## 01_gathering_data

Within these notebooks, I'll build a model that will be able to differentiate between Satirical and Sensational news content. This content should be reviewed alongside my upcoming presentation, whose slides can be viewed [here](https://docs.google.com/presentation/d/1nEYjhchUErCDQR8fqr9AunZtu1Rai8-R14BtKmHf9_Q/edit?usp=sharing).

**Description:** Querying Pushshift API for Subreddit data, and then saving it for EDA and preprocessing.

In [1]:
import requests
import json
import time
import pandas as pd
import pickle

**Testing Pushshift API**

I had tried using Reddit's API directly, though I found Pushshift much easier to use, as you can request virtually as many pages as you'd like, within of the 60 requests/min max.

In [2]:
tr = requests.get('https://api.pushshift.io/reddit/search/submission/?subreddit=nottheonion')

In [3]:
tr.status_code

200

In [4]:
tr_js = tr.json()

In [5]:
tr_js['data'][0]

{'author': 'PM_ME_YOUR_BITCOlN',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_n0y4u8n',
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1536545998,
 'domain': 'longisland.news12.com',
 'full_link': 'https://www.reddit.com/r/nottheonion/comments/9eis00/residents_crack_pipe_dispenser_found_near_busy/',
 'id': '9eis00',
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 2,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status': 'all_ads',
 'permalink': '/r/nottheonion/comments/9eis00/residents_crack_pipe_dispenser_found_near_busy/',
 'pinned': False,
 'post_hint': 

In [6]:
tr_js['data'][0]['title']

'Residents: Crack pipe dispenser found near busy shopping center, bus stop'

In [7]:
post_list = []
post_list.extend(tr_js)

In [8]:
post_list

['data']

Seems to work well, I'll now try to programatically grab posts from each subreddit.

**Gathering Data**

Based off of the `total_results` key within pushshift API, the two subreddits that I'm looking at comparing contain the following number of posts:

- /r/theonion: 7124
- /r/notheonion: 115527

Given the size of both subreddits, I'll look to gather 5000 posts from each to analyze. That should give me enough data for my model

In [9]:
posts = []
before = None
for i in range(200):
    if before == None:
        params = {}
    else:
        params = {'before': before}
    url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=nottheonion'
    res = requests.get(url, params=params)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend(the_json['data'])
        before = the_json['data'][-1]['created_utc']
    else:
        print(res.status_code)
        break
    time.sleep(1)

In [10]:
len(posts)

5000

In [11]:
# import json
# with open('../data/nottheonion_ps_20180830.json', 'w') as fout:
#     json.dump(posts, fout)

**Separating Titles from JSON and Saving into List**

In [12]:
title_list = []
c = 0
for i in range(len(posts)):
    title_list.append(posts[c]['title'])
    c += 1

**Checking for Duplicates**

In [13]:
len(title_list)

5000

In [14]:
len(set(title_list))

4093

I'm a little disappointed at the number of duplicate posts that I found within the first 5000 that I've pulled. Nonetheless, over 4,000 posts still seems acceptable for generating a model. I'll consider grabbing more if I'm able to find a much larger portion of originals with The Onion subreddit.

In [15]:
df = pd.DataFrame(title_list)

In [16]:
df['sub'] = 1

In [17]:
df.rename(columns = {0:'Title'}, inplace=True)

**Getting Rid of Duplicate Values**

In [18]:
df.drop_duplicates(inplace=True)

**Working on /r/TheOnion**

In [19]:
posts_o = []
before = None
for i in range(200):
    if before == None:
        params = {}
    else:
        params = {'before': before}
    url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=theonion'
    res = requests.get(url, params=params)
    if res.status_code == 200:
        the_json = res.json()
        posts_o.extend(the_json['data'])
        before = the_json['data'][-1]['created_utc']
    else:
        print(res.status_code)
        break
    time.sleep(1)

In [20]:
len(posts_o)

5000

In [21]:
# import json
# with open('../data/theonion_ps_20180830.json', 'w') as fout:
#     json.dump(posts_o, fout)

In [22]:
title_list_o = []
c = 0
for i in range(len(posts_o)):
    title_list_o.append(posts_o[c]['title'])
    c += 1

**Checking for Duplicates**

In [23]:
len(title_list_o)

5000

In [24]:
len(set(title_list_o))

4644

I'm a bit surprised that I had slightly better luck in finding original posts with The Onion subreddit. The difference between them isn't large enough to justify grabbing more posts from Not the Onion, however, so I'll move forward to pre-processing my data.

In [25]:
df_o = pd.DataFrame(title_list_o)

In [26]:
df_o['sub'] = 0

In [27]:
df_o.rename(columns = {0:'Title'}, inplace=True)

**Getting Rid of Duplicate Values**

In [28]:
df_o.drop_duplicates(inplace=True)

In [29]:
df.info(), df_o.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4093 entries, 0 to 4999
Data columns (total 2 columns):
Title    4093 non-null object
sub      4093 non-null int64
dtypes: int64(1), object(1)
memory usage: 95.9+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4644 entries, 0 to 4999
Data columns (total 2 columns):
Title    4644 non-null object
sub      4644 non-null int64
dtypes: int64(1), object(1)
memory usage: 108.8+ KB


(None, None)

**Saving Title Lists as a `pickle`**

In [30]:
# with open("../data/nto_t.pkl", 'wb') as nto_t:
#     pickle.dump(df, nto_t)

In [31]:
# with open("../data/to_t.pkl", 'wb') as to_t:
#     pickle.dump(df_o, to_t)

**Go to: 02_preprocessing**