## Reddit Post Natural Language Processing and Classification
![reddit](../images/reddit_logo.png)
### 01. Data Retrieval




In [23]:
import requests
import pandas as pd
import time

In [24]:
base_url = 'https://api.pushshift.io/'

submission_url = 'reddit/search/submission' 
comment_url = 'reddit/search/comment'

#### Problem Statement

1. Build at least three models to classify two different reddit categories (subreddits).
2. Using Sentiment Analyzer, determine if one subreddit has an overall more negative sentiment in it's collection of text.

Download a large collection of posts in two subreddits.

The two subreddits to look at are:<br>
Ask Culinary - a subreddit devoted to food, cooking, sharing recipes, etc.<br>
![reddit](../images/reddit_ask_culinary_logo.png)<br>

and Running - a subreddit pertaining to running, jogging, etc.<br>
![reddit](../images/reddit_running_logo.png)


A working theory to prove or disprove regarding the two subreddits:
* Cooking can be sometimes be an independent activity, but also pertains to family gatherings, holidays, birthdays and other group social occasions.
A chef could also be working in a kitchen with other cooks.

* In contrast, running is primarily a solitary activity.  A runner could be in a group, but anyone who has tried to run and talk at the same time knows that even running in groups would not be conducive to social interaction. It is safe to assume that most runners, joggers, or even walkers are doing so independently.

* In using the VADER SentimentIntensityAnalyzer, certain search words such as **alone**, **lonely**, and other variations, might prove to be a factor in helping to analyze the two subreddit posts.

---

In order to download the posts, this notebook will be run twice, once for each subreddit.
After running for one subreddit, the local variable will be changed to point to the second subreddit, and all cells will be run again.

Each notebook execution will loop through calling the reddit API, and save the posts in a .csv file.

In [38]:
# https://www.reddit.com/r/AskCulinary/
# https://www.reddit.com/r/running/

subreddit = 'AskCulinary'
#subreddit = 'running'

# the max number of posts the api can download in each call is 100.
params = {
    'subreddit' : subreddit,
    'size' : 100
}

In [39]:
base_url + submission_url

'https://api.pushshift.io/reddit/search/submission'

Perform one query to get the first 100 posts and create a dataframe.

The for loop later will append its downloaded posts to this initial dataframe.

In [40]:
res = requests.get (base_url + submission_url, params)
res.status_code

200

In [41]:
data = res.json()

posts = data['data']
len(posts)

100

Create a dataframe from the request data, which is in json format.

In [42]:
posts_data = pd.DataFrame (posts)
posts_data.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,author_flair_background_color,author_flair_text_color,edited
0,[],False,PrinceOfWales_,,[],,text,t2_lvu3m,False,False,...,1.0,https://www.reddit.com/r/AskCulinary/comments/...,all_ads,6,,,,,,
1,[],False,frowogger,,[],,text,t2_3iogam6g,False,False,...,1.0,https://www.reddit.com/r/AskCulinary/comments/...,all_ads,6,,,,,,
2,[],False,pelse_O_clock,,[],,text,t2_5k6soum5,False,False,...,1.0,https://www.reddit.com/r/AskCulinary/comments/...,all_ads,6,,,,,,
3,[],False,EmbryoRoux,,[],,text,t2_9oos7w6k,False,False,...,1.0,https://www.reddit.com/r/AskCulinary/comments/...,all_ads,6,,,,,,
4,[],False,batcat03,,[],,text,t2_rnw06,False,False,...,1.0,https://www.reddit.com/r/AskCulinary/comments/...,all_ads,6,,,,,,


In [43]:
posts_data.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_template_id', 'link_flair_text', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title',
       'total_awards

Drop columns we don't need (in other words, create a new dataframe containing only relevant columns.)

The main text of each post is contained in a column called **selftext**.

In [44]:
posts_df = posts_data[['subreddit', 'selftext', 'title', 'author', 'created_utc']].copy()
posts_df.head()

Unnamed: 0,subreddit,selftext,title,author,created_utc
0,AskCulinary,"Hi everyone,\n\nI'm not sure if this is the pl...",Fridge for an instructional kitchen?,PrinceOfWales_,1610056955
1,AskCulinary,"Hey, I was wondering if anybody had tips to ke...",How to keep sourdough starter jar clean?,frowogger,1610055394
2,AskCulinary,\nI’m Gonna make birria tacos tomorrow and my ...,What substitutes can I use for guajillo chili’...,pelse_O_clock,1610055301
3,AskCulinary,"Hi all,\n\nI just found a steal on a WagnerWar...",Wagner Magnalite Dutch Oven,EmbryoRoux,1610055100
4,AskCulinary,"Hi all,\n\nI tried making Julia's recipe for P...",AWFUL SMELL Potage Parmentier/Vichyssoise,batcat03,1610054738


Convert the created utc column just for purpose of making it readable for a sanity check.

In [45]:
posts_df['created_date'] = pd.to_datetime(posts_df['created_utc'], unit='s')

In [46]:
posts_df.tail()

Unnamed: 0,subreddit,selftext,title,author,created_utc,created_date
95,AskCulinary,"I love wine and beer, but sometimes I want to ...",I like some fancy drinks that are not alcoholi...,elforastero,1609940803,2021-01-06 13:46:43
96,AskCulinary,Normally i presalt and let rest in the fridge ...,Should I salt meats right before/after pan fry...,SuperomegaOP,1609938259,2021-01-06 13:04:19
97,AskCulinary,Does anyone know how you use Glutamic Acid E62...,How do you use Glutamic Acid E620 in food?,2021redditsusernames,1609918984,2021-01-06 07:43:04
98,AskCulinary,There are various recipes I have where I want ...,What temperature should I keep eggs below to e...,dunwannatacoboutit,1609918416,2021-01-06 07:33:36
99,AskCulinary,Was following a recipe online for crepe recipe...,Ultra thick double cream,razorblaze186,1609917494,2021-01-06 07:18:14


In [47]:
# The oldest post date is in the last row of the dataframe.
# When sending an api request, adjust this parameter to get older and older posts.
posts_df.iloc[len(posts_df)-1]['created_utc']

1609917494

In [48]:
# get more data.  100 loops should retrieve 10_000 subreddit posts from this category.
num_loops = 100

params = {
    'subreddit' : subreddit,
    'size' : 100,
    'before': ''
}

for index in range(num_loops):
    # Get timestamp of oldest current post in the dataframe
    oldest_utc = posts_df.iloc[len(posts_df)-1]['created_utc']
    
    params['before'] = oldest_utc
    
    res = requests.get (base_url + submission_url, params)
    
    # every 5 times through the loop, print a status update.
    if index % 5 == 0:
        print (f'Loop: {index} Query: {submission_url} Params: {params} \n Response code: {res.status_code} ')
    
    if res.status_code == 200:
        # if valid response, concat to end of dataframe.
        temp_data = res.json()['data']   # get array of reddit posts (each being a dict)
        temp_data_df = pd.DataFrame (temp_data)  # convert reddit post dict to dataframe
        
        # only keep relevant columns
        temp_posts_df = temp_data_df[['subreddit', 'selftext', 'title', 'author', 'created_utc']].copy()
        temp_posts_df['created_date'] = pd.to_datetime(temp_data_df['created_utc'], unit='s')

        # concat to existing dataframe of prior responses
        posts_df = pd.concat([posts_df, temp_posts_df], ignore_index=True)
        
        # sleep one second so as to not overload reddit API
        time.sleep(1)
        



Loop: 0 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1609917494} 
 Response code: 200 
Loop: 5 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1609361956} 
 Response code: 200 
Loop: 10 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1608838188} 
 Response code: 200 
Loop: 15 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1608380890} 
 Response code: 200 
Loop: 20 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1607787868} 
 Response code: 200 
Loop: 25 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1607185849} 
 Response code: 200 
Loop: 30 Query: reddit/search/submission Params: {'subreddit': 'AskCulinary', 'size': 100, 'before': 1606436753} 
 Response code: 200 
Loop: 35 Query: reddit/search/submission Params: {'subred

In [52]:
posts_df.tail()

Unnamed: 0,subreddit,selftext,title,author,created_utc,created_date
10095,AskCulinary,,I think I used spoiled flour to make bread...,miss_scarlett_ohara,1595878226,2020-07-27 19:30:26
10096,AskCulinary,"It just goes bulgur wheat; butter, brown; butt...",How does the Flavor Bible not have an entry fo...,naestekaerlighed,1595878012,2020-07-27 19:26:52
10097,AskCulinary,I just bought these: [https://www.amazon.com/...,"Just bought new pots and pans, not sure about ...",aziraphale60,1595876770,2020-07-27 19:06:10
10098,AskCulinary,Hi all! \n\nI just got a pasta roller and was ...,Looking for advice on sauces for fresh pasta,9jharris1,1595874873,2020-07-27 18:34:33
10099,AskCulinary,The title.,Will a molecular gastronomy student be able to...,normiedeilim,1595874352,2020-07-27 18:25:52


In [51]:
posts_df.to_csv(f'../data/{ subreddit }.csv', index=False)