# Web APIs and NLP

## 1. Introduction

We will be using the [Pushshift API](https://github.com/pushshift/api) to get posts from 2 different subreddits and creating a model using NLP to determine which subreddit a post came from exactly. For those unfamiliar with Reddit, it is a site with many different communities called subreddits where people post about just about anything they are passionate about. For a more hands-on approach to how the site works you can visit it by clicking [here](https://reddit.com/). 

## 2. Problem Statement

The 2 subreddits we will be looking at are [/r/TheSilphRoad](https://www.reddit.com/r/TheSilphRoad/) and [/r/pokemongo](https://www.reddit.com/r/pokemongo/). What we want to determine is if The Silph Road is actually more tailored to experienced players of Pokemon Go compared to the pokemongo subreddit. We hope that by using NLP and testing different classification models we can differentiate posts between the two subreddits by finding language that is more akin to veteran players of the game. If we can accurately tell the difference between posts on the two subreddits, the hardcore players of The Silph Road will be able brag that their subreddit is the superior one. 

## 3. Web Scraping

Now we will use the Pushshift API to get posts from the two subreddits we are interested in and do some exploratory analysis on the posts. 

In [1]:
# import libraries
import requests
import pandas as pd
from time import sleep, time
from tqdm import tqdm

In [2]:
# set up pushshift url
pushshift_url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# write a function to aggregate posts of multiple calls
def get_20000(subreddit, last_time):
    posts = []
    for i in tqdm(range(200)): # use tqdm to see how long each iteration takes
        try:
            reddit_params = {
                'subreddit': subreddit,
                'size': 100,
                'before': last_time,
                'is_video': False # remove videos as we only want to look at text
            }
            reddit_res = requests.get(pushshift_url, reddit_params)
            reddit_data = reddit_res.json()
            posts += reddit_data['data']
            last_time = posts[len(posts) - 1]['created_utc']
            sleep(2) # make sure we aren't making too many requests
        except:
            pass
    return posts
silph_posts = get_20000('TheSilphRoad', int(time()))
pogo_posts = get_20000('pokemongo', int(time()))

100%|██████████| 200/200 [19:40<00:00,  5.90s/it]
100%|██████████| 200/200 [17:11<00:00,  5.16s/it]


In [4]:
len(silph_posts)

19700

In [5]:
len(pogo_posts)

20000

In [7]:
silph_df = pd.DataFrame(silph_posts)
silph_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,secure_media,secure_media_embed,poll_data,crosspost_parent,crosspost_parent_list,edited,banned_by,media_metadata,collections,suggested_sort
0,[],False,gearowns,,[],,text,t2_14924t,False,False,...,,,,,,,,,,
1,[],False,Weignot,,[],,text,t2_30r66mkn,False,False,...,,,,,,,,,,
2,[],False,Rebelsoul76,,[],,text,t2_ixuw55y,False,False,...,,,,,,,,,,
3,[],False,brodstar12,usa-midwest-wheat,"[{'e': 'text', 't': 'Melbourne, Australia'}]","Melbourne, Australia",richtext,t2_13sdwg,False,False,...,,,,,,,,,,
4,[],False,brodstar12,usa-midwest-wheat,"[{'e': 'text', 't': 'Melbourne, Australia'}]","Melbourne, Australia",richtext,t2_13sdwg,False,False,...,,,,,,,,,,


In [8]:
pogo_df = pd.DataFrame(pogo_posts)
pogo_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,thumbnail_height,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,media_metadata,gilded,distinguished
0,[],False,Ham7116,,[],,text,t2_c0sjr,False,False,...,,,,,,,,,,
1,[],False,Ok_Understanding_44,,[],,text,t2_7w29o2xg,False,False,...,,,,,,,,,,
2,[],False,lavumidnite,,[],,text,t2_5tqei2tq,False,False,...,,,,,,,,,,
3,[],False,figmentine,,[],,text,t2_zulim,False,False,...,,,,,,,,,,
4,[],False,Ok_Anywhere_458,,[],,text,t2_4m97io3s,False,False,...,,,,,,,,,,


We see we have a lot of columns without any useful information, so we will not be keeping them. 

In [1]:
# create a function to only keep what we need
def dataframe_cleaner(df):
    cleaned_df = df[['selftext', 'title', 'subreddit']]
    cleaned_df = cleaned_df.loc[(cleaned_df['selftext'] != '') & (cleaned_df['selftext'] != '[removed]') & (cleaned_df['selftext'] != '[deleted]')]
    cleaned_df.reset_index(drop = True, inplace = True)
    return cleaned_df

In [9]:
# only keep things we need for NLP
silph_df = dataframe_cleaner(silph_df)
silph_df.head()
# check size of dataframe
silph_df.shape

Unnamed: 0,selftext,title,subreddit
0,Has anybody found a permanent fix to not being...,Fix to not being able to attack?,TheSilphRoad
1,,Whoa.,TheSilphRoad
2,2 raids today where my attacks wouldn’t regist...,Attack glitch during Regi raids,TheSilphRoad
3,I’ve had a Pokémon defending a gym for over 36...,[Bug?] Can’t seem to earn or collect pokecoins...,TheSilphRoad
4,,Can’t earn or collect pokecoins for Pokémon in...,TheSilphRoad


In [None]:
# only keep things we need for NLP
pogo_df = dataframe_cleaner(pogo_df)
pogo_df.head()
# check size of dataframe
pogo_df.shape

In [13]:
# save our reddit dataframes to use in future notebooks
pogo_df.to_csv('../data/pogo.csv', index = False)
silph_df.to_csv('../data/silph.csv', index = False)