# Project 3: Web APIs and NLP
_By: Kelly Wu (NYC - Tensors)_

## Contents
- [Imports](#Imports)
- [Importing Our Subreddits](#Importing-Our-Subreddits)
- [Cleaning Our Subreddit Dataframes](#Cleaning-Our-Subreddit-Dataframes)


### Imports
Here we import the libraries we need in order to webscrape from our selected two subreddits: tennis and tabletennis. 

In [1]:
import pandas as pd
import datetime as dt
import time
import requests

### Importing Our Subreddits 
After selecting our subreddits, we create a function that will automatically pull requests from the specified variable and repeat the process five times to give us our entire DataFrame that splits up each subreddit's subfields such as title and content. 

In [2]:
subreddit_1 = 'SmashBrosUltimate'
subreddit_2 = 'mariokart'

In [3]:
# Code from Tom
def query_pushshift(subreddit, kind='submission', skip = 30, times = 10, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 
                                'author', 'num_comments', 'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):
    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    for x in range(1, times + 1):
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        
        time.sleep(2)
    
    full = pd.concat(mylist, sort=False)
    
    if kind == "submission":
        full = full[subfield]
        full = full.drop_duplicates()
        full = full.loc[full['is_self'] == True]
    
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date)
    full['timestamp'] = _timestamp
    
    print(full.shape)
    
    return full 

In [4]:
sub_1_query = query_pushshift(subreddit_1)

https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=SmashBrosUltimate&size=500&after=300d
(1472, 9)


In [5]:
sub_2_query = query_pushshift(subreddit_2)

https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=mariokart&size=500&after=300d
(1892, 9)


### Cleaning Our Subreddit Dataframes
Here we selected three features to focus on as opposed to looking at numerous features pulled from the function above when webscraping. We isolate the title, content, and the subreddit it is from. Then we do some mild cleaning of our newly created DataFrames by getting rid of any duplicates or null values.

In [6]:
features = ['title', 'selftext', 'subreddit']

In [7]:
smash = sub_1_query[features]
smash.head()

Unnamed: 0,title,selftext,subreddit
0,Who's your main?,Mine:\nKing k. Roll\nDonkey Kong\nNess\nLucas\...,SmashBrosUltimate
1,What character to main?,I really like Fox and spent 20 hours trying to...,SmashBrosUltimate
13,Challenge,Type the reveal tagline of a character you wan...,SmashBrosUltimate
17,New Smash character prediction,So my theory is that when they announce the se...,SmashBrosUltimate
18,Mewtwo kinda thicc,Lowkey bruh he got some thicc ass thighs what ...,SmashBrosUltimate


In [8]:
kart = sub_2_query[features]
kart.head()

Unnamed: 0,title,selftext,subreddit
0,Time Trial issues,[removed],mariokart
1,Time Trial [MK8DX],So I have always considered myself very good ...,mariokart
11,Is my build awful? (MK8DX),[removed],mariokart
12,mario kart maker,[removed],mariokart
15,Research for a YouTube video I’m working on.,[removed],mariokart


In [9]:
print(smash.shape)
print(kart.shape)

(1472, 3)
(1892, 3)


In [10]:
# Drop any duplicated rows
smash = smash.drop_duplicates(subset = 'title')
kart = kart.drop_duplicates(subset = 'title')

In [11]:
smash.isna().sum()

title        0
selftext     3
subreddit    0
dtype: int64

In [12]:
kart.isna().sum()

title         0
selftext     12
subreddit     0
dtype: int64

In [13]:
# Drop rows with any null values
smash = smash.dropna()
kart = kart.dropna()

In [14]:
# Checking the number of rows after dropping duplicates and nulls
print(smash.shape)
print(kart.shape)

(1453, 3)
(1857, 3)


In [15]:
# Exporting our scraped subreddit dataframes 
smash.to_csv('./smashultimate.csv', index = False)
kart.to_csv('./mariokart.csv', index = False)