<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" />

# # Project 3 : Web API, NLP and Classification Models

## Task Guidelines

** For project 3, the goal is two-fold:**

1. Using Pushshift's API, collect posts from any two subreddits.
2. Use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## Overview

** As part of this project, created 3 notebooks **

1. <a href="01_Data_Gathering.ipynb" Title="Data Gathering">01_Data_Gathering</a> - reads the API and runs in a loop with timer to get 100 submissions at a time. Used gaming section with - boardgames (1400), fallout (1600), rocketleague(1500) and destinythegame (400 rows) to use a total of 4900 rows.
2. <a href="02_NLP.ipynb" Title="NLP">02_NLP</a> - notebook represents EDA and NLP processing
3. <a href="03_Classification_Model.ipynb" Title="Models">03_Classification_Models</a> - Pulls the csv saved to create different classification models and get predictions with scores.


In [1]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
import time

In [2]:
# call only one url to test
# url='https://api.pushshift.io/reddit/search/submission'

# params = {'subreddit': 'RocketLeague',
#           'size' : 500,
#         } 

In [3]:
#convert the calls into a function so we can loop to get data  
def getredditdata (num, subname, posttime):
    
    df = pd.DataFrame()
    reddit_df = pd.DataFrame()

    for i in range(1, num):
        
        url='https://api.pushshift.io/reddit/search/submission'
        params = {'subreddit': subname,
                  'size' : 100,
                  'before' : posttime
                   }
        push_res = requests.get(url, params)
        push_json = push_res.json()
        time.sleep(5)
        posts = push_json['data']
        df = pd.DataFrame(posts)
        time.sleep(5)
        posttime = df['created_utc'].min()
        dataobj = [reddit_df, df]
        reddit_df = pd.concat(dataobj, axis=0, ignore_index=True)
    
        print(f'Finished pulling {100*i} records from {subname}.')
        
        time.sleep(45)
    
    return reddit_df

In [4]:
#call function to get boardgames reddit data for 1400 rows
boardgames_df = getredditdata (15, 'boardgames', int(time.time()))

Finished pulling 100 records from boardgames.
Finished pulling 200 records from boardgames.
Finished pulling 300 records from boardgames.
Finished pulling 400 records from boardgames.
Finished pulling 500 records from boardgames.
Finished pulling 600 records from boardgames.
Finished pulling 700 records from boardgames.
Finished pulling 800 records from boardgames.
Finished pulling 900 records from boardgames.
Finished pulling 1000 records from boardgames.
Finished pulling 1100 records from boardgames.
Finished pulling 1200 records from boardgames.
Finished pulling 1300 records from boardgames.
Finished pulling 1400 records from boardgames.


In [5]:
#call function to get Fallout reddit data for 1600 rows
fallout_df = getredditdata (17, 'Fallout', int(time.time()))

Finished pulling 100 records from Fallout.
Finished pulling 200 records from Fallout.
Finished pulling 300 records from Fallout.
Finished pulling 400 records from Fallout.
Finished pulling 500 records from Fallout.
Finished pulling 600 records from Fallout.
Finished pulling 700 records from Fallout.
Finished pulling 800 records from Fallout.
Finished pulling 900 records from Fallout.
Finished pulling 1000 records from Fallout.
Finished pulling 1100 records from Fallout.
Finished pulling 1200 records from Fallout.
Finished pulling 1300 records from Fallout.
Finished pulling 1400 records from Fallout.
Finished pulling 1500 records from Fallout.
Finished pulling 1600 records from Fallout.


In [6]:
#call function to get RocketLeague reddit data for 1500 rows
rocketleague_df = getredditdata (16, 'RocketLeague', int(time.time()))

Finished pulling 100 records from RocketLeague.
Finished pulling 200 records from RocketLeague.
Finished pulling 300 records from RocketLeague.
Finished pulling 400 records from RocketLeague.
Finished pulling 500 records from RocketLeague.
Finished pulling 600 records from RocketLeague.
Finished pulling 700 records from RocketLeague.
Finished pulling 800 records from RocketLeague.
Finished pulling 900 records from RocketLeague.
Finished pulling 1000 records from RocketLeague.
Finished pulling 1100 records from RocketLeague.
Finished pulling 1200 records from RocketLeague.
Finished pulling 1300 records from RocketLeague.
Finished pulling 1400 records from RocketLeague.
Finished pulling 1500 records from RocketLeague.


In [7]:
#call function to get DestinyTheGame reddit data for 400 rows
destinygame_df = getredditdata (5, 'DestinyTheGame', int(time.time()))

Finished pulling 100 records from DestinyTheGame.
Finished pulling 200 records from DestinyTheGame.
Finished pulling 300 records from DestinyTheGame.
Finished pulling 400 records from DestinyTheGame.


In [8]:
#printing the shapes to ensure we have good mix of data
print(boardgames_df.shape)
print(fallout_df.shape)
print(rocketleague_df.shape)
print(destinygame_df.shape)

(1400, 82)
(1600, 70)
(1500, 82)
(400, 67)


In [9]:
# Resetting DataFrame with 3 columns for our NLP problem statement
boardgames_df = boardgames_df[['subreddit', 'selftext', 'title']]
fallout_df = fallout_df[['subreddit', 'selftext', 'title']]
rocketleague_df = rocketleague_df[['subreddit', 'selftext', 'title']]
destinygame_df = destinygame_df[['subreddit', 'selftext', 'title']]

In [10]:
rocketleague_df.isnull().sum()

subreddit    0
selftext     3
title        0
dtype: int64

In [11]:
#create a dataframe for all the gaming dataframes
full_df = pd.DataFrame()
full_df = pd.concat([full_df, boardgames_df])
time.sleep(30)
full_df = pd.concat([full_df, fallout_df])
time.sleep(30)
full_df = pd.concat([full_df, rocketleague_df])
print(f'Merged all the dataframes to get 4500 records.')

Merged all the dataframes to get 4500 records.


In [12]:
full_df.shape

(4500, 3)

In [13]:
full_df.isnull().sum()

subreddit     0
selftext     17
title         0
dtype: int64

In [14]:
full_df.to_csv('./datasets/reddit4500.csv')

In [15]:
full_df1 = pd.concat([full_df, destinygame_df], join="outer")

In [16]:
full_df1.to_csv('./datasets/reddit4900.csv')

In [17]:
full_df1.isnull().sum()

subreddit     0
selftext     17
title         0
dtype: int64

### Summary:

1. As part of the gathering process, brought in 3 sub reddit posts - boardgames, Fallout and RocketLeague. Used the 4th category to see if it has a big difference in the model accuracy. 
2. Decided to go with 3 categories after quick evaluation
3. 4500 records have been gathered for the NLP and Model predictions.

### Next link: 
Please navigate to <a href="02_NLP.ipynb" title="NLP Notebook">NLP Notebook</a>