# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & NLP from Reddit Part 1/2

# Contents:
- [Problem Statement](#Problem-Statement)
- [Data collection via API](#Data-Collection-via-API)
- [Save file](#Save-marvel-and-dc-post)

## Problem Statement

Our client is a game creator that specialised in mobile application game and their recent game creation involved heros from DC cinematics and Marvel studios.

As a data scientist from a consultancy firm, our task is to build a good classification model based on reddit posts from marvel and dc subbredits to help them in their marketing campaign. Client will use the model to predict whether the internet user is a marvel or dc fan todisplay appropriate advertisement to entice the users to download the game in order to boost the game popularity. The model will also help client to have a better understanding on the most discussed superheros to be placed of a higher priority or to be included in the game.

The dataset consist of the latest 1000 posts from each of the two subreddit.The primary stakeholders will be the client to boost their game popularity and the secondary stakeholders will be the game consumer. 

The following is the general workflow for this project: 
+ Data Collection
+ Exploratory data analysis (EDA)
+ Data cleaning by removing special characters, lemmatizing/stemming and word removal
+ Pre-processing and feature engineering
+ Modelling and evaluation
+ Conclusion and recommendation

The model will then be evaluated by two metrics - model test accuracy and ROC AUC score. The objective of the model is to get a high accuracy score and ROC AUC score.

## Data Collection via API

In [1]:
# Import libaries
import pandas as pd
import numpy as np 
import requests
from bs4 import BeautifulSoup

### To gather information from 1 request

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

params = {
    'subreddit':'marvelstudios',
    'size':100,
    'before':1622966051
}

res=requests.get(url,params)
res.status_code

200

Status_code 200 means successfully

In [3]:
data = res.json()
post = data['data']

In [4]:
len(post)

100

In [5]:
df= pd.DataFrame(post)

In [6]:
df[['subreddit','selftext','title']].head()

Unnamed: 0,subreddit,selftext,title
0,marvelstudios,,"Dang, I love making these."
1,marvelstudios,,"Dang, I love making these."
2,marvelstudios,After Sam's words of advice in Episode 5 of FA...,What I think Bucky will do next.
3,marvelstudios,8 years later definitely could of changed to 5...,I know that marvel had the power to change Bla...
4,marvelstudios,,Let's Talk: Marvel!


In [7]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       '

### To gather information for 1000 post per subreddit

In [8]:
from time import sleep

In [9]:
def get_posts(subreddit, n_iter):
    
    df_list = []
    current_time = 1623963107
    
    for _ in range(n_iter):
        res = requests.get(
            url,
            params={
                "subreddit": subreddit,
                "size": 100,
                "before": current_time,
                "is_self":True, #if submission is a self post
                "is_original_content":True,
                "allow_videos":False,
                "allow_images":False,
                "allow_videogifs":False,
                "title:not" : 'POLL'
            }
        )
        sleep(3)
        df = pd.DataFrame(res.json()['data'])
        df = df.loc[:, ["subreddit", "title", "selftext", "created_utc",'id']]
        df_list.append(df)
        current_time = df.created_utc.min()
        
    return pd.concat(df_list, axis=0)

Note: 
- The pushshift limits to 100 posts per request. Thus, we need to request 10 times
- Important columns identified are subreddit, title, selftext, create_utc and the id

In [10]:
df_marvel = get_posts('marvelstudios', 10)

In [11]:
res.status_code

200

In [12]:
df_marvel.head(10)

Unnamed: 0,subreddit,title,selftext,created_utc,id
0,marvelstudios,Spider multiverse and LoKi TVA ...,How does the spider multiverse coincide with t...,1623961657,o26yrt
1,marvelstudios,My guess on how Loki will end and lead into th...,First up I'm sure I'm not the first person to ...,1623961646,o26ymh
2,marvelstudios,Excuse me what? Partial rant,Ok so we watched the first episode of loki and...,1623961252,o26tbi
3,marvelstudios,Joe Biden had too much,[removed],1623960897,o26o25
4,marvelstudios,"So if there's only a single timeline, how are ...",,1623960476,o26hkv
5,marvelstudios,Loki strength,How did Loki got his ass kicked by some random...,1623960470,o26hhu
6,marvelstudios,Whoever is making Loki should be brought in to...,[removed],1623960149,o26cxa
7,marvelstudios,avengers age of ultron,"hi y’all, i’m currently rewatching all the mar...",1623960140,o26csr
8,marvelstudios,BLACK WIDOW Social Media Reactions Megathread,BLACK WIDOW Social Media Reactions Megathread\...,1623959901,o2691g
9,marvelstudios,Just did a first watch of Thor: Ragnarok with ...,"I didn't let them know Hulk was in it, the com...",1623959421,o2622o


In [13]:
df_marvel['id'].nunique()

1000

In [14]:
len(df_marvel)

1000

In [15]:
df_marvel.shape

(1000, 5)

This is the data from subreddit marvelstudios

In [16]:
df_dc = get_posts('DC_Cinematic', 10)

In [17]:
res.status_code

200

In [18]:
df_dc.head(10)

Unnamed: 0,subreddit,title,selftext,created_utc,id
0,DC_Cinematic,OTHER: Where can I stream/rent THE DARK KNIGHT...,Hi everyone! I was planning to watch THE DARK ...,1623954447,o242ng
1,DC_Cinematic,Discussion: To those still Streaming ZSJL week...,The thing you should all be doing is switching...,1623952206,o236we
2,DC_Cinematic,Poll:which cinematic universe do you wb to con...,"Hamadaverse:jl17,shazam,bop,ww84,tss,\nBlack A...",1623951214,o22sv3
3,DC_Cinematic,Which Batman stories do you think made Batflec...,Stories like Knightfall and Killing Joke come ...,1623950894,o22o9n
4,DC_Cinematic,MERCHANDISE: About a US 4K/Blu-ray/DVD release...,Everything proceeding refers to Warner Bros ex...,1623950516,o22is3
5,DC_Cinematic,I Think Man Should play Reverse Flash in Flash...,[removed],1623949368,o2230e
6,DC_Cinematic,Discussion: After the snyder cut,After I watched the snyder cut I just gave up ...,1623947876,o21il2
7,DC_Cinematic,Other: Has anyone bought ZSJL 4K DVD in the US?,Has anyone bought the ZSJL 4k DVD in the US an...,1623945609,o20nia
8,DC_Cinematic,"What's gonna happen with the DCEU, since that ...",[removed],1623916879,o1rsvp
9,DC_Cinematic,Question: Unfilmed aquaman scene where he foll...,I read about this scene quite recently but can...,1623914391,o1r825


In [19]:
len(df_dc)

1000

In [20]:
df_dc['id'].nunique()

1000

In [21]:
df_dc.shape

(1000, 5)

This is the data from subreddit dc_cinematics

In [22]:
print('min', df_marvel['created_utc'].min())
print('max', df_marvel['created_utc'].max())

min 1623339481
max 1623961657


In [23]:
print('min', df_dc['created_utc'].min())
print('max', df_dc['created_utc'].max())

min 1620251011
max 1623954447


There is a total of 2000 posts from marvel and DC subreddit.

## Save marvel and dc post 

In [24]:
# For data used for 2200 post pulled 
df_marvel.to_csv('../data/marvel.csv', index=False)
df_dc.to_csv('../data/dc.csv', index=False)