# Classifying Reddit Posts to Marvel/DC Movie Subreddits Pt 1 

## 1. Problem Statement
We are employees of a marketing agency hired by a toy company to perform market research on Reddit to build a classifier model that classifies posts from Marvel vs DC movies subreddits in order to

Build a classifier model that can be applied to other platforms (e.g. Twitter, Facebook) with text data to determine public interest in either movie franchise
Find popular heroes and keywords for each franchise to identify product and marketing opportunities (using sentiment analysis)
The success of the classifier model will be evaluated using the accuracy metric i.e. is the model able to correctly label a post as coming from the Marvel/DC subreddit? Similarly, sentiment analysis will be evaluated using accuracy i.e. is the model able to correctly identify a post as having positive, neutral or negative sentiment? This can help us identify the most discussed heroes and keywords (a sign of popularity) and whether the discussions around these heroes and topics are positive, neutral or negative to identify product and marketing opportunities to boost revenue for our customer, the toy company.

## 2. Overview of notebook
In this first code notebook, we use the Pushshift API to get posts from the Marvel and DC subreddits to be used as the training dataset for our classification model.

## 3. Import libraries

All libraries used will be imported here. 

In [1]:
import requests
import json
import pandas as pd
import numpy as np 
import time
from random import randint

## 1. Data Collection

Use the API to collect sufficient data for NLP and model training.

In [2]:
# Define a function that takes in a subreddit and number of posts (in multiples of 100)
# and returns a DataFrame including the subreddit, title, selftext and creation time of the post 

def get_posts(subreddit, n):
    # x = 0 
    posts = []
    date = round(time.time()) # current time
    
    for x in range(1,(n/100)+1): 
        # Call API to pull data 
        url = 'https://api.pushshift.io/reddit/search/submission'
        params = {
        'subreddit': subreddit
        ,'size': 100
        ,'before': date
        }
        res = requests.get(url, params)
        data = res.json()
        temp_df = pd.DataFrame(data['data'])
        
        # Add onto/reset accumulators 
        date = min(temp_df['created_utc']) # get the min created utc from the most recent loop of data pulled
        # x += len(data['data'])
        posts.extend(data['data']) # data['data'] is a list of posts, each post is a dict 
        
        #Track progress
        # print(x)
        
        # Wait 1-3 min before calling API again
        # time.sleep(randint(60,180)) 
        
    # Return a dataframe of all the posts 
    return pd.DataFrame(posts)

In [13]:
# Use function to collect and store DC subreddit data in dataframes
dc_df = get_posts('DC_Cinematic', 5000)

In [14]:
# Use function to collect and store marvel subreddit data in dataframes 
marvel_df = get_posts('marvelstudios', 5000)

In [15]:
# Save dataframes in csv files 
marvel_df.to_csv('./datasets/marvel.csv', index=False)
dc_df.to_csv('./datasets/dc.csv', index=False)