# Book 1 - Problem Statement & Data Collection

---

This project consist of 4 separate notebook:

1. Book 1 - ProblemStatement & DataCollection 
2. Book 2 - Preprocessing
3. Book 3 - Modelling & Evaluation Part 1 - Text Classification
4. Book 4 - Modelling & Evaluation Part 2 - Sentiment Analysis

## I. Problem Statement

We are employees of a marketing agency hired by a toy company to perform market research on Reddit to classify posts from *Marvel* vs *DC* movies subreddits in order to:
- Build a classifier model that can be applied to other platforms (e.g. Twitter, Facebook) with text data to determine public interest in either movie franchise
- Decide which top heroes to create toys for for each movie franchise


## 1.0 Import Libraries

---

In [1]:
import requests
import pandas as pd

## 2.0 Request Subreddit Data via Pushshift API

---

It was noted that the API sometimes encounter trouble getting the amount of posts as requested in each iteration. Out of 20000 post requested, 19990 posts were received. Due to insignificance, the issue was ignored.

In [2]:
# Function to grab post from subreddit using pushshift API
# Since the APi limits a total of 100 post per request, the function will run a loop to request more post
def get_data(subreddit, loops):
    for loop in range(loops):
        if loop ==0:
            url = 'https://api.pushshift.io/reddit/search/submission'

            params = {
                'subreddit': subreddit,
                'size': 100
            }

            res = requests.get(url,params)

            df = pd.DataFrame(res.json()['data'])
            created_utc = df['created_utc'].iloc[-1]
            print(f'DataFrame shape: {df.shape}')
        else:
            params = {
                'subreddit': subreddit,
                'size': 100, 
                'before' : created_utc
            }
            res = requests.get(url,params)
            df = pd.concat([df, pd.DataFrame(res.json()['data'])], ignore_index=True)
            created_utc = df['created_utc'].iloc[-1]
            print(f'DataFrame shape: {df.shape}')
        print(f'Scraping data from subreddit/{subreddit}')
        print(f'Status code: {res.status_code}')
        print(f'Iteration: {loop}')
    return df


In [3]:
# Initialize the number of loops to run
iteration = 100

### 2.1 Request Data from r/marvelstudios

Total data scraped from the subreddit was 9993

In [4]:
# Request 10,000 most recent posts from r/marvelstudios subreddit
raw_data_marvel = get_data('marvelstudios',iteration)

DataFrame shape: (100, 80)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 0
DataFrame shape: (200, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 1
DataFrame shape: (300, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 2
DataFrame shape: (400, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 3
DataFrame shape: (500, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 4
DataFrame shape: (600, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 5
DataFrame shape: (700, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 6
DataFrame shape: (800, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 7
DataFrame shape: (900, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 8
DataFrame shape: (1000, 82)
Scraping data from subreddit/marvelstudios
Status code: 200
Iteration: 9

In [None]:
raw_data_marvel.info()

### 2.2 Request Data from r/DC_Cinematic

Total data scraped from the subreddit was 9997

In [10]:
# Request 10,000 most recent posts from r/DC_Cinematic subreddit
raw_data_dc = get_data('DC_Cinematic',iteration)

DataFrame shape: (100, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 0
DataFrame shape: (200, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 1
DataFrame shape: (300, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 2
DataFrame shape: (400, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 3
DataFrame shape: (500, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 4
DataFrame shape: (600, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 5
DataFrame shape: (700, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 6
DataFrame shape: (800, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 7
DataFrame shape: (900, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 8
DataFrame shape: (1000, 81)
Scraping data from subreddit/DC_Cinematic
Status code: 200
Iteration: 9
DataFrame

In [None]:
raw_data_dc.info()

## 3.0 Save Data

---

In [11]:
# Save requested data
pd.DataFrame.to_csv(raw_data_marvel,'./data/raw_data_marvel.csv',index_label=True)
pd.DataFrame.to_csv(raw_data_dc,'./data/raw_data_dc.csv',index_label=True)

---

# End of Book 1

In the next book, we will be going thourgh the preprocessing, where the data are cleaned and EDA is carried out.