# Project 3: Web APIs & NLP
---
**Book 1: Data Collection**<br>
Book 2: Data Cleaning & Exploratory Data Analysis<br>
Book 3: Preprocessing & Vectorization<br>
Book 4: ML Modeling<br>
Book 5: Sentiment Analysis, Conclusion & Recommendation<br>
Author: Lee Wan Xian

## Problem Statement

Our client is a firm that runs a streaming service discussion website. As part of their initiative to build an inhouse label tagging algorithm, they have tasked us to develop a machine learning (ML) classification model that tags posts to the right streaming service tag. Meanwhile, the client is also interested in users' sentiments towards famous shows. That way, they can evaluate how to improve their search and homepage recommendations for users.

## Contents:
- [Background](#Background)
- [Data Collection](#Data-Collection)

## Background

There are over 200 streaming services globally ([source](https://flixed.io/complete-list-streaming-services/)). With users having so many options in streaming services, the forum website that the client runs has become a hotspot for open discussions and sharing. The client wish to enable label tags onto posts so that users can use it to search for forum posts related to specific streaming service. Since this change is new to the firm, they have engaged us to build a ML classification model that can perform such a task.

As a team of data professionals, we will leverage on Reddit to form the training data for our model. This is because the client lacks the resources for human annotation on their own forum posts. Reddit will serve as a good substitute to the training data, given the similar nature of business, similar user demographics and the fact that there is a way to differentiate the streaming service from reddit posts.

For the purpose of this project, we will perform webscraping on the below subreddits using [Pushshift's](https://github.com/pushshift/api) API.
* `r/DisneyPlus`: https://www.reddit.com/r/DisneyPlus/
* `r/netflix`: https://www.reddit.com/r/netflix/

To add on, the client also wish to understand users' sentiments on famous shows. With a good understanding on user's sentiments, they can leverage it to improve on their search and homepage recommendations. In turn, improving users' experience with the website.

---

## Python Libraries

In [1]:
import pandas as pd
import requests
import time

## Data Collection

For this project, we will be extracting 15,000 reddit posts from each subreddit (`r/DisneyPlus` and `r/netflix`). The posts extracted were posted on September 30, 2022 11:59 PM GMT time or earlier.

In [2]:
# Function to webscrape posts from subreddit into dataframe

def reddit_to_df(reddit, runs, post_count=150, before=1664582340):
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {'subreddit': reddit, 'size': post_count, 'before': before}
    posts = []
    
    for i in range(runs):
        res = requests.get(url, params)
        
        if res.status_code != 200:
            print(f'ERROR: Unable to scrape from the subreddit due to HTML Status code {res.status_code}')
        else:
            reddit_sub = res.json()
            posts += reddit_sub['data']
            params['before'] = posts[-1]['created_utc']
            print(f'Batch {i+1} scraped into Dataframe, earliest created_utc in this batch: {posts[-1]["created_utc"]}')
            time.sleep(3)
            
    df = pd.DataFrame(posts)
    
    # with reference to pushshift API docs, these 5 columns should provide enough value & insights to the classification model & sentiment analysis
    return df[['subreddit','title','selftext','is_video','created_utc']] 
    

In [3]:
# Scrape 15_000 reddit posts from DisneyPlus subreddit into dataframe

dfdisney_raw = reddit_to_df(reddit='DisneyPlus', runs=100)

Batch 1 scraped into Dataframe, earliest created_utc in this batch: 1664002270
Batch 2 scraped into Dataframe, earliest created_utc in this batch: 1663363347
Batch 3 scraped into Dataframe, earliest created_utc in this batch: 1662839944
Batch 4 scraped into Dataframe, earliest created_utc in this batch: 1662388926
Batch 5 scraped into Dataframe, earliest created_utc in this batch: 1661450795
Batch 6 scraped into Dataframe, earliest created_utc in this batch: 1660798533
Batch 7 scraped into Dataframe, earliest created_utc in this batch: 1660188020
Batch 8 scraped into Dataframe, earliest created_utc in this batch: 1659689701
Batch 9 scraped into Dataframe, earliest created_utc in this batch: 1659060446
Batch 10 scraped into Dataframe, earliest created_utc in this batch: 1658538331
Batch 11 scraped into Dataframe, earliest created_utc in this batch: 1658081126
Batch 12 scraped into Dataframe, earliest created_utc in this batch: 1657469367
Batch 13 scraped into Dataframe, earliest created

In [4]:
# Scrape 15_000 reddit posts from netflix subreddit into dataframe

dfnetflix_raw = reddit_to_df(reddit='netflix', runs=100)

Batch 1 scraped into Dataframe, earliest created_utc in this batch: 1664369608
Batch 2 scraped into Dataframe, earliest created_utc in this batch: 1664179472
Batch 3 scraped into Dataframe, earliest created_utc in this batch: 1664020243
Batch 4 scraped into Dataframe, earliest created_utc in this batch: 1663788929
Batch 5 scraped into Dataframe, earliest created_utc in this batch: 1663436282
Batch 6 scraped into Dataframe, earliest created_utc in this batch: 1663171219
Batch 7 scraped into Dataframe, earliest created_utc in this batch: 1662882024
Batch 8 scraped into Dataframe, earliest created_utc in this batch: 1662741635
Batch 9 scraped into Dataframe, earliest created_utc in this batch: 1662535441
Batch 10 scraped into Dataframe, earliest created_utc in this batch: 1662207440
Batch 11 scraped into Dataframe, earliest created_utc in this batch: 1661886731
Batch 12 scraped into Dataframe, earliest created_utc in this batch: 1661617444
Batch 13 scraped into Dataframe, earliest created

In [5]:
# Show the shape for both DisneyPlus & Netflix subreddit
print(f'The no. of rows,columns in DisneyPlus subreddit corpus is {dfdisney_raw.shape}.')
print(f'The no. of rows,columns in Netflix subreddit corpus is {dfnetflix_raw.shape}.')

The no. of rows,columns in DisneyPlus subreddit corpus is (14980, 5).
The no. of rows,columns in Netflix subreddit corpus is (14990, 5).


A total of 29970 posts were extracted from both `r/DisneyPlus` and `r/netflix` with a time frame spanning from Unix Epoch 1614041482 to 1664582340. Only `subreddit`, `title`, `selftext`, `is_video` and `created_utc` fields of the posts were extracted for modelling and analysis.

In [6]:
# Combine both dataframes into 1 for EDA & Modelling

df_raw = pd.concat([dfdisney_raw, dfnetflix_raw])

### Export the dataframe

In [7]:
import os
# create new folder named 'data' if it does not exist
if not os.path.exists('../data'):
    os.makedirs('../data')
    
# Export the dataframes into csv files    
df_raw.to_csv('../data/df_raw.csv', index=False)

**Please proceed to Book 2 for Data Cleaning & EDA.**