<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

## 1. Problem Statement

Our company, twitch, offers a video streaming service that focuses on video games, music, and sports. We have recently created a beta forum where gamers can post comments about games of their interest. We are currently testing this beta forum with selected users. Considering the growing volume of game posts, it is challenging for the marketing and business development team to understand the users and design sales and marketing campaigns that best meet the users' needs, since the posts are not currently organized by games. Additionally, the team indicated that users may lose interest in this brand-new forum if they are unable to find and navigate to the games that interest or are relevant to them, and that we may not be able to launch the feature because of poor user experience.

Thus, the marketing and business development teams have requested that the data scientist team categorize the content according to its respective game. Dota 2 and League of Legends are currently the two most popular games among our users. Our goal is to use Reddit posts from Dota 2 and League of Legends to build a text classifier that has at least 85% accuracy, and to identify the top 5 features of each game. By categorizing the current content for both games, we will make it easy for our users to navigate and enhance their user experience. The top 5 features will give the marketing team some insight into their sales and marketing strategy.

## 2. Methodology

1. Scrape data using Pushshift for the below 2 subreddits. This will be covered in Notebook 1 (current Notebook).
    - https://www.reddit.com/r/DotA2/
    - https://www.reddit.com/r/leagueoflegends/


2. Data cleaning, text preprocessing (removing URLs / punctuations / stopwords, text tokenizing, and lemmatizing and stemming the text), and visualization. This will be covered in Notebook 2.


3. Train and analyze 3 classification models, Naive Bayes, Logistic Regression, and K-Nearest Neighbors with different vectorizers (CountVectorizer and TfidfVectorizer) and hyperparameters. As this is a classification problem, the below 5 evaluation metrics will be used. The important features will also be identified. This will be covered in Notebook 3.

    - Accuracy Score
    - AUC Score
    - Precision Score
    - Recall Score
    - F1 Score


## 3. Importing Python Modules

In [1]:
import requests
import time
import pandas as pd
import numpy as np
import json
import seaborn as sns

In [2]:
# Set the max display for the columns and rows
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

## 4. Scrapping using Pushshift

In [3]:
# Test out reddit API extraction 
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit': 'leagueoflegends',
    'size': 100,
}
result = requests.get(url, params)
data = result.json()
posts = data['data']
data_df = pd.DataFrame(posts)
data_df.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,post_hint,preview,thumbnail_height,thumbnail_width,media,media_embed,secure_media,secure_media_embed,media_metadata,author_flair_background_color,author_flair_text_color,author_flair_template_id,banned_by,author_cakeday,crosspost_parent,crosspost_parent_list
0,[],False,Pokemon_Only,,[],,text,t2_26tp2z5g,False,False,False,[],False,False,1652852675,youtu.be,https://www.reddit.com/r/leagueoflegends/comme...,{},us5wfe,False,False,False,False,False,False,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/leagueoflegends/comments/us5wfe/t1_gumayusi...,False,6,moderator,1652852685,1,,True,True,False,leagueoflegends,t5_2rfxx,5838151,public,spoiler,T1 Gumayusi: 🥱Group Stage wasn't fun🥱 wants st...,0,[],1.0,https://youtu.be/EcfjBTsudwE,https://youtu.be/EcfjBTsudwE,all_ads,6,,,,,,,,,,,,,,,,
1,[],False,Ecstatic-Buy-2907,,[],,text,t2_ipdcr06y,False,False,False,[],False,False,1652852217,self.leagueoflegends,https://www.reddit.com/r/leagueoflegends/comme...,{},us5sm1,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/leagueoflegends/comments/us5sm1/unpopular_o...,False,6,moderator,1652852227,1,[removed],True,False,False,leagueoflegends,t5_2rfxx,5838140,public,self,Unpopular Opinion: The 12.10 nerfs will initia...,0,[],1.0,https://www.reddit.com/r/leagueoflegends/comme...,,all_ads,6,,,,,,,,,,,,,,,,
2,[],False,jcplay26,,[],,text,t2_6n7uayrm,False,False,False,[],False,False,1652852082,dotesports.com,https://www.reddit.com/r/leagueoflegends/comme...,{},us5rf5,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/leagueoflegends/comments/us5rf5/how_to_acqu...,False,6,,1652852092,1,,True,False,False,leagueoflegends,t5_2rfxx,5838140,public,https://b.thumbs.redditmedia.com/ZINESUvqwkEBC...,How to acquire all Teamwork and Strategy chall...,0,[],1.0,https://dotesports.com/league-of-legends/news/...,https://dotesports.com/league-of-legends/news/...,all_ads,6,link,"{'enabled': False, 'images': [{'id': '_CRtairm...",78.0,140.0,,,,,,,,,,,,


In [4]:
# Create a function to automate the web API scrapping
def pushshift_subreddit(subreddit, n):
    url = 'https://api.pushshift.io/reddit/search/submission'
  
    # Create the dataframe to store the posts
    df = pd.DataFrame()
    
    # Loop n times to retreive the required number of posts (100 posts per request)
    # Append the data to the dataframe
    for i in range(n):
        if i == 0: # For the first loop, use the current time
            params = {'subreddit': subreddit, 'size': 100}
        else: # Subsequent loops using the new timestamp based on the 'created_utc' for the last position post
            params = {'subreddit': subreddit, 'size': 100, 'before': timestamp}
        result = requests.get(url, params)
        data = pd.DataFrame(result.json()['data'])
        df = df.append(data, ignore_index = True)
                
        # Initialize the new timestamp with the last position post  
        timestamp = df['created_utc'].iloc[-1]
        
        # Set a timer before next iteration
        time.sleep(5)
        
    return df[['subreddit', 'title', 'selftext', 'created_utc']]

In [5]:
# Extract dota posts
dota = pushshift_subreddit('DotA2', 40)

In [6]:
dota.shape

(4000, 4)

In [7]:
dota.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    4000 non-null   object
 1   title        4000 non-null   object
 2   selftext     3997 non-null   object
 3   created_utc  4000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 125.1+ KB


In [8]:
# Extract league posts
league = pushshift_subreddit('leagueoflegends', 40)

In [9]:
league.shape

(3999, 4)

In [10]:
league.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    3999 non-null   object
 1   title        3999 non-null   object
 2   selftext     3983 non-null   object
 3   created_utc  3999 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 125.1+ KB


In [12]:
# Export datasets to csv file
dota.to_csv('../datasets/dota.csv', index=False)
league.to_csv('../datasets/league.csv', index=False)

### Continued in Notebook 2