<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Web APIs & NLP

# Part 2 : Web-scraping and Data Processing

### Contents:
* [Organisation of Notebooks](#Organisation-of-Notebooks)
* [Import Libraries](#Import-Libraries)
* [Scraping Data using Reddit API](#Scraping-Data-using-Reddit-API)
* [Data Processing](#Merge-the-2-Subreddit-Data)
* [Summary](#Summary)

## Organistation of Notebooks:
1. [Introduction](./01_Introduction.ipynb)
2. Web-scraping and Data Processing
3. [EDA and Modeling](./03_EDA_Modeling.ipynb)

## Import Libraries

In [1]:
import requests
import re
import pandas as pd
import numpy as np
import time


from wordcloud import WordCloud#, STOPWORDS, ImageColorGenerator

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import RegexpTokenizer 

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Scrape Data using Reddit API

In [3]:
# specify url and 2 subreddit
url = 'https://api.pushshift.io/reddit/search/submission'
subreddit_board = 'boardgames'
subreddit_mobile = 'MobileGaming'

### Function is defined to scrap data and remove those posts that are removed, deleted or empty using the reddit API

In [4]:
# function to scrap data from the endpoint and get at least 1000 data
# and filter to 4 columns: subreddit, selftext, title, and created_utc
def scrap_data(url, subreddit, limit):
    cont_scrape = True
    just_start = True
    
    while cont_scrape:
        
        # if just started to scrap, before param do not need to specify
        if just_start:
            params = {
                'subreddit': subreddit,
                'size': 100
            }
        # if not, before param need to specify the last post time
        # to get the posts that were posted before the last post retrieved earlier
        else:
            params = {
                'subreddit': subreddit,
                'size': 100,
                'before': time_retrieve_last
            }

        res = requests.get(url, params)
        
        # incase there is an error when hitting the endpoint
        # sleep 0.5sec then rehit the endpoint again to retrieve data
        if res.status_code != 200:
            time.sleep(0.5)
            res = requests.get(url, params)
            
        posts = res.json()
        
        # if just started to scrap, put the posts into df
        if just_start:
            df = pd.DataFrame(posts['data'])
        # if not, put the recently retrived posts into new df
        # then concatentate with the previous df vertically
        else:
            new_df = pd.DataFrame(posts['data'])
            df = pd.concat([df, new_df], ignore_index=True)

        time_retrieve_last = df['created_utc'].iloc[-1]
#         print(time_retrieve)
   
        # specify filter to filter off post that are removed, deleted or empty, or author = AutoModerator
        mask = (df['selftext'] != '')
        mask2 = (df['selftext'] != '[removed]')
        mask3 = (df['selftext'] != '[deleted]')
        mask4 = (pd.notna(df['selftext']))
        mask5 = (df['author'] != 'AutoModerator')
        df = df.loc[mask & mask2 & mask3 & mask4, ['subreddit', 'selftext', 'title', 'created_utc', 'author']]
        df = df.loc[mask5, ['subreddit', 'selftext', 'title', 'created_utc', 'author']]
        
        # if the posts we got exceed 1000, stop scraping
        if (len(df)) > limit:
            cont_scrape = False
            
        # after enter the while loop for first time
        # set just_start to False 
        just_start = False
#         print('loop')
    return df


### Get data from the boardgames subreddit using the function defined and check for any null values in the dataframe

In [5]:
%%time
# takes about 1 min
df_board = scrap_data(url, subreddit_board, 1000)

CPU times: user 1.1 s, sys: 81.2 ms, total: 1.18 s
Wall time: 1min 31s


In [6]:
df_board['author'].value_counts()

StarXedHero             8
RoadToInfamyGames       7
kryzak123               6
guispfilho              5
ThinEzzy                4
EndersGame_Reviewer     4
Mancupcake              4
rsbrown42               4
HSUbablue               4
Karrion42               4
Tiny792                 4
highendthinking         4
laxar2                  3
zWeApOnz                3
KyriSGS                 3
Zelbinian               3
GuysoftheBeholder       3
LoveHerMore             3
zebraman7               3
jaymoont                3
Brilliant_Ed_9912       3
Yoonzee                 3
CavalloSkate            3
ShelfClutter            3
michele_piccolini       3
AssumeBattlePoise       3
BxMxCx360               3
JarlGilles              2
Premo-Busey             2
justmeaskingaquestio    2
TBPMach                 2
pulipul777              2
AlwaysBeQuestioning     2
djkidkaz                2
Kingofthered            2
Benetton_Cumbersome     2
EsotericTribble         2
chicken_fried_food      2
ArcadianDelS

In [7]:
df_board.loc[df_board['author'] == 'AutoModeator', :]

Unnamed: 0,subreddit,selftext,title,created_utc,author


In [8]:
df_board.isna().sum()

subreddit      0
selftext       0
title          0
created_utc    0
author         0
dtype: int64

### Get data from the MobileGaming subreddit using the function defined and check for any null values in the dataframe

In [None]:
%%time
# takes about 
df_mobile = scrap_data(url, subreddit_mobile, 1000)

In [None]:
df_mobile['author'].value_counts()

In [None]:
df_mobile.isna().sum()

In [None]:
# # df_board = df_board.loc[:, ['subreddit', 'selftext', 'title', 'created_utc']]
# df_board['text'] = df_board['selftext'] + df_board['title']
# df_mobile['text'] = df_mobile['selftext'] + df_mobile['title']

In [None]:
# df_board['selftext'][0]

In [None]:
# df_board.isna().sum()

In [None]:
# df_mobile.head()

In [None]:
# df_mobile.isna().sum()

## Data Processing

### Merge the selftext and title of the posts into a new column 'text' for the individual subreddit before merging the 2 dataframes together for data processing

In [None]:
# df_board = df_board.loc[:, ['subreddit', 'selftext', 'title', 'created_utc']]
df_board['text'] = df_board['selftext'] + df_board['title']
df_mobile['text'] = df_mobile['selftext'] + df_mobile['title']

In [None]:
# concatenate 2 df vertically
posts = pd.concat([df_board, df_mobile], ignore_index=True)

In [None]:
posts.head()

### 2 functions defined here for lemmatizing and stemming purposes

In [None]:
def process_data_lem(x):
    # Short regex to remove urls
    x = re.sub(r'\w+:\/\/[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', x)
#     x = re.sub(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&\/\/=]*)', '', x)  
    x = re.sub(r'\d+', '', x)
    x = re.sub(r'_[A-Za-z0-9]+', '', x)
    x = re.sub(r'[Aa][Aa][A-Za-z]+', '', x)

    lemmatizer = WordNetLemmatizer()

    tokenizer = RegexpTokenizer('\w+', gaps=False)
    token_lem = [token for token in tokenizer.tokenize(x.lower())]

    return ' '.join(token_lem) 

def process_data_stem(x):

    # Short regex to remove urls
    x = re.sub(r'\w+:\/\/[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', x)
#     x = re.sub(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&\/\/=]*)', '', x)  
    x = re.sub(r'\d+', '', x)
    x = re.sub(r'_[A-Za-z0-9]+', '', x)
    x = re.sub(r'[Aa][Aa][A-Za-z]+', '', x)


#     snow_stem = SnowballStemmer(language='english')
    p_stemmer = PorterStemmer()

    tokenizer = RegexpTokenizer('\w+', gaps=False)

    token_stem = [p_stemmer.stem(token) for token in tokenizer.tokenize(x)]#

    return ' '.join(token_stem)

### A function is defined here to compare the lemmatizing and stemming method 

In [None]:
def compare_lem_stem(df):

    list_words_diff = [(stem, lem) for idx in range(len(df)) for stem, lem \
                      in zip(df['processed_text_stem'][idx].split(), \
                      df['processed_text_lem'][idx].split()) if stem != lem]

    return list_words_diff


In [None]:
%%time
posts['processed_text_lem'] = posts['text'].apply(process_data_lem)

In [None]:
%%time
posts['processed_text_stem'] = posts['text'].apply(process_data_stem)

In [None]:
# return (stem, lem)
word_list = compare_lem_stem(posts)
word_list

### Observations:

From the comparison of words after lemmatizing and stemming respectively, it can be observed that the stemming method will reduce the words to the root form more effective than the lemmatizing method. However, the downside of stemming method is that some of the words when trying to reduce to root form do not provide any meaning, like 'mani', 'becaus', 'condi' etc, which are not very useful. Hence, the lemmatizing method is chosen in this case though it did not reduce the word into its root form.

In [None]:
# Create label column
# boardgames - 1
# MobileGaming - 0
posts['subreddit_cat'] = posts['subreddit'].map(lambda x: 1 if x == 'boardgames' else 0)

In [None]:
processed_posts = posts.loc[:, ['subreddit_cat', 'processed_text_lem']]

### Export the merged data for EDA and modeling in the next section

In [None]:
processed_posts.head()

In [None]:
processed_posts.to_csv('../data/posts.csv', index=False)

In [None]:
# merged_data.head()

## Summary

In this section, web scraping using the reddit API to get at least 1000 data for each of the 2 subreddits('boardgames' and 'MobileGaming'). A function is defined in this section to scrap data from the API as well as cleaning the data by removing those posts that are removed, deleted, empty and posts that are posted by the moderator. A comparison between stemming and lemmatizing is done and the lemmatizing method is used over stemming due to the lack of meaningful words produced by the stemming method. At the end of this section, the 'boardgames' and 'MobileGaming' data are merged together and exported to csv file to be used in the next section.