# Project 3: Web APIs & NLP

# Data Collection Notebook

In this notebook, we use `Pushshift Reddit API` in order to scrape reddit submissions from `r/LifeProTips` and `r/YouShouldKnow`.

The two chosen subreddits share similar contents in giving random bits of useful information to redditors. Firstly, in order to keep the data balanced, we will obtain 1500 posts from each subreddit. Secondly in order to weed out 'weak' posts, we will only take posts above a certain score threshold. Finally, we will clean some of the data before exporting them for modelling <a href="./Modelling.ipynb">here</a>.

For the final evalution and summary notebook, please click <a href="./Summary.ipynb">here</a>.

### Scraping the for Data ###

In [2]:
import requests
import pandas as pd
import numpy as np
import time

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
def scrape(sub,score):
    url = 'https://api.pushshift.io/reddit/search/submission'
    df_concat_list = []
    params = {
        'subreddit': sub,
        'size': 100,
        'before': None,
        'is_video':'false', # skip video posts
        'stickied':'false', # skip on stickied posts (i.e. rules, megathreads, etc)
        'score': '>'+str(score) # only scrape posts above a certain score
    }
    print(sub+" Scraping Start")
    for i in range(15):
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        df_new = pd.DataFrame(posts)
        df_new = df_new[['subreddit','id','title','selftext','created_utc']]
        df_concat_list.append(df_new) #load the final df with the extracted data
        params['before'] = df_new['created_utc'][len(df_new)-1] # continue the 'before' parameter after last cycle timestamp
        time.sleep(20) # add delay to avoid overloading the API
        print(f'Iteration {i+1}')
    df = pd.concat(df_concat_list)
    df.to_csv(f'./dataset/{sub}.csv', index = False) # export to csv after complete scraping
    print('done')

`LifeProTips` has a relatively high user population of 19.7m users at the time of scraping. Therefore, we set the score higher since there are more users to upvote the posts.

In [4]:
scrape("LifeProTips",4500)

LifeProTips Scraping Start
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
done


`YouShouldKnow` has a significantly lower user base compared to `LifeProTips`. Therefore, we scale the score down in case we reach the bottom of the list and run out of things to scrape.

In [7]:
scrape("YouShouldKnow",150)

YouShouldKnow Scraping Start
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
done


### Data Cleaning

Both subreddits have their respective post tags of "LPT" for `LifeProTips` and "YSK" for `YouShouldKnow`. This would make the sorting in the modelling process later too easy for the model. Therefore, we will remove the tags from each of the post titles. Furthermore, we will also concatenate the `title` and `selftext` columns to form `text` which will contain all of our post content.

_Edit: After modelling it was discovered that URLs are mistakenly picked up as key words by the modelling process. We will remove all URLs to avoid their unwanted data. Secondly the models also picked up 000 as a key feature. We will convert it to 'k' in order to avoid splitting up potentially significant numbers._

In [7]:
def cleaner(sub):
    df = pd.read_csv(f'./dataset/{sub}.csv')
    
    df['selftext'].fillna(".",inplace=True) # fillna with '.' so we can concatenate 'title' and 'selftext'
    df['text'] = df['title'] +' '+ df['selftext']
    
    # remove post tag at start of post
    df['text'] = df['text'].str.replace(r"^.?[LlPpTtYySsKk]{3}[^\s]*\s?[\-\:]?\s?","")
    
    # remove [removed] tags
    df['text'] = df['text'].str.replace(r'\[removed\]','')
    
    # remove any other LPT & YSK tags
    df['text'] = df['text'].str.replace(r'ysk|YSK|lpt|LPT','')
    
    # remove url
    df['text'] = df['text'].str.replace(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)','')    
    
    # replace ",000" with k to avoid picking up "000" with cvec
    df['text'] = df['text'].str.replace(r'\,000','k')
    
    #export to same file
    df.to_csv(f'./dataset/{sub}.csv',index=False)

In [8]:
cleaner("YouShouldKnow")

In [9]:
cleaner("LifeProTips")

In [10]:
def nullfind(df): # find missing values by columns
    col_with_null = df.loc[:, df.isnull().any()].columns.tolist()
    return df[col_with_null].isnull().sum()/len(df['id']) #% of missing values

def duplicate_check(df): # check for duplicate entries by df shape
    n_unique = df.id.nunique()
    print("length of ID column: "+ str(len(df.id)))
    print("shape of DataFrame: "+ str(df.shape))
    print(f'no. of unique posts: {n_unique}')

In [11]:
LPT = pd.read_csv('./dataset/LifeProTips.csv')
YSK = pd.read_csv('./dataset/YouShouldKnow.csv')

In [12]:
duplicate_check(LPT)
nullfind(LPT)

length of ID column: 1472
shape of DataFrame: (1472, 6)
no. of unique posts: 1472


Series([], dtype: float64)

In [13]:
duplicate_check(YSK)
nullfind(YSK)

length of ID column: 1439
shape of DataFrame: (1439, 6)
no. of unique posts: 1439


Series([], dtype: float64)

As we obverved above, the rows did not add up to 3,000. Missing data or image posts are dropped by the API and therefore we have a combined total of 2,921 data points.

In [14]:
#concat both dataframe into 1 to export for feature engineering and modelling
concat = pd.concat([LPT,YSK])
concat.to_csv('./dataset/data.csv', index=False)