# Web APIs & NLP: Reddit classification
### Notebook 01 - Data Scraping and Preliminary Cleaning


_Author: Joe Serigano (jserigano4@gmail.com)_

---

**Objectives:**
- Gather and prepare data from subreddits [r/LifeProTips](https://www.reddit.com/r/ShittyLifeProTips) and [r/ShittyLifeProTips](https://www.reddit.com/r/ShittyLifeProTips) using the requests library and [Pushshift's](https://github.com/pushshift/api) API.
- Remove any duplicate posts.
- Save post titles for preprocessing analysis. 

In [1]:
# Import libaries
import requests
import pandas as pd
import numpy as np
import time

# We are dealing with large data sets, so setting max number of column and row displays to be unlimited
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In this project we will be analyzing the titles of posts from each subreddit, so we will use only the '/reddit/search/submission' endpoint to scrape the data.

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

The function below gathers posts from a given subreddit posted directly before the input UTC time. Pushshift limits you to 100 posts per request, however we'll want much more data than this in order to build a good model. To overcome this issue, we'll use the earliest (minimum) UTC time from the 'created_utc' column to re-define the beginning time parameter after each request in order to gather more than 100 posts.

Once we gather these data, we will save only the subreddit name and the title of each post since this is what we are going to analyze. 

In [3]:
def reddit_scraper(subreddit, n_size, utc_time, n_iter):
    '''
    Function to scrape posts from subreddit, beginning at specified UTC time. 
    Number of posts scraped = n_size * n_iter
    **************
    Input params:
    subreddit: Subreddit to pull from
    n_size: Number of results to return from pushshift.io API (max=100)
    utc_time: Return results before this time in utc
    n_iter: Amount of times to run this function in order to pull more n_iter * n_size posts.
    **************
    '''
    df_all = []
    
    for i in range(n_iter):
        # Define parameters for API call
        params = {
            'subreddit': subreddit,
            'size': n_size,
            'before': utc_time
        }
        
        # Make url request and check that the request was successful 
        res = requests.get(url, params)
        if res.status_code != 200:
            print('ERROR:', res.status_code)
            break
            
        # Convert text into a Dataframe and add new DF to the list of previously pulled DataFrames
        df = pd.DataFrame(res.json()['data'])
        df_all.append(df)
        
        # Re-define beginning UTC time as the new earliest (minimum) UTC time in the DataFrame
        utc_time = df['created_utc'].min()

        time.sleep(2)
    
    # Concatenate all the individual DataFrames and save only the subreddit and title columns
    full_df = pd.concat(df_all)
    full_df = full_df[['subreddit','title']]
    
    return full_df

Now that we've created a function to scrape the appropriate data, we can define our input parameters and create a DataFrame for each subreddit. We want around 10,000 posts from each subreddit after removing duplicates, so we'll set n_iter to 110. The starting UTC time refers to the time at which this project was started, Monday, June 27, 2022 7:15:49 PM. The subreddit r/LifeProTips more duplicates, so we are pulling 2x as much data for that subreddit.

In [4]:
n_size = 100
utc_time = 1656371749 # Current UTC to use as starting UTC time
n_iter =  110

subreddit_1 = 'lifeprotips'
subreddit_2 = 'shittylifeprotips'

df_lpt = reddit_scraper(subreddit_1, n_size, utc_time, n_iter*2)
df_slpt = reddit_scraper(subreddit_2, n_size, utc_time, n_iter)

Now that we've pulled the appropriate data we need to inspect it for any missing values or any other issues. 

In [5]:
print('SHAPE/NULL VALUES:')
print('*'*10, 'LPT', '*'*10)
print(f'Shape: {df_lpt.shape}')
print(df_lpt.isnull().sum())
print('*'*10, 'SLPT', '*'*10)
print(f'Shape: {df_slpt.shape}')
print(df_slpt.isnull().sum())

SHAPE/NULL VALUES:
********** LPT **********
Shape: (21952, 2)
subreddit    0
title        0
dtype: int64
********** SLPT **********
Shape: (10994, 2)
subreddit    0
title        0
dtype: int64


The shape of these DataFrames is what we expect and there are no missing values. Let's display the first few rows of each DataFrame.

In [6]:
df_lpt.head(20)

Unnamed: 0,subreddit,title
0,LifeProTips,LPT: If you are as outraged as I am about the ...
1,LifeProTips,Doing things when you don't want to do them is...
2,LifeProTips,LPT: Dedicate a credit card to only subscripti...
3,LifeProTips,LPT: Like to have a flutter but gambling is to...
4,LifeProTips,LPT: Enraged by RvW and feeling powerless? Con...
5,LifeProTips,Enraged by RvW and feeling powerless? Consider...
6,LifeProTips,LPT: Having frequent nightmares? Get a sleep s...
7,LifeProTips,"LPT: If you receive a coupon at a store, set a..."
8,LifeProTips,LPT: When writing a recipe either online or of...
9,LifeProTips,"LPT: If you live in a legal abortion state, or..."


In [7]:
df_slpt.head(20)

Unnamed: 0,subreddit,title
0,ShittyLifeProTips,A tale as old as time
1,ShittyLifeProTips,"At a large karaoke party, secretly request the..."
2,ShittyLifeProTips,SLPT: Can't get an abortion?? Just take off yo...
3,ShittyLifeProTips,SLPT: hate your boss plant CP on his computer
4,ShittyLifeProTips,hate your boss plant CP on his computer
5,ShittyLifeProTips,I want the stick my tongue up in your ass hard...
6,ShittyLifeProTips,SLPT Get cheap gas
7,ShittyLifeProTips,Cheap gas trick
8,ShittyLifeProTips,Genius!
9,ShittyLifeProTips,Gorilla tape to cover holes when using chemica...


There are two things to note:
- Posts for these subreddits are supposed to begin with 'LPT' and 'SLPT'. We are building a model that will attempt to predict which subreddit a given post came from, so we will need to remove these starting letters so that the original subreddit is not so obvious to the model. I also noticed at least one typo of 'SPLT' in one of the first rows, so we will also remove that.
- There are some duplicate posts in both subreddits. It also looks like some users might have posted an initial post and then edited that post by creating a new, similar post. Unfortunately 'drop_duplicates' will miss duplicate posts that are not identical to each other. However, a quick scan of the data set shows that this is not a very common occurrence, so we will have to leave these similar posts in and remove only identical duplicates.

In [8]:
def replace_substring(column, substring_list, replacement):
    '''
    Function to replace every instance of a substring in a DataFrame column.
    **************
    Input params:
    column: DataFrame column to replace.
    substring_list: List of substrings to replace
    replacement: String of text to replace substring
    **************
    '''
    for word in substring_list:
        df_slpt[column] = df_slpt[column].str.replace(word, replacement)
        df_lpt[column] = df_lpt[column].str.replace(word, replacement)
    return df_slpt, df_lpt

We will use the above function to get rid of the beginning text of each title for these posts.

In [9]:
column = 'title'
substring_list = ['SLPT', 'SPLT', 'LPT', ': ']
replacement = ''

replace_substring(column, substring_list, replacement);

Next, we will drop all identical duplicates from each DataFrame.

In [10]:
df_slpt.drop_duplicates(inplace=True)
df_lpt.drop_duplicates(inplace=True)

In [11]:
print(f'New LPT shape after duplicate removal: {df_lpt.shape}')
print(f'New SLPT shape after duplicate removal: {df_slpt.shape}')

New LPT shape after duplicate removal: (20866, 2)
New SLPT shape after duplicate removal: (10239, 2)


I guess there weren't as many duplicates in the r/LPT data as I thought! We will only save the first half of this for analysis, so that the sizes of each data set are similar. 

Finally, we will save these DataFrames to CSV files for further cleaning, preprocessing, and NLP modelling.

In [20]:
df_lpt_reduced = df_lpt.head(10433)

In [21]:
df_lpt_reduced.shape

(10433, 2)

In [22]:
df_lpt_reduced.head()

Unnamed: 0,subreddit,title
0,LifeProTips,If you are as outraged as I am about the overt...
1,LifeProTips,Doing things when you don't want to do them is...
2,LifeProTips,Dedicate a credit card to only subscription se...
3,LifeProTips,Like to have a flutter but gambling is too exp...
4,LifeProTips,Enraged by RvW and feeling powerless? Consider...


In [23]:
df_lpt_3.to_csv('data/lifeprotips.csv', index=False)
df_slpt.to_csv('data/shittylifeprotips.csv', index=False)