<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

Part 1 of 3

**Problem Statement**

On a typical workday while performing maintenance, an engineer accidentally deleted multiple posts from the subreddits r/nottheonion and r/theonion. The former writes true stories that are mind-blowingly ridiculous, so much so that one would have thought it's from the latter which publishes satirical articles on international, national and local news. Despite the engineer's quick follow-up action, it was unfortunate that he could only recover the titles of the lost posts. 

Our team has been engaged to build a classification model which would train on posts submitted before 01 Jan 2022 to correctly sort the recovered posts back to their respective subreddits, r/nottheonion and r/theonion, based solely on the post titles.

In employing NLP to help us build various models such as the Multinomial Naive Bayes, Random Forest, Logistic Regression etc, we will peg success to the initial accuracy scores across our cross validations. Beyond that, we will dive deeper into studying the confusion matrices together with the f1 scores to pick out the best model(s).

As it is now, our primary stakeholders - the volunteer moderators of each subreddit thread - have to spend a substantial amount of time reviewing user reports and deleting spam posts from the subreddits. This is especially so with the increase in bots spamming subreddits with irrelevant posts. With this classifier model, we can use it as a proof of concept for the development of an automated moderator which would automatically delete posts that do not belong to the subreddit that they have been posted to. This in turn helps to free up time for our human moderators. As a consequence, our secondary stakeholders - the subreddit community as a whole - can enjoy their daily reads without being bothered by disparate frills.


**Data Collection**  

In [1]:
# API scrape 
!pip install psaw
from psaw import PushshiftAPI



In [2]:
import pandas as pd

In [3]:
# Creating a function to input the subreddit threads subsequently.
def scrape_data(subreddit):
    
    # Instantiate 
    api = PushshiftAPI()

    # Create list of scraped data
    scrape_list = list(api.search_submissions(subreddit=subreddit,
                                before= 1640995200, #Specifying this so we scrape the same batch of posts.
                                filter=['title', 'subreddit', 'num_comments', 'author', 'subreddit_subscribers', 'score', 'domain', 'created_utc'],
                                limit=3000)) 

    #Filter list to only show Subreddit titles and Subreddit category 
    clean_scrape_lst = []
    for i in range(len(scrape_list)):
        scrape_dict = {}
        scrape_dict['subreddit'] = scrape_list[i][5]
        scrape_dict['author'] = scrape_list[i][0]
        scrape_dict['domain'] = scrape_list[i][2]
        scrape_dict['title'] = scrape_list[i][7]
        scrape_dict['num_comments'] = scrape_list[i][3]
        scrape_dict['score'] = scrape_list[i][4]
        scrape_dict['timestamp'] = scrape_list[i][1]
        clean_scrape_lst.append(scrape_dict)

    # Show number of subscribers
    print(subreddit, 'subscribers:',scrape_list[1][6])
    
    # Return list of scraped data
    return clean_scrape_lst

I intend to collect 2000 unique posts from each subreddit so I scrap 3000 knowing that there might be repeat posts.

In [4]:
# Call function and create DataFrame
df_onion = pd.DataFrame(scrape_data('theonion'))

# Save data to csv
df_onion.to_csv('../datasets/the_onion.csv')

# Shape of DataFrame
print(f'df_onion shape: {df_onion.shape}')

# Show head
df_onion.head()



theonion subscribers: 165298
df_onion shape: (2996, 7)


Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,TheOnion,mothershipq,theonion.com,Surgeon Kind Of Pissed Patient Seeing Her Defo...,0,1,1640973300
1,TheOnion,-ImYourHuckleberry-,theartnewspaper.com,McDonald’s blocked from building drive-through...,1,1,1640971771
2,TheOnion,dwaxe,theonion.com,Gwyneth Paltrow Touts New Diamond-Encrusted Tr...,0,1,1640955671
3,TheOnion,dwaxe,theonion.com,Artist Crafting Music Box Hopes It Delights At...,0,1,1640955669
4,TheOnion,dwaxe,theonion.com,Homeowner Trying To Smoke Out Snakes Accidenta...,0,1,1640955668


In [5]:
# Call function and create DataFrame
df_not_onion = pd.DataFrame(scrape_data('nottheonion'))

# Save data to csv
df_not_onion.to_csv('../datasets/not_onion.csv')

# Shape of DataFrame
print(f'df_not_onion shape: {df_not_onion.shape}')

# Show head
df_not_onion.head()



nottheonion subscribers: 20438921
df_not_onion shape: (2997, 7)


Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,nottheonion,Taco_duck68,wral.com,"Man attempts to pay for car with rap, steals p...",0,1,1640995192
1,nottheonion,BlackNingaa,bloodyelbow.com,Former UFC fighter reveals past as sex worker ...,1,1,1640994707
2,nottheonion,Lopsided_File_1642,facebook.com,Log into Facebook,1,1,1640991506
3,nottheonion,SkinnyWhiteGirl19,theartnewspaper.com,McDonald’s blocked from building drive-through...,0,1,1640990429
4,nottheonion,kids-cake-and-crazy,kjrh.com,Legendary actress Betty White dies at 99 on Ne...,0,1,1640989181


In [6]:
# Checking the unique titles in r/TheOnion df.
df_onion['title'].nunique()

2783

In [7]:
# Checking the unique titles in r/nottheonion df.
df_not_onion['title'].nunique()

2338

As expected, there are repeated posts in both subreddit threads. To ensure that I take only useful rows that contribute to a successful model built, I'll take only unique titles further downstream.

In [8]:
# Dropping the duplicate titles.
df_onion.drop_duplicates(subset=['title'], inplace=True)
df_not_onion.drop_duplicates(subset=['title'], inplace=True)

In [9]:
print(df_onion.shape)
print(df_not_onion.shape)

(2783, 7)
(2338, 7)


In [10]:
# Slicing so I take 2000 unique titles from each df, keeping to the same number from each df.
df_onion = df_onion.iloc[0:2000]
df_not_onion = df_not_onion.iloc[0:2000]

In [11]:
# Combining all 4000 rows of data from both subreddit threads together using concatenation. 
combined_df = pd.concat([df_onion, df_not_onion],axis=0)
combined_df.head()

Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,TheOnion,mothershipq,theonion.com,Surgeon Kind Of Pissed Patient Seeing Her Defo...,0,1,1640973300
1,TheOnion,-ImYourHuckleberry-,theartnewspaper.com,McDonald’s blocked from building drive-through...,1,1,1640971771
2,TheOnion,dwaxe,theonion.com,Gwyneth Paltrow Touts New Diamond-Encrusted Tr...,0,1,1640955671
3,TheOnion,dwaxe,theonion.com,Artist Crafting Music Box Hopes It Delights At...,0,1,1640955669
4,TheOnion,dwaxe,theonion.com,Homeowner Trying To Smoke Out Snakes Accidenta...,0,1,1640955668


In [12]:
# Checking the shape.
combined_df.shape

(4000, 7)

In [13]:
# To export the data.
combined_df.to_csv('../datasets/combined_df.csv')

With the df from both subreddits combined, the next codebook will be where I clean up the data and carry out modelling.