In [25]:
import requests
import time
import pandas as pd

# Background
When performing maintenance, an engineer accidentally deleted multiple posts from r/nottheonion and r/theonion. Unfortunately, the engineer was only able to recover the titles of the lost posts. As the data science team of Reddit, we were therefore tasked to build a classification model which would train on posts submitted before 01 Jan 2022 to classify the recovered posts back to their respective subreddits, r/nottheonion and r/theonion, based solely on the post titles.

This model would also be used as a proof of concept for the development of an automated moderator which would automatically delete posts that do not belong to the subreddit that they are posted to. There has been an increase in bots spamming subreddits with irrelevant posts. Moderators have been spending a substantial amount of their time reviewing user reports and deleting spam posts from the subreddit. Having automated moderators police the subreddit for spam posts would free up time for human moderators, who are volunteers, to do things that they want to do.

# Problem Statement
Scrape the most recent 1,000 posts that were posted before 01 Jan 2022. Do this for each of the 2 subreddits to form a dataset of 2,000 posts.

Using the title texts of the posts, train a classification model to classify the posts into 1 of the 2 subreddits.

In combination with 2 types of vectorizers (Count Vectorizer and TFIDF Vectorizer), a total of 4 model types will be used (Logistic Regression, Multinomial Naive Bayes, Random Forest and Support Vector Machine). Hence, a total of 8 machine learning classification models will be trained and scored in addition to the baseline model.

Since the dataset is split nicely in half between posts from each subreddit, Accuracy score will be the key performance indicator. We want the Accuracy score to be high so that Redditors will be able to discuss the correct posts in their respective subreddits. Having the posts in the right subreddits is also an indicator of Reddit's quality of service. Secondly, we want a balanced model to reduce a similar amount of workload for human moderators of each subreddit.

## Notebook 1 - Scrapping

This is the first of 3 notebooks for this project.

In this notebook, I will:
- Utilize the pushshift API to scrape posts from the subreddits
- As the API only allows for 100 posts to be scraped each time, I wrote a loop to scrape until there are 1,000 posts with unique titles
- The 1,000 posts are exported to a csv file, 1 each for The Onion and Not The Onion

In [26]:
url = 'https://api.pushshift.io/reddit/search/submission'

The unix epoch for 01 Jan 2022 is 1640995200

In [27]:
params = {
    'subreddit':'theonion',
    'size': 100,
    'before': 1640995200
}

In [28]:
res = requests.get(url, params)

In [29]:
res.status_code

200

In [30]:
data = res.json()

In [31]:
posts = data['data']

In [32]:
len(posts)

100

In [33]:
df = pd.DataFrame(posts)

In [34]:
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,url_overridden_by_dest,whitelist_status,wls,crosspost_parent,crosspost_parent_list,removed_by_category,media,media_embed,secure_media,secure_media_embed
0,[],False,mothershipq,,[],,text,t2_4negm,False,False,...,https://www.theonion.com/surgeon-kind-of-pisse...,all_ads,6,,,,,,,
1,[],False,-ImYourHuckleberry-,,[],,text,t2_g3p2c,False,False,...,https://www.theartnewspaper.com/2021/12/31/mcd...,all_ads,6,t3_rstp9v,[{'all_awardings': [{'award_sub_type': 'GLOBAL...,moderator,,,,
2,[],False,dwaxe,,[],,text,t2_3jamc,False,False,...,https://www.theonion.com/gwyneth-paltrow-touts...,all_ads,6,,,,,,,
3,[],False,dwaxe,,[],,text,t2_3jamc,False,False,...,https://www.theonion.com/artist-crafting-music...,all_ads,6,,,,,,,
4,[],False,dwaxe,,[],,text,t2_3jamc,False,False,...,https://www.theonion.com/homeowner-trying-to-s...,all_ads,6,,,,,,,


In [35]:
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'mothershipq',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_4negm',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': True,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1640973300,
 'domain': 'theonion.com',
 'full_link': 'https://www.reddit.com/r/TheOnion/comments/rszeht/surgeon_kind_of_pissed_patient_seeing_her/',
 'gildings': {},
 'id': 'rszeht',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,

In [36]:
df[['subreddit','title','created_utc']].head()

Unnamed: 0,subreddit,title,created_utc
0,TheOnion,Surgeon Kind Of Pissed Patient Seeing Her Defo...,1640973300
1,TheOnion,McDonald’s blocked from building drive-through...,1640971771
2,TheOnion,Gwyneth Paltrow Touts New Diamond-Encrusted Tr...,1640955671
3,TheOnion,Artist Crafting Music Box Hopes It Delights At...,1640955669
4,TheOnion,Homeowner Trying To Smoke Out Snakes Accidenta...,1640955668


- post 0 = Friday, December 31, 2021 17:55:00
- post 1 = Friday, December 31, 2021 17:29:31
- post 2 = Friday, December 31, 2021 13:01:11

Post 0 is the newest post.

In [37]:
df[['subreddit','title','created_utc']].tail()

Unnamed: 0,subreddit,title,created_utc
95,TheOnion,Army Receives 15-Yard Penalty For Drone-Striki...,1639255284
96,TheOnion,The Taliban are trying to Adapt to western ide...,1639242218
97,TheOnion,"‘And What Do You Want Me To Do, Brush Every Ni...",1639241099
98,TheOnion,"Mrs. Fields CEO Under Fire For Laying Off 1,00...",1639241070
99,TheOnion,End Of Man’s Usefulness To Society Celebrated ...,1639190215


In [38]:
df.iloc[-1]['created_utc']

1639190215

In [39]:
df['title'].nunique()

98

In [40]:
previous_df = df
previous_df['title'].nunique()

98

In [41]:
#As api can only scrape 100 posts at a time, run a loop to scrape multiple times
#Each loop will scrape for earlier posts before the previous loop and concatenate current loop's results with previous loop
before_time = df.iloc[-1]['created_utc']
previous_df = df
i = 1
while previous_df['title'].nunique() < 1_000:
    params = {
                'subreddit':'theonion',
                'size': 100,
                'before': before_time
            }
    res = requests.get(url, params)
    print(f'try {i}, status code is {res.status_code}')
    data = res.json()
    posts = data['data']
    print(f'try {i}, number of posts scrapped is {len(posts)}')
    df = pd.DataFrame(posts)
    before_time = df.iloc[-1]['created_utc']
    print(f'try {i}, before_time is {before_time}')
    previous_df = pd.concat(objs = [df,previous_df])
    print(f'try {i}, combined df shape is {previous_df.shape}')
    i+=1
    time.sleep(1)
theonion_df = previous_df
theonion_df.drop_duplicates(subset = ['title'], inplace = True)
theonion_df = theonion_df.iloc[:1_000]
theonion_df['title'].nunique()

try 1, status code is 200
try 1, number of posts scrapped is 100
try 1, before_time is 1637028589
try 1, combined df shape is (200, 71)
try 2, status code is 200
try 2, number of posts scrapped is 99
try 2, before_time is 1635253246
try 2, combined df shape is (299, 71)
try 3, status code is 200
try 3, number of posts scrapped is 100
try 3, before_time is 1632588548
try 3, combined df shape is (399, 76)
try 4, status code is 200
try 4, number of posts scrapped is 100
try 4, before_time is 1629862378
try 4, combined df shape is (499, 76)
try 5, status code is 200
try 5, number of posts scrapped is 100
try 5, before_time is 1626744709
try 5, combined df shape is (599, 76)
try 6, status code is 200
try 6, number of posts scrapped is 100
try 6, before_time is 1624497754
try 6, combined df shape is (699, 76)
try 7, status code is 200
try 7, number of posts scrapped is 100
try 7, before_time is 1622042395
try 7, combined df shape is (799, 76)
try 8, status code is 200
try 8, number of posts 

1000

In [42]:
theonion_df.to_csv('../data/the_onion.csv', index = False)

In [43]:
params = {
    'subreddit':'nottheonion',
    'size': 100,
    'before': 1640995200
    }
res = requests.get(url, params)
res.status_code
data = res.json()
posts = data['data']
df = pd.DataFrame(posts)

In [44]:
len(posts)

100

In [45]:
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Taco_duck68',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_bqrj5t0e',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': True,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1640995192,
 'domain': 'wral.com',
 'full_link': 'https://www.reddit.com/r/nottheonion/comments/rt6ods/man_attempts_to_pay_for_car_with_rap_steals/',
 'gildings': {},
 'id': 'rt6ods',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': False,
 'num_comments

In [46]:
df[['subreddit','title','created_utc']].tail()

Unnamed: 0,subreddit,title,created_utc
95,nottheonion,DEA Releases Emoji Drug Decoder,1640919490
96,nottheonion,Stay safe and remain ugly,1640918203
97,nottheonion,Man who has been dropping used condoms in temp...,1640917746
98,nottheonion,Georgia man squirts fart spray called Liquid A...,1640917366
99,nottheonion,Man faked a COVID positive test to avoid court...,1640917033


In [47]:
#As api can only scrape 100 posts at a time, run a loop to scrape multiple times
#Each loop will scrape for earlier posts before the previous loop and concatenate current loop's results with previous loop
before_time = df.iloc[-1]['created_utc']
previous_df = df
i = 1
while previous_df['title'].nunique() < 1_000:
    params = {
                'subreddit':'nottheonion',
                'size': 100,
                'before': before_time
            }
    res = requests.get(url, params)
    print(f'try {i}, status code is {res.status_code}')
    data = res.json()
    posts = data['data']
    print(f'try {i}, number of posts scrapped is {len(posts)}')
    df = pd.DataFrame(posts)
    before_time = df.iloc[-1]['created_utc']
    print(f'try {i}, before_time is {before_time}')
    previous_df = pd.concat(objs = [df,previous_df])
    print(f'try {i}, combined df shape is {previous_df.shape}')
    i+=1
    time.sleep(1)
nottheonion_df = previous_df
nottheonion_df.drop_duplicates(subset = ['title'], inplace = True)
nottheonion_df = nottheonion_df.iloc[:1_000]
nottheonion_df['title'].nunique()

try 1, status code is 200
try 1, number of posts scrapped is 100
try 1, before_time is 1640859919
try 1, combined df shape is (200, 73)
try 2, status code is 200
try 2, number of posts scrapped is 100
try 2, before_time is 1640803266
try 2, combined df shape is (300, 73)
try 3, status code is 200
try 3, number of posts scrapped is 100
try 3, before_time is 1640751254
try 3, combined df shape is (400, 73)
try 4, status code is 200
try 4, number of posts scrapped is 100
try 4, before_time is 1640698420
try 4, combined df shape is (500, 73)
try 5, status code is 200
try 5, number of posts scrapped is 100
try 5, before_time is 1640607307
try 5, combined df shape is (600, 73)
try 6, status code is 200
try 6, number of posts scrapped is 98
try 6, before_time is 1640502834
try 6, combined df shape is (698, 73)
try 7, status code is 200
try 7, number of posts scrapped is 100
try 7, before_time is 1640383202
try 7, combined df shape is (798, 73)
try 8, status code is 200
try 8, number of posts 

1000

In [48]:
nottheonion_df.to_csv('../data/not_the_onion.csv', index = False)