# Subreddit Classifier Project: Monty Python vs. Python Language
## Stage 1: Problem Statement, API/Webscraping

### Problem Statement:
The word "python" means to many things to so many people. Snake enthusiasts adore their ball pythons with a love that borders on [incomprehensible](https://www.reddit.com/r/ballpython/comments/kbstzv/my_first_snake_i_couldnt_stop_crying/?utm_source=share&utm_medium=web2x&context=3). Fans of screwball British comedy have one holy grail, as it were, the incomparable Monty Python. And while the Python programming language may have been named after the comedy troop, it has gained a reputation in it's own right with fans in the data science world who are almost as passionate for this beautiful, flexible language as fans of its eponym.

However, in the world of Reddit advertising, the proliferation of the word "python" and other data science-y terms is a challenge for LabelBox, a company which sells data science products and services. While their display ads on Reddit overall perform well, the LabelBox marketing team has discovered that they inadvertantly display ads on subreddits not related to data science, but whose names are similar to or related to data science topics. Ad campaigns on these "imposter" subreddits have very low return on ad spend (ROAS).

However, rather than discontinue ads to these sites entirely, LabelBox wants to get more surgical in their targeting. The data science and python subreddits are heavily saturated with ads from LabelBox and their competitors, so opportunity to scale in those subreddits is limited. In addition, the marketing team knows of several lucrative clients who saw their ads on one of the "imposter" subreddits, when a Reddit user landed there by accident as well. LabelBox hopes to gain an edge on their competition by finding prospective clients via mis-placed posts on imposter websites. 

In this project, I will develop and present to the LabelBox technical and marketing teams a classification model that uses natural language processing to identify true positive, true negative, false positive and false negative subreddit posts. False positives, posts on ```r/montypython``` that the model believes should be on ```r/python```, will then be targeted directly by LabelBox via direct message or highly targeted display adds. LabelBox can then discontinue broad, costly batch campaigns in these subreddits, thereby reducing marketing spend while improving marketing metrics such as return on ad spend (ROAS) and cost to acquire (CTA).

My goal is for the model to accurately predict which subreddit a post belongs to 90% of more of the time across my training and testing datasets. 

### Data Collection

#### Imports

In [5]:
import pandas as pd
import requests
import time

In [6]:
# Creating pushshift url & params to retrieve submissions from my chosen subreddits:

url = 'https://api.pushshift.io/reddit/search/submission'

params_pp = {
    'subreddit': 'python',
    'size': 100,
}

params_mp = {
    'subreddit': 'montypython',
    'size': 100
}

#### Pull Python Subreddit Submission Data:

In [7]:
res = requests.get(url, params_pp)

In [8]:
res.status_code

200

In [9]:
data = res.json()

In [10]:
posts = data['data']

In [12]:
# Create dataframe for submissions from Python subreddit:

pp = pd.DataFrame(posts)
pp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 77 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  100 non-null    object 
 1   allow_live_comments            100 non-null    bool   
 2   author                         100 non-null    object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          97 non-null     object 
 5   author_flair_text              3 non-null      object 
 6   author_flair_type              97 non-null     object 
 7   author_fullname                97 non-null     object 
 8   author_patreon_flair           97 non-null     object 
 9   author_premium                 97 non-null     object 
 10  awarders                       100 non-null    object 
 11  can_mod_post                   100 non-null    bool   
 12  contest_mode                   100 non-null    bool

In [13]:
# Look at key columns: 

pp[['subreddit', 'title', 'selftext']]

Unnamed: 0,subreddit,title,selftext
0,Python,How would I make a script to automatically mov...,"Like if I press the W key, I would want the ch..."
1,Python,Best IDE,[removed]
2,Python,LEARNING PYTHON AS A BEGINNER,Hy guys my name is Jason and i want to learn p...
3,Python,I automated a full time full before it could b...,Thought this was funny. I work as an Accountan...
4,Python,Learn SQLite with this free course with Python!!,Learn databases with python which you can imp...
...,...,...,...
95,Python,Rounding function on Python,[removed]
96,Python,I built a Telegram bot notifier to bring peace...,I’ve been reading Robert Martin’s *Clean Code*...
97,Python,Hello World!,
98,Python,Better approach for python multitasking for ta...,[removed]


In [236]:
# Run while loop to pull most recent 1500 posts from Python subreddit:

while len(pp) < 1500:
    url = 'https://api.pushshift.io/reddit/search/submission'
    params_new_pp = {
        'subreddit': 'python',
        'size': 100,
        # Because pushshift pulls newest to oldest, pull timestamp from last posted record
        'before': pp.iloc[-1]['created_utc']
    }
    res = requests.get(url, params_new_pp)
    if res.status_code == 200:
        pp_newposts = res.json()['data']
        pp = pp.append(pp_newposts)
        time.sleep(1)
    else:
        time.sleep(1)
        pp_newposts = res.json()['data']
        pp = pp.append(pp_newposts)
        time.sleep(1)

In [287]:
len(pp)

1500

In [276]:
pp = pp.reset_index() 

#### Pull Monty Python Subreddit Submissions:

In [228]:
res = requests.get(url, params_mp)

In [229]:
res.status_code

200

In [230]:
mp_data = res.json()

In [231]:
mp_posts = mp_data['data']

In [136]:
# Create Monty Python dataframe (will append to main DF later):

mp = pd.DataFrame(mp_posts)

In [137]:
mp[['subreddit', 'title', 'selftext']]

Unnamed: 0,subreddit,title,selftext
0,montypython,You Silly King,
1,montypython,‘Tis just some art,
2,montypython,Linzhi Miner Phoenix Ethereum Official Review ...,
3,montypython,My new awesome Yule T,
4,montypython,Automate Whatsapp with 2 lines using Python,
...,...,...,...
95,montypython,Black knight joke found in 15th century manusc...,
96,montypython,This is the best lovely T I’ve ever found,
97,montypython,Was Castle Anthrax a trap?,"I recently rewatched that part, and I can't sh..."
98,montypython,ya'll think monty phython and the holy grail i...,for me it feels like they meet on Fridays at t...


In [285]:
# Use while loop to add more records to df:

while len(mp) < 1500:
    url = 'https://api.pushshift.io/reddit/search/submission'
    params_new_mp = {
        'subreddit': 'montypython',
        'size': 100,
        'before': mp.iloc[-1]['created_utc']
    }
    res = requests.get(url, params_new_mp)
    if res.status_code == 200:
        mp_newposts = res.json()['data']
        mp = mp.append(mp_newposts)
        time.sleep(1)
    else:
        time.sleep(1)
        res.status_code
        mp_newposts = res.json()['data']
        mp = mp.append(mp_newposts)
        time.sleep(1)
        

In [286]:
len(mp)

1500

In [288]:
# Check for duplicate records in both dataframes: 

mp.duplicated(subset=['id'], keep=False)

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 1500, dtype: bool

In [290]:
pp.duplicated(subset=['id'], keep=False)

0       False
1       False
2       False
3       False
4       False
        ...  
1495    False
1496    False
1497    False
1498    False
1499    False
Length: 1500, dtype: bool

### Data Cleaning and EDA 

In [162]:
mp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 99
Data columns (total 77 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  1000 non-null   object 
 1   allow_live_comments            1000 non-null   bool   
 2   author                         1000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          964 non-null    object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              964 non-null    object 
 7   author_fullname                964 non-null    object 
 8   author_patreon_flair           964 non-null    object 
 9   author_premium                 964 non-null    object 
 10  awarders                       1000 non-null   object 
 11  can_mod_post                   1000 non-null   bool   
 12  contest_mode                   1000 non-null   boo

In [204]:
# Look at various columns to determine usefulness of data:

mp[['id','author', 'subreddit', 'title', 'selftext', 'score', 'num_comments', 'domain']].head()

Unnamed: 0,id,author,subreddit,title,selftext,score,num_comments,domain
0,klkxxo,MrJFrayFilms,montypython,You Silly King,,1,0,i.redd.it
1,kl5fis,AidenAvocado,montypython,‘Tis just some art,,1,10,i.redd.it
2,kl2g82,eydiemaloy696,montypython,Linzhi Miner Phoenix Ethereum Official Review ...,,1,0,youtube.com
3,kkmm2u,PositionFederal,montypython,My new awesome Yule T,,1,6,i.redd.it
4,kkhdex,Just_Philosopher385,montypython,Automate Whatsapp with 2 lines using Python,,1,0,vocal.media


In [325]:
pp.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
index,1500.0,49.5,28.875697,0.0,24.75,49.5,74.25,99.0
created_utc,1500.0,1608084000.0,456198.556193,1607348000.0,1607658000.0,1608061000.0,1608489000.0,1608926000.0
num_comments,1500.0,4.402,15.535893,0.0,0.0,2.0,2.0,270.0
num_crossposts,1500.0,0.001333333,0.036503,0.0,0.0,0.0,0.0,1.0
pwls,1500.0,6.0,0.0,6.0,6.0,6.0,6.0,6.0
retrieved_on,1500.0,1608086000.0,456756.580798,1607348000.0,1607658000.0,1608061000.0,1608489000.0,1608926000.0
score,1500.0,1.338667,6.899834,0.0,1.0,1.0,1.0,237.0
subreddit_subscribers,1500.0,715959.3,2478.085588,711830.0,713678.8,715855.5,718104.5,720495.0
total_awards_received,1500.0,0.003333333,0.077414,0.0,0.0,0.0,0.0,2.0
upvote_ratio,1500.0,0.9878533,0.076393,0.33,1.0,1.0,1.0,1.0


In [324]:
mp.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
created_utc,1500.0,1591543000.0,9019151.0,1577315000.0,1583857000.0,1590545000.0,1598446000.0,1609137000.0
num_comments,1500.0,4.437333,8.986652,0.0,0.0,0.0,6.0,105.0
num_crossposts,1500.0,0.004,0.08938303,0.0,0.0,0.0,0.0,3.0
pwls,832.0,6.936298,0.2443681,6.0,7.0,7.0,7.0,7.0
retrieved_on,1500.0,1591548000.0,9019538.0,1577315000.0,1583857000.0,1590545000.0,1598446000.0,1609137000.0
score,1500.0,6.192,35.19568,0.0,1.0,1.0,1.0,506.0
subreddit_subscribers,1500.0,24687.83,1639.295,21443.0,23463.75,24721.0,26136.0,27194.0
thumbnail_height,1026.0,117.2086,25.26653,22.0,105.0,127.0,140.0,140.0
thumbnail_width,1026.0,140.0,0.0,140.0,140.0,140.0,140.0,140.0
total_awards_received,1500.0,0.0006666667,0.02581989,0.0,0.0,0.0,0.0,1.0


In [1]:
# Based on the stats of numeric fields, both subreddits have similar mean numbers of comments (4.4).
# I will pull comment data and append to my dataset if the models don't perform well with just submissions.
# At this point, because this project does focus on NLP, I am going to move forward using just subreddit, title and selftext.

In [331]:
# Checking for null values in Monty Python df for title, selftext:

mp.isnull().sum().tail(40)

pwls                              668
removed_by_category               883
retrieved_on                        0
score                               0
selftext                            0
send_replies                        0
spoiler                             0
stickied                            0
subreddit                           0
subreddit_id                        0
subreddit_subscribers               0
subreddit_type                      0
thumbnail                           0
thumbnail_height                  474
thumbnail_width                   474
title                               0
total_awards_received               0
treatment_tags                    518
upvote_ratio                      636
url                                 0
url_overridden_by_dest           1003
whitelist_status                  668
wls                               668
media                            1337
media_embed                      1355
secure_media                     1337
secure_media

In [292]:
# Filling nulls in selftext with blank string:

mp['selftext'] = mp['selftext'].fillna('')

In [293]:
# Checking for null values in important columns in Python dataframe:

pp.isnull().sum().tail(40)

parent_whitelist_status             0
permalink                           0
pinned                              0
pwls                                0
retrieved_on                        0
score                               0
selftext                            4
send_replies                        0
spoiler                             0
stickied                            0
subreddit                           0
subreddit_id                        0
subreddit_subscribers               0
subreddit_type                      0
thumbnail                           0
title                               0
total_awards_received               0
treatment_tags                      0
upvote_ratio                        0
url                                 0
whitelist_status                    0
wls                                 0
post_hint                         783
preview                           783
thumbnail_height                 1010
thumbnail_width                  1010
url_overridd

In [294]:
# Filling nulls in selftext with blank string:

pp['selftext'] = pp['selftext'].fillna('')

In [309]:
all_py = pp[['id','subreddit', 'title', 'selftext']]

In [310]:
all_py = all_py.append(mp[['id','subreddit', 'title', 'selftext']])

In [311]:
# Resetting index and deleteing "index" column which is unnecessary.
all_py = all_py.reset_index()

In [312]:
all_py.drop(['index'], axis=1, inplace=True)

In [314]:
# Because many posts only have title, no selftext, I am using Hov's suggestion to merge them to a single corpus:
# This will hopefully simplify the model, and increase the chance of having strong accuracy 
# given more tokens/bigrams/trigrams to work with per user post. 

all_py['title_selftext'] = all_py['title'] + ' ' + all_py['selftext']

In [315]:
# Create binary column for subreddit for classification:

all_py['subred'] = all_py.subreddit.map(lambda x: 1  if x == 'Python' else 0)

In [327]:
# Checking that counts are as expected:

all_py['subred'].value_counts(dropna=False)

1    1500
0    1500
Name: subred, dtype: int64

In [317]:
# Final check for nulls. All other data cleansing/EDA will be done during preprocessing.

all_py.isnull().sum()

id                0
subreddit         0
title             0
selftext          0
title_selftext    0
subred            0
dtype: int64

In [322]:
all_py.to_csv('./data/all_py.csv')