# Problem Statement

This project seeks to identify the best classification model that can distinguish which of two subreddits a post belongs to. The  performance of 2 parametric models will be evaluated. In the instance of Logistic Regression, the coefficients can be interpreted to identify features that increases the log-odds of a post to be from the positive class, or else being equal.

Advertisors seeking to use reddit as a marketing platform will be able to develop marketing strategies specifically targeted  at each subreddit based on what each community is most concerned about.

Two closely-related subreddits, namely /gainit and /keto are selected to identify the classification model that can more accurately classify the posts to their respective subreddit. 

The /gainit subreddit is a fitness subreddit for information and discussion for people looking to put on weight, muscle, and strength.

The /keto subreddit is a place to share thoughts, ideas, benefits, and experiences around eating within a Ketogenic lifestyle.

While seemingly dissimilar, the 2 threads share commonalities. 
1. Focus on caloric intake and fitness
2. Macronutrients: The basis of a ketogenic diet is one that focuses on a relatively high-protein, high-fat diet. The increased intake amount of these nutrients are common to what a person seeking to gain muscle/weight would incorporate to attain the one's goal

Moderators of the post can use the model to potentially flag out misclassified posts to maintain the integrity of the posts in the subreddit thread.  

In [1]:
#imports:
import pandas as pd
import numpy as np
import requests
import re
import time
import random
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
%matplotlib inline

Using the requests library, data is pulled via the Reddit's API. Reddit only provides 25 posts per request. In order to get 500 posts per subreddit, the process was iterated through 20 times- a buffer to cater for duplicate posts. A list of nested json dictionaries is obtained of which are saved in 2 separate csv files for EDA and modelling in a separate notebook.

In [2]:
url = "https://www.reddit.com/r/keto.json"

In [3]:
%%capture
from tqdm import tnrange,tqdm_notebook as tqdm
tqdm().pandas()

In [4]:

posts = []
after = None
#options = '&t=all&limit=100'
my_list = list(range(20))
#for a in tqdm(my_list):
for a in tnrange(20,desc='loop'):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after 
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})

    if res.status_code != 200:
        print('Status error', res.status_code)
        break

    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']

    # COMPLETE THE CODE!
    if a > 0:
        prev_posts = pd.read_csv('./keto.csv')
        current_df = pd.DataFrame(posts)
        combined_posts = pd.concat([prev_posts,current_df], ignore_index=True)
        combined_posts.to_csv('./keto.csv', index = False)

    else:
        pd.DataFrame(posts).to_csv('./keto.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

  


HBox(children=(FloatProgress(value=0.0, description='loop', max=20.0, style=ProgressStyle(description_width='i…

https://www.reddit.com/r/keto.json
4
https://www.reddit.com/r/keto.json?after=t3_gj1ps1
2
https://www.reddit.com/r/keto.json?after=t3_gic4s0
2
https://www.reddit.com/r/keto.json?after=t3_gimmqs
2
https://www.reddit.com/r/keto.json?after=t3_gicedv
3
https://www.reddit.com/r/keto.json?after=t3_gi0r8p
6
https://www.reddit.com/r/keto.json?after=t3_ghi7ou
6
https://www.reddit.com/r/keto.json?after=t3_ghcw3k
4
https://www.reddit.com/r/keto.json?after=t3_ggxv49
3
https://www.reddit.com/r/keto.json?after=t3_ggs7zu
2
https://www.reddit.com/r/keto.json?after=t3_gge4la
2
https://www.reddit.com/r/keto.json?after=t3_gg40to
2
https://www.reddit.com/r/keto.json?after=t3_gfrjnb
3
https://www.reddit.com/r/keto.json?after=t3_gfm23h
6
https://www.reddit.com/r/keto.json?after=t3_gfazst
5
https://www.reddit.com/r/keto.json?after=t3_gexvbe
6
https://www.reddit.com/r/keto.json?after=t3_gerzvo
6
https://www.reddit.com/r/keto.json?after=t3_gdeblc
5
https://www.reddit.com/r/keto.json?after=t3_gdvu3o
5
https://w

In [5]:
keto_df = pd.read_csv('./keto.csv')
keto_df.shape[0]

5290

In [8]:
url = "https://www.reddit.com/r/gainit.json"

In [9]:
posts = []
after = None
#options = '&t=all&limit=100'
my_list = list(range(20))
for a in tnrange(20,desc='loop'):
#for x in tqdm(my_list):
#for a in range(20):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after 
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # COMPLETE THE CODE!
    if a > 0:
        prev_posts = pd.read_csv('./gainit.csv')
        current_df = pd.DataFrame(posts)
        combined_posts = pd.concat([prev_posts,current_df], ignore_index=True)
        combined_posts.to_csv('./gainit.csv', index = False)
        
    else:
        pd.DataFrame(posts).to_csv('./gainit.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

  """


HBox(children=(FloatProgress(value=0.0, description='loop', max=20.0, style=ProgressStyle(description_width='i…

https://www.reddit.com/r/gainit.json
3
https://www.reddit.com/r/gainit.json?after=t3_ghu04s
2
https://www.reddit.com/r/gainit.json?after=t3_gevnsq
3
https://www.reddit.com/r/gainit.json?after=t3_gdh8a1
6
https://www.reddit.com/r/gainit.json?after=t3_gca57e
3
https://www.reddit.com/r/gainit.json?after=t3_g7pbdn
4
https://www.reddit.com/r/gainit.json?after=t3_g63xtu
2
https://www.reddit.com/r/gainit.json?after=t3_g38a3c
6
https://www.reddit.com/r/gainit.json?after=t3_g1r03k
2
https://www.reddit.com/r/gainit.json?after=t3_fzp7c1
5
https://www.reddit.com/r/gainit.json?after=t3_fxel1s
4
https://www.reddit.com/r/gainit.json?after=t3_ful5x9
4
https://www.reddit.com/r/gainit.json?after=t3_frodeb
5
https://www.reddit.com/r/gainit.json?after=t3_fqo22i
3
https://www.reddit.com/r/gainit.json?after=t3_fozrp4
3
https://www.reddit.com/r/gainit.json?after=t3_fnx13j
3
https://www.reddit.com/r/gainit.json?after=t3_fmrsl5
6
https://www.reddit.com/r/gainit.json?after=t3_fjw9gn
5
https://www.reddit.com/r/g

In [10]:
gainIt_df = pd.read_csv('../datasets/gainit.csv')
gainIt_df.shape[0]

5290

In [36]:
keto_filtered_df = keto_df.loc[:,['selftext','title']]
keto_filtered_df['target'] = 1

In [37]:
gainIt_filtered_df = gainIt_df.loc[:,['selftext','title']]
gainIt_filtered_df['target'] = 0

In [38]:
combined_df = pd.concat([keto_filtered_df,gainIt_filtered_df], ignore_index=True)
combined_df.head()

Unnamed: 0,selftext,title,target
0,,New to gaining weight? Please read the FAQ bef...,1
1,**Welcome to the weekly stupid questions threa...,[Mod] Simple Questions - the weekly stupid que...,1
2,"Yes, we understand not having everything be id...",Am I the only person who is tired of the const...,1
3,Pics: https://m.imgur.com/a/itfwgds \n\nBeen t...,[PROGRESS] 71kg - 81 kg (157lbs-179lbs) 3 mont...,1
4,"That is all, just wanted to share. Feels good ...",Feel like I've finally gained enough weight to...,1


In [39]:
combined_df.shape

(22120, 3)

In [40]:
combined_df.drop_duplicates(keep='first', inplace=True)

In [41]:
combined_df.shape

(1049, 3)

In [42]:
combined_df.isnull().sum()

selftext    45
title        0
target       0
dtype: int64

In [43]:
combined_df.fillna('missingtext', inplace=True)

In [44]:
combined_df.target.value_counts(normalize=True)

0    0.529075
1    0.470925
Name: target, dtype: float64

In [91]:
X_train, X_test, y_train, y_test = train_test_split(combined_df[['selftext']],
                                                    combined_df['target'],
                                                    test_size = 0.25,
                                                    random_state = 42)

In [121]:
def selftext_to_words(raw):
    # Function to convert a raw selftext to a string of words
    # The input is a single string (a raw selftext), and 
    # the output is a single string (a preprocessed selftext)
    
    # 1. Remove HTML.
    #review_text = BeautifulSoup(raw_review).get_text()
    
    # 2. Remove non-letters and leakage terms
    pattern_1 = '\[(.*?)\]' #square brackets
    pattern_2 = '(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})' #urls
    pattern_3 = 'keto' #subreddit(prevent data leakage)
    pattern_4 = '\/r\/' #indicative of subreddit
    pattern_5 = 'gainit'
    generic_re = re.compile("(%s|%s|%s|%s|%s)" % (pattern_1, pattern_2, pattern_3,pattern_4,pattern_5))
    raw_text = re.sub(generic_re, r' ', raw)
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stopwords to a set.
    stops = set(stopwords.words('english'))
    
    # 5. Remove stopwords.
    meaningful_words = [w for w in words if not w in stops]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [88]:
#combined_df.iloc[0,0]

In [87]:
#pattern_1 = '\[(.*?)\]' #square brackets
#pattern_2 = '(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})' #urls
#pattern_3 = 'keto' #subreddit(prevent data leakage)
#pattern_4 = '\/r\/' #indicative of subreddit
#generic_re = re.compile("(%s|%s|%s|%s)" % (pattern_1, pattern_2, pattern_3,pattern_4))

#raw_text = re.sub(generic_re, r' ', combined_df.iloc[0,0])
#raw_text

"Hey   !\n\nRunning? Lifting? Yoga? Swimming? Rowing? How are you getting your heart rate up these days?\n\nShare your fitness regimen OR ask the community any questions you have about working out!\n\nIf you're new to    and need some info, start with  (  and  (  Or, if you have a question that doesn't seem to be covered, head on over to the Community Support thread (pinned to the top of the subreddit) and ask the community!"

In [122]:
total_selftext = combined_df.shape[0]
total_selftext

995

In [123]:
print(f'There are {total_selftext} reviews.')

# Initialize an empty list to hold the clean reviews.
clean_train_selftext = []
clean_test_selftext = []

There are 995 reviews.


In [124]:
print("Cleaning and parsing the training set articles...")

# Instantiate counter.
j = 0

# For every review in our training set...
for train_selftext in X_train['selftext']:
    
    # Convert review to words, then append to clean_train_selftext.
    clean_train_selftext.append(selftext_to_words(train_selftext))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_selftext}.')
    
    j += 1

# Let's do the same for our testing set.
print("Cleaning and parsing the testing set movie reviews...")

# For every review in our testing set...
for test_selftext in X_test['selftext']:
    
    # Convert review to words, then append to clean_test_selftext.
    clean_test_selftext.append(selftext_to_words(test_selftext))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_selftext}.')
        
    j += 1


Cleaning and parsing the training set articles...
Review 100 of 995.
Review 200 of 995.
Review 300 of 995.
Review 400 of 995.
Review 500 of 995.
Review 600 of 995.
Review 700 of 995.
Cleaning and parsing the testing set movie reviews...
Review 800 of 995.
Review 900 of 995.


In [125]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('logreg', LogisticRegression(solver='lbfgs',max_iter=200))
    
])

In [126]:
pipe_params = {
    'cvec__max_features' :[2000, 3000, 4000, 5000],
    'cvec__min_df':[2,3],
    'cvec__max_df':[0.9,0.95],
    'cvec__ngram_range': [(1, 1),(1, 2),(1, 3)]
}

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid = pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [127]:
gs.fit(clean_train_selftext,y_train)


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [128]:
gs.best_score_

0.8794004474272932

In [129]:
gs.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 4000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 3)}

In [130]:
gs_model = gs.best_estimator_

In [134]:
gs_model.score(clean_train_selftext, y_train)

0.9919571045576407

In [135]:
gs_model.score(clean_test_selftext, y_test)

0.9076305220883534

In [136]:
pipe_tfidf = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('logreg', LogisticRegression(solver='lbfgs',max_iter=200))
    
])



In [137]:
pipe_params = {
    'tvec__max_features' :[2000, 3000, 4000, 5000],
    'tvec__min_df':[2,3],
    'tvec__max_df':[0.9,0.95],
    'tvec__ngram_range': [(1, 1),(1, 2),(1, 3)]
}

gs_tfidf = GridSearchCV(pipe_tfidf, # what object are we optimizing?
                  param_grid = pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [138]:
gs_tfidf.fit(clean_train_selftext,y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tvec',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [139]:
gs_tfidf.best_params_

{'tvec__max_df': 0.9,
 'tvec__max_features': 4000,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 2)}

In [140]:
gs_model = gs_tfidf.best_estimator_

In [141]:
gs_model.score(clean_train_selftext, y_train)

0.985254691689008

In [142]:
gs_model.score(clean_test_selftext, y_test)

0.9116465863453815