# Classifying Subreddits by Title Content

### Notebook 1

## Problem Statement

Reddit can sometimes be a confusing place. With so many communities specializing in so many different topics, it can easily be overwhelming to keep different subreddits straight. To this end, it could be useful to have a model that could take a specific post, and predict which subreddit it came from. Specifically, we want a model to be able to classify posts between two gaming subreddits, r/gaming and r/pcgaming. Gaming is a general gaming subreddit, mostly centered around video games, whereas PCgaming is cetnered specifically around video games played on PCs, and the PCs that are built to play those games. 

In order to build this model, posts from each subreddit will be required, and the language in those posts will need to be analyzed. This is specifically a binary classification problem, so the modeling techniques used must be able to make this kind of prediction. In order to compare the different models, the accuracy metric will be utilized. Ideally, this model will have a significant boost in performance compared to a baseline mode prediction, and adapt to new data well.

## Executive Summary

The first step of this project was to acquire the necessary data, and attempt to pull 1000 posts from each targeted subreddit. In the end, we found that 40-50% of the posts that we pulled from each subreddit had been repeated.

The real work began when the target information (the post title) had been isolated. Each of the posts had unique words and language patterns that could help to decipher which subreddit it had been from. In order to fully analyze this information we had to use NLP (natural language processing). This was an iterative and multi-step process that included stemming words down to their base roots, removing so-called "stop words" in order to access the words conveying more meaning, and even removing some of the words that were very frequent in both of the subreddits (such as "game", "gaming, and "new"). 

As there are many modeling techniques available to solve a classification problem, it was important to be able to optimize each model, and then compare the best models against each other. To this end, we used pipelines and gridsearches to perform large groups of fits at a time. As always, a `train_test_split` was used to keep a holdout dataset from the model for testing purposes. Two vectorizers were utilized, `CountVectorizer` and `TfidfVectorizer` in order to optimize the way that each title's words were incorporated. For the classifiers, six different types were utilized: `LogisticRegression`, `KNearestNeighbors`, `MultinomialNB`, `RandomForestClassifier`, `AdaboostClassifier`, and `SVC` (Support Vector Classifier). Each of these models have certain benefits and pitfalls, and the results of each were somewhat unexpected. 

In the end, The model that was able to optimize accuracy, and still have a reasonable balance between bias and variance, utilized `TfidfVectorizer` and `SVC`. This model was able to give a 78.3% accuracy on the testing set. We believe this model is effective enough to continue classifying other posts between these two subreddits, mainly due to the fact that this case has fairly low stakes for misclassifications.

## Table of Contents

1. [Importing Packages](#Importing-Packages)
2. [Scraping Data](#Scraping-Data)
    1. [Scraping r/gaming](#Scraping-r/gaming)
    2. [Scraping r/pcgaming](#Scraping-r/pcgaming)
3. [Saving Data](#Saving-Data)

## Importing Packages

In [1]:
# importing all the things that might be useful
import numpy as np
import pandas as pd
import requests
import time

## Scraping Data

Reddit's API will allow people to pull up to 1000 posts from each subreddit. However, each typical request for posts only returns 25 entries. Since we need a large amount of data to be able to create a well functioning model we can either pull 25 entries a lot of times, or try to pull more entries a few times.

With either method, we want to create a custom function to do the pulls for us, without pinging reddit so much that we get blocked. Pulling a larger amount of posts each time also reduces how much we have to ping reddit, hopefully also reducing the chances of getting blocked.

### Scraping r/gaming

In [2]:
# setting the url/request info for the first subreddit
# changing post limit to 100 and using params dict from Madhi's wisdom
gaming_url = 'https://www.reddit.com/r/gaming.json' # API Endpoint

# setting params dict to increase posts per pull, and update the link to grab new posts
params = {"limit": 100}
user_agent = {'User-agent': 'jondov'}

In [3]:
# making a list for all the posts from r/gaming
gaming_posts = []

In [4]:
# creating custom function to take in vars and make the right pulls
def reddit_puller(url, params, user_agent, pulls, post_list):
    # adapting code from Boom Devahastin Na Ayudhya
    
    for pull_num in range(int(pulls)):
        
        # stating which pull is being attempted
        print("Pulling data attempted", pull_num+1, "times")
        
        # establishing the request code
        res = requests.get(url, headers=user_agent, params=params)
        
        # pull the correct data if the code is good
        if res.status_code == 200:
            json_data = res.json()                      #  Pull JSON
            post_list.extend(json_data['data']['children']) #  Get posts and extend the `posts` list 
            
            # updating url with the id of the last post in the pull
            # next pull will grab the following entries
            after = json_data['data']['after']
            params["after"] = after
        
        else:
            print("There has been an error. The code is: ", res.status_code)
            break
            
        # sleeping the func so we aren't locked out
        time.sleep(2)

In [5]:
# running the new function
reddit_puller(gaming_url, params, user_agent, 10, gaming_posts)

Pulling data attempted 1 times
Pulling data attempted 2 times
Pulling data attempted 3 times
Pulling data attempted 4 times
Pulling data attempted 5 times
Pulling data attempted 6 times
Pulling data attempted 7 times
Pulling data attempted 8 times
Pulling data attempted 9 times
Pulling data attempted 10 times


The pulling appears to have been completed successfully, but now we have to verify what we ended up with.

In [5]:
# making func to check total and unique posts scraped
def post_checker(post_list):
    print(f"Total posts scraped: {len(post_list)}.")
    print("We have:", len(set([p['data']['id'] for p in post_list])), "unique posts for this subreddit")

In [5]:
# checking how many posts were scraped
len(gaming_posts)

945

In [6]:
# checking how many of the scraped posts have unique ids
print("We have:", len(set([p['data']['id'] for p in gaming_posts])), "unique posts in this subreddit")

We have: 645 unique posts in this subreddit


Unfortunately, we did not actually end up with as many posts as thought. There is most likely something in the pulling request or the reddit API that is not properly pulling the posts in the sequential fashion that we wanted.

This issue is beyond the scope of the project, however, as we can still work with the amount of unique posts that we have here.

### Additional Scraping for r/gaming

After a few months since the project was completed, new data is going to be pulled to add onto the original dataset and improve the model.

In [6]:
# making new list for posts
new_gaming_posts = []

In [7]:
# making new pull 9/23/2019
reddit_puller(gaming_url, params, user_agent, 10, new_gaming_posts)

Pulling data attempted 1 times
Pulling data attempted 2 times
Pulling data attempted 3 times
Pulling data attempted 4 times
Pulling data attempted 5 times
Pulling data attempted 6 times
Pulling data attempted 7 times
Pulling data attempted 8 times
Pulling data attempted 9 times
Pulling data attempted 10 times


In [8]:
# checking the new post list
post_checker(new_gaming_posts)

Total posts scraped: 967.
We have: 566 unique posts for this subreddit


### Scraping r/pcgaming

The same process as above is now going to be repeated for r/pcgaming. We only need to make a new base url and list to store the posts, as the `params` and `user_agent` will stay the same, and the function will take care of the rest for us.

In [9]:
# setting the url for the second subreddit
pcgaming_url = 'https://www.reddit.com/r/pcgaming.json' # API Endpoint

In [7]:
# making the list to store all the posts from r/pcgaming
pcgaming_posts = []

In [8]:
# running func on new subreddit
reddit_puller(pcgaming_url, params, user_agent, 10, pcgaming_posts)

Pulling data attempted 1 times
Pulling data attempted 2 times
Pulling data attempted 3 times
Pulling data attempted 4 times
Pulling data attempted 5 times
Pulling data attempted 6 times
Pulling data attempted 7 times
Pulling data attempted 8 times
Pulling data attempted 9 times
Pulling data attempted 10 times


The second pull was also successful, and we should now also check how many unique posts we have against the total pulled.

In [10]:
len(pcgaming_posts)

985

In [20]:
# checking how many of the scraped posts have unique ids
print("We have:", len(set([p['data']['id'] for p in pcgaming_posts])), "unique posts in this subreddit")

We have: 583 unique posts in this subreddit


We have the same issue with this subreddit as well. It is unfortunate that we were not able to pull the full amount of unique posts for each subreddit, but the model should still be effective with over 1200 total posts for our dataset.

### Additional Scraping for r/pcgaming

The additional scraping will also be performed for the pcgaming subreddit.

In [10]:
# making new list for posts
new_pcgaming_posts = []

In [13]:
# making new pull 9/23/2019
reddit_puller(pcgaming_url, params, user_agent, 10, new_pcgaming_posts)

Pulling data attempted 1 times
Pulling data attempted 2 times
Pulling data attempted 3 times
Pulling data attempted 4 times
Pulling data attempted 5 times
Pulling data attempted 6 times
Pulling data attempted 7 times
Pulling data attempted 8 times
Pulling data attempted 9 times
Pulling data attempted 10 times


In [14]:
# checking the new post list
post_checker(new_pcgaming_posts)

Total posts scraped: 1296.
We have: 648 unique posts for this subreddit


## Converting Posts

Now the posts need to be converted into something we can work with visually. Dataframes will do nicely.

In [12]:
# making dataframes of what was scraped
gaming_posts_df = pd.DataFrame(gaming_posts)
pcgaming_posts_df = pd.DataFrame(pcgaming_posts)

In [13]:
print(gaming_posts_df.shape)
gaming_posts_df.head()

(945, 2)


Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
1,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
2,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
3,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
4,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3


In [14]:
print(pcgaming_posts_df.shape)
pcgaming_posts_df.head()

(985, 2)


Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
1,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
2,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
3,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
4,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3


We now have two dataframes of all the information from each pulled post. They're not quite in the final form that we need them, but it is a good start.

We will repeat the process for the new posts that were pulled.

In [15]:
# making dataframes of new scraped posts 9/23/2019
new_gaming_posts_df = pd.DataFrame(new_gaming_posts)
new_pcgaming_posts_df = pd.DataFrame(new_pcgaming_posts)

In [17]:
new_gaming_posts_df.head()

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'gaming..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'gaming..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'gaming..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'gaming..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'gaming..."


In [20]:
new_pcgaming_posts_df.head()

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'pcgami..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'pcgami..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'pcgami..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'pcgami..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'pcgami..."


## Saving Data

This seems like a good place to save the progress. Saving these to files will allow us to continue working with these datasets without having to pull the posts from reddit again.

In [21]:
# appending the new data as it was scraped so it doesn't get lost
new_gaming_posts_df.to_csv("../datasets/gaming_posts.csv", mode="a", index=False, index_label=False)
new_pcgaming_posts_df.to_csv("../datasets/pcgaming_posts.csv", mode="a", index=False, index_label=False)

Currently the dataframes hold the full information of each post, though what we want is only the title of the unique posts that were scraped. We can loop through the dictionary of each post to extract the titles, and then only keep one of each unique title.

In [23]:
# making list of just the titles from each post
# getting set of that list to remove duplicates
unique_titles = set([new_gaming_posts_df["data"][post]["title"] for post in range(len(new_gaming_posts_df))])

This list can now be turned into a dataframe to more easily view/work on it. This is a good time to add our "target" column, a binary classification of which subreddit the post belongs to.

In [26]:
# making a new df of the titles, and creating a subreddit dummy column
new_gaming_titles_df = pd.DataFrame(unique_titles, columns=["title"])
new_gaming_titles_df["pc_sub"] = 0
new_gaming_titles_df.head()

Unnamed: 0,title,pc_sub
0,Who's your favorite and most trustworthy game ...,0
1,Mario Maker 2 in a nut,0
2,Finish adding backlight LCD screen to my gameb...,0
3,In Just Five Days Borderlands 3 Became 2K's Fa...,0
4,Whats a game everyone hates but you like?,0


In [27]:
# repeating the process for the pcgaming df
unique_titles = set([new_pcgaming_posts_df["data"][post]["title"] for post in range(len(new_pcgaming_posts_df))])
new_pcgaming_titles_df = pd.DataFrame(unique_titles, columns=["title"])
new_pcgaming_titles_df["pc_sub"] = 1
new_pcgaming_titles_df.head()

Unnamed: 0,title,pc_sub
0,Just built PC but can’t seem to finish all the...,1
1,So the rockstar games launcher,1
2,Shenmue 3 Kickstarter backers can still reques...,1
3,Early Access release date for Chernobylite ann...,1
4,Borderlands 3 - DX11 and DX12 testing - 3900X ...,1


We now have two dataframes of just post titles, matched with their respective subreddit. With these dataframes, we can begin the process of exploring and cleaning the data. This seems like another good spot to save the progress.

In [28]:
# appending these new dfs to the old csv files
new_gaming_titles_df.to_csv(
    "../datasets/gaming_titles.csv",
    mode="a",
    index=False,
    label_index=False
)
new_pcgaming_titles_df.to_csv(
    "../datasets/pcgaming_titles.csv",
    mode="a",
    index=False,
    label_index=False
)

In order to preserve these cells, and not constantly re-pull the posts, the remainder of the project will be in a new notebook.