# Classifying Subreddits by Post Content

### Notebook 1

## Problem Statement

Reddit can sometimes be a confusing place. With so many communities specializing in so many different topics, it can easily be overwhelming to keep different subreddits straight. To this end, it could be useful to have a model that could take a specific post, and predict which subreddit it came from. Specifically, we want a model to be able to classify posts between two gaming subreddits, r/gaming and r/pcgaming. Gaming is a general gaming subreddit, mostly centered around video games, whereas PCgaming is cetnered specifically around video games played on PCs, and the PCs that are built to play those games. 

In order to build this model, posts from each subreddit will be required, and the lanuage in those posts will need to be analyzed. This is specifcally a binary classification problem, so the modeling techniques used must be able to make this kind of prediction. In order to compare the differnet models, the accuracy metric will be utilized. Ideally, this model will have a significant boost in performance compared to a baseline mode prediction, and adapt to new data well.

## Executive Summary

The first step of this project was to acquire the necessary data, and attempt to pull 1000 posts from each targeted subreddit. In the end, we found that 40-50% of the posts that we pulled from each subreddit had been repeated.

The real work began when the target information (the post title) had been isolated. Each of the posts had unique words and language patterns that could help to decipher which subreddit it had been from. In order to fully analyze this information we had to use NLP (natural language processing). This was an iterative and multi-step process that included stemming words down to their base roots, removing so-called "stop words" in order to access the words conveying more meaning, and even removing some of the words that were very frequent in both of the subreddits (such as "game", "gaming, and "new"). 

As there are many modeling techniques available to solve a classification problem, it was important to be able to optimize each model, and then compare the best models against each other. To this end, we used pipelines and gridsearches to perform hundreds of fits at a time. As always, a `train_test_split` was used to keep a holdout dataset from the model for testing purposes. Two vectorizers were utilized, `CountVectorizer` and `TfidfVectorizer` in order to optimize the way that each title's words were incorported. For the classifiers, five different types were utilized: `LogisticRegression`, `KNearestNeighbors`, `MultinomialNB`, `RandomForestClassifier`, and `AdaboostClassifier`. Each of these models have certian benefits and pitfalls, and the results of each were somehwat unexpected. 

In the end, The model that was able to optimize accuracy, and still have a reasonable balance between bias and variance, utilized `TfidfVectorizer` and `AdaboostClassifier`. This model was able to give a 78.4% accuracy on the testing set. We believe this model is effective enough to continue classifying other psots between these two subreddits, mainly due tot eh fact that this case has fairly low stakes for misclassifications.

## Table of Contents

1. [Importing Packages](#Importing-Packages)
2. [Scraping Data](#Scraping-Data)
    1. [Scraping r/gaming](#Scraping-r/gaming)
    2. [Scraping r/pcgaming](#Scraping-r/pcgaming)
3. [Saving Data](#Saving-Data)

## Importing Packages

In [1]:
# importing all the things that might be useful
import numpy as np
import pandas as pd
import requests
import time

## Scraping Data

Reddit's API will allow people to pull up to 1000 posts from each subreddit. However, each typical request for posts only returns 25 entries. Since we need a large amount of data to be able to create a well functioning model we can either pull 25 entries a lot of times, or try to pull more entries a few times.

With either method, we want to create a function to do the pulls for us, without pinging reddit so much that we get blocked. Pulling a larger amount of posts each time also reduces how much we have to ping reddit, hopefully also reducing the chances of getting blocked.

### Scraping r/gaming

In [2]:
# setting the url/request info for the first subreddit
# changing post limit to 100 and using params dict from Madhi's wisdom
gaming_url = 'https://www.reddit.com/r/gaming.json' # API Endpoint

# setting params dict to increase posts per pull, and update the link to grab new posts
params = {"limit": 100}
user_agent = {'User-agent': 'jondov'}

In [3]:
# making a list for all the posts from r/gaming
gaming_posts = []

In [4]:
# copypasta from Boom's local session
# minor changes made

# making the pull for r/gaming
for pull_num in range(10):

    ##### PREPARATIONS #####
    
    # Create some kind of message to tell us which request number we're at
    print("Pulling data attempted", pull_num+1, "times")
    
    
    ##### PULLING REQUEST AND EXTRACTING THE DATA #####
    
    # Step 1: Make request
    res = requests.get(gaming_url, headers=user_agent, params=params)
    

    # Step 2: Extract just the data that we need
    if res.status_code == 200:
        json_data = res.json()                     #  Pull JSON
        gaming_posts.extend(json_data['data']['children']) #  Get posts and extend the `posts` list
        
    # Step 3: Update the after param for next loop - grabs the next set of posts
        after = json_data['data']['after']
        params["after"] = after
        # 'after' = ID of the last post in this pull iteration
    else:
        print("We've run into an error. The status code is:", res.status_code)
        break

    # Create a brief pause so the API doesn't lock you out by mistaking you for a machine
    time.sleep(2)

Pulling data attempted 1 times
Pulling data attempted 2 times
Pulling data attempted 3 times
Pulling data attempted 4 times
Pulling data attempted 5 times
Pulling data attempted 6 times
Pulling data attempted 7 times
Pulling data attempted 8 times
Pulling data attempted 9 times
Pulling data attempted 10 times


The pulling appears to have been completed successfully, but now we have to verify what we ended up with.

In [5]:
# checking how many posts were scraped
len(gaming_posts)

945

In [6]:
# checking how many of the scraped posts have unique ids
print("We have:", len(set([p['data']['id'] for p in gaming_posts])), "unique posts in this subreddit")

We have: 645 unique posts in this subreddit


Unfortunately, we did not actually end up with as many posts as thought. There is most likely something in the pulling request or the reddit api that is not properly pulling the posts in the sequential fashion that we wanted.

This issue is beyond the scope of the project, however, and we can still work with the amount of unique posts that we have here.

### Scraping r/pcgaming

The same process as above is now going to be repeated for r/pcgaming. The only changes to the code here are to make that distinction.

In [7]:
# setting the url/request info for the second subreddit
pcgaming_url = 'https://www.reddit.com/r/pcgaming.json' # API Endpoint
params = {"limit": 100}
user_agent = {'User-agent': 'jondov'}

In [8]:
# making the list to store all the posts from r/pcgaming
pcgaming_posts = []

In [9]:
# same loop as above

# making the pull for r/pcgaming
for pull_num in range(10):

    ##### PREPARATIONS #####
    
    # Create some kind of message to tell us which request number we're at
    print("Pulling data attempted", pull_num+1, "times")
    
    
    ##### PULLING REQUEST AND EXTRACTING THE DATA #####
    
    # Step 1: Make request
    res = requests.get(pcgaming_url, headers=user_agent, params=params)
    

    # Step 2: Extract data (but may want to extract some)
    if res.status_code == 200:
        json_data = res.json()                     #  Pull JSON
        pcgaming_posts.extend(json_data['data']['children']) #  Get posts and extend the `posts` list
        
    # Step 3: Update the after string for next loop
        after = json_data['data']['after']
        params["after"] = after
        # 'after' = ID of the last post in this pull iteration
    else:
        print("We've run into an error. The status code is:", res.status_code)
        break

    # Create a brief pause so the API doesn't lock you out by mistaking you for a machine
    time.sleep(2)

Pulling data attempted 1 times
Pulling data attempted 2 times
Pulling data attempted 3 times
Pulling data attempted 4 times
Pulling data attempted 5 times
Pulling data attempted 6 times
Pulling data attempted 7 times
Pulling data attempted 8 times
Pulling data attempted 9 times
Pulling data attempted 10 times


In [10]:
len(pcgaming_posts)

985

In [20]:
# checking how many of the scraped posts have unique ids
print("We have:", len(set([p['data']['id'] for p in pcgaming_posts])), "unique posts in this subreddit")

We have: 583 unique posts in this subreddit


We have the same issue with this subreddit as well. It is unfortunate that we were not able to pull the full amount of unique posts for each subreddit, but the model should still be effective with over 1200 total posts for our dataset.

Now the posts need to be converted into something we can work with visually. Dataframes will do nicely.

In [12]:
# making dataframes of what was scraped
gaming_posts_df = pd.DataFrame(gaming_posts)
pcgaming_posts_df = pd.DataFrame(pcgaming_posts)

In [13]:
print(gaming_posts_df.shape)
gaming_posts_df.head()

(945, 2)


Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
1,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
2,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
3,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3
4,"{'approved_at_utc': None, 'subreddit': 'gaming...",t3


In [14]:
print(pcgaming_posts_df.shape)
pcgaming_posts_df.head()

(985, 2)


Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
1,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
2,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
3,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3
4,"{'approved_at_utc': None, 'subreddit': 'pcgami...",t3


We now have two dataframes of all the information from each pulled post. THey're not quite in the final form that we need them, but it is a good start.

## Saving Data

This seems like a good place to save the progress. Saving these to files will allow us to continue working with these datasets without having to pull the posts from reddit again.

In [21]:
# saving the data as it was scraped so it doesn't get lost
gaming_posts_df.to_csv("../datasets/gaming_posts.csv", index=False)
pcgaming_posts_df.to_csv("../datasets/pcgaming_posts.csv", index=False)

Currently the dataframes hold the full information of each post, though what we want is only the title of the unique posts that were scraped. We can loop through the dictionary of each post to extract the titles, and then only keep one of each unique title.

In [16]:
# making list of just the titles from each post
# getting set of that list to remove duplicates
unique_titles = set([gaming_posts_df["data"][post]["title"] for post in range(len(gaming_posts_df))])

This list can now be turned into a dataframe to more easily view/work on it. This is a good time to add our "target" column, a binary classification of which subreddit the post belongs to.

In [17]:
# making a new df of the titles, and creating a subreddit dummy column
gaming_titles_df = pd.DataFrame(unique_titles, columns=["title"])
gaming_titles_df["pc_sub"] = 0
gaming_titles_df.head()

Unnamed: 0,title,pc_sub
0,Two Lord of the Rings Inspired Mario Maker Lev...,0
1,"Request for ""offline mmo""",0
2,This sub is extremely hypocritical,0
3,Chapter 15 of Evil Within is kicking my butt//...,0
4,Turning Inferno into Dust,0


In [18]:
# repeating the process for the pcgaming df
unique_titles = set([pcgaming_posts_df["data"][post]["title"] for post in range(len(pcgaming_posts_df))])
pcgaming_titles_df = pd.DataFrame(unique_titles, columns=["title"])
pcgaming_titles_df["pc_sub"] = 1
pcgaming_titles_df.head()

Unnamed: 0,title,pc_sub
0,Crackdown 3 Flying High Update (Official Trailer),1
1,"New games coming to Gamepass. Shadow of War, D...",1
2,Super Buckyball Tournament is Coming to Steam ...,1
3,Skyrim’s murky seas finally get an overhaul un...,1
4,Nvidia RTX 2060 Super/ RTX 2070 Super Review! ...,1


We now have two dataframes of just post titles, matched with their respective subreddit. With these dataframes, we can begin the process of exloring and cleaning the data. This seems like another good spot to save the progess.

In [19]:
# saving these new dfs to csv files
gaming_titles_df.to_csv("../datasets/gaming_titles.csv", index=False)
pcgaming_titles_df.to_csv("../datasets/pcgaming_titles.csv", index=False)

In order to preserve these cells, and not constantly re-pull the posts, the remainder of the project will be in a new notebook.