# NLP Subreddit Classification

## Problem Statement

The hosts of the My Favorite Murder podcast would like to know if it’s possible to predict if a piece of content was posted on the True Crime subreddit or on their subreddit. Natural Language Processing was used to convert the text to numeric values, then the data was tested on three model types: Logistic Regression, Naive Bayes, and Support Vector Machine. Success was measured by accuracy.

### Contents:
- [Imports](#Imports)
- [Webscraping](#Webscraping)
- [EDA](#EDA)
  * [DataFrame Creation](#DataFrame-Creation)
  * [Comment Counts Review](#Comment-Counts-Review)  
  * [Most Popular Words](#Most-Popular-Words)
- [Data Cleanup](#Data-Cleanup)
  * [Beautiful Soup, Regex, Lemmatizing, and Stopwords](#Beautiful-Soup,-Regex,-Lemmatizing,-and-Stopwords)
- [Modeling](#Modeling)
  * [Preprocessing](#Preprocessing)
  * [Naive Bayes Model](#Naive-Bayes-Model) 
- [Evaluation](#Evaluation)
  * [Confustion Matrix](#Confustion-Matrix) 
- [Conclusion](#Conclusion)

## Imports

Importing in packages.

In [1223]:
#importing in the packages
import numpy as np
import pandas as pd
import requests
import time
import regex as re
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from bs4 import BeautifulSoup   
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix
from sklearn import svm

%config InlineBackend.figure_format = 'retina'

In [897]:
#importing warning to turn off future warnings
import warnings
warnings.simplefilter(action='ignore')

## Webscraping

Scraping in the data from the subreddits, True Crime and My Favorite Murder.

In [2]:
#edited code from Boom:
#defining urls I'll be pulling from as well as username
url_truecrime = "https://www.reddit.com/r/TrueCrime.json"
url_ssdgm = "https://www.reddit.com/r/myfavoritemurder.json"
username = {"User-agent": 'greybon'}      

In [3]:
#edited code from Boom:
#creating a fuction to pull posts from reddit
def get_subreddit(url, n_pulls, headers):   
    # Create empty templates
    posts = []
    after = None

    # Create a loop that does max 25 requests per pull
    for pull_num in range(n_pulls):
        print("Pulling data attempted", pull_num+1,"time(s)")

        if after == None:
            new_url = url                 
        else:
            new_url = url+"?after="+after 

        res = requests.get(new_url, headers=headers)

        if res.status_code == 200:
            subreddit_json = res.json()                      
            posts.extend(subreddit_json['data']['children']) 
            after = subreddit_json['data']['after']          
        else:
            print("We've run into an error. The status code is:", res.status_code)
            break
        time.sleep(1)
    print("We have:", len(set([p['data']['name'] for p in posts])), "posts in this subreddit")
        
        
    return(posts)

In [1771]:
#edited code from Boom:
#calling function for true crime subreddit 
true_crime = get_subreddit(url_truecrime, n_pulls = 40, headers = username)

Pulling data attempted 1 time(s)
Pulling data attempted 2 time(s)
Pulling data attempted 3 time(s)
Pulling data attempted 4 time(s)
Pulling data attempted 5 time(s)
Pulling data attempted 6 time(s)
Pulling data attempted 7 time(s)
Pulling data attempted 8 time(s)
Pulling data attempted 9 time(s)
Pulling data attempted 10 time(s)
Pulling data attempted 11 time(s)
Pulling data attempted 12 time(s)
Pulling data attempted 13 time(s)
Pulling data attempted 14 time(s)
Pulling data attempted 15 time(s)
Pulling data attempted 16 time(s)
Pulling data attempted 17 time(s)
Pulling data attempted 18 time(s)
Pulling data attempted 19 time(s)
Pulling data attempted 20 time(s)
Pulling data attempted 21 time(s)
Pulling data attempted 22 time(s)
Pulling data attempted 23 time(s)
Pulling data attempted 24 time(s)
Pulling data attempted 25 time(s)
Pulling data attempted 26 time(s)
Pulling data attempted 27 time(s)
Pulling data attempted 28 time(s)
Pulling data attempted 29 time(s)
Pulling data attempted 

In [1772]:
#edited code from Boom:
#calling function for true crime subreddit
fav_murder = get_subreddit(url_ssdgm, n_pulls = 40, headers = username)

Pulling data attempted 1 time(s)
Pulling data attempted 2 time(s)
Pulling data attempted 3 time(s)
Pulling data attempted 4 time(s)
Pulling data attempted 5 time(s)
Pulling data attempted 6 time(s)
Pulling data attempted 7 time(s)
Pulling data attempted 8 time(s)
Pulling data attempted 9 time(s)
Pulling data attempted 10 time(s)
Pulling data attempted 11 time(s)
Pulling data attempted 12 time(s)
Pulling data attempted 13 time(s)
Pulling data attempted 14 time(s)
Pulling data attempted 15 time(s)
Pulling data attempted 16 time(s)
Pulling data attempted 17 time(s)
Pulling data attempted 18 time(s)
Pulling data attempted 19 time(s)
Pulling data attempted 20 time(s)
Pulling data attempted 21 time(s)
Pulling data attempted 22 time(s)
Pulling data attempted 23 time(s)
Pulling data attempted 24 time(s)
Pulling data attempted 25 time(s)
Pulling data attempted 26 time(s)
Pulling data attempted 27 time(s)
Pulling data attempted 28 time(s)
Pulling data attempted 29 time(s)
Pulling data attempted 

## EDA

For EDA, the first thing I did was check out what info was available for each post from each subreddit. Things that are popping out are me are that each post is tagged for what subreddit it belongs to, and that the text available is limited to title, selfttext, and selftexthtml.  

In [1773]:
#Looking at the first post to see the breakdown components of a post
true_crime[0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'TrueCrime',
  'selftext': 'In an effort to clear up some of the clutter and to maintain the focus of the sub on learning about and discussing True Crime, we have decided to sticky this thread for all lighthearted discussion and memes. Any posts containing memes or jokes outside of this thread will be removed.\n\nPlease remember to keep things respectful of victims and their families, any inappropriate content will not be permitted. \n\nEnjoy!',
  'author_fullname': 't2_j3bh3',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': '[Meme Thread] For all jokes and lighthearted discussion',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/TrueCrime',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': None,
  'downs': 0,
  'thumbnail_height': None,
  'hide_score': False,
  'name': 't3_aptilw',
  'quarantine': False,
  'link_flair_text_color': 'dark',
  'author_flair_ba

In [1774]:
#checking out the ssdgm subreddit to see how the posts are broken down too also noting the flair 
#section marking someone as "I'm a Karen". That sort of flag would mark it 100% to this thread 
#so I'll not use that for prediction purposes.
fav_murder[0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'myfavoritemurder',
  'selftext': "Just a friendly reminder that we do not allow any direct links to merchandise here on the subreddit.  You are more than welcome to post a picture of it but under no circumstance are you allowed to link to the actual merchandise's website/etsy/etc.\n\n",
  'author_fullname': 't2_r4538',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'New Subscribers: Please Read the Subreddit Rules and reminder: NO MERCHANDISE DIRECT LINKS!',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/myfavoritemurder',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': None,
  'downs': 0,
  'thumbnail_height': None,
  'hide_score': False,
  'name': 't3_9gkh9q',
  'quarantine': False,
  'link_flair_text_color': 'dark',
  'author_flair_background_color': '',
  'subreddit_type': 'public',
  'ups': 60,
  'domain': 'self.myfavoritemurder',
  'media_embed': 

### DataFrame Creation

I created a function to pull elements from the scraped data and put it into a dataframe. I used a dictionary because each section label looked like a good key to add a value to. I pulled in the number of comments because I wanted to see how chatty the audiences are. 

In [1775]:
#I want to pull in title, selftext, subreddit, and comments info into a dataframe so I'm going to create 
#a function to do this
def pulling_posts(gimme_posts):
    
#creating an empty df
    df = pd.DataFrame()
    
    #looping through posts
    for post in gimme_posts:
        
        #creating a dictionary of the pieces I want to pull
        content_dict = {'title': [post['data']['title']],
                    'selftext': [post['data']['selftext']],
                    'subreddit': [post['data']['subreddit']],
                    'num_comments': [post['data']['num_comments']]} #pulled in so I can see how chatty
                                                                    #the users are, gives a picture of how
                                                                    #actively engaged the audiences are
        
        #appending to the df
        to_append = pd.DataFrame(content_dict)
        df = df.append(to_append, ignore_index=True)
        
    return df

In [1776]:
#creating the true crime df
true_crime_df = pulling_posts(gimme_posts = true_crime)

In [1777]:
#saving the file so I can read it in and use it from here
true_crime_df.to_csv("./datasets/true_crime_df.csv", index=False)

In [None]:
#creating the ssdgm df
fav_murder_df = pulling_posts(gimme_posts = fav_murder)

In [None]:
#saving the file so I can read it in and use it from here
fav_murder_df.to_csv("./datasets/fav_murder_df.csv", index=False)

In [1864]:
#reading in the true crime file
true_crime_df = pd.read_csv('./datasets/true_crime_df.csv')

In [1866]:
#confirming that it pulled in what I wanted. Noting that not all posts include selftext
#because of that, I won't use it by itself as a predictor
true_crime_df.head(2)

Unnamed: 0,title,selftext,subreddit,num_comments
0,[Meme Thread] For all jokes and lighthearted d...,In an effort to clear up some of the clutter a...,TrueCrime,14
1,A black man in Colorado was detained by cops a...,,TrueCrime,30


In [1867]:
#The file now has NaNs in it. I'm coverting them to a stopword that will get removed in an upcoming function
#that way it won't interfere with anything
true_crime_df['selftext'] = true_crime_df['selftext'].fillna('was')

In [1868]:
#reading in the my favorite murder file
fav_murder_df = pd.read_csv('./datasets/fav_murder_df.csv')

In [1869]:
#checking out the df, same note on selftext here as for the true crime subreddit, also MFM is mentioned right away.
#guessing that may be something that helps in prediction on this just because it references the podcast.
fav_murder_df.head(2)

Unnamed: 0,title,selftext,subreddit,num_comments
0,New Subscribers: Please Read the Subreddit Rul...,Just a friendly reminder that we do not allow ...,myfavoritemurder,20
1,MFM #167 - Bomb Grade: Official Discussion Post,This is the official discussion post for My Fa...,myfavoritemurder,37


In [None]:
#doing the same thing I did for the true crime df by changing NaNs to one of the stopwords
fav_murder_df['selftext'] = fav_murder_df['selftext'].fillna('was')

I know that my options for text here are title and selftext. Selftext is not available for all rows, but I'd like to use it. So I'm creating a column for it + the title. I'll compare it to the title column in my models to see if one does better than the other.

In [1781]:
#creating a column that combines the text columns into one
#I want to test how it does against the title columns
true_crime_df['combinedtext'] = true_crime_df['title'] + " " + true_crime_df['selftext']

In [1782]:
#adding the same combinedtext column to the my favorite murder df
fav_murder_df['combinedtext'] = fav_murder_df['title'] + " " + fav_murder_df['selftext']

### Comment Counts Review

I took a look at the comment counts because I wanted to see how chatty each subreddits audiences were. I was looking to see that they both had engaged audiences, which they do in this case. Mostly that's a marker to me that the content that's being shared continues to be of interest to the group who is following it. It means the subreddits remain on topic, which is good to know because I want them to be compareable. I'll be able to confirm this when I review each subreddit's top words. 

In [1870]:
#taking a look at how chatty followers of the true crime subreddit are
true_crime_df.describe()

Unnamed: 0,num_comments
count,982.0
mean,13.756619
std,30.105706
min,0.0
25%,1.0
50%,4.0
75%,13.0
max,569.0


In [1871]:
#and over to ssdgm's subreddit, interesting to note that the variations in max between this one and the other
#Curious to see if there is some sort of outlier on the true crime thread that folks were extra chatty on.
fav_murder_df.describe()

Unnamed: 0,num_comments
count,1001.0
mean,6.999001
std,12.519225
min,0.0
25%,1.0
50%,3.0
75%,7.0
max,142.0


I dove into the comments more to see what was so exciting over on the True Crime subreddit, since they do have a post that's a bit of an outlier on the comment counts. I will classify that one as a "celebrity" citing since it was a picture of someone who had been found not guilty on a high profile murder case.

### Most Popular Words

Next, I want to check out the frequency counts on words to see what's most popular. I want to see if there are differences in the vocabularly used or if everything overlaps. What I noticed is that while there was much overlapping with targeted words (murder, crime, true), there was also a strong use of podcast specific terminology on the My Favorite Murder subreddit that will be helpful with predicting which one content was posted on.

In [1790]:
#before diving into cleanup, I want to see what the most popular words used on each subbreddit are
#mostly I want to see if podcast terms pop out or if it's all murder and mayhem
def count_words(df):
    
    #instantiating
    cvec = CountVectorizer(stop_words = 'english', lowercase=True)

    # fitting/transforming
    new_cvec = cvec.fit_transform(df)

    # converting text to an array
    new_cvec = pd.DataFrame(new_cvec.toarray(), columns= cvec.get_feature_names())

    # getting the counts
    count_words = new_cvec.sum().sort_values(0, ascending=False)
    
    #returning a df
    return count_words

In [1791]:
#calling a df of the function for the combined text column (title + selftext) for the true crime df
crime_df = pd.DataFrame(count_words(true_crime_df['combinedtext']), columns=["Count Combined"])

In [1792]:
#Interesting, looks like some cleaning up is needed. Not sure what amp is, maybe part of a popular url, 
#I can see https and com are in the top there too so might be. 
crime_df.head()

Unnamed: 0,Count Combined
amp,286
https,269
case,231
murder,225
com,220


In [1793]:
#calling a df of the function for true crime - in this case I'm calling in the title column because I know I 
#want to test it compared to the combined text column in my models. 
crime_df = pd.DataFrame(count_words(true_crime_df['title']), columns=["Count Title"])

In [1794]:
#it looks like the title text is cleaner
crime_df.head()

Unnamed: 0,Count Title
murder,113
case,91
crime,85
true,72
old,71


In [1795]:
#and now over to the favorite murder df, calling the function with combined text column
murder_df = pd.DataFrame(count_words(fav_murder_df['combinedtext']), columns=["Count Combined"])

In [1796]:
#seeing what the top words are, and noting that amp and https don't appear to be as much of issue here
#although conversational words just, know, and like are cropping up higher.
murder_df.head()

Unnamed: 0,Count Combined
just,190
know,165
murder,156
episode,154
like,154


In [1797]:
#calling the function with the title only text for the favorite murder df
murder_df = pd.DataFrame(count_words(fav_murder_df['title']), columns=["Count Title"])

In [1798]:
#seeing what the top words from the title column are
murder_df.head()

Unnamed: 0,Count Title
mfm,76
murder,63
karen,63
episode,51
murderino,51


## Data-Cleanup

My approach to data cleanup is to first get rid of duplicates, if there are any. Then combine the dataframes to clean them. I noticed a couple odd characters I wanted to take care of right away. Ampersands being converted to html and tags for new rows (\n).

In [1799]:
#dropping the duplicates on the true crime df
true_crime_df = true_crime_df.drop_duplicates()

In [1800]:
#seeing how many we lost, which is none in this case
true_crime_df.shape

(931, 5)

In [1801]:
#dropping duplicates on the ssdgm df
fav_murder_df = fav_murder_df.drop_duplicates()

In [1802]:
#seeing how many we lost, which is none in this case
fav_murder_df.shape

(1001, 5)

In [1803]:
#I'm going go ahead and stack the dfs together here to make cleaning easier. 
df_combined = pd.concat([true_crime_df, fav_murder_df], axis=0)
df_combined.head()

Unnamed: 0,title,selftext,subreddit,num_comments,combinedtext
0,[Meme Thread] For all jokes and lighthearted d...,In an effort to clear up some of the clutter a...,TrueCrime,14,[Meme Thread] For all jokes and lighthearted d...
1,A black man in Colorado was detained by cops a...,was,TrueCrime,30,A black man in Colorado was detained by cops a...
2,Human remains were found at the home of actor ...,was,TrueCrime,2,Human remains were found at the home of actor ...
3,"Shawn Souza - Dartmouth, MA Police Officer Cha...",was,TrueCrime,1,"Shawn Souza - Dartmouth, MA Police Officer Cha..."
4,"Richey Edwards, of ManicStreetPreachers, disap...",was,TrueCrime,4,"Richey Edwards, of ManicStreetPreachers, disap..."


In [1804]:
#first up, I can see that &s have somehow been flipped over to html so let's fix that
df_combined['combinedtext'] = df_combined['combinedtext'].map(lambda cell: cell.replace("&amp;","&"))

In [1805]:
#Fixing it in title row too just in case
df_combined['title'] = df_combined['title'].map(lambda cell: cell.replace("&amp;","&"))

In [1806]:
#I also noticed some new lines in the code and am pulling that out
df_combined['combinedtext'] = df_combined['combinedtext'].map(lambda cell: cell.replace("\n"," "))

In [1807]:
#Fixing it in title row too just in case
df_combined['title'] = df_combined['title'].map(lambda cell: cell.replace("\n"," "))

In [1808]:
# making subreddit a binary column where true crime = 1 and my favorite murder = 0
df_combined['subreddit'] = df_combined['subreddit'].map({'myfavoritemurder': 0, 'TrueCrime': 1})

### Beautiful Soup, Regex, Lemmatizing, and Stopwords

My approach to the following function is that I wanted to strip any random html tags across the text. I also wanted to remove any punctuation and set the text all to lowercase. I realize I can do some of these things in CountVectorizer or TfidfVectorizer, but I wanted these elements cleaned now in order to make my gridsearches run faster.

For the text normalization techniques, I tested lemmatizing, Porter and Snowball stemming, and none across all my baseline models. All had similar scores. However, lemmatizing brought my training and test scores closer together, which helped with overfitting. So I went with lemmatizing here.

For stopwords, I created a custom list of words based on what I learned when I reviewed most popular words. Murder, true, and crime were in the top words on both subreddits, meaning they would contradict each other and not help with differentiation. I also noticed that any any urls shared in the selftext field had been chopped up into pieces. Https, www, and com. X200b showed up a lot too and I looked into it and found it represents a spacing issue. None of these words will help with predictions so I added them to my stopword list too. Lastly, I noticed some popular conversational language cropping up in the top words that were not words already in the English set (like and know)
. So I added them in too. 

I tested having no stopwords, English, and English plus my custom list of words. The last one worked best across my baseline models for increasing scores as well as helping to fix overfitting.

In [1873]:
#Function to clean up the text, edited code from Matt Brems
def posts_clean(text):
    #removing any html
    review_text = BeautifulSoup(text).get_text()
    
    # removing non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    
    # converting to lower case, split into individual words
    words = letters_only.lower().split()
    
    #lemmatizer
    lemmatizer = WordNetLemmatizer()
    lem_words = [lemmatizer.lemmatize(i) for i in words]
        
    # converting stopwords to set
    stop_words = stopwords.words('english')
    new_stops = ['www', 'https', 'com', 'x200b', 'like', 'know', 'murder', 'crime', 'true']
    stop_words.extend(new_stops)
    stops = set(stop_words)
    
    # removing stop words
    meaningful_words = [w for w in lem_words if not w in stops]
    
    # joining the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [1874]:
#initializing an empty list to hold the clean posts
clean_posts = []

In [1875]:
#setting up a for loop to iterate through the column
j = 0
for posts_update in df_combined['combinedtext']:
    clean_posts.append(posts_clean(posts_update)) 
    j += 1

In [1876]:
#pulling the list values into a column
df_combined['clean_posts'] = clean_posts

In [1877]:
# initializing an empty list to hold the clean titles
clean_titles = []

In [1878]:
#setting up a for loop to iterate through the column
j = 0
for posts_update in df_combined['title']:
    clean_titles.append(posts_clean(posts_update))
    j += 1

In [1879]:
#pulling the list values into a column
df_combined['clean_titles'] = clean_titles

In [1880]:
#checking to make sure everything updated correctly
df_combined.head(1)

Unnamed: 0,title,selftext,subreddit,num_comments,combinedtext,clean_posts,clean_titles
0,[Meme Thread] For all jokes and lighthearted d...,In an effort to clear up some of the clutter a...,1,14,[Meme Thread] For all jokes and lighthearted d...,meme thread joke lighthearted discussion effor...,meme thread joke lighthearted discussion


In [1817]:
#I'm saving a clean version of the dataframe to use in a supplement notebook (Modeling_Supplement.ipynb)
#The supplement notebook includes testing models, while the best model is featured in the next section.
df_combined.to_csv('./datasets/clean_df_combined.csv')

## Modeling

My approach for modeling was to set up 4 variations of pipelines for three model types: Logistic Regression, Naive Bayes, and Support Vector Machine. I only included the best model in this notebook. Please see Modeling_Supplement.ipynb for a look at the other models that were tested.

I set up four variations because I wanted to test different options for X (title vs. title + subtext) and I wanted to test two vectorizers (CountVectorizer and TfidfVectorizer). I set each of these models at their default settings to run tests on text normalization and stopwords. Once I settled on lemmatization and stopwords of English plus a custom list as my best performing options, I moved on to testing parameters. 

For parameters, I took a step by step approach, layering in new ranges of parameters for the gridsearch to test, including max features, n-grams, and max and min document frequencies. I also tested model specific options including penalties and the inverse of regularization strength. What I found though was that adding penalties made my scores worse and my models more overfit. So I backed up and scaled down on the options and ended up honing in on max features. I started dropping the max features and found that it helped with overfitting and with better scores. However, I had to make sure to not go too low, else the range between train and test scores would increase again.

In [1818]:
#Finding the baseline accuracy. Our goal here is to do better than 51.7%, which is the majority of the sample.
df_combined['subreddit'].value_counts(normalize=True)

0    0.518116
1    0.481884
Name: subreddit, dtype: float64

### Best Model

My best performing model was Naive Bayes Multinomial. The X was the title combined with selftext. The text normalizer was lemmatization and the stopwords were English plus a custom list of words. TifdVectorizer was the vectorizer used with max features of 738.

#### Preprocessing

In [1819]:
#dropping down to just the columns I want to use
df_crop = df_combined[['clean_posts', 'subreddit']]

In [1820]:
#setting X and y
X = df_crop['clean_posts']
y = df_crop['subreddit']

In [1821]:
#splitting into train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

#### Modeling

In [1822]:
#setting up the pipeline order
pipe = Pipeline([
    ('tvec', TfidfVectorizer()), 
    ('nb', MultinomialNB()) 
])

In [1857]:
#setting up the pipe parameters
pipe_params = {
    'tvec__max_features': [738],

}
best = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
best.fit(X_train, y_train); 
print(best.best_score_)
best.best_params_

0.8184955141476881


{'tvec__max_features': 738}

In [1858]:
# Train score
best.score(X_train, y_train)

0.8723257418909592

In [1859]:
# Test score
best.score(X_test, y_test)

0.8612836438923396

## Evaluation

<p style="text-align: center;"><strong>MODEL BREAKDOWN</strong></p>

| Model | Train Score | Test Score | Best Cross-Val Score | X | Text Normalization
|:----------:|:-------------:|:------:|:------:| :------:| :------:|
| Logistic Regression 1 | 0.9586 | 0.8282 | 0.7964 | combined text | CountVectorizer | 
| Logistic Regression 2 | 0.9296 | 0.8178 | 0.7778 | title | CountVectorizer |  
| Logistic Regression 3 | 0.9165 | 0.8654 | 0.8213 | combined text | TfidfVectorizer |
| Logistic Regression 4 | 0.9068 | 0.8157 | 0.7888 | title | TfidfVectorizer |
| Naive Bayes 1 | 0.8578 | 0.8219 | 0.7985 | combined text | CountVectorizer | 
| Naive Bayes 2 | 0.8730 | 0.8116 | 0.7867 | title | CountVectorizer |  
| Naive Bayes 3 | 0.8723 | 0.8613 | 0.8185 | combined text | TfidfVectorizer |
| Naive Bayes 4 | 0.8937 | 0.8219 | 0.7902 | title | TfidfVectorizer |
| Support Vector Machine 1 | 0.6273 | 0.6398 | 0.5769 | combined text | CountVectorizer | 
| Support Vector Machine 2 | 0.6273 | 0.6398 | 0.5769 | title | CountVectorizer |  
| Support Vector Machine 3 | 0.7129 | 0.7371 | 0.6929 | combined text | TfidfVectorizer |
| Support Vector Machine 4 | 0.5183 | 0.5176 | 0.5183 | title | TfidfVectorizer |

My baseline accuracy is 51.8% and my goal was to beat that score. You can see the breakdown of the model scores in the grid above. All models except one beat the baseline score. One of my Support Vector Machine models did not beat it. This model used the title as the X and used TfidVectorizer as its vectorizer. The train score matched the 51.8% while the test score was lower at 51.7%. The Support Vector Machine models overall were my worst performing models. While the train and test scores were about the same for each model (so they weren't overfit), they were the lowest accuracy scores overall. 

The Logistic Regression models had higher scores overall, however each model was significantly overfit. The test scores ranged from 81.6 - 86.6%, while the train scores ranged from 90.7 - 95.9%. The Naive Bayes models were similar scores, however they were less overfit, including one model that is close enough in range between train and test scores to no longer be considered overfit. The test scores ranged from 81.2 - 86.1% and the train scores ranged from 85.5 - 89.4%. The best performing model was a Naive Bayes model with a test score of 86.1% and a train score of 87.2%. 

### Confustion Matrix

I wanted to take a look at how many posts were actually currected accurately with my best model.

In [1826]:
#getting predictions of the best model
predictions = best.predict(X_test)

In [1827]:
#generating a confusion matrix
confusion_matrix(y_test, predictions)

array([[223,  27],
       [ 41, 192]])

In [1828]:
#unraveling the matrix
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

In [1829]:
#printing out the values
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 223
False Positives: 27
False Negatives: 41
True Positives: 192


This model correctly predicted 415 posts, while it incorrectly predicted 68 posts.

## Conclusion

My best model was able to predict placement of content on the True Crime subreddit vs. the My Favorite Murder subreddit with an accuracy of 86%. That's why I would recommend the Naive Bayes model as the one to use here. However, the caveat I would add is that Reddit content changes daily. What that means is that accuracy can and will change based on what content is available on any particular day. The model will need to be adjusted accordingly with updating what stopwords are included and fine tuning the max feature value. 