# Introduction

## Data Science Problem

For this project, we were tasked with finding two different subreddits, culling posts from each, and using Natual Language Processing (NLP) to see how well we can classify posts as belonging to one or other subreddit.

With the timing of this project coinciding with the 2020 NFL Draft, I selected the subreddits r/cowboys and r/eagles, two rival NFL franchises with dedicated fan bases \[I myself have been a Cowboys fan since I was six years old.\]

**PRIMARY DATA SCIENCE PROBLEM**

Build a machine learning classifier that maximizes the accuracy of predicting r/cowboys posts ***without using obvious keywords/tokens***, but with particular emphasis on maximizing the model's Sensitivity (in other words, trying to minimize the chances of falsely predicting a true Cowboys fan to be an Eagles fan).

**SECONDARY DATA SCIENCE PROBLEM**

Uncover and diagnose which words are more indicative of one or the other fan bases heading into the draft.

## Executive Summary

In order to execute this project, I collected 2000 reddit posts using the [Pushshift API](https://github.com/pushshift/api), scraping 1000 posts from the r/cowboys subreddit and 1000 posts from the r/eagles subreddit. Since I wanted to detect r/cowboys posts, I coded each subdataset as either a 1 (r/cowboys posts) or 0 (r/eagles posts), setting up a binary classification problem.

Because this project is about Natural Language Processing and not image evaluation, the two fields from the posts that I focused on for analysis were the "title" and "selftext" fields. Titles are typically like headlines and may contain anything from a few words to a couple of sentences. Selftext fields, when present, are more like paragraphs in their form. For the analysis, I concatenated the title and selftext fields into a single (sometimes lengthy) string. As a reference, about three-quarters of posts only consist of a title.

At this point, the two files were merged, shuffled, and randomly split into train and test data files for modeling, with 1400 cases selected for training the machine learning classifier and 600 posts used as a holdout sample to evaluate model performance. Stratification was used to ensure that each of these files contained 50% r/cowboys posts and 50% r/eagles posts.

The two datasets were cleaned by removing html and non-letter characters, converting all remaining characters to lowercase, and splitting the strings into "tokens", which are essentially any words or characters that are separated by spaces. These tokens were then lemmatized, which reduces multiple forms of the same word into a common base form. Stopwords were then removed in the final step prior to analysis. Stopwords are either words that have been consistently judged to provide little meaning in text analysis, or in this case, words that so obviously point to being from either subreddit as to make classification too easy. I wanted my machine learners to work hard!

The analysis stream consisted of running the data through either of two types of vectorizers in order to convert the raw text data into numeric variables suitable for modeling. Count Vectorization simply sums or counts tokens used within each post (so values are integer counts from zero to theoretically infinity), while Tfidf Vectorization converts tokens to relative ratios of how many times a word occurs in a given post relative to the number of times it occurs across all posts in the data set. After vectorization, the data were run through a series of classification models, including:

    - Logistic Regression Classifier
    - Multinomial Naive Bayes Classifier
    - Decision Tree Classifier
    - Bagged Decision Tree Classifier
    - Random Forests Classifier
    - Extra Trees Classifier
    - Gradient Boosting Classifier
    - Support Vector Machine Classifier
    
Model performance was judged against the baseline accuracy rate of 50%, as well as an "unfair" model where custom stopwords were not removed. Again, I was looking for a model that would be relatively high in accuracy, balance Sensitivity and Specificity (as shown by a high ROC AUC score), but lean towards maximizing Sensitivity.

**Results**

Somewhat surprisingly, even the "unfair" model didn't have an easy time of it, only correctly predicting 80% of the posts (which is still a substantial improvement over the naive 50% accuracy rate baseline). The remaining models achieved accuracy ranging between 56-70% on cross-validated data. The "winning" model included Count Vectorized data passed through a Logistic Regression classifier with L1 (LASSO) regularization, resulting in a solution with 67% cross-validated accuracy, 72% Sensitivity, and a ROC AUC score of 72%. These represent decent performance, but not world-class classification levels.

The reason for this is at least two-fold. For one thing, both subreddits consist of fans of NFL football, and therefore use a common vocabulary or terminology that may not differ enough for highly-accurate classification. And for another, I did not scrape a huge sample of posts, so some challenges may be a result of the relatively small sample size used for the analysis.

Differences in word usage were revealed, however, with strong indicators often being references to various media members who cover one or the other team (such as Justin Melo or Todd Archer). Fans may have been reacting in the form of "did you see what Todd Archer posted?", for example. r/cowboys posts were more likely to contain references to other teams (like Tampa, Dolphins, Steelers, San Francisco). Certain vulgarities were more associated with one or the other subreddit as well.

## Table of Contents

- [2 Imports](#Imports)
- [3 Get Data](#Get-Data)
- [4 Data Cleaning & Preprocessing](#Data-Cleaning)
- [5 Modeling](#Modeling)
    - [Logistic Regression Classifier](#Logistic-Regression-Classifier)
    - [Multinomial Naive Bayes Classifier](#Multinomial-Naive-Bayes-Classifier)
    - [Decision Tree Classifier](#Decision-Tree-Classifier)
    - [Random Forest/ExtraTrees Classifiers](#Random-Forests-and-Extra-Trees-Classifiers)
    - [Boosting Classifiers](#Boosting-Classifiers)
    - [Support Vector Machine Classifiers](#Support-Vector-Machine-Classifier)
- [6 Conclusions](#Conclusions)

# Imports

In [2]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, plot_roc_curve, precision_score, roc_auc_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, \
                             GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import time
import datetime as dt

# Get Data

## Get Cowboys Data

In [2]:
# URL for r/cowboys
reddit_url_dal = 'https://api.pushshift.io/reddit/search/submission/?subreddit=cowboys&size=1000'

In [3]:
# Pull cowboys data from reddit
r_dal = requests.get(reddit_url_dal)
# Return json encoded data
dal = r_dal.json()
# Select data library from json file
dal = dal['data']

In [31]:
len(dal)

1000

In [6]:
dal[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'drewchaiinz',
 'author_flair_css_class': 'JNW13',
 'author_flair_richtext': [],
 'author_flair_template_id': '713e5116-1baf-11ea-8398-0ed97e74e1df',
 'author_flair_text': 'Michael Gallup',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'text',
 'author_fullname': 't2_4uvf9n5y',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1587105700,
 'domain': 'i.redd.it',
 'full_link': 'https://www.reddit.com/r/cowboys/comments/g2wo5q/fan_art_i_made_of_dak_zeke_coop_by_yung_drxw_on/',
 'gildings': {},
 'id': 'g2wo5q',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': True,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,

In [7]:
dal[0]['selftext']

''

In [8]:
dal[999]['title']

'Cowboy Fans Standards Are Higher Than Any Other Team'

In [12]:
dal[999]['selftext']

'The level of play that is needed from our players to deem them “worth it” is ridiculous. I’ve seen people calling Tank, Amari, Dak, and Byron essentially wastes this offseason while they were playing at elite levels in 2019. I really hope the wishes of these fans are met so that they can deal with the suffering of an actually awful team. PAY DAK, AMARI, and BYRON!'

In [65]:
# Save time just in case
ts = time.time()
print(ts)

1587177624.658514


In [59]:
dt.datetime.fromtimestamp(1587177624.658514)

datetime.datetime(2020, 4, 17, 22, 40, 24, 658514)

In [18]:
# Create dataframe for dallas posts
posts = []
for row in range(len(dal)):
    post = {}
    post['author'] = dal[row]['author']
    post['title'] = dal[row]['title']
    try:
        post['selftext'] = dal[row]['selftext']
    except:
        post['selftext'] = ""
    

    posts.append(post)
    
df_dal = pd.DataFrame(posts)
df_dal

Unnamed: 0,author,title,selftext
0,drewchaiinz,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",
1,StrokinCole,Call me the god of trade downs,
2,classic_hispanic2,What do you guys think of this draft,
3,youngDCmixa,Skins fan coming in peace,Howdy folks! Not trying to ruffle any feathers...
4,goldfishlaboratory,If anyone wants their heart broken...,2016 Divisional Playoff game vs Green Bay is o...
...,...,...,...
995,texastiger1025,Will training camp ever move back home to Texas?,"Seriously, Oxnard is a tradition at this point..."
996,ElizabethAnnWashingt,So according to 105.3 WFAN the deadline to vot...,for those who may not understand the reason th...
997,mhunt2929,Happy Birthday 88,
998,kidofChrist,Only #2?,


In [34]:
# Add dependent variable (1 = cowboys posts, 0 = eagles posts)
df_dal["subreddit"] = 1

In [35]:
df_dal.head()

Unnamed: 0,author,title,selftext,cowboys,subreddit
0,drewchaiinz,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",,1,1
1,StrokinCole,Call me the god of trade downs,,1,1
2,classic_hispanic2,What do you guys think of this draft,,1,1
3,youngDCmixa,Skins fan coming in peace,Howdy folks! Not trying to ruffle any feathers...,1,1
4,goldfishlaboratory,If anyone wants their heart broken...,2016 Divisional Playoff game vs Green Bay is o...,1,1


## Get Eagles Data

In [39]:
# URL for r/eagles
reddit_url_eagles = 'https://api.pushshift.io/reddit/search/submission/?subreddit=eagles&size=1000'

In [40]:
# Pull eagles data from reddit
r_eagles = requests.get(reddit_url_eagles)
# Return json encoded data
eag = r_eagles.json()
# Select data library from json file
eag = eag['data']

In [43]:
len(eag)

1000

In [44]:
eag[0]['title']

'3 tight end set 2020? Bears release Trey Burton'

In [45]:
eag[0]['selftext']

'Tough go in Chicago but he’s a fan favourite #phillyspecial\n\nhttp://www.nfl.com/news/story/0ap3000001109866/article/roundup-bears-cutting-te-trey-burton-after-2-years'

In [46]:
eag[999]['title']

'NFL Game pass is totally free from now until May 31st. What games are people watching?'

In [47]:
eag[999]['selftext']

'I’m currently enjoying the 37-9 beat down of the cowboys, where Kamu took over kicking duties and Doug kept going for 2. Probably going to check out the 2017 bears game, or the 2017 broncos game, before watching some of rookie Wentz.\n\nProbably going to also go back and watch Foles’ 7 TD game. Also watching the Snow Bowl, as that was my girlfriend and I’s first Birds game we watched together, and it will be fun to rewatch it together all these years later.\n\nWhat games would people want to rewatch?'

In [48]:
# Create dataframe for eagles posts
posts = []
for row in range(len(eag)):
    post = {}
    post['author'] = eag[row]['author']
    post['title'] = eag[row]['title']
    try:
        post['selftext'] = eag[row]['selftext']
    except:
        post['selftext'] = ""
    

    posts.append(post)
    
df_eag = pd.DataFrame(posts)
df_eag

Unnamed: 0,author,title,selftext
0,MikeCloney,3 tight end set 2020? Bears release Trey Burton,Tough go in Chicago but he’s a fan favourite #...
1,5566y,Source: The #Bears are releasing TE Trey Burto...,
2,[deleted],Crazy Thought: RB/QB Cam Newton ?,[deleted]
3,amatom27,"[Schefter] Bears releasing Trey Burton, per so...",
4,[deleted],[Rapoport] Source: The #Bears are releasing TE...,[deleted]
...,...,...,...
995,LumberjackWeezy,WR market is very low right now. Good time to ...,"Historically speaking, rookie WR's don't usual..."
996,BCSinReverse,[Schefter] Former 49ers’ WR Emmanuel Sanders r...,
997,randomphiladelphian,Slay is still trashing his old team,[removed]
998,eaglesnation11,Alshon and a 2nd for OBJ?,Apparently OBJ rumors are rampant again. We wo...


In [49]:
# Add dependent variable (1 = cowboys posts, 0 = eagles posts)
df_eag["subreddit"] = 0

In [50]:
df_eag.head()

Unnamed: 0,author,title,selftext,subreddit
0,MikeCloney,3 tight end set 2020? Bears release Trey Burton,Tough go in Chicago but he’s a fan favourite #...,0
1,5566y,Source: The #Bears are releasing TE Trey Burto...,,0
2,[deleted],Crazy Thought: RB/QB Cam Newton ?,[deleted],0
3,amatom27,"[Schefter] Bears releasing Trey Burton, per so...",,0
4,[deleted],[Rapoport] Source: The #Bears are releasing TE...,[deleted],0


## Merge Cowboys and Eagles data together

In [51]:
df = df_dal.merge(df_eag, how="outer")

In [52]:
df.head()

Unnamed: 0,author,title,selftext,subreddit
0,drewchaiinz,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",,1
1,StrokinCole,Call me the god of trade downs,,1
2,classic_hispanic2,What do you guys think of this draft,,1
3,youngDCmixa,Skins fan coming in peace,Howdy folks! Not trying to ruffle any feathers...,1
4,goldfishlaboratory,If anyone wants their heart broken...,2016 Divisional Playoff game vs Green Bay is o...,1


In [53]:
df.tail()

Unnamed: 0,author,title,selftext,subreddit
1995,LumberjackWeezy,WR market is very low right now. Good time to ...,"Historically speaking, rookie WR's don't usual...",0
1996,BCSinReverse,[Schefter] Former 49ers’ WR Emmanuel Sanders r...,,0
1997,randomphiladelphian,Slay is still trashing his old team,[removed],0
1998,eaglesnation11,Alshon and a 2nd for OBJ?,Apparently OBJ rumors are rampant again. We wo...,0
1999,whubby777,NFL Game pass is totally free from now until M...,I’m currently enjoying the 37-9 beat down of t...,0


In [54]:
df.dtypes

author       object
title        object
selftext     object
subreddit     int64
dtype: object

In [55]:
df['subreddit'].value_counts()

1    1000
0    1000
Name: subreddit, dtype: int64

In [59]:
# Replace [removed] and [deleted] with empty strings in selftext
df["selftext"].replace(["[removed]", "[deleted]"], "", inplace=True)

In [60]:
# Concatenate title and selftext to create final independent variable input
df["title_selftext"] = df["title"] + " " + df["selftext"]

In [61]:
df.head()

Unnamed: 0,author,title,selftext,subreddit,title_selftext
0,drewchaiinz,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",,1,"Fan art I made of Dak, Zeke &amp; COOP by @yun..."
1,StrokinCole,Call me the god of trade downs,,1,Call me the god of trade downs
2,classic_hispanic2,What do you guys think of this draft,,1,What do you guys think of this draft
3,youngDCmixa,Skins fan coming in peace,Howdy folks! Not trying to ruffle any feathers...,1,Skins fan coming in peace Howdy folks! Not try...
4,goldfishlaboratory,If anyone wants their heart broken...,2016 Divisional Playoff game vs Green Bay is o...,1,If anyone wants their heart broken... 2016 Div...


In [62]:
df.tail()

Unnamed: 0,author,title,selftext,subreddit,title_selftext
1995,LumberjackWeezy,WR market is very low right now. Good time to ...,"Historically speaking, rookie WR's don't usual...",0,WR market is very low right now. Good time to ...
1996,BCSinReverse,[Schefter] Former 49ers’ WR Emmanuel Sanders r...,,0,[Schefter] Former 49ers’ WR Emmanuel Sanders r...
1997,randomphiladelphian,Slay is still trashing his old team,,0,Slay is still trashing his old team
1998,eaglesnation11,Alshon and a 2nd for OBJ?,Apparently OBJ rumors are rampant again. We wo...,0,Alshon and a 2nd for OBJ? Apparently OBJ rumor...
1999,whubby777,NFL Game pass is totally free from now until M...,I’m currently enjoying the 37-9 beat down of t...,0,NFL Game pass is totally free from now until M...


In [67]:
df = df[["author", "title", "selftext", "title_selftext", "subreddit"]]

In [68]:
df.head()

Unnamed: 0,author,title,selftext,title_selftext,subreddit
0,drewchaiinz,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",1
1,StrokinCole,Call me the god of trade downs,,Call me the god of trade downs,1
2,classic_hispanic2,What do you guys think of this draft,,What do you guys think of this draft,1
3,youngDCmixa,Skins fan coming in peace,Howdy folks! Not trying to ruffle any feathers...,Skins fan coming in peace Howdy folks! Not try...,1
4,goldfishlaboratory,If anyone wants their heart broken...,2016 Divisional Playoff game vs Green Bay is o...,If anyone wants their heart broken... 2016 Div...,1


In [69]:
# Save raw data to csv
df.to_csv("../data/raw_data_all.csv", index=False)

In [70]:
# Read in file to make sure it was saved correctly
pd.read_csv("../data/raw_data_all.csv")

Unnamed: 0,author,title,selftext,title_selftext,subreddit
0,drewchaiinz,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",,"Fan art I made of Dak, Zeke &amp; COOP by @yun...",1
1,StrokinCole,Call me the god of trade downs,,Call me the god of trade downs,1
2,classic_hispanic2,What do you guys think of this draft,,What do you guys think of this draft,1
3,youngDCmixa,Skins fan coming in peace,Howdy folks! Not trying to ruffle any feathers...,Skins fan coming in peace Howdy folks! Not try...,1
4,goldfishlaboratory,If anyone wants their heart broken...,2016 Divisional Playoff game vs Green Bay is o...,If anyone wants their heart broken... 2016 Div...,1
...,...,...,...,...,...
1995,LumberjackWeezy,WR market is very low right now. Good time to ...,"Historically speaking, rookie WR's don't usual...",WR market is very low right now. Good time to ...,0
1996,BCSinReverse,[Schefter] Former 49ers’ WR Emmanuel Sanders r...,,[Schefter] Former 49ers’ WR Emmanuel Sanders r...,0
1997,randomphiladelphian,Slay is still trashing his old team,,Slay is still trashing his old team,0
1998,eaglesnation11,Alshon and a 2nd for OBJ?,Apparently OBJ rumors are rampant again. We wo...,Alshon and a 2nd for OBJ? Apparently OBJ rumor...,0


# Data Cleaning
***Note that almost all of the code in Section 3 borrows heavily from Matt Brems's NLP lecture code***

In [3]:
df = pd.read_csv("../data/raw_data_all.csv")

In [4]:
df.isnull().sum()

author               0
title                0
selftext          1530
title_selftext       0
subreddit            0
dtype: int64

## Create train test split

In [5]:
df.columns

Index(['author', 'title', 'selftext', 'title_selftext', 'subreddit'], dtype='object')

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df[['author', 'title_selftext']],
                                                    df['subreddit'],
                                                    test_size = 0.3,
                                                    random_state = 42, 
                                                    stratify=df['subreddit'])

In [7]:
X_train.head()

Unnamed: 0,author,title_selftext
1593,Time-Ambassador,What about Greg Ward? This sub has been adaman...
785,Conety,And this is why we didn’t pay him. Asking wayy...
729,Badlands32,Defense is suddenly bottom of the league in a ...
403,PersonBehindAScreen,[Rodriguez] DeMarcus Lawrence Biggest Winner f...
890,teo1315,"Cowboys need TE help, would you prefer the tea..."


In [7]:
y_train.value_counts()

1    700
0    700
Name: subreddit, dtype: int64

In [8]:
y_test.value_counts()

1    300
0    300
Name: subreddit, dtype: int64

## Remove HTML code artifacts
There actually might not be any in here, but better safe than sorry.

In [266]:
X_train["title_selftext"][515]

'2014 Demarco Murray So because gamepass is now free, I’ve been rewatching a lot of 2014 highlights just because of how great and enjoyable that season was and I have to say.....man was Demarco a beast that year. I think people on this sub forget him too much at times because of the injury issues and how soon Zeke came after him, but I’ll take his 2014 season (minus all those fumbles) up against any season Zeke has had thus far in his career (although I’d still take Zeke as a player because of the consistency).\n\nAlso, that offense looked damn near unstoppable at times. Everyone was at their peak from romo to the offensive line to dez, Murray, Beasley and the other skill guys (even Dunbar had a solid role on that team on third downs). Hell, even the body catcher was good as a #2 WR. \n\nDamn, I miss that team. If only the defense was a bit better at that point in time (or the refs could call a catch correctly). We would’ve gone to Seattle and beaten them again.\n\nWould we have beaten

In [267]:
# Initialize the BeautifulSoup object on a single reddit entry     
example1 = BeautifulSoup(X_train['title_selftext'][515])

# Print the raw reddit post and then the output of get_text(), for 
# comparison
print(X_train['title_selftext'][515])
print()
print(example1.get_text())

2014 Demarco Murray So because gamepass is now free, I’ve been rewatching a lot of 2014 highlights just because of how great and enjoyable that season was and I have to say.....man was Demarco a beast that year. I think people on this sub forget him too much at times because of the injury issues and how soon Zeke came after him, but I’ll take his 2014 season (minus all those fumbles) up against any season Zeke has had thus far in his career (although I’d still take Zeke as a player because of the consistency).

Also, that offense looked damn near unstoppable at times. Everyone was at their peak from romo to the offensive line to dez, Murray, Beasley and the other skill guys (even Dunbar had a solid role on that team on third downs). Hell, even the body catcher was good as a #2 WR. 

Damn, I miss that team. If only the defense was a bit better at that point in time (or the refs could call a catch correctly). We would’ve gone to Seattle and beaten them again.

Would we have beaten the Pa

## Remove non-letters (non-numbers?)
May need to reconsider as player numbers may come in to play here

In [278]:
# Use regular expressions to do a find-and-replace
# In this case, the # symbol may be important to keep because #2 WR is a thing in football
okchars_only = re.sub("[^a-zA-Z0-9\#]",           # The pattern to search for - keeps letters, numbers and #
                      " ",                   # The pattern to replace it with
                      example1.get_text())   # The text to search

In [279]:
okchars_only

'2014 Demarco Murray So because gamepass is now free  I ve been rewatching a lot of 2014 highlights just because of how great and enjoyable that season was and I have to say     man was Demarco a beast that year  I think people on this sub forget him too much at times because of the injury issues and how soon Zeke came after him  but I ll take his 2014 season  minus all those fumbles  up against any season Zeke has had thus far in his career  although I d still take Zeke as a player because of the consistency    Also  that offense looked damn near unstoppable at times  Everyone was at their peak from romo to the offensive line to dez  Murray  Beasley and the other skill guys  even Dunbar had a solid role on that team on third downs   Hell  even the body catcher was good as a #2 WR    Damn  I miss that team  If only the defense was a bit better at that point in time  or the refs could call a catch correctly   We would ve gone to Seattle and beaten them again   Would we have beaten the P

In [40]:
# Convert letters_only to lower case.
lower_case = okchars_only.lower()

# Split lower_case up at each space.
words = lower_case.split()

In [41]:
lower_case

'2014 demarco murray so because gamepass is now free  i ve been rewatching a lot of 2014 highlights just because of how great and enjoyable that season was and i have to say     man was demarco a beast that year  i think people on this sub forget him too much at times because of the injury issues and how soon zeke came after him  but i ll take his 2014 season  minus all those fumbles  up against any season zeke has had thus far in his career  although i d still take zeke as a player because of the consistency    also  that offense looked damn near unstoppable at times  everyone was at their peak from romo to the offensive line to dez  murray  beasley and the other skill guys  even dunbar had a solid role on that team on third downs   hell  even the body catcher was good as a #2 wr    damn  i miss that team  if only the defense was a bit better at that point in time  or the refs could call a catch correctly   we would ve gone to seattle and beaten them again   would we have beaten the p

In [42]:
words

['2014',
 'demarco',
 'murray',
 'so',
 'because',
 'gamepass',
 'is',
 'now',
 'free',
 'i',
 've',
 'been',
 'rewatching',
 'a',
 'lot',
 'of',
 '2014',
 'highlights',
 'just',
 'because',
 'of',
 'how',
 'great',
 'and',
 'enjoyable',
 'that',
 'season',
 'was',
 'and',
 'i',
 'have',
 'to',
 'say',
 'man',
 'was',
 'demarco',
 'a',
 'beast',
 'that',
 'year',
 'i',
 'think',
 'people',
 'on',
 'this',
 'sub',
 'forget',
 'him',
 'too',
 'much',
 'at',
 'times',
 'because',
 'of',
 'the',
 'injury',
 'issues',
 'and',
 'how',
 'soon',
 'zeke',
 'came',
 'after',
 'him',
 'but',
 'i',
 'll',
 'take',
 'his',
 '2014',
 'season',
 'minus',
 'all',
 'those',
 'fumbles',
 'up',
 'against',
 'any',
 'season',
 'zeke',
 'has',
 'had',
 'thus',
 'far',
 'in',
 'his',
 'career',
 'although',
 'i',
 'd',
 'still',
 'take',
 'zeke',
 'as',
 'a',
 'player',
 'because',
 'of',
 'the',
 'consistency',
 'also',
 'that',
 'offense',
 'looked',
 'damn',
 'near',
 'unstoppable',
 'at',
 'times',
 'ev

In [43]:
# Consider lemmatizing tokenized characters
lemmatizer = WordNetLemmatizer()
tokens_lem = [lemmatizer.lemmatize(i) for i in words]
# Compare original and lemmatized tokens
list(zip(words, tokens_lem))

[('2014', '2014'),
 ('demarco', 'demarco'),
 ('murray', 'murray'),
 ('so', 'so'),
 ('because', 'because'),
 ('gamepass', 'gamepass'),
 ('is', 'is'),
 ('now', 'now'),
 ('free', 'free'),
 ('i', 'i'),
 ('ve', 've'),
 ('been', 'been'),
 ('rewatching', 'rewatching'),
 ('a', 'a'),
 ('lot', 'lot'),
 ('of', 'of'),
 ('2014', '2014'),
 ('highlights', 'highlight'),
 ('just', 'just'),
 ('because', 'because'),
 ('of', 'of'),
 ('how', 'how'),
 ('great', 'great'),
 ('and', 'and'),
 ('enjoyable', 'enjoyable'),
 ('that', 'that'),
 ('season', 'season'),
 ('was', 'wa'),
 ('and', 'and'),
 ('i', 'i'),
 ('have', 'have'),
 ('to', 'to'),
 ('say', 'say'),
 ('man', 'man'),
 ('was', 'wa'),
 ('demarco', 'demarco'),
 ('a', 'a'),
 ('beast', 'beast'),
 ('that', 'that'),
 ('year', 'year'),
 ('i', 'i'),
 ('think', 'think'),
 ('people', 'people'),
 ('on', 'on'),
 ('this', 'this'),
 ('sub', 'sub'),
 ('forget', 'forget'),
 ('him', 'him'),
 ('too', 'too'),
 ('much', 'much'),
 ('at', 'at'),
 ('times', 'time'),
 ('because',

## Create and remove stopwords

In [9]:
# Print English stopwords.
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [133]:
my_stop_words = stopwords.words("english") + [
    "dallas",
    "cowboys",
    "cowboy",
    "philadelphia",
    "eagles",
    "philly",
    "eagle",
    "phi",
    "dal",
    "dallascowboys",
    "boys",
    "birds",
    "texas",
    "app",
    "http",
    "https",
    "etc",
    "imgur",
    "www",
    "com",
    "png",
    "auto",
    "width",
    "reddit",
    "x200b",
    "2qgz89t",
    "2z11chhwhq99x5sashj6fb5xi6wir5t",
    "zeke",
    "dak",
    "prescott",
    "romo",
    "tony",
    "carson",
    "wentz",
    "dez",
    "bryant",
    "demarco",
    "murray",
    "cole",
    "beasley",
    "aldon",
    "smith",
    "amari",
    "cooper",
    "alshon",
    "jeffery",
    "anthony",
    "brown", 
    "chidobe",
    "awuzie",
    "avonte",
    "maddox",
    "blake",
    "jarwin",
    "brian",
    "dawkins",
    "byron",
    "jones",
    "cam",
    "fleming",
    "haha",
    "ha", 
    "ha-ha",
    "clinton",
    "clinton-dix",
    "dix",
    "rush",
    "darius",
    "slay",
    "david",
    "irving",
    "demarcus",
    "lawrence",
    "robinson",
    "desean",
    "jackson",
    "donovan",
    "mcnabb",
    "dontari",
    "poe",
    "dorsett",
    "ezekiel",
    "elliott",
    "jake",
    "zach",
    "ertz",
    "leighton",
    "van",
    "der",
    "esch",
    "fletcher",
    "cox",
    "nick",
    "foles",
    "kai",
    "forbath",
    "travis",
    "frederick",
    "jason",
    "garrett",
    "witten",
    "kelce",
    "peters",
    "gerald",
    "mccoy",
    "howie",
    "roseman",
    "jeff",
    "heath",
    "tim",
    "jernigan",
    "jerry",
    "jim",
    "schwartz",
    "joe",
    "looney",
    "jordan",
    "matthews",
    "jp",
    "ladouceur",
    "lane",
    "johnson",
    "lesean",
    "mccoy",
    "maliek",
    "collins",
    "mike",
    "mccarthy",
    "michael",
    "gallup",
    "miles",
    "sanders",
    "nelson",
    "agholor",
    "nickell",
    "robey",
    "coleman",
    "robert",
    "quinn",
    "randall",
    "cobb",
    "randy",
    "gregory",
    "sean",
    "lee",
    "roger",
    "staubach",
    "stephen",
    "tank",
    "troy",
    "aikman",
    "tyron",
    "smith",
    "vick",
    "whiteside",
    "zack",
    "martin",
    "zuerlein",
    "andre",
    "dillard",
    "boston",
    "scott",
    "brandon",
    "graham",
    "cameron",
    "chido",
    "cunningham",
    "curtis",
    "samuel",
    "darian",
    "thompson",
    "jeremy",
    "maclin",
    "kerry",
    "hyder",
    "rasul",
    "douglas",
    "tavon",
    "austin",
    "timmy",
    "malcolm",
    "jenkins",
    "jalen",
    "mills",
    "greg",
    "javon",
    "hargrave",
    "kellen",
    "moore",
    "dlaw"
]

In [134]:
my_stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [332]:
# Remove stopwords from "words."
words = [w for w in words if w not in my_stop_words]

In [333]:
words

['2014',
 'gamepass',
 'free',
 'rewatching',
 'lot',
 '2014',
 'highlights',
 'great',
 'enjoyable',
 'season',
 'say',
 'man',
 'beast',
 'year',
 'think',
 'people',
 'sub',
 'forget',
 'much',
 'times',
 'injury',
 'issues',
 'soon',
 'came',
 'take',
 '2014',
 'season',
 'minus',
 'fumbles',
 'season',
 'thus',
 'far',
 'career',
 'although',
 'still',
 'take',
 'player',
 'consistency',
 'also',
 'offense',
 'looked',
 'damn',
 'near',
 'unstoppable',
 'times',
 'everyone',
 'peak',
 'offensive',
 'line',
 'skill',
 'guys',
 'even',
 'dunbar',
 'solid',
 'role',
 'team',
 'third',
 'downs',
 'hell',
 'even',
 'body',
 'catcher',
 'good',
 '#2',
 'wr',
 'damn',
 'miss',
 'team',
 'defense',
 'bit',
 'better',
 'point',
 'time',
 'refs',
 'could',
 'call',
 'catch',
 'correctly',
 'would',
 'gone',
 'seattle',
 'beaten',
 'would',
 'beaten',
 'pats',
 'probably',
 'defense',
 'stop',
 'brady',
 'damn',
 'would',
 'loved',
 'see',
 'offense',
 'go',
 'bellichek']

## Combine into single function

### Function without lemmatizer included

In [9]:
# Adapted from Matt Brems's NLP lecture code
# This version does NOT include a lemmatizer
def reddit_to_words(raw_post, stop_words):
    # Function to convert a raw reddit post to a string of words
    # The input is a single string (a raw reddit post), and 
    # the output is a single string (a preprocessed reddit post)
    
    # 1. Remove HTML. There may not be any in these posts, but just in case.
    post_text = BeautifulSoup(raw_post).get_text()
    
    # 2. Remove non-letters/non-numbers, but keep # character
    okchars_only = re.sub("[^a-zA-Z0-9\#]", " ", post_text)
    
    # 3. Convert to lower case, split into individual words.
    words = okchars_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stopwords to a set.
    stops = set(stop_words)
    
    # 5. Remove stopwords.
    meaningful_words = [w for w in words if not w in stops]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

### Function without lemmatizer, also excludes numbers and special characters

In [10]:
# Adapted from Matt Brems's NLP lecture code
# This version does NOT include a lemmatizer
def reddit_to_words_nonum(raw_post, stop_words):
    # Function to convert a raw reddit post to a string of words
    # The input is a single string (a raw reddit post), and 
    # the output is a single string (a preprocessed reddit post)
    
    # 1. Remove HTML. There may not be any in these posts, but just in case.
    post_text = BeautifulSoup(raw_post).get_text()
    
    # 2. Remove non-letters/non-numbers, but keep # character
    okchars_only = re.sub("[^a-zA-Z]", " ", post_text)
    
    # 3. Convert to lower case, split into individual words.
    words = okchars_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stopwords to a set.
    stops = set(stop_words)
    
    # 5. Remove stopwords.
    meaningful_words = [w for w in words if not w in stops]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

### Function with lemmatizer that also removes numbers and # sign

In [11]:
# Adapted from Matt Brems's NLP lecture code
# This version INCLUDES a lemmatizer
def reddit_to_words_wlem(raw_post, stop_words):
    # Function to convert a raw reddit post to a string of words
    # The input is a single string (a raw reddit post), and 
    # the output is a single string (a preprocessed reddit post)
    
    # 1. Remove HTML. There may not be any in these posts, but just in case.
    post_text = BeautifulSoup(raw_post).get_text()
    
    # 2. Remove non-letters/non-numbers, but keep # character
    okchars_only = re.sub("[^a-zA-Z]", " ", post_text)
    
    # 3. Convert to lower case, split into individual words.
    words = okchars_only.lower().split()
    
    # 4. Lemmatize words
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in words]
    
    # 5. In Python, searching a set is much faster than searching
    # a list, so convert the stopwords to a set.
    stops = set(stop_words)
    
    # 6. Remove stopwords.
    meaningful_words = [w for w in tokens_lem if not w in stops]
    
    # 7. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

## Apply function

In [12]:
# Get the number of posts in the dataframe.
total_posts = df.shape[0]
print(f'There are {total_posts} posts in the dataframe.')

There are 2000 posts in the dataframe.


### Try un-lemmatized version first

In [116]:
# Initialize an empty list to hold the clean posts.
clean_train_posts = []
clean_test_posts = []

print("Cleaning and parsing the training set reddit posts...")

# Instantiate counter.
j = 0

# For every post in our training set...
for train_post in X_train['title_selftext']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_train_posts.append(reddit_to_words_nonum(train_post, my_stop_words))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_posts}.')
    
    j += 1

# Let's do the same for our testing set.
print("Cleaning and parsing the testing set reddit posts...")

# For every post in our testing set...
for test_post in X_test['title_selftext']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_test_posts.append(reddit_to_words_nonum(test_post, my_stop_words))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_posts}.')
        
    j += 1

Cleaning and parsing the training set reddit posts...
Review 100 of 2000.
Review 200 of 2000.
Review 300 of 2000.
Review 400 of 2000.
Review 500 of 2000.
Review 600 of 2000.
Review 700 of 2000.
Review 800 of 2000.
Review 900 of 2000.
Review 1000 of 2000.
Review 1100 of 2000.
Review 1200 of 2000.
Review 1300 of 2000.
Review 1400 of 2000.
Cleaning and parsing the testing set reddit posts...
Review 1500 of 2000.
Review 1600 of 2000.
Review 1700 of 2000.
Review 1800 of 2000.
Review 1900 of 2000.
Review 2000 of 2000.


In [117]:
clean_train_posts[250]

'case drafting reagor reagor talented explosive collegiate playmaker hope target first round trade back scenario reagor blessed incredible athletic traits truly special athlete reaches top speed incredibly quickly change direction instant potential rac superstar aggressive ball carrier attacks defenders strong compliment power finesse moves studying reagor games saw player extreme boom bust potential player raw athletic gifts incredible ball carrier vision also significant flaws game although talented receiver belong second tier jefferson mims opposed drafting first wide receiver reagor athletic traits project favourable scheme fit yet believe reagor would wasted pick natural hands catcher double catches straight drops passes consistently drops concentration drops others occur already thinking running secured catch drops passes three levels naturally track ball shoulder uses pads help make shoulder catches instead relying hands shows lack concentration fighting contact traffic reagor h

### Now try lemmatized version

In [135]:
# Initialize an empty list to hold the clean posts.
clean_train_posts_lem = []
clean_test_posts_lem = []

print("Cleaning and parsing the training set reddit posts...")

# Instantiate counter.
j = 0

# For every post in our training set...
for train_post in X_train['title_selftext']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_train_posts_lem.append(reddit_to_words_wlem(train_post, my_stop_words))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_posts}.')
    
    j += 1

# Let's do the same for our testing set.
print("Cleaning and parsing the testing set reddit posts...")

# For every post in our testing set...
for test_post in X_test['title_selftext']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_test_posts_lem.append(reddit_to_words_wlem(test_post, my_stop_words))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_posts}.')
        
    j += 1

Cleaning and parsing the training set reddit posts...
Review 100 of 2000.
Review 200 of 2000.
Review 300 of 2000.
Review 400 of 2000.
Review 500 of 2000.
Review 600 of 2000.
Review 700 of 2000.
Review 800 of 2000.
Review 900 of 2000.
Review 1000 of 2000.
Review 1100 of 2000.
Review 1200 of 2000.
Review 1300 of 2000.
Review 1400 of 2000.
Cleaning and parsing the testing set reddit posts...
Review 1500 of 2000.
Review 1600 of 2000.
Review 1700 of 2000.
Review 1800 of 2000.
Review 1900 of 2000.
Review 2000 of 2000.


In [292]:
clean_train_posts_lem[250]

'case drafting reagor reagor talented explosive collegiate playmaker hope target first round trade back scenario reagor blessed incredible athletic trait truly special athlete reach top speed incredibly quickly change direction instant potential rac superstar aggressive ball carrier attack defender strong compliment power finesse move studying reagor game saw player extreme boom bust potential player raw athletic gift incredible ball carrier vision also significant flaw game although talented receiver doe belong second tier jefferson mims opposed drafting first wide receiver reagor athletic trait project favourable scheme fit yet believe reagor would wasted pick natural hand catcher double catch straight drop pass consistently drop concentration drop others occur already thinking running secured catch drop pass three level naturally track ball shoulder us pad help make shoulder catch instead relying hand show lack concentration fighting contact traffic reagor hand give confidence proje

# Modeling

## Logistic Regression Classifier

### First try using count vectorization

In [136]:
pipe_cvec = Pipeline([
    ('cvec', CountVectorizer(min_df=2, stop_words=my_stop_words)),
    ('ss', StandardScaler(with_mean=False)),
    ('lr', LogisticRegression(penalty='l1', solver='saga', random_state=42, max_iter=5000))
])

In [137]:
pipe_cvec_params = {
    'cvec__max_features': [750, 1000, 1250],
    'cvec__ngram_range': [(1,1), (1,2)],
#     'cvec__min_df': [1, 2, 5],
    'lr__C': [.1, .35, .5]
}

In [138]:
# Instantiate grid search
gs = GridSearchCV(pipe_cvec, 
                  param_grid=pipe_cvec_params, 
                  cv=5, 
                  n_jobs=-1,
                  verbose=1)

# Fit grid search to training data
gs.fit(clean_train_posts_lem, y_train)


Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   13.1s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=2,
                                                        ngram_range=(1, 1),
                                                        prep

In [139]:
gs.best_params_

{'cvec__max_features': 1000, 'cvec__ngram_range': (1, 1), 'lr__C': 0.35}

In [140]:
# Best score
print(f'Best CV score', gs.best_score_)
print(f'Best train score', gs.score(clean_train_posts_lem, y_train))
print(f'Best test score', gs.score(clean_test_posts_lem, y_test))

Best CV score 0.6564285714285715
Best train score 0.8957142857142857
Best test score 0.6533333333333333


In [141]:
c_vectorizer = CountVectorizer(max_features=1000, ngram_range=(1,1), min_df=2, stop_words=my_stop_words)

In [142]:
train_data_features_c = c_vectorizer.fit_transform(clean_train_posts_lem)

test_data_features_c = c_vectorizer.transform(clean_test_posts_lem)

In [143]:
print(train_data_features_c.shape)

(1400, 1000)


In [144]:
print(test_data_features_c.shape)

(600, 1000)


In [145]:
c_vocab = c_vectorizer.get_feature_names()
print(c_vocab)

['aav', 'ability', 'able', 'absolute', 'absolutely', 'according', 'act', 'actually', 'adam', 'adamant', 'add', 'adding', 'address', 'age', 'agency', 'agent', 'ago', 'agree', 'agreed', 'agreement', 'ahead', 'aiyuk', 'aj', 'alabama', 'allowed', 'almost', 'along', 'already', 'also', 'although', 'always', 'amazing', 'among', 'amount', 'anderson', 'andrew', 'another', 'anybody', 'anyone', 'anything', 'anywhere', 'appreciate', 'april', 'arcega', 'archer', 'arizona', 'around', 'article', 'ask', 'asking', 'athletic', 'attack', 'attempt', 'authentic', 'available', 'average', 'away', 'awesome', 'back', 'backup', 'bad', 'ball', 'barnett', 'base', 'based', 'basically', 'baun', 'baylor', 'bear', 'beat', 'became', 'become', 'becomes', 'begin', 'beginning', 'behind', 'believe', 'best', 'better', 'big', 'biggest', 'bill', 'bird', 'birthday', 'bit', 'blocker', 'blocking', 'blogging', 'board', 'boise', 'bonus', 'bored', 'bottom', 'boundary', 'bowl', 'box', 'boy', 'bpa', 'brady', 'brandin', 'break', 'bre

In [146]:
# Scale data for use in regularized logistic regression model
ss = StandardScaler(with_mean=False)

X_train_sc_c = ss.fit_transform(train_data_features_c)
X_test_sc_c = ss.transform(test_data_features_c)

In [148]:
# Instantiate logistic regression model.
lr_c = LogisticRegression(penalty='l1', 
                          solver = 'saga', 
                          C=0.35, 
                          random_state=42, 
                          max_iter=5000,
                          n_jobs=-1,
                          verbose=1)

# Fit model to training data.
lr_c.fit(X_train_sc_c, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.


convergence after 3098 epochs took 12 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   11.9s finished


LogisticRegression(C=0.35, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000,
                   multi_class='auto', n_jobs=-1, penalty='l1', random_state=42,
                   solver='saga', tol=0.0001, verbose=1, warm_start=False)

In [149]:
# Evaluate model on cross-validated data
c_cv_mean = cross_val_score(lr_c, X_train_sc_c, y_train, cv=5).mean()
print(c_cv_mean)

# Evaluate model on training data.
train_acc_c = lr_c.score(X_train_sc_c, y_train)
print(train_acc_c)

# Evaluate model on testing data.
test_acc_c = lr_c.score(X_test_sc_c, y_test)
print(test_acc_c)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    9.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    9.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    9.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   11.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


0.6649999999999999
0.8957142857142857
0.6533333333333333


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   11.5s finished


In [150]:
# Predict test data
y_pred_c = lr_c.predict(X_test_sc_c)

In [151]:
# Predict test data probabilities
y_probs_c = lr_c.predict_proba(X_test_sc_c)
y_probs_c

array([[9.99816394e-01, 1.83605956e-04],
       [8.90471879e-01, 1.09528121e-01],
       [1.26393787e-01, 8.73606213e-01],
       ...,
       [6.90692697e-06, 9.99993093e-01],
       [9.77582493e-01, 2.24175068e-02],
       [6.55762097e-01, 3.44237903e-01]])

In [152]:
# Create confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_c).ravel()
print(confusion_matrix(y_test, y_pred_c))

[[176 124]
 [ 84 216]]


In [153]:
# Compute sensitivity & specificity
sensitivity_c = tp / (tp + fn)
specificity_c = tn / (tn + fp)

# Compute precision
precision_c = precision_score(y_test, y_pred_c)

# Compute roc_auc score
roc_auc_c = roc_auc_score(y_test, y_probs_c[:, 1])               

In [154]:
# Model score summary
print(f'Cross-Val Accuracy:', round(c_cv_mean, 4))
print(f'Train Accuracy:', round(train_acc_c, 4))
print(f'Test Accuracy:', round(test_acc_c, 4))
print(f'Test Sensitivity:', round(sensitivity_c, 4))
print(f'Test Specificity:', round(specificity_c, 4))
print(f'Test Precision:', round(precision_c, 4))
print(f'ROC AUC:', round(roc_auc_c, 4))

Cross-Val Accuracy: 0.665
Train Accuracy: 0.8957
Test Accuracy: 0.6533
Test Sensitivity: 0.72
Test Specificity: 0.5867
Test Precision: 0.6353
ROC AUC: 0.7223


In [155]:
# Get model coefficients
lr_c.coef_

array([[ 0.00000000e+00, -5.16216508e-02,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  6.35734251e-02,  5.49362422e-02,
         0.00000000e+00,  0.00000000e+00,  5.32000703e-03,
         8.57277892e-02,  0.00000000e+00,  1.48386131e-01,
         2.69308671e-01,  0.00000000e+00,  1.27103234e-01,
         1.93403104e-02,  1.60352677e-02,  0.00000000e+00,
        -2.59525842e-01,  2.69489552e-01,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         2.83598420e-02,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  1.26602218e-01,
         9.20422703e-02,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  9.06507465e-02,  1.62704116e-01,
         9.66333752e-02,  0.00000000e+00,  1.57188791e-02,
        -1.73240297e-01, -1.30749419e-01,  1.56637799e-01,
         5.12570163e-03,  0.00000000e+00, -1.49435427e-02,
        -2.97722543e-02,  0.00000000e+00,  0.00000000e+0

In [156]:
#https://stackoverflow.com/questions/31029340/
#how-to-adjust-scaled-scikit-learn-logicistic-regression-coeffs-to-score-a-non-sc/38836670

lr_c_coefficients = np.exp(np.true_divide(lr_c.coef_,  ss.scale_))

In [157]:
lr_c_coefficients

array([[1.00000000e+00, 6.93547062e-01, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.81719452e+00,
        1.53809285e+00, 1.00000000e+00, 1.00000000e+00, 1.05695567e+00,
        2.91916757e+00, 1.00000000e+00, 2.18024866e+00, 3.56855610e+00,
        1.00000000e+00, 3.76286095e+00, 1.27378801e+00, 1.25525909e+00,
        1.00000000e+00, 3.88786290e-02, 1.49520740e+01, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.26884915e+00,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        3.74325656e+00, 3.16349234e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 3.10888823e+00, 2.13603830e+00, 1.99226103e+00,
        1.00000000e+00, 1.19486555e+00, 1.86953114e-01, 1.76468061e-01,
        7.98902842e+00, 1.07036662e+00, 1.00000000e+00, 8.50533697e-01,
        6.88992714e-01, 1.00000000e+00, 1.00000000e+00, 2.29347718e-01,
        1.00000000e+00, 1.00000000e+00, 1.08823258e+00, 1.000000

In [158]:
feature_names_c = np.array(c_vectorizer.get_feature_names())

In [159]:
feature_names_c.shape

(1000,)

In [160]:
lr_c_coefficients.shape

(1, 1000)

In [161]:
df_lr_c_coef = pd.DataFrame(lr_c_coefficients[0], index=feature_names_c)

In [163]:
df_lr_c_coef.to_csv("../data/final cvec lasso coefs.csv")

### Run model with no custom stopwords for comparison

In [164]:
# Initialize an empty list to hold the clean posts.
clean_train_posts_orig_lem = []
clean_test_posts_orig_lem = []

print("Cleaning and parsing the training set reddit posts...")

# Instantiate counter.
j = 0

# For every post in our training set...
for train_post in X_train['title_selftext']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_train_posts_orig_lem.append(reddit_to_words_wlem(train_post, stopwords.words("english")))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_posts}.')
    
    j += 1

# Let's do the same for our testing set.
print("Cleaning and parsing the testing set reddit posts...")

# For every post in our testing set...
for test_post in X_test['title_selftext']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_test_posts_orig_lem.append(reddit_to_words_wlem(test_post, stopwords.words("english")))
    
    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {total_posts}.')
        
    j += 1

Cleaning and parsing the training set reddit posts...
Review 100 of 2000.
Review 200 of 2000.
Review 300 of 2000.
Review 400 of 2000.
Review 500 of 2000.
Review 600 of 2000.
Review 700 of 2000.
Review 800 of 2000.
Review 900 of 2000.
Review 1000 of 2000.
Review 1100 of 2000.
Review 1200 of 2000.
Review 1300 of 2000.
Review 1400 of 2000.
Cleaning and parsing the testing set reddit posts...
Review 1500 of 2000.
Review 1600 of 2000.
Review 1700 of 2000.
Review 1800 of 2000.
Review 1900 of 2000.
Review 2000 of 2000.


In [165]:
# Instantiate count vectorizer
c_vectorizer = CountVectorizer(max_features=1000, ngram_range=(1,1), min_df=2, 
                               stop_words=stopwords.words("english"))

# Fit count vectorizer & transform
train_data_features_orig_c = c_vectorizer.fit_transform(clean_train_posts_orig_lem)

test_data_features_orig_c = c_vectorizer.transform(clean_test_posts_orig_lem)

print(train_data_features_orig_c.shape)

print(test_data_features_orig_c.shape)

c_orig_vocab = c_vectorizer.get_feature_names()
print(c_orig_vocab)

(1400, 1000)
(600, 1000)
['aav', 'ability', 'able', 'absolute', 'absolutely', 'according', 'act', 'actually', 'adam', 'adamant', 'add', 'adding', 'address', 'age', 'agency', 'agent', 'agholor', 'ago', 'agree', 'agreed', 'agreement', 'ahead', 'aiyuk', 'aj', 'alabama', 'aldon', 'allowed', 'almost', 'along', 'already', 'alshon', 'also', 'although', 'always', 'amari', 'amazing', 'among', 'amount', 'anderson', 'andrew', 'another', 'anthony', 'anybody', 'anyone', 'anything', 'anywhere', 'app', 'appreciate', 'april', 'arcega', 'archer', 'arizona', 'around', 'article', 'ask', 'asking', 'athletic', 'attack', 'attempt', 'austin', 'authentic', 'available', 'average', 'away', 'awesome', 'back', 'backup', 'bad', 'ball', 'barnett', 'base', 'based', 'basically', 'baun', 'baylor', 'bear', 'beat', 'became', 'become', 'becomes', 'begin', 'beginning', 'behind', 'believe', 'best', 'better', 'big', 'biggest', 'bill', 'bird', 'birthday', 'bit', 'blake', 'blocker', 'board', 'bored', 'boston', 'boundary', 'bo

In [166]:
# Scale data for use in regularized logistic regression model
ss = StandardScaler(with_mean=False)

X_train_sc_orig_c = ss.fit_transform(train_data_features_orig_c)
X_test_sc_orig_c = ss.transform(test_data_features_orig_c)

# Instantiate logistic regression model.
lr_orig_c = LogisticRegression(penalty='l1', solver = 'saga', C=0.35, random_state=42, max_iter=5000)

# Fit model to training data.
lr_orig_c.fit(X_train_sc_orig_c, y_train)

# Evaluate model on cross-validated data
c_orig_cv_mean = cross_val_score(lr_orig_c, X_train_sc_orig_c, y_train, cv=5).mean()

# Evaluate model on training data.
train_acc_orig_c = lr_orig_c.score(X_train_sc_orig_c, y_train)

# Evaluate model on testing data.
test_acc_orig_c = lr_orig_c.score(X_test_sc_orig_c, y_test)

# Predict test data
y_pred_orig_c = lr_orig_c.predict(X_test_sc_orig_c)

# Predict test data probabilities
y_probs_orig_c = lr_orig_c.predict_proba(X_test_sc_orig_c)

# Create confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_orig_c).ravel()
print(confusion_matrix(y_test, y_pred_orig_c))

# Compute sensitivity & specificity
sensitivity_orig_c = tp / (tp + fn)
specificity_orig_c = tn / (tn + fp)

# Compute precision
precision_orig_c = precision_score(y_test, y_pred_orig_c)

# Compute roc_auc score
roc_auc_orig_c = roc_auc_score(y_test, y_probs_orig_c[:, 1])

# Model score summary
print(f'Cross-Val Accuracy:', round(c_orig_cv_mean, 4))
print(f'Train Accuracy:', round(train_acc_orig_c, 4))
print(f'Test Accuracy:', round(test_acc_orig_c, 4))
print(f'Test Sensitivity:', round(sensitivity_orig_c, 4))
print(f'Test Specificity:', round(specificity_orig_c, 4))
print(f'Test Precision:', round(precision_orig_c, 4))
print(f'ROC AUC:', round(roc_auc_orig_c, 4))

[[248  52]
 [ 66 234]]
Cross-Val Accuracy: 0.8157
Train Accuracy: 0.9479
Test Accuracy: 0.8033
Test Sensitivity: 0.78
Test Specificity: 0.8267
Test Precision: 0.8182
ROC AUC: 0.8958


In [167]:
lr_orig_c.coef_

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        -3.96693588e-02,  0.00000000e+00,  5.57561342e-02,
         0.00000000e+00, -2.93448906e-03,  0.00000000e+00,
         5.34799590e-02,  0.00000000e+00,  0.00000000e+00,
         5.91217244e-02, -3.39454783e-01,  0.00000000e+00,
         1.73570254e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00, -1.70059455e-01,  1.74930764e-01,
         0.00000000e+00,  1.92792125e-01,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  6.84116805e-03,
        -2.00324821e-01, -2.85722128e-02,  0.00000000e+00,
         0.00000000e+00,  3.19910746e-01,  0.00000000e+00,
         0.00000000e+00,  1.86106980e-01,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  2.33102625e-01,
         5.25150826e-02,  1.40700226e-01,  4.97450238e-02,
         0.00000000e+00,  0.00000000e+00,  1.70286550e-03,
        -1.88093072e-01,  0.00000000e+00,  1.55439922e-0

In [168]:
#https://stackoverflow.com/questions/31029340/
#how-to-adjust-scaled-scikit-learn-logicistic-regression-coeffs-to-score-a-non-sc/38836670

orig_coefficients = np.exp(np.true_divide(lr_orig_c.coef_,  ss.scale_))

In [169]:
orig_coefficients

array([[1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 5.90797412e-01, 1.00000000e+00,
        1.54800786e+00, 1.00000000e+00, 9.77239027e-01, 1.00000000e+00,
        1.95094338e+00, 1.00000000e+00, 1.00000000e+00, 1.32217667e+00,
        2.64782610e-01, 1.00000000e+00, 6.10827965e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.19091325e-01, 5.78789861e+00,
        1.00000000e+00, 2.24399020e+00, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.05912035e+00, 4.62336729e-01, 9.16319371e-01,
        1.00000000e+00, 1.00000000e+00, 3.46687443e+00, 1.00000000e+00,
        1.00000000e+00, 1.02641670e+01, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 5.07396751e+00, 1.92917731e+00, 1.92767202e+00,
        1.42592851e+00, 1.00000000e+00, 1.00000000e+00, 1.01947403e+00,
        1.61917815e-01, 1.00000000e+00, 7.86307121e+00, 1.87311204e+00,
        1.00000000e+00, 1.00000000e+00, 1.60229484e-01, 1.000000

In [170]:
feature_names_orig = np.array(c_vectorizer.get_feature_names())

In [171]:
feature_names_orig.shape

(1000,)

In [172]:
lr_orig_c.coef_.shape

(1, 1000)

In [173]:
df_orig_coef = pd.DataFrame(orig_coefficients[0], index=feature_names_orig)

In [174]:
df_orig_coef.to_csv("../data/orig cvec lasso coefs.csv")

### Now try Tfidf Vectorization

#### Use pipeline to tune hyperparameters

In [175]:
pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer(stop_words=my_stop_words)),
    ('ss', StandardScaler(with_mean=False)),
    ('lr', LogisticRegression(penalty='l1', solver='saga', random_state=42, max_iter=3000))
])

In [176]:
pipe_tvec_params = {
    'tvec__max_features': [1000, 1100, 1250],
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__min_df': [1, 2],
    'lr__C': [0.05, 0.1, 0.15]
}

In [177]:
# Instantiate grid search
gs = GridSearchCV(pipe_tvec, 
                  param_grid=pipe_tvec_params, 
                  cv=5,
                  n_jobs=-1,
                  verbose=1)

# Fit grid search to training data
gs.fit(clean_train_posts_lem, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   24.3s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tvec',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [178]:
gs.best_params_

{'lr__C': 0.1,
 'tvec__max_features': 1250,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 1)}

In [179]:
# Best score
print(f'Best CV score', gs.best_score_)
print(f'Best train score', gs.score(clean_train_posts_lem, y_train))
print(f'Best test score', gs.score(clean_test_posts_lem, y_test))

Best CV score 0.6642857142857144
Best train score 0.9207142857142857
Best test score 0.655


#### Fit model again using selected hyperparameters 
This output will be easier for interpretation of results

In [180]:
t_vectorizer = TfidfVectorizer(max_features=1250, ngram_range=(1,1), min_df=2, stop_words=my_stop_words)

In [181]:
train_data_features_t = t_vectorizer.fit_transform(clean_train_posts_lem)

test_data_features_t = t_vectorizer.transform(clean_test_posts_lem)

In [182]:
ss = StandardScaler(with_mean=False)
X_train_sc_t = ss.fit_transform(train_data_features_t)
X_test_sc_t = ss.transform(test_data_features_t)

In [183]:
lr_t = LogisticRegression(penalty='l1', 
                          solver='saga', 
                          C=.1,  
                          random_state=42, 
                          max_iter=3000,
                          n_jobs=-1)

# Fit model
lr_t.fit(X_train_sc_t, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=3000,
                   multi_class='auto', n_jobs=-1, penalty='l1', random_state=42,
                   solver='saga', tol=0.0001, verbose=0, warm_start=False)

In [184]:
# Compute cv mean score
t_cv_mean = cross_val_score(lr_t, X_train_sc_t, y_train, cv=5).mean()

# Compute train Accuracy Score
train_acc_t = lr_t.score(X_train_sc_t, y_train)

# Compute test Accuracy Score
test_acc_t = lr_t.score(X_test_sc_t, y_test)

# Compute test predictions
y_pred_t = lr_t.predict(X_test_sc_t)

# Compute test probabilities
y_probs_t = lr_t.predict_proba(X_test_sc_t)

# Compute confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_t).ravel()
print(confusion_matrix(y_test, y_pred_t))

# Compute sensitivity & specificity
sensitivity_t = tp / (tp + fn)
specificity_t = tn / (tn + fp)

# Compute precision
precision_t = precision_score(y_test, y_pred_t)

# Compute roc_auc_score
roc_auc_t = roc_auc_score(y_test, y_probs_t[:,0])

# Model score summary
print(f'Cross-Val Accuracy:', round(t_cv_mean, 4))
print(f'Train Accuracy:', round(train_acc_t, 4))
print(f'Test Accuracy:', round(test_acc_t, 4))
print(f'Test Sensitivity:', round(sensitivity_t, 4))
print(f'Test Specificity:', round(specificity_t, 4))
print(f'Test Precision:', round(precision_t, 4))
print(f'ROC AUC:', round(roc_auc_t, 4))

[[180 120]
 [ 87 213]]
Cross-Val Accuracy: 0.6621
Train Accuracy: 0.9207
Test Accuracy: 0.655
Test Sensitivity: 0.71
Test Specificity: 0.6
Test Precision: 0.6396
ROC AUC: 0.2988


## Multinomial Naive Bayes Classifier

### Try Count Vectorized inputs

In [185]:
m_pipe = Pipeline([
    ('mnb', MultinomialNB())
])

In [186]:
m_pipe_params = {
    'mnb__alpha': [.001, .01, .05, .1, .5, 1],
}

In [187]:
# Instantiate grid search
gs = GridSearchCV(m_pipe, 
                  param_grid=m_pipe_params, 
                  cv=5,
                  n_jobs=-1)

# Fit grid search to training data
gs.fit(train_data_features_c, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('mnb',
                                        MultinomialNB(alpha=1.0,
                                                      class_prior=None,
                                                      fit_prior=True))],
                                verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'mnb__alpha': [0.001, 0.01, 0.05, 0.1, 0.5, 1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [188]:
gs.best_params_

{'mnb__alpha': 0.1}

In [189]:
# Best score
print(gs.best_score_)

gs.score(train_data_features_c, y_train)

0.68


0.7885714285714286

In [190]:
gs.score(test_data_features_c, y_test)

0.6033333333333334

In [191]:
mnb_c = MultinomialNB(alpha=0.1)

In [192]:
mc_cv_mean = cross_val_score(mnb_c, train_data_features_c, y_train, cv=5).mean()

In [193]:
mnb_c.fit(train_data_features_c, y_train)
train_acc_mc = mnb_c.score(train_data_features_c, y_train)

In [194]:
test_acc_mc = mnb_c.score(test_data_features_c, y_test)

In [195]:
y_pred_mc = mnb_c.predict(test_data_features_c)

In [196]:
y_probs_mc = mnb_c.predict_proba(test_data_features_c)

In [197]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_mc).ravel()
print(confusion_matrix(y_test, y_pred_mc))

[[185 115]
 [123 177]]


In [198]:
# Compute Sensitivity & Specificity
sensitivity_mc = tp / (tp + fn)
specificity_mc = tn / (tn + fp)

# Compute precision
precision_mc = precision_score(y_test, y_pred_mc)

# Compute ROC AUC score
roc_auc_mc = roc_auc_score(y_test, y_probs_mc[:,0])

In [199]:
# Model score summary
print(f'Cross-Val Accuracy:', round(mc_cv_mean, 4))
print(f'Train Accuracy:', round(train_acc_mc, 4))
print(f'Test Accuracy:', round(test_acc_mc, 4))
print(f'Test Sensitivity:', round(sensitivity_mc, 4))
print(f'Test Specificity:', round(specificity_mc, 4))
print(f'Test Precision:', round(precision_mc, 4))
print(f'ROC AUC:', round(roc_auc_mc, 4))

Cross-Val Accuracy: 0.68
Train Accuracy: 0.7886
Test Accuracy: 0.6033
Test Sensitivity: 0.59
Test Specificity: 0.6167
Test Precision: 0.6062
ROC AUC: 0.3021


## Decision Tree Classifier

### Fit single Decision Tree first

In [200]:
dt = DecisionTreeClassifier(random_state=42)

In [201]:
grid = GridSearchCV(dt, 
                    param_grid = {'max_depth':[2,3,5,7],
                                  'min_samples_split':[5,10,15,20],
                                  'min_samples_leaf':[2,3,4,5],
                                  'ccp_alpha':[0, 0.001, 0.01, 0.1, 1, 10]},
                    cv = 5,
                    n_jobs=-1,
                    verbose = 1)

In [202]:
# Grid search over parameter space
grid.fit(train_data_features_c, y_train)

Fitting 5 folds for each of 384 candidates, totalling 1920 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 1920 out of 1920 | elapsed:    1.3s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=-1,
             param_grid={'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10],
              

In [203]:
# Best parameters
grid.best_params_

{'ccp_alpha': 0.001,
 'max_depth': 7,
 'min_samples_leaf': 2,
 'min_samples_split': 20}

In [204]:
grid.best_score_

0.5657142857142857

In [205]:
# Evaluate model.
print(f'Score on training set: {grid.score(train_data_features_c, y_train)}')
print(f'Score on testing set: {grid.score(test_data_features_c, y_test)}')

Score on training set: 0.5964285714285714
Score on testing set: 0.5433333333333333


In [206]:
dt_preds = grid.predict(test_data_features_c)

In [207]:
tn, fp, fn, tp = confusion_matrix(y_test, dt_preds).ravel()

print(confusion_matrix(y_test, dt_preds))

[[279  21]
 [253  47]]


In [208]:
dt_sensitivity = tp / (tp + fn)
dt_specificity = tn / (tn + fp)
dt_roc_auc = roc_auc_score(y_test, dt_preds) 
print(f'Sensitivity: {round(dt_sensitivity, 4)}')
print(f'Specificity: {round(dt_specificity, 4)}')
print(f'Precison:', round(precision_score(y_test, dt_preds), 4))
print(f'ROC AUC Score: {round(dt_roc_auc, 4)}')

Sensitivity: 0.1567
Specificity: 0.93
Precison: 0.6912
ROC AUC Score: 0.5433


### Now try Bagged Decision Tree Classifier

In [209]:
# Instantiate BaggingClassifier.
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(min_samples_split=15,
                                                              min_samples_leaf=4,
                                                              max_depth = 7,
                                                              ccp_alpha=0), 
                        random_state=42)

# Fit BaggingClassifier.
bag.fit(train_data_features_c, y_train)

# Score BaggingClassifier.
bag.score(test_data_features_c, y_test)

0.5633333333333334

In [210]:
cross_val_score(bag, train_data_features_c, y_train, cv=5).mean()

0.6071428571428571

In [211]:
bag.score(train_data_features_c, y_train)

0.6414285714285715

In [212]:
bag_preds = bag.predict(test_data_features_c)

In [213]:
tn, fp, fn, tp = confusion_matrix(y_test, bag_preds).ravel()

print(confusion_matrix(y_test, bag_preds))

[[263  37]
 [225  75]]


In [214]:
bag_sensitivity = tp / (tp + fn)
bag_specificity = tn / (tn + fp)
bag_roc_auc = roc_auc_score(y_test, bag_preds) 
print(f'Sensitivity: {round(bag_sensitivity, 4)}')
print(f'Specificity: {round(bag_specificity, 4)}')
print(f'Precision:', round(precision_score(y_test, bag_preds), 4))
print(f'ROC AUC Score: {round(bag_roc_auc, 4)}')

Sensitivity: 0.25
Specificity: 0.8767
Precision: 0.6696
ROC AUC Score: 0.5633


## Random Forests and Extra Trees Classifiers

In [215]:
# Instantiate models
rf = RandomForestClassifier()
et = ExtraTreesClassifier()

In [216]:
cross_val_score(rf, train_data_features_c, y_train, cv=5, n_jobs=-1).mean()

0.6742857142857143

In [217]:
cross_val_score(et, train_data_features_c, y_train, cv=5, n_jobs=-1).mean()

0.6699999999999999

**Random Forests** model performs better on cross-validated data, so let's move forward only with that model.

In [218]:
# Run Random Forest Classifier in grid search
rf = RandomForestClassifier(random_state=42)
params = {
    'n_estimators': [50, 100, 200],
    'max_features': [None, 'auto'],
    'max_depth': [None, 2, 3, 4, 5, 6],
    'ccp_alpha': [0, 0.1, 0.5, 1]
}
gs = GridSearchCV(rf, param_grid=params, cv=5, n_jobs=-1, verbose=1)
gs.fit(train_data_features_c, y_train)
print(gs.best_score_) # cross-val score
gs.best_params_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    9.1s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   44.2s finished


0.675


{'ccp_alpha': 0,
 'max_depth': None,
 'max_features': 'auto',
 'n_estimators': 100}

In [219]:
rf_train_acc_c = gs.score(train_data_features_c, y_train)
rf_test_acc_c = gs.score(test_data_features_c, y_test)

In [220]:
rf_preds = gs.predict(test_data_features_c)
rf_probs = gs.predict_proba(test_data_features_c)

In [221]:
tn, fp, fn, tp = confusion_matrix(y_test, rf_preds).ravel()

print(confusion_matrix(y_test, rf_preds))

[[194 106]
 [122 178]]


In [222]:
rf_sensitivity = tp / (tp + fn)
rf_specificity = tn / (tn + fp)
rf_roc_auc = roc_auc_score(y_test, rf_probs[:, 1]) 
print(f'RF CV Accuracy:', round(gs.best_score_, 4))
print(f'RF Train Accuracy: {round(rf_train_acc_c, 4)}')
print(f'RF Test Accuracy: {round(rf_test_acc_c, 4)}')
print(f'Sensitivity: {round(rf_sensitivity, 4)}')
print(f'Specificity: {round(rf_specificity, 4)}')
print(f'Precision:', round(precision_score(y_test, rf_preds), 4))
print(f'ROC AUC Score: {round(rf_roc_auc, 4)}')

RF CV Accuracy: 0.675
RF Train Accuracy: 0.9771
RF Test Accuracy: 0.62
Sensitivity: 0.5933
Specificity: 0.6467
Precision: 0.6268
ROC AUC Score: 0.6882


## Boosting Classifiers

### Test AdaBoost vs. Gradient Boosting Classifier

In [223]:
# Instantiate models
ada = AdaBoostClassifier(random_state=42)
gbm = GradientBoostingClassifier(random_state=42)

In [224]:
cross_val_score(ada, train_data_features_c, y_train, cv=5, n_jobs=-1).mean()

0.5900000000000001

In [225]:
cross_val_score(gbm, train_data_features_c, y_train, cv=5, n_jobs=-1).mean()

0.6399999999999999

### Proceed with Gradient Boosting using Count Vectorized inputs

In [226]:
# Try to add stochastic element to boosting by use of subsample hyperparameter tuning
# This approach randomly subsamples the data at each potential node split
gbm = GradientBoostingClassifier(random_state=42)
params = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.001, 0.01, 0.1],
    'subsample': [.2, .4, .6, .8, 1],
    'max_depth': [1, 2, 3, 4, 5, 6],
    'max_features': [None, 'auto']
}
gs_gbm = GridSearchCV(gbm, param_grid=params, cv=5, n_jobs=-1, verbose=1)
gs_gbm.fit(train_data_features_c, y_train)
print(gs_gbm.best_score_) # cross-val score
gs_gbm.best_params_

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 232 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done 482 tasks      | elapsed:   43.6s
[Parallel(n_jobs=-1)]: Done 832 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 1282 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 1832 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 2482 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 2700 out of 2700 | elapsed:  7.4min finished


0.6828571428571429


{'learning_rate': 0.1,
 'max_depth': 6,
 'max_features': None,
 'n_estimators': 1000,
 'subsample': 1}

In [227]:
gbm_cv_acc = gs_gbm.best_score_

In [228]:
gbm_train_acc = gs_gbm.score(train_data_features_c, y_train)

In [229]:
gbm_test_acc = gs_gbm.score(test_data_features_c, y_test)

In [230]:
gbm_preds = gs_gbm.predict(test_data_features_c)
gbm_probs = gs_gbm.predict_proba(test_data_features_c)

In [231]:
tn, fp, fn, tp = confusion_matrix(y_test, gbm_preds).ravel()

print(confusion_matrix(y_test, gbm_preds))

[[200 100]
 [125 175]]


In [232]:
gbm_sensitivity = tp / (tp + fn)
gbm_specificity = tn / (tn + fp)
gbm_roc_auc = roc_auc_score(y_test, gbm_probs[:, 1]) 
print(f'RF CV Accuracy:', round(gs_gbm.best_score_, 4))
print(f'RF Train Accuracy: {round(gbm_train_acc, 4)}')
print(f'RF Test Accuracy: {round(gbm_test_acc, 4)}')
print(f'Sensitivity: {round(gbm_sensitivity, 4)}')
print(f'Specificity: {round(gbm_specificity, 4)}')
print(f'Precision:', round(precision_score(y_test, gbm_preds), 4))
print(f'ROC AUC Score: {round(gbm_roc_auc, 4)}')

RF CV Accuracy: 0.6829
RF Train Accuracy: 0.9707
RF Test Accuracy: 0.625
Sensitivity: 0.5833
Specificity: 0.6667
Precision: 0.6364
ROC AUC Score: 0.7263


### Try again with TfidfVectorization

In [233]:
cross_val_score(ada, train_data_features_t, y_train, cv=5, n_jobs=-1).mean()

0.6142857142857143

In [234]:
cross_val_score(gbm, train_data_features_t, y_train, cv=5, n_jobs=-1).mean()

0.6642857142857143

In [235]:
gbm = GradientBoostingClassifier(random_state=42)
params = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.001, 0.01, 0.1],
    'subsample': [.2, .4, .6, .8, 1],
    'max_depth': [1, 2, 3, 4, 5, 6],
    'max_features': [None, 'auto']
}
gs_gbm = GridSearchCV(gbm, param_grid=params, cv=5, n_jobs=-1, verbose=1)
gs_gbm.fit(train_data_features_t, y_train)
print(gs_gbm.best_score_) # cross-val score
gs_gbm.best_params_

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 200 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done 450 tasks      | elapsed:   56.3s
[Parallel(n_jobs=-1)]: Done 800 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 1250 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 1800 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 2450 tasks      | elapsed:  8.6min
[Parallel(n_jobs=-1)]: Done 2700 out of 2700 | elapsed: 10.0min finished


0.6799999999999999


{'learning_rate': 0.01,
 'max_depth': 6,
 'max_features': None,
 'n_estimators': 1000,
 'subsample': 1}

In [236]:
gbm_train_acc_t = gs_gbm.score(train_data_features_t, y_train)

In [237]:
gbm_test_acc_t = gs_gbm.score(test_data_features_t, y_test)

In [238]:
gbm_preds_t = gs_gbm.predict(test_data_features_t)
gbm_probs_t = gs_gbm.predict_proba(test_data_features_t)

In [239]:
tn, fp, fn, tp = confusion_matrix(y_test, gbm_preds_t).ravel()

print(confusion_matrix(y_test, gbm_preds_t))

[[206  94]
 [143 157]]


In [240]:
gbm_sensitivity_t = tp / (tp + fn)
gbm_specificity_t = tn / (tn + fp)
gbm_roc_auc_t = roc_auc_score(y_test, gbm_probs_t[:, 1]) 
print(f'RF CV Accuracy:', round(gs_gbm.best_score_, 4))
print(f'RF Train Accuracy: {round(gbm_train_acc_t, 4)}')
print(f'RF Test Accuracy: {round(gbm_test_acc_t, 4)}')
print(f'Sensitivity: {round(gbm_sensitivity_t, 4)}')
print(f'Specificity: {round(gbm_specificity_t, 4)}')
print(f'Precision:', round(precision_score(y_test, gbm_preds_t), 4))
print(f'ROC AUC Score: {round(gbm_roc_auc_t, 4)}')

RF CV Accuracy: 0.68
RF Train Accuracy: 0.9443
RF Test Accuracy: 0.605
Sensitivity: 0.5233
Specificity: 0.6867
Precision: 0.6255
ROC AUC Score: 0.6751


## Support Vector Machine Classifier

### First try with Count Vectorization

In [241]:
# Create pipeline
svc_pipe = Pipeline([
    ('ss', StandardScaler(with_mean=False)),
    ('svc', SVC(random_state=42, probability=True))
])

In [242]:
svc_params = {
    'svc__C': np.linspace(0.0001, 1, 20),
    'svc__kernel': ['rbf', 'poly'],
    'svc__degree': [2, 3]
}
gs_svc = GridSearchCV(svc_pipe, param_grid=svc_params, cv=5, n_jobs=-1, verbose=1)
gs_svc.fit(train_data_features_c, y_train)
print(gs_svc.best_score_) # cross-val score
gs_svc.best_params_

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:   19.1s finished


0.6378571428571429


{'svc__C': 1.0, 'svc__degree': 2, 'svc__kernel': 'rbf'}

In [243]:
gs_svc_train_acc = gs_svc.score(train_data_features_c, y_train)

In [244]:
gs_svc_test_acc = gs_svc.score(test_data_features_c, y_test)

In [245]:
svc_preds = gs_svc.predict(test_data_features_c)
svc_probs = gs_svc.predict_proba(test_data_features_c)

In [246]:
tn, fp, fn, tp = confusion_matrix(y_test, svc_preds).ravel()

print(confusion_matrix(y_test, svc_preds))

[[145 155]
 [ 60 240]]


In [247]:
svc_sensitivity = tp / (tp + fn)
svc_specificity = tn / (tn + fp)
svc_roc_auc = roc_auc_score(y_test, svc_probs[:, 1]) 
print(f'RF CV Accuracy:', round(gs_svc.best_score_, 4))
print(f'RF Train Accuracy: {round(gs_svc_train_acc, 4)}')
print(f'RF Test Accuracy: {round(gs_svc_test_acc, 4)}')
print(f'Sensitivity: {round(svc_sensitivity, 4)}')
print(f'Specificity: {round(svc_specificity, 4)}')
print(f'Precision:', round(precision_score(y_test, svc_preds), 4))
print(f'ROC AUC Score: {round(svc_roc_auc, 4)}')

RF CV Accuracy: 0.6379
RF Train Accuracy: 0.8407
RF Test Accuracy: 0.6417
Sensitivity: 0.8
Specificity: 0.4833
Precision: 0.6076
ROC AUC Score: 0.6919


### Now try with TfidfVectorization

In [248]:
# Create pipeline
svct_pipe = Pipeline([
    ('ss', StandardScaler(with_mean=False)),
    ('svc', SVC(random_state=42, probability=True))
])

In [249]:
svc_t_params = {
    'svc__C': np.linspace(0.0001, 1, 20),
    'svc__kernel': ['rbf', 'poly'],
    'svc__degree': [2, 3]
}
gs_svc_t = GridSearchCV(svct_pipe, param_grid=svc_t_params, cv=5, n_jobs=-1, verbose=1)
gs_svc_t.fit(train_data_features_t, y_train)
print(gs_svc_t.best_score_) # cross-val score
gs_svc_t.best_params_

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:   22.4s finished


0.6364285714285713


{'svc__C': 1.0, 'svc__degree': 2, 'svc__kernel': 'rbf'}

In [250]:
gs_svc_train_acc_t = gs_svc_t.score(train_data_features_t, y_train)

In [251]:
gs_svc_test_acc_t = gs_svc_t.score(test_data_features_t, y_test)

In [252]:
svc_preds_t = gs_svc_t.predict(test_data_features_t)
svc_probs_t = gs_svc_t.predict_proba(test_data_features_t)

In [253]:
tn, fp, fn, tp = confusion_matrix(y_test, svc_preds_t).ravel()

print(confusion_matrix(y_test, svc_preds_t))

[[207  93]
 [128 172]]


In [254]:
svc_t_sensitivity = tp / (tp + fn)
svc_t_specificity = tn / (tn + fp)
svc_t_roc_auc = roc_auc_score(y_test, svc_probs_t[:, 1]) 
print(f'RF CV Accuracy:', round(gs_svc_t.best_score_, 4))
print(f'RF Train Accuracy: {round(gs_svc_train_acc_t, 4)}')
print(f'RF Test Accuracy: {round(gs_svc_test_acc_t, 4)}')
print(f'Sensitivity: {round(svc_t_sensitivity, 4)}')
print(f'Specificity: {round(svc_t_specificity, 4)}')
print(f'Precision:', round(precision_score(y_test, svc_preds_t), 4))
print(f'ROC AUC Score: {round(svc_t_roc_auc, 4)}')

RF CV Accuracy: 0.6364
RF Train Accuracy: 0.9421
RF Test Accuracy: 0.6317
Sensitivity: 0.5733
Specificity: 0.69
Precision: 0.6491
ROC AUC Score: 0.6999


## Modeling Summary

**Model Summary Table**
<br>Sorted in descending order by ROC AUC Score
<br>Numbers in **bold** represent the highest score on a given measure

|Stopwords Used|Vectorizer/Classifier|CV Accuracy|Train Accuracy|Test Accuracy|Test Sensitivity|Test Specificity|Test Precision|ROC AUC Score|
|---|---|---|---|---|---|---|---|---|
|Standard English only|CVEC, Logistic Regression with LASSO|**.8157**|.9479|**.8033**|.7800|.8267|**.8182**|**.8958**|
|Standard + custom|CVEC, Gradient Boosting Classifier|.6829|.9707|.6250|.5833|.6670|.6364|.7263|
|Standard + custom|CVEC, Logistic Regression with LASSO|.6650|.8957|.6533|.7200|.5867|.6353|.7223|
|Standard + custom|TfidfVec, Logistic Regression with LASSO|.6621|.9207|.6550|.7100|.6000|.6396|.7012|
|Standard + custom|TfidVec, Support Vector Classifier|.6364|.9421|.6317|.5733|.6900|.6491|.6999|
|Standard + custom|CVEC, MultinomialBayes Classifier|.6800|.7886|.6033|.5900|.6167|.6062|.6979|
|Standard + custom|CVEC, Support Vector Classifier|.6379|.8407|.6417|**.8000**|.4833|.6076|.6919|
|Standard + custom|CVEC, Random Forest Classifier|.6750|**.9771**|.6200|.5933|.6467|.6268|.6882|
|Standard + custom|TfidfVec, Gradient Boosting Classifier|.6800|.9443|.6050|.5233|.6867|.6255|.6751|
|Standard + custom|CVEC, Bagged Decision Tree Classifier|.6071|.6414|.5633|.2500|.8767|.6696|.5633|
|Standard + custom|CVEC, Decision Tree Classifier|.5657|.5964|.5433|.1567|**.9300**|.6912|.5433|

# Conclusions 

As mentioned in the Executive Summary, classifying two NFL team-based subreddits proved to be a challenging task, even without removing obvious-classifying tokens (such as Cowboys, Eagles, Dallas, Philadelphia, etc.).

The reason for this is at least two-fold. For one thing, both subreddits consist of fans of NFL football, and therefore use a common vocabulary or terminology that may not differ enough for highly-accurate classification. And for another, I did not scrape a huge sample of posts, so some challenges may be a result of the relatively small sample size used for the analysis.

That said, my best models were able to achieve overall classification accuracy of 66% or better, Sensitivity scores over 70% and ROC AUC scores in the 70s. So the models were decent, though I obviously hoped for better performance. Unfortunately, all of the models showed a relatively large amount of overfitting to the training data.

Models using a Gradient Boosting Classifier were able to achieve the highest overall accuracy, though at the expense of lower Sensitivity. Multinomial Naive Bayes models also performed similarly.

Overall, the models that seemed best able to balance accuracy and Sensitivity were Logistic Regression Classifiers applying L1 (LASSO) regularization.

Differences in word usage were revealed, with strong indicators often being references to various media members who cover one or the other team (such as Justin Melo or Todd Archer). Fans may have been reacting in the form of "did you see what Todd Archer posted?", for example. r/cowboys posts were more likely to contain references to other teams (like Tampa, Dolphins, Steelers, San Francisco). Certain vulgarities were more associated with one or the other subreddit as well.

**Recommendations**

For future research, larger and longer-term data pulls could be attempted to see whether increasing the overall sample size leads to less overfitting and better model performance.

Many posts only consisted of title text without more substantial selftext entries, so another approach would be to screen for posts that contain both title and selftext content.

Lastly, a more rigorous approach to removing stopwords (such as removing ALL current and former player/coach/GM names), and even local sports reporters, might yield more unique topics of interest to each fan base.