# Modeling
Now that we've taken a look as some of the more interesting features of our data and also engineered a few along the way lets see if we can get some machine learning going!

In [1]:
#imports
import pandas as pd
import numpy as np

import xgboost as xg

from nltk.tokenize import RegexpTokenizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, recall_score, precision_score

In [2]:
data = pd.read_csv('../data/reddit_comments_partial.csv')
data.head()

Unnamed: 0,comment,word_count,neg_sent,neu_sent,pos_sent,comp_sent,subreddit
0,What’s a ghost profile,4,0.535,0.465,0.0,-0.3182,0
1,"i’m bummed for sure, but i’ll improvise someth...",20,0.0,1.0,0.0,0.0,0
2,Makes my mouth water.,4,0.0,1.0,0.0,0.0,1
3,"2, 3, and 4 are all the same car. 2 has sunroo...",35,0.07,0.93,0.0,-0.2755,0
4,"i wish there were a ""love"" button. damn!!! bea...",18,0.12,0.489,0.39,0.7249,1


In [3]:
# check to make sure we didnt get any null values once more
data.isna().sum()

comment       0
word_count    0
neg_sent      0
neu_sent      0
pos_sent      0
comp_sent     0
subreddit     0
dtype: int64

## Baseline
Let's see the exact spread of our final data to get a good idea of our baseline score we need to beat

In [4]:
data['subreddit'].value_counts(normalize=True)

0    0.500994
1    0.499006
Name: subreddit, dtype: float64

Our metric to beat here is 50%. We'll test against this for accuracy but keep an eye on our f1 scores as well to make sure we're at least somewhat balanced on our misclassifications. Quick reminder that oddlysatisfying is our positive class (1) and mildlyinfuriating is our negative class (0)

## Additional Transformations
One final thing I would like to add in here is a means of removing puncuation to reduce dimensionality for the sake of model run times. We'll use a regular expression tokenizer to help us out here. We will also have to redo our CountVectorizer transformation as it caused storage issues previously.

In [5]:
# initialize our tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# clean all puncuation out of our comments
data['comment'] = [' '.join(tokenizer.tokenize(str(value).lower())) for value in data['comment']]

In [6]:
# stop words we determined from our cleaning step
stops = ['like', 'just', 'don', 'know', 'think', 'time', 'people', 'looks', 'good', 'really', 'make', 've', 'way', 'want', 'lol', 'thing', 'did', 'work', 'right',
         'need', 'use', 'look', 'does', 'water', 'got', 'thought', 'used', 'yeah', 'going', 'shit', 'pretty', 'say', 'actually', 'probably', 'sure', 'didn', 'll',
         'doesn', 'little', 'makes', 'lot', 'day', 'yes', 'years', 'things', 'better', 'oh', 'isn', 'feel', 'doing', 'long', 'man', 'stuff', 'fuck', 'different',
         'maybe', 'mean', 'gt', 'new', 'bad', 'getting', 'said', 'job', 'fucking', 'life', 'point', 'old', 'post', 'car', 'eat', 'person', 'house', 'wrong', 'big',
         'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
         'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what',
         'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
         'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
         'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
         'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',
         'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
         'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
         "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
         'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [7]:
# initialize our bag of words transformers
cvec = CountVectorizer(ngram_range=(1,5),      # check groups of up to 5 consecutive words
                       stop_words=stops,       # set our ignored words
                       max_features=10_000,    # limit the total nubmer of features to 10,000
                       strip_accents='ascii',  # strip special accents of characters and force them to base ascii to prevent duplicates of the same word in different fonts
                       min_df=3)               # declare a minimum number of appearances for a word to be considered

# declare our X and y variables
X = data.drop('subreddit', axis=1)
y = data['subreddit']

# transform our comments into a new dataframe
Z = cvec.fit_transform(X['comment'])
Z_fit = pd.DataFrame(Z.todense().astype('uint8'), columns= cvec.get_feature_names())

# add back in our word count and sentiment score coclumns
Z_fit[X.columns] = X
Z_fit.drop('comment', axis=1, inplace=True)

# check to make sure everything combined correctly
Z_fit.head()

Unnamed: 0,00,000,01,02,03,04,05,07,08,10,...,zeros,zikr,zone,zoomed,zucchini,word_count,neg_sent,neu_sent,pos_sent,comp_sent
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,4,0.535,0.465,0.0,-0.3182
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,20,0.0,1.0,0.0,0.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,4,0.0,1.0,0.0,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,35,0.07,0.93,0.0,-0.2755
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,18,0.12,0.489,0.39,0.7249


In [8]:
# finding an instance of a non-zero value in our data for use in presentation slides
Z_fit[1159:1170]

Unnamed: 0,00,000,01,02,03,04,05,07,08,10,...,zeros,zikr,zone,zoomed,zucchini,word_count,neg_sent,neu_sent,pos_sent,comp_sent
1159,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,23,0.0,0.847,0.153,0.5859
1160,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,46,0.0,0.785,0.215,0.8648
1161,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,6,0.0,1.0,0.0,0.0
1162,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,90,0.074,0.797,0.129,0.7269
1163,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0.0,1.0,0.0,0.0
1164,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,13,0.199,0.678,0.123,-0.3004
1165,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3,0.0,0.476,0.524,0.296
1166,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,9,0.0,0.778,0.222,0.431
1167,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,5,0.16,0.84,0.0,-0.2732
1168,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3,0.0,0.328,0.672,0.6249


In [9]:
# check our shape and make sure we still dont have any nulls before modeling
print(f'Shape after transformation: {Z_fit.shape}')
print(f'Null count after transformation: {Z_fit.isna().sum().sum()}')

Shape after transformation: (63388, 10004)
Null count after transformation: 0


## Model Fitting
Lets go ahead and split our data into training and testing sets and see if we can get some decent models running!

In [10]:
# create our split for training and testing
X_train, X_test, y_train, y_test = train_test_split(Z_fit,y,random_state=413,train_size=.67)

## !!QUICK DISCLAIMER HERE!!
These models were fit on a machine using 32 GB of DDR4 3200 MHZ RAM, a Ryzen 3900X 12-core processor and an RTX 3060ti GPU with cuda cores enabled. Proceed with caution before running the below code as this still took several minutes to run. If you do not have a machine with cuda cores enabled please remove the ``` tree_method="gpu_hist" ``` from the ```xg_model = xg.sklearn.XGBClassifier(tree_method="gpu_hist")``` line.

In [11]:
# # initialize, fit, and make predictions with a logistic regression model
# logreg = LogisticRegression(max_iter=10_000)
# logreg.fit(X_train,y_train)
# log_preds = logreg.predict(X_test)

# # print our scores
# print(f'Training Accuracy: {logreg.score(X_train, y_train)}')
# print(f'Testing Accuracy: {logreg.score(X_test, y_test)}')
# print(f'F1 Score: {f1_score(y_test, log_preds)}')
# print(f'Recall Score: {recall_score(y_test, log_preds)}')
# print(f'Precision Score: {precision_score(y_test, log_preds)}')

Training Accuracy: 0.8345145871105983
Testing Accuracy: 0.7581146326306228
F1 Score: 0.7680176049880799
Recall Score: 0.7979422692197771
Precision Score: 0.7402562969509501


In [12]:
# # check our top coefficients
# log_coefficients = pd.DataFrame(zip(cvec.get_feature_names(),logreg.coef_[0]),columns=['Word','Coefficient']).sort_values(by = 'Coefficient')
# log_coefficients[abs(log_coefficients['Coefficient']) > .5]

Unnamed: 0,Word,Coefficient
5752,net,-3.104940
7109,refilling,-3.086177
6268,passphrase,-2.938167
5203,magician,-2.748970
7100,refers,-2.728523
...,...,...
6979,rave,2.674706
6127,osha,2.768829
9610,wedding,2.939022
9609,websites,3.003101


In [13]:
# # initialize, fit, and make predictions with a random forrest model
# forrest = RandomForestClassifier(max_features=100,max_depth=100)
# forrest.fit(X_train,y_train)
# forrest_preds = forrest.predict(X_test)

# # print our scores
# print(f'Training Accuracy: {forrest.score(X_train, y_train)}')
# print(f'Testing Accuracy: {forrest.score(X_test, y_test)}')
# print(f'F1 Score: {f1_score(y_test, forrest_preds)}')
# print(f'Recall Score: {recall_score(y_test, forrest_preds)}')
# print(f'Precision Score: {precision_score(y_test, forrest_preds)}')

Training Accuracy: 0.8253078716240081
Testing Accuracy: 0.7192026387494622
F1 Score: 0.7340155768882448
Recall Score: 0.7721253691530914
Precision Score: 0.6994908086648831


In [14]:
# # initialize, fit, and make predictions with an XGBoot model
# xg_model = xg.sklearn.XGBClassifier(tree_method="gpu_hist")
# xg_model.fit(X_train,y_train)
# xg_preds = xg_model.predict(X_test)

# print(f'Training Accuracy: {xg_model.score(X_train, y_train)}')
# print(f'Testing Accuracy: {xg_model.score(X_test, y_test)}')
# print(f'F1 Score: {f1_score(y_test, xg_preds)}')
# print(f'Recall Score: {recall_score(y_test, xg_preds)}')
# print(f'Precision Score: {precision_score(y_test, xg_preds)}')



Training Accuracy: 0.7459087805222633
Testing Accuracy: 0.7215450069314977
F1 Score: 0.7378133861457442
Recall Score: 0.7807945127179194
Precision Score: 0.6993174061433447


# Results
It looks like our top contender was logistic regression. It's still overfit but with more time we could revisit the stop words to try and remove words that are highly common in both subreddits. Some manual hyperparameter adjustments were made but gridsearching was avoided due to hardware and time constraints. XGBoost had the best fit overall and I'd be most interested to try and explore more using that model. Additionally it would be worth trying to use a few other modeling techniques such as SVMs or other types of tree methods. 

## Final Thoughts
It's very clear that there is some clear divide between the types of comments on each subreddit but it would take more time and research to truely home in on a great model that could accurately distinguish between the two. We still managed to score well above our baseline score but I think some more work needs to be done here before this model could become super useful. 