# Part 4: Preprocessing and Modeling

---

## Notebook Summary

This notebook undergoes some basic preprocessing of the data to prepare for a baseline model and then iterating until reaching a production model. We begin by trying multiple features in a baseline model and hold all hyperparameters constant. This will allow us to do some targeted feature selection. We will then compare subsequent models to our baseline to evaluate their effectiveness in classifying a text post as belonging to either the ADHD or autism subreddit. Included in this notebook, the reader will find:

* Null Baseline and Preliminary Logistic Regression Models
* Model Iterations and Evaluations
* Production Model
* Notebook Conclusion

---

## Null Baseline and Preliminary Logistic Regression Models

In this section, we will develop a null and baseline model of our data. We will begin by importing all necessary libraries for our preprocessing and modeling and read in our dataset.

In [7]:
# import requisite libraries
import pandas as pd
import numpy as np

from nltk.corpus import stopwords

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore', category = UserWarning)

In [8]:
# reads in dataset
sr_posts = pd.read_csv('./data_files/sr_posts_cleaned.csv')

sr_posts

Unnamed: 0,title,selftext,subreddit,post_word_count,post_length,selftext_lemma,sentiment
0,Megathread: US Medication Shortage,as many of you are aware by now the current u ...,0,319,2033,a many of you are aware by now the current u s...,-0.9567
1,Did you do something you're proud of? Somethin...,what success have you had this week did you ac...,0,52,288,what success have you had this week did you ac...,0.9633
2,The Vyvanse poops have taken over my mornings..,i now wake up at least 1 5 hours early to ensu...,0,85,379,i now wake up at least 1 5 hour early to ensur...,0.6124
3,Why does someone forcing you to push through e...,i can t even explain how it hurts but it s so ...,0,62,304,i can t even explain how it hurt but it s so m...,-0.8755
4,Just had an epiphany- isn’t it crazy how relig...,so my mom can believe in all her saints god je...,0,175,897,so my mom can believe in all her saint god jes...,-0.9856
...,...,...,...,...,...,...,...
4579,Scared of old people,anyone else scared of old people when i talk t...,1,77,398,anyone else scared of old people when i talk t...,-0.7933
4580,Recommendations for blocking out noise for sleep,i have a really hard time sleeping due to even...,1,116,588,i have a really hard time sleeping due to even...,-0.6284
4581,Mom Refuses Access to Diagnosis Report,i 17f was diagnosed with autism and adhd aroun...,1,83,403,i 17f wa diagnosed with autism and adhd around...,-0.4019
4582,Do you feel uncomfortable with everyday sounds?,i m talking about inevitable sounds as birds d...,1,204,1067,i m talking about inevitable sound a bird dog ...,0.3574


In [9]:
# check for nulls
sr_posts.isnull().sum()

title               0
selftext            7
subreddit           0
post_word_count     0
post_length         0
selftext_lemma     10
sentiment           0
dtype: int64

In [10]:
# drop all null values
sr_posts.dropna(inplace = True)

In [11]:
# check for nulls
sr_posts.isnull().sum()

title              0
selftext           0
subreddit          0
post_word_count    0
post_length        0
selftext_lemma     0
sentiment          0
dtype: int64

We will start first by exploring the null baseline model, our baseline accuracy score for the majority class. This will then be used as an ongoing yardstick to compare other models moving forward. Since the data are collected fairly evenly from both subreddits, the reader will recall that we are hoping to achieve an accuracy score of at least 90% on our test set, exceeding our null baseline score by approximately 40%.

In [12]:
# displays proportion of posts belonging to each subreddit
sr_posts['subreddit'].value_counts(normalize = True)

1    0.500219
0    0.499781
Name: subreddit, dtype: float64

According to this null model, any subsequent models must be able to guess the majority class of autism with greater than a 0.50 baseline accuracy score to ensure that they are modeling better than the null baseline accuracy.

We will be looking at a couple different preliminary models of the purpose of feature selection. The goal here will be to choose different features to iterate on the same model before iterate with the final, selected features on multiple different models.

Let us look at a preliminary model which uses a count vectorizer and logistic regression on the cleaned selftext with our new stop words from our EDA but without lemmatization first.

In [13]:
# declares features and target variable, then train/test splits dataset
X = sr_posts['selftext']
y = sr_posts['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [14]:
# builds a list of stop words including English and adding subreddit specific words
new_stop_words = stopwords.words('english')
incl_stop_words = ['don', 've', 'autism', 'autistic', 'adhd', 'medication', 'meds']
for word in incl_stop_words:
    new_stop_words.append(word)

In [15]:
# creates a pipeline for the CountVectorizer and Logistic Regression
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words)),
    ('lr', LogisticRegression(max_iter = 5000))
])

In [16]:
# creates the pipe params for pipe
pipe_params = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [1.0, 0.8],
    'lr__penalty': ['l2', None],
    'lr__C': [0.1, 1, 10]
}

In [77]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.8131195335276967


{'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 6,
 'lr__C': 0.1,
 'lr__penalty': 'l2'}

In [78]:
# print training and test scores
print(f'Logistic Regression Training Score: {gs.score(X_train, y_train)}')
print(f'Logistic Regression Test Score: {gs.score(X_test, y_test)}')

Logistic Regression Training Score: 0.9551020408163265
Logistic Regression Test Score: 0.8243006993006993


This logistic regression model appears to be very overfit.

We will continue investigating a logistic regression model with lemmatizing and then after that, a logistic regression model with bigrams. For the purposes of comparison, we will leave parameters in the grid search the same. Again, the rationale is to determine, based on this preliminary logistic regression model, which features we should use in iterating with other models, the vectorized selftext alone, the vectorized lemmatized selftext, the vectorized selftext bigrams, or the vectorized selftext bigrams.

In [79]:
# declares features and target variable, then train/test splits dataset
X = sr_posts['selftext_lemma']
y = sr_posts['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [80]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.8206997084548104


{'cvec__max_df': 1.0,
 'cvec__max_features': 5000,
 'cvec__min_df': 2,
 'lr__C': 0.1,
 'lr__penalty': 'l2'}

In [81]:
# print training and test scores
print(f'Logistic Regression Lemmatized Training Score: {gs.score(X_train, y_train)}')
print(f'Logistic Regression Lemmatized Test Score: {gs.score(X_test, y_test)}')

Logistic Regression Lemmatized Training Score: 0.9565597667638484
Logistic Regression Lemmatized Test Score: 0.8199300699300699


This second preliminary model shows a slightly higher cross-validation score and incrementally lower test score. The training score is also slightly better on the lemmatized text. This would indicate that the lemmatized text might be slightly better to use as our starting features rather than simply the original text with the stop words.

We will run two more iterations of this logistic regression, hold the parameters constant, one with bigrams of the original text and another with bigrams of the lemmatized text.

In [82]:
# declares features and target variable, then train/test splits dataset
X = sr_posts['selftext']
y = sr_posts['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [83]:
# creates a pipeline for the CountVectorizer and Logistic Regression
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words, ngram_range = (2, 2))),
    ('lr', LogisticRegression(max_iter = 10000))
])

In [84]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.70932944606414


{'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 4,
 'lr__C': 0.1,
 'lr__penalty': 'l2'}

In [85]:
# print training and test scores
print(f'Logistic Regression Bigram Training Score: {gs.score(X_train, y_train)}')
print(f'Logistic Regression Bigram Test Score: {gs.score(X_test, y_test)}')

Logistic Regression Bigram Training Score: 0.88600583090379
Logistic Regression Bigram Test Score: 0.7106643356643356


The bigram features appear to be performing much more poorly than the selftext features alone, with much lower cross-validation and test scores.

Finally, we shall see if lemmatized bigrams perform any better.

In [86]:
# declares features and target variable, then train/test splits dataset
X = sr_posts['selftext_lemma']
y = sr_posts['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [87]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.7081632653061225


{'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 4,
 'lr__C': 0.1,
 'lr__penalty': 'l2'}

In [88]:
# print training and test scores
print(f'Logistic Regression Lemmatized Bigram Training Score: {gs.score(X_train, y_train)}')
print(f'Logistic Regression Lemmatized Bigram Test Score: {gs.score(X_test, y_test)}')

Logistic Regression Lemmatized Bigram Training Score: 0.8915451895043732
Logistic Regression Lemmatized Bigram Test Score: 0.7027972027972028


Based on what we are seeing here the lemmatized bigrams perform slightly poorer than the bigrams on their own.

Comparing all four preliminary logistic regression models it does appear that the lemmatized selftext will likely make the best features to use moving forward with subsequent iterations of our model. Therefore, we will keep the lemmatized selftext as our features and begin iterating with different types of models.

In the next section, we shall iterate with multiple classification models using the lemmatized selftext as our features.

---

## Model Iterations

For each of the following subsequent models, we will build a pipeline, specific grid search parameters, run a grid search, and then do a cursory evaluation with an accuracy score.

We will begin first with the K-Nearest Neighbors Classifier.

In [89]:
# declares features and target variable, then train/test splits dataset
X = sr_posts['selftext_lemma']
y = sr_posts['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [90]:
# creates a pipeline for the CountVectorizer and KNN
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words)),
    ('knn', KNeighborsClassifier())
])

In [91]:
# creates the pipe params for pipe
pipe_params = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [1.0, 0.8],
    'knn__n_neighbors': range(1, 51, 10),
    'knn__metric': ['euclidean', 'manhattan']
}

In [92]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.5405247813411078


{'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 4,
 'knn__metric': 'euclidean',
 'knn__n_neighbors': 1}

In [93]:
# print training and test scores
print(f'K-Nearest Neighbors Training Score: {gs.score(X_train, y_train)}')
print(f'K-Nearest Neighbors Test Score: {gs.score(X_test, y_test)}')

K-Nearest Neighbors Training Score: 0.9991253644314869
K-Nearest Neighbors Test Score: 0.5454545454545454


We can see here that the K-Nearest Neighbors are barely performing better than the baseline accuracy score based on the cross-validation score and the test score. Therefore, K-Nearest Neighbors is not going to be a useful model for the purposes of our problem statement.

Let us continue with a decision tree model.

In [94]:
# creates a pipeline for the CountVectorizer and Decision Tree
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words)),
    ('dt', DecisionTreeClassifier())
])

In [95]:
# creates the pipe params for pipe
pipe_params = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [1.0, 0.8],
    'dt__max_depth': [3, 5, 7], 
    'dt__ccp_alpha': [0, 0.1, 1]
}

In [96]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.7084548104956268


{'cvec__max_df': 0.8,
 'cvec__max_features': 4000,
 'cvec__min_df': 2,
 'dt__ccp_alpha': 0,
 'dt__max_depth': 7}

In [97]:
# print training and test scores
print(f'Decision Tree Training Score: {gs.score(X_train, y_train)}')
print(f'Decision Tree Test Score: {gs.score(X_test, y_test)}')

Decision Tree Training Score: 0.7387755102040816
Decision Tree Test Score: 0.708916083916084


Looking at the cross-validation and test scores of approximately 0.71 each, we see that this model is definitely performing better than the KNN model above but is still performing at about the level of the bigrams used as features in the logistic regression. Therefore, we can see that the decision tree model ultimatley is not a good fit for our problem statement.

Now, we shall continue with a few ensemble models.

In [104]:
# creates a pipeline for the CountVectorizer and Decision Tree
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words)),
    ('bc', BaggingClassifier())
])

In [107]:
# creates the pipe params for pipe
pipe_params = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [1.0, 0.8],
    'bc__estimator': [LogisticRegression(max_iter = 10000), DecisionTreeClassifier()], 
    'bc__n_estimators': range(10, 31, 10)
}

In [108]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.8137026239067054


{'bc__estimator': LogisticRegression(max_iter=10000),
 'bc__n_estimators': 10,
 'cvec__max_df': 1.0,
 'cvec__max_features': 5000,
 'cvec__min_df': 4}

In [109]:
# print training and test scores
print(f'Bagging Classifier Training Score: {gs.score(X_train, y_train)}')
print(f'Bagging Classifier Test Score: {gs.score(X_test, y_test)}')

Bagging Classifier Training Score: 0.972594752186589
Bagging Classifier Test Score: 0.8155594405594405


Regarding the Bagging Classifier, we do see a rather high cross-validation scoree of 0.81 and test score of 0.82 These are close to our logistic regression scores in the preliminary model. However, they are still performing slightly worse than logistic regression, so our preliminary model remains the best classifier model so far.

We will now continue with a random forest classifier.

In [110]:
# creates a pipeline for the CountVectorizer and Decision Tree
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words)),
    ('rf', RandomForestClassifier())
])

In [111]:
# creates the pipe params for pipe
pipe_params = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [1.0, 0.8],
    'rf__n_estimators': range(1, 301, 100),
    'rf__max_depth': [None, 2, 4]
}

In [112]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.8163265306122449


{'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 6,
 'rf__max_depth': None,
 'rf__n_estimators': 201}

In [113]:
# print training and test scores
print(f'Random Forest Classifier Training Score: {gs.score(X_train, y_train)}')
print(f'Random Forest Classifier Test Score: {gs.score(X_test, y_test)}')

Random Forest Classifier Training Score: 0.9991253644314869
Random Forest Classifier Test Score: 0.8138111888111889


For the random forest classifier, we also see that the cross-validation score of 0.82 and the test score of 0.81 are both very high and also super close to the logistic regression preliminary model. However, these scores are still performing somewhat less than the logistic regression, so the logistic regression is still our best model so far.

Let us try one more with a boosting model, using AdaBoost ensemble.

In [114]:
# creates a pipeline for the CountVectorizer and Decision Tree
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = new_stop_words)),
    ('ab', AdaBoostClassifier())
])

In [115]:
# creates the pipe params for pipe
pipe_params = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [1.0, 0.8],
    'ab__n_estimators': [50, 100, 150],
    'ab__learning_rate': [0.9, 1.1]
}

In [116]:
# runs a grid search over pipe for the given params
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs = -1)

# fits the grid search over our train data and prints best score and best params
gs.fit(X_train, y_train)
print(gs.best_score_) 
gs.best_params_

0.7979591836734693


{'ab__learning_rate': 1.1,
 'ab__n_estimators': 150,
 'cvec__max_df': 1.0,
 'cvec__max_features': 4000,
 'cvec__min_df': 2}

In [117]:
# print training and test scores
print(f'AdaBoost Classifier Training Score: {gs.score(X_train, y_train)}')
print(f'AdaBoost Classifier Test Score: {gs.score(X_test, y_test)}')

AdaBoost Classifier Training Score: 0.8880466472303207
AdaBoost Classifier Test Score: 0.7788461538461539


Looking at these final scores for the AdaBoost Classifier, we can see that the cross-validation and test scores are substantially lower than the logistic regression scores. Therefore, at this point, we can safely conclude that our final production model should be the logistic regression model.

We shall continue forward with the logistic regression model as our final production model with following features and hypermeters:

* Features: Lemmatized Text with English and Modified Stop Words
* Max Document Frequency: 100%
* Maximum Features: 5,000
* Minimum Document Frequency: 2
* Logistic Regression C-value: 0.1
* Logistic Regression Penalty: L2 "Ridge"

---

## Notebook Conclusion

In this notebook, we looked at a null baseline accuracy score for the sake of comparison of all subsequent models. We then built a series of preliminary models using logistic regression and the same hyperparameters over a grid search, changing only the features used in each model to do some more specific feature selection. Then, we iterated with those selected features through six different models to compare against our null baseline and preliminary model to determine which was the best fit for our final production model. We determined that the logistic regression preliminary model outperforms the other iterated models, and we will use the logistic regression as our final production model.

Now that have our final production model in the form of a logistic regression, we can begin a final evaluation of how well our model performs on other metrics besides accuracy alone. In Part 5, we will look at the other metrics including precision and recall, as well as the f1-score to determine how well how model is doing in answering our problem statement.