# Modeling

In this notebook I will be trying different classification models to attempt to best predict if a post came from the fantasy of scifi subreddits.

In [18]:
# imports
import pandas as pd
import numpy as np

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier,
                              AdaBoostClassifier, StackingClassifier)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score

---
## K-Nearest Neighbor

I'm going to start with KNN models utilizing CountVectorizer or TfidfVectorizer on titles only, posts only, and titles with posts. I will be looking at accuracy and F1 score.

In [2]:
# read in data
combined = pd.read_csv("../data/combined_subreddits.csv")

# lemmatize titles
combined["title"] = combined["title"].map(lambda x: word_tokenize(x.lower()))

lem = WordNetLemmatizer()
combined["title"] = combined["title"].map(lambda x: " ".join([lem.lemmatize(i) for i in x]))

# lemmatize posts
combined["post"] = combined["post"].map(lambda x: word_tokenize(x.lower()))

lem = WordNetLemmatizer()
combined["post"] = combined["post"].map(lambda x: " ".join([lem.lemmatize(i) for i in x]))

# convert subreddits fantasy=0 and scifi=1
combined["subreddit"] = combined["subreddit"].map({"fantasy": 0, "scifi": 1})

In [18]:
# create X and y
X = combined["title"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [19]:
# pipeline with countvectorizer and knn on titles
pipe1 = Pipeline([
    ("cvec", CountVectorizer()),
    ("ss", StandardScaler(with_mean=False)),
    ("knn", KNeighborsClassifier())
])

pipe1_params = {
    "cvec__stop_words": [None, "english"],
    "cvec__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "cvec__max_features": range(100, 501, 100),
    "cvec__min_df": range(2, 11),
    "cvec__max_df": [0.9, 0.95, 0.99],
    "knn__n_neighbors": range(3, 16, 2)
}

gs1 = GridSearchCV(pipe1, param_grid=pipe1_params, cv=5, n_jobs=-1, verbose=1)

gs1.fit(X_train, y_train)

Fitting 5 folds for each of 5670 candidates, totalling 28350 fits


In [20]:
gs1.best_params_, gs1.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 200,
  'cvec__min_df': 8,
  'cvec__ngram_range': (1, 3),
  'cvec__stop_words': 'english',
  'knn__n_neighbors': 15},
 0.8099318648379743)

In [21]:
gs1.score(X_train, y_train), gs1.score(X_test, y_test), f1_score(y_test, gs1.predict(X_test))

(0.8319959879638916, 0.7834586466165413, 0.8115183246073299)

In [22]:
# see baseline model score
y_train.value_counts(normalize=True)

subreddit
1    0.556169
0    0.443831
Name: proportion, dtype: float64

*This model is a little overfit but still performs better than baseline, let's see how KNN does with TfidfVectorizer.*

In [23]:
# knn with tfidfvectorizer on titles
pipe2 = Pipeline([
    ("tf", TfidfVectorizer()),
    ("ss", StandardScaler(with_mean=False)),
    ("knn", KNeighborsClassifier())
])

pipe2_params = {
    "tf__stop_words": [None, "english"],
    "tf__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "tf__max_features": range(50, 151, 10),
    "tf__min_df": range(2, 11),
    "tf__max_df": [0.90, 0.95, 0.99], 
    "knn__n_neighbors": range(3, 16, 2)
}

gs2 = GridSearchCV(pipe2, param_grid=pipe2_params, cv=5, n_jobs=-1, verbose=1)

gs2.fit(X_train, y_train)

Fitting 5 folds for each of 12474 candidates, totalling 62370 fits


In [24]:
gs2.best_params_, gs2.best_score_

({'knn__n_neighbors': 15,
  'tf__max_df': 0.9,
  'tf__max_features': 120,
  'tf__min_df': 4,
  'tf__ngram_range': (1, 2),
  'tf__stop_words': 'english'},
 0.8079369277465019)

In [25]:
gs2.score(X_train, y_train), gs2.score(X_test, y_test), f1_score(y_test, gs2.predict(X_test))

(0.8204613841524574, 0.7548872180451128, 0.788586251621271)

*This KNN model did worse. Going to try KNN with adding in posts. I'm also going to stop including stop words in my grid searches and include them, they didn't show much impact during my EDA and it's being included in the best params each time I try with and without.*

In [39]:
X = combined["post"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [27]:
# knn with countvectorizer on posts
pipe3 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("knn", KNeighborsClassifier())
])

pipe3_params = {
    "cvec__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "cvec__max_features": range(50, 301, 50),
    "cvec__min_df": range(2, 11),
    "cvec__max_df": [0.9, 0.95, 0.99],
    "knn__n_neighbors": range(3, 11, 2)
}

gs3 = GridSearchCV(pipe3, param_grid=pipe3_params, cv=5, n_jobs=-1, verbose=1)

gs3.fit(X_train, y_train)

Fitting 5 folds for each of 1944 candidates, totalling 9720 fits


In [28]:
gs3.best_params_, gs3.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 300,
  'cvec__min_df': 6,
  'cvec__ngram_range': (1, 2),
  'knn__n_neighbors': 9},
 0.8034130552511932)

In [29]:
gs3.score(X_train, y_train), gs3.score(X_test, y_test), f1_score(y_test, gs3.predict(X_test))

(0.8360080240722166, 0.7909774436090226, 0.8124156545209177)

*Not much better on the training set, but not as overfit as the KNN models on the titles because an improvement on the test accuracy.*

In [30]:
# knn with tfidfvectorizer with posts
pipe4 = Pipeline([
    ("tf", TfidfVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("knn", KNeighborsClassifier())
])

pipe4_params = {
    "tf__ngram_range": [(1, 2), (1, 3)],
    "tf__max_features": range(50, 101, 10),
    "tf__min_df": range(2, 11),
    "tf__max_df": [0.9, 0.95, 0.99],
    "knn__n_neighbors": range(3, 16, 2)
}

gs4 = GridSearchCV(pipe4, param_grid=pipe4_params, cv=5, n_jobs=-1, verbose=1)

gs4.fit(X_train, y_train)

Fitting 5 folds for each of 2268 candidates, totalling 11340 fits


In [32]:
gs4.best_params_, gs4.best_score_

({'knn__n_neighbors': 13,
  'tf__max_df': 0.9,
  'tf__max_features': 90,
  'tf__min_df': 10,
  'tf__ngram_range': (1, 2)},
 0.8024004735456731)

In [31]:
gs4.score(X_train, y_train), gs4.score(X_test, y_test), f1_score(y_test, gs4.predict(X_test))

(0.8239719157472417, 0.7669172932330827, 0.7973856209150327)

*TfidfVectorizer not doing as well for posts also.*

In [59]:
# new X and y using both titles and posts
X = combined["title"] + " " + combined["post"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [60]:
# knn with countvectorizer on titles and posts
pipe5 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("knn", KNeighborsClassifier())
])

pipe5_params = {
    "cvec__ngram_range": [(1, 2), (1, 3)],
    "cvec__max_features": range(100, 1001, 100),
    "cvec__min_df": range(2, 11),
    "cvec__max_df": [0.9, 0.95, 0.99],
    "knn__n_neighbors": range(3, 16, 2)
}

gs5 = GridSearchCV(pipe5, param_grid=pipe5_params, cv=5, n_jobs=-1, verbose=1)

gs5.fit(X_train, y_train)

Fitting 5 folds for each of 3780 candidates, totalling 18900 fits


In [61]:
gs5.best_params_, gs5.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 100,
  'cvec__min_df': 8,
  'cvec__ngram_range': (1, 3),
  'knn__n_neighbors': 3},
 0.7733063815317187)

In [62]:
gs5.score(X_train, y_train), gs5.score(X_test, y_test), f1_score(y_test, gs5.predict(X_test))

(0.8671013039117352, 0.7804511278195488, 0.8215158924205379)

*KNN did about the same using both titles and posts as it did using just one or the other. Going to pass on fitting KNN with TfidfVectorizer for all data because it did worse for titles and posts on their own and wouldn't expect it to do significantly better with both since CountVectorizer didn't.*

---
## Logistic Regression

I'm going to follow a similar workflow as above with KNN:
* titles only with CountVectorizer and TfidfVectorizer
* posts only with CountVectorizer and TfidfVectorizer
* titles and posts with CountVectorizer and TfidfVectorizer

In [36]:
# starting with logreg and countvectorizer
# create X and y with just titles
X = combined["title"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [39]:
# Logistic Regression with CountVectorizer with titles only
pipe6 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("lr", LogisticRegression())
])

pipe6_params = {
    "cvec__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "cvec__max_features": range(1000, 2001, 100),
    "cvec__min_df": range(2, 11),
    "cvec__max_df": [0.9, 0.95, 0.99],
    "lr__C": [0.01, 0.1, 1],
    "lr__penalty": ["l2", "l1"],
    "lr__solver": ["liblinear", "lbfgs"]
}

gs6 = GridSearchCV(pipe6, param_grid=pipe6_params, cv=5, n_jobs=-1, verbose=1)

gs6.fit(X_train, y_train)

Fitting 5 folds for each of 10692 candidates, totalling 53460 fits


13365 fits failed out of a total of 53460.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
13365 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages

In [40]:
gs6.best_params_, gs6.best_score_

({'cvec__max_df': 0.99,
  'cvec__max_features': 1600,
  'cvec__min_df': 2,
  'cvec__ngram_range': (1, 2),
  'lr__C': 0.1,
  'lr__penalty': 'l1',
  'lr__solver': 'liblinear'},
 0.8670973917205073)

In [41]:
gs6.score(X_train, y_train), gs6.score(X_test, y_test), f1_score(y_test, gs6.predict(X_test))

(0.9493480441323972, 0.8631578947368421, 0.8785046728971962)

*Already looking better than KNN, but still overfit.*

In [42]:
# LogisticRegression with TfidfVectorizer on titles only
pipe7 = Pipeline([
    ("tf", TfidfVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("lr", LogisticRegression())
])

pipe7_params = {
    "tf__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "tf__max_features": range(100, 251, 10),
    "tf__min_df": range(2, 11),
    "tf__max_df": [0.9, 0.95, 0.99],
    "lr__C": [0.01, 0.1, 1],
    "lr__penalty": ["l2", "l1"],
    "lr__solver": ["lbfgs", "liblinear"]
}

gs7 = GridSearchCV(pipe7, param_grid=pipe7_params, cv=5, n_jobs=-1, verbose=1)

gs7.fit(X_train, y_train)

Fitting 5 folds for each of 15552 candidates, totalling 77760 fits


19440 fits failed out of a total of 77760.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
19440 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages

In [43]:
gs7.best_params_, gs7.best_score_

({'lr__C': 0.01,
  'lr__penalty': 'l2',
  'lr__solver': 'liblinear',
  'tf__max_df': 0.9,
  'tf__max_features': 230,
  'tf__min_df': 5,
  'tf__ngram_range': (1, 2)},
 0.8385184065691866)

In [44]:
gs7.score(X_train, y_train), gs7.score(X_test, y_test), f1_score(y_test, gs7.predict(X_test))

(0.8811434302908726, 0.8330827067669173, 0.8514056224899599)

*Logistic Regression did better with CountVectorizer.*

In [45]:
# new X and y with just posts
X = combined["post"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [46]:
# LogisticRegression with CountVectorizer on posts only
pipe8 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("lr", LogisticRegression())
])

pipe8_params = {
    "cvec__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "cvec__max_features": range(1000, 2001, 100),
    "cvec__min_df": range(2, 11),
    "cvec__max_df": [0.9, 0.95, 0.99],
    "lr__C": [0.01, 0.1, 1],
    "lr__penalty": ["l2", "l1"],
    "lr__solver": ["liblinear", "lbfgs"]
}

gs8 = GridSearchCV(pipe8, param_grid=pipe8_params, cv=5, n_jobs=-1, verbose=1)

gs8.fit(X_train, y_train)

Fitting 5 folds for each of 10692 candidates, totalling 53460 fits


13365 fits failed out of a total of 53460.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
13365 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages

In [49]:
gs8.best_params_, gs8.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 1500,
  'cvec__min_df': 8,
  'cvec__ngram_range': (1, 3),
  'lr__C': 1,
  'lr__penalty': 'l1',
  'lr__solver': 'liblinear'},
 0.8731073915945643)

In [48]:
gs8.score(X_train, y_train), gs8.score(X_test, y_test), f1_score(y_test, gs8.predict(X_test))

(0.9864593781344032, 0.8601503759398497, 0.8793774319066148)

*This model did very similar to the same model fit on title only data, slightly lower accuracy but slightly higher F1 score.*

In [50]:
# LogisticRegression with TfidfVectorizer on posts only
# no model yet has not included bi and/or trigrams in their best parameters
# no longer grid searching over single words only
pipe9 = Pipeline([
    ("tf", TfidfVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("lr", LogisticRegression())
])

pipe9_params = {
    "tf__ngram_range": [(1, 2), (1, 3)],
    "tf__max_features": range(100, 261, 20), 
    "tf__min_df": range(2, 11),
    "tf__max_df": [0.9, 0.95, 0.99],
    "lr__C": [0.01, 0.1, 1],
    "lr__penalty": ["l2", "l1"],
    "lr__solver": ["liblinear", "lbfgs"]
}

gs9 = GridSearchCV(pipe9, param_grid=pipe9_params, cv=5, n_jobs=-1, verbose=1)

gs9.fit(X_train, y_train)

Fitting 5 folds for each of 5832 candidates, totalling 29160 fits


7290 fits failed out of a total of 29160.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
7290 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\s

In [51]:
gs9.best_params_, gs9.best_score_

({'lr__C': 0.01,
  'lr__penalty': 'l2',
  'lr__solver': 'liblinear',
  'tf__max_df': 0.9,
  'tf__max_features': 160,
  'tf__min_df': 6,
  'tf__ngram_range': (1, 3)},
 0.8791262074784951)

In [52]:
gs9.score(X_train, y_train), gs9.score(X_test, y_test), f1_score(y_test, gs9.predict(X_test))

(0.9097291875626881, 0.8586466165413534, 0.8743315508021391)

*LogisticRegression did worse again using TfidfVectorizer on posts only, will only try CountVectorizer with both titles and posts.*

In [55]:
# new X and y containing titles and posts
X = combined["title"] + " " + combined["post"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [63]:
# LogisiticRegression with CountVerctorization on titles and posts
# not grid searching over logreg solver anymore
# liblinear has been in best params for every model so far
pipe10 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("lr", LogisticRegression(solver="liblinear"))
])

pipe10_params = {
    "cvec__ngram_range": [(1, 2), (1, 3)],
    "cvec__max_features": range(1000, 2001, 100),
    "cvec__min_df": range(2, 11),
    "cvec__max_df": [0.9, 0.95, 0.99],
    "lr__C": [0.01, 0.1, 1],
    "lr__penalty": ["l2", "l1"],
}

gs10 = GridSearchCV(pipe10, param_grid=pipe10_params, cv=5, n_jobs=-1, verbose=1)

gs10.fit(X_train, y_train)

Fitting 5 folds for each of 3564 candidates, totalling 17820 fits


In [64]:
gs10.best_params_, gs10.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 2000,
  'cvec__min_df': 5,
  'cvec__ngram_range': (1, 2),
  'lr__C': 0.01,
  'lr__penalty': 'l2'},
 0.9112164834196044)

In [65]:
gs10.score(X_train, y_train), gs10.score(X_test, y_test), f1_score(y_test, gs10.predict(X_test))

(0.9919759277833501, 0.9022556390977443, 0.9145860709592641)

*The best model so far. It maxed out on max features so reiterate going higher.*

In [68]:
# increasing max features on last model
pipe11 = Pipeline([
    ("cvec", CountVectorizer(ngram_range=(1, 2), min_df=5, max_df=0.9, stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("lr", LogisticRegression(C=0.01))
])

pipe11_params = {
    "cvec__max_features": range(2000, 4001, 100),
    "lr__solver": ["liblinear", "lbfgs"]
}

gs11 = GridSearchCV(pipe11, param_grid=pipe11_params, cv=5, n_jobs=-1, verbose=1)

gs11.fit(X_train, y_train)

Fitting 5 folds for each of 42 candidates, totalling 210 fits


In [69]:
gs11.best_params_, gs11.best_score_

({'cvec__max_features': 2200, 'lr__solver': 'liblinear'}, 0.9112227805695141)

In [70]:
gs11.score(X_train, y_train), gs11.score(X_test, y_test), f1_score(y_test, gs11.predict(X_test))

(0.9944834503510531, 0.8977443609022556, 0.9102902374670184)

*It did better at 2000 max features in model 10.*

---
## Naive Bayes

My best accuracy and F1 score so far has been using both titles and posts for the subreddit data so moving forward I will use everything. Next going to try a Multinomial Naive Bayes model.

In [73]:
# reinstantiate X and y
X = combined["title"] + " " + combined["post"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [79]:
# Multinomial Naive Bayes using CountVectorization on titles and posts
pipe12 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("nb", MultinomialNB())
])

pipe12_params = {
    "cvec__ngram_range": [(1, 2), (1, 3)],
    "cvec__max_features": range(1000, 3001, 200),
    "cvec__min_df": range(2, 11, 2),
    "cvec__max_df": [0.90, 0.95, 0.99],
    "nb__alpha": range(1, 11)
}

grid12 = GridSearchCV(pipe12, param_grid=pipe12_params, cv=5, n_jobs=-1, verbose=1)

grid12.fit(X_train, y_train)

Fitting 5 folds for each of 3300 candidates, totalling 16500 fits


In [80]:
grid12.best_params_, grid12.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 2200,
  'cvec__min_df': 2,
  'cvec__ngram_range': (1, 2),
  'nb__alpha': 1},
 0.8761350612712686)

In [81]:
grid12.score(X_train, y_train), grid12.score(X_test, y_test), f1_score(y_test, grid12.predict(X_test))

(0.9232698094282848, 0.8766917293233083, 0.8867403314917127)

*Another step back, the LogisticRegression using CountVectorization on all title and post data has performed the best so far.*

In [84]:
# Multinomial Naive Bayes using TfidfVectorizer on titles and posts
pipe13 = Pipeline([
    ("tf", TfidfVectorizer(stop_words="english")),
    ("ss", StandardScaler(with_mean=False)),
    ("nb", MultinomialNB())
])

pipe13_params = {
    "tf__ngram_range": [(1, 2), (1, 3)],
    "tf__max_features": range(100, 501, 50), 
    "tf__min_df": range(2, 11),
    "tf__max_df": [0.9, 0.95, 0.99],
    "nb__alpha": [0.01, 0.1, 1, 2, 3, 5]
}

gs13 = GridSearchCV(pipe13, param_grid=pipe13_params, cv=5, n_jobs=-1, verbose=1)

gs13.fit(X_train, y_train)

Fitting 5 folds for each of 2916 candidates, totalling 14580 fits


In [85]:
gs13.best_params_, gs13.best_score_

({'nb__alpha': 0.01,
  'tf__max_df': 0.9,
  'tf__max_features': 300,
  'tf__min_df': 10,
  'tf__ngram_range': (1, 3)},
 0.8771300109570408)

In [86]:
gs13.score(X_train, y_train), gs13.score(X_test, y_test), f1_score(y_test, gs13.predict(X_test))

(0.9012036108324974, 0.8526315789473684, 0.8653846153846154)

*Even lower performance with Multinomial Naive Bayes using TfidfVectorizer.*

---
## Decision Trees

I will now look into a couple methods using Decision Trees.

In [3]:
# X and y using titles and posts
X = combined["title"] + " " + combined["post"]
y = combined["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [17]:
# basic decision tree with countvectorization
pipe14 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english", ngram_range=(1, 2))),
    ("dt", DecisionTreeClassifier(random_state=42))
])

pipe14.fit(X_train, y_train)

In [18]:
pipe14.score(X_train, y_train), pipe14.score(X_test, y_test), f1_score(y_test, pipe14.predict(X_test))

(0.9990239141044412, 0.8845029239766082, 0.895364238410596)

*This did pretty well, trying with TfidfVectorizer.*

In [19]:
# decision tree with tfidfvectorization
pipe15 =  Pipeline([
    ("tf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2))),
    ("dt", DecisionTreeClassifier(random_state=42))
])

pipe15.fit(X_train, y_train)

In [20]:
pipe15.score(X_train, y_train), pipe15.score(X_test, y_test), f1_score(y_test, pipe15.predict(X_test))

(0.9990239141044412, 0.8742690058479532, 0.8853333333333333)

*TfidfVectorizer didn't perform as well again, going to stick with CountVectorizer moving forward.*

In [6]:
# RandomForestClassifier with CountVectorizer
pipe16 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english", ngram_range=(1, 2))),
    ("rf", RandomForestClassifier(random_state=42))
])

pipe16_params = {
    "cvec__max_features": range(1000, 3001, 200),
    "cvec__min_df": range(2, 11, 2),
    "cvec__max_df": [0.90, 0.95, 0.99],
    "rf__n_estimators": range(100, 201, 25),
    "rf__max_depth": [None, *range(1, 6)]
}

gs16 = GridSearchCV(pipe16, param_grid=pipe16_params, cv=5, n_jobs=-1, verbose=1)

gs16.fit(X_train, y_train)

Fitting 5 folds for each of 4950 candidates, totalling 24750 fits


In [7]:
gs16.best_params_, gs16.best_score_

({'cvec__max_df': 0.9,
  'cvec__max_features': 1400,
  'cvec__min_df': 6,
  'rf__max_depth': None,
  'rf__n_estimators': 150},
 0.9054198948533481)

In [8]:
gs16.score(X_train, y_train), gs16.score(X_test, y_test), f1_score(y_test, gs16.predict(X_test))

(0.9985597695631301, 0.9035971223021583, 0.9128738621586476)

*Slightly higher accuracy and slightly lower F1 score compared to best Logistic Regression.*

In [15]:
# ADABoost with CountVectorizer
pipe17 = Pipeline([
    ("cvec", CountVectorizer(stop_words="english", min_df=2, max_df=0.9, ngram_range=(1, 2))),
    ("ada", AdaBoostClassifier(random_state=42))
])

pipe17_params = {
    "cvec__max_features": range(1000, 2001, 100),
    "ada__n_estimators": range(20, 201, 20),
    "ada__learning_rate": [0.1, 1, 10]
}

gs17 = GridSearchCV(pipe17, param_grid=pipe17_params, cv=5, n_jobs=-1, verbose=1)

gs17.fit(X_train, y_train)

Fitting 5 folds for each of 330 candidates, totalling 1650 fits




In [16]:
gs17.best_params_, gs17.best_score_

({'ada__learning_rate': 1,
  'ada__n_estimators': 200,
  'cvec__max_features': 1800},
 0.9169422154584026)

In [17]:
gs17.score(X_train, y_train), gs17.score(X_test, y_test), f1_score(y_test, gs17.predict(X_test))

(0.9927988478156505, 0.9035971223021583, 0.9117259552042161)

*Very similar to the RandomForestClassifier model just done.*

In [20]:
# StackingClassifier with CountVectorizer
level1_models = [
    ("logr_pipe", Pipeline([
        ("cvec", CountVectorizer(stop_words="english", max_df=0.9,
                                 min_df=8, max_features=1500, ngram_range=(1, 3))),
        ("ss", StandardScaler(with_mean=False)),
        ("logr", LogisticRegression(penalty="l1", solver="liblinear"))
    ])),
    ("rf_pipe", Pipeline([
        ("cvec", CountVectorizer(stop_words="english", max_df=0.9,
                                 min_df=6, max_features=1400, ngram_range=(1, 2))),
        ("rf", RandomForestClassifier(n_estimators=150, random_state=42))
    ])),
    ("ada_pipe", Pipeline([
        ("cvec", CountVectorizer(stop_words="english", max_features=1800)),
        ("ada", AdaBoostClassifier(n_estimators=200))
    ]))
]

stacked_model = StackingClassifier(estimators=level1_models,
                                   final_estimator=LogisticRegression())

stacked_model.fit(X_train, y_train)



In [21]:
stacked_model.score(X_train, y_train), stacked_model.score(X_test, y_test), f1_score(y_test, stacked_model.predict(X_test))

(0.9980796927508402, 0.9122302158273381, 0.919631093544137)

*Using my top performing models from previous iterations this produces the highest accuracy and F1 score so far.*

---
I'm going to take this last stacking model and visualize it's performance in my production model and insights notebook.