# The "Most Epic" Data-Science Project
# Notebook-3 (Modeling)
### Perry Shyr

### This notebook covers my baseline model and five classifier-models.  A pipeline with GridSearching is used to find optimum hyperparameters with the first model created (logistic regression).  A GridSearch is also used to find the best hyperparameters for the fourth model created (random-forests).  The test split comprised of 458 posts is used in the scoring calculations of each model.

#### 1. Logistic Regression model
#### 2. Naive-Bayes (multinomial) model
#### 3. k-Nearest Neighbors model
#### 4. Random-forests model
#### 5. Support-vector Machine model

### I choose to go with the TF-IDF vectorizer for each of my NLP models.  I did explore additional analysis using SVD, but the component results provided little to no information.  

## Load libraries and data:

In [238]:
import requests
import json
import time
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import regex as re
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm, linear_model, datasets

from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import recall_score, make_scorer, f1_score, precision_score
from sklearn.metrics import classification_report, roc_curve

np.random.seed(42)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

%matplotlib inline

### I import the data collected without the "Star" or "star" when the franchise title is referenced.

In [239]:
data = pd.read_csv('../data/combined_no_star.csv')

### Class sizes:

In [240]:
data['is_trek'].value_counts(normalize=True)*100    #  As percentages, my classes are almost evenly balanced.

1    52.925096
0    47.074904
Name: is_trek, dtype: float64

### BASELINE: My baseline model states that any given post originated from the 'r/startrek' subreddit.

## Train-Test-Split:

In [241]:
X=data['title']
y=data['is_trek']

In [242]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    random_state = 42)

## GridSearch over TFIDF/Log-Reg pipeline:

In [243]:
star_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('logistic', LogisticRegression())
])

In [244]:
star_pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [245]:
star_pipe.score(X_train, y_train)

0.9905178701677607

In [246]:
star_pipe.score(X_test, y_test)

0.9104803493449781

In [247]:
star_params = {
    'tfidf__min_df': [2,3],
    'tfidf__max_df': np.linspace(.1,.35,10),
    'logistic__C': np.linspace(0.5,1.5,10),
    'logistic__penalty': ['l1','l2']
}

In [248]:
gs = GridSearchCV(star_pipe, star_params)

In [249]:
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': [2, 3], 'tfidf__max_df': array([0.1    , 0.12778, 0.15556, 0.18333, 0.21111, 0.23889, 0.26667,
       0.29444, 0.32222, 0.35   ]), 'logistic__C': array([0.5    , 0.61111, 0.72222, 0.83333, 0.94444, 1.05556, 1.16667,
       1.27778, 1.38889, 1.5    ]), 'logistic__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [250]:
gs.score(X_train, y_train)

0.9649890590809628

In [251]:
gs.score(X_test, y_test)

0.8864628820960698

In [252]:
gs.best_params_

{'logistic__C': 0.9444444444444444,
 'logistic__penalty': 'l2',
 'tfidf__max_df': 0.18333333333333335,
 'tfidf__min_df': 2}

#### The critical hyperparameter is found to be "min_df" for the Vectorizer.  Although a "min_df" value of '1' gives the best score, I stick with a "min_df" value of '2' going forwards at the expense of some accuracy.

## Use consistent TFIDF-Vectorizer result for separate modeling:

In [253]:
tfidf = TfidfVectorizer(stop_words='english', min_df=2, max_df=.3, ngram_range=(1,2))
X_train_transform = tfidf.fit_transform(X_train)
X_test_transform = tfidf.transform(X_test)

## 1. Logistic regression:

In [254]:
logreg = LogisticRegression(C=1.05, penalty='l2')
logreg.fit(X_train_transform, y_train)
logreg.score(X_test_transform, y_test)

0.8864628820960698

In [255]:
predictions_logreg = logreg.predict(X_test_transform)

In [256]:
from sklearn.metrics import confusion_matrix

In [257]:
confusion_matrix(y_test, predictions_logreg)

array([[198,  16],
       [ 36, 208]])

In [258]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_logreg).ravel()
print("LR_Confirmed Star-Wars posts:    %s" % tn)
print("LR_Misclassified Star-Trek posts: %s" % fp)
print("LR_Misclassified Star-Wars posts: %s" % fn)
print("LR_Confirmed Star-Trek posts:    %s" % tp)

LR_Confirmed Star-Wars posts:    198
LR_Misclassified Star-Trek posts: 16
LR_Misclassified Star-Wars posts: 36
LR_Confirmed Star-Trek posts:    208


In [259]:
print(classification_report(y_test, predictions_logreg))

             precision    recall  f1-score   support

          0       0.85      0.93      0.88       214
          1       0.93      0.85      0.89       244

avg / total       0.89      0.89      0.89       458



### NOTE: Tuning the C-value or choosing "L1" for penalty, only degraded the score.

## Save models, splits and vectorizer:

In [103]:
with open('../data/p3_log_reg_MinDF2.pkl', 'wb+') as f:
    pickle.dump(logreg, f)

In [104]:
X_train.to_csv('../data/X_train.csv', index=False)
X_test.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)

In [106]:
with open('../data/p3_xtrain_transform.pkl', 'wb+') as f:
    pickle.dump(X_train_transform, f)

In [107]:
with open('../data/p3_xtest_transform.pkl', 'wb+') as f:
    pickle.dump(X_test_transform, f)

In [108]:
with open('../data/tfidf.pkl', 'wb+') as f:
    pickle.dump(tfidf, f)

### 2. Naive Bayes Model:

In [260]:
nb = MultinomialNB()

In [261]:
model_nb = nb.fit(X_train_transform, y_train)

In [262]:
predictions = model_nb.predict(X_test_transform)

In [263]:
model_nb.score(X_train_transform, y_train)

0.9635302698760029

In [264]:
model_nb.score(X_test_transform, y_test)

0.8580786026200873

### Save model:

In [49]:
with open('../data/naive_bayes.pkl', 'wb+') as f:
    pickle.dump(model_nb, f)

In [265]:
confusion_matrix(y_test, predictions)

array([[167,  47],
       [ 18, 226]])

In [266]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print("NB_Confirmed Star-Wars posts:    %s" % tn)
print("NB_Misclassified Star-Trek posts: %s" % fp)
print("NB_Misclassified Star-Wars posts: %s" % fn)
print("NB_Confirmed Star-Trek posts:    %s" % tp)

NB_Confirmed Star-Wars posts:    167
NB_Misclassified Star-Trek posts: 47
NB_Misclassified Star-Wars posts: 18
NB_Confirmed Star-Trek posts:    226


In [267]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.90      0.78      0.84       214
          1       0.83      0.93      0.87       244

avg / total       0.86      0.86      0.86       458



### 3. KNN Modeling:

In [268]:
ss = StandardScaler()
ss.fit(X_train_transform.toarray())
X_train_sc = ss.transform(X_train_transform.toarray())
X_test_sc = ss.transform(X_test_transform.toarray())

In [269]:
knn = KNeighborsClassifier(n_neighbors=4)

In [270]:
cross_val_score(knn, X_train_sc, y_train).mean()

0.5718264461618437

In [271]:
model_knn = knn.fit(X_train_sc, y_train)

In [272]:
model_knn.score(X_test_sc, y_test)

0.5676855895196506

In [59]:
with open('../data/p3_knn.pkl', 'wb+') as f:
    pickle.dump(model_knn, f)

In [273]:
predictions_knn = model_knn.predict(X_test_transform.toarray())

In [274]:
confusion_matrix(y_test, predictions_knn)

array([[100, 114],
       [130, 114]])

In [275]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_knn).ravel()
print("KN_Confirmed Star-Wars posts:       %s" % tn)
print("KN_Misclassified Star-Trek posts: %s" % fp)
print("KN_Misclassified Star-Wars posts:   %s" % fn)
print("KN_Confirmed Star-Trek posts:     %s" % tp)

KN_Confirmed Star-Wars posts:       100
KN_Misclassified Star-Trek posts: 114
KN_Misclassified Star-Wars posts:   130
KN_Confirmed Star-Trek posts:     114


In [276]:
print(classification_report(y_test, predictions_knn))

             precision    recall  f1-score   support

          0       0.43      0.47      0.45       214
          1       0.50      0.47      0.48       244

avg / total       0.47      0.47      0.47       458



### 4a. Random-Forests Modeling, with GridSearch:

In [277]:
rf = RandomForestClassifier()
rf_params = {
    'n_estimators': [100, 300],
    'max_features': [ 150, 250,350],
    'max_depth': [1,2]
}
rf_search = GridSearchCV(rf, param_grid = rf_params)
rf_search.fit(X_train_transform, y_train)
print(rf_search.best_score_)

0.7811816192560175


In [278]:
rf_search.score(X_test_transform, y_test)

0.7641921397379913

In [279]:
rf_search.best_params_

{'max_depth': 2, 'max_features': 350, 'n_estimators': 300}

### 4b. Random-Forests Modeling:

In [280]:
rf = RandomForestClassifier(n_estimators=100)
print('cross', cross_val_score(rf, X_train_transform, y_train).mean())

cross 0.867252962351828


In [281]:
rf.fit(X_train_transform, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [282]:
rf.score(X_test_transform, y_test)

0.834061135371179

In [71]:
with open('../data/p3_randomforests.pkl', 'wb+') as f:
    pickle.dump(rf, f)

In [283]:
predictions_rf = rf.predict(X_test_transform)

In [284]:
confusion_matrix(y_test, predictions_rf)

array([[165,  49],
       [ 27, 217]])

In [285]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_rf).ravel()
print("RF_Confirmed Star-Wars posts:    %s" % tn)
print("RF_Misclassified Star-Trek posts: %s" % fp)
print("RF_Misclassified Star-Wars posts: %s" % fn)
print("RF_Confirmed Star-Trek posts:    %s" % tp)

RF_Confirmed Star-Wars posts:    165
RF_Misclassified Star-Trek posts: 49
RF_Misclassified Star-Wars posts: 27
RF_Confirmed Star-Trek posts:    217


In [286]:
print(classification_report(y_test, predictions_rf))

             precision    recall  f1-score   support

          0       0.86      0.77      0.81       214
          1       0.82      0.89      0.85       244

avg / total       0.84      0.83      0.83       458



## 5. SVM-model:

In [287]:
svc = svm.SVC(C=1., kernel='linear')
svc.fit(X_train_transform, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [288]:
svc.score(X_train_transform, y_train)

0.975929978118162

In [289]:
predictions_svm = svc.predict(X_test_transform)

In [290]:
svc.score(X_test_transform, y_test)

0.87117903930131

In [291]:
confusion_matrix(y_test, predictions_svm)

array([[196,  18],
       [ 41, 203]])

In [292]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_svm).ravel()
print("SV_Confirmed Star-Wars posts:    %s" % tn)
print("SV_Misclassified Star-Trek posts: %s" % fp)
print("SV_Misclassified Star-Wars posts: %s" % fn)
print("SV_Confirmed Star-Trek posts:    %s" % tp)

SV_Confirmed Star-Wars posts:    196
SV_Misclassified Star-Trek posts: 18
SV_Misclassified Star-Wars posts: 41
SV_Confirmed Star-Trek posts:    203


In [293]:
print(classification_report(y_test, predictions_svm))

             precision    recall  f1-score   support

          0       0.83      0.92      0.87       214
          1       0.92      0.83      0.87       244

avg / total       0.88      0.87      0.87       458



In [35]:
with open('../data/p3_svm.pkl', 'wb+') as f:
    pickle.dump(svc, f)

### The summary scores from the modeling process are:

| Model | f1-score |
| --- | --- |
| Logistic Regression | 0.89 |
| SVM | 0.87 |
| Naive-Bayes | 0.86 |
| Random Forests | 0.83 |
| KNN | 0.47 |

## Clearly, the KNN-model was the worst.  On the other hand, three separate models scored in the high-0.8's in accuracy terms.

## Continue to Notebook-4