# Predict the Epic Sci-Fi Universe
# Notebook-3 (Modeling)
### Perry Shyr

### This notebook covers my baseline model and five classifier-models.  A pipeline with GridSearching is used to find optimum hyperparameters with the first model created (logistic regression).  A GridSearch is also used to find the best hyperparameters for the fourth model created (random-forests).  The test split comprised of 458 posts is used in the scoring calculations of each model.

#### 1. Logistic Regression model
#### 2. Naive-Bayes (multinomial) model
#### 3. k-Nearest Neighbors model
#### 4. Random-forests model
#### 5. Support-vector Machine model

### Note: I choose to go with the TF-IDF vectorizer for each of my NLP models.  I did explore additional analysis using SVD, but the component results provided little to no information.  

## Load libraries and data:

In [135]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm, linear_model, datasets

from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, make_scorer, f1_score, precision_score
from sklearn.metrics import classification_report, roc_curve

np.random.seed(42)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

%matplotlib inline

### I import the data collected without the "Star" or "star" when the franchise title is referenced.

In [103]:
data = pd.read_csv('../data/combined_no_star.csv')

### Class sizes:
#### My positive-class ("1") is the group of post-titles from the 'r/startrek' subreddit.  The class balance is fairly even.

In [104]:
data['is_trek'].value_counts(normalize=True)*100    #  As percentages, my classes are almost evenly balanced.

1    52.925096
0    47.074904
Name: is_trek, dtype: float64

## BASELINE: My baseline model states that any given post originated from the 'r/startrek' subreddit.

## Train-Test-Split:
#### This is used in preparation for cross-validation, and in measuring how well a model generalizes on untrained data.

In [105]:
X=data['title']
y=data['is_trek']

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    random_state = 42)

### The split results are saved:

In [None]:
X_train.to_csv('../data/X_train.csv', index=False)
X_test.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)

## GridSearch over TFIDF/Log-Reg pipeline:
#### This pipeline sets up a sequence of vectorization and modeling.  It allows me to search over various hyperparameters in the two steps (vectorizing and modeling.  The model used in this pipeline is logistic regresssion.

In [107]:
star_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('logistic', LogisticRegression())
])

In [108]:
star_pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [109]:
star_pipe.score(X_train, y_train)

0.9905178701677607

In [110]:
star_pipe.score(X_test, y_test)      # This modeling is overfit.

0.9104803493449781

#### The high training fit is not unexpected.  There is an 8% drop in accuracy representing high variance and unnecessary complexity.  That is, ideally some features should be removed from the modeling.

In [125]:
star_params = {
    'tfidf__min_df': [1,2,3],
    'tfidf__max_df': np.linspace(.2,.45,10),
    'logistic__C': np.linspace(0.5,2.0,15),
    'logistic__penalty': ['l1','l2']
}

In [126]:
gs = GridSearchCV(star_pipe, star_params)

#### The hyperparameters we are searching over include minimum-document frequency, maximum-document frequency in the vectorization, and 'C' (regularization-inverse) and regulatization penalty in logistic regression.  We intentionally excluded a 'min_df' of 1, which means that features that only appear in a single document are given weight.  The upper range for 'max_df' searched is unusually low, which means that features are not weighted if they occur more often than 35% of the corpus.  Although we searched over through the vectorizing process, I may yet decide to commit to using the results going forwards.

In [127]:
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': [1, 2, 3], 'tfidf__max_df': array([0.2    , 0.22778, 0.25556, 0.28333, 0.31111, 0.33889, 0.36667,
       0.39444, 0.42222, 0.45   ]), 'logistic__C': array([0.5    , 0.60714, 0.71429, 0.82143, 0.92857, 1.03571, 1.14286,
       1.25   , 1.35714, 1.46429, 1.57143, 1.67857, 1.78571, 1.89286,
       2.     ]), 'logistic__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [128]:
gs.score(X_train, y_train)

0.9956236323851203

In [129]:
gs.score(X_test, y_test)                     # This modeling is also overfit.

0.9082969432314411

In [130]:
gs.best_params_

{'logistic__C': 1.5714285714285714,
 'logistic__penalty': 'l2',
 'tfidf__max_df': 0.2,
 'tfidf__min_df': 1}

#### The search yielded 1.57 for 'C,' 'Ridge' for the regularization, 1 for the 'min_df' and probably less than 20% for 'max_df' because this is the low-end of my range.  It fit under 99% of the training dataset and scored almost 91% of the test dataset.  The logistic-regression model is definitely overfit.  The 'min_df' of 1 and 'max_df' of under 20% are practically absurd values, so will be ignored.

### I include a CountVectorization step here to get an idea of word frequencies.  I set the 'binary' option to 'true just to count document frequency and not frequencies within each document.  This data may be useful during my model evaluation (next notebook).

In [7]:
cvec = CountVectorizer(binary='true')

In [72]:
X_cv = cvec.fit_transform(X)

In [84]:
df_cvec  = pd.DataFrame(X_cv.todense(),
                   columns=cvec.get_feature_names())
df_cvec.sum().sort_values(ascending=False)

the             719
of              343
to              326
in              313
trek            275
and             254
for             225
is              187
on              170
my              165
what            163
wars            160
this            155
you             137
it              137
new             127
with            122
that            107
series          105
picard          103
was             100
from             97
about            96
have             88
be               84
just             83
do               82
at               82
tng              76
discovery        75
               ... 
naked             1
nails             1
mythology         1
mysterious        1
myself            1
needed            1
neepers           1
norton            1
nightmare         1
northeast         1
north             1
normally          1
normal            1
nor               1
noise             1
nodes             1
nm                1
nintendo          1
niece             1


#### When I sort by document frequency, I see a lot of stop words.
### I save the count results and note how huge the file is that is saved.

In [101]:
df_cvec.to_csv('../data/cvec.csv', index=False)

#### From the GridSearch above, the critical hyperparameter that I find is "min_df" for the Vectorizer.  Although a 'min_df' value of "1" gives the best score, it makes no sense, so I choose to use a 'min_df' value of "2" going forwards despite the loss in accuracy.  The 'max_df' value of "0.2" is unbelievably low, so  I decide to use a value of "0.5."

#### I also choose to include bigrams in the models below to capture features like "Death Star" or "USS Enterprise."  

## The TFIDF-Vectorization is run here as the consistent input of each of my five separate models, for comparability:

In [131]:
tfidf = TfidfVectorizer(stop_words='english', min_df=2, max_df=.5, ngram_range=(1,2))
X_train_transform = tfidf.fit_transform(X_train)
X_test_transform = tfidf.transform(X_test)

### The vectorization is saved:

In [None]:
with open('../pickles/p3_xtrain_transform.pkl', 'wb+') as f:
    pickle.dump(X_train_transform, f)

In [None]:
with open('../pickles/p3_xtest_transform.pkl', 'wb+') as f:
    pickle.dump(X_test_transform, f)

In [None]:
with open('../pickles/tfidf.pkl', 'wb+') as f:
    pickle.dump(tfidf, f)

## I plan to test five separate models for comparison purposes before deciding which two to evaluate in detail.  
### These include Logistic-regression, Multinomial-Naive-Bayes, k-Nearest Neighbors, Random-forests (an ensemble method) and Support-vector Machines.  I can use the test scoring a confusion-matrix measures to compare each model.

## Model-1. Logistic regression, using C=1.6:

In [146]:
logreg = LogisticRegression(C=1.6, penalty='l2')
logreg.fit(X_train_transform, y_train)
logreg.score(X_test_transform, y_test)

0.8842794759825328

In [147]:
predictions_logreg = logreg.predict(X_test_transform)

In [148]:
confusion_matrix(y_test, predictions_logreg)

array([[197,  17],
       [ 36, 208]])

In [149]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_logreg).ravel()
print("LR_Confirmed Star-Wars posts:    %s" % tn)
print("LR_Misclassified Star-Trek posts: %s" % fp)
print("LR_Misclassified Star-Wars posts: %s" % fn)
print("LR_Confirmed Star-Trek posts:    %s" % tp)

LR_Confirmed Star-Wars posts:    197
LR_Misclassified Star-Trek posts: 17
LR_Misclassified Star-Wars posts: 36
LR_Confirmed Star-Trek posts:    208


In [150]:
print(classification_report(y_test, predictions_logreg))

             precision    recall  f1-score   support

          0       0.85      0.92      0.88       214
          1       0.92      0.85      0.89       244

avg / total       0.89      0.88      0.88       458



## Save models, splits and vectorizer:

In [103]:
with open('../pickles/p3_log_reg_MinDF2.pkl', 'wb+') as f:
    pickle.dump(logreg, f)

### Model-2. Naive Bayes Model:

In [260]:
nb = MultinomialNB()

In [261]:
model_nb = nb.fit(X_train_transform, y_train)

In [262]:
predictions = model_nb.predict(X_test_transform)

In [263]:
model_nb.score(X_train_transform, y_train)

0.9635302698760029

In [264]:
model_nb.score(X_test_transform, y_test)             # This modeling is also overfit.

0.8580786026200873

### The model is saved:

In [49]:
with open('../pickles/naive_bayes.pkl', 'wb+') as f:
    pickle.dump(model_nb, f)

In [265]:
confusion_matrix(y_test, predictions)

array([[167,  47],
       [ 18, 226]])

In [266]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print("NB_Confirmed Star-Wars posts:    %s" % tn)
print("NB_Misclassified Star-Trek posts: %s" % fp)
print("NB_Misclassified Star-Wars posts: %s" % fn)
print("NB_Confirmed Star-Trek posts:    %s" % tp)

NB_Confirmed Star-Wars posts:    167
NB_Misclassified Star-Trek posts: 47
NB_Misclassified Star-Wars posts: 18
NB_Confirmed Star-Trek posts:    226


In [267]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.90      0.78      0.84       214
          1       0.83      0.93      0.87       244

avg / total       0.86      0.86      0.86       458



### Models-3. KNN Modeling:

In [151]:
ss = StandardScaler()
ss.fit(X_train_transform.toarray())                    # It is necessary to scale the features.
X_train_sc = ss.transform(X_train_transform.toarray())
X_test_sc = ss.transform(X_test_transform.toarray())

In [180]:
knn = KNeighborsClassifier(n_neighbors=7)

In [181]:
cross_val_score(knn, X_train_sc, y_train).mean()

0.6163707304389896

#### Setting 'n_neighbors' to "7," gets me the highest test-score.

In [182]:
model_knn = knn.fit(X_train_sc, y_train)

In [183]:
model_knn.score(X_test_sc, y_test)                     # This modeling is slightly overfit.

0.6157205240174672

In [59]:
with open('../pickles/p3_knn.pkl', 'wb+') as f:
    pickle.dump(model_knn, f)

In [184]:
predictions_knn = model_knn.predict(X_test_transform.toarray())

In [185]:
confusion_matrix(y_test, predictions_knn)

array([[ 58, 156],
       [ 77, 167]])

In [186]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_knn).ravel()
print("KN_Confirmed Star-Wars posts:       %s" % tn)
print("KN_Misclassified Star-Trek posts: %s" % fp)
print("KN_Misclassified Star-Wars posts:   %s" % fn)
print("KN_Confirmed Star-Trek posts:     %s" % tp)

KN_Confirmed Star-Wars posts:       58
KN_Misclassified Star-Trek posts: 156
KN_Misclassified Star-Wars posts:   77
KN_Confirmed Star-Trek posts:     167


In [187]:
print(classification_report(y_test, predictions_knn))

             precision    recall  f1-score   support

          0       0.43      0.27      0.33       214
          1       0.52      0.68      0.59       244

avg / total       0.48      0.49      0.47       458



### Model-4. Random-Forests Modeling:
#### Since there is little to no risk of overfitting, no 'max_depth' is set (default).  I also use the default setting for 'max_features' as well.

In [193]:
rf = RandomForestClassifier(n_estimators=500)
print('cross', cross_val_score(rf, X_train_transform, y_train).mean())

cross 0.8716341356003406


In [194]:
rf.fit(X_train_transform, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [195]:
rf.score(X_test_transform, y_test)

0.8384279475982532

#### This ensemble result is somewhat overfit, I am surprised to admit.
#### I need to save each model as they are fit/scored.

In [71]:
with open('../pickles/p3_randomforests.pkl', 'wb+') as f:
    pickle.dump(rf, f)

In [196]:
predictions_rf = rf.predict(X_test_transform)

In [197]:
confusion_matrix(y_test, predictions_rf)

array([[167,  47],
       [ 27, 217]])

In [198]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_rf).ravel()
print("RF_Confirmed Star-Wars posts:    %s" % tn)
print("RF_Misclassified Star-Trek posts: %s" % fp)
print("RF_Misclassified Star-Wars posts: %s" % fn)
print("RF_Confirmed Star-Trek posts:    %s" % tp)

RF_Confirmed Star-Wars posts:    167
RF_Misclassified Star-Trek posts: 47
RF_Misclassified Star-Wars posts: 27
RF_Confirmed Star-Trek posts:    217


In [199]:
print(classification_report(y_test, predictions_rf))

             precision    recall  f1-score   support

          0       0.86      0.78      0.82       214
          1       0.82      0.89      0.85       244

avg / total       0.84      0.84      0.84       458



## Model-5. SVM-model:
#### I find the best value for the budget 'C' to be the default, "1."  The best kernel to use appears to be "linear" as well.

In [287]:
svc = svm.SVC(C=1., kernel='linear')
svc.fit(X_train_transform, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [288]:
svc.score(X_train_transform, y_train)

0.975929978118162

In [289]:
predictions_svm = svc.predict(X_test_transform)

In [290]:
svc.score(X_test_transform, y_test)                      # This modeling is overfit.

0.87117903930131

#### This model is not the most overfit, but is definitely one suffering from high variance.

In [291]:
confusion_matrix(y_test, predictions_svm)

array([[196,  18],
       [ 41, 203]])

In [292]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions_svm).ravel()
print("SV_Confirmed Star-Wars posts:    %s" % tn)
print("SV_Misclassified Star-Trek posts: %s" % fp)
print("SV_Misclassified Star-Wars posts: %s" % fn)
print("SV_Confirmed Star-Trek posts:    %s" % tp)

SV_Confirmed Star-Wars posts:    196
SV_Misclassified Star-Trek posts: 18
SV_Misclassified Star-Wars posts: 41
SV_Confirmed Star-Trek posts:    203


In [293]:
print(classification_report(y_test, predictions_svm))

             precision    recall  f1-score   support

          0       0.83      0.92      0.87       214
          1       0.92      0.83      0.87       244

avg / total       0.88      0.87      0.87       458



#### I remember to save the latest model fit:

In [35]:
with open('../pickles/p3_svm.pkl', 'wb+') as f:
    pickle.dump(svc, f)

### The summary scores from the modeling process are:

| Model | f1-score |
| --- | --- |
| Logistic Regression | 0.89 |
| SVM | 0.87 |
| Naive-Bayes | 0.86 |
| Random Forests | 0.83 |
| KNN | 0.47 |

## Clearly, the KNN-model was the worst.  On the other hand, three separate models scored in the high-0.8's in accuracy terms.

# Continue to Notebook-4.   > > > > > > >

### In the next notebook, I examine two of the 5 models in greater detail.  I choose to investigate the Logistic-regression model for high accuracy and the Random-forests model for relatively low variance.