# Project 3: Web APIs & Classification

## Notebook 4: Modeling Evaluation Conclusion and Recommendation

### Contents:
- [Model selections:](#Model-selections:)
- [Multinomial Naive-Bayes](#Multinomial-Naive-Bayes)
- [Logistic Regressions](#Logistic-Regressions)
- [Random Forest](#Random-Forest)
- [Evaluating Models](#Evaluating-Models)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score 

In [3]:
X_train = pd.read_csv('../datasets/X_train.csv', index_col = 0)
X_test = pd.read_csv('../datasets/X_test.csv', index_col = 0)
y_train = pd.read_csv('../datasets/y_train.csv', index_col = 0)
y_test = pd.read_csv('../datasets/y_test.csv', index_col = 0)

In [4]:
X_train.head()

Unnamed: 0,combine_text_tokens,combine_text_stem,combine_text_lemma
615,the guy is a walking sugar cookie,the guy is a walk sugar cooki,the guy is a walking sugar cookie
526,claire mccaskill on twitter when agents were ...,clair mccaskil on twitter when agent were mak...,claire mccaskill on twitter when agent were m...
1715,southfield city clerk charged with felonies t...,southfield citi clerk charg with feloni tie t...,southfield city clerk charged with felony tie...
715,amy siskind on the eve of the second mass shoo...,ami siskind on the eve of the second mass shoo...,amy siskind on the eve of the second mass shoo...
1806,hmmm and they call fascist,hmmm and they call fascist,hmmm and they call fascist


In [5]:
X_train_final = X_train['combine_text_tokens'].values.astype('U')
X_test_final = X_test['combine_text_tokens'].values.astype('U')

### Model selections: 
#### 1.  TF-IDF Vectorizer Multinomial Naive-Bayes
  - `tf__ngram_range = (1, 2)`
  - `tf__stop_words = 'english'`
  
#### 2. TF-IDF Vectorizer Scaled Logistic Regression
  - `tf__ngram_range = (1, 2)`
  - `tf__stop_words = 'english'`
  
#### 3. TF-IDF Vectorizer Random Forest 
  - `tf__ngram_range = (1, 1)`
  - `tf__stop_words = 'english'`

### Multinomial Naive-Bayes

In [8]:
mnb_loop = pd.DataFrame(columns = ['train_accuracy', 'test_accuracy', 'best_params', 'tn', 'fp', 'fn', 'tp',
                                   'precision', 'recall', 'f1'])

In [27]:
mnb_steps = [('tf', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))),
             ('mnb', MultinomialNB())]

mnb_params = {"mnb__alpha": np.arange(.1, 1, .1),
              "tf__max_features": [30000, 32500, 35000, 37500]}

In [28]:
mnb_params_1 = {"mnb__alpha": np.arange(.01, 1, .1),
                "tf__max_features": [22500, 25000, 27500, 30000, 32500]}

In [29]:
mnb_params_2 = {"mnb__alpha": np.arange(.01, 0.5, .05),
                "tf__max_features": [22500, 25000, 27500, 30000, 32500]}

In [15]:
pipe = Pipeline(mnb_steps)

In [26]:
def mnb_pipe(params):
    mnb_results = {}

    grid = GridSearchCV(pipe, params, cv = 5) # optimize GridSearch hyperparameters on `cv=5` cross validation runs
    grid.fit(X_train_final, y_train.values.ravel()) # fit to our training data

    print('Train Accuracy: ', grid.score(X_train_final, y_train))
    mnb_results['train_accuracy'] = grid.score(X_train_final, y_train) # print/store training accuracy

    print('Test Accuracy: ',grid.score(X_test_final, y_test))
    mnb_results['test_accuracy'] = grid.score(X_test_final, y_test) # print/store test accuracy

    print('Best Params: ',grid.best_params_, '\n')
    mnb_results['best_params'] = grid.best_params_ # print/store best parameters

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_final)).ravel() # inspect counted results in matrix
    print("True Negatives: %s" % tn)
    mnb_results['tn'] = tn
    print("False Positives: %s" % fp)
    mnb_results['fp'] = fp
    print("False Negatives: %s" % fn)
    mnb_results['fn'] = fn
    print("True Positives: %s" % tp, '\n')
    mnb_results['tp'] = tp

    print("Precision Score: ", precision_score(y_test, grid.predict(X_test_final)))
    mnb_results['precision'] = precision_score(y_test, grid.predict(X_test_final))

    print("Recall Score: ", recall_score(y_test, grid.predict(X_test_final)))
    mnb_results['recall'] = recall_score(y_test, grid.predict(X_test_final))

    print("F1 Score: ", f1_score(y_test, grid.predict(X_test_final)), '\n')
    mnb_results['f1'] = f1_score(y_test, grid.predict(X_test_final))
    
    return mnb_results

In [30]:
mnb_results = mnb_pipe(mnb_params)

Train Accuracy:  0.9946559786239145
Test Accuracy:  0.744
Best Params:  {'mnb__alpha': 0.2, 'tf__max_features': 30000} 

True Negatives: 197
False Positives: 53
False Negatives: 75
True Positives: 175 

Precision Score:  0.7675438596491229
Recall Score:  0.7
F1 Score:  0.7322175732217573 



In [31]:
mnb_loop = mnb_loop.append(mnb_results, ignore_index = True)

In [33]:
mnb_results = mnb_pipe(mnb_params_1)

Train Accuracy:  0.9946559786239145
Test Accuracy:  0.746
Best Params:  {'mnb__alpha': 0.21000000000000002, 'tf__max_features': 22500} 

True Negatives: 197
False Positives: 53
False Negatives: 74
True Positives: 176 

Precision Score:  0.7685589519650655
Recall Score:  0.704
F1 Score:  0.7348643006263048 



In [34]:
mnb_loop = mnb_loop.append(mnb_results, ignore_index = True)

In [35]:
mnb_results = mnb_pipe(mnb_params_2)

Train Accuracy:  0.9946559786239145
Test Accuracy:  0.746
Best Params:  {'mnb__alpha': 0.21000000000000002, 'tf__max_features': 22500} 

True Negatives: 197
False Positives: 53
False Negatives: 74
True Positives: 176 

Precision Score:  0.7685589519650655
Recall Score:  0.704
F1 Score:  0.7348643006263048 



In [36]:
mnb_loop = mnb_loop.append(mnb_results, ignore_index = True)

In [37]:
mnb_loop

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
0,0.994656,0.744,"{'mnb__alpha': 0.2, 'tf__max_features': 30000}",197,53,75,175,0.767544,0.7,0.732218
1,0.994656,0.746,"{'mnb__alpha': 0.21000000000000002, 'tf__max_f...",197,53,74,176,0.768559,0.704,0.734864
2,0.994656,0.746,"{'mnb__alpha': 0.21000000000000002, 'tf__max_f...",197,53,74,176,0.768559,0.704,0.734864


### Logistic Regressions

In [52]:
lr_loop = pd.DataFrame(columns = ['train_accuracy', 'test_accuracy', 'best_params', 'tn', 'fp', 'fn', 'tp',
                                  'precision', 'recall', 'f1'])

In [53]:
lr_steps = [('tf', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))),
            ('sc', StandardScaler(with_mean = False)),
            ('lr', LogisticRegression(random_state = 42))]

lr_params = {"lr__penalty": ['l2'], "lr__C": [1.2],
             "lr__tol": [.00035], "tf__max_features": [25000, 27500, 30000, 32500, 35000]}

In [54]:
lr_params_1 = {"lr__penalty": ['l2'], "lr__C": [1.2],
               "lr__tol": [.00035], "tf__max_features": [15000, 17500, 20000, 22500, 25000, 27500]}

In [55]:
lr_params_2 = {"lr__penalty": ['l1'], "lr__C": [1, 1.2], "lr__solver": ['liblinear'],
               "lr__tol": [.00035], "tf__max_features": [14000, 15000, 16000]}

In [56]:
lr_params_3 = {"lr__penalty": ['l1'], "lr__C": [1, 1.2], "lr__solver": ['liblinear'],
               "lr__tol": [.00035], "tf__max_features": [12000, 13000, 14000, 15000, 16000]}

In [57]:
pipe = Pipeline(lr_steps)

In [58]:
def lr_pipe(params):
    lr_results = {}

    grid = GridSearchCV(pipe, params, cv = 5) # optimize GridSearch hyperparameters on `cv=5` cross validation runs
    grid.fit(X_train_final, y_train.values.ravel()) # fit to our training data

    print('Train Accuracy: ', grid.score(X_train_final, y_train))
    lr_results['train_accuracy'] = grid.score(X_train_final, y_train) # print/store training accuracy

    print('Test Accuracy: ',grid.score(X_test_final, y_test))
    lr_results['test_accuracy'] = grid.score(X_test_final, y_test) # print/store test accuracy

    print('Best Params: ',grid.best_params_, '\n')
    lr_results['best_params'] = grid.best_params_ # print/store best parameters

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_final)).ravel() # inspect counted results in matrix
    print("True Negatives: %s" % tn)
    lr_results['tn'] = tn
    print("False Positives: %s" % fp)
    lr_results['fp'] = fp
    print("False Negatives: %s" % fn)
    lr_results['fn'] = fn
    print("True Positives: %s" % tp, '\n')
    lr_results['tp'] = tp

    print("Precision Score: ", precision_score(y_test, grid.predict(X_test_final)))
    lr_results['precision'] = precision_score(y_test, grid.predict(X_test_final))

    print("Recall Score: ", recall_score(y_test, grid.predict(X_test_final)))
    lr_results['recall'] = recall_score(y_test, grid.predict(X_test_final))

    print("F1 Score: ", f1_score(y_test, grid.predict(X_test_final)), '\n')
    lr_results['f1'] = f1_score(y_test, grid.predict(X_test_final))
    
    return lr_results

In [59]:
lr_results = lr_pipe(lr_params)

Train Accuracy:  0.9966599866399466
Test Accuracy:  0.722
Best Params:  {'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol': 0.00035, 'tf__max_features': 25000} 

True Negatives: 172
False Positives: 78
False Negatives: 61
True Positives: 189 

Precision Score:  0.7078651685393258
Recall Score:  0.756
F1 Score:  0.7311411992263056 



In [60]:
lr_loop = lr_loop.append(lr_results, ignore_index = True)

In [61]:
lr_results = lr_pipe(lr_params_1)

Train Accuracy:  0.9959919839679359
Test Accuracy:  0.712
Best Params:  {'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol': 0.00035, 'tf__max_features': 17500} 

True Negatives: 171
False Positives: 79
False Negatives: 65
True Positives: 185 

Precision Score:  0.7007575757575758
Recall Score:  0.74
F1 Score:  0.7198443579766536 



In [62]:
lr_loop = lr_loop.append(lr_results, ignore_index = True)

In [63]:
lr_results = lr_pipe(lr_params_2)

Train Accuracy:  0.9953239812959251
Test Accuracy:  0.726
Best Params:  {'lr__C': 1, 'lr__penalty': 'l1', 'lr__solver': 'liblinear', 'lr__tol': 0.00035, 'tf__max_features': 15000} 

True Negatives: 193
False Positives: 57
False Negatives: 80
True Positives: 170 

Precision Score:  0.748898678414097
Recall Score:  0.68
F1 Score:  0.7127882599580714 



In [64]:
lr_loop = lr_loop.append(lr_results, ignore_index = True)

In [65]:
lr_loop

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
0,0.99666,0.722,"{'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol':...",172,78,61,189,0.707865,0.756,0.731141
1,0.995992,0.712,"{'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol':...",171,79,65,185,0.700758,0.74,0.719844
2,0.995324,0.726,"{'lr__C': 1, 'lr__penalty': 'l1', 'lr__solver'...",193,57,80,170,0.748899,0.68,0.712788


### Random Forest

In [38]:
rf_loop = pd.DataFrame(columns = ['train_accuracy', 'test_accuracy', 'best_params', 'tn', 'fp', 'fn', 'tp',
                                  'precision', 'recall', 'f1'])

In [39]:
rf_steps = [('tf', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 1))),
            ('rf', RandomForestClassifier(random_state = 42))]

rf_params = {"rf__n_estimators": np.arange(98, 102, 1), "rf__max_depth": [7, 8, 9], 
             "rf__criterion": ['gini'], "tf__max_features": [None, 35000, 40000, 45000]}

In [40]:
rf_params_1 = {"rf__n_estimators": np.arange(94, 98, 1), "rf__max_depth": [5, 6, 7], 
               "rf__criterion": ['gini'], "tf__max_features": [None, 2500, 5000, 7500]}

In [41]:
rf_params_2 = {"rf__n_estimators": np.arange(100, 104, 1), "rf__max_depth": [6, 7, 8, 9], 
               "rf__criterion": ['gini'], "tf__max_features": [None]}

In [42]:
pipe = Pipeline(rf_steps)

In [43]:
def rf_pipe(params):
    rf_results = {}

    grid = GridSearchCV(pipe, params, cv = 5) # optimize GridSearch hyperparameters on `cv=5` cross validation runs
    grid.fit(X_train_final, y_train.values.ravel()) # fit to our training data

    print('Train Accuracy: ', grid.score(X_train_final, y_train))
    rf_results['train_accuracy'] = grid.score(X_train_final, y_train) # print/store training accuracy

    print('Test Accuracy: ',grid.score(X_test_final, y_test))
    rf_results['test_accuracy'] = grid.score(X_test_final, y_test) # print/store test accuracy

    print('Best Params: ',grid.best_params_, '\n')
    rf_results['best_params'] = grid.best_params_ # print/store best parameters

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_final)).ravel() # inspect counted results in matrix
    print("True Negatives: %s" % tn)
    rf_results['tn'] = tn
    print("False Positives: %s" % fp)
    rf_results['fp'] = fp
    print("False Negatives: %s" % fn)
    rf_results['fn'] = fn
    print("True Positives: %s" % tp, '\n')
    rf_results['tp'] = tp

    print("Precision Score: ", precision_score(y_test, grid.predict(X_test_final)))
    rf_results['precision'] = precision_score(y_test, grid.predict(X_test_final))

    print("Recall Score: ", recall_score(y_test, grid.predict(X_test_final)))
    rf_results['recall'] = recall_score(y_test, grid.predict(X_test_final))

    print("F1 Score: ", f1_score(y_test, grid.predict(X_test_final)), '\n')
    rf_results['f1'] = f1_score(y_test, grid.predict(X_test_final))
    
    return rf_results

In [44]:
rf_results = rf_pipe(rf_params)

Train Accuracy:  0.7895791583166333
Test Accuracy:  0.708
Best Params:  {'rf__criterion': 'gini', 'rf__max_depth': 9, 'rf__n_estimators': 98, 'tf__max_features': None} 

True Negatives: 162
False Positives: 88
False Negatives: 58
True Positives: 192 

Precision Score:  0.6857142857142857
Recall Score:  0.768
F1 Score:  0.7245283018867925 



In [45]:
rf_loop = rf_loop.append(rf_results, ignore_index = True)

In [46]:
rf_loop

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
0,0.789579,0.708,"{'rf__criterion': 'gini', 'rf__max_depth': 9, ...",162,88,58,192,0.685714,0.768,0.724528


In [47]:
rf_results = rf_pipe(rf_params_1)

Train Accuracy:  0.7735470941883767
Test Accuracy:  0.716
Best Params:  {'rf__criterion': 'gini', 'rf__max_depth': 7, 'rf__n_estimators': 96, 'tf__max_features': 2500} 

True Negatives: 172
False Positives: 78
False Negatives: 64
True Positives: 186 

Precision Score:  0.7045454545454546
Recall Score:  0.744
F1 Score:  0.7237354085603113 



In [48]:
rf_loop = rf_loop.append(rf_results, ignore_index = True)

In [49]:
rf_results = rf_pipe(rf_params_2)

Train Accuracy:  0.7895791583166333
Test Accuracy:  0.712
Best Params:  {'rf__criterion': 'gini', 'rf__max_depth': 9, 'rf__n_estimators': 102, 'tf__max_features': None} 

True Negatives: 161
False Positives: 89
False Negatives: 55
True Positives: 195 

Precision Score:  0.6866197183098591
Recall Score:  0.78
F1 Score:  0.7303370786516854 



In [50]:
rf_loop = rf_loop.append(rf_results, ignore_index = True)

In [51]:
rf_loop

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
0,0.789579,0.708,"{'rf__criterion': 'gini', 'rf__max_depth': 9, ...",162,88,58,192,0.685714,0.768,0.724528
1,0.773547,0.716,"{'rf__criterion': 'gini', 'rf__max_depth': 7, ...",172,78,64,186,0.704545,0.744,0.723735
2,0.789579,0.712,"{'rf__criterion': 'gini', 'rf__max_depth': 9, ...",161,89,55,195,0.68662,0.78,0.730337


### Evaluating Models

**Accuracy** - How many posts (r/democrats or r/Republican) did we correctly predict out of all the posts? (TP + TN / TP + FP + TN + FN)    
**Precision** - How many posts of those we labeled as democrats are actually democrats post? (TP / TP + FP)  
**Recall/Sensitivity** - Of all the democrats posts how many did we correctly predict? (TP / TP + FN)     
**F1-score** - It is the harmonic mean of precision and recall (2 * (Recall * Precision) / (Recall + Precision))  

In [66]:
mnb_loop.sort_values('test_accuracy', ascending = False)

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
1,0.994656,0.746,"{'mnb__alpha': 0.21000000000000002, 'tf__max_f...",197,53,74,176,0.768559,0.704,0.734864
2,0.994656,0.746,"{'mnb__alpha': 0.21000000000000002, 'tf__max_f...",197,53,74,176,0.768559,0.704,0.734864
0,0.994656,0.744,"{'mnb__alpha': 0.2, 'tf__max_features': 30000}",197,53,75,175,0.767544,0.7,0.732218


Index 1 is choosen out of all the Multinomial Naive-Bayes models apart from looking at the test accuracy score, it is important to also evaluate both the precision and recall of the model we can see that it also out performance the other runs by score the highest F1-score.

In [67]:
lr_loop.sort_values('test_accuracy', ascending = False)

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
2,0.995324,0.726,"{'lr__C': 1, 'lr__penalty': 'l1', 'lr__solver'...",193,57,80,170,0.748899,0.68,0.712788
0,0.99666,0.722,"{'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol':...",172,78,61,189,0.707865,0.756,0.731141
1,0.995992,0.712,"{'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol':...",171,79,65,185,0.700758,0.74,0.719844


Index 0 is choosen out of all the Logistic Regression models although it does not have the highest test accuracy score, upon further inspection of the precision, recall and F1-score we can see that it actually outperforms the higher test accuracy score of Index 2 model. We decided to pick the model with a higher F1-score as this means it is capable of getting a more accuracy prediction for post on democrats subreddit which is our focus.

In [68]:
rf_loop.sort_values('test_accuracy', ascending = False)

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
1,0.773547,0.716,"{'rf__criterion': 'gini', 'rf__max_depth': 7, ...",172,78,64,186,0.704545,0.744,0.723735
2,0.789579,0.712,"{'rf__criterion': 'gini', 'rf__max_depth': 9, ...",161,89,55,195,0.68662,0.78,0.730337
0,0.789579,0.708,"{'rf__criterion': 'gini', 'rf__max_depth': 9, ...",162,88,58,192,0.685714,0.768,0.724528


Index 1 scored higher compared to index 2 in terms of the test accuracy score, however, between the two index 2 has a higher F1-socre as it means it is capable of getting a more accurate prediction for post on democrats subreddict which is our focus. Hence index 2 final parameters is choosen.

In [70]:
mnb_loop[2:3]

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
2,0.994656,0.746,"{'mnb__alpha': 0.21000000000000002, 'tf__max_f...",197,53,74,176,0.768559,0.704,0.734864


In [71]:
final_model = pd.concat([mnb_loop[1:2], lr_loop[:1], rf_loop[2:3]], axis = 0)

In [72]:
final_model.sort_values('test_accuracy', ascending = False)

Unnamed: 0,train_accuracy,test_accuracy,best_params,tn,fp,fn,tp,precision,recall,f1
1,0.994656,0.746,"{'mnb__alpha': 0.21000000000000002, 'tf__max_f...",197,53,74,176,0.768559,0.704,0.734864
0,0.99666,0.722,"{'lr__C': 1.2, 'lr__penalty': 'l2', 'lr__tol':...",172,78,61,189,0.707865,0.756,0.731141
2,0.789579,0.712,"{'rf__criterion': 'gini', 'rf__max_depth': 9, ...",161,89,55,195,0.68662,0.78,0.730337


Looking at the final models selected for the individual model types, Random Forest model is first dropped. We then look to evaluate the Multinomial Naive-Bayes model and Logistic Regressions model. Looking all the metrics for comparison we can see that the Multinomial Naive-Bayes model outperforms the Logistic Regressions model. Hence, our final selection will be the Multinomial Naive-Bayes model with `alpha: 0.21` and `max_features = 22500`.

In [78]:
final_steps = [('tf', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2), max_features = 22500)),
               ('mnb', MultinomialNB(alpha = 0.21))]

In [79]:
pipe = Pipeline(final_steps)

In [80]:
pipe.fit(X_train_final, y_train.values.ravel())

Pipeline(memory=None,
         steps=[('tf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=22500,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('mnb',
                 MultinomialNB(alpha=0.21, class_prior=None, fit_prior=True))],
         verbose=False)

In [81]:
pipe.score(X_train_final, y_train)

0.9946559786239145

In [82]:
pipe.score(X_test_final, y_test)

0.746

In [83]:
selected_model = pipe.named_steps['mnb']

In [84]:
selected_vec = pipe.named_steps['tf']

In [85]:
selected_vec.fit_transform(X_train_final)

<1497x18544 sparse matrix of type '<class 'numpy.float64'>'
	with 29683 stored elements in Compressed Sparse Row format>

In [87]:
selected_model_df = pd.DataFrame(selected_model.coef_.T, index = selected_vec.get_feature_names(), columns = ['coef'])

In [89]:
selected_model_df.coef.sort_values(ascending=False).head(10)

trump       -6.361883
black       -7.025105
media       -7.032372
biden       -7.041769
president   -7.236651
like        -7.269424
left        -7.295475
new         -7.338719
joe         -7.349093
people      -7.359577
Name: coef, dtype: float64

In [91]:
# Function researched and borrowed from Stackoverflow
# https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers

def important_features(vectorizer, classifier, n = 20):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names), reverse = True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names), reverse = True)[:n]
    print("Important words for r/Republican\n")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------\n")
    print("Important words for r/democrats\n")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 

In [93]:
important_features(selected_vec, selected_model, 15)

Important words for r/Republican

0 24.101323119490903 trump
0 8.523535229197075 president
0 7.517688689080907 twitter
0 5.588794782007526 donald
0 5.551203193684958 says
0 5.516157499149025 donald trump
0 5.507025891187277 vote
0 5.455446317451146 gop
0 5.221953423854521 people
0 5.016747455609432 obama
0 4.7216884543495565 house
0 4.40920412953961 clinton
0 4.366930702089541 campaign
0 3.9816754552585594 hillary
0 3.929695598111138 new
-----------------------------------------

Important words for r/democrats

1 11.077877017130145 trump
1 5.605385911379153 black
1 5.563278093974671 media
1 5.509282903760532 biden
1 4.496579039463293 president
1 4.344831980523482 like
1 4.227705256391047 left
1 4.039891621072071 new
1 3.9960326234330528 joe
1 3.9521665022616954 people
1 3.9476687337660734 says
1 3.6770328184155217 antifa
1 3.5774170319809064 don
1 3.5754855064920537 china
1 3.4722865777886565 just


### Conclusion and Recommendations

Based on our findings, we were able to answer the questions that we have formulated as part of the problem statement.  
- Between the two similar subreddits r/democrats and r/Republican are able to differentiate the post using Natural Language Processing models?    

Yes we were able to differentiate the post between the two different subreddits and compared to our baseline score of 50.025% we were able to improve it to 74.6%. However, due to the overlap nature of some major key words this has likely reduce the performance of our models.  
- Which models is then likely to work best?  

Multinomial Naive-Bayes model seems to works the best after evaluating the metrics for accuracy, precision, recall and f1-score across the 3 different models that were utilised. 

While we have managed to answer our problem statement that we have formulated, there is still room for improvement. Particularly in terms of the final selected model there is a huge different between the train and test score this is likely due to overfitting. We can take a few steps to further improve the model

1. Increase data size of collection by utilising pushshift.io which is an alternative reddit API that is capable of exceeding the 1000 post limit [source](https://pushshift.io/).
2. Collect additional text information like comments from the individual reddit post
3. Explore more alternative social media platforms like twitter or facebook