# PART TWO 

## Content List- Part 2

- [Data Cleaning and EDA](#Data-Cleaning-and-EDA)
- [Preprocessing and Modeling](#Preprocessing-and-Modeling)
- [Evaluation and Conceptual Understanding](#Evaluation-and-Conceptual-Understanding)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)

## Data Cleaning and EDA

### Importing packages

In [1]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import text, stop_words
from sklearn.metrics import accuracy_score,recall_score,precision_score, confusion_matrix

#### Importing the Dataframe from csv

In [2]:
master_df = pd.read_csv('./data/master_df.csv')

In [3]:
len(master_df)

1799

In [4]:
#check for nulls
master_df.isnull().sum()

ID                 0
Length of Title    0
Post Text          0
Subreddit          0
dtype: int64

## Preprocessing and Modeling

In [5]:
#check shape of new, combined dataframe
master_df.shape

(1799, 4)

In [6]:
master_df.columns

Index(['ID', 'Length of Title', 'Post Text', 'Subreddit'], dtype='object')

In [7]:
master_df['Length of Title'].mean()

108.51917732073375

In [8]:
#set feature and targets
X = master_df[['Post Text', 'Length of Title']]
y = master_df['Subreddit']

In [9]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify= y)

## Determining Baseline Score

As Accuracy is our metric, it is vital to determine a baseline so that we can compare our results. We will do this by performing a quick analysis on the distribution of the classes, in order to see if there is any inherent imbalance.

In [10]:
# Baseline Accuracy
y_test.value_counts(normalize=True)

1    0.551111
0    0.448889
Name: Subreddit, dtype: float64

The baseline Accuracy of 55.58% is important for the model as it provides a metric on which the model should be judged. 55% is the equivalent of random chance pick by the Majority class, even higher than a coin flip. 

In [11]:
#show us the shape of our data
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(1349, 2)
(1349,)
(450, 2)
(450,)


## Amending stop word lists

In [12]:
additional_politics_english_stop = ['www', 'things', 'does', 'x200b', 'amp', 'want', 'watch',
                           'just', 'like', 'https', 'com', 'republican', 'republicans',
                           'libertarians', 'democrats', 'democrat', 'people', 'libertarian',
                           'says', 'say', 'did', 'this', 'conservative', 'conservatives' ]

additional_english_stop = ['www', 'things', 'does', 'x200b', 'amp',
                           'just', 'like', 'https', 'com', 'watch', 'want',
                           'says', 'say', 'did', 'this']

new_stop_list = stop_words.ENGLISH_STOP_WORDS.union(additional_english_stop)
new_politics_english_stop_list = stop_words.ENGLISH_STOP_WORDS.union(additional_politics_english_stop)
print(len(stop_words.ENGLISH_STOP_WORDS))
print(len(additional_english_stop))
print(len(new_politics_english_stop_list))
print(len(new_stop_list))



318
15
341
332


In [13]:
new_stop_list

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'amp',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
  

## Pipeline & GridSearchCV

When doing gridsearch with vectorizer, add onto X_train the feature desired (length of post)

### CountVectorizer with Logistic Regression

In [14]:
pipe_cvec_lr = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

pipe_params_cvec_lr = {
    'cvec__max_features': [None,500,1000],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.3,.4,],
    'cvec__ngram_range': [(1,2),(1,3)],
    'cvec__stop_words': [None,'english',new_stop_list],
    'lr__penalty': ['l2']
}

gs = GridSearchCV(pipe_cvec_lr, param_grid=pipe_params_cvec_lr, cv=5,n_jobs = -1,verbose = 1)

gs.fit(X_train['Post Text'],y_train)


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   22.7s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   41.5s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'cvec__max_features': [None, 500, 1000], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.3, 0.4], 'cvec__ngram_range': [(1, 2), (1, 3)], 'cvec__stop_words': [None, 'english', frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot', 'n...erywhere', 'becoming', 'even', 'what', 'everything', 'while', 'its', 'de'})], 'lr__penalty': ['l2']},
       pre_dispatch='2*n_jobs', refit=T

In [15]:
cvlr_bestscore = gs.best_score_
cvlr_params = gs.best_params_
cvlr_train = gs.score(X_train["Post Text"],y_train)
cvlr_test= gs.score(X_test["Post Text"],y_test)
cvlr = ('CountVec with LogReg', cvlr_bestscore, cvlr_params, cvlr_train, cvlr_test)


In [16]:
print(f'Best CV Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best CV Score: 0.7835433654558932
Best Parameters: {'cvec__max_df': 0.4, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot', 'nobody', 'serious', 'in', 'over', 'anyone', 'like', 'anywhere', 'one', 'get', 'inc', 'mine', 'afterwards', 'system', 'yet', 'ie', 'her', 're', 'during', 'so', 'hereby', 'the', 'mostly', 'besides', 'again', 'cant', 'itself', 'moreover', 'through', 'done', 'yourselves', 'amongst', 'hereupon', 'whatever', 'say', 'he', 'such', 'empty', 'still', 'made', 'nothing', 'two', 'ever', 'thereupon', 'now', 'but', 'seeming', 'only', 'ltd', 'either', 'where', 'third', 'i', 'be', 'meanwhile', 'seems', 'just', 'and', 'whenever', 'eight', 'onto', 'except', 'next', 'nor', 'may', 'same', 'as', 'can', 'show', 'of', 'an', 'themselves', 'five', 'being', 'my', 'call', 'did', 'https', 'front', 'con', 'they', 'your', 'is', 'upon',

Pretty strong results with CountVectorizer and Logistic Regression, with a Best CV Score: 0.79581; where the 'cvec__max_df': 0.3, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 3), 'cvec__stop_words':'english', 'lr__penalty': 'l2'. 

Train Accuracy Score: 0.9798055347793567

Test Accuracy Score: 0.8094170403587444

The train score of approx 0.9798 was much better than the test score of 0.8094 indicating that this model is overfit despite tuning the hyperparameters and the strong training data score.

### TF-IDF with Logistic Regression

In [17]:
pipe_tvec_lr = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

pipe_params_tvec_lr = {
    'tvec__max_features': [None,1000],
    'tvec__min_df': [2,3,4],
    'tvec__max_df': [.3,.5],
    'tvec__ngram_range': [(1,1),(1,3)],
    'tvec__stop_words': [None, new_stop_list,'english'],
    'lr__penalty': ['l2']
}

gs = GridSearchCV(pipe_tvec_lr, param_grid=pipe_params_tvec_lr, cv=4, n_jobs=-1, verbose = 1)

gs.fit(X_train['Post Text'],y_train)


print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Fitting 4 folds for each of 72 candidates, totalling 288 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    8.3s


Best Score: 0.7435137138621201
Best Parameters: {'lr__penalty': 'l2', 'tvec__max_df': 0.5, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot', 'nobody', 'serious', 'in', 'over', 'anyone', 'like', 'anywhere', 'one', 'get', 'inc', 'mine', 'afterwards', 'system', 'yet', 'ie', 'her', 're', 'during', 'so', 'hereby', 'the', 'mostly', 'besides', 'again', 'cant', 'itself', 'moreover', 'through', 'done', 'yourselves', 'amongst', 'hereupon', 'whatever', 'say', 'he', 'such', 'empty', 'still', 'made', 'nothing', 'two', 'ever', 'thereupon', 'now', 'but', 'seeming', 'only', 'ltd', 'either', 'where', 'third', 'i', 'be', 'meanwhile', 'seems', 'just', 'and', 'whenever', 'eight', 'onto', 'except', 'next', 'nor', 'may', 'same', 'as', 'can', 'show', 'of', 'an', 'themselves', 'five', 'being', 'my', 'call', 'did', 'https', 'front', 'con', 'they', 'yo

[Parallel(n_jobs=-1)]: Done 288 out of 288 | elapsed:   24.6s finished


In [18]:
tflr_bestscore = gs.best_score_
tflr_params = gs.best_params_
tflr_train = gs.score(X_train["Post Text"],y_train)
tflr_test= gs.score(X_test["Post Text"],y_test)
tflr = ('TF-IDF with LogReg',tflr_bestscore, tflr_params, tflr_train, tflr_test)

In [19]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7435137138621201
Best Parameters: {'lr__penalty': 'l2', 'tvec__max_df': 0.5, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot', 'nobody', 'serious', 'in', 'over', 'anyone', 'like', 'anywhere', 'one', 'get', 'inc', 'mine', 'afterwards', 'system', 'yet', 'ie', 'her', 're', 'during', 'so', 'hereby', 'the', 'mostly', 'besides', 'again', 'cant', 'itself', 'moreover', 'through', 'done', 'yourselves', 'amongst', 'hereupon', 'whatever', 'say', 'he', 'such', 'empty', 'still', 'made', 'nothing', 'two', 'ever', 'thereupon', 'now', 'but', 'seeming', 'only', 'ltd', 'either', 'where', 'third', 'i', 'be', 'meanwhile', 'seems', 'just', 'and', 'whenever', 'eight', 'onto', 'except', 'next', 'nor', 'may', 'same', 'as', 'can', 'show', 'of', 'an', 'themselves', 'five', 'being', 'my', 'call', 'did', 'https', 'front', 'con', 'they', 'yo

Results for TFIDF and Logistic Regression, with a Best cv score of ~0.7644; where the optimal parameters were 'tvec__max_df': 0.3, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': new_stop_list, 'lr__penalty': 'l2'.

Train Accuracy Score: 0.9281974569932685

Test Accuracy Score: 0.7982062780269058

The train score was better than the test score indicating that this model is overfit despite tuning the hyperparameters.

### Count Vectorizer with Multinomial Naive Bayes

In [25]:
pipe_cvec_mnb = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

pipe_params_cvec_mnb = {
    'cvec__max_features': [None,500,1000,2500],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.4, .8],
    'cvec__ngram_range': [(1,1),(1,2),(1,3)],
    'cvec__stop_words': [None, new_stop_list,'english']
}

gs = GridSearchCV(pipe_cvec_mnb, param_grid=pipe_params_cvec_mnb, cv=4, n_jobs = 4, verbose = 1)

gs.fit(X_train['Post Text'],y_train)

Fitting 4 folds for each of 144 candidates, totalling 576 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    7.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   20.2s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   43.2s
[Parallel(n_jobs=4)]: Done 576 out of 576 | elapsed:   56.9s finished


GridSearchCV(cv=4, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'cvec__max_features': [None, 500, 1000, 2500], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.4, 0.8], 'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)], 'cvec__stop_words': [None, frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot',...everal', 'everywhere', 'becoming', 'even', 'what', 'everything', 'while', 'its', 'de

In [26]:
cvmnb_bestscore = gs.best_score_
cvmnb_params = gs.best_params_
cvmnb_train = gs.score(X_train["Post Text"],y_train)
cvmnb_test= gs.score(X_test["Post Text"],y_test)
cvmnb = ('CountVec with MNB',cvmnb_bestscore, cvmnb_params, cvmnb_train, cvmnb_test)

In [23]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7086730911786508
Best Parameters: {'cvec__max_df': 0.4, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'}
Train Accuracy Score: 0.9154929577464789
Test Accuracy Score: 0.7488888888888889


Count Vectorizer and Multinomial Naive Bayes, with a Best cv score of 0.7255; where the optimal parameters were 'cvec__max_df': 0.4, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'.

Train Accuracy Score: 0.8833208676140614

Test Accuracy Score: 0.7556053811659192

The train score of approx 0.8833 was much better than the test score of 0.7556 indicating that this model is very overfit despite tuning the hyperparameters.

### TF-IDF with Multinomial Naive Bayes

In [29]:
pipe_tvec_mnb = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('mnb', MultinomialNB())
])

pipe_params_tvec_mnb = {
    'tvec__max_features': [None,500,1000,3000],
    'tvec__min_df': [2,3],
    'tvec__max_df': [.2,.3,.4,],
    'tvec__ngram_range': [(1,1),(1,2),(1,3)],
    'tvec__stop_words': [None, new_stop_list,'english']
}

gs = GridSearchCV(pipe_tvec_mnb, param_grid=pipe_params_tvec_mnb, cv=4, n_jobs = -1, verbose = 1)

gs.fit(X_train['Post Text'],y_train)


Fitting 4 folds for each of 216 candidates, totalling 864 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   17.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   43.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed:  1.4min finished


GridSearchCV(cv=4, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...rue,
        vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'tvec__max_features': [None, 500, 1000, 3000], 'tvec__min_df': [2, 3], 'tvec__max_df': [0.2, 0.3, 0.4], 'tvec__ngram_range': [(1, 1), (1, 2), (1, 3)], 'tvec__stop_words': [None, frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'can...everal', 'everywhere', 'becoming', 'even', 'what', 'everything', 'while', 'its', 'de'}), 'english']},
       pre_dispatch='2*n_jobs', refit=T

In [30]:
tfmnb_bestscore = gs.best_score_
tfmnb_params = gs.best_params_
tfmnb_train = gs.score(X_train["Post Text"],y_train)
tfmnb_test= gs.score(X_test["Post Text"],y_test)

tfmnb = ('TF-IDF with MNB',tfmnb_bestscore, tfmnb_params, tfmnb_train, tfmnb_test)

In [31]:

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7131208302446257
Best Parameters: {'tvec__max_df': 0.4, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot', 'nobody', 'serious', 'in', 'over', 'anyone', 'like', 'anywhere', 'one', 'get', 'inc', 'mine', 'afterwards', 'system', 'yet', 'ie', 'her', 're', 'during', 'so', 'hereby', 'the', 'mostly', 'besides', 'again', 'cant', 'itself', 'moreover', 'through', 'done', 'yourselves', 'amongst', 'hereupon', 'whatever', 'say', 'he', 'such', 'empty', 'still', 'made', 'nothing', 'two', 'ever', 'thereupon', 'now', 'but', 'seeming', 'only', 'ltd', 'either', 'where', 'third', 'i', 'be', 'meanwhile', 'seems', 'just', 'and', 'whenever', 'eight', 'onto', 'except', 'next', 'nor', 'may', 'same', 'as', 'can', 'show', 'of', 'an', 'themselves', 'five', 'being', 'my', 'call', 'did', 'https', 'front', 'con', 'they', 'your', 'is', 'upon', 'n

Not bad! Results for TFIDF and Multinomial Naive Bayes, with a Best cv score of 0.7128; where the optimal parameters were 'tvec__max_df': 0.4, 'tvec__max_features': 1000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': 'english'.

Train Accuracy Score: 0.9012715033657442

Test Accuracy Score: 0.7623318385650224

The train score of approx 0.9013 was much better than the test score of 0.7623 indicating that this model is very overfit despite tuning the hyperparameters.

## Random Forest with CountVectorizer

In [32]:
from sklearn.ensemble import RandomForestClassifier
rf_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('rfc', RandomForestClassifier())])

rf_params = [{
    'cvec__max_features': [None, 500,1000],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.3,.4,.8],
    'cvec__ngram_range': [(1,1),(1,2),(1,3)],
    'rfc__bootstrap': [True],
    'rfc__max_features': [.5, .6],
    'rfc__min_samples_leaf': [3,6],
    'rfc__min_samples_split':[3,6],
    'rfc__n_estimators':[10,100]
}]

In [33]:
gs = GridSearchCV(rf_pipe, 
                   param_grid=rf_params, 
                   cv = 4,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

Fitting 4 folds for each of 864 candidates, totalling 3456 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   19.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  7.9min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 16.0min
[Parallel(n_jobs=-1)]: Done 3456 out of 3456 | elapsed: 17.0min finished


GridSearchCV(cv=4, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'cvec__max_features': [None, 500, 1000], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.3, 0.4, 0.8], 'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)], 'rfc__bootstrap': [True], 'rfc__max_features': [0.5, 0.6], 'rfc__min_samples_leaf': [3, 6], 'rfc__min_samples_split': [3, 6], 'rfc__n_estimators': [10, 100]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [34]:
cvrf_bestscore = gs.best_score_
cvrf_params = gs.best_params_
cvrf_train = gs.score(X_train["Post Text"],y_train)
cvrf_test= gs.score(X_test["Post Text"],y_test)

cvrf = ('CountVec with RandomForest',cvrf_bestscore, cvrf_params, cvrf_train, cvrf_test)


In [35]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7568569310600445
Best Parameters: {'cvec__max_df': 0.4, 'cvec__max_features': 1000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'rfc__bootstrap': True, 'rfc__max_features': 0.5, 'rfc__min_samples_leaf': 3, 'rfc__min_samples_split': 3, 'rfc__n_estimators': 10}
Train Accuracy Score: 0.8799110452186805
Test Accuracy Score: 0.7755555555555556


This model drastically improved on variance with the combination of CountVectorizer and RandomForestClassifier. The ideal param: were as follows: 'cvec__max_df': 0.9, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'rfc__bootstrap': True, 'rfc__max_features': 0.5, 'rfc__min_samples_leaf': 4, 'rfc__min_samples_split': 3, 'rfc__n_estimators': 100}

Train Accuracy Score: 0.868362004487659

Test Accuracy Score: 0.757847533632287

Furthermore, the fact that the train accuracy score is still higher than the test accuracy score indicates the model is still overfit, albeit suffering from a lower bias as well as a lower variance than the prior.

## Random Forest with TFIDF

In [36]:
rf_pipe = Pipeline([
        ('tvec', TfidfVectorizer()),
        ('rfc', RandomForestClassifier())])

rf_params = [{
    'tvec__max_features': [None],
    'tvec__min_df': [2,4],
    'tvec__max_df': [.3,.4, .5],
    'tvec__ngram_range': [(1,1),(1,2),(1,3)],
    'tvec__stop_words': [None],
    'rfc__bootstrap': [False, True],
    'rfc__n_estimators': [10,100],
    'rfc__max_features': [.5, .6, .7],
    'rfc__min_samples_leaf': [10],
    'rfc__min_samples_split':[3]
}]

In [37]:
gs= GridSearchCV(rf_pipe, 
                   param_grid=rf_params, 
                   cv = 4,
                   verbose = 1,
                   n_jobs = 3)

gs.fit(X_train['Post Text'],y_train)

Fitting 4 folds for each of 216 candidates, totalling 864 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    8.1s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:  1.2min
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:  3.3min
[Parallel(n_jobs=3)]: Done 794 tasks      | elapsed:  5.0min
[Parallel(n_jobs=3)]: Done 864 out of 864 | elapsed:  5.6min finished


GridSearchCV(cv=4, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=3,
       param_grid=[{'tvec__max_features': [None], 'tvec__min_df': [2, 4], 'tvec__max_df': [0.3, 0.4, 0.5], 'tvec__ngram_range': [(1, 1), (1, 2), (1, 3)], 'tvec__stop_words': [None], 'rfc__bootstrap': [False, True], 'rfc__n_estimators': [10, 100], 'rfc__max_features': [0.5, 0.6, 0.7], 'rfc__min_samples_leaf': [10], 'rfc__min_samples_split': [3]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [38]:
tfrf_bestscore = gs.best_score_
tfrf_params = gs.best_params_
tfrf_train = gs.score(X_train["Post Text"],y_train)
tfrf_test= gs.score(X_test["Post Text"],y_test)
tfrf = ('TF-IDF with RandomForest', tfrf_bestscore, tfrf_params, tfrf_train, tfrf_test)

In [39]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7064492216456635
Best Parameters: {'rfc__bootstrap': False, 'rfc__max_features': 0.5, 'rfc__min_samples_leaf': 10, 'rfc__min_samples_split': 3, 'rfc__n_estimators': 100, 'tvec__max_df': 0.3, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': None}
Train Accuracy Score: 0.8346923647146034
Test Accuracy Score: 0.7355555555555555


This models score with the combination of TFIDF and RandomForestClassifier average of .7457 was a little lower than the prior model.

The ideal paramaters were as follows:{'rfc__bootstrap': False, 'rfc__max_features': 0.5, 'rfc__min_samples_leaf': 10, 'rfc__min_samples_split': 3, 'rfc__n_estimators': 100, 'tvec__max_df': 0.3, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': None.

Train Accuracy Score: 0.8160059835452506

Test Accuracy Score: 0.7309417040358744

Furthermore, the fact that the train accuracy score is still higher than the test accuracy score indicates the model is still overfit (0.78608 vs 0.7511) , albeit suffering from a lower bias as well as a lower variance than the prior.

### Adaboost with CountVectorizer

In [40]:
from sklearn.ensemble import AdaBoostClassifier

In [41]:
ada_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('ada', AdaBoostClassifier())
])

ada_params = {
    'cvec__max_features': [None,500,1000],
    'cvec__min_df': [3,5],
    'cvec__max_df': [.4,.3],
    'cvec__ngram_range': [(1,2),(2,3),(1,3)],
    'cvec__stop_words': [None, 'english', new_stop_list],
    'ada__learning_rate': [0.3,.5,.7]}

gs= GridSearchCV(ada_pipe, 
                   param_grid=ada_params, 
                   cv = 5,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   29.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1620 out of 1620 | elapsed:  4.6min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...m='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'cvec__max_features': [None, 500, 1000], 'cvec__min_df': [3, 5], 'cvec__max_df': [0.4, 0.3], 'cvec__ngram_range': [(1, 2), (2, 3), (1, 3)], 'cvec__stop_words': [None, 'english', frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'can...ming', 'even', 'what', 'everything', 'while', 'its', 'de'})], 'ada__learning_rate': [0.3, 0.5, 0.7]},
       pre_dispatch='2*n_jobs', refit=T

In [42]:
cvada_bestscore = gs.best_score_
cvada_params = gs.best_params_
cvada_train = gs.score(X_train["Post Text"],y_train)
cvada_test= gs.score(X_test["Post Text"],y_test)
cvada = ('CountVec with AdaBoost',cvada_bestscore, cvada_params, cvada_train, cvada_test)

In [43]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7368421052631579
Best Parameters: {'ada__learning_rate': 0.7, 'cvec__max_df': 0.4, 'cvec__max_features': None, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None}
Train Accuracy Score: 0.8057820607857672
Test Accuracy Score: 0.7711111111111111


### AdaBoost with TFIDF

In [44]:
ada_pipe = Pipeline([
        ('tvec', TfidfVectorizer()),
        ('ada', AdaBoostClassifier())
])

ada_params = {
    'tvec__max_features': [None,500,1000],
    'tvec__min_df': [2,3,4],
    'tvec__max_df': [.5,.4,.3],
    'tvec__ngram_range': [(1,1),(1,3)],
    'tvec__stop_words': [None, 'english', new_stop_list],
    'ada__learning_rate': [.5]}

gs= GridSearchCV(ada_pipe, 
                   param_grid=ada_params, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

Fitting 3 folds for each of 162 candidates, totalling 486 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   31.5s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 486 out of 486 | elapsed:  1.3min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...m='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'tvec__max_features': [None, 500, 1000], 'tvec__min_df': [2, 3, 4], 'tvec__max_df': [0.5, 0.4, 0.3], 'tvec__ngram_range': [(1, 1), (1, 3)], 'tvec__stop_words': [None, 'english', frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'can...re', 'becoming', 'even', 'what', 'everything', 'while', 'its', 'de'})], 'ada__learning_rate': [0.5]},
       pre_dispatch='2*n_jobs', refit=T

In [45]:
tfada_bestscore = gs.best_score_
tfada_params = gs.best_params_
tfada_train = gs.score(X_train["Post Text"],y_train)
tfada_test= gs.score(X_test["Post Text"],y_test)
tfada = ('TF-IDF with AdaBoost',tfada_bestscore, tfada_params, tfada_train, tfada_test)

In [46]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7160859896219421
Best Parameters: {'ada__learning_rate': 0.5, 'tvec__max_df': 0.5, 'tvec__max_features': 1000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': None}
Train Accuracy Score: 0.7894736842105263
Test Accuracy Score: 0.76


AdaBoost with TFIDF proved the best so far, with lower variance and higher accuracy with optimal settings of:'ada__learning_rate': 0.5, 'tvec__max_df': 0.5, 'tvec__max_features': 500, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': new_stop_list.

Train Accuracy Score: 0.8032909498878086

Test Accuracy Score: 0.7645739910313901

Scores show that there is still a tiny bit of overfit, but all in all this model should generalize the best to new data and so we will make our predictions using it.

## XGBoost with CountVectorizer


In [47]:
xgb_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('xgb', XGBClassifier())
])

xgb_params = {
    'cvec__max_features': [None,500,1000],
    'cvec__min_df': [3,5],
    'cvec__max_df': [.4,.3],
    'cvec__ngram_range': [(1,2),(2,3),(1,3)],
    'cvec__stop_words': [None, 'english', new_stop_list]}

gs= GridSearchCV(xgb_pipe, 
                   param_grid= xgb_params, 
                   cv = 5,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

cvxgb_bestscore = gs.best_score_
cvxgb_params = gs.best_params_
cvxgb_train = gs.score(X_train["Post Text"],y_train)
cvxgb_test= gs.score(X_test["Post Text"],y_test)
cvxgb = ('CountVec with XGBoost',cvxgb_bestscore, cvxgb_params, cvxgb_train, cvxgb_test)


print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   15.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   56.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  2.7min finished


Best Score: 0.7412898443291327
Best Parameters: {'cvec__max_df': 0.4, 'cvec__max_features': 1000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': None}
Train Accuracy Score: 0.8206078576723499
Test Accuracy Score: 0.7777777777777778


## XGBoost with TF-IDF

In [48]:
xgb_pipe = Pipeline([
        ('tvec', TfidfVectorizer()),
        ('xgb', XGBClassifier())
])

xgb_params = {
    'tvec__max_features': [None,500,1000],
    'tvec__min_df': [2,3,4],
    'tvec__max_df': [.5,.4,.3],
    'tvec__ngram_range': [(1,1),(1,3)],
    'tvec__stop_words': [None, 'english', new_stop_list]}

gs= GridSearchCV(xgb_pipe, 
                   param_grid=xgb_params, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

Fitting 3 folds for each of 162 candidates, totalling 486 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 486 out of 486 | elapsed:  2.8min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'tvec__max_features': [None, 500, 1000], 'tvec__min_df': [2, 3, 4], 'tvec__max_df': [0.5, 0.4, 0.3], 'tvec__ngram_range': [(1, 1), (1, 3)], 'tvec__stop_words': [None, 'english', frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'can...metime', 'several', 'everywhere', 'becoming', 'even', 'what', 'everything', 'while', 'its', 'de'})]},
       pre_dispatch='2*n_jobs', refit=T

In [50]:
tfxgb_bestscore = gs.best_score_
tfxgb_params = gs.best_params_
tfxgb_train = gs.score(X_train["Post Text"],y_train)
tfxgb_test= gs.score(X_test["Post Text"],y_test)
tfxgb = ('TF-IDF with XGBoost',tfxgb_bestscore, tfxgb_params, tfxgb_train, tfxgb_test)

In [51]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Best Score: 0.7197924388435878
Best Parameters: {'tvec__max_df': 0.5, 'tvec__max_features': 1000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': frozenset({'when', 'although', 'had', 'eleven', 'put', 'seem', 'every', 'former', 'for', 'go', 'whereas', 'cannot', 'nobody', 'serious', 'in', 'over', 'anyone', 'like', 'anywhere', 'one', 'get', 'inc', 'mine', 'afterwards', 'system', 'yet', 'ie', 'her', 're', 'during', 'so', 'hereby', 'the', 'mostly', 'besides', 'again', 'cant', 'itself', 'moreover', 'through', 'done', 'yourselves', 'amongst', 'hereupon', 'whatever', 'say', 'he', 'such', 'empty', 'still', 'made', 'nothing', 'two', 'ever', 'thereupon', 'now', 'but', 'seeming', 'only', 'ltd', 'either', 'where', 'third', 'i', 'be', 'meanwhile', 'seems', 'just', 'and', 'whenever', 'eight', 'onto', 'except', 'next', 'nor', 'may', 'same', 'as', 'can', 'show', 'of', 'an', 'themselves', 'five', 'being', 'my', 'call', 'did', 'https', 'front', 'con', 'they', 'your', 'is', 'upon', 'n

## Predictions Utilizing the Top Performing Models
#### I would classify two of the models as the best, the one with the highest overall score (lowest bias) and the one with the smallest overall difference between the train and test data (lowest variance). These models are tested out below with their optimized hyperparameters.

In [52]:
master_df.head()

Unnamed: 0,ID,Length of Title,Post Text,Subreddit
0,t3_bu8owc,63,Research Confirms It: Weak Men Are More Likely...,0
1,t3_c926ee,60,The 3 Big Differences Between Conservatives an...,0
2,t3_c9upog,273,It's time for trump to do more than lip Servic...,0
3,t3_c9lci0,65,Koch Groups Renew Push to Reform ‘Broken’ U.S....,0
4,t3_c9i6zr,95,30 people have been shot in Chicago since the ...,0


In [53]:
#define features
X = master_df['Post Text']
y = master_df['Subreddit']

#train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=42)

### AdaBoost with TF-IDF

In [54]:
#instantiate Adaboost with learning rate of 0.5 as optimized by GridSearch
ada = AdaBoostClassifier(learning_rate=0.5)

In [55]:
#instantiate TF-IDF and choose optimized hyperparameters from prior section's GridSearch
tf= TfidfVectorizer(max_df= 0.4, 
                max_features= None,
                min_df= 3,
                ngram_range=(1, 3),
                stop_words='english')


# Fit our TfidfVectorizer on the training data and transform training data.
X_train_tf = pd.DataFrame(tf.fit_transform(X_train).todense()
                           ,columns = tf.get_feature_names())

# Fit our TfidfVectorizer on the test data and transform training data.
X_test_tf = pd.DataFrame(tf.transform(X_test).todense()
                           ,columns = tf.get_feature_names())

In [56]:
#fit the model to our data
ada = ada.fit(X_train_tf, y_train)

In [57]:
X_test_tf.shape

(450, 1691)

In [58]:
y_test.shape

(450,)

In [59]:
ada.score(X_train_tf, y_train)

0.7746478873239436

In [60]:
ada.score(X_test_tf, y_test)

0.7577777777777778

### LogisticRegression and CountVectorizer

In [61]:
#instantiate countvectorizer 
cvec = CountVectorizer(stop_words= new_stop_list,
                       ngram_range=(1,2), min_df=2,
                       max_features=None, max_df = 0.4)

In [62]:
# Fit our CountVectorizer on the training data and transform training data.
X_train_cvec = pd.DataFrame(cvec.fit_transform(X_train).todense()
                           ,columns = cvec.get_feature_names())

# Fit our CountVectorizer on the test data and transform training data.
X_test_cvec = pd.DataFrame(cvec.transform(X_test).todense()
                           ,columns = cvec.get_feature_names())

In [63]:
#instantiate logisticregression
lr = LogisticRegression()
#fit data
lr = lr.fit(X_train_cvec, y_train)



In [64]:
#examine and verify shape
X_test_cvec.shape

(450, 3078)

#examine and verify shape
X_test_cvec.shape

In [65]:
#examine shape to verify a fit
y_test.shape

(450,)

In [66]:
#score our logistic regression model on our fitted training data
lr.score(X_train_cvec, y_train)

0.9836916234247591

In [67]:
#score our logistic regression model on our fitted testing data
lr.score(X_test_cvec, y_test)

0.7955555555555556

### Evaluation and Conceptual Understanding

Although our models performed well, there are inherent limitations. For starters, we are asked to choose between a model that has very high variance (Logistic Regression) and one that has slightly worse accuracy but much lower variance (Adaboost). We are also limited by the computational requirements of putting every function into a gridsearch in order to tune the hyperparameters towards optimization.

In [70]:

# # prepare configuration for cross validation test harness
# seed = 42
# # prepare models
# models = []
# models.append(('LR', LogisticRegression()))
# models.append(('LDA', LinearDiscriminantAnalysis()))
# models.append(('KNN', KNeighborsClassifier()))
# models.append(('CART', DecisionTreeClassifier()))
# models.append(('NB', GaussianNB()))
# models.append(('SVM', SVC()))
# # evaluate each model in turn
# results = []
# names = []
# scoring = 'accuracy'
# for name, model in models:
# 	kfold = model_selection.KFold(n_splits=10, random_state=seed)
# 	cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
# 	results.append(cv_results)
# 	names.append(name)
# 	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
# 	print(msg)

In [71]:
# # boxplot algorithm comparison
# fig = plt.figure()
# fig.suptitle('Algorithm Comparison')
# ax = fig.add_subplot(111)
# plt.boxplot(results)
# ax.set_xticklabels(names)
# plt.show()

In [72]:
#generate predictions
pred = ada.predict(X_test_tf)

#generate confusion matrix
conf = confusion_matrix( y_test,# True values.
                     pred)# Predicted values.
tn, fp, fn, tp = conf.ravel()

In [73]:
#convert confusion matrix to dataframe
df_ada= pd.DataFrame(conf, index =  ['actual republican', 'actual democrats'], columns = ['predicted republican', 'predicted democrats'])


#### Confusion Matrix- Adaboost

In [74]:
df_ada

Unnamed: 0,predicted republican,predicted democrats
actual republican,149,53
actual democrats,56,192


This provides another visualization into the Accuracy score, in which there is approximately 1 in 5 misclassified data points.

#### Confusion Matrix- Logistic Regression

In [75]:
#generate predictions
pred = lr.predict(X_test_cvec)

#generate confusion matrix
conf = confusion_matrix( y_test,# True values.
                     pred)# Predicted values.
tn, fp, fn, tp = conf.ravel()

In [76]:
#convert confusion matrix to dataframe
df_lr= pd.DataFrame(conf, index =  ['actual republican', 'actual democrats'], columns = ['predicted republican', 'predicted democrats'])


In [77]:
df_lr

Unnamed: 0,predicted republican,predicted democrats
actual republican,164,38
actual democrats,54,194



## Conclusion and Recommendations

All of our models performed better than the baseline accuracy metric of ~55%, and although almost all of the models displayed different varying degrees of bias, variance and overfitting, the optimal models were LogisticRegression with CountVectorizer. These were determined not only in terms of overall raw accuracy, but in terms of variance and goodness of fit. 

In recommending this model to be used for the purpose of advertising companies who wish to target potential clients, it is important to weigh the pros and cons of 82.25% accuracy as offered by the Logistic Regression version of our model. This would mean that although 4 out of 5 recipients would be accurate, there would still exist a consistent 1 out of 5 audience that was not actually in the class described by our model. 

Additional features could also serve to improve the accuracy of our model, three ideas for that in future iterations include:

1. Fixing typos or other spelling errors that may have impacted our model's ability to interpret text
      
2. Incorporating a sentiment analysis aspect, which would involve creating two bags of words in which  we define positive and negative sentiment words, then filter and weight them accordingly.
      
3. Incorporate a loudness aspect, in which we would look at the prevalence of capital letters in sequence. Although our preprocessing transforms all text to lowercase, there is an argument to be made for the inclusion of series of uppercase text as it usually conveys intense emotion. 
      