# Project 3 3. Modelling and Classification

In this notebook, we will be running several different models with different vectorizers and will be cross comparing between their training and testing scores to see how their scores compare up to each other.

Models:<br>
&emsp;     1. Logistic Regression Model and Count Vectorizer <br>
&emsp;     2. MultinomialNB Model and Count Vectorizer<br>
&emsp;     3. Logistic Regression Model and TfidfVectorizer<br>
&emsp;     4. MultinomialNB Model and TfidfVectorizer<br>
&emsp;     5. Decision Tree Classifier Model and Count Vectorizer<br>
&emsp;     6. Decision Tree Classifier Model and TfidfVectorizer<br>
&emsp;     7. Random Forest Classifier Model and Count Vectorizer<br>
&emsp;     8. Random Forest Classifier Model and TfidfVectorizer<br>

After which, a Final Modelling Run will be run with the various best parameters of all the models and the corresponding scores, and some data between models will be generated for comparision, and then everything will be tabulated into a table for easy cross comparision and visualization.

Importing the relevant libraries for modelling and classification:

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

Reading CSV file from previous notebooks:

In [2]:
df = pd.read_csv("./datasets/combined_df.csv")

In [3]:
df.head()

Unnamed: 0,names,titles,t_tokenized,t_stopped,t_final,posts,p_tokenized,p_stopped,p_final,subreddit,final_combined
0,t3_dczhrz,URL not found,"['url', 'not', 'found']","['url', 'found']","['url', 'found']",Had a customer raise a ticket today which said...,"['had', 'a', 'customer', 'raise', 'a', 'ticket...","['customer', 'raise', 'ticket', 'today', 'said...","['customer', 'raise', 'ticket', 'today', 'said...",talesfromtechsupport,"['url', 'found', 'customer', 'raise', 'ticket'..."
1,t3_db4quc,Screen Protector Did Its Job!,"['screen', 'protector', 'did', 'its', 'job']","['screen', 'protector', 'job']","['screen', 'protector', 'job']",I work at a cell phone retail store. Someone c...,"['i', 'work', 'at', 'a', 'cell', 'phone', 'ret...","['work', 'cell', 'phone', 'retail', 'store', '...","['work', 'cell', 'phone', 'retail', 'store', '...",talesfromretail,"['screen', 'protector', 'job', 'work', 'cell',..."
2,t3_cm7nr1,My dear coworker...,"['my', 'dear', 'coworker']","['dear', 'coworker']","['dear', 'coworker']",I do (or coordinate) all the tech where I work...,"['i', 'do', 'or', 'coordinate', 'all', 'the', ...","['coordinate', 'tech', 'work', 'small', 'part'...","['coordinate', 'tech', 'work', 'small', 'part'...",talesfromtechsupport,"['dear', 'coworker', 'coordinate', 'tech', 'wo..."
3,t3_cyh93q,"""No that's not turquoise""","['no', 'thats', 'not', 'turquoise']","['thats', 'turquoise']","['thats', 'turquoise']","So this happened a few weeks ago, and I was so...","['so', 'this', 'happened', 'a', 'few', 'weeks'...","['happened', 'weeks', 'ago', 'dumbfounded', 'e...","['happened', 'weeks', 'ago', 'dumbfounded', 'e...",talesfromretail,"['thats', 'turquoise', 'happened', 'weeks', 'a..."
4,t3_dah9xr,We broker her Laptop,"['we', 'broker', 'her', 'laptop']","['broker', 'laptop']","['broker', 'laptop']",This is a old one but as it is some time ago i...,"['this', 'is', 'a', 'old', 'one', 'but', 'as',...","['old', 'one', 'time', 'ago', 'recall', 'highl...","['old', 'one', 'time', 'ago', 'recall', 'highl...",talesfromtechsupport,"['broker', 'laptop', 'old', 'one', 'time', 'ag..."


First, we turn `subreddit` into a 1/0 column, where 1 indicates `talesfromtechsupport`.

In [4]:
df['talesfromtechsupport'] = [1 if df.loc[i,'subreddit'] == 'talesfromtechsupport' else 0 for i in range(df.shape[0])]

In [5]:
df['talesfromtechsupport'].value_counts()

1    976
0    442
Name: talesfromtechsupport, dtype: int64

Here, we will split our data into `X` and `y`. Note that we will be predicting our subreddit posts from the titles and posts combined and see how well our model works in predicting with this parameter.

In [6]:
X = df['final_combined']
y = df['talesfromtechsupport']

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)    

## GridSearch for Best Parameters:

And then we can proceed to use this GridSearch function below to find our best parameters for the different models we will be initializing.

In [7]:
def grid_searcher(X, y, vectorizer, model):
    
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size=0.25,
                                                        random_state=42,
                                                        stratify=y)
    
    full_name_dict = {'cvec' : 'Count Vectorizer',
                      'tvec' : 'TfidfVectorizer',
                      'multi_nb' : 'MultinomialNB',
                      'lr' : 'Logistic Regression',
                      'dt' : 'Decision Tree Classifier',
                      'rf': 'Random Forest Classifier'}
    
    vec_dict =  {'cvec': CountVectorizer(),
                 'tvec': TfidfVectorizer()
                }
    
    param_dict = {'cvec': {'cvec__max_features': [2500, 3000, 3500],
                           'cvec__min_df': [2, 3],
                           'cvec__max_df': [.9, .95],
                           'cvec__ngram_range': [(1,1), (1,2)]},
                  'tvec': {'tvec__max_features': [2500, 3000, 3500],
                           'tvec__min_df':[2,3],
                           'tvec__max_df':[.9,.95],
                           'tvec__ngram_range':[(1,1),(1,2)]},
                  'dt' : {'dt__max_depth': [3,5],
                          'dt__min_samples_split': [5,10],
                          'dt__min_samples_leaf': [2,3]},
                  'rf' : {'rf__n_estimators': [100],
                          'rf__max_depth': [None, 1, 2],
                          'rf__min_samples_split': [5,10],
                          'rf__min_samples_leaf': [2,3]},
                  'lr' : {},
                  'multi_nb' : {}
                 }

    model_dict = {'multi_nb' : MultinomialNB(),
                  'lr' : LogisticRegression(),
                  'dt' : DecisionTreeClassifier(),
                  'rf' : RandomForestClassifier()
                  }
    
    pipe = Pipeline([(vectorizer, vec_dict[vectorizer]), 
                    ((model, model_dict[model]))])
    
    
    param_dict[model].update(param_dict[vectorizer])
    pipe_params = param_dict[model]
    
    grid = GridSearchCV(pipe,
           param_grid=pipe_params,
           cv=3)
        
    grid.fit(X_train, y_train)
    
    print(f'Using {full_name_dict[model]} Model and {full_name_dict[vectorizer]}:')
    print(f'Model train score : {grid.best_score_}')
    print(f'Model test score : {grid.score(X_test,y_test)}')
    print(f'Model best params : {grid.best_params_}')

GridSearchCV for Logistic Regression and CountVectorizer:

In [8]:
grid_searcher(X,y,'cvec','lr')



Using Logistic Regression Model and Count Vectorizer:
Model train score : 0.967074317968015
Model test score : 0.952112676056338
Model best params : {'cvec__max_df': 0.9, 'cvec__max_features': 3500, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 1)}


GridSearchCV for MultinomialNB and CountVectorizer:

In [9]:
grid_searcher(X,y,'cvec','multi_nb')

Using MultinomialNB Model and Count Vectorizer:
Model train score : 0.9783631232361242
Model test score : 0.9690140845070423
Model best params : {'cvec__max_df': 0.9, 'cvec__max_features': 3000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}


GridSearchCV for Logistic Regression and TfidfVectorizer:

In [10]:
grid_searcher(X,y,'tvec','lr')



Using Logistic Regression Model and TfidfVectorizer:
Model train score : 0.9463781749764817
Model test score : 0.9633802816901409
Model best params : {'tvec__max_df': 0.9, 'tvec__max_features': 2500, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 1)}


GridSearchCV for MultinomialNB and TfidfVectorizer:

In [11]:
grid_searcher(X,y,'tvec','multi_nb')

Using MultinomialNB Model and TfidfVectorizer:
Model train score : 0.9510818438381938
Model test score : 0.971830985915493
Model best params : {'tvec__max_df': 0.9, 'tvec__max_features': 2500, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 2)}


GridSearchCV for Decision Tree Classifier and CountVectorizer:

In [12]:
grid_searcher(X,y,'cvec','dt')

Using Decision Tree Classifier Model and Count Vectorizer:
Model train score : 0.883349012229539
Model test score : 0.923943661971831
Model best params : {'cvec__max_df': 0.9, 'cvec__max_features': 2500, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 1), 'dt__max_depth': 5, 'dt__min_samples_leaf': 2, 'dt__min_samples_split': 10}


GridSearchCV for Decision Tree Classifier and TfidfVectorizer:

In [13]:
grid_searcher(X,y,'tvec','dt')

Using Decision Tree Classifier Model and TfidfVectorizer:
Model train score : 0.8852304797742239
Model test score : 0.8845070422535212
Model best params : {'dt__max_depth': 5, 'dt__min_samples_leaf': 3, 'dt__min_samples_split': 5, 'tvec__max_df': 0.95, 'tvec__max_features': 3000, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 2)}


GridSearchCV for Random Tree Classifier and CountVectorizer:

In [14]:
grid_searcher(X,y,'cvec','rf')

Using Random Forest Classifier Model and Count Vectorizer:
Model train score : 0.955785512699906
Model test score : 0.9746478873239437
Model best params : {'cvec__max_df': 0.9, 'cvec__max_features': 3000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'rf__max_depth': None, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'rf__n_estimators': 100}


GridSearchCV for Random Tree Classifier and TfidfVectorizer:

In [15]:
grid_searcher(X,y,'tvec','rf')

Using Random Forest Classifier Model and TfidfVectorizer:
Model train score : 0.9539040451552211
Model test score : 0.971830985915493
Model best params : {'rf__max_depth': None, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 10, 'rf__n_estimators': 100, 'tvec__max_df': 0.9, 'tvec__max_features': 2500, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2)}


From all the 8 different models that have been run above, <br>

`MultinomialNB` model together with `CountVectorizer` seems to give the best training and testing scores. <br>
`Random Tree Classifier` model with `CountVectorizer` also gives us a pretty good score.

# Final Modelling Runs:

Finally, we can use our best parameters found using GridSearch to initialize various models with best parameters and run our predictions on our test data and interpret the results correspondingly below.

### Final Logistic Regression with CountVectorizer model:

In [66]:
lr_cvec_fin = Pipeline([('cvec',CountVectorizer(max_df = 0.9, 
                                                max_features = 3500, 
                                                min_df = 3,
                                                ngram_range = (1, 1))),
                        ('lr', LogisticRegression())])

In [67]:
lr_cvec_fin.fit(X_train,y_train)

lr_cvec_ypred_fin = lr_cvec_fin.predict(X_test)
lr_cvec_fin_acc = accuracy_score(y_test,lr_cvec_ypred_fin)

print(f'Accuracy: {lr_cvec_fin_acc}')
print(classification_report(y_test,lr_cvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.952112676056338
                      precision    recall  f1-score   support

     TalesFromRetail       0.90      0.95      0.93       111
TalesFromTechSupport       0.98      0.95      0.96       244

           micro avg       0.95      0.95      0.95       355
           macro avg       0.94      0.95      0.95       355
        weighted avg       0.95      0.95      0.95       355





In [68]:
tn, fp, fn, tp = confusion_matrix(y_test, lr_cvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 106
False Positives: 5
False Negatives: 12
True Positives: 232


<b> Since `TalesFromTechSupport` = 1, in this case: </b>

106 posts were correctly predicted by this model to be from TalesFromRetail. <br>
5 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
12 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
232 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final Logistic Regression with TfidfVectorizer model:

In [104]:
lr_tvec_fin = Pipeline([('tvec',TfidfVectorizer(max_df = 0.9, 
                                                max_features = 2500, 
                                                min_df = 3,
                                                ngram_range = (1, 1))),
                        ('lr', LogisticRegression())])

In [32]:
lr_cvec_ypred_fin = lr_cvec_fin.predict(X_train)

In [105]:
lr_tvec_fin.fit(X_train,y_train)

lr_tvec_ypred_fin = lr_tvec_fin.predict(X_test)
lr_tvec_fin_acc = accuracy_score(y_test,lr_tvec_ypred_fin)

print(f'Accuracy: {lr_tvec_fin_acc}')
print(classification_report(y_test,lr_tvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.9633802816901409
                      precision    recall  f1-score   support

     TalesFromRetail       0.97      0.91      0.94       111
TalesFromTechSupport       0.96      0.99      0.97       244

           micro avg       0.96      0.96      0.96       355
           macro avg       0.97      0.95      0.96       355
        weighted avg       0.96      0.96      0.96       355





In [106]:
tn, fp, fn, tp = confusion_matrix(y_test, lr_tvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 101
False Positives: 10
False Negatives: 3
True Positives: 241


101 posts were correctly predicted by this model to be from TalesFromRetail. <br>
10 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
3 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
241 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final MultinomialNB with CountVectorizer model:

In [33]:
confusion_matrix(y_train, lr_cvec_ypred_fin)

array([[331,   0],
       [  0, 732]], dtype=int64)

In [73]:
nb_cvec_fin = Pipeline([('cvec',CountVectorizer(max_df = 0.9, 
                                                max_features = 3000, 
                                                min_df = 2,
                                                ngram_range = (1, 2))),
                        ('multi_nb', MultinomialNB())])

In [74]:
nb_cvec_fin.fit(X_train,y_train)

nb_cvec_ypred_fin = nb_cvec_fin.predict(X_test)
nb_cvec_fin_acc = accuracy_score(y_test,nb_cvec_ypred_fin)

print(f'Accuracy: {nb_cvec_fin_acc}')
print(classification_report(y_test,nb_cvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.9690140845070423
                      precision    recall  f1-score   support

     TalesFromRetail       0.93      0.97      0.95       111
TalesFromTechSupport       0.99      0.97      0.98       244

           micro avg       0.97      0.97      0.97       355
           macro avg       0.96      0.97      0.96       355
        weighted avg       0.97      0.97      0.97       355



In [76]:
tn, fp, fn, tp = confusion_matrix(y_test, nb_cvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 108
False Positives: 3
False Negatives: 8
True Positives: 236


108 posts were correctly predicted by this model to be from TalesFromRetail. <br>
3 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
8 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
236 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final MultinomialNB with TfidfVectorizer model:

In [107]:
nb_tvec_fin = Pipeline([('tvec',TfidfVectorizer(max_df = 0.9, 
                                                max_features = 2500, 
                                                min_df = 3,
                                                ngram_range = (1, 2))),
                         ('multi_nb', MultinomialNB())])

In [108]:
nb_tvec_fin.fit(X_train,y_train)

nb_tvec_ypred_fin = nb_tvec_fin.predict(X_test)
nb_tvec_fin_acc = accuracy_score(y_test,nb_tvec_ypred_fin)

print(f'Accuracy: {nb_tvec_fin_acc}')
print(classification_report(y_test,nb_tvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.971830985915493
                      precision    recall  f1-score   support

     TalesFromRetail       0.96      0.95      0.95       111
TalesFromTechSupport       0.98      0.98      0.98       244

           micro avg       0.97      0.97      0.97       355
           macro avg       0.97      0.96      0.97       355
        weighted avg       0.97      0.97      0.97       355



In [109]:
tn, fp, fn, tp = confusion_matrix(y_test, nb_tfidf_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 105
False Positives: 6
False Negatives: 4
True Positives: 240


105 posts were correctly predicted by this model to be from TalesFromRetail. <br>
6 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
4 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
240 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final Decision Tree Classifier with CountVectorizer model:

In [85]:
dt_cvec_fin = Pipeline([('cvec',CountVectorizer(max_df = 0.9, 
                                                max_features = 2500, 
                                                min_df = 3,
                                                ngram_range = (1, 1))),
                         ('dt', DecisionTreeClassifier(max_depth = 5, 
                                                       min_samples_leaf = 2,
                                                       min_samples_split = 10))])

In [89]:
dt_cvec_fin.fit(X_train,y_train)

dt_cvec_ypred_fin = dt_cvec_fin.predict(X_test)
dt_cvec_fin_acc = accuracy_score(y_test,dt_cvec_ypred_fin)

print(f'Accuracy: {dt_cvec_fin_acc}')
print(classification_report(y_test,dt_cvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.923943661971831
                      precision    recall  f1-score   support

     TalesFromRetail       0.93      0.82      0.87       111
TalesFromTechSupport       0.92      0.97      0.95       244

           micro avg       0.92      0.92      0.92       355
           macro avg       0.93      0.90      0.91       355
        weighted avg       0.92      0.92      0.92       355



In [87]:
tn, fp, fn, tp = confusion_matrix(y_test, dt_cvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 91
False Positives: 20
False Negatives: 9
True Positives: 235


91 posts were correctly predicted by this model to be from TalesFromRetail. <br>
20 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
9 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
235 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final Decision Tree Classifier with TfidfVectorizer model:

In [88]:
dt_tvec_fin = Pipeline([('tvec',TfidfVectorizer(max_df = 0.95, 
                                                max_features = 3000, 
                                                min_df = 3,
                                                ngram_range = (1, 2))),
                         ('dt', DecisionTreeClassifier(max_depth = 5, 
                                                       min_samples_leaf = 3,
                                                       min_samples_split = 5))])

In [90]:
dt_tvec_fin.fit(X_train,y_train)

dt_tvec_ypred_fin = dt_tvec_fin.predict(X_test)
dt_tvec_fin_acc = accuracy_score(y_test,dt_tvec_ypred_fin)

print(f'Accuracy: {dt_tvec_fin_acc}')
print(classification_report(y_test,dt_tvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.8788732394366198
                      precision    recall  f1-score   support

     TalesFromRetail       0.85      0.74      0.79       111
TalesFromTechSupport       0.89      0.94      0.91       244

           micro avg       0.88      0.88      0.88       355
           macro avg       0.87      0.84      0.85       355
        weighted avg       0.88      0.88      0.88       355



In [91]:
tn, fp, fn, tp = confusion_matrix(y_test, dt_tvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 82
False Positives: 29
False Negatives: 14
True Positives: 230


82 posts were correctly predicted by this model to be from TalesFromRetail. <br>
29 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
14 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
230 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final Random Tree Classifier with CountVectorizer model:

In [92]:
rf_cvec_fin = Pipeline([('cvec',CountVectorizer(max_df = 0.9, 
                                                max_features = 3000, 
                                                min_df = 2,
                                                ngram_range = (1, 1))),
                         ('rf', RandomForestClassifier(max_depth = None,
                                                       min_samples_leaf = 2,
                                                       min_samples_split = 5,
                                                       n_estimators = 100))])

In [93]:
rf_cvec_fin.fit(X_train,y_train)

rf_cvec_ypred_fin = rf_cvec_fin.predict(X_test)
rf_cvec_fin_acc = accuracy_score(y_test,rf_cvec_ypred_fin)

print(f'Accuracy: {rf_cvec_fin_acc}')
print(classification_report(y_test,rf_cvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.9774647887323944
                      precision    recall  f1-score   support

     TalesFromRetail       0.99      0.94      0.96       111
TalesFromTechSupport       0.97      1.00      0.98       244

           micro avg       0.98      0.98      0.98       355
           macro avg       0.98      0.97      0.97       355
        weighted avg       0.98      0.98      0.98       355



In [94]:
tn, fp, fn, tp = confusion_matrix(y_test, rf_cvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 104
False Positives: 7
False Negatives: 1
True Positives: 243


104 posts were correctly predicted by this model to be from TalesFromRetail. <br>
7 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
1 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
243 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

### Final Random Tree Classifier with TfidfVectorizer model:

In [96]:
rf_tvec_fin = Pipeline([('tvec',TfidfVectorizer(max_df = 0.9, 
                                                max_features = 2500, 
                                                min_df = 2,
                                                ngram_range = (1, 2))),
                         ('rf', RandomForestClassifier(max_depth = None,
                                                       min_samples_leaf = 2,
                                                       min_samples_split = 10,
                                                       n_estimators = 100))])

In [97]:
rf_tvec_fin.fit(X_train,y_train)

rf_tvec_ypred_fin = rf_tvec_fin.predict(X_test)
rf_tvec_fin_acc = accuracy_score(y_test,rf_tvec_ypred_fin)

print(f'Accuracy: {rf_tvec_fin_acc}')
print(classification_report(y_test,rf_tvec_ypred_fin,target_names=['TalesFromRetail','TalesFromTechSupport']))

Accuracy: 0.9830985915492958
                      precision    recall  f1-score   support

     TalesFromRetail       1.00      0.95      0.97       111
TalesFromTechSupport       0.98      1.00      0.99       244

           micro avg       0.98      0.98      0.98       355
           macro avg       0.99      0.97      0.98       355
        weighted avg       0.98      0.98      0.98       355



In [98]:
tn, fp, fn, tp = confusion_matrix(y_test, rf_tvec_ypred_fin).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 105
False Positives: 6
False Negatives: 0
True Positives: 244


105 posts were correctly predicted by this model to be from TalesFromRetail. <br>
6 posts were wrongly predicted by this model to be from TalesFromRetail. <br>
0 posts were wrongly predicted by this model to be from TalesFromTechSupport. <br>
244 posts were correctly predicted by this model to be from TalesFromTechSupport. <br>

# Models Score Tabulation:

In [150]:
GridSearchCV = [0.967074317968015, 
                0.9463781749764817, 
                0.9783631232361242, 
                0.9510818438381938, 
                0.883349012229539, 
                0.8852304797742239, 
                0.955785512699906, 
                0.9539040451552211]
Best_Params = [lr_cvec_fin_acc,
        lr_tfidf_fin_acc,
        nb_cvec_fin_acc,
        nb_tvec_fin_acc,
        dt_cvec_fin_acc,
        dt_tvec_fin_acc,
        rf_cvec_fin_acc,
        rf_cvec_fin_acc]

models = ['LRCvec',
         'LRTvec',
         'NBCvec',
         'NBTvec',
         'DTCvec',
         'DTTvec',
         'RFCvec',
         'RFTvec']

score_df = {'GridSearchCV' : GridSearchCV, 'Best Params' : Best_Params, 'Models' : models}

In [151]:
finalized_df = pd.DataFrame(score_df)

In [152]:
finalized_df.set_index('Models')

Unnamed: 0_level_0,GridSearchCV,Best Params
Models,Unnamed: 1_level_1,Unnamed: 2_level_1
LRCvec,0.967074,0.952113
LRTvec,0.946378,0.96338
NBCvec,0.978363,0.969014
NBTvec,0.951082,0.971831
DTCvec,0.883349,0.923944
DTTvec,0.88523,0.878873
RFCvec,0.955786,0.977465
RFTvec,0.953904,0.977465
