## Mini Project 03- NLP Emotions: Model Comparison

Dataset:
- A. Tripathi, "Emotion Classification NLP", Kaggle.com, 2021. [Online]. Available: https://www.kaggle.com/datasets/anjaneyatripathi/emotion-classification-nlp. [Accessed: 16- Jul- 2022].

Sources:
- WASSA-2017 Shared Task on Emotion Intensity. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.
BibTex

Emotion Labels:
- joy: 1
- sadness: 2
- anger: 3
- fear: 4

### Import Libraries

In [207]:
## Import Libraries
import numpy as np
import pandas as pd
import re as regex
import spacy
from pathlib import Path
import time


import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import plot_confusion_matrix
from simple_colors import *

import string
from collections import Counter
import re as regex
from sklearn import metrics
from sklearn.metrics import classification_report

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

import neattext.functions as nfx
import nltk

import warnings
warnings.filterwarnings('ignore')

import import_ipynb
import nlpEmotion_functions
from nlpEmotion_functions import show_summary_report, calcRocPrecisionRecall, testCalcRocPrecisionRecall, calcRocAuc

%matplotlib inline

### Load Data

In [208]:
dfData = pd.read_csv("convertedTextDataset.csv")

In [209]:
dfData.head()

Unnamed: 0.1,Unnamed: 0,text,label,short
0,0,Just got back from seeing @GaryDelaney in Burs...,1,get see burslem amaze face hurt laugh hilarious
1,1,Oh dear an evening of absolute hilarity I don'...,1,dear even absolute hilarity don think laugh lo...
2,2,Been waiting all week for this game ❤️❤️❤️ #ch...,1,wait week game red heart red heart red heart c...
3,3,"@gardiner_love : Thank you so much, Gloria! Yo...",1,thank gloria sweet thoughtful day joyful love ...
4,4,I feel so blessed to work with the family that...,1,feel bless work family nanny red heart love am...


In [210]:
dfData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7102 entries, 0 to 7101
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  7102 non-null   int64 
 1   text        7102 non-null   object
 2   label       7102 non-null   int64 
 3   short       7097 non-null   object
dtypes: int64(2), object(2)
memory usage: 222.1+ KB


In [211]:
dfData.drop(['Unnamed: 0'], axis= 1, inplace= True)

In [212]:
dfData.shape

(7102, 3)

### Split the data into train and test sets

In [213]:
## split the dataset
# Features and Labels
X= dfData['short']
y= dfData['label']

# split the dataset
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42)

## Feature Engineering

### Count Vectors as features

In [214]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
# count_vect.fit(dfData['text'])

# Transform documents to document-term matrix.
X_train_count = count_vect.fit_transform(X_train.astype('str'))
X_test_count = count_vect.transform(X_test.astype('str'))

In [215]:
X_train_count.shape

(5681, 8917)

In [216]:
# create a count vectorizer object
# count_vect2 = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
# count_vect.fit(dfData['text'])

# Transform documents to document-term matrix.
# X_train_count2 = count_vect2.fit_transform(X2_train.astype('str'))
# X_test_count2 = count_vect2.transform(X2_test.astype('str'))

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [217]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

# tfidf_vect.fit(dfData['text'])
X_train_tfidf = tfidf_vect.fit_transform(X_train.astype('str'))
X_test_tfidf  = tfidf_vect.transform(X_test.astype('str'))

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: total: 62.5 ms
Wall time: 68.8 ms


In [218]:
X_train_tfidf

<5681x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 39110 stored elements in Compressed Sparse Row format>

In [219]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

# tfidf_vect_ngram.fit(dfData['text'])
X_train_tfidf_ngram = tfidf_vect_ngram.fit_transform(X_train.astype('str'))
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test.astype('str'))

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: total: 219 ms
Wall time: 227 ms


In [220]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
#                                          token_pattern = r'\w{1,}',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

# tfidf_vect_ngram_chars.fit(dfData['text'])
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.fit_transform(X_train.astype('str'))
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test.astype('str'))

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: total: 422 ms
Wall time: 422 ms


### Text / NLP based features

Create some other features.

Char_Count = Number of Characters in Text

Word Count = Number of Words in Text

Word Density = Average Number of Char in Words

Punctuation Count = Number of Punctuation in Text

Title Word Count = Number of Words in Title

Uppercase Word Count = Number of Upperwords in Text

In [221]:
# Number of characters in text
def char_count(text):
    return len(text)

In [222]:
demoText= dfData['short'][4321]
demoText

'hope bsnl slightly smile face speed slightly smile face win thank elate like neutral face slightly smile face'

In [223]:
charCount= char_count(demoText)
print(charCount)

109


In [224]:
# Number of words in text
def word_count(text):
    return len(text.split())

In [225]:
wordCount= word_count(demoText)
print(wordCount)

18


In [226]:
# Average number of characters in words
def word_density(text):
    charCount= len(text)
    wordCount= len(text.split())
    wordDensity= charCount/ wordCount
    return wordDensity

In [227]:
wordDensity= word_density(demoText)
print(wordDensity)

6.055555555555555


In [228]:
# Number of punctuations in text
def punctuation_count(text):
    punctuations= "!#$%&'()*+,-./:;<=>?@[\]^_`{|}~"
    d= dict()
    for i in punctuations:
        d[str(i)+' count']= text.count(i)
    return d 

In [229]:
punctCount= punctuation_count(demoText)
print(punctCount)

{'! count': 0, '# count': 0, '$ count': 0, '% count': 0, '& count': 0, "' count": 0, '( count': 0, ') count': 0, '* count': 0, '+ count': 0, ', count': 0, '- count': 0, '. count': 0, '/ count': 0, ': count': 0, '; count': 0, '< count': 0, '= count': 0, '> count': 0, '? count': 0, '@ count': 0, '[ count': 0, '\\ count': 0, '] count': 0, '^ count': 0, '_ count': 0, '` count': 0, '{ count': 0, '| count': 0, '} count': 0, '~ count': 0}


In [230]:
# Number of uppercase words in text
def count_capital_words(text):
    return sum(map(str.isupper, text.split()))

In [231]:
capitalWordCount= count_capital_words(demoText)
print(capitalWordCount)

0


In [232]:
## load spaCy
nlp = spacy.load('en_core_web_md')

In [233]:
# tokenisation
# doc = nlp('Drinking a glass of wine is good for your wellbeing!')
doc = nlp(demoText)
doc_Adv= []
for token in doc:
#     print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
    if token.pos_ == 'ADV':
        doc_Adv.append([token.text, token.pos_])
doc_Adv

[['slightly', 'ADV'], ['slightly', 'ADV'], ['slightly', 'ADV']]

In [234]:
advFreq= len(doc_Adv)
advFreq

3

In [235]:
def adj_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'ADJ':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

def adv_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'ADV':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

def noun_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'ADV':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

def num_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'NUM':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

def pron_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'PRON':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

def propn_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'PROPN':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

def verb_count(cell):
    doc = nlp(cell)
    doc_Pos= []
    for token in doc:
#         print(f"token:{token}\t tag:{token.tag_}\t\tPOS:{token.pos_}\t\t text:'{token.text}' \tlemma:{token.lemma_}\t ")
        if token.pos_ == 'VERB':
            doc_Pos.append([token.text, token.pos_])
    posFreq= len(doc_Pos)
    return posFreq

In [236]:
# %%time
# # Initialise some columns for feature's counts
# dfData['char_count']= dfData['text'].apply(lambda x: char_count(x))
# dfData['word_count']= dfData['text'].apply(lambda x: word_count(x))
# dfData['word_density']= dfData['text'].apply(lambda x: word_density(x))
# # dfData['punctuation_count']= dfData['text'].apply(lambda x: punctuation_count(x))
# # dfData['uppercase_word_count']= dfData['text'].apply(lambda x: count_capital_words(x))
# dfData['adj_count']= dfData['text'].apply(lambda x: adj_count(x))
# dfData['adv_count']= dfData['text'].apply(lambda x: adv_count(x))
# dfData['noun_count']= dfData['text'].apply(lambda x: noun_count(x))
# dfData['num_count']= dfData['text'].apply(lambda x: num_count(x))
# dfData['pron_count']= dfData['text'].apply(lambda x: pron_count(x))
# dfData['propn_count']= dfData['text'].apply(lambda x: propn_count(x))
# dfData['verb_count']= dfData['text'].apply(lambda x: verb_count(x))

In [237]:
# dfData.sample(10)

In [238]:
# filepath = Path('nlpTrainFeatures.csv')  
# filepath.parent.mkdir(parents=True, exist_ok=True)  
# dfData.to_csv(filepath)

### Topic Models as features

In [239]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

Xtrain_topics = lda_model.fit_transform(X_train_count)
Xtest_topics = lda_model.transform(X_test_count)
topic_word = lda_model.components_ 
vocab = count_vect.get_feature_names_out()

CPU times: total: 16.3 s
Wall time: 16.3 s


In [240]:
Xtrain_topics.shape

(5681, 20)

In [241]:
Xtest_topics.shape

(1421, 20)

In [242]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 heart god game follow red party damn joyous awe sick
    1 face joy tear lose say horror shake break fan turn
    2 get anxiety depression tonight revenge new fume thank give great
    3 think shock hold long head madden evil bright reason breezy
    4 fuck feel right have like despair world help sleep actually
    5 face smile start eye cry nervous weary loudly hope sober
    6 people talk worry terrorism way skin person tone light dark
    7 anger stop try year doesn wish end delight hilarious month
    8 love like look work make sad pout little rage laugh
    9 angry day didn away snap post class fall birthday order
   10 want find unhappy hour chirp success car wow dog keep
   11 blue life well gbbo job meet family cause beautiful care
   12 good need fear cheer frown point day quote morning rejoice
   13 time play awful sadness best stay gon guy player sorry
   14 amp spark

In [243]:
dfTrainTopic= pd.DataFrame(Xtrain_topics)
dfTestTopic= pd.DataFrame(Xtest_topics)

In [244]:
dfTrainTopic.shape

(5681, 20)

In [245]:
dfTestTopic.shape

(1421, 20)

### Modelling

In [246]:
## helper function
def train_model(classifier, feature_vector_train, label, feature_vector_valid, testLabel):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    accScore= accuracy_score(predictions, testLabel)
    return [accScore, predictions]

In [247]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [248]:
%%time
# Naive Bayes on Count Vectors
gnbAccuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count, y_test)
print('NB, Count Vectors    : %.4f\n' % gnbAccuracy1[0])

NB, Count Vectors    : 0.8304

CPU times: total: 0 ns
Wall time: 3.99 ms


In [249]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
gnbAccuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf, y_test)
print('NB, WordLevel TF-IDF : %.4f\n' % gnbAccuracy2[0])

NB, WordLevel TF-IDF : 0.8086

CPU times: total: 0 ns
Wall time: 2.99 ms


In [250]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
gnbAccuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram, y_test)
print('NB, N-Gram Vectors   : %.4f\n' % gnbAccuracy3[0])

NB, N-Gram Vectors   : 0.3955

CPU times: total: 15.6 ms
Wall time: 2.99 ms


In [251]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
gnbAccuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars, y_test)
print('NB, CharLevel Vectors: %.4f\n' % gnbAccuracy4[0])

NB, CharLevel Vectors: 0.6953

CPU times: total: 31.2 ms
Wall time: 5.98 ms


In [252]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': gnbAccuracy1[0],
    'WordLevel TF-IDF': gnbAccuracy2[0],
    'N-Gram Vectors': gnbAccuracy3[0],
    'CharLevel Vectors': gnbAccuracy4[0]}

### Linear Classifier

In [253]:
%%time
# Linear Classifier on Count Vectors
logAccuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count, y_test)
print('LR, Count Vectors    : %.4f\n' % logAccuracy1[0])

LR, Count Vectors    : 0.8431

CPU times: total: 859 ms
Wall time: 1.11 s


In [254]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
logAccuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf, y_test)
print('LR, WordLevel TF-IDF : %.4f\n' % logAccuracy2[0])

LR, WordLevel TF-IDF : 0.8480

CPU times: total: 547 ms
Wall time: 587 ms


In [255]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
logAccuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train,
                           X_test_tfidf_ngram, y_test)
print('LR, N-Gram Vectors   : %.4f\n' % logAccuracy3[0])

LR, N-Gram Vectors   : 0.3955

CPU times: total: 562 ms
Wall time: 512 ms


In [256]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
logAccuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train,
                           X_test_tfidf_ngram_chars, y_test)
print('LR, CharLevel Vectors: %.4f\n' % logAccuracy4[0])

LR, CharLevel Vectors: 0.7797

CPU times: total: 891 ms
Wall time: 1.04 s


In [257]:
results.loc['Logistic Regression'] = {
    'Count Vectors': logAccuracy1[0],
    'WordLevel TF-IDF': logAccuracy2[0],
    'N-Gram Vectors': logAccuracy3[0],
    'CharLevel Vectors': logAccuracy4[0]}

### Support Vector Machine

In [258]:
%%time
# Support Vector Machine on Count Vectors
svmAccuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count, y_test)
print('SVM, Count Vectors    : %.4f\n' % svmAccuracy1[0])

SVM, Count Vectors    : 0.8360

CPU times: total: 422 ms
Wall time: 422 ms


In [259]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
svmAccuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf, y_test)
print('SVM, WordLevel TF-IDF : %.4f\n' % svmAccuracy2[0])

SVM, WordLevel TF-IDF : 0.8550

CPU times: total: 46.9 ms
Wall time: 41.9 ms


In [260]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
svmAccuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram, y_test)
print('SVM, N-Gram Vectors   : %.4f\n' % svmAccuracy3[0])

SVM, N-Gram Vectors   : 0.3885

CPU times: total: 62.5 ms
Wall time: 47.9 ms


In [261]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
svmAccuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars, y_test)
print('SVM, CharLevel Vectors: %.4f\n' % svmAccuracy4[0])

SVM, CharLevel Vectors: 0.8030

CPU times: total: 203 ms
Wall time: 205 ms


In [262]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': svmAccuracy1[0],
    'WordLevel TF-IDF': svmAccuracy2[0],
    'N-Gram Vectors': svmAccuracy3[0],
    'CharLevel Vectors': svmAccuracy4[0]}

### Bagging Models

In [263]:
%%time
# Bagging (Random Forest) on Count Vectors
rfcAccuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count, y_test)
print('RF, Count Vectors    : %.4f\n' % rfcAccuracy1[0])

RF, Count Vectors    : 0.8318

CPU times: total: 5.84 s
Wall time: 5.84 s


In [264]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
rfcAccuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf, y_test)
print('RF, WordLevel TF-IDF : %.4f\n' % rfcAccuracy2[0])

RF, WordLevel TF-IDF : 0.8234

CPU times: total: 3.98 s
Wall time: 3.99 s


In [265]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
rfcAccuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram, y_test)
print('RF, N-Gram Vectors   : %.4f\n' % rfcAccuracy3[0])

RF, N-Gram Vectors   : 0.3871

CPU times: total: 4.66 s
Wall time: 4.66 s


In [266]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
rfcAccuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train,
                           X_test_tfidf_ngram_chars, y_test)
print('RF, CharLevel Vectors: %.4f\n' % rfcAccuracy4[0])

RF, CharLevel Vectors: 0.7628

CPU times: total: 11.1 s
Wall time: 11.1 s


In [267]:
results.loc['Random Forest'] = {
    'Count Vectors': rfcAccuracy1[0],
    'WordLevel TF-IDF': rfcAccuracy2[0],
    'N-Gram Vectors': rfcAccuracy3[0],
    'CharLevel Vectors': rfcAccuracy4[0]}

### Boosting Models

In [268]:
%%time
# Gradient Boosting on Count Vectors
gradBstAccuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count, y_test)
print('GB, Count Vectors    : %.4f\n' % gradBstAccuracy1[0])

GB, Count Vectors    : 0.8128

CPU times: total: 3.59 s
Wall time: 3.59 s


In [269]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
gradBstAccuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf, y_test)
print('GB, WordLevel TF-IDF : %.4f\n' % gradBstAccuracy2[0])

GB, WordLevel TF-IDF : 0.8100

CPU times: total: 4.62 s
Wall time: 4.63 s


In [270]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
gradBstAccuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram, y_test)
print('GB, N-Gram Vectors   : %.4f\n' % gradBstAccuracy3[0])

GB, N-Gram Vectors   : 0.3842

CPU times: total: 2.11 s
Wall time: 2.13 s


In [271]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
gradBstAccuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars,
                              y_test)
print('GB, CharLevel Vectors: %.4f\n' % gradBstAccuracy4[0])

GB, CharLevel Vectors: 0.7889

CPU times: total: 51.6 s
Wall time: 51.6 s


In [272]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': gradBstAccuracy1[0],
    'WordLevel TF-IDF': gradBstAccuracy2[0],
    'N-Gram Vectors': gradBstAccuracy3[0],
    'CharLevel Vectors': gradBstAccuracy4[0]}

In [273]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.830401,0.808586,0.395496,0.695285
Logistic Regression,0.843068,0.847994,0.395496,0.779733
Support Vector Machine,0.836031,0.855032,0.388459,0.802956
Random Forest,0.831809,0.823364,0.387051,0.762843
Gradient Boosting,0.812808,0.809993,0.384236,0.788881


In [274]:
filepath = Path('modelPhase01ResultSummary.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
results.to_csv(filepath)