<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [2]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = '../DATA/corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [4]:
# ANSWER
df_corpus.head()

Unnamed: 0,label,text
0,1,The best soundtrack ever to anything.: I'm rea...
1,1,Amazing!: This soundtrack is my favorite music...
2,1,Excellent Soundtrack: I truly like this soundt...
3,1,"Remember, Pull Your Jaw Off The Floor After He..."
4,1,an absolute masterpiece: I am quite sure any o...


In [5]:
df_corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   9999 non-null   int64 
 1   text    9999 non-null   object
dtypes: int64(1), object(1)
memory usage: 156.4+ KB


## Split the data into train and test

In [7]:
## ANSWER
## split the dataset

X_train, X_test, y_train, y_test = train_test_split(df_corpus.text, df_corpus.label)

## Feature Engineering

### Count Vectors as features

In [61]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

X_train_count[6, :]

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 85 stored elements and shape (1, 27300)>

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [9]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: total: 156 ms
Wall time: 746 ms


In [10]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: total: 656 ms
Wall time: 3.42 s


In [12]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: total: 406 ms
Wall time: 4.84 s


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text


In [42]:
for index, row in df_corpus.sample(8).iterrows():
    print()
    print(row.text)


Power connector is not durable: Adapter worked flawlessly for six months, then became intermittent. Seems to be broken or shorted somewhere near the connector to the computer. Even the ridiculously flimsy cord from Apple lasted longer than six months, so I can't say I'm pleased. Macally says they'll sell the connector cable separately for $17.00 which is not much of a bargain.

This compilation befuddles me.: This compilation befuddles me.The sound is a bit muffled on the tracks listed as 2000 remastered when compared with the John Denver's Greatest Hits.Not all tracks are 2000 remastered. Calypso and I'm Sorry are listed as 1975. They sound and look like the same files as Denvers's Greatest Hits when brought into audacity editing software.It also seems the 2000 remasters are not the original arrangements.I have both this "Best of Rocky Mountain" and Best of Volume 1 & 2.The Rocky Mountain version does dull or soften Denver's voice. Some may like that.But if dulls the instruments.I pr

In [46]:
# %%time

import string

# NB: INstructions do not specify whether or not whitespace should be counted in the character count -- Assumed no.
df_corpus["char_count"] = df_corpus['text'].apply(lambda text: len(text.replace(' ', '')))
df_corpus['word_count'] = df_corpus['text'].apply(lambda text: len(text.split()))

# word_density
def word_density(text):
    words = text.split()
    word_lengths = pd.Series([len(word) for word in words])
    return word_lengths.mean()

# df_corpus['word_density_0'] = df_corpus['text'].apply(word_density)
df_corpus['word_density'] = df_corpus['char_count']/df_corpus['word_count']

df_corpus['punctuation_count'] = df_corpus['text'].apply(lambda text: len([char for char in text if char in string.punctuation]))

# df_corpus['title_word_count'] = df_corpus['text'].apply(lambda text: len(text.split(':')[0].split()))
df_corpus['title_case_word_count'] = df_corpus['text'].apply(lambda text: len([word for word in text.split() if word.istitle()]))
df_corpus['uppercase_word_count'] = df_corpus['text'].apply(lambda text: len([word for word in text.split() if word.isupper()]))
df_corpus.head()

Unnamed: 0,label,text,char_count,word_count,word_density,punctuation_count,title_case_word_count,uppercase_word_count
0,1,The best soundtrack ever to anything.: I'm rea...,413,97,4.257732,14,7,3
1,1,Amazing!: This soundtrack is my favorite music...,632,129,4.899225,40,24,4
2,1,Excellent Soundtrack: I truly like this soundt...,626,118,5.305085,33,52,4
3,1,"Remember, Pull Your Jaw Off The Floor After He...",395,87,4.54023,22,30,0
4,1,an absolute masterpiece: I am quite sure any o...,684,142,4.816901,35,14,3


In [47]:

df_corpus.columns

Index(['label', 'text', 'char_count', 'word_count', 'word_density',
       'punctuation_count', 'title_case_word_count', 'uppercase_word_count'],
      dtype='object')

In [48]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out the number of Adjectives, Adverbs, Nouns, Numerals, Pronouns, Proper Nouns, Verbs.
    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

*Learning about Counter*

In [55]:
sample = df_corpus.iloc[123].text
print(sample)

sample_doc = nlp(sample)
c = Counter([token.pos_ for token in sample_doc])
print()
print(c)


Oh!: This cereal is so sweet....yet so good for you! One taste=ADDICTION!!!! I just tried this cereal out of curiousity and I was hooked! It is an excellent breakfast choice, or just any time to eat! Especially as breakfast because you will crave more honey taste and you won't be hungry later {probably because of how much you'd eat} and it is actually sweeter and better than Honey Bunches of Oats. Cap'n Crunch Berries is an option, but this splendid cereal is SO FILLED WITH A HONEY-FILLED TASTE! I have not tried much better tasting cereal than this! Oh!

Counter({'PUNCT': 19, 'NOUN': 14, 'AUX': 11, 'ADV': 11, 'ADJ': 11, 'PRON': 10, 'ADP': 9, 'PROPN': 9, 'VERB': 8, 'DET': 7, 'CCONJ': 6, 'PART': 3, 'SCONJ': 3, 'INTJ': 2, 'NUM': 1})


In [56]:
# Initialize some columns for feature's counts
df_corpus['adj_count'] = 0
df_corpus['adv_count'] = 0
df_corpus['noun_count'] = 0
df_corpus['num_count'] = 0
df_corpus['pron_count'] = 0
df_corpus['propn_count'] = 0
df_corpus['verb_count'] = 0

In [57]:
# ANSWER
for i in range(df_corpus.shape[0]):
    # convert into a spaCy document
    doc = nlp(df_corpus.iloc[i]['text'])
    # initialise feature counters
    c = Counter([t.pos_ for t in doc])

    df_corpus.at[i, 'adj_count'] = c['ADJ']
    df_corpus.at[i, 'adv_count'] = c['ADV']
    df_corpus.at[i, 'noun_count'] = c['NOUN']
    df_corpus.at[i, 'num_count'] = c['NUM']
    df_corpus.at[i, 'pron_count'] = c['PRON']
    df_corpus.at[i, 'propn_count'] = c['PROPN']
    df_corpus.at[i, 'verb_count'] = c['VERB']

In [None]:
# Variation using iterrows
for _, row in df_corpus.iterrows:
    doc = nlp(row.text)
    pos_counts = Counter([token.pos_ for token in doc])

    row['adj_count'] = c['ADJ']
    # ...etc.

In [59]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_case_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

df_corpus[cols].sample(5)

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_case_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
956,234,58,4.034483,10,2,0,8,5,9,1,7,1,7
5912,559,118,4.737288,39,14,4,11,5,22,0,13,3,15
5085,669,138,4.847826,33,16,0,10,7,27,3,9,11,13
8957,197,49,4.020408,15,4,2,6,3,10,1,6,1,3
9545,104,25,4.16,5,3,2,3,1,3,0,4,0,5


### Topic Models as features

In [62]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: total: 6.28 s
Wall time: 38.4 s


In [63]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 eyes products refund bottle ugly cat ride shut mothman medicine
    1 british japan cave voices queen religious intriguing blue elvis mindless
    2 band released metal tales bands mountain mary kate hardcore gammell
    3 tess lets lucky ati hardy headphones basics headphone circumstances owning
    4 study difference larry bra squeem planet cup taylor timely failure
    5 and the my for a it to in product i
    6 art volume shoes dance named tale hockey video his visual
    7 i the it to this a and was t not
    8 jay privacy cambridge desires freud lily delta davis conductor 1992
    9 threads marie lhasa trained strauss murdered guards ellington vimes sang
   10 l le wedding viewers sizing burmese bench robots crystal workbook
   11 with card works christmas support camera software player the work
   12 her she woman life who his admit herself self rice
   13 jewish pen ich 

## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [64]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [65]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [66]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print(f"NB, Count Vectors    : {accuracy1:.4f}\n")

NB, Count Vectors    : 0.8376

CPU times: total: 0 ns
Wall time: 13.8 ms


In [67]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print(f'NB, WordLevel TF-IDF : {accuracy2:.4f}\n')

NB, WordLevel TF-IDF : 0.8408

CPU times: total: 15.6 ms
Wall time: 13.3 ms


In [69]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print(f'NB, N-Gram Vectors   : {accuracy3:.4f}\n')

NB, N-Gram Vectors   : 0.8340

CPU times: total: 0 ns
Wall time: 5 ms


In [71]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print(f'NB, CharLevel Vectors: {accuracy4:.4f}\n')

NB, CharLevel Vectors: 0.8076

CPU times: total: 0 ns
Wall time: 68.1 ms


In [72]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [73]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.8376,0.8408,0.834,0.8076


### Linear Classifier

In [74]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8572

CPU times: total: 1min 32s
Wall time: 15.1 s


In [75]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8592

CPU times: total: 0 ns
Wall time: 19 ms


In [76]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8332

CPU times: total: 31.2 ms
Wall time: 15.9 ms


In [77]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8372

CPU times: total: 15.6 ms
Wall time: 104 ms


In [78]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [79]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8388

CPU times: total: 93.8 ms
Wall time: 261 ms


In [80]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8584

CPU times: total: 15.6 ms
Wall time: 60.5 ms


In [81]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8176

CPU times: total: 0 ns
Wall time: 32.1 ms


In [82]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8460

CPU times: total: 391 ms
Wall time: 733 ms


In [84]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [85]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8124

CPU times: total: 172 ms
Wall time: 7.77 s


In [86]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8200

CPU times: total: 766 ms
Wall time: 5.13 s


In [87]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7792

CPU times: total: 719 ms
Wall time: 5.28 s


In [88]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7716

CPU times: total: 1.58 s
Wall time: 17.6 s


In [89]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [90]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7888

CPU times: total: 250 ms
Wall time: 5.23 s


In [91]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7836

CPU times: total: 828 ms
Wall time: 11.4 s


In [92]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7380

CPU times: total: 1.5 s
Wall time: 6.91 s


In [93]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.7972

CPU times: total: 15.2 s
Wall time: 2min 9s


In [94]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [95]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.8376,0.8408,0.834,0.8076
Logistic Regression,0.8572,0.8592,0.8332,0.8372
Support Vector Machine,0.8388,0.8584,0.8176,0.846
Random Forest,0.8124,0.82,0.7792,0.7716
Gradient Boosting,0.7888,0.7836,0.738,0.7972


Which combination of features and model performed the best?

In [96]:
results.max()

Count Vectors        0.8572
WordLevel TF-IDF     0.8592
N-Gram Vectors       0.8340
CharLevel Vectors    0.8460
dtype: float64

The best performance was from the Logistic Regression on the WordLevel TF-IDF encoding.



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



