<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [4]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [7]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = 'corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [9]:
# ANSWER

df_corpus.head()

Unnamed: 0,label,text
0,1,The best soundtrack ever to anything.: I'm rea...
1,1,Amazing!: This soundtrack is my favorite music...
2,1,Excellent Soundtrack: I truly like this soundt...
3,1,"Remember, Pull Your Jaw Off The Floor After He..."
4,1,an absolute masterpiece: I am quite sure any o...


In [10]:
df_corpus.tail()

Unnamed: 0,label,text
9994,1,A revelation of life in small town America in ...
9995,1,Great biography of a very interesting journali...
9996,0,Interesting Subject; Poor Presentation: You'd ...
9997,0,Don't buy: The box looked used and it is obvio...
9998,1,Beautiful Pen and Fast Delivery.: The pen was ...


In [11]:
df_corpus.shape

(9999, 2)

In [12]:
df_corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   9999 non-null   int64 
 1   text    9999 non-null   object
dtypes: int64(1), object(1)
memory usage: 156.4+ KB


## Split the data into train and test

In [14]:
# Features and Labels
X = df_corpus['text']
y = df_corpus['label']

In [15]:
## ANSWER
## split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## Feature Engineering

### Count Vectors as features

In [18]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [20]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: total: 2.39 s
Wall time: 3.06 s


In [21]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: total: 9.98 s
Wall time: 13 s


In [22]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: total: 15.2 s
Wall time: 21.4 s


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text


In [24]:
%%time
# ANSWER

# Define functions to compute features
def char_count(text):
    return len(text)

def word_count(text):
    return len(text.split())

def word_density(text):
    words = text.split()
    if len(words) == 0:
        return 0
    return sum(len(word) for word in words) / len(words)

def punctuation_count(text):
    return sum(1 for char in text if char in string.punctuation)

def title_word_count(text):
    return sum(1 for word in text.split() if word.istitle())

def uppercase_word_count(text):
    return sum(1 for word in text.split() if word.isupper())

# Compute features for each text
def compute_features(df):
    df['char_count'] = df['text'].apply(char_count)
    df['word_count'] = df['text'].apply(word_count)
    df['word_density'] = df['text'].apply(word_density)
    df['punctuation_count'] = df['text'].apply(punctuation_count)
    return df  # Return the updated DataFrame

# Sample DataFrame
df_corpus = pd.DataFrame({
    'label': [1, 1, 1, 1, 1],
    'text': [
        "The best soundtrack ever to anything.: I'm really impressed.",
        "Amazing!: This soundtrack is my favorite music of all time.",
        "Excellent Soundtrack: I truly like this soundtrack very much.",
        "Remember, Pull Your Jaw Off The Floor After Hearing This Music!",
        "An absolute masterpiece: I am quite sure any other soundtrack is inferior."
    ]
})

# Compute features
df_corpus = compute_features(df_corpus)

# Display the updated DataFrame
print(df_corpus.head())

# Optional: Measure computation time
import time
start_time = time.time()

# Compute features again (if needed)
df_corpus = compute_features(df_corpus)

print("Time taken to compute features: %s seconds" % (time.time() - start_time))

   label                                               text  char_count  \
0      1  The best soundtrack ever to anything.: I'm rea...          60   
1      1  Amazing!: This soundtrack is my favorite music...          59   
2      1  Excellent Soundtrack: I truly like this soundt...          61   
3      1  Remember, Pull Your Jaw Off The Floor After He...          63   
4      1  An absolute masterpiece: I am quite sure any o...          74   

   word_count  word_density  punctuation_count  
0           9      5.777778                  4  
1          10      5.000000                  3  
2           9      5.888889                  2  
3          11      4.818182                  2  
4          12      5.250000                  2  
Time taken to compute features: 0.0020072460174560547 seconds
CPU times: total: 0 ns
Wall time: 14 ms


In [25]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [27]:
# Initialise some columns for feature's counts
df_corpus['adj_count'] = 0
df_corpus['adv_count'] = 0
df_corpus['noun_count'] = 0
df_corpus['num_count'] = 0
df_corpus['pron_count'] = 0
df_corpus['propn_count'] = 0
df_corpus['verb_count'] = 0

In [28]:
# ANSWER
df_corpus = pd.DataFrame({
    'label': [1, 1, 1, 1, 1],
    'text': [
        "The best soundtrack ever to anything.: I'm really impressed.",
        "Amazing!: This soundtrack is my favorite music of all time.",
        "Excellent Soundtrack: I truly like this soundtrack very much.",
        "Remember, Pull Your Jaw Off The Floor After Hearing This Music!",
        "An absolute masterpiece: I am quite sure any other soundtrack is inferior."
    ]
})

# Function to count POS tags
def count_pos(text):
    doc = nlp(text)
    pos_counts = Counter([token.pos_ for token in doc])
    return pos_counts

# Apply function to compute POS counts
pos_counts = df_corpus['text'].apply(count_pos)

# Create a DataFrame from the POS counts
pos_df = pd.DataFrame(list(pos_counts)).fillna(0).astype(int)

# Ensure all POS tags are included, even if they have zero counts
pos_df = pos_df.reindex(columns=['ADJ', 'ADV', 'NOUN', 'NUM', 'PRON', 'ADP', 'VERB'], fill_value=0)

# Combine with the original DataFrame
df_corpus = pd.concat([df_corpus, pos_df], axis=1)

# Rename columns for clarity
df_corpus = df_corpus.rename(columns={
    'ADJ': 'Adjective',
    'ADV': 'Adverb',
    'NOUN': 'Noun',
    'NUM': 'Numeric',
    'PRON': 'Pronoun',
    'ADP': 'Preposition',
    'VERB': 'Verb'
})


# Display the updated DataFrame
print(df_corpus.head())

# Optional: Measure computation time
import time
start_time = time.time()

# Apply function and expand results into separate columns
pos_counts = df_corpus['text'].apply(count_pos)
pos_df = pd.DataFrame(list(pos_counts)).fillna(0).astype(int)
pos_df = pos_df.reindex(columns=['ADJ', 'ADV', 'NOUN', 'NUM', 'PRON', 'ADP', 'VERB'], fill_value=0)
df_corpus = pd.concat([df_corpus, pos_df], axis=1)
df_corpus = df_corpus.rename(columns={
    'ADJ': 'Adjective',
    'ADV': 'Adverb',
    'NOUN': 'Noun',
    'NUM': 'Numeric',
    'PRON': 'Pronoun',
    'ADP': 'Preposition',
    'VERB': 'Verb'
})

print("Time taken to compute POS counts: %s seconds" % (time.time() - start_time))



   label                                               text  Adjective  \
0      1  The best soundtrack ever to anything.: I'm rea...          2   
1      1  Amazing!: This soundtrack is my favorite music...          2   
2      1  Excellent Soundtrack: I truly like this soundt...          1   
3      1  Remember, Pull Your Jaw Off The Floor After He...          0   
4      1  An absolute masterpiece: I am quite sure any o...          4   

   Adverb  Noun  Numeric  Pronoun  Preposition  Verb  
0       2     1        0        2            1     0  
1       0     3        0        1            1     0  
2       3     2        0        1            0     1  
3       0     0        0        1            2     3  
4       1     2        0        1            0     0  
Time taken to compute POS counts: 0.10580182075500488 seconds


In [29]:
print(df_corpus.columns)

Index(['label', 'text', 'Adjective', 'Adverb', 'Noun', 'Numeric', 'Pronoun',
       'Preposition', 'Verb', 'Adjective', 'Adverb', 'Noun', 'Numeric',
       'Pronoun', 'Preposition', 'Verb'],
      dtype='object')


In [30]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']


In [31]:
# Add functions for the additional counts
def adj_count(text):
    # Placeholder for adjective count (you can refine this)
    return sum(1 for word in text.split() if word.endswith('y'))

def adv_count(text):
    # Placeholder for adverb count (you can refine this)
    return sum(1 for word in text.split() if word.endswith('ly'))

def noun_count(text):
    # Placeholder for noun count (you can refine this)
    return sum(1 for word in text.split() if word.istitle())

def num_count(text):
    return sum(1 for word in text.split() if word.isdigit())

def pron_count(text):
    # Placeholder for pronoun count (you can refine this)
    return sum(1 for word in text.split() if word.lower() in ['i', 'you', 'he', 'she', 'it', 'we', 'they'])

def propn_count(text):
    # Placeholder for proper noun count (you can refine this)
    return sum(1 for word in text.split() if word.istitle())

def verb_count(text):
    # Placeholder for verb count (you can refine this)
    return sum(1 for word in text.split() if word.endswith('ing'))

# Update compute_features to include the new counts
def compute_features(df):
    df['char_count'] = df['text'].apply(char_count)
    df['word_count'] = df['text'].apply(word_count)
    df['word_density'] = df['text'].apply(word_density)
    df['punctuation_count'] = df['text'].apply(punctuation_count)
    df['title_word_count'] = df['text'].apply(title_word_count)
    df['uppercase_word_count'] = df['text'].apply(uppercase_word_count)
    df['adj_count'] = df['text'].apply(adj_count)
    df['adv_count'] = df['text'].apply(adv_count)
    df['noun_count'] = df['text'].apply(noun_count)
    df['num_count'] = df['text'].apply(num_count)
    df['pron_count'] = df['text'].apply(pron_count)
    df['propn_count'] = df['text'].apply(propn_count)
    df['verb_count'] = df['text'].apply(verb_count)
    return df

# Then compute the features again
df_corpus = compute_features(df_corpus)

# Now you can sample the full set of columns
df_corpus[cols].sample(5)

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
2,61,9,5.888889,2,3,1,2,1,3,0,1,3,0
3,63,11,4.818182,2,11,0,0,0,11,0,0,11,1
4,74,12,5.25,2,2,1,1,0,2,0,1,2,0
1,59,10,5.0,3,2,0,1,0,2,0,0,2,0
0,60,9,5.777778,4,1,0,1,1,1,0,0,1,0


### Topic Models as features

In [33]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: total: 2min 53s
Wall time: 4min 16s


In [34]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 daily career actions adventures beautifully superficial afternoon harder indulgent tons
    1 dialogue running ipod disk yoga government chose allowed load field
    2 cute economics politics wanting economic pratchett costume self overview gay
    3 hollywood diane lane van worthy seconds heater grammar damme drivel
    4 tape chapters digital descriptions japanese copy camera receive moment 70
    5 versions blocks peter paperback goodness rocket victorian stewart fairy trade
    6 flight stockings housing 55 reduce acne varies concerns retarded wagner
    7 the and a i to of it this is in
    8 recipes cooking ballet paris scooter celiac entry alike neighbors geforce
    9 l et les il honor lighter titan est manon scheme
   10 the i it to and a for this my not
   11 orwell winston starting situations odd ship stephen catholic darkness peace
   12 spanish cap philadelphia life

## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [37]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [38]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [40]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8520

CPU times: total: 0 ns
Wall time: 22.6 ms


In [41]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8550

CPU times: total: 0 ns
Wall time: 16 ms


In [42]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8360

CPU times: total: 0 ns
Wall time: 13 ms


In [43]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8195

CPU times: total: 46.9 ms
Wall time: 48.5 ms


In [44]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [46]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8520

CPU times: total: 781 ms
Wall time: 3.5 s


In [47]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8715

CPU times: total: 31.2 ms
Wall time: 110 ms


In [48]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8295

CPU times: total: 31.2 ms
Wall time: 76.1 ms


In [49]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8490

CPU times: total: 188 ms
Wall time: 463 ms


In [50]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [52]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8345

CPU times: total: 1.05 s
Wall time: 1.37 s


In [53]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8605

CPU times: total: 31.2 ms
Wall time: 307 ms


In [54]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8120

CPU times: total: 46.9 ms
Wall time: 147 ms


In [55]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8590

CPU times: total: 906 ms
Wall time: 1.92 s


In [56]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [58]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8240

CPU times: total: 28.8 s
Wall time: 40.4 s


In [59]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8300

CPU times: total: 16 s
Wall time: 25.6 s


In [60]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7855

CPU times: total: 13.9 s
Wall time: 25.3 s


In [61]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7855

CPU times: total: 49.2 s
Wall time: 1min 15s


In [62]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [64]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

CPU times: total: 14.2 s
Wall time: 26.7 s


In [65]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7920

CPU times: total: 30.6 s
Wall time: 51.7 s


In [66]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7335

CPU times: total: 19.7 s
Wall time: 33.4 s


In [67]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8025

CPU times: total: 5min 15s
Wall time: 8min


In [68]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [69]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.852,0.855,0.836,0.8195
Logistic Regression,0.852,0.8715,0.8295,0.849
Support Vector Machine,0.8345,0.8605,0.812,0.859
Random Forest,0.824,0.83,0.7855,0.7855
Gradient Boosting,0.799,0.792,0.7335,0.8025


Which combination of features and model performed the best?

In [None]:
# Logistic Regression + WordLevel TF-IDF



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



