### Sentiment Analysis on Imdb: tf-idf + Naive Bayes

This is a classical exercise of Sentiment Analysis, based on the famous IMdb dataset. Text is vectorized using sklearn tf-idf.

As a classifier, we're using Multinomial Naive Bayes

* added k-fold cv to better compare with distil-bert

In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm

import re
import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import GaussianNB, MultinomialNB

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/datascience/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# to remove stop words from text
stop = stopwords.words('english')

### Loading the dataset as a DataFrame

In [3]:
# it takes some time... good to have a progress bar
basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}

df = pd.DataFrame()

with tqdm(total=50000) as pbar:
    for s in ('test', 'train'):
        for l in ('pos', 'neg'):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), 
                          'r', encoding='utf-8') as infile:
                    txt = infile.read()
                df = df.append([[txt, labels[l]]], 
                               ignore_index=True)
                
                pbar.update(1)
                
df.columns = ['text', 'target']

100%|██████████| 50000/50000 [02:11<00:00, 379.85it/s]


### preprocess text

In [4]:
# for text cleaning: add some percentage points to the result
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))

    return text

# if you want to test the difference without text preprocessing
def dummy_preprocessor(text):
    return text

In [5]:
# apply globally
df['text'] = df['text'].apply(preprocessor)

In [6]:
# this is our text corpus
corpus = df['text'].values

In [7]:
# removing words from stop words list
vectorizer = TfidfVectorizer(stop_words=stop)

# feature extraction using tf-idf
X = vectorizer.fit_transform(corpus)
y = df['target'].values

In [8]:
# as we can see, we have a very large (103938) number of extracted features
X.shape

(50000, 103938)

### train the classifier Naive Bayes (NB)

In [9]:
%%time

SEED = 1432
N_FOLDS = 5

kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

avg_acc_score = 0.

for i, (train_idx, valid_idx) in enumerate(kf.split(X)):
    print()
    print("Processing fold:", i + 1)

    # here we split the DataFrame, using the indexes for the fold

    X_train = X[train_idx].toarray()
    y_train = y[train_idx]
    
    X_valid = X[valid_idx].toarray()
    y_valid = y[valid_idx]
    
    # train of the classifier
    # MUltinomial works better than Gaussian

    clf = MultinomialNB()

    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_valid)
    
    acc = accuracy_score(y_valid, y_pred)
    
    print(f"acc: {round(acc, 3)}")
    print()
    
    avg_acc_score += acc/N_FOLDS


Processing fold: 1
acc: 0.865


Processing fold: 2
acc: 0.862


Processing fold: 3
acc: 0.868


Processing fold: 4
acc: 0.87


Processing fold: 5
acc: 0.867

CPU times: user 1min 51s, sys: 52.1 s, total: 2min 43s
Wall time: 1min 26s


### Validation

In [10]:
print(f"Avg ACC score computed on validation set: {round(avg_acc_score, 3)}")

Avg ACC score computed on validation set: 0.866


### Final remarks
It is easy to develop a Model based on the **bags-of-words** approach. We need only to use the **TfidfVectorizer** provided by sklearn.

To increase the accuracy I have added:
* some preprocessing to clean the text
* removal of stop words

As we can easy see, such a simple model gives an:
* ACC = 0.866

The reason is that to understand if the mood is positive or negative sometimes it is enough to see which words are contained in the text.