### Sentiment Analysis on Imdb: tf-idf + Naive Bayes

This is a classical exercise of Sentiment Analysis, based on the famous IMdb dataset

In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm

import re
import nltk
from nltk.corpus import stopwords

from sklearn.metrics import accuracy_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import GaussianNB, MultinomialNB

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/datascience/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# to remove stop words from text
stop = stopwords.words('english')

### Loading the dataset as a DataFrame

In [3]:
# it takes some time... good to have a progress bar
basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}

df = pd.DataFrame()

with tqdm(total=50000) as pbar:
    for s in ('test', 'train'):
        for l in ('pos', 'neg'):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), 
                          'r', encoding='utf-8') as infile:
                    txt = infile.read()
                df = df.append([[txt, labels[l]]], 
                               ignore_index=True)
                
                pbar.update(1)
                
df.columns = ['text', 'target']

100%|██████████| 50000/50000 [01:34<00:00, 531.54it/s]


### preprocess text

In [4]:
# for text cleaning: add some percentage points to the result
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))

    return text

# if you want to test the difference without text preprocessing
def dummy_preprocessor(text):
    return text

In [5]:
# apply globally
df['text'] = df['text'].apply(preprocessor)

In [6]:
# this is our text corpus
corpus = df['text'].values

In [7]:
# removing words from stop words list
vectorizer = TfidfVectorizer(stop_words=stop)

# feature extraction using tf-idf
X = vectorizer.fit_transform(corpus)
y = df['target'].values

In [8]:
# as we can see, we have a very large (103938) number of extracted features
X.shape

(50000, 103938)

In [9]:
# train/validation split
SEED = 1234

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [10]:
# remove sparsity
X_train = X_train.toarray()
X_valid = X_valid.toarray()

### train the classifier Naive Bayes (NB)

In [12]:
%%time
# train of the classifier
# MUltinomial works better than Gaussian

clf = MultinomialNB()

clf.fit(X_train, y_train)

CPU times: user 16.5 s, sys: 1.85 s, total: 18.3 s
Wall time: 7.51 s


MultinomialNB()

### Validation

In [13]:
y_pred = clf.predict(X_valid)

In [14]:
print("Number of mislabeled points out of a total %d texts : %d" % (X_valid.shape[0], (y_valid != y_pred).sum()))

Number of mislabeled points out of a total 10000 texts : 1355


In [15]:
acc = accuracy_score(y_valid, y_pred)

print(f"ACC score computed on validation set: {round(acc, 3)}")

ACC score computed on validation set: 0.864


### Final remarks
It is easy to develop a Model based on the **bags-of-words** approach. We need only to use the **TfidfVectorizer** provided by sklearn.

To increase the accuracy I have added:
* some preprocessing to polish the text
* removal of stop words

As we can easy, such a simple model gives an:
* ACC = 0.864

The reason is that to understand if the mood is positive or negative sometimes it is enough to see which words are contained in the text.