# Sentiment analysis
In this notebook we will use a machine learning algorithm to infer the sentiment, positive or negative, about a text. We will use the [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/) dataset to train our model. The dataset contains 50k movie reviews. We download the dataset and extract the files into a folder. The dataset contains two subfolders train/ and test/ each containing 25k reviews split into two subfolders pos/ and  neg/ with 12500 txt files. Each file contains a short text, the content of the review. The name of the file is created from the review's unique identifier and the score given to the movie. A score equal or higher than 7 is positive, a score equal or lower than 4 is negative.  

In [2]:
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
import torch
import torch.nn as nn
import torchvision
warnings.filterwarnings('ignore')
print("NumPy version: %s"%np.__version__)
print("Pandas version: %s"%pd.__version__)
print("PyTorch version: %s"%torch.__version__)

NumPy version: 1.23.1
Pandas version: 1.4.3
PyTorch version: 1.13.0


We copy the reviews with the sentiment in a tabular format so that it will be easier to split and shuffle.   

In [3]:
basepath = 'data/aclImdb'

labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                df = pd.concat([df, x], ignore_index=False)

df.columns = ['review', 'sentiment']

In [4]:
df = df.sample(frac=1, random_state=0).reset_index(drop=True)
df.to_csv(basepath + '/movie_data.csv', index=False, encoding='utf-8')

In [16]:
df.shape

(50000, 2)

We save the pandas dataframe into a CSV file

In [24]:
df = pd.read_csv(basepath + '/movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


## Bag-of-words representation
We want to represent each document by the words that are used. The words come from a dictionary built by analyzing all the documents in the dataset. Each document will be represented by an array that contains the number of times a word from the dictionary has been used. The length of each array is equal to the length of the dictionary. This representation of a set of documents is called [bag-of-words](). In order to create such representation for the reviews we have to tokinize each review and create an array with the number of occurrences for each word. This process is called vectorization. This representation will contain only numbers of occurrences and no words. Scikit-Learn provides the class [CountVectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) to do exactly that. In the bag-of-words model the order of the words in a sentence does not matter, the words are treated as independent variables.

In [75]:
sample_reviews = np.array([df.iloc[i]['review'].split('.')[0] for i in range(0, 3)])
sample_reviews

array(['Election is a Chinese mob movie, or triads in this case',
       'I was just watching a Forensic Files marathon on Court TV',
       'Police Story is a stunning series of set pieces for Jackie Chan to show his unique talents and bravery'],
      dtype='<U102')

In [76]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array(sample_reviews)
bag = count.fit_transform(docs)

We can print the vocabulary built from the sample of documents with the index of each word. The index is assigned in alphabetical order.

In [77]:
print(count.vocabulary_)

{'election': 6, 'is': 12, 'chinese': 4, 'mob': 16, 'movie': 17, 'or': 20, 'triads': 31, 'in': 11, 'this': 29, 'case': 2, 'was': 34, 'just': 14, 'watching': 35, 'forensic': 9, 'files': 7, 'marathon': 15, 'on': 19, 'court': 5, 'tv': 32, 'police': 22, 'story': 26, 'stunning': 27, 'series': 23, 'of': 18, 'set': 24, 'pieces': 21, 'for': 8, 'jackie': 13, 'chan': 3, 'to': 30, 'show': 25, 'his': 10, 'unique': 33, 'talents': 28, 'and': 0, 'bravery': 1}


The length of the vocabulary is the length of the array that represents each document. Documents with different meaning but with the exact same words will be represented by the same vector.

In [78]:
print(len(count.vocabulary_))

36


We can print the bag-of-words representation of the sample documents. Each array returned by the vectorizer represents the index of the document, that is a map that links a document to the words that can be found in it in terms of occurrence. Each value in the array represents the frequency of the term in the document, or term frequency *tf(t, d)* where t represents the term and d the document. Most of the time we are interested in the inverted index, a map that links a word to the documents that contain it. 

In [79]:
print(bag.toarray())

[[0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1]
 [1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0]]


## Word relevancy
Not all words have the same relevance when we want to classify or rank a text. The least used words are those that provides more information about a document. One way to measure the relevance of a word is the *term frequency-inverse document frequency*, or *tf_idf(t,d)*. The inverse document frequency of a term t, or *idf(t)*, is defined as

$$idf(t) = log \frac{n}{1 + df(t)}$$

where n is the total number of documents and *df(t)* is the number of documents that contain the term t (at least once). The term-frequency of a term t, or *tf(t, d)*, is computed as the number of occurrences of a term t in the document d and corresponds to the counts that are returned by the vectorizer. The tf_idf(t,d) of a term t in a document d, that is its relevance, is defined as

$$td\_idf(t, d) = tf(t, d) * idf(t, d)$$

Scikit-Learn provides a class [TfidfTransformer](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) that implements the td_idf

In [86]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
relevance = tfidf.fit_transform(count.fit_transform(docs)).toarray()
print(relevance)

[[0.         0.         0.32311233 0.         0.32311233 0.
  0.32311233 0.         0.         0.         0.         0.32311233
  0.24573525 0.         0.         0.         0.32311233 0.32311233
  0.         0.         0.32311233 0.         0.         0.
  0.         0.         0.         0.         0.         0.32311233
  0.         0.32311233 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.33333333
  0.         0.33333333 0.         0.33333333 0.         0.
  0.         0.         0.33333333 0.33333333 0.         0.
  0.         0.33333333 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.33333333 0.         0.33333333 0.33333333]
 [0.23851206 0.23851206 0.         0.23851206 0.         0.
  0.         0.         0.23851206 0.         0.23851206 0.
  0.18139457 0.23851206 0.         0.         0.         0.
  0.23851206 0.         0.         0.23851206 0.23

## Cleaning text data
Before creating the bag of words we have to remove the characters that do not represent words, for example markup from HTML documents, punctuation and special characters as in the last part of the first document. We use regular expressions to accomplish this task.

In [6]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [7]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)  # removes HTML markup
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [8]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

We apply the preprocessing function to all the reviews

In [9]:
df['review'] = df['review'].apply(preprocessor)

## Tokenization
After we have removed all the non-word characters from the document we can extract the words by splitting the document using a space as separator. In order to reduce the dictionary we use the Porter stemming algorithm to extract the root of the different forms of words and verbs.

In [11]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

We will also remove stop words, that is very common words that don't help the understanding of the text such as "the", "a", "and", and others 

In [13]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Luigi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We can see the result of applying the stemming algorithm and the removal of the (english) stop words.

In [15]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## Document classification
We use a [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) model to classify the reviews into positives or negatives. We have cleaned the reviews by removing markup, punctuations and special characters. Now we divide the dataset into a train set of 25k reviews and a test set of the same size, like in the original dataset.

In [16]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

We use a set of hyperparameters to improve the result of the training and the class [GridSearchCV](https://scikit-learn.org/stable/modules/grid_search.html#grid-search) from Scikit-Learn to find the best combination.

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

"""
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]
"""

small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l2'],
                  'clf__C': [1.0, 10.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [19]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


After the search is completed we have the best combination of hyperparameters.

In [20]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x000001DAA19B00D0>}
CV Accuracy: 0.897


We test the logistic regression model, using the best hyperparameters combination, on the test set. We can see that the model is able to correctly classify a review 90% of the times.

In [21]:
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.899


## Online algorithm
It takes 10 minutes to process 25k words in a laptop. For larger datasets in case we do not have access to a cluster of computers to parallelize the computation the approach is to fit the model incrementally on small batches of documents. We clean and tokenize the data using a function similar to the preprocessor function we have used before.

In [23]:
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

We define an iterator function to return one review at a time from the CSV file that we have previously created using the original dataset.

In [28]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

We test the iterator on the CSV file that returns the first review

In [27]:
next(stream_docs(path=basepath + '/movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

We define a function that uses the iterator to return a batch of documemts.

In [30]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Since it is assumed we cannot load the dataset in memory we cannot use the CountVectorizer class to map a review to a vector. Scikit-Learn provides the [HashingVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) that maps a set of documents to a matrix of word occurrences without the need to load a dictionary in memory. We set the number of features to store the occurrences of each word to a large number, e.g. $2^{21} = 2097152$, to reduce the collissions, that is different words that are mapped to the same hash value.

In [33]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1)
doc_stream = stream_docs(path=basepath + '/movie_data.csv')

Now we can start the incremental learning process using batches of 1000 documents in 45 loops.

In [34]:
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)

At the end of the training process we can use the last 5 batches to evaluate the performance of our model

In [35]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.868


The performance of the online process is close to the one we achieved with the data loaded into memory. Now we can use the 5 batches to finalize the traiining of the model.

In [36]:
clf = clf.partial_fit(X_test, y_test)

## Topic modeling