# Sentiment analysis
In this notebook we will use a machine learning algorithm to infer the sentiment, positive or negative, about a text. We will use the [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/) dataset to train our model. The dataset contains 50k movie reviews. We download the dataset and extract the files into a folder. The dataset contains two subfolders train/ and test/ each containing 25k reviews split into two subfolders pos/ and  neg/ with 12500 txt files. Each file contains a short text, the content of the review. The name of the file is created from the review's unique identifier and the score given to the movie. A score equal or higher than 7 is positive, a score equal or lower than 4 is negative.  

In [7]:
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
import torch
import torch.nn as nn
import torchvision
warnings.filterwarnings('ignore')
print("NumPy version: %s"%np.__version__)
print("Pandas version: %s"%pd.__version__)
print("PyTorch version: %s"%torch.__version__)

NumPy version: 1.23.1
Pandas version: 1.4.3
PyTorch version: 1.13.0


We copy the reviews with the sentiment in a tabular format so that it will be easier to split and shuffle.   

In [9]:
basepath = 'data/aclImdb'

labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                df = pd.concat([df, x], ignore_index=False)

df.columns = ['review', 'sentiment']

In [11]:
df = df.sample(frac=1, random_state=0).reset_index(drop=True)
df.to_csv(basepath + '/movie_data.csv', index=False, encoding='utf-8')

In [16]:
df.shape

(50000, 2)

We save the pandas dataframe into a CSV file

In [13]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"Election is a Chinese mob movie, or triads in ...",1
1,I was just watching a Forensic Files marathon ...,0
2,Police Story is a stunning series of set piece...,1


## Bag-of-words representation
We want to represent each document by the words that have been used. The words come from a dictionary built by analyzing all the documents in the dataset. Each document will be represented by an array that contains the number of times a word from the dictionary has been used. The length of each array is equal to the length of the dictionary. This representation of a set of documents is called [bag-of-words](). In order to create such representation for the reviews we have to tokinize each review and create an array with the number of occurrences for each word. This process is called vectorization. This representation will contain only numbers of occurrences and no words. Scikit-Learn provides the class [CountVectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) to do exactly that. In the bag-of-words model the order of the words in a sentence does not matter, the words are treated as independent variables.

In [75]:
sample_reviews = np.array([df.iloc[i]['review'].split('.')[0] for i in range(0, 3)])
sample_reviews

array(['Election is a Chinese mob movie, or triads in this case',
       'I was just watching a Forensic Files marathon on Court TV',
       'Police Story is a stunning series of set pieces for Jackie Chan to show his unique talents and bravery'],
      dtype='<U102')

In [76]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array(sample_reviews)
bag = count.fit_transform(docs)

We can print the vocabulary built from the sample of documents with the index of each word. The index is assigned in alphabetical order.

In [77]:
print(count.vocabulary_)

{'election': 6, 'is': 12, 'chinese': 4, 'mob': 16, 'movie': 17, 'or': 20, 'triads': 31, 'in': 11, 'this': 29, 'case': 2, 'was': 34, 'just': 14, 'watching': 35, 'forensic': 9, 'files': 7, 'marathon': 15, 'on': 19, 'court': 5, 'tv': 32, 'police': 22, 'story': 26, 'stunning': 27, 'series': 23, 'of': 18, 'set': 24, 'pieces': 21, 'for': 8, 'jackie': 13, 'chan': 3, 'to': 30, 'show': 25, 'his': 10, 'unique': 33, 'talents': 28, 'and': 0, 'bravery': 1}


The length of the vocabulary is the length of the array that represents each document. Documents with different meaning but with the exact same words will be represented by the same vector.

In [78]:
print(len(count.vocabulary_))

36


We can print the bag-of-words representation of the sample documents. Each array returned by the vectorizer represents the index of the document, that is a map that links a document to the words that can be found in it in terms of occurrence. Each value in the array represents the frequency of the term in the document, or term frequency *tf(t, d)* where t represents the term and d the document. Most of the time we are interested in the inverted index, a map that links a word to the documents that contain it. 

In [79]:
print(bag.toarray())

[[0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1]
 [1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0]]


## Word relevancy
Not all words have the same relevance when we want to classify or rank a text. The least used words are those that provides more information about a document. One way to measure the relevance of a word is the *term frequency-inverse document frequency*, or *tf_idf(t,d)*. The inverse document frequency of a term t, or *idf(t)*, is defined as

$$idf(t) = log \frac{n}{1 + df(t)}$$

where n is the total number of documents and *df(t)* is the number of documents that contain the term t (at least once). The term-frequency of a term t, or *tf(t, d)*, is computed as the number of occurrences of a term t in the document d and corresponds to the counts that are returned by the vectorizer. The tf_idf(t,d) of a term t in a document d, that is its relevance, is defined as

td_idf(t, d) = tf(t, d) * tf_idf(t, d)

Scikit-Learn provides a class [TfidfTransformer](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) that implements the td_idf

In [86]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
relevance = tfidf.fit_transform(count.fit_transform(docs)).toarray()
print(relevance)

[[0.         0.         0.32311233 0.         0.32311233 0.
  0.32311233 0.         0.         0.         0.         0.32311233
  0.24573525 0.         0.         0.         0.32311233 0.32311233
  0.         0.         0.32311233 0.         0.         0.
  0.         0.         0.         0.         0.         0.32311233
  0.         0.32311233 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.33333333
  0.         0.33333333 0.         0.33333333 0.         0.
  0.         0.         0.33333333 0.33333333 0.         0.
  0.         0.33333333 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.33333333 0.         0.33333333 0.33333333]
 [0.23851206 0.23851206 0.         0.23851206 0.         0.
  0.         0.         0.23851206 0.         0.23851206 0.
  0.18139457 0.23851206 0.         0.         0.         0.
  0.23851206 0.         0.         0.23851206 0.23