# Deep Learning for Sentiment Analysis

This Jupyter Notebook illustrates the sentiment analysis of the IMDB dataset of the Kaggle competition https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews of over 50'000 movie reviews using deep learning. In order to run the notebook, the used packages like `troch` or `tensorflow` have to be installed first. Furthermore the word-embeddings have to be downloaded and saved into the appropriate folders. A more specific explanation of this can be found in the respective chapters.

The founding author of this notebook is Tien Tran https://www.kaggle.com/tientd95, who has published his work on https://www.kaggle.com/tientd95/deep-learning-for-sentiment-analysis.

<a class="anchor" id="0.1"></a>

# **Table of Contents**


1.	[Processing Dataset](#1)
2.  [Pretrained Word Embedding](#2)
3.  [Building Model Pipeline](#3)
4.  [Training Model with fastText Embedding](#4)
5.  [Training Model with GloVe Embedding](#5)
6.  [Model Evaluation](#6)
7.  [Evaluation of FastText](#7)
8.  [Interact with User's Input Review](#8)


In [1]:
import numpy as np
import pandas as pd
import io
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer 
import os, re, csv, math, codecs
from sklearn import model_selection
from sklearn import metrics
import torch
import torch.nn as nn
import tensorflow as tf  # we use both tensorflow and pytorch (pytorch for main part) , tensorflow for tokenizer

from  utils import train_test_split, evaluate

torch.manual_seed(42)
np.random.seed(42)

# **1. Processing Dataset** <a class="anchor" id="1"></a>

The reviews are stored in the approx. 66.21 MB large 'IMDB Dataset.csv' in the directory 'data'. These have to be included first and can be downloaded from https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.

In [2]:
df = pd.read_csv('./data/IMDB Dataset.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [3]:
# Convert sentiment columns to numerical values
df.sentiment = df.sentiment.apply(lambda x: 1 if x=='positive' else 0)

X_train, X_test = train_test_split(df)

# Random the rows of data
X_train = X_train.sample(frac=1, random_state=42).reset_index(drop=True)
# get label
y_train = X_train.sentiment.values

X_train.head(3)

Unnamed: 0,review,sentiment
0,The story is similar to ET: an extraterrestria...,0
1,"To many people, Beat Street has inspired their...",1
2,Extremely entertaining mid-1950's western that...,1


Now we check the distribution of the sentiments in all sets

In [4]:
np.unique(X_train.sentiment.values, return_counts=True)

(array([0, 1]), array([20007, 19993]))

In [5]:
np.unique(X_test.sentiment.values, return_counts=True)

(array([0, 1]), array([4993, 5007]))

# **2. Pretrained Word Embedding** <a class="anchor" id="2"></a>

**fastText** is a word embedding development by Facebook released in 2016.
fastText improves on Word2Vec by taking word parts into account, enables training of embeddings on smaller datasets and generalization to unknown words.

The full version of fastText can be found here : https://fasttext.cc/docs/en/english-vectors.html 

Due to the size of memory, ( the full version is about 13GB RAM after loading), we use the mini version of fastText. This reduced embedding can be downloaded from https://fasttext.cc/docs/en/pretrained-vectors.html 'Simple English: bin+text, text' or directly from https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.simple.zip.

The embeddings are all stored in the 'embeddings' directory. In this folder there is a subfolder for each embedding, if they contain more than one file. fastText embedding for example should be found under './embeddings/wiki.simple/wiki.simple.vec'.

The vectors of this embedding have a length of 300. In the file these are organized in such a way that vectors of a length of 301 are stored. The first value in each case is the word in the vocabulary, whereas the 300 next elements represent the embedding vectors. Thus we extract the embedding into a dictionary `fasttext_embedding` with the key as the word and the value as the embedding vector.

In [6]:
#load fasttext embeddings
print('loading word embeddings...')
fasttext_embedding = {}
f = codecs.open('./embeddings/wiki.simple/wiki.simple.vec', encoding='utf-8')
for line in tqdm(f):
    values = line.rstrip().rsplit(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    fasttext_embedding[word] = coefs
f.close()

# Check the dimensions of fastText
fasttext_embedding['hello'].shape

loading word embeddings...


111052it [00:08, 13781.14it/s]


(300,)

**GloVe** (Global Vectors) is a word embedding which is developed as an open-source project at Stanford an was launched in 2014. The model is trained in an unsupervised manner for obtaining vector representations for words. It is based on co-occurence statistics from a corpus to map them into a semantical meaningful subspace.

GloVe can be downloaded from https://www.kaggle.com/danielwillgeorge/glove6b100dtxt 'glove.6B.100d.txt'.

The vectors of this embedding have a length of 100 and can be extracted very easly using dict-comprehension.

In [7]:
# Load GloVe embedding.
glove = pd.read_csv('./embeddings/glove/glove.6B.100d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove_embedding = {key: val.values for key, val in glove.T.items()}

# Check the dimensions of GloVe
glove_embedding['hello'].shape

(100,)

# **3. Building Model Pipeline** <a class="anchor" id="3"></a>

### Dataset class

First we need to create a Dataset class, take input in numpy array(embedding matrix) and return torch tensor output datatype 

In [8]:
class IMDBDataset:
    def __init__(self, reviews, targets):
        """
        Argument:
        reviews: a numpy array
        targets: a vector array
        
        Return xtrain and ylabel in torch tensor datatype, stored in dictionary format
        """
        self.reviews = reviews
        self.target = targets
    
    def __len__(self):
        # return length of dataset
        return len(self.reviews)
    
    def __getitem__(self, index):
        # given an idex (item), return review and target of that index in torch tensor
        review = torch.tensor(self.reviews[index,:], dtype = torch.long)
        target = torch.tensor(self.target[index], dtype = torch.float)
        
        return {'review': review,
                'target': target}

**Now we move on to build a model class. Before that, there's something to remembe**r:

* The input feed to model is served as embedding matrix (each row corresponding to an embedding vector of a word)
* Number of words (for entire dataset) = number of row in embeddng matrix 
* Dimension of embedding is the num of columns in matrix, = dimention of pretrained embedding (fasttext, glove,..in case we use pretrained embedding). 
* Pretrained embeddings have several versions with different dimension so we should check the dimension before set dimension to model.
* In case we use pretrainde embedding (this kernel), we will not do gadient calculation on these embedding (required grads = False)
* In case we train embedding from scratch, we will treat embedding matrix as weight parameter and training on them (required grads = True)


In [9]:
class LSTM(nn.Module):
    def __init__(self, embedding_matrix, flatten=False):
        """
        Given embedding_matrix: numpy array with vector for all words
        return prediction ( in torch tensor format)
        """
        super(LSTM, self).__init__()
        self.flatten = flatten
        # Number of words = number of rows in embedding matrix
        num_words = embedding_matrix.shape[0]
        # Dimension of embedding is num of columns in the matrix
        embedding_dim = embedding_matrix.shape[1]
        # Define an input embedding layer
        self.embedding = nn.Embedding(
                                      num_embeddings=num_words,
                                      embedding_dim=embedding_dim)
        # Embedding matrix actually is collection of parameter
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype = torch.float32))
        # Because we use pretrained embedding (GLove, Fastext,etc) so we turn off requires_grad-meaning we do not train gradient on embedding weight
        self.embedding.weight.requires_grad = False
        # LSTM with hidden_size = 128
        self.lstm = nn.LSTM(
                            embedding_dim, 
                            128,
                            bidirectional=True,
                            batch_first=True,
                             )
        if flatten:
            self.out = nn.Linear(128*256, 1)
        else:
            # Input(512) because we use bi-directional LSTM ==> hidden_size*2 + maxpooling **2  = 128*4 = 512, will be explained more on forward method
            self.out = nn.Linear(512, 1)
        
    def forward(self, x):
        # pass input (tokens) through embedding layer
        x = self.embedding(x)
        # fit embedding to LSTM
        hidden, _ = self.lstm(x)
        if self.flatten:
            flattened = torch.flatten(hidden, start_dim=1)
            out = self.out(flattened)
        else:
            # apply mean and max pooling on lstm output
            avg_pool= torch.mean(hidden, 1)
            max_pool, index_max_pool = torch.max(hidden, 1)
            # concat avg_pool and max_pool ( so we have 256 size, also because this is bidirectional ==> 256*2 = 512)
            out = torch.cat((avg_pool, max_pool), 1)
            # fit out to self.out to conduct dimensionality reduction to 1
            out = self.out(out)
        
        return out

The input of a forward pass is given by a matrix $X \in \mathbb{R}^{m \times n}$. $m$ is the review dimension and $n$ is the word dimension. $X$ contains the indexes of the word embedding given by the word of the reviews. The reviews are defined into a fixed length using sequence padding, so that longer reviews are only considered up to `MAX_LEN` and shorter ones are padded with a special token. Thus, batching is greatly simplified.`torch.LSTM` then detects the padding automatically. Since `bidirectional` is True, the sequences are also passed through the LSTM from behind. So for a review we have a LSTM series of $2 \times $ `MAX_LEN`. Afterwards the hidden state should contain all the information to perform the sentiment analysis. Because the hidden state is a matrix for a single sample, we need to reduce it's dimension. For this purpose we pursue two attempts:

- First attempt is done by average, and max-pooling the hidden state of the LSTM. Afterwards we have left a vector representation of the sentiment analysis, which will be reduced with a linear layer to one dimension for the sentiment analysis. This was the attempt of the original author of this Notebook.
- The second attempt is to flatten the output of the hidden state in order to feed it to a wider linear layer.

The last linear layer in both versions have a single neuron as output. We then simply can take a threshold of 0.5 to split between a positive or negative review.

Now after buidling the model class, we move to create `train` and `predict` function

In [10]:
def train(data_loader, model, optimizer, device):
    """
    this is model training for one epoch
    data_loader:  this is torch dataloader, just like dataset but in torch and devide into batches
    model : lstm
    optimizer : torch optimizer : adam
    device:  cuda or cpu
    """
    # set model to training mode
    model.train()
    # go through batches of data in data loader
    for data in data_loader:
        reviews = data['review']
        targets = data['target']
        # move the data to device that we want to use
        reviews = reviews.to(device, dtype = torch.long)
        targets = targets.to(device, dtype = torch.float)
        # clear the gradient
        optimizer.zero_grad()
        # make prediction from model
        predictions = model(reviews)
        # caculate the losses
        loss = nn.BCEWithLogitsLoss()(predictions, targets.view(-1,1))
        # backprob
        loss.backward()
        #single optimization step
        optimizer.step()

In [11]:
def predict(data_loader, model, device):
    final_predictions = []
    final_targets = []
    model.eval()
    
    with torch.no_grad():
        for data in data_loader:
            reviews = data['review']
            targets = data['target']
            reviews = reviews.to(device, dtype = torch.long)
            targets = targets.to(device, dtype=torch.float)
            # make prediction
            predictions = model(reviews)
            # move prediction and target to cpu
            predictions = predictions.cpu().numpy().tolist()
            targets = data['target'].cpu().numpy().tolist()
            # add predictions to final_prediction
            final_predictions.extend(predictions)
            final_targets.extend(targets)
    return final_predictions, final_targets

## Config

In [12]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 8
EPOCHS = 5

In [13]:
def create_embedding_matrix(word_index, embedding_dict=None, d_model=100):
    """
     this function create the embedding matrix save in numpy array
    :param word_index: a dictionary with word: index_value
    :param embedding_dict: a dict with word embedding
    :d_model: the dimension of word pretrained embedding, here I just set to 100, we will define again
    :return a numpy array with embedding vectors for all known words
    """
    embedding_matrix = np.zeros((len(word_index) + 1, d_model))
    ## loop over all the words
    for word, index in word_index.items():
        if word in embedding_dict:
            embedding_matrix[index] = embedding_dict[word]
    return embedding_matrix

Now is time to running the model. 
The entire workflow will be as following steps:

==> **Step1**: Creating a tokenizer function to convert sentences of dataset to token index
After converting we'll have a dictionary contain word and its index. We feed it to creating an embedding matrix

==> **Step2**: Cross validation of dataset to devide into train_df and valid_df

==> **Step3**: Applying tokenizer pad_sequence to token index to ensure all sentence has the same vector dimension ( example: the sentence with 10 words will have longer vector dimension then the sentence with 2 words, using pad_sequence to ensure the same length, The length is set to a fixed number)

We used the tokenizer from `tensorflow` 2 because of its convenience in pad_sequence.

==> **Step4**: Initialize dataset class `IMDBDataset`

==> **Step5**: We load the dataset which created in step4 to Pytorch DataLoader in order to devide the dataset to batches

==> **Step6**: Till now, we have almost necessary components to start training. Calling model, optimizer, send model to device and start running

# **4. Training Model with fastText Embedding** <a class="anchor" id="4"></a>


fastText embedding version in this kernel is 300, so we set d_model =300

In [14]:
# STEP 1: Tokenization 
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(df.review.values.tolist())

In [15]:
print('Load fasttext embedding')
embedding_matrix = create_embedding_matrix(tokenizer.word_index, embedding_dict=fasttext_embedding, d_model=300)

# pad sequence
xtrain = tokenizer.texts_to_sequences(X_train.review.values)
xtest = tokenizer.texts_to_sequences(X_test.review.values)

# zero padding
xtrain = tf.keras.preprocessing.sequence.pad_sequences(xtrain, maxlen=MAX_LEN)
xtest = tf.keras.preprocessing.sequence.pad_sequences(xtest, maxlen=MAX_LEN)

# initialize dataset class for training
train_dataset = IMDBDataset(reviews=xtrain, targets=X_train.sentiment.values)

# Load dataset to Pytorch DataLoader
# after we have train_dataset, we create a torch dataloader to load train_dataset class based on specified batch_size
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers=2)
# initialize dataset class for validation
valid_dataset = IMDBDataset(reviews=xtest, targets=X_test.sentiment.values)
valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = VALID_BATCH_SIZE, num_workers=1)

# Running 
device = torch.device('cuda')
# feed embedding matrix to lstm
model_fasttext = LSTM(embedding_matrix, flatten=False)
# set model to cuda device
model_fasttext.to(device)
# initialize Adam optimizer
optimizer = torch.optim.Adam(model_fasttext.parameters(), lr=1e-3)

print('training model')

for epoch in range(EPOCHS):
    #train one epoch
    train(train_data_loader, model_fasttext, optimizer, device)
    #validate
    outputs_fasttext_train, targets_fasttext_train = predict(train_data_loader, model_fasttext, device)
    outputs_fasttext, targets_fasttext = predict(valid_data_loader, model_fasttext, device)
    # threshold
    outputs_fasttext_train = np.array(outputs_fasttext_train) >= 0.5
    outputs_fasttext = np.array(outputs_fasttext) >= 0.5
    # calculate accuracy
    accuracy_train = metrics.accuracy_score(targets_fasttext_train, outputs_fasttext_train)
    accuracy = metrics.accuracy_score(targets_fasttext, outputs_fasttext)
    print(f'epoch: {epoch}, accuracy_train: {accuracy_train},    accuracy_test: {accuracy}')

Load fasttext embedding
training model
epoch: 0, accuracy_train: 0.858125,    accuracy_test: 0.8409
epoch: 1, accuracy_train: 0.89355,    accuracy_test: 0.8577
epoch: 2, accuracy_train: 0.91785,    accuracy_test: 0.8608
epoch: 3, accuracy_train: 0.939175,    accuracy_test: 0.8643
epoch: 4, accuracy_train: 0.937525,    accuracy_test: 0.8543


Now we flatten the output of the hidden state instead of pooling

In [16]:
print('Load fasttext embedding')
embedding_matrix = create_embedding_matrix(tokenizer.word_index, embedding_dict=fasttext_embedding, d_model=300)

# pad sequence
xtrain = tokenizer.texts_to_sequences(X_train.review.values)
xtest = tokenizer.texts_to_sequences(X_test.review.values)

# zero padding
xtrain = tf.keras.preprocessing.sequence.pad_sequences(xtrain, maxlen=MAX_LEN)
xtest = tf.keras.preprocessing.sequence.pad_sequences(xtest, maxlen=MAX_LEN)

# initialize dataset class for training
train_dataset = IMDBDataset(reviews=xtrain, targets=X_train.sentiment.values)

# Load dataset to Pytorch DataLoader
# after we have train_dataset, we create a torch dataloader to load train_dataset class based on specified batch_size
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers=2)
# initialize dataset class for validation
valid_dataset = IMDBDataset(reviews=xtest, targets=X_test.sentiment.values)
valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = VALID_BATCH_SIZE, num_workers=1)

# Running 
device = torch.device('cuda')
# feed embedding matrix to lstm
model_fasttext_flatten = LSTM(embedding_matrix, flatten=True)
# set model to cuda device
model_fasttext_flatten.to(device)
# initialize Adam optimizer
optimizer = torch.optim.Adam(model_fasttext_flatten.parameters(), lr=1e-3)

print('training model flatten')

for epoch in range(EPOCHS):
    #train one epoch
    train(train_data_loader, model_fasttext_flatten, optimizer, device)
    #validate
    outputs_fasttext_flatten_train, targets_fasttext_flatten_train = predict(train_data_loader,
                                                                             model_fasttext_flatten, device)
    outputs_fasttext_flatten, targets_fasttext_flatten = predict(valid_data_loader, model_fasttext_flatten, device)
    # threshold
    outputs_fasttext_flatten_train = np.array(outputs_fasttext_flatten_train) >= 0.5
    outputs_fasttext_flatten = np.array(outputs_fasttext_flatten) >= 0.5
    # calculate accuracy
    accuracy_train = metrics.accuracy_score(targets_fasttext_flatten_train, outputs_fasttext_flatten_train)
    accuracy = metrics.accuracy_score(targets_fasttext_flatten, outputs_fasttext_flatten)
    print(f'epoch: {epoch}, accuracy_train: {accuracy_train},    accuracy_test: {accuracy}')

Load fasttext embedding
training model flatten
epoch: 0, accuracy_train: 0.83725,    accuracy_test: 0.8123
epoch: 1, accuracy_train: 0.88265,    accuracy_test: 0.8072
epoch: 2, accuracy_train: 0.95245,    accuracy_test: 0.8121
epoch: 3, accuracy_train: 0.976725,    accuracy_test: 0.8133
epoch: 4, accuracy_train: 0.912525,    accuracy_test: 0.7554


# **5. Training Model with GloVe Embedding** <a class="anchor" id="5"></a>

GloVe embedding version in this kernel is 100, so we set d_model =100

In [17]:
print('Load Glove embedding')
embedding_matrix = create_embedding_matrix(tokenizer.word_index, embedding_dict=glove_embedding, d_model=100)

# pad sequence
xtrain = tokenizer.texts_to_sequences(X_train.review.values)
xtest = tokenizer.texts_to_sequences(X_test.review.values)

# zero padding
xtrain = tf.keras.preprocessing.sequence.pad_sequences(xtrain, maxlen=MAX_LEN)
xtest = tf.keras.preprocessing.sequence.pad_sequences(xtest, maxlen=MAX_LEN)

# initialize dataset class for training
train_dataset = IMDBDataset(reviews=xtrain, targets=X_train.sentiment.values)

# load dataset to Pytorch DataLoader
# after we have train_dataset, we create a torch dataloader to load train_dataset class based on specified batch_size
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers=2)
# initialize dataset class for validation
valid_dataset = IMDBDataset(reviews=xtest, targets=X_test.sentiment.values)
valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = VALID_BATCH_SIZE, num_workers=1)

# Running 
device = torch.device('cuda')
# feed embedding matrix to lstm
model_glove = LSTM(embedding_matrix, flatten=False)
# set model to cuda device
model_glove.to(device)
# initialize Adam optimizer
optimizer = torch.optim.Adam(model_glove.parameters(), lr=1e-3)

print('training model')

for epoch in range(EPOCHS):
    #train one epoch
    train(train_data_loader, model_glove, optimizer, device)
    #validate
    outputs_glove_train, targets_glove_train = predict(train_data_loader, model_glove, device)
    outputs_glove, targets_glove = predict(valid_data_loader, model_glove, device)
    # threshold
    outputs_glove_train = np.array(outputs_glove_train) >= 0.5
    outputs_glove = np.array(outputs_glove) >= 0.5
    # calculate accuracy
    accuracy_train = metrics.accuracy_score(targets_glove_train, outputs_glove_train)
    accuracy = metrics.accuracy_score(targets_glove, outputs_glove)
    print(f'epoch: {epoch}, accuracy_train: {accuracy_train},    accuracy_test: {accuracy}')

Load Glove embedding
training model
epoch: 0, accuracy_train: 0.8327,    accuracy_test: 0.8257
epoch: 1, accuracy_train: 0.863675,    accuracy_test: 0.8456
epoch: 2, accuracy_train: 0.87955,    accuracy_test: 0.8515
epoch: 3, accuracy_train: 0.89585,    accuracy_test: 0.8538
epoch: 4, accuracy_train: 0.90885,    accuracy_test: 0.8544


Now we flatten the output of the hidden state instead of pooling

In [18]:
print('Load Glove embedding')
embedding_matrix = create_embedding_matrix(tokenizer.word_index, embedding_dict=glove_embedding, d_model=100)

# pad sequence
xtrain = tokenizer.texts_to_sequences(X_train.review.values)
xtest = tokenizer.texts_to_sequences(X_test.review.values)

# zero padding
xtrain = tf.keras.preprocessing.sequence.pad_sequences(xtrain, maxlen=MAX_LEN)
xtest = tf.keras.preprocessing.sequence.pad_sequences(xtest, maxlen=MAX_LEN)

# initialize dataset class for training
train_dataset = IMDBDataset(reviews=xtrain, targets=X_train.sentiment.values)

# load dataset to Pytorch DataLoader
# after we have train_dataset, we create a torch dataloader to load train_dataset class based on specified batch_size
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers=2)
# initialize dataset class for validation
valid_dataset = IMDBDataset(reviews=xtest, targets=X_test.sentiment.values)
valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = VALID_BATCH_SIZE, num_workers=1)

# Running 
device = torch.device('cuda')
# feed embedding matrix to lstm
model_glove_flatten = LSTM(embedding_matrix, flatten=True)
# set model to cuda device
model_glove_flatten.to(device)
# initialize Adam optimizer
optimizer = torch.optim.Adam(model_glove_flatten.parameters(), lr=1e-3)

print('training model flatten')

for epoch in range(EPOCHS):
    #train one epoch
    train(train_data_loader, model_glove_flatten, optimizer, device)
    #validate
    outputs_glove_flatten_train, targets_glove_flatten_train = predict(train_data_loader,
                                                                       model_glove_flatten, device)
    outputs_glove_flatten, targets_glove_flatten = predict(valid_data_loader, model_glove_flatten, device)
    # threshold
    outputs_glove_flatten_train = np.array(outputs_glove_flatten_train) >= 0.5
    outputs_glove_flatten = np.array(outputs_glove_flatten) >= 0.5
    # calculate accuracy
    accuracy_train = metrics.accuracy_score(targets_glove_flatten_train, outputs_glove_flatten_train)
    accuracy = metrics.accuracy_score(targets_glove_flatten, outputs_glove_flatten)
    print(f'epoch: {epoch}, accuracy_train: {accuracy_train},    accuracy_test: {accuracy}')

Load Glove embedding
training model flatten
epoch: 0, accuracy_train: 0.828025,    accuracy_test: 0.8085
epoch: 1, accuracy_train: 0.879025,    accuracy_test: 0.8134
epoch: 2, accuracy_train: 0.949075,    accuracy_test: 0.8276
epoch: 3, accuracy_train: 0.967975,    accuracy_test: 0.8268
epoch: 4, accuracy_train: 0.956125,    accuracy_test: 0.8051


# **6. Model Evaluation** <a class="anchor" id="6"></a>

In this section we're looking at the evaluation of our four models. The models where the hidden state was flattened instead of pooled have the ending `_flatten`.

## fastText

**Accuracy**

In [19]:
evaluation_fasttext = evaluate(targets_fasttext, outputs_fasttext)
print(evaluation_fasttext[0])

0.8543


In [20]:
evaluation_fasttext_flatten = evaluate(targets_fasttext_flatten, outputs_fasttext_flatten)
print(evaluation_fasttext_flatten[0])

0.7554


**classification-report**

In [21]:
print(evaluation_fasttext[1])

              precision    recall  f1-score   support

    Negative       0.80      0.94      0.87      4993
    Positive       0.93      0.77      0.84      5007

    accuracy                           0.85     10000
   macro avg       0.86      0.85      0.85     10000
weighted avg       0.87      0.85      0.85     10000



In [22]:
print(evaluation_fasttext_flatten[1])

              precision    recall  f1-score   support

    Negative       0.68      0.95      0.79      4993
    Positive       0.91      0.57      0.70      5007

    accuracy                           0.76     10000
   macro avg       0.80      0.76      0.75     10000
weighted avg       0.80      0.76      0.75     10000



## GloVe

**Accuracy**

In [23]:
evaluation_glove = evaluate(targets_glove, outputs_glove)
print(evaluation_glove[0])

0.8544


In [24]:
evaluation_glove_flatten = evaluate(targets_glove_flatten, outputs_glove_flatten)
print(evaluation_glove_flatten[0])

0.8051


**classification-report**

In [25]:
print(evaluation_glove[1])

              precision    recall  f1-score   support

    Negative       0.79      0.96      0.87      4993
    Positive       0.95      0.75      0.84      5007

    accuracy                           0.85     10000
   macro avg       0.87      0.85      0.85     10000
weighted avg       0.87      0.85      0.85     10000



In [26]:
print(evaluation_glove_flatten[1])

              precision    recall  f1-score   support

    Negative       0.75      0.92      0.83      4993
    Positive       0.90      0.69      0.78      5007

    accuracy                           0.81     10000
   macro avg       0.82      0.81      0.80     10000
weighted avg       0.82      0.81      0.80     10000



## Discussion

The best results were achieved with an **accuracy** of **0.85** on the test set with the current setting. This however is pretty arbitrary since with just another seed of the shuffling of the training set (after splitting) both best models achieved **accuracy** of **0.88**. So that means that the shuffling of the training set led to find another local minima of the model.

The results show very clearly that the flattened version of the LSTM's hidden state does perform worse than the pooling version. If we compare the accuracy of the train- and the test set it's pretty obvious that the flattened models overfits. We suspect it is because the linear layer of the flattened version has much more weights rather than just pooling. So we think that the pooling version is better in generalization because pooling in itself has regularizational properties. Regularization of the flattened versions would have been a vaild attempt to improve performance.

Most test have shown that **FastText** and **GloVe** performed pretty similar in accuracy, precision and recall on the test-set. So the performance does not really depend on the Word-Embedding in this case.

# **7. Evaluation of FastText** <a class="anchor" id="7"></a>

Here we look at a few examples of incorrectly predicted reviews from the **FastText** model. In particular, we are interested in the **FP** and **FN** predictions. First, we need to extract them.

In [27]:
from sklearn.metrics import confusion_matrix

In [28]:
def evaluate_sample(model, sentence):
    model.eval()
    sentence_original = sentence
    sentence = np.array([sentence])
    sentence_token = tokenizer.texts_to_sequences(sentence)
    sentence_token = tf.keras.preprocessing.sequence.pad_sequences(sentence_token, maxlen = MAX_LEN)
    sentence_train = torch.tensor(sentence_token, dtype = torch.long).to(device, dtype = torch.long)
    predict = model(sentence_train)
    if predict.item() > 0.5:
        print('------> Positive')
    else:
        print('------> Negative')
    print(sentence_original)

fasttext_evaluation = targets_fasttext - outputs_fasttext[:,0]
fasttext_fp_idxs = np.where(fasttext_evaluation == -1)[0]
fasttext_fn_idxs = np.where(fasttext_evaluation == 1)[0]
confusion_matrix(targets_fasttext, outputs_fasttext)

array([[4693,  300],
       [1157, 3850]])

The relatively high **TP** and **TN** predictions was to be expected given the relatively high accuracy of the model and the balance of the data set. In the current setting however it is slightly biased to predict the positive reviews more acurate than the negativ reviews. According to the **FP** and **FN** it seems that the model makes by far more **FN** than **FP**. However, the distribution of all four values is quiet arbitrary since we achieved pretty balanced **TP** to **TN** and **FN** to **FP** with a different random seed.

Now we look at three examples of the wrong predictions. *The "positive" or "negative" is the prediction of the model, not the label. Keep in mind!*

**FP** examples:

In [48]:
evaluate_sample(model_fasttext, X_test.iloc[fasttext_fp_idxs[2]]['review'])

------> Positive
Still a sucker for Pyun's esthetic sense, I liked this movie, though the "unfinished" ending was a let-down. As usual, Pyun develops a warped sense of humour and Kathy Long's fights are extremely impressive. Beautifully photographed, this has the feel it was done for the big screen.


In [40]:
evaluate_sample(model_fasttext, X_test.iloc[fasttext_fp_idxs[3]]['review'])

------> Positive
"Black Vengeance" is an alternate title for "Ying hung ho hon" AKA "Tragic Hero" (1987). I have just seen this on VHS, together with the first part of the story, "Gong woo ching" ("Rich and Famous"), also 1987. (The poster and 2 stills featured on the page are for a 4-DVD set of movies starring Rod Perry (The Black Gestapo), Fred Williamson (Black Cobra 2), Richard Lawson (Black Fist). The fourth movie is called "The Black Six"). Strangely, while the characters retain their original names in "Rich and Famous", in "Black Vengeance" Chow Yun-Fat's character is named Eddie Shaw, Alex Man (Man Tze Leung) is Harry, and Andy Lau is called Johnny. Also confusing is the fact that 1994 is given as the copyright dates on both films. Perhaps that was the year they were American-dubbed. According to the release dates given on IMDb "Tragic Hero" was released before "Rich and Famous". Was there any reason for releasing the sequel first? Despite some users' comments, I enjoyed these 

In [41]:
evaluate_sample(model_fasttext, X_test.iloc[fasttext_fp_idxs[4]]['review'])

------> Positive
I watched this movie when I was a young lad full of raging hormones and it was about as sexy a movie as I had ever seen-or ever was to see. It may not have been a great movie. My guess is it wasn't. I don't really remember much about it, to tell you the truth. I only remember the sexual chemistry between Crosby and Biehn. No woman in ANY movie has ever done it for me as the unbelievably sexy Cathy did in this movie. I haven't seen it since that first time I caught it on TV in the 70s and I don't think I'd want to see it again since I'm sure it would be a disappointment-my hormones aren't as raging and I've become more jaded over the years. Still, when I think back on the shower scene I can still remember how great it felt way back when.<br /><br />Added later: After watching the movie again, I discovered that it's dangerous to go home again. What was once erotic is now pretty tame. The older woman-younger man thing still works for me, just not as much as it once did, p

As you can see the examples above it can be pretty difficult to classify these reviews correctly because they are not as extremly negative as the usual reviews. They have some positive sentences in it. And if you e.g. look at the 2nd example, to me it looks like this is an actual positive review and it was just wrongly labelled.

**FN** examples:

In [43]:
evaluate_sample(model_fasttext, X_test.iloc[fasttext_fn_idxs[2]]['review'])

------> Negative
I actually had seen the last parts of this movie when I was a child. Thanks to the search feature of plots I was able to find out the name of it. For years I did not know the name, but the movie stuck in my mind. The ending left hope that the main character would get back to Earth eventually. It was a shame it did not make it to a series. This movie reminds me of Journey to The Far Side Of the Sun. Also known as Doppleganger. If you liked this feature the other one is worth a watch. It was done before The Stranger, but shares a similar plot. Yet different. I just picked up The Stranger off of eBay on VHS. Hope they make a DVD, but it is doubtful unless it comes out on Dollar DVD. A few pilots are making it on the budget DVD's and maybe this one will.


In [44]:
evaluate_sample(model_fasttext, X_test.iloc[fasttext_fn_idxs[3]]['review'])

------> Negative
This is a very dark movie, somewhat better than the average Asylum film. It was a lot better than I thought it would be, is a combination of a psychological thriller and a horror film. <br /><br />The voice on the telephone is really creepy - this voice without a face, this unknown and threatening voice works really well in the film, since we never see the killer face is left to the imagination of the spectator.<br /><br />The action and suspense never decay and after the first half of the film, it becomes vertiginous; there is not much gore in this film, just enough to serve the story and also the director does a good job at holding your attention. <br /><br />I gave this movie a 8/10 because some clichés.


In [45]:
evaluate_sample(model_fasttext, X_test.iloc[fasttext_fn_idxs[4]]['review'])

------> Negative
The good thing about this film is that it stands alone - you don't have to have seen the original. Unfortunately this is also it's biggest drawback. It would have been nice to have included a few of the original characters in the new story and seen how their lives had developed. Sinclair as in the original is excellent and provides the films best comic moments as he attempts to deal with awkward and embarrassing situations but the supporting cast is not as strong as in the original movie. Forsyth is to be congratulated on a brave attempt to move the character on and create an original sequel but the film is ultimately flawed and lacks the warmth of the original


Here again, these reviews are not very clearly positive or negative. For instance the 3rd one I can not really tell if the writer meant to write it as a positive or negative review. The only clearly missclassified review is the 2nd one in my opinion. But it is still pretty hard to tell if you don't read the last sentence of it.

# **8. Interact with User's Input Review** <a class="anchor" id="8"></a>

Now it's time to test model by entering any review we can think and see the model reaction

In [35]:
def Interact_user_input(model):
    '''
    model: trained model : fasttext model or glove model
    '''
    model.eval()
    
    sentence = ''
    while True:
        try:
            sentence = input('Review: ')
            if sentence in ['q','quit']: 
                break
            sentence = np.array([sentence])
            sentence_token = tokenizer.texts_to_sequences(sentence)
            sentence_token = tf.keras.preprocessing.sequence.pad_sequences(sentence_token, maxlen = MAX_LEN)
            sentence_train = torch.tensor(sentence_token, dtype = torch.long).to(device, dtype = torch.long)
            predict = model(sentence_train)
            if predict.item() > 0.5:
                print('------> Positive')
            else:
                print('------> Negative')
        except KeyError:
            print('please enter again')

## Enter reviews and test the result

In [37]:
Interact_user_input(model_fasttext)