# IMDB Classification - Bag of Words and Embeddings

This tutorial will go through steps for building a deep learning model for sentiment Analysis. We will classify IMDB movie reviews as either positive or negative. This tutorial will be used for teaching during the workshop.

The tutorial has taken contents from various places including the tutorial from http://www.hvass-labs.org/ for the purpose of teaching in the deep learning class.

The topics addressed in the tutorial:

1. Basic exploration of the IMDB movies dataset.
2. Tokenization, text to sequences, padding and truncating
3. Building NN Model using Bag Of Words
4. Building NN Model using Embeddings
5. Peeping to Word Embeddings

We will be exploring mostly how to use Bag of Words and Word Embeddings vector representation of texts and build plain vanila NN models. In the future tutorials, we will explore RNN, LSTM models in the future.

### IMDB Movie Reviews

The dataset is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews.

**Data Fields**

- id - Unique ID of each review
- sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
- review - Text of the review

In [None]:
%load_ext tensorboard

### Loading the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
imdb_df = pd.read_csv('labeledTrainData.tsv',
                      sep = '\t')

In [None]:
pd.set_option('display.max_colwidth', 500)
imdb_df.head(5)

### Data Tokenization

The text data need to be converted into vectors using either bag of words or embeddings model. We will first explore bag of words (BOW) model. In the BOW model, a sentence will be represented as a vector with the words (also called tokens) as dimensions of the vectors.

For the purpose of creating vectors, we need to tokenize the sentences first and find out all unique tokens (words) used across all sentences. The corpus of unquie words used could very large, so we can limit the corpus of tokens by using only the most popular (frequently used) words. In this example, we will use 10000 words.

In [None]:
import os
#os.environ["KERAS_BACKEND"] = "tensorflow"

In [None]:
import keras
print(keras.__version__)

### Encode Y Variable

In [None]:
y = np.array(imdb_df.sentiment)

In [None]:
y[0:5]

How many classes available?

In [None]:
imdb_df.sentiment.unique()

Now we will pad or truncate. But padding or truncating can be done at the beginning of the sentence or at the end of the sentences. *pre* or *post* can be used to specify the padding and truncating the beginning or end of sentence.

In [None]:
max_num_tokens = 10000
max_review_length = 500

In [None]:
from keras.layers import TextVectorization

In [None]:
vectorize_layer = TextVectorization(max_tokens = max_num_tokens,
                                    output_mode='int',
                                    output_sequence_length = max_review_length,
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace')

In [None]:
vectorize_layer.adapt(list(imdb_df.review))

In [None]:
vectorize_layer.get_vocabulary()[0:20]

In [None]:
vectorize_layer(["I like the movie gladiator"])[0][0:50]

In [None]:
vectorize_layer(imdb_df.review[0:1])[0][0:50]

### Split Datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(imdb_df.review,
                                                    imdb_df.sentiment,
                                                    test_size = 0.2)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
input_shape = X_train.shape

In [None]:
input_shape

### Using Embeddings

In Word embeddings, words are represented by a vector i.e. series of numbers (weights). The vectors represent words in a N dimension space, in which similar meaning words are places nearer to each other while the dissimilar words are kept far. The dimensions in the space represent some latent factors, by which the words could be defined. All words are assigned some weights in each each latent factors. Words that share some common meaning have similar weights across common factors.

The word embeddings weights can be estimated during the NN model building. There are also pre-built word embeddings are available, which can be used in the model. We will discuss about the pre-built word embeddings later in the tutorial.

Word embeddings are commonly used in many Natural Language Processing (NLP) tasks because they are found to be useful representations of words and often lead to better performance in the various tasks performed. Given its widespread use, this post seeks to introduce the concept of word embeddings to the prospective NLP practitioner.

Here are couple of good references to understand embeddings

https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

(Bag of words) -> Embeddings (8) -> Dense Layer(16) ->  Relu -> Dense Layer(1) -> Sigmoid

In [None]:
from keras.layers import Embedding
from keras.optimizers import SGD
from keras.models import Sequential
from keras.layers import Flatten, Dense, Activation, Dropout

In [None]:
vectorize_layer = TextVectorization(max_tokens = max_num_tokens,
                                    output_mode='int',
                                    output_sequence_length = max_review_length,
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace')

In [None]:
vectorize_layer.adapt(list(X_train))

In [None]:
train_ds = vectorize_layer(X_train)

In [None]:
keras.backend.clear_session()  # clear default graph

emb_model = Sequential()
emb_model.add(keras.Input(shape=(max_review_length,)))
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
emb_model.add(Embedding(max_num_tokens, 8))

# After the Embedding layer,
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings
# into a 2D tensor of shape `(samples, maxlen * 8)`
emb_model.add(Flatten())

emb_model.add(Dense(16))
emb_model.add(Activation('relu'))

# We add the classifier on top
emb_model.add(Dense(1))
emb_model.add(Activation('sigmoid'))

In [None]:
emb_model.summary()

In [None]:
sgd = SGD(learning_rate=0.01, momentum=0.8)

In [None]:
emb_model.compile(optimizer=sgd,
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
callbacks_list = [keras.callbacks.ReduceLROnPlateau(monitor='val_loss',
                                    factor=0.1,
                                    patience=2),
                  keras.callbacks.EarlyStopping(monitor='val_loss',
                                patience=6),
                  keras.callbacks.TensorBoard(log_dir="klogs", histogram_freq=1)]

In [None]:
emb_history = emb_model.fit(train_ds,
                            y_train,
                            epochs=20,
                            batch_size=32,
                            callbacks = callbacks_list,
                            validation_split=0.3)

In [None]:
%tensorboard --logdir .klogs

#### Conclusion:

The model is overfitting. The training accuracy is about 98%, whereas the validation accuracy is 80%.

### Model 4

Add a dropout layer as a regularization layer for dealing with overfitting.

In [None]:
keras.backend.clear_session()  # clear default graph

emb_model_2 = Sequential()
emb_model_2.add(keras.Input(shape=(max_review_length,)))
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
emb_model_2.add(Embedding(max_num_tokens, 8))

# After the Embedding layer,
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings
# into a 2D tensor of shape `(samples, maxlen * 8)`
emb_model_2.add(Flatten())

emb_model_2.add(Dense(16))
emb_model_2.add(Activation('relu'))

emb_model_2.add(Dropout(0.8))

# We add the classifier on top
emb_model_2.add(Dense(1))
emb_model_2.add(Activation('sigmoid'))

In [None]:
emb_model_2.compile(optimizer="adam",
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
callbacks_list = [keras.callbacks.ReduceLROnPlateau(monitor='val_loss',
                                    factor=0.1,
                                    patience=2),
                  keras.callbacks.EarlyStopping(monitor='val_loss',
                                patience=6),
                  keras.callbacks.TensorBoard(log_dir="klogs1", histogram_freq=1)]

In [None]:
emb_history = emb_model_2.fit(train_ds,
                              y_train,
                              epochs=20,
                              batch_size=32,
                              callbacks = callbacks_list,
                              validation_split=0.3)

In [None]:
%tensorboard --logdir klogs1

### Checking performance on test set

We will use the model 4 for checking performance on test set and making predictions.

In [None]:
test_ds =  vectorize_layer(X_test)

In [None]:
result = emb_model_2.evaluate(test_ds, y_test)

In [None]:
print("Accuracy: {0:.2%}".format(result[1]))

### Predicting Test Data and Confusion Matrix

We will predict the classes using model 4 and build the confusion matrix to understand precision and recall.

In [None]:
y_pred_probs = emb_model_2.predict(test_ds)

In [None]:
y_pred = np.where(y_pred_probs >= 0.5, 1,0)

In [None]:
from sklearn import metrics

cm = metrics.confusion_matrix( y_test,
                            y_pred, labels = [1,0] )

In [None]:
sn.heatmap(cm, annot=True,
           fmt='.2f',
           xticklabels = ["Positive", "Negative"] ,
           yticklabels = ["Positive", "Negative"] )

plt.ylabel('True label')
plt.xlabel('Predicted label');
plt.title( 'Confusion Matrix for Sentiment Classification');

In [None]:
from sklearn.metrics import classification_report

In [None]:
print( classification_report(y_test,y_pred))

# Peeping into Embeddings

We will look at the embeddings estimated for different words and if they are placed neared or far as per their meaning.

In [None]:
layer_embedding = emb_model_2.get_layer('embedding')

In [None]:
weights_embedding = layer_embedding.get_weights()[0]

In [None]:
weights_embedding.shape

In [None]:
vocab = vectorize_layer.get_vocabulary()
vocab[0:20]

In [None]:
vocab.index("the")

In [None]:
def get_embeddings( word ):
    token = vocab.index(word)
    return weights_embedding[token]

In [None]:
good = get_embeddings('good')
good

In [None]:
great = get_embeddings('great')
great

In [None]:
bad = get_embeddings('bad')
bad

In [None]:
terrible = get_embeddings('terrible')
terrible

We will calculate the euclidean distance between the word embeddings.

In [None]:
from scipy.spatial.distance import cdist

In [None]:
def get_distance( word1, word2 ):

    word1_token = vocab.index(word1)
    word2_token = vocab.index(word2)

    return cdist([weights_embedding[word1_token]],
                 [weights_embedding[word2_token]],
                 metric = 'euclidean')

In [None]:
get_distance( 'good',
             'awesome' )

In [None]:
get_distance( 'good', 'bad' )

In [None]:
get_distance( 'bad', 'terrible' )

In [None]:
get_distance( 'great', 'terrible' )

It can be observed that the words *good* and *great* are places together, while *bad* and *terrible* are place together. And the words *good* and *terrible* are place far. This indicates the embeddings have incorporated the meaning of the words as per how they are used in the sentences expressing positive and negative sentiments.

## Storing Embeddings

In [None]:
import numpy as np

# Assume you used a TextVectorization layer
vocab = vectorize_layer.get_vocabulary()  # list of words

print(len(vocab))

# Save embeddings
weights_embedding = layer_embedding.get_weights()[0]

print(weights_embedding.shape)

np.savetxt("tensor.tsv", weights_embedding, delimiter="\t")

# Save metadata
with open("metadata.tsv", "w", encoding='utf-8') as f:
    for word in vocab:
      if word == "":
        f.write(f"[sp]\n")
      else:
        f.write(f"{word}\n")

Some more examples expressing sentiments.

### Participant Exercise: 1

- Build a model with an embedding layer of 16 or 32
- Add one more dense layer
- Change the number of neurons in dense layer
- Build a model and check accuracy


### Participant Exercise: 2

- Explore words, their embeddings and distances between them.

## Excellent References

For further exploration and better understanding, you can use the following references.

- Glossary of Deep Learning: Word Embedding

    https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca


- wevi: word embedding visual inspector

    https://ronxin.github.io/wevi/  
    
    
- Learning Word Embedding    

    https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html


- On the contribution of neural networks and word embeddings in Natural Language Processing

    https://medium.com/@josecamachocollados/on-the-contribution-of-neural-networks-and-word-embeddings-in-natural-language-processing-c8bb1b85c61c