# IMDB Classification - Bag of Words and Embeddings

This tutorial will go through steps for building a deep learning model for sentiment Analysis. We will classify IMDB movie reviews as either positive or negative. This tutorial will be used for teaching during the workshop.

The tutorial has taken contents from various places including the tutorial from http://www.hvass-labs.org/ for the purpose of teaching in the deep learning class.

The topics addressed in the tutorial:

1. Basic exploration of the IMDB movies dataset.
2. Tokenization, text to sequences, padding and truncating
3. Building NN Model using Bag Of Words
4. Building NN Model using Embeddings
5. Peeping to Word Embeddings

We will be exploring mostly how to use Bag of Words and Word Embeddings vector representation of texts and build plain vanila NN models. In the future tutorials, we will explore RNN, LSTM models in the future.

### IMDB Movie Reviews

The dataset is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews.

**Data Fields**

- id - Unique ID of each review
- sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
- review - Text of the review

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Loading the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
imdb_df = pd.read_csv('/content/drive/MyDrive/AdvancedML/labeledTrainData.tsv', sep = '\t')

In [None]:
pd.set_option('display.max_colwidth', 500)
imdb_df.head(5)

### Data Tokenization

The text data need to be converted into vectors using either bag of words or embeddings model. We will first explore bag of words (BOW) model. In the BOW model, a sentence will be represented as a vector with the words (also called tokens) as dimensions of the vectors. 

For the purpose of creating vectors, we need to tokenize the sentences first and find out all unique tokens (words) used across all sentences. The corpus of unquie words used could very large, so we can limit the corpus of tokens by using only the most popular (frequently used) words. In this example, we will use 10000 words.

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer

In [None]:
all_tokenizer = Tokenizer()

In [None]:
all_tokenizer.fit_on_texts( imdb_df.review )

In [None]:
max_num_tokens = 10000

In [None]:
tokenizer = Tokenizer(num_words = max_num_tokens)

In [None]:
tokenizer.fit_on_texts( imdb_df.review )

### Encode Y Variable

In [None]:
y = np.array(imdb_df.sentiment)

In [None]:
y[0:5]

How many classes available?

In [None]:
imdb_df.sentiment.unique()

## Text Vectorization

In [None]:
from keras.layers import TextVectorization

In [None]:
max_review_length = 552

In [None]:
vectorize_layer = TextVectorization(max_tokens = max_num_tokens,
                                    output_mode='int',
                                    output_sequence_length = max_review_length,
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace')

In [None]:
text_dataset = tf.data.Dataset.from_tensor_slices(list(imdb_df.review))

In [None]:
vectorize_layer.adapt(text_dataset)

In [None]:
vectorize_layer.get_vocabulary()[0:10]

### Creating Word Index

In [None]:
voc = vectorize_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
from itertools import islice

first10 = dict(islice(word_index.items(), 10))
         
for word, i in first10.items():
  print(f"{word} : {i}")


### Split Datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(imdb_df.review, 
                                                    imdb_df.sentiment, 
                                                    test_size = 0.2)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
input_shape = X_train.shape

In [None]:
input_shape

## Applying Pre trained embeddings

Word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of  techniques, some involving neural networks, others not. The idea of a dense, lowdimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s,1 but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the Word2vec algorithm (https://code.google.com/ archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec dimensions capture specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

One of the most widely used pretrained word embeddings is Glove and can be downloaded from https://nlp.stanford.edu/projects/glove/ 

GloVe is pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens).

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!mkdir glove
!unzip glove.6B.zip -d glove/

In [None]:
!head -20 /content/glove/glove.6B.50d.txt

In [None]:
import os

glove_dir = '/content/glove'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.50d.txt'))
line_num = 0

for line in f:
    ## The following code is done for printing the first line 
    if( line_num == 0):
        print( line )
        line_num += 1
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Get the word indexes from the our tokenizer, which contains the indexes of the words in our corpus.

In [None]:
word_index = tokenizer.word_index

In [None]:
embedding_dim = 50 #This is because we have downloaded GloVec for 100d embeddings
max_words = 10000

### The embedding matrix will have 
embedding_matrix = np.zeros((max_words, 
                             embedding_dim))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector


In [None]:
embedding_matrix.shape

### Embedding Model

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Flatten, Dense, Activation
from keras.layers import Embedding
from keras.layers import Dropout
from keras.optimizers import SGD

tf.keras.backend.clear_session()  # clear default graph

pre_trained_emb_model = Sequential()
pre_trained_emb_model.add(keras.Input(shape=(1,), dtype=tf.string))
pre_trained_emb_model.add(vectorize_layer)
pre_trained_emb_model.add(Embedding(max_num_tokens, 
                                    embedding_dim, 
                                    input_length=max_review_length,
                                    name='layer_embedding'))

pre_trained_emb_model.add(Flatten())
pre_trained_emb_model.add(Dense(32, activation='relu'))
pre_trained_emb_model.add(Dense(1, activation='sigmoid'))
pre_trained_emb_model.summary()

The Embedding layer has a single weight matrix: a 2D float matrix where each entry *i* is the word vector meant to be associated with index i. Simple enough. Let's just load the GloVe matrix we prepared into our Embedding layer, the first layer in our model:

Additionally, we freeze the embedding layer (we set its trainable attribute to False), following the same rationale as what you are already familiar with in the context of pre-trained convnet features: when parts of a model are pre-trained (like our Embedding layer), and parts are randomly initialized (like our classifier), the pre-trained parts should not be updated during training to avoid forgetting what they already know. The large gradient update triggered by the randomly initialized layers would be very disruptive to the already learned features.

In [None]:
pre_trained_emb_model.get_layer('layer_embedding').set_weights([embedding_matrix])
pre_trained_emb_model.get_layer('layer_embedding').trainable = False

In [None]:
#sgd = SGD(learning_rate=0.01, momentum=0.8)
pre_trained_emb_model.compile(optimizer='adam',
                              loss='binary_crossentropy',
                              metrics=['accuracy'])

In [None]:
pre_trained_emb_history = pre_trained_emb_model.fit(X_train, 
                                                    y_train,
                                                    epochs=20,
                                                    batch_size=128,
                                                    validation_split=0.1)

### Embedding Layer with Dropouts

In [None]:
tf.keras.backend.clear_session()  # clear default graph

pre_trained_emb_model = Sequential()
pre_trained_emb_model.add(keras.Input(shape=(1,), dtype=tf.string))
pre_trained_emb_model.add(vectorize_layer)
pre_trained_emb_model.add(Embedding(max_num_tokens, 
                                    embedding_dim, 
                                    input_length=max_review_length,
                                    name='layer_embedding'))
pre_trained_emb_model.add(Flatten())
pre_trained_emb_model.add(Dense(64))
pre_trained_emb_model.add(Activation('sigmoid'))
pre_trained_emb_model.add(Dropout(0.4))

pre_trained_emb_model.add(Dense(1))
pre_trained_emb_model.add(Activation('sigmoid'))
pre_trained_emb_model.summary()

In [None]:
pre_trained_emb_model.get_layer('layer_embedding').set_weights([embedding_matrix])
pre_trained_emb_model.get_layer('layer_embedding').trainable = True

In [None]:
pre_trained_emb_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

pre_trained_emb_history = pre_trained_emb_model.fit(X_train, 
                                                    y_train,
                                                    epochs=10,
                                                    batch_size=64,
                                                    validation_split=0.2)

In [None]:
plot_accuracy(pre_trained_emb_history.history)

### Exploring the embeddings

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

word2vec_output_file = "/content/glove/glove.6B.50d.txt"

In [None]:
pretrained_w2v_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False, no_header=True)

In [None]:
pretrained_w2v_model.most_similar('bangalore')

In [None]:
pretrained_w2v_model.most_similar('dhoni')

In [None]:
pretrained_w2v_model.most_similar('google')

In [None]:
pretrained_w2v_model.most_similar('hp')

In [None]:
pretrained_w2v_model.most_similar('wikipedia')

In [None]:
def analogy(a, b, c):
    result = pretrained_w2v_model.most_similar([c, b], [a])
    return result[0][0]

In [None]:
analogy('india', 'indian', 'japan')

In [None]:
analogy('india', 'delhi', 'canada')

In [None]:
analogy('india', 'dhoni', 'england')

## Excellent References

For further exploration and better understanding, you can use the following references.

- Glossary of Deep Learning: Word Embedding

    https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca


- wevi: word embedding visual inspector

    https://ronxin.github.io/wevi/  
    
    
- Learning Word Embedding    

    https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html


- On the contribution of neural networks and word embeddings in Natural Language Processing

    https://medium.com/@josecamachocollados/on-the-contribution-of-neural-networks-and-word-embeddings-in-natural-language-processing-c8bb1b85c61c