# Text Classification for the IMDB Dataset using DL
**Objective:** classify the IMDB Reviews into positive or negative. <br>
In this notebook we explore different DL-based text classification models and compare their performance. <br>
The notebook is coded with Keras and explores the following three architectures:
1. CNN-based models with and without pre-trained embeddings
2. LSTM-based models with and without pre-trained embeddings
3. Transformer-based models with and without pre-trained embeddings (for you to do)
This notebook needs a GPU; google colab could be used.
**Useful documentation** <br>
- [Pre-trained embeddings with Keras](https://keras.io/examples/nlp/pretrained_word_embeddings/) 
- [Sentiment classification with LSTM keras](https://slundberg.github.io/shap/notebooks/deep_explainer/Keras%20LSTM%20for%20IMDB%20Sentiment%20Classification.html) 
# Installation of needed libraries

In [None]:
#!pip install numpy==1.19.5
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Importing needed libraries

In [1]:
import os, sys, numpy as np, pandas as pd
from zipfile import ZipFile
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import Constant

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Apr  4 18:07:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


# Downloading dataset & pre-trained GLOVE embeddings
1. [GLOVE](http://nlp.stanford.edu/data/glove.6B.zip)
2. [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)

They are both zipped, thus we need to unzip them. <br>
We will put the data and pre-trained mebddings into a folder called Data.


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
GLOVE_DIR = '/content/drive/MyDrive/glove'

In [None]:
#if not os.path.exists('Data'):
#    os.mkdir('Data')
#if not os.path.exists('Data/glove.6B') : 
#    temp='glove.6B.zip' 
#    file = ZipFile(temp)  
#    file.extractall('Data/glove.6B') 
#    file.close()

In [6]:
#BASE_DIR = 'Data'
#GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')

df_train = pd.read_excel('/content/drive/MyDrive/train_set_imdb_reviews.xlsx')
df_test = pd.read_excel('/content/drive/MyDrive/test_set_imdb_reviews.xlsx')

# EDA
- Explore both datasets
- Clean the datasets: !!! The cleaning steps should be deduced following the exploration done on the train set not on the test set to garantee **no data leakage**. However, it is applied on both. 
- Check if the dataset is balanced or not 
- Bonus: fix the imbalance if it turns out to be the case

At the end of the EDA, set the cleaned reviews (texts) to the variables ``train_texts`` and ``test_texts`` and the sentiments to ``train_labels`` and ``test_labels``. <br>
If you failed this step, use the following commands: <br>
1. ``train_texts = df_train.reviews.apply(lambda x: str(x)).tolist()``
2. ``test_texts = df_train.reviews.apply(lambda x: str(x)).tolist()``
3. ``train_labels = df_train.sentiment.tolist()``
4. ``test_labels = df_test.sentiment.tolist()``

In [7]:
df_train

Unnamed: 0,reviews,sentiment
0,Story of a man who has unnatural feelings for ...,0
1,Airport '77 starts as a brand new luxury 747 p...,0
2,This film lacked something I couldn't put my f...,0
3,"Sorry everyone,,, I know this is supposed to b...",0
4,When I was little my parents took me along to ...,0
...,...,...
24995,"Seeing as the vote average was pretty low, and...",1
24996,"The plot had some wretched, unbelievable twist...",1
24997,I am amazed at how this movie(and most others ...,1
24998,A Christmas Together actually came before my t...,1


In [8]:
df_test

Unnamed: 0,reviews,sentiment
0,Once again Mr. Costner has dragged out a movie...,0
1,This is an example of why the majority of acti...,0
2,"First of all I hate those moronic rappers, who...",0
3,Not even the Beatles could write songs everyon...,0
4,Brass pictures (movies is not a fitting word f...,0
...,...,...
24995,I was extraordinarily impressed by this film. ...,1
24996,"Although I'm not a golf fan, I attended a snea...",1
24997,"From the start of ""The Edge Of Love"", the view...",1
24998,"This movie, with all its complexity and subtle...",1


In [9]:
df_train.sentiment.value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

In [10]:
df_test.sentiment.value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

In [11]:
df_train.reviews[0]

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [12]:
df_train.dtypes

reviews      object
sentiment     int64
dtype: object

In [None]:
is_integer = df_train['reviews'].apply(lambda x: isinstance(x, int))
df_train.loc[is_integer, 'reviews']

3504    0
Name: reviews, dtype: object

In [None]:
is_integer = df_test['reviews'].apply(lambda x: isinstance(x, int))
df_test.loc[is_integer, 'reviews']

2950    0
Name: reviews, dtype: object

In [13]:
df_train.drop([3504],inplace=True)
df_test.drop([2950],inplace=True)

In [14]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from string import punctuation

def preprocess_corpus(texts):
    
    stop_words = set(stopwords.words('english'))
    texts = [' '.join([word for word in text.split() if word.lower() not in stop_words]) for text in texts]

    texts = [''.join([char for char in text if char not in string.punctuation]) for text in texts]

    texts = [''.join([char for char in text if not char.isdigit()]) for text in texts]

    texts = [text.lower() for text in texts]
    
    return texts

train_texts = preprocess_corpus(df_train.reviews.tolist())
test_texts = preprocess_corpus(df_test.reviews.tolist())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
### uncomment the following if you fail the cleaning
#train_texts = df_train.reviews.apply(lambda x: str(x)).tolist()
# test_texts = df_test.reviews.apply(lambda x: str(x)).tolist()

In [15]:
train_labels = df_train.sentiment.tolist()
test_labels = df_test.sentiment.tolist()

# Tokenization of sentences using keras Tokenizer

In keras, unlike pytorch, the Tokenizer not only splits the sentence into words but also convert words into their ids.<br>
AS we have mentioned in class, keras is a high level layer on top of tensoflow implemented to allow novice DL users (more precisely traditional ML users) to develop DL models. <br>

**Remember**, the pre-processing is learnt by looking at the train dataset only to garantee **no data leakage**, and it is applied on both datasets. 
&rarr; we fit the tokenizer on training data, then use it to tokenize both datasets. <br>


In [16]:
#Vectorize these text samples into a 2D integer tensor using Keras Tokenizer 
# 
MAX_NUM_WORDS = 20000 
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS) 
tokenizer.fit_on_texts(train_texts) 
train_sequences = tokenizer.texts_to_sequences(train_texts) #Converting text to a vector of word indexes 
test_sequences = tokenizer.texts_to_sequences(test_texts) 
word_index = tokenizer.word_index 
print('Found %s unique tokens.' % len(word_index))

Found 118411 unique tokens.


In [None]:
train_sequences[0]

[13,
 52,
 7470,
 1283,
 4534,
 405,
 527,
 53,
 1191,
 363,
 1673,
 126,
 10654,
 7589,
 213,
 591,
 2009,
 1011,
 2932,
 834,
 5302,
 368,
 2514,
 1673,
 125,
 10,
 716,
 1210,
 741,
 152,
 1386,
 8,
 986,
 591,
 554,
 13882,
 316,
 9,
 25,
 2161,
 194,
 683,
 755,
 13883,
 1591,
 594,
 47,
 133,
 30,
 6,
 522,
 621,
 21,
 621,
 300,
 3387,
 12945,
 16957,
 8233,
 34,
 3228]

Since we are dealing with a classical ML/DL model, input dimension should always be fixed. <br>
As in a traditional ML model, the number of attributes/features/columns should be fixed, in a DL model, the input dimension should be fixed as well. <br>
In our case, the input features are sentences i.e. list of words. In order to make sure that the input has a fixed size, i.e. the sentences having the same size, we will need to fix a max length (MAX_LEN) parameter, which is the maximum number of words composing a sentence. <br>
You might ask yourselves, But every sentence has a different set of words, shouldn't we create an input size that is equal to the number of unique words in our corpus? <br>
The answer is No, because, we will never deal with words, we will deal with embeddings such that all words are embedded with vectors having the same dimension $d$ &rarr; every sentence of our corpus will be transformed into an input of size MAX_LEN $\times d$ &rarr; our input will have the same size. <br>
Thus: <br>
- sentences with number of words > than MAX_LEN will be truncated; we chose a post truncating i.e., the first MAX_LEN are retained and the remaining words are removed. 
- sentences with number of words < than MAX_LEN will be padded; we chose a post padding i.e., the 0 id will be added after the ids of the words present in the sentence.

In [17]:
MAX_LEN = 1000
trainvalid_data = pad_sequences(sequences=train_sequences, maxlen=MAX_LEN, padding='post', truncating='post', value=0.0)
test_data = pad_sequences(sequences=test_sequences, maxlen=MAX_LEN, padding='post', truncating='post', value=0.0)

In [None]:
trainvalid_data[0], test_data[0]

(array([   13,    52,  7470,  1283,  4534,   405,   527,    53,  1191,
          363,  1673,   126, 10654,  7589,   213,   591,  2009,  1011,
         2932,   834,  5302,   368,  2514,  1673,   125,    10,   716,
         1210,   741,   152,  1386,     8,   986,   591,   554, 13882,
          316,     9,    25,  2161,   194,   683,   755, 13883,  1591,
          594,    47,   133,    30,     6,   522,   621,    21,   621,
          300,  3387, 12945, 16957,  8233,    34,  3228,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
      

# Converting the target into a categorical tensor variable for DL model training
Keras implements the command ``to_categorical``, it transforms each label into a one-hot encode array of dimension = unique number of categories and sets the value 1 on the index i if the data sample belongs to the category i else 0. With to categorical, if an input belongs to several categories at a time, the label would contain several 1. <br>
Here there is 2 catgories: neg and pos &rarr; the dimension is 2. <br>
Example: the target of a review with a pos review is converted with ``to_categorical`` to an ``array([0,1])``, while the target of a review with a neg review is converted with ``to_categorical`` to an ``array([1,0])``.

In [18]:
trainvalid_labels = to_categorical(np.asarray(train_labels))
test_labels = to_categorical(np.asarray(test_labels))

In [19]:
train_labels[12500] , trainvalid_labels[12500]

(1, array([0., 1.], dtype=float32))

# Split the training data into a training set and a validation set

In [20]:
VALIDATION_SPLIT = 0.2
indices = np.arange(trainvalid_data.shape[0])
np.random.shuffle(indices)
trainvalid_data = trainvalid_data[indices]
trainvalid_labels = trainvalid_labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * trainvalid_data.shape[0])
x_train = trainvalid_data[:-num_validation_samples]
y_train = trainvalid_labels[:-num_validation_samples]
x_val = trainvalid_data[-num_validation_samples:]
y_val = trainvalid_labels[-num_validation_samples:]

# Convert the token ids into embedding vectors
1. Extract embeddings from glove.6B.100d.txt
2. Convert the words in the dataset into embeddings using the dictionary from step 1
3. Create the embedding layer for keras; this will be the first layer of our DL model.

In [21]:
EMBEDDING_DIM = 100 
print('Preparing embedding matrix.')

# first, build index mapping words in the embeddings set to their embedding vector
#  every line in glove.6B.100d.txt contains the word followed by the embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'),encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))

# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Preparing embedding matrix.
Found 400000 word vectors in Glove embeddings.


In [22]:
# load these pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed during training
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_LEN,
                            trainable=False)
print("Preparing of embedding matrix is done")

Preparing of embedding matrix is done


# Training and evaluating the DL model
We will test 3 DL models:
- 1D CNN-based architecture
- LSTM-based architecture
- Transformer-based architecture (to do it on your own)

### 1D CNN Model with pre-trained embedding

In [40]:
labels_index = {}  # create an empty dictionary
for label in train_labels:
    if label not in labels_index:
        labels_index[label] = len(labels_index)  # add new label to dictionary with next integer index

In [41]:
print('Define a 1D CNN model.')

cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))

cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Define a 1D CNN model.
Test accuracy with CNN: 0.5000200271606445


### 1D CNN model with training your own embedding
The only difference here is that the embedding layer we created ``embedding_layer`` using the pre-trained glove embeddings is no longer used here. We initialize an ambedding layer with randomly initialized weights ``Embedding(MAX_NUM_WORDS, 128)``.

In [42]:
print("Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings")
cnnmodel = Sequential()
cnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))

cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings
Test accuracy with CNN: 0.5005000233650208


### LSTM Model with training your own embedding 

In [None]:
print("Defining and training an LSTM model, training embedding layer on the fly")

#model
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(2, activation='sigmoid'))
rnnmodel.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Training the RNN')

rnnmodel.fit(x_train, y_train,
          batch_size=32,
          epochs=1,
          validation_data=(x_val, y_val))
score, acc = rnnmodel.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with RNN:', acc)
#Test accuracy with RNN: 0.82998

Defining and training an LSTM model, training embedding layer on the fly




Training the RNN
Test accuracy with RNN: 0.49998000264167786


### LSTM Model using pre-trained Embedding Layer

> Indented block



In [None]:
print("Defining and training an LSTM model, using pre-trained embedding layer")

rnnmodel2 = Sequential()
rnnmodel2.add(embedding_layer)
rnnmodel2.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel2.add(Dense(2, activation='sigmoid'))
rnnmodel2.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Training the RNN')

rnnmodel2.fit(x_train, y_train,
          batch_size=32,
          epochs=1,
          validation_data=(x_val, y_val))
score, acc = rnnmodel2.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with RNN:', acc)
#Test accuracy with RNN: 0.793



Defining and training an LSTM model, using pre-trained embedding layer
Training the RNN
Test accuracy with RNN: 0.5000200271606445


### Transformer Model 
Refer to the [keras tutorial](https://keras.io/examples/nlp/text_classification_with_transformer/) to implement and evaluate your model 

In [25]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In [46]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

In [27]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 200 

print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

20000 Training sequences
4999 Validation sequences


In [48]:
#embed_dim = 32  # Embedding size for each token
#num_heads = 2  # Number of attention heads
#ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)
#logits = logits.reshape((64,))

model = keras.Model(inputs=inputs, outputs=outputs)

In [51]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(
    x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val)
)

Epoch 1/2
Epoch 2/2
