## Bidirectional GRU for Sentence Classification

### Learning Objectives:

At the end of the experiment, you will be able to:

*  generate vector representation of words in the data using Glove embeddings
*  implement the multi-layer bidirectional GRU (Gated
Recurrent Unit) for solving the sentence classification problem

### Dataset Description

The **sentence polarity dataset v1.0** contains two data files which are:
  * **rt-polarity.pos**: It contains 5346 positive examples
  * **rt-polarity.neg**: It contains 5349 negative examples

Each line in these two files corresponds to a single snippet (usually
containing roughly one single sentence) that includes the review of a movie.

**Note:** Here is the source [link](https://www.cs.cornell.edu/people/pabo/movie-review-data/) to the Movie  dataset





### Importing the libraries and packages

In [52]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from gensim.utils import simple_preprocess

import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.preprocessing import LabelEncoder

from keras.layers import Input, Embedding, Dense, Bidirectional, Dropout, GRU
from keras.models import Sequential   # the model

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Loading the data

In [53]:
# Read the positive and negative files and split the sentences into a list
with open('rt-polarity.neg',"r", encoding='ISO-8859-1') as data_neg:
    data_neg_set = data_neg.read().splitlines()

with open('rt-polarity.pos',"r", encoding='ISO-8859-1') as data_pos:
  data_pos_set = data_pos.read().splitlines()

In [54]:
# Length of the positive and negative reviews
len(data_neg_set), len(data_pos_set)

(5349, 5346)

In [55]:
# Loading the negative reviews
data_neg_set = pd.DataFrame(data_neg_set, columns=["Review"])

# Loading the positive reviews
data_pos_set = pd.DataFrame(data_pos_set, columns=["Review"])

In [56]:
# Print the first five rows of the positive examples
data_pos_set.head()

Unnamed: 0,Review
0,the rock is destined to be the 21st century's ...
1,"the gorgeously elaborate continuation of "" the..."
2,effective but too-tepid biopic
3,if you sometimes like to go to the movies to h...
4,"emerges as something rare , an issue movie tha..."


In [57]:
# Print the first five rows of the negative examples
data_neg_set.head()

Unnamed: 0,Review
0,"simplistic , silly and tedious ."
1,"it's so laddish and juvenile , only teenage bo..."
2,exploitative and largely devoid of the depth o...
3,[garbus] discards the potential for pathologic...
4,a visually flashy but narratively opaque and e...


#### Giving the labels to the data

Let us give the labels as positive and negative for the sentences present in the two files.

In [58]:
data_neg_set['Polarity'] = 'Negative'
data_pos_set['Polarity'] = 'Positive'

Let us have a glance at few of the values present in the data with negative and positive reviews that we have labeled in the previous step.

In [59]:
data_neg_set.head()

Unnamed: 0,Review,Polarity
0,"simplistic , silly and tedious .",Negative
1,"it's so laddish and juvenile , only teenage bo...",Negative
2,exploitative and largely devoid of the depth o...,Negative
3,[garbus] discards the potential for pathologic...,Negative
4,a visually flashy but narratively opaque and e...,Negative


#### Combining the positive and negative data

Now, we have to work on the combined data containing the positive and negative reviews, so, let us concatenate both the dataframes.

In [60]:
dataframes = [data_neg_set, data_pos_set]
rt_polarity_data = pd.concat(dataframes)
rt_polarity_data.head()

Unnamed: 0,Review,Polarity
0,"simplistic , silly and tedious .",Negative
1,"it's so laddish and juvenile , only teenage bo...",Negative
2,exploitative and largely devoid of the depth o...,Negative
3,[garbus] discards the potential for pathologic...,Negative
4,a visually flashy but narratively opaque and e...,Negative


In [61]:
rt_polarity_data.reset_index(inplace=True, drop=True)

In [62]:
rt_polarity_data.head()

Unnamed: 0,Review,Polarity
0,"simplistic , silly and tedious .",Negative
1,"it's so laddish and juvenile , only teenage bo...",Negative
2,exploitative and largely devoid of the depth o...,Negative
3,[garbus] discards the potential for pathologic...,Negative
4,a visually flashy but narratively opaque and e...,Negative


When you combine the negative and positive examples, it is a good idea to shuffle the examples so that the negative and positive examples are spread throughout. If we do not shuffle it, then, it may happen that in some mini-batches, examples from only one class(positive or negative) will be present. Therefore, it is better to avoid such scenarios.


In [63]:
rt_polarity_data = shuffle(rt_polarity_data)

In [64]:
rt_polarity_data.head(10)

Unnamed: 0,Review,Polarity
1725,"paid in full is so stale , in fact , that its ...",Negative
3278,"as it stands , there's some fine sex onscreen ...",Negative
9449,"the overall result is an intelligent , realist...",Positive
4894,seeing as the film lacks momentum and its posi...,Negative
2547,there's something fundamental missing from thi...,Negative
183,just too silly and sophomoric to ensnare its t...,Negative
5697,an unbelievably fun film just a leading man aw...,Positive
995,howard and his co-stars all give committed per...,Negative
5939,there's a disreputable air about the whole thi...,Positive
941,all very stylish and beautifully photographed ...,Negative


Let us check the value counts of negative and positive reviews.

In [65]:
rt_polarity_data['Polarity'].value_counts()

Negative    5349
Positive    5346
Name: Polarity, dtype: int64

Checking whether there are any null values present in the data.

In [66]:
rt_polarity_data.isnull().values.any()

False

### Label Encoding

In [67]:
# Converting the labels from categorical to numerical
le = LabelEncoder()
rt_polarity_data['Polarity'] = le.fit_transform(rt_polarity_data['Polarity'])
rt_polarity_data.head()

Unnamed: 0,Review,Polarity
1725,"paid in full is so stale , in fact , that its ...",0
3278,"as it stands , there's some fine sex onscreen ...",0
9449,"the overall result is an intelligent , realist...",1
4894,seeing as the film lacks momentum and its posi...,0
2547,there's something fundamental missing from thi...,0


### Data Preprocessing


We can preprocess the text using gensim package. Gensim provides function **simple_preprocess** for more effective preprocessing of the corpus. In such kind of preprocessing, we can convert a document into a list of lowercase tokens. We can also ignore tokens that are too short or too long.

**Note:** Refer to the following [link](https://radimrehurek.com/gensim/utils.html#gensim.utils.simple_preprocess) for gensim `simple_preprocess` method

In [68]:
rt_polarity_data['Review'] = rt_polarity_data['Review'].apply(lambda x:simple_preprocess(x, max_len=30))

In [69]:
# Remove stop words
stop_words = set(stopwords.words('english'))

rt_polarity_data['Review'] = rt_polarity_data['Review'].apply(lambda x: [w for w in x if not w in stop_words])

In [70]:
rt_polarity_data.head()

Unnamed: 0,Review,Polarity
1725,"[paid, full, stale, fact, vibrant, scene, one,...",0
3278,"[stands, fine, sex, onscreen, tense, arguing, ...",0
9449,"[overall, result, intelligent, realistic, port...",1
4894,"[seeing, film, lacks, momentum, position, rema...",0
2547,"[something, fundamental, missing, story, somet...",0


### Hyperparameters

In [71]:
# Hyperparameters
MAX_SENT_LEN = 30   # Number of words to consider from each review
MAX_VOCAB_SIZE = 20000  # Max vocabulary size
BATCH_SIZE = 32
N_EPOCHS = 15

### Tokenize and Pad sequences

A Neural Network only accepts numeric data, so we need to encode the reviews. Here use keras.Tokenizer() to encode the reviews into integers, where each unique word is automatically indexed (using `fit_on_texts` method) calculates the frequency of each word in our corpus/messages.

`texts_to_sequences` method finally converts our array of sequences of strings to list of sequences of integers (most frequent word is assigned 1 and so on).

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using `keras.preprocessing.sequence.pad_sequences.`

`post`, pad or truncate the words in the back of a sentence
`pre`, pad or truncate the words in front of a sentence

Each word is assigned an integer and that integer is placed in a list.


For example if we have a sentence “How text to sequence and padding works”. Each word is assigned a number. We suppose how = 1, text = 2, to = 3, sequence = 4, and = 5, padding = 6, works = 7. After texts_to_sequences is called our sentence will look like [1, 2, 3, 4, 5, 6, 7 ]. Now for suppose our MAX_SEQUENCE_LENGTH = 10. After padding our sentence will look like `pre` = [0, 0, 0, 1, 2, 3, 4, 5, 6, 7 ], `post` = [1, 2, 3, 4, 5, 6, 7, 0, 0, 0]

In [72]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts([' '.join(seq[:MAX_SENT_LEN]) for seq in rt_polarity_data['Review']])

print("Number of words in vocabulary:", len(tokenizer.word_index))

Number of words in vocabulary: 18007


In [73]:
# Convert the sequence of words to sequnce of indices
X = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in rt_polarity_data['Review']])
X = pad_sequences(X, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

y = rt_polarity_data['Polarity']

### Splitting the data into train and test sets

In [74]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, train_size=10000)

In [75]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((10000, 30), (695, 30), (10000,), (695,))

### Load the GloVe word embeddings

**What is GloVe?**

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus. Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine. Meaning that two similar words are represented by almost similar vectors that are very closely placed in a vector space. These are essential for solving most Natural language processing problems.The resulting embeddings show interesting linear substructures of the word in vector space.

Thus when using word embeddings, all individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.

Now, let us load the 300-dimensional GloVe embeddings.

In [76]:
embeddings_index = {}
# Loading the 300-dimensional vector of the model
f = open('glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 53047 word vectors.


In [77]:
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))

embedding_dim = 300

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, embedding_dim))


# Initialize with a vector of zeros for missing words
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
    if len(embedding_vector) != embedding_dim:
        print(f"Warning: Word '{word}' has an embedding of dimension {len(embedding_vector)}, but expected dimension is {embedding_dim}. Skipping this word.")
    else:
        embedding_matrix[i] = embedding_vector
else:
    words_not_found.append(word)


Loaded 53047 word vectors.


In [78]:
print(tokenizer.word_index)



In [79]:
print(len(tokenizer.word_index))

18007


### Define the Bi-directional GRU model



### LSTM vs GRU
<center>
<img src="https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/GRU.png" width=700px, height=500/>
</center>
<br><br>

Simple RNNs have a very short memory, due to the issue of vanishing gradients. More complicated cell architectures try to solve the short memory problem. The most famous one is probably the Long Short Term Memory (LSTM) cell:


It uses a gated cell architecture to update and forget information selectively in the network memory (cell and hidden states). The Gated Recurrent Units (GRU) have a slightly simpler architecture (and only one hidden state). GRUs are usually faster than LSTMs, while still often have competitive performances for many applications.

### GRU - The subtle differences

* The **update gate** acts similar to the **forget gate** and **input gate** of an LSTM

* The **update gate** decides how much of the past information (from previous time steps) needs to be passed along to the future.

* The **reset gate** decides how much of the past information to forget

* Some tensor ops and speedier to train than LSTMs

<center>
<img src="https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/GRU_Subtle_difference.png" width=700px, height=500/>
</center>
<br><br>

### The need for Bi-directional GRUs

* Bi-directional GRUs are just putting two independent GRUs together

* The input sequence is fed in forward order for one GRU, and reverse order for the other

* The otputs of the two networks are usually concatenated at each time step

* Preserving information from both past and future helps understand context better

<center>
<img src="https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Bi-GRU.jpg" width=700px, height=500/>
</center>


A bidirectional GRU consists of forward layer and a backward layer. The input sequence is fed to the forward layer in the regular way, while in the backward layer the input is processed in the reverse order, starting from the last word, then proceed to the next to last word and so on up to to first word.

The hidden states are then concatenated for each token generating an intermediate representation sequence. Hence, for each intermediate representation the information from the sequence before and after the respective token are taken into account. That means for each iteration step the network has access to the complete document and can deduce the right label from that information.

In [80]:
# Build a sequential model by stacking neural net units
model = Sequential()
embedding_layer = Embedding(vocab_size,
                            embedding_dim,
                            weights = [embedding_matrix],
                            input_length = MAX_SENT_LEN,
                            trainable=False)
model.add(embedding_layer)
model.add(Bidirectional(GRU(128, return_sequences=True, dropout=0.50, name='first_gru_layer')))
model.add(Dropout(0.5))
model.add(Bidirectional(GRU(64, name='second_gru_layer')))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid', name='output_layer'))

In [81]:
print('Summary of the built model...')
model.summary()

Summary of the built model...
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 300)           5402400   
                                                                 
 bidirectional_2 (Bidirecti  (None, 30, 256)           330240    
 onal)                                                           
                                                                 
 dropout_3 (Dropout)         (None, 30, 256)           0         
                                                                 
 bidirectional_3 (Bidirecti  (None, 128)               123648    
 onal)                                                           
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_1 (Dense)        

### Compile and train the model

In [82]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [83]:
model.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=N_EPOCHS,
          validation_data=(X_test, y_test))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7b36dd5eabc0>

### Evaluate the model

In [84]:
print('Testing...')
model.evaluate(X_test, y_test)

Testing...


[0.6931480765342712, 0.49928057193756104]

In [85]:
# model predictions on the test data
preds = model.predict(X_test)



In [86]:
preds.shape

(695, 1)

In [87]:
# Get the text sequences for the preprocessed movie reviews
reviews_list_idx = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in rt_polarity_data['Review']])

In [88]:
print(reviews_list_idx[1])

[1253, 290, 340, 1856, 3206, 6964, 169, 111]


In [89]:
# Function to get the predictions on the movie reviews using GRU model
def add_score_predictions(data, reviews_list_idx):

  # Pad the sequences of the data
  reviews_list_idx = pad_sequences(reviews_list_idx, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

  # Get the predictons by using GRU model
  review_preds = model.predict(reviews_list_idx)

  # Add the predictions to the movie reviews data
  rt_polarity_data['polarity score'] = review_preds

  # Set the threshold for the predictions
  pred_sentiment = np.array(list(map(lambda x : 'positive' if x > 0.5 else 'negative', review_preds)))

  # Add the sentiment predictions to the movie reviews
  rt_polarity_data['predicted polarity'] = pred_sentiment

  return rt_polarity_data

In [90]:
# Call the above function to get the sentiment score and the predicted sentiment
data = add_score_predictions(rt_polarity_data, reviews_list_idx)



In [93]:
# Display the data
data[:5]

Unnamed: 0,Review,Polarity,polarity score,predicted polarity
1725,"[paid, full, stale, fact, vibrant, scene, one,...",0,0.500242,positive
3278,"[stands, fine, sex, onscreen, tense, arguing, ...",0,0.500242,positive
9449,"[overall, result, intelligent, realistic, port...",1,0.500242,positive
4894,"[seeing, film, lacks, momentum, position, rema...",0,0.500242,positive
2547,"[something, fundamental, missing, story, somet...",0,0.500242,positive
