## Final Project Day 3 Solution: Neural Networks and Transformers  for a Classification Task

We continue to work with the final project dataset to see how Recurrent Neural Networks (RNNs), Long Short-term Memory Networks (LSTMs) and Transformers, perform to predict the __isPositive__ field of the dataset.

* We are giving you two pieces of code to read your training and test datasets.
* Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

*Note: Incorporate all that you have learned over Day 1 and Day 2. Feel free to use your processed data from Day 1 to save on redundant work.*

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review

Importing libraries:

In [1]:
import re
from collections import Counter
import d2l
from sklearn.model_selection import train_test_split
import mxnet as mx
from mxnet import gluon, np, npx, autograd
from mxnet.gluon import nn, rnn

npx.set_np()

### Reading the dataset

Let's read the datasets below and fill-in the reviewText field. We will use this field as input to our ML model.

In [2]:
import pandas as pd

df_train = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')
df_test = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

Let's look at the first five rows in the datasets. As you can see the __log_votes__ field is numeric. That's why we will build a regression model.

In [3]:
df_train.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


In [4]:
df_test.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,Kaspersky offers the best security for your co...,State of the art protection,True,1465516800,0.0,1.0
1,This Value was extremely discounted which I ap...,Quickbooks,True,1393632000,0.0,1.0
2,Some dufus probably got stock options by the t...,Sad,False,1228176000,2.639057,0.0
3,I have reviewed the software and it is beyond ...,Excellent product,True,1402531200,0.0,1.0
4,"Plain old simple you need Anti-Virus,I have tr...",A must have,True,1367539200,0.0,1.0


### Exploratory Data Analysis and Missing Value Imputation

Let's look at the target distribution for our datasets.

In [5]:
df_train["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [6]:
df_test["isPositive"].value_counts()

1.0    4980
0.0    3020
Name: isPositive, dtype: int64

Checking the number of missing values:

In [7]:
print(df_train.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


In [8]:
print(df_test.isna().sum())

reviewText    2
summary       1
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


We will only consider the reviewText field. Let's fill-in the missing values for that below. We will just use the placeholder "Missing" here.

In [9]:
df_train["reviewText"].fillna("Missing", inplace=True)
df_test["reviewText"].fillna("Missing", inplace=True)

### Text processing-cleaning

Next, we will clean the text. We will remove leading/train white space, extra space and html tags. Recurrent neural networks usually __DON'T__ need text processing work further than simple text cleaning. Stemming and lemmatization can introduce some errors that will cause our model to skip those words completely. 

In [10]:
# Some string preprocessing
def clean_str(text):
    text = text.lower().strip() # Remove leading/trailing whitespace
    text = re.sub('\s+', ' ', text) # Remove extra space and tabs
    text = re.compile('<.*?>').sub('', text) # Remove HTML tags/markups:
    return text

clean_reviews_train = [clean_str(x) for x in df_train["reviewText"].tolist()]
clean_reviews_test = [clean_str(x) for x in df_test["reviewText"].tolist()]

Next we prepare the review texts to be fed into the deep learning model by (1) Reserving 15% of the dataset as a validation dataset, (2) padding and truncating the data to the length of 50 words, and (3) converting the encoded text into into MXNet's NDArray format.

In [11]:
# This separates 15% of the entire dataset into val dataset.
X_train, X_val, y_train, y_val = \
    train_test_split(clean_reviews_train, df_train["isPositive"].tolist(), 
                     test_size=0.15, random_state=42)

In [12]:
def load_data(X_train, X_val, y_train, y_val, num_steps=50):
    ## num_steps=50 trim the sentence after the 50th word
    
    train_tokens = d2l.tokenize(X_train, token='word')
    val_tokens = d2l.tokenize(X_val, token='word')
    vocab = d2l.Vocab(train_tokens, min_freq=5)
    
    ## convert to ndarray
    train_features = np.array([d2l.trim_pad(vocab[line], num_steps, vocab.unk)
                               for line in train_tokens], dtype=np.float32)
    val_features = np.array([d2l.trim_pad(vocab[line], num_steps, vocab.unk)
                              for line in val_tokens], dtype=np.float32)  ## l2_loss does not accept float64
    y_train = np.array(y_train, dtype=np.float32)
    y_val = np.array(y_val, dtype=np.float32)

    return train_features, val_features, y_train, y_val, vocab

truncate_word_after_max = 50
train_features, val_features, train_labels, val_labels, vocab = \
    load_data(X_train, X_val, y_train, y_val, truncate_word_after_max)

### Using pre-trained GloVe Word Embeddings:

In this example, we will use GloVe word vectors. The following code shows how to get the word vectors and create an embedding dictionary using Gluon. The dictionary maps the words to their word vectors. 

In [13]:
from mxnet.contrib import text
glove_embedding = text.embedding.create('glove', pretrained_file_name='glove.6B.300d.txt')
embedding_matrix = glove_embedding.get_vecs_by_tokens(vocab.idx_to_token)
embedding_matrix.shape

(24986, 300)

### Training the model

#### Initializing the model
After all the data and word vector preparation, now is the time to define the model and its parameters.

In [14]:
context, num_hidden = mx.cpu(), 100

Let's define a simple model:
1. Initializing with a `Sequential()` block, which can be added a series architecture;
1. One embedding layer with the shape of `embedding_matrix` from Glove;
1. One `RNN` block with `num_hidden` hidden states, and a input layout 'NTC', where (T, N, C) stand for (sequence length, batch size, word vector dimensions) respectively;
1. One `Dense` layer with activation function ReLU.

In [28]:
def initialize(input_dim, output_dim, num_hidden, context):
    model = nn.Sequential()
    model.add(nn.Embedding(input_dim, output_dim),
              rnn.RNN(num_hidden, layout = 'NTC'),        
              nn.Dense(2)              # Output layer
              )

    model.collect_params().initialize(mx.init.Xavier(), ctx=context)
    model[0].weight.set_data(embedding_matrix)
    model[0].collect_params().setattr('grad_req', 'null')
    return model


model = initialize(embedding_matrix.shape[0], embedding_matrix.shape[1],
                   num_hidden, context)
model

Sequential(
  (0): Embedding(24986 -> 300, float32)
  (1): RNN(-1 -> 100, NTC)
  (2): Dense(-1 -> 2, linear)
)

#### Evaluation
Let's define a function that will calculate the accurary metrics for the model.

In [24]:
def evaluate_accuracy(model, loader, context=mx.cpu()):
    wrong = 0
    total = 0
    for _, (data, target) in enumerate(loader):
        
        data = data.as_in_context(context)
        target = target.as_in_context(context)
        predictions = np.argmax(model(data), axis=1)
        wrong += np.abs(predictions, target).sum()
        total += data.shape[0]
    
    return float(1 - wrong/total)

#### Defining the `train()` function

In [32]:
def train(net, train_features, train_labels, val_features, val_labels,
          num_epochs, learning_rate, batch_size):
    softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    val_iter = d2l.load_array((val_features, val_labels), batch_size)
    # The SGD optimization algorithm is used here
    trainer = gluon.Trainer(net.collect_params(), 'sgd', 
                            {'learning_rate': learning_rate})
    for epoch in range(num_epochs):
        train_ls, val_ls = 0, 0
        for X, y in train_iter:
            with autograd.record():
                out = net(X)
                y = y.reshape((-1,))
                l = softmax_cross_entropy(out, y)
                l.backward()
            trainer.step(batch_size)
            train_ls += l.sum()        
        
        for val_X, val_y in val_iter:
            val_ls += softmax_cross_entropy(net(val_X), val_y).sum()
            
        # Let's take the average losses
        training_loss = train_ls / len(train_labels)
        val_loss = val_ls / len(val_labels)
        print("Epoch %s. Train_loss (mse) %s Validation_loss (mse) %s" % \
              (epoch, training_loss, val_loss))
            
    # Calculating training and validation accuracy
    train_accuracy = evaluate_accuracy(net, train_iter)
    val_accuracy = evaluate_accuracy(net, val_iter)
    return net, train_accuracy, val_accuracy

#### Training 
Let's start the training process below. We will print Mean Squared Error after each epoch.

In [33]:
learning_rate, epochs, batch_size = 0.01, 10, 64

# initializing
model = initialize(embedding_matrix.shape[0], embedding_matrix.shape[1],
                   num_hidden, context)

# training
model_out, train_accuracy, val_accuracy = \
    train(net=model, train_features=train_features, 
          train_labels=train_labels, 
          val_features=val_features, val_labels=val_labels,
          num_epochs=epochs, learning_rate=learning_rate, 
          batch_size=batch_size)


Epoch 0. Train_loss (mse) 0.59695274 Validation_loss (mse) 0.5375231
Epoch 1. Train_loss (mse) 0.51408947 Validation_loss (mse) 0.4935246
Epoch 2. Train_loss (mse) 0.482084 Validation_loss (mse) 0.47253278
Epoch 3. Train_loss (mse) 0.45635378 Validation_loss (mse) 0.47566172
Epoch 4. Train_loss (mse) 0.43796033 Validation_loss (mse) 0.44435388
Epoch 5. Train_loss (mse) 0.42355752 Validation_loss (mse) 0.451506
Epoch 6. Train_loss (mse) 0.41417316 Validation_loss (mse) 0.42983395
Epoch 7. Train_loss (mse) 0.4048757 Validation_loss (mse) 0.43024126
Epoch 8. Train_loss (mse) 0.39600328 Validation_loss (mse) 0.4245714
Epoch 9. Train_loss (mse) 0.3883114 Validation_loss (mse) 0.41779023


### Evaluating the model

In [36]:
X_test = clean_reviews_test
y_test = df_test["isPositive"].tolist()

test_tokens = d2l.tokenize(X_test, token='word')
test_features = np.array([d2l.trim_pad(vocab[line], truncate_word_after_max, vocab.unk)
                           for line in test_tokens], dtype=np.float32)
y_test = np.array(y_test, dtype=np.float32)
test_iter = d2l.load_array((test_features, y_test), batch_size)

evaluate_accuracy(model_out, test_iter)

0.34324997663497925