## Assignment 2.3: Text classification via RNN (30 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language.

In [1]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

### Preparing Data

In [2]:
TEXT = Field(sequential=True, lower=True, batch_first=True)
LABEL = LabelField(batch_first=True)

In [3]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()

In [4]:
%%time
TEXT.build_vocab(trn)

CPU times: user 913 ms, sys: 22.8 ms, total: 936 ms
Wall time: 937 ms


In [5]:
LABEL.build_vocab(trn)

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [6]:
TEXT.vocab.freqs.most_common(10)

[('the', 223801),
 ('a', 111451),
 ('and', 110409),
 ('of', 100200),
 ('to', 93193),
 ('is', 72635),
 ('in', 62957),
 ('i', 49539),
 ('this', 48863),
 ('that', 46089)]

### Creating the Iterator (2 points)

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [7]:
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(64, 64, 64),
        sort=True,
        sort_key=lambda x: len(x.text),
        sort_within_batch=False,
        device='cuda',
        repeat=False
)

Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [8]:
batch = next(train_iter.__iter__()); batch.text

tensor([[    9,   518,   845,  ...,     1,     1,     1],
        [   10,    20,     7,  ...,     1,     1,     1],
        [ 1331,   136,  2280,  ...,     1,     1,     1],
        ...,
        [   10,    25,     7,  ...,   300,    17,  3521],
        [ 1850,  2434,    16,  ...,   115,    16,  2166],
        [   10,    20,     7,  ..., 13938,    13,   695]], device='cuda:0')

In [9]:
batch.text.shape

torch.Size([64, 34])

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [10]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

### Define the RNN-based text classification model (10 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [11]:
class RNNBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim):
        super().__init__()
        self.emb = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.gru = nn.GRU(emb_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)
            
    def forward(self, seq):
        embedded = self.emb(seq)
        h_seq, h_n = self.gru(embedded)
        logits = self.fc(h_n).view(x.shape[0], -1)
        return logits

In [12]:
em_sz = 200
nh = 300
model = RNNBaseline(nh, emb_dim=em_sz); model

RNNBaseline(
  (emb): Embedding(201057, 200)
  (gru): GRU(200, 300, batch_first=True)
  (fc): Linear(in_features=300, out_features=1, bias=True)
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [13]:
model.cuda()

RNNBaseline(
  (emb): Embedding(201057, 200)
  (gru): GRU(200, 300, batch_first=True)
  (fc): Linear(in_features=300, out_features=1, bias=True)
)

### The training loop (3 points)

Define the optimization and the loss functions.

In [14]:
opt = optim.Adam(model.parameters(), lr=3e-4)
loss_func = nn.BCEWithLogitsLoss()

Define the stopping criteria.

In [15]:
epochs = 5

In [16]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010202687626225608, Validation Loss: 0.009948110779126485
Epoch: 2, Training Loss: 0.008900403203283037, Validation Loss: 0.010478779538472493
Epoch: 3, Training Loss: 0.008424819031783513, Validation Loss: 0.008872889057795206
Epoch: 4, Training Loss: 0.005823815512657165, Validation Loss: 0.006868855317433675
Epoch: 5, Training Loss: 0.00452646757704871, Validation Loss: 0.0074263299147288
CPU times: user 47.9 s, sys: 503 ms, total: 48.4 s
Wall time: 48.8 s


### Calculate performance of the trained model (5 points)

In [17]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_true = np.zeros(len(tst))
y_pred = np.zeros(len(tst))

model.eval()

with torch.no_grad():
    for i, batch in enumerate(test_iter):
        x = batch.text
        y = batch.label
        y_batch_pred = torch.exp(model(x))
        y_true[i * 64 : (i + 1) * 64] = y.cpu().numpy()
        y_pred[i * 64 : (i + 1) * 64] = y_batch_pred.cpu().numpy().flatten() > 0.5

print(f'accuracy: {round(accuracy_score(y_true, y_pred), 3)}')
print(f'precision: {round(precision_score(y_true, y_pred), 3)}')
print(f'recall: {round(recall_score(y_true, y_pred), 3)}')
print(f'f1: {round(f1_score(y_true, y_pred), 3)}')

accuracy: 0.786
precision: 0.766
recall: 0.824
f1: 0.794


Write down the calculated performance

### Accuracy: 0.786
### Precision: 0.766
### Recall: 0.824
### F1: 0.794

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

### 1. Replace GRU with LSTM (minor f1-score improvement)
### 2. Stack more GRU layers (no improvement)
### 3. Stack more linear layers(minor f1-score improvement)

## 1. Replace GRU with LSTM

In [18]:
class LSTM_RNN(nn.Module):
    def __init__(self, hidden_dim, emb_dim):
        super().__init__()
        self.emb = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, len(LABEL.vocab))
            
    def forward(self, seq):
        embedded = self.emb(seq)
        h_seq, h_n = self.lstm(embedded)
        logits = self.fc(h_n).view(x.shape[0], -1)
        return logits

epochs = 5
em_sz = 200
nh = 300
model = RNNBaseline(nh, emb_dim=em_sz)
model.cuda()

opt = optim.Adam(model.parameters(), lr=3e-4)
loss_func = nn.BCEWithLogitsLoss()

In [19]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010394512609073094, Validation Loss: 0.010490611632664999
Epoch: 2, Training Loss: 0.008800734301975794, Validation Loss: 0.009240469149748484
Epoch: 3, Training Loss: 0.00720909207378115, Validation Loss: 0.010440164800484975
Epoch: 4, Training Loss: 0.006119077843427658, Validation Loss: 0.007283803216616313
Epoch: 5, Training Loss: 0.00468870895419802, Validation Loss: 0.0075952236930529275
CPU times: user 48.9 s, sys: 239 ms, total: 49.1 s
Wall time: 49.4 s


In [20]:
y_true = np.zeros(len(tst))
y_pred = np.zeros(len(tst))

model.eval()

with torch.no_grad():
    for i, batch in enumerate(test_iter):
        x = batch.text
        y = batch.label
        y_batch_pred = torch.exp(model(x))
        y_true[i * 64 : (i + 1) * 64] = y.cpu().numpy()
        y_pred[i * 64 : (i + 1) * 64] = y_batch_pred.cpu().numpy().flatten() > 0.5

print(f'accuracy: {round(accuracy_score(y_true, y_pred), 3)}')
print(f'precision: {round(precision_score(y_true, y_pred), 3)}')
print(f'recall: {round(recall_score(y_true, y_pred), 3)}')
print(f'f1: {round(f1_score(y_true, y_pred), 3)}')

accuracy: 0.799
precision: 0.801
recall: 0.797
f1: 0.799


## 2. Stack more GRU layers

In [21]:
class RNN_GRU_MORE_LAYERS(nn.Module):
    def __init__(self, hidden_dim, emb_dim):
        super().__init__()
        self.emb = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.gru = nn.GRU(emb_dim, hidden_dim, num_layers=3, batch_first=True)
        self.fc = nn.Linear(hidden_dim, len(LABEL.vocab))
            
    def forward(self, seq):
        embedded = self.emb(seq)
        h_seq, h_n = self.gru(embedded)
        logits = self.fc(h_n).view(x.shape[0], -1)
        return logits

epochs = 5
em_sz = 200
nh = 300
model = RNNBaseline(nh, emb_dim=em_sz)
model.cuda()

opt = optim.Adam(model.parameters(), lr=3e-4)
loss_func = nn.BCEWithLogitsLoss()

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

y_true = np.zeros(len(tst))
y_pred = np.zeros(len(tst))

model.eval()

with torch.no_grad():
    for i, batch in enumerate(test_iter):
        x = batch.text
        y = batch.label
        y_batch_pred = torch.exp(model(x))
        y_true[i * 64 : (i + 1) * 64] = y.cpu().numpy()
        y_pred[i * 64 : (i + 1) * 64] = y_batch_pred.cpu().numpy().flatten() > 0.5

print(f'accuracy: {round(accuracy_score(y_true, y_pred), 3)}')
print(f'precision: {round(precision_score(y_true, y_pred), 3)}')
print(f'recall: {round(recall_score(y_true, y_pred), 3)}')
print(f'f1: {round(f1_score(y_true, y_pred), 3)}')

Epoch: 1, Training Loss: 0.010115732039724077, Validation Loss: 0.010826970005035401
Epoch: 2, Training Loss: 0.009827562093734742, Validation Loss: 0.009730577715237936
Epoch: 3, Training Loss: 0.00772027542420796, Validation Loss: 0.008830604835351308
Epoch: 4, Training Loss: 0.0062943147114345, Validation Loss: 0.008755472727616629
Epoch: 5, Training Loss: 0.005029705234936305, Validation Loss: 0.00799410094022751
accuracy: 0.733
precision: 0.663
recall: 0.944
f1: 0.779


## 3. Stack more linear layers

In [22]:
class RNN_MORE_LAYERS_FOR_THE_GOD_OF_LAYERS(nn.Module):
    def __init__(self, hidden_dim, emb_dim):
        super().__init__()
        self.emb = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.gru = nn.GRU(emb_dim, hidden_dim, batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, 1)
            
    def forward(self, seq):
        embedded = self.emb(seq)
        h_seq, h_n = self.gru(embedded)
        fc_out = self.relu(self.fc1(h_n))
        logits = self.fc2(fc_out).view(x.shape[0], -1)
        return logits

epochs = 5
em_sz = 200
nh = 300
model = RNN_MORE_LAYERS_FOR_THE_GOD_OF_LAYERS(nh, emb_dim=em_sz)
model.cuda()

opt = optim.Adam(model.parameters(), lr=3e-4)
loss_func = nn.BCEWithLogitsLoss()

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.view(-1, 1).type(torch.float)
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

y_true = np.zeros(len(tst))
y_pred = np.zeros(len(tst))

model.eval()

with torch.no_grad():
    for i, batch in enumerate(test_iter):
        x = batch.text
        y = batch.label
        y_batch_pred = torch.exp(model(x))
        y_true[i * 64 : (i + 1) * 64] = y.cpu().numpy()
        y_pred[i * 64 : (i + 1) * 64] = y_batch_pred.cpu().numpy().flatten() > 0.5

print(f'accuracy: {round(accuracy_score(y_true, y_pred), 3)}')
print(f'precision: {round(precision_score(y_true, y_pred), 3)}')
print(f'recall: {round(recall_score(y_true, y_pred), 3)}')
print(f'f1: {round(f1_score(y_true, y_pred), 3)}')

Epoch: 1, Training Loss: 0.010186308268138341, Validation Loss: 0.010147208086649576
Epoch: 2, Training Loss: 0.008149138351849147, Validation Loss: 0.010538814282417297
Epoch: 3, Training Loss: 0.009862441568715232, Validation Loss: 0.009839516989390056
Epoch: 4, Training Loss: 0.007747751390933991, Validation Loss: 0.007571780848503113
Epoch: 5, Training Loss: 0.005280466116326196, Validation Loss: 0.007482968397935231
accuracy: 0.771
precision: 0.714
recall: 0.904
f1: 0.798
