# Homework 7: Text classification with Pytorch

#### Important: Save a copy of this notebook before you start working and upload the solved notebook to the HW submission page given below, as always no need to upload the datasets.
 - All theoretical questions must be answered in your own words, do not copy-paste text from the internet. Points can be deducted for terrible formatting or incomprehensible English.
 - Code must be commented. If you use code you found online, you have to add the link to the source you used. There is no penalty for using outside sources as long as you convince us you understand the code.

**Note that this HW has only one notebook.**

*Once completed download the ipython notebook and upload it to https://courses.cs.ut.ee/2024/nn/spring/Main/Practices.*

## Introduction

In this practice session we are looking into text classification. This means we are going to touch topics like word embeddings and recurrent neural networks.

Lets download the data first! This is a subset from movie reviews datasets with 25000 reviews. And corresponding positive or negative labels.

In [None]:
!wget -c https://courses.cs.ut.ee/2023/nn/spring/Main/HomePage?action=download\&upname=labels.txt -O labels.txt
!wget -c https://courses.cs.ut.ee/2023/nn/spring/Main/HomePage?action=download\&upname=reviews.txt -O reviews.txt

import numpy as np
# read data from text files

with open('reviews.txt', 'r') as f:
     reviews = f.readlines()
     
with open('labels.txt', 'r') as f:
     labels = f.readlines()

# Preprocessing

In some cases, just throwing the text as it is in an embedder and feeding it to a model might not be the best solution. So, some pre-processing steps are required to use text data properly.


**Task 1** 

Clean up the reviews of all punctuation, since the focus of this is words and not special characters.


In [None]:
# Your code here

**Task 2** 

Get word counts of the words in the reveiws and sort the words by frequency to get the most common words. Let's use a vocabulary of top (2k words)! Create a dictionary to map the position in the top 2k scoreboard to the word.

In [None]:
# your code here 
print(vocab_to_int) # print the dictionary

**Task 3** 


However, the top 2k words are not all the possible words that there can be in a review; so now we need to take care of the words not found in the dictionary. Let's add a special character 'UNK' to denote these words. Shift the sequence and add the word 'UNK' in the dictionary in position 1 to denote the words not found this dictionary.







In [None]:
len(vocab_to_int)

2000

In [None]:
## Your code here

**Task 4** 


Encode all the sentences in the dataset using the disctionary. Fix the sequence length to 250 words, padd with (0) or truncate sequences with less/more words. *Note that you will also have to handle the words missing in the dictionary.*  

In [None]:
## Your code here

Here we convert the labels from our data set into 1 or 0 for positive/negative

In [None]:
labels=[1 if label.strip()=='positive' else 0 for label in labels]


**Task 5** 

Split the dataset into *train:val:test* sets with *80:10:10* proportions.


In [None]:
#split_dataset into 80% training , 10% test and 10% Validation Dataset
train_x= #
test_x=#
valid_x=#
## do the same for y
print(len(train_y), len(valid_y), len(test_y))

20000 2500 2500


# Data Loading

Now we are generating Dataset and data loaders for our train and test set. See how the dataloader will automatically create data batches for your training/validation/test routines



In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset

#create Tensor Dataset
train_data=TensorDataset(torch.FloatTensor(train_x), torch.FloatTensor(train_y))
valid_data=TensorDataset(torch.FloatTensor(valid_x), torch.FloatTensor(valid_y))
test_data=TensorDataset(torch.FloatTensor(test_x), torch.FloatTensor(test_y))

#dataloader
batch_size=50
train_loader=DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_loader=DataLoader(valid_data, batch_size=batch_size, shuffle=True)
test_loader=DataLoader(test_data, batch_size=batch_size, shuffle=True)

# Model

Next we create the Model class where we define our model in its init function. The forward function is what will be called when you call an instance of the model class, its the definition of forward function that defines the conputational chain of your model.

In [None]:

import torch.nn as nn
 
class SentimentalLSTM(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):    
        """
        Initialize the model by setting up the layers
        """
        super().__init__()
        self.output_size=output_size
        self.n_layers=n_layers
        self.hidden_dim=hidden_dim
        
        #Embedding and LSTM layers
        self.embedding=nn.Embedding(vocab_size, embedding_dim)
        self.lstm=nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        
        #dropout layer
        self.dropout=nn.Dropout(0)
        
        #Linear and sigmoid layer
        self.fc1=nn.Linear(hidden_dim, 64)
        self.fc2=nn.Linear(64, 16)
        self.fc3=nn.Linear(16,output_size)
        self.sigmoid=nn.Sigmoid()
        
    def forward(self, x):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size=x.size()
        
        #Embadding and LSTM output
        embedd=self.embedding(x)
        lstm_out, hidden=self.lstm(embedd)
        
        #stack up the lstm output
        lstm_out=lstm_out.contiguous().view(-1, self.hidden_dim)
        
        #dropout and fully connected layers
        out=self.dropout(lstm_out)
        out=self.fc1(out)
        out=self.dropout(out)
        out=self.fc2(out)
        out=self.dropout(out)
        out=self.fc3(out)
        sig_out=self.sigmoid(out)
        
        sig_out=sig_out.view(batch_size, -1)
        sig_out=sig_out[:, -1]
        
        return sig_out
    


In [None]:
## function to get the validation loss at some stage of the training
def validation(net,valid_loader,criterion):
    # Get validation loss
    val_losses = []
    net.eval()
    for inputs, labels in valid_loader:
        inputs, labels = inputs.cuda(), labels.cuda()  
        output = net(inputs.to(torch.int))
        val_loss = criterion(output.squeeze(), labels.float())

        val_losses.append(val_loss.item())
    net.train()
    return val_losses


In [None]:
## the main train function to call when you want to train your model
def train(net,train_loader,validation_loader,criterion,optimizer):
        # check if CUDA is available
    train_on_gpu = torch.cuda.is_available()

    # training params

    epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

    counter = 0
    print_every = 100
    clip=5 # gradient clipping

    # move model to GPU, if available
    if(train_on_gpu):
        net.cuda()

    net.train()
    # train for some number of epochs
    for e in range(epochs):

        # batch loop
        for inputs, labels in train_loader:
            counter += 1

            if(train_on_gpu):
                inputs=inputs.cuda()
                labels=labels.cuda()
            
            # zero accumulated gradients
            net.zero_grad()

            # get the output from the model
            output = net(inputs.to(torch.long))

            # calculate the loss and perform backprop
            loss = criterion(output.squeeze(), labels.float())
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            optimizer.step()

            # loss stats
            if counter % print_every == 0:
                val_losses= validation(net,valid_loader,criterion)
                print("Epoch: {}/{}...".format(e+1, epochs),
                    "Step: {}...".format(counter),
                    "Loss: {:.6f}...".format(loss.item()),
                    "Val Loss: {:.6f}".format(np.mean(val_losses)))

In [None]:
## function to test your model on the test set after it is trained
def test(net,test_loader,criterion):
    test_losses = [] # track loss
    num_correct = 0
    train_on_gpu = torch.cuda.is_available()

    net.eval()
    # iterate over test data
    for inputs, labels in test_loader:

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        output = net(inputs.to(torch.long))

        # calculate loss
        test_loss = criterion(output.squeeze(), labels.float())
        test_losses.append(test_loss.item())

        # convert output probabilities to predicted class (0 or 1)
        pred = torch.round(output.squeeze())  # rounds to the nearest integer

        # compare predictions to true label
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)


    # -- stats! -- ##
    # avg test loss
    print("Test loss: {:.3f}".format(np.mean(test_losses)))

    # accuracy over all test data
    test_acc = num_correct/len(test_loader.dataset)
    print("Test accuracy: {:.3f}".format(test_acc))


Here we define some model parameters and create an instance of our model.

In [None]:
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding
output_size = 1
embedding_dim = 256
hidden_dim = 80
n_layers = 1

net = SentimentalLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers,drop_prob=0)
print(net)

**Task 6** 

Explain the dimensions of the embedding layer.

*Your answer here:*

**Task 7** 

Update the learning rate or the model parameters in the model definition above to train the Network; the goal is to reach above 78% test accuracy. The train and test blocks are given below, run the block to check how it works.

In [None]:
lr=0.005
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train(net,train_loader,valid_loader,criterion,optimizer)
test(net,test_loader,criterion)

**Task 8** 

Now let's switch things up! Use the model and the model definition block below and do the same computation, but instead of using LSTM in your model use simple RNN or GRU units. Do you see a difference in your models in terms of number of parameters, definition, performance, etc? Explain your answer.

*Your answer here:*

In [None]:
## Edit this code block to define the model with RNN or GRU instead of LSTM
import torch.nn as nn
 
class SentimentalLSTM(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):    
        """
        Initialize the model by setting up the layers
        """
        super().__init__()
        self.output_size=output_size
        self.n_layers=n_layers
        self.hidden_dim=hidden_dim
        
        #Embedding and LSTM layers
        self.embedding=nn.Embedding(vocab_size, embedding_dim)
        self.lstm=nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        
        #dropout layer
        self.dropout=nn.Dropout(0)
        
        #Linear and sigmoid layer
        self.fc1=nn.Linear(hidden_dim, 64)
        self.fc2=nn.Linear(64, 16)
        self.fc3=nn.Linear(16,output_size)
        self.sigmoid=nn.Sigmoid()
        
    def forward(self, x):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size=x.size()
        
        #Embadding and LSTM output
        embedd=self.embedding(x)
        lstm_out, hidden=self.lstm(embedd)
        
        #stack up the lstm output
        lstm_out=lstm_out.contiguous().view(-1, self.hidden_dim)
        
        #dropout and fully connected layers
        out=self.dropout(lstm_out)
        out=self.fc1(out)
        out=self.dropout(out)
        out=self.fc2(out)
        out=self.dropout(out)
        out=self.fc3(out)
        sig_out=self.sigmoid(out)
        
        sig_out=sig_out.view(batch_size, -1)
        sig_out=sig_out[:, -1]
        
        return sig_out


Create instance of the new model and run train/test your model. You may need to change the learning rate and/or no of epochs to get your model to train well.

In [None]:
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding
output_size = 1
embedding_dim = 256
hidden_dim = 80
n_layers = 1

net = SentimentalLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers,drop_prob=0)
print(net)

In [None]:
lr=0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train(net,train_loader,valid_loader,criterion,optimizer)
test(net,test_loader,criterion)

**Task 9** 

Generate positive and negative senteces. Try to get scores above 0.8 for positive sentences, and below 0.2 for negative sentences.

In [None]:
inputs = [] ## write your reviews here, in quotes, seperated by commas 

encoded = np.zeros((len(inputs),250),int)
for sent_idx,sent in enumerate(inputs):
    for idx,word in enumerate(sent.split()):
        if(word not in vocab_to_int.keys()):
            encoded[sent_idx,idx]=1
        else:
            encoded[sent_idx,idx]=vocab_to_int[word]
encoded = torch.from_numpy(encoded)

In [None]:
net.eval()
if(train_on_gpu):
    encoded= encoded.cuda()
output= net(encoded.to(torch.long))

True


**Task 10** 

Convert the output into positive or negative with 0.5 threshold (i.e. >0.5 is positive).

In [None]:
## Your code here

**Task 11** 

How performant would your rate your AI? What would you change if you were tasked to improve this model? Expain. *Note that it could be anything in terms of dataset, data pre-processing, Neural Networks architecture etc.*

*Your answer here:*

Now lets try to visualise how your network has learnt to embed the words in the vocabulary. We will first exract the weights of the embedding layer. Use PCA to project them into a 2D space. And then visualise the projection of some words in the 2d space. 

In [None]:
embed= net.embedding.weight.cpu().detach().numpy()

In [None]:
from sklearn.decomposition import PCA
embed_2d = PCA(n_components = 2).fit_transform(embed)
embed_2d.shape

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
word_idx=[]
for word in ["great","awesome","beautiful","magnificent","masterpiece"]:
    idx = vocab_to_int[word]
    word_idx.append(idx)
    plt.scatter(embed_2d[idx,0],embed_2d[idx,1],color="red",s=100)
    plt.text(embed_2d[idx, 0], embed_2d[idx, 1], word,fontsize=20)
for word in ["bad", "terrible", "boring", "lame"]:
    idx = vocab_to_int[word]
    word_idx.append(idx)
    plt.scatter(embed_2d[idx,0],embed_2d[idx,1],color="blue",s=100)
    plt.text(embed_2d[idx, 0], embed_2d[idx, 1], word,fontsize=20)

for word in ["actor", "producer", "director", "dog","and","the"]:
    idx = vocab_to_int[word]
    word_idx.append(idx)
    plt.scatter(embed_2d[idx,0],embed_2d[idx,1],color="gray",s=100)
    plt.text(embed_2d[idx, 0], embed_2d[idx, 1], word,fontsize=20)
plt.show()


**Task 12**

Explain what the distribution of words using the learnt work emeddings shows?

*Your answer here:*