<a href="https://colab.research.google.com/github/kanav9063/Deep-Learning/blob/main/190I_HW3_release.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 190I Homework 3: Multi-class classification in pytorch
In this machine problem (MP), you will train a neural network to classify textual sequences. You will use `torch.nn` to implement a neural network and use `torch.autograd` to calculate the gradient and train your model.

## Basic classes and functions in Pytorch

### [torch.autograd](https://pytorch.org/docs/stable/autograd.html)

The `torch.autograd` package provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. To obtain gradients for a tensor via autograd from arbitrary scalar valued functions, you can simply set `requires_grad=True`. Then you can call `backward()` on any scalar that you want to calculate gradient of. The gradients will be accumulated in the `.grad` attribute. You can refer to [this tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) for more information.

For example, let's calculate $∇_\boldsymbol{x}||\boldsymbol{x}||^2$ and verify if it equals $2\boldsymbol{x}$.

In [None]:
# Include packages
import math
import torch
from torch import nn
import random
import numpy as np
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
plt.rcParams["savefig.bbox"] = 'tight'
%matplotlib inline

In [None]:
x = torch.randn(5, requires_grad=True)
norm_square = (x**2).sum()

# calculate gradient
norm_square.backward()

print(f"2x is: {2 * x.data}")
print(f"gradient is: {x.grad}")

**Note:** the gradient is accumulated in the `.grad` attribute, so you need to clear the accumulated gradients before every iteration.

### [torch.nn](https://pytorch.org/docs/stable/nn.html#)
The `torch.nn` package defines a set of Modules, including all kinds of layers you might use in a neural network, loss functions, weight initialization functions, etc. In this notebook, we will introduce the [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions) in `torch.nn`, which define a set of functions you might use for various problems such as regression and classification.

For example, the following cell illustrates the use of `nn.MSELoss` to calculate the mean squared error.

In [None]:
x = torch.randn(5)
y = torch.randn(5)

# calculate MSE with torch
mse_th = ((x - y)**2).mean()
print(f"MSE using tensor operations: {mse_th}")

# calculate MSE with nn.MESLoss
loss_func = nn.MSELoss()
mse_nn = loss_func(x, y)
print(f"MSE using nn: {mse_nn}")

### [torch.optim](https://pytorch.org/docs/stable/optim.html)
In previous homeworks, you manually update the parameters after calcuting the gradients. In fact, `torch.optim` implements various optimization algorithms such as SGD, which you can use to conveniently update your parameters. To do that, you simply need to create an optimizer (e.g., `torch.optim.SGD`) by specifying the parameters that need to be updated and associated optimization hyperparameters such as learning rate. In the training loop, you will need to modify your code to include the following two steps:
- Use `optimizer.zero_grad()` to clear gradients of parameters.
- Use `optimizer.step()` to automatically update parameters.

You can refer to [this tutorial](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html) for more details and examples.

## Homework: Text Classification with Pytorch

In this problem, you will create a text-classification model thtat classifys whether a given movie review is positive or negative. We experiment with the dataset called [SST-2](https://nlp.stanford.edu/sentiment/).


**Download SST-2 dataset**

We provide you with the utility function that downloads and extracts document strings from raw SST-2 dataset.

In [None]:
import random
import requests
import zipfile
import csv

def download_and_extract(url, local_filename, extract_dir):
    try:
        # Download the file
        response = requests.get(url)
        if response.status_code != 200:
            raise Exception(f"Failed to download {url}. Status code: {response.status_code}")
        # Save to local file
        with open(local_filename, 'wb') as file:
            file.write(response.content)
        # Extract the file
        with zipfile.ZipFile(local_filename, 'r') as zip_ref:
            zip_ref.extractall(extract_dir)
        print(f"Extracted {local_filename} to {extract_dir} successfully.")
    except Exception as e:
        print(f"Error: {e}")

def read_sst2(path, maxidx=None):
    data = {'documents' : [], 'labels' : []}
    with open(path, newline="", encoding="utf-8") as csvfile:
        reader = csv.DictReader(csvfile, delimiter="\t")
        for i, row in enumerate(reader):
            text = row["sentence"]
            label = int(row["label"])  # Convert the label to an integer (0 or 1)
            data['documents'].append(text)
            data['labels'].append(label)

            if i == maxidx:
                break
    print("Loaded ", len(data['documents']), " sample from ", path)
    return data

SST2_URL = "https://dl.fbaipublicfiles.com/glue/data/SST-2.zip"
download_and_extract(SST2_URL, 'sst2.zip', '.')
sst2_train = read_sst2("SST-2/train.tsv", )
sst2_dev = read_sst2("SST-2/dev.tsv", )
for _ in range(3):
    idx = random.randint(0, len(sst2_train['documents']))
    print(f"Example {idx}: {sst2_train['documents'][idx]}\tLabel: {sst2_train['labels'][idx]}")

Similar to the classification on MNIST where we convert raw images into raw image features. We first need to convert the raw movie review string into a common text feature: [**bag-of-words**](https://www.wikiwand.com/en/Bag-of-words_model).


**Tokenize**

In natural language processing, we always first split the full text into small pieces, i.e., tokens. This process is called **tokenize**, which helps us construct a sequence of integers to represent the text.

Below, you need to implement one basic tokenize function on the *documents*, that splits document string into a list of words and convert each word into corresponding integer index that represents the word.

More specifically, you need to implement following functions (details in the following cell):

**normalize(document)**: a function that lowercases all characters in the document and adds whitespace before and after ".,!?:;" characters.

**build_vocab(documents)**: a function that finds all unique words in the documents and creates a dictionary mapping from word to integer index in the vocabulary. Remember to add a special **\<unk\>** token into vocaboluary.

**tokenize(vocab2id, document)**: a function that first split document into a sequence of words and then convert words into corresponding indices in the vocabulary. For unkown words, use the index of **\<unk\>**.

**bag_of_words(vocab2id, documents)**: a function that constructs **bag-of-words** feature of documents. Bag of words is represented as a unordered collection of words.

In [None]:
documents = [
    "This is, the first document.",
    "This document , is the second document.",
    "And this is the third one.",
    "Is this the first document?",
    "How many documents are here"
]

In [None]:
import numpy as np
from collections import Counter

def normalize(document):
    ## TODO:
    ## 1. Lowercase all characters in the document.
    ## 2. Add white space before and after following punctuation marks .,!?;:

    ## END OF YOUR CODE
    return document

def build_vocab(documents):
    # Build vocabulary
    vocabulary = set()

    for document in documents:
        ## TODO: normalize document, split the document into words and find the unique words

        ## END OF YOUR CODE

    vocabulary = sorted(list(vocabulary))
    assert "<unk>" not in vocabulary

    ## TODO: insert the <unk> token into the vocabulary

    ## END OF YOUR CODE

    word2id = {} # A dictionary that maps from word to integer index in the vocabulary
    ## TODO: construct a mapping from word string into an integer index

    ## END OF YOUR CODE

    print("Number of unique words: ", len(vocabulary))
    print("The words are", vocabulary)
    print("Word to id dict is: ", word2id)
    return vocabulary, word2id

def tokenize(word2id, document):
    wordids = []
    ## TODO: Tokenize the document string into a list of integers called wordids

    ## END OF YOUR CODE
    return wordids

def bag_of_words_doc(word2id, document):
    feature = None
    ## TODO: Construct bag of word feature for a document

    ## END OF YOUR CODE
    return feature

vocab, word2id = build_vocab(documents)
document = documents[0]
print("Input document: ", document)
print("Tokenize result: ", tokenize(word2id, document))
print("Document bag of words feature: ", bag_of_words_doc(word2id, document))

Now create a vocabulary using the entire SST-2 training set.

In [None]:
vocab, word2id = build_vocab(sst2_train['documents'])


**Train MLP on SST2**

Implement a two-layer MLP with ReLU activation function for the binary classification task on SST-2. Use Cross Entropy loss (or equivalently negative loglikelihood) to train the model. Complete the `train_mlp_sst()` function to train your model, visualize the training losses and validation loss, and report the accuracy on validation set. Remember to tokenize the data on the fly during training to save memory.

In [None]:
class TwoLayerMLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        '''
        Create a two-layer fully-connected network
        Inputs:
        input_dim: dimension of input features
        hidden_dim: dimension of hidden layer
        output_dim: dimension of output
        '''
        super().__init__()
        ## TODO: define layers in the model
        ## Model architecture: input --> hidden layer --> output

        ## End of your code

    def forward(self, x):
        logits = None
        ## TODO: forward pass

        ## End of your code
        return logits

In [None]:
def visualize_loss_acc(losses, accs, split):
    '''
    This function plots the loss curve and accuracy curve using matplotlib.
    '''
    # use matplotlib plot train curves
    plt.figure(figsize=(6, 10))
    plt.subplot(2, 1, 1)

    plt.plot(range(len(losses)), losses)
    plt.xlabel('Iter #')
    plt.ylabel('Loss')
    plt.title(f'{split} loss vs iteration number')

    plt.subplot(2, 1, 2)
    plt.plot(range(len(accs)), accs)
    plt.xlabel('Iter #')
    plt.ylabel('Acc')
    plt.title(f'{split} accuracy vs iteration number')

    # Show the figure.
    plt.show()

In [None]:
def train_mlp_sst(num_epochs, batch_size, lr, model, sst2_train, sst2_val):
    '''
    This function trains the model using stochastic gradient desent on the dataset.
    Returns:
    model: the optimized model.
    '''

    losses = []
    accs = []
    val_losses = []
    val_accs = []

    ## TODO: define loss function and optimizer, use SGD optimizer

    ## End of your code

    # Train loop
    # If implemented correctly, it should take <15 seconds for an epoch
    for i in tqdm(range(num_epochs)):
        ## TODO: shuffle training data

        ## End of your code

        epoch_step = math.ceil(len(sst2_train['documents']) / batch_size)
        for j in range(epoch_step):
            ## TODO: get features and labels for the batch: dynamically convert raw document string into feature tensors
            ## TODO: calculate loss and gradient
            ## TODO: update parameters
            ## Note: remember to clear gradients before every iteration

            ## End of your code

        loss, acc, val_loss, val_acc = None, None, None, None
        ## TODO: calculate loss, predictions, and accuracy
        ## Remember to wrap the computaions in torch.no_grad so that no computation graph is built

        ## End of your code
        losses.append(loss)
        accs.append(acc)
        val_losses.append(val_loss)
        val_accs.append(val_acc)

    print("Training done")
    visualize_loss_acc(losses, accs, "Training")
    visualize_loss_acc(val_losses, val_accs, "Validation")
    return model

In [None]:
# STOCHASTIC GRADIENT DESCENT HYPER-PARAMETERS
num_epochs = 10
batch_size = 256
lr = 0.2
hidden_dim = 128 # use this as hidden layer dimension

model = None
#################################
## TODO: initialize model      ##
#################################

##################################
######### End of your code #######
##################################
model = train_mlp_sst(num_epochs, batch_size, lr, model, sst2_train, sst2_dev)

**Effect of number of layers**

Experiment with different hyper-parameters: try number of layers with 3, 5, 10. You need to implement a new model class called `NLayerMLP` that takes the number of layers as hyper-parameter and constructs an MLP with multiple layers.

Visualize the training loss and validation loss (visualize validation loss at the end of each epoch), discuss your findings

In [None]:
class NLayerMLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, num_layers: int):
        '''
        Create a N-layer fully-connect network
        Inputs:
        input_dim: dimension of input features
        hidden_dim: dimension of hidden layer
        output_dim: dimension of output
        num_layers: number of hidden layers
        '''
        super().__init__()
        ## TODO: define layers in the model

        ## End of your code

    def forward(self, x):
        logits = None
        ## TODO: forward pass

        ## End of your code
        return logits

In [None]:
# STOCHASTIC GRADIENT DESCENT HYPER-PARAMETERS
num_epochs = 10
batch_size = 256
lr = 0.2
hidden_dim = 128 # use this as hidden layer dimension
num_layers = 5

model = None
#################################
## TODO: initialize model      ##
#################################

##################################
######### End of your code #######
##################################
model = train_mlp_sst(num_epochs, batch_size, lr, model, sst2_train, sst2_dev)