> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Week 1: Text Classification

### What are we building
We’ll continue to apply our learning philosophy of repetition as we build multiple classification models of increasing complexity in the following order:

1. Average of Word2Vec + MLP Layer
1. Can we concatenate 3 token embeddings and then average them? Does this do better than the previous method?
1. Build an embedding layer based model.
1. **Extension**: Explore different parameters, features and architectures. 

###  Evaluation
We’ll be evaluating our models on the following metric: 

1. Accuracy: is the ratio of the number of correctly classified instances to the total number of instances
1. **Extension**: this is a multi-class classification problem, visualize a [confusion matrix](https://torchmetrics.readthedocs.io/en/latest/references/functional.html#confusion-matrix-func) of N*N of actual class vs predicted class (N = number of classes).


### Instructions

1. We've provide scaffolding for all the boiler plate PyTorch code to get to our first model. This covers downloading and parsing the dataset, training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point our model gets to an accuracy of about 0.32. After this we'll try to improve the model by using sliding windows of text instead of just one word at a time. **Does this improve accuracy?**
1. The third model we're going to build is an embedding layer based model. Here instead of using pre-trained word-embeddings we'll be creating new vectors as part of the training process. **How do you think this model will perform?**
1. **Extension**: We've suggested a bunch of extensions to the project so go crazy, tweak any parts of the pipeline and see if you can beat all the current modes.

### Code Overview
- Dependencies: Python dependencies and loading the spacy model
- Project
  - Dataset: Download the conversation dataset and parse it into a pytorch Dataset
  - Trainer: Trainer function to help with multi-epoch training
  - Model 1: Simple Word2Vec + MLP model
  - Model 2: Sliding window trigram (Word2Vec)
  - Model 3: Embedding bag based model on Trigram
- Extensions
 


# Dependencies

✨ Now let's get started and to kick things off as always we install some dependencies.

In [2]:
%%capture
# Install all the required dependencies for the project
!pip install pytorch-lightning==1.5.10 spacy==2.2.4
!python -m spacy download en_core_web_md

Import all the necessary libraries we need throughout the project.

In [3]:
from sklearn.preprocessing import LabelEncoder
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, random_split
from collections import Counter
import en_core_web_md
import numpy as np
import pytorch_lightning as pl
import spacy
import torch
import torch.nn.functional as F
import torchmetrics

First things first, let's load the Spacy data which comes with pre-trainined embeddings. This process is expensive so only do this once.

In [4]:
# Really expensive operation to load the entire space word-vector index in memory
# We'll only run it once 
loaded_spacy_model = en_core_web_md.load()

Fix the random seed for numpy and pytorch so the entire class gets consistent results which we can discuss with each other.

In [5]:
# Fix the random seed so that we get consistent results
torch.manual_seed(0)
np.random.seed(0)

# Classifier Project 

✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

We’ll be using the Empathetic Dialogs dataset open-sourced by Facebook ([link](https://research.fb.com/publications/towards-empathetic-open-domain-conversation-models-a-new-benchmark-and-dataset/)). It can be downloaded as a tar ball from the following [link](https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz)

A sample row from the dataset: 
```
conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
hit:12388_conv:24777,1,joyful,I felt overcome with emotions when Christmas came around as a kid,437,Christmas was the best time of year back in the day!,5|5|5_5|5|5, ''
```

The three columns we'll primarily focus on are:
1. context ==> emotion we're trying to predict
1. prompt + utterance ==> We'll combine these sentences and use them as input 

But let's download and explore the dataset and these should automatically get clear.


In [6]:
import tarfile
import os
import csv

DIRECTORY_NAME="classification"
TRAIN_FILE="classification/empatheticdialogues/train.csv"
VALIDATION_FILE="classification/empatheticdialogues/valid.csv"
TEST_FILE="classification/empatheticdialogues/test.csv"


def download_dataset():
  """
  Download the dialog dataset. The tarball contains three files: train.csv, valid.csv, test.csv 
  """
  !wget 'https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz'
  if not os.path.isdir(DIRECTORY_NAME):
    !mkdir classification
  tar = tarfile.open('empatheticdialogues.tar.gz')
  tar.extractall(DIRECTORY_NAME)
  tar.close()

# Expensive operation so we should just do this once
download_dataset()

--2022-07-18 07:21:16--  https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28022709 (27M) [application/gzip]
Saving to: ‘empatheticdialogues.tar.gz’


2022-07-18 07:21:20 (8.77 MB/s) - ‘empatheticdialogues.tar.gz’ saved [28022709/28022709]



Now the question is that did it do the right thing? Time to find out.


In [7]:
import glob
glob.glob(f"{DIRECTORY_NAME}/**/*.csv", recursive=True)

['classification/empatheticdialogues/test.csv',
 'classification/empatheticdialogues/train.csv',
 'classification/empatheticdialogues/valid.csv']

Cool we see all our files. Let's poke at one of them before we start parsing our dataset.

In [8]:
with open(TRAIN_FILE, 'r', newline='\n') as file:
  reader = csv.reader(file, delimiter = ',')
  i = 0
  while(i < 5):
    print(next(reader))
    i += 1

['conv_id', 'utterance_idx', 'context', 'prompt', 'speaker_idx', 'utterance', 'selfeval', 'tags']
['hit:0_conv:1', '1', 'sentimental', 'I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.', '1', 'I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.', '5|5|5_2|2|5', '']
['hit:0_conv:1', '2', 'sentimental', 'I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.', '0', 'Was this a friend you were in love with_comma_ or just a best friend?', '5|5|5_2|2|5', '']
['hit:0_conv:1', '3', 'sentimental', 'I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.', '1', 'This was a best friend. I miss her.', '5|5|5_2|2|5', '']
['hit:0_conv:

The set of columns we care about are:
1. context ==> emotion we're trying to predict
1. prompt + utterance ==> We'll combine these sentences and use them as input 

Parse the dataset file and create a label encoder that converts text labels to integer ids or vice versa

In [9]:
def parse_dataset(file_path, label_encoder):
  """
  Function to parse the csv into training or test dataset

  Input: Tuple[conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags]
  Output: Tuple[label, merge sentences in the conversation]
  """
  data = []
  with open(file_path, 'r', newline='\n') as file:
    reader = csv.reader(file, delimiter = ',')
     # This skips the first row of the CSV file.
    next(reader)
    for row in reader:
      # This is a bad row if it is missing any of the entries
      if len(row) != 8:
        continue
      # Append the entry into the list of data points.
      data.append((label_encoder([row[2]])[0], row[3] + " " + row[5]))
  return data


# A lable encoder converts the text labels into integer ids
def get_label_encoder():
  """Get all the labels in a dataset and return two maps that convert labels -> id or vice versa.
  """
  # We pass an identity encoder since we still need the raw labels to train the label encoder
  raw_data = parse_dataset(TRAIN_FILE, lambda x: x)
  le = LabelEncoder()
  le.fit([x[0] for x in raw_data])
  return le

# Global variables used throughout the notebook
label_encoder = get_label_encoder()

Creating the global training, validation and test datasets from the data files.

In [10]:
training_data = parse_dataset(TRAIN_FILE, label_encoder.transform)
validation_data = parse_dataset(VALIDATION_FILE, label_encoder.transform)
test_data = parse_dataset(TEST_FILE, label_encoder.transform)

print('Shape of training dataset: ({rows}, {cols})'.format(rows=len(training_data), cols=len(training_data[0])))
print('Shape of validation dataset: ({rows}, {cols})'.format(rows=len(validation_data), cols=len(validation_data[0])))
print('Shape of test dataset: ({rows}, {cols})'.format(rows=len(test_data), cols=len(test_data[0])))

Shape of training dataset: (76668, 2)
Shape of validation dataset: (6313, 2)
Shape of test dataset: (5697, 2)


[Dataset and Data loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html): Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

[LightingDataModule](https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html#datamodules): A datamodule is a shareable, reusable class that encapsulates all the steps needed to process data. A datamodule encapsulates the five steps involved in data processing in PyTorch:

1. Download / tokenize / process.
2. Clean and (maybe) save to disk.
3. Load inside Dataset.
4. Apply transforms (rotate, tokenize, etc…).
5. Wrap inside a DataLoader.


In [11]:
class ClassificationDataset(Dataset):
  """Creates an pytorch dataset to consume our pre-loaded csv data

  Reference: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html 
  """
  def __init__(self, data, vectorizer):
    self.dataset = data
    # Vectorizer needs to implement a vectorize function that returns vector and tokens
    # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
    self.vectorizer = vectorizer

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    (label, sentence) = self.dataset[idx]
    sentence_vector, sentence_tokens = self.vectorizer.vectorize(sentence)
    return {
        "vectors": sentence_vector,
        "label": label,
        "tokens": sentence_tokens, # for debugging only
        "sentence": sentence # for debugging only
      }

class ClassificationDataModule(pl.LightningDataModule):
  """LightningDataModule: Wrapper class for the dataset to be used in training
  """
  def __init__(self, vectorizer, params):
    super().__init__()
    self.params = params
    self.classification_train = ClassificationDataset(training_data, vectorizer)
    self.classification_val = ClassificationDataset(validation_data, vectorizer)
    self.classification_test = ClassificationDataset(test_data, vectorizer)

  # Function to convert the input raw data from the dataset into model input. 
  # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
  def collate_fn(self, batch):
    # Embedding layers need the inputs to be integer so we need to add this special case here.
    if self.params.integer_input: 
      word_vector = [torch.LongTensor(item["vectors"]) for item in batch]
      sentence_vector = pad_sequence(word_vector, batch_first=True, padding_value=0)
    else:
      sentence_vector = torch.stack([torch.Tensor(item["vectors"]) for item in batch])
    labels = torch.LongTensor([item["label"] for item in batch])
    return {"vectors": sentence_vector, "labels": labels, "sentences": [item["sentence"] for item in batch]}

  # Training dataloader .. will reset itself each epoch
  def train_dataloader(self):
    return DataLoader(self.classification_train, batch_size=self.params.batch_size, collate_fn=self.collate_fn)

  # Validation dataloader .. will reset itself each epoch
  def val_dataloader(self):
    return DataLoader(self.classification_val, batch_size=self.params.batch_size, collate_fn=self.collate_fn)

  # Test dataloader .. will reset itself each epoch
  def test_dataloader(self):
    return DataLoader(self.classification_test, batch_size=self.params.batch_size, collate_fn=self.collate_fn)

### Classfier and Trainer (Common to all solutions)

We've now created the DataLoader and Datasets we'll use in the entire project.It is time to write the training and testing loops. 

[LightingModule](https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html): organizes your PyTorch code into 5 sections

1. Computations (init).
2. Train loop (training_step)
3. Validation loop (validation_step)
4. Test loop (test_step)
5. Optimizers (configure_optimizers)

In [12]:
# 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
class EmotionClassifier(pl.LightningModule):
  def __init__(self, model, params):
      super().__init__()
      self.model = model
      self.params = params
      self.accuracy = torchmetrics.Accuracy()

  def forward(self, x):
      return self.model(x)

  def training_step(self, batch, batch_idx):
    x = batch["vectors"]
    y = batch["labels"]
    y_hat = self(x)
    loss = F.cross_entropy(y_hat, y, reduction='mean')
    self.log_dict(
        {'train_loss': loss}, 
        batch_size=self.params.batch_size, 
        prog_bar=True
        )
    return loss
  
  def validation_step(self, batch, batch_nb):
    x = batch["vectors"]
    y = batch["labels"]
    y_hat = self(x)
    val_loss = F.cross_entropy(y_hat, y, reduction='mean')
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'val_loss': val_loss,
          'val_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.params.batch_size,  
        prog_bar=True
      )
    return val_loss

  def test_step(self, batch, batch_nb):
    x = batch["vectors"]
    y = batch["labels"]
    y_hat = self(x)
    test_loss = F.cross_entropy(y_hat, y, reduction='mean')
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'test_loss': test_loss,
          'test_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.params.batch_size, 
        prog_bar=True
      )
    return test_loss
  
  def predict_step(self, batch, batch_idx):
    y_hat = self.model(batch["vectors"])
    predictions = torch.argmax(y_hat, dim=1)
    return {'logits':y_hat, 'predictions': predictions, 'labels': batch["labels"], 'sentences': batch['sentences']}

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.params.learning_rate)
    return optimizer

Once we have a Lightning and LightingDataModule, a [Trainer](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html) automates everything else. It provides functions for training (fit), testing and inference. We ended up writing a helper function that takes the model, vectorizer and hyper parameters to be able to easily compare our different models.

In [13]:
def trainer(model, params, vectorizer):
  # Create a pytorch trainer
  trainer = pl.Trainer(max_epochs=params.max_epochs, check_val_every_n_epoch=1)

  # Initialize our data loader with the passed vectorizer
  data_module = ClassificationDataModule(vectorizer, params)

  # Instantiate a new model
  model = EmotionClassifier(model, params)

  # Train and validate the model
  trainer.fit(model, data_module.train_dataloader(), val_dataloaders=data_module.val_dataloader())

  # Test the model
  trainer.test(model, data_module.test_dataloader())

  # Predict on the same test set to show some output
  output = trainer.predict(model, data_module.test_dataloader())

  for i in range(2):
    print("-----------")
    print("Sentence: ", output[1]['sentences'][i])
    print("Predicted Emotion: ", label_encoder.inverse_transform([output[1]['predictions'][i].numpy()])[0])
    print("Actual Label: ", label_encoder.inverse_transform([output[1]['labels'][i].numpy()])[0])

# Models

Are we building models yet? Finally the time has come to build our baseline model and then we'll work towards improving it. 

### Model 1: Average word vector of the sentence -- Baseline
##### <font color='red'>Expected accuracy: ~30%</font>

Let's build our first simple word2vec based model we'll use as our baseline.

Here we've three key pieces:

1. *WordVectorClassificationModel*: Simple linear model that just has one single neuron layer that maps the input word2vec dimensions (300) to the output classes (32) building a really simple classifier.

In [14]:
class WordVectorClassificationModel(torch.nn.Module):
  def __init__(self, word_vec_dimension, num_classes):
    super().__init__()
    self.classes = num_classes
    self.linear_layer = torch.nn.Linear(word_vec_dimension, num_classes)
    
  # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
  def forward(self, batch):
    """Projection from word_vec_dim to n_classes

    Batch is of shape (batch_size, max_seq_len, word_vector_dim)
    """
    return self.linear_layer(batch)

2. *HParams*: a class that contains all the hyper parameters relevant for the current model.
3. *SpacyVectorizer*: Vectorizer that converts the text sentence into the input to the DataLoader's collate function. Basically we'll call vectorizer on each row of the input data and then call the collate_fn on each batch of items which is fed to the Neural Network.

It will take several minutes to train the model, so don't be alarmed if you don't get the result right away. When the cell finishes running, under the `DATALOADER:0 TEST RESULTS` section, you should see the `test_accuracy` field with a value of ~0.3.

In [15]:
class HParams:
  batch_size: int = 32
  integer_input: bool = False
  word_vec_dimension: int = 300
  num_classes: int = len(label_encoder.classes_)
  learning_rate: float = 0.001
  max_epochs: int = 4


# 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
class SpacyVectorizer:
  def vectorize(self, sentence):
    """
    Given a sentence, tokenize it and reference pre-trained word vector for each token.

    Returns a tuple of sentence_vector and list of text tokens
    """
    sentence_vector = []
    sentence_tokens = []
    spacy_doc = loaded_spacy_model.make_doc(sentence) ## I am Sourabh
    word_vector = [token.vector for token in spacy_doc] ## [ [Embedding of I], [Embedding of am], [Embedding of UNK]] 
    sentence_tokens = list([token.text for token in spacy_doc])
    sentence_vector = np.mean(np.array(word_vector), axis=0)
    return sentence_vector, sentence_tokens


trainer(
    model=WordVectorClassificationModel(HParams.word_vec_dimension, 
                                        HParams.num_classes),
    params=HParams,
    vectorizer=SpacyVectorizer())

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Missing logger folder: /content/lightning_logs

  | Name     | Type                          | Params
-----------------------------------------------------------
0 | model    | WordVectorClassificationModel | 9.6 K 
1 | accuracy | Accuracy                      | 0     
-----------------------------------------------------------
9.6 K     Trainable params
0         Non-trainable params
9.6 K     Total params
0.039     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_accuracy': 0.3310055732727051, 'test_loss': 2.4306912422180176}
--------------------------------------------------------------------------------


Predicting: 2396it [00:00, ?it/s]

-----------
Sentence:  All my friends live in a different country hi_comma_ i feel so lonely sometimes because all my friends live in a different country
Predicted Emotion:  lonely
Actual Label:  lonely
-----------
Sentence:  All my friends live in a different country i was thinking about it! I wanted to join a group for local moms
Predicted Emotion:  lonely
Actual Label:  lonely


🎉🎉🎉 WE HAVE OUR TEXT CLASSIFIER 🎉🎉🎉


Now might be a good time to play around with the vectorizer and the classifier.

### Assignment Part: 1 - Model 2: Sliding Window Word2Vec ---- TO BE COMPLETED
##### <font color='red'>Expected accuracy: ~40%</font>

We'll be re-using the simple linear model from Model-1 but changing the input to use sliding windows instead of one word at a time. 

Implement a new `SpacyChunkVectorizer` which is a variant of the `SpacyVectorizer` that operates. Here are some instructions on how to implement it.

1. Split the sentence into chunks of the size of the n_grams parameter.
2. Concat all the spacy embeddings of the tokens inside to create embeddings of the chuck. Each chunk vector is of the size of `3 * size_of_embedding`
3. Sentence vector is the average of all chunk vectors.
4. Return the sentence_vector and tokens (for debugging, option: could just return None for tokens)

Does this model perform better than our baseline? Why do you think that is?

In [None]:
class HParamsSpacy:
  batch_size: int = 32
  integer_input: bool = False
  word_vec_dimension: int = 300
  num_classes: int = len(label_encoder.classes_)
  learning_rate: float = 0.001
  max_epochs: int = 4
  n_grams: int = 3  ## ADAPT, Change it to your liking


class SpacyChunkVectorizer:
  def __init__(self, params):
    self.params = params


  def vectorize(self, sentence):
    """
    Given a sentence, tokenize it and returns a word vector for that sentence.
    Sentence_vector is of length (n_grams * word_vector_dim)
    Sentence_tokens is of length (n)
    Sentence is of length (n)
    """
    
    chunk_vector = []
    ngrams_vector = []
    sentence_vector = []
    sentence_tokens = []


    # 1. Split the sentence into tokens using Spacy's function make_doc
    spacy_doc = loaded_spacy_model.make_doc(sentence)

    # 2. Split the list of token into size of the n_grams parameter.
    word_vector = []
    token_vector = []
    if len(spacy_doc) < self.params.n_grams:
      # get the current chunk
      cur_word_vector = [token.vector for token in spacy_doc]
      cur_token_vector = [token.text for token in spacy_doc]

      # pad the chunks with enough zeros
      for i in range(self.params.n_grams - len(spacy_doc)):
        cur_word_vector.append(np.zeros(self.params.word_vec_dimension))
        cur_token_vector.append("")

      # add it back to the word_vector and token_vector
      word_vector.append(cur_word_vector)
      token_vector.append(cur_token_vector)
    else:
      for i in range(len(spacy_doc) - self.params.n_grams + 1):
        window = spacy_doc[i:i+self.params.n_grams]
        word_vector.append([token.vector for token in window])
        token_vector.append([token.text for token in window])
    

    # 3. Concat all the spacy embeddings of the tokens inside to create embeddings of the chuck. Each chunk vector is of the size of 3 * size_of_embedding
    sentence_tokens = [" ".join(text) for text in token_vector]
    ngrams_vector = [np.concatenate(words) for words in word_vector]

    # 4. Sentence vector is the average of all chunk vectors.
    sentence_vector = np.mean(ngrams_vector, axis=0)

    # 5. Return the sentence_vector and tokens (option: could just return None for tokens)
    
    return sentence_vector, sentence_tokens


trainer(
    model=WordVectorClassificationModel(
        # Observe the change in input parameters
        HParamsSpacy.word_vec_dimension * HParamsSpacy.n_grams,  
        HParamsSpacy.num_classes),
    params=HParamsSpacy,
    vectorizer=SpacyChunkVectorizer(HParamsSpacy))

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name     | Type                          | Params
-----------------------------------------------------------
0 | model    | WordVectorClassificationModel | 28.8 K
1 | accuracy | Accuracy                      | 0     
-----------------------------------------------------------
28.8 K    Trainable params
0         Non-trainable params
28.8 K    Total params
0.115     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_accuracy': 0.38791900873184204, 'test_loss': 2.151660680770874}
--------------------------------------------------------------------------------


Predicting: 2396it [00:00, ?it/s]

-----------
Sentence:  All my friends live in a different country hi_comma_ i feel so lonely sometimes because all my friends live in a different country
Predicted Emotion:  lonely
Actual Label:  lonely
-----------
Sentence:  All my friends live in a different country i was thinking about it! I wanted to join a group for local moms
Predicted Emotion:  lonely
Actual Label:  lonely


### Assignment Part: 2 - Model 3: EmbeddingBag  ---- TO BE COMPLETED
##### <font color='red'>Expected accuracy: ~43%</font>

The third model we're going to build is an embedding layer based model. Here instead of using pre-trained word-embeddings we'll be creating new vectors as part of the training process. How do you think this model will perform?

Implementation has the following steps:

1. **`get_char_trigram_token_map`**: Implement a map that returns top `num_tokens` token in the corpus to some allocated ids between `1 to num_tokens`. Here are some steps that should help with the implementation.
      1. Compute a frequency map of the `num_tokens` most common  character trigrams in the training data. **Note: We're now moving away from words and moving to just using characters directly**.
      2. Add an extra key called "UNK" to capture unknown tokens.
      3. Create unique integer ids for all these tokens 1...N

2. **Vectorizer**: Implement a new vectorizer for each sentence that does the following:
    1. Get all trigrams for the sentence
    2. Get id for every trigram
    3. For missing trigrams in the map mark them as "UNK"
    4. Append all the ids into a list and that is your sentence vector

3. **Forward pass of the model**: Implement the forward pass of the model
    1. Pass the input batch through the embedding layer
    1. Pass the output of the embedding layer into our linear layer  


Some rational for trying this out is:
- We average the embeddings in a sentence anyway so word or not word maybe doesn't matter.
- Vocabalary of character trigrams is much smaller than word trigrams so our models are easier to train.

In [None]:

class HParamsCTT:
  batch_size: int = 16  ### NOTE THE CHANGE
  integer_input: bool = True  ### NOTE THE CHANGE
  num_classes: int = len(label_encoder.classes_)
  learning_rate: float = 0.001
  max_epochs: int = 4
  n_grams: int = 3
  embed_dim: int = 350 # ADDED, Change it to your liking
  num_tokens: int = 5000
  n_grams: int = 3
 

class CharacterTrigramTokenizer:
  """
  We represent a sentence as a vector of num_tokens tokens.
  If the trigram is present in the sentence then we add the token's id to the sentence
  """

  def __init__(self, train_data, num_tokens):
    self.num_tokens = num_tokens
    self.token_to_id_map = self.get_char_trigram_token_map(train_data, num_tokens)


  def get_char_trigram_token_map(self, train_data, num_tokens):
    """
    1. Compute a frequency map of the `num_tokens` most common trigrams in the training data.
    2. Add an extra key called "UNK" to capture unknown tokens.
    3. Create unique integer ids for all these tokens 1...N
    """
    token_to_id_map = {}
    frequency_map = {}
    token_to_id_map["UNK"] = 1
    n_grams = 3

    for _, sentence in train_data:
      for i in range(len(sentence)-n_grams+1):
        token = sentence[i:i+n_grams]
        if token not in frequency_map:
          frequency_map[token] = 0
        frequency_map[token] += 1

    # sort the frequency map items so that the tokens with the largest frequency
    # are at the head of the list
    frequency_tokens = sorted(frequency_map.items(), key=lambda x: x[1], reverse=True)

    # start at id 2 since the UNK token has id 1
    for i in range(2, self.num_tokens + 2):
      token_to_id_map[frequency_tokens[i][0]] = i

    return token_to_id_map
  
  def vectorize(self, sentence):
    """
    Given a sentence (string), do the following - 
    1. Get all trigrams for the sentence
    2. Get id for every trigram
    3. For missing trigrams in the map mark them as "UNK"
    4. Append all the ids into a list and that is your sentence vector
    """
    sentence_vector = []
    n_grams = 3

    for i in range(len(sentence)-n_grams+1):
      token = sentence[i:i+n_grams]
      id = self.token_to_id_map.get(token, self.token_to_id_map["UNK"])
      sentence_vector.append(id)
    
    return sentence_vector, None

Let's just validate the output of the tokenizer before we train the model. We should see something like:

```
Number of trigrams:  12997
UNK 1
 th 2
 i  3
the 4
my  5
```

In [None]:
characterTokenizer = CharacterTrigramTokenizer(training_data, HParamsCTT.num_tokens)
i = 0
for k, v in characterTokenizer.token_to_id_map.items():
  if i > 10:
    break
  print(k,v)
  i += 1


UNK 1
the 2
 to 3
ing 4
he  5
ng  6
to  7
nd  8
 a  9
 wa 10
ed  11


Now we can create the simple embedding layer based model and start training it.

In [None]:
class EmbeddingBagClassificationModel(torch.nn.Module):
  def __init__(self, num_tokens, embed_dim, n_classes):
    super().__init__()
    self.classes = n_classes
    self.embedding = torch.nn.EmbeddingBag(num_tokens, embed_dim, padding_idx=0)
    self.linear_layer = torch.nn.Linear(embed_dim, n_classes)
    
  def forward(self, batch):
    """Pass the input batch through the embedding layer and then follow it up with the linear layer
    """
    y = self.embedding(batch)
    y = self.linear_layer(y)

    return y


trainer(
    # Note the plus three: padding + UNK + 1 buffer
    model=EmbeddingBagClassificationModel(
        HParamsCTT.num_tokens + 3, 
        HParamsCTT.embed_dim, 
        HParamsCTT.num_classes), 
    params=HParamsCTT,
    vectorizer=characterTokenizer)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name     | Type                            | Params
-------------------------------------------------------------
0 | model    | EmbeddingBagClassificationModel | 1.8 M 
1 | accuracy | Accuracy                        | 0     
-------------------------------------------------------------
1.8 M     Trainable params
0         Non-trainable params
1.8 M     Total params
7.049     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_accuracy': 0.4306722581386566, 'test_loss': 1.9705818891525269}
--------------------------------------------------------------------------------


Predicting: 4792it [00:00, ?it/s]

-----------
Sentence:  i really hope my husband finds a full time job soon thank you so much!
Predicted Emotion:  hopeful
Actual Label:  hopeful
-----------
Sentence:  i really hope my husband finds a full time job soon he is an armed guard
Predicted Emotion:  hopeful
Actual Label:  hopeful


🎉 CONGRATS!!! on finishing the assignment. Now is a good time to pause and reflect how much progress we've made in understanding word vectors, reading some pytorch code and build our first model. But hey, don't stop here, there is a lot to do or play with in the next sections.

# Extensions

Now that you've worked through the project. There is a lot more for us to try.

- Which model performed the best? Why do you think that was?
- Try adding a hidden layer to the baseline of the models and see if that changes anything
- Does adding a hidden layer to the embedding bag model help?
- visualize a [confusion matrix](https://torchmetrics.readthedocs.io/en/latest/references/functional.html#confusion-matrix-func) of N*N of actual class vs predicted class (N = number of classes)