> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Prereq Week: Text Classification

### What are we building
We’ll continue to apply our learning philosophy of repetition as we build multiple classification models of increasing complexity in the following order:

1. Average of Word2Vec + MLP Layer
1. Can we concatenate 3 token embeddings and then average them? Does this do better than the previous method?
1. Build an embedding layer based model.
1. **Extension**: Explore different parameters, features and architectures. 

###  Evaluation
We’ll be evaluating our models on the following metric: 

1. Accuracy: is the ratio of the number of correctly classified instances to the total number of instances
1. **Extension**: this is a multi-class classification problem, visualize a [confusion matrix](https://torchmetrics.readthedocs.io/en/latest/references/functional.html#confusion-matrix-func) of N*N of actual class vs predicted class (N = number of classes).


### Instructions

1. We've provide scaffolding for all the boiler plate PyTorch code to get to our first model. This covers downloading and parsing the dataset, training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point our model gets to an accuracy of about 0.32. After this we'll try to improve the model by using sliding windows of text instead of just one word at a time. **Does this improve accuracy?**
1. The third model we're going to build is an embedding layer based model. Here instead of using pre-trained word-embeddings we'll be creating new vectors as part of the training process. **How do you think this model will perform?**
1. **Extension**: We've suggested a bunch of extensions to the project so go crazy, tweak any parts of the pipeline and see if you can beat all the current modes.

### Code Overview
- Dependencies: Python dependencies and loading the spacy model
- Project
  - Dataset: Download the conversation dataset and parse it into a pytorch Dataset
  - Trainer: Trainer function to help with multi-epoch training
  - Model 1: Simple Word2Vec + MLP model
  - Model 2: Sliding window trigram (Word2Vec)
  - Model 3: Embedding bag based model on Trigram
- Extensions
 


## Dependencies

In [5]:
import os
import subprocess

from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn.functional as F
import torch.utils.data as tfdata
from torchmetrics.classification import MulticlassAccuracy
import spacy
from tqdm import tqdm
import numpy as np
import lightning as L
import pandas as pd
import warnings
from torch.nn.utils.rnn import pad_sequence
warnings.filterwarnings("ignore", ".*does not have many workers.*")

In [6]:
# Load the spaCy model
# python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

# Fix the random seed so that we get consistent results
torch.manual_seed(0)
np.random.seed(0)

# Classifier Project
✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

We’ll be using the Empathetic Dialogs dataset open-sourced by Facebook ([link](https://research.fb.com/publications/towards-empathetic-open-domain-conversation-models-a-new-benchmark-and-dataset/)). It can be downloaded as a tar ball from the following [link](https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz)

A sample row from the dataset: 
```
conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
hit:12388_conv:24777,1,joyful,I felt overcome with emotions when Christmas came around as a kid,437,Christmas was the best time of year back in the day!,5|5|5_5|5|5, ''
```

Let's download and explore the dataset and these should automatically get clear.

[Dataset and Data loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html): Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

[LightingDataModule](https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html#datamodules): A datamodule is a shareable, reusable class that encapsulates all the steps needed to process data. A datamodule encapsulates the five steps involved in data processing in PyTorch:

1. Download / tokenize / process.
2. Clean and (maybe) save to disk.
3. Load inside Dataset.
4. Apply transforms (rotate, tokenize, etc…).
5. Wrap inside a DataLoader.


In [7]:
###########
# DATASET #
###########

class EmpatheticDataset(tfdata.Dataset):
    def __init__(self, data_dir: str = 'classification_data', split: str = 'train', transform=None):
        self.data_dir = data_dir
        self.dataset_url = 'https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz'
        self.directory_name = 'empatheticdialogues'
        self.transform = transform
        # Check if the dataset directory already exists to avoid re-downloading
        if not os.path.exists(self.data_dir):
            os.makedirs(self.data_dir, exist_ok=True)
            # Download the dataset using wget
            subprocess.run(['wget', '-q', self.dataset_url, '-O', 'empatheticdialogues.tar.gz'])
            # Extract the dataset
            subprocess.run(['tar', '-xvf', 'empatheticdialogues.tar.gz', '-C', self.data_dir])
            # Remove the tar file to clean up
            os.remove('empatheticdialogues.tar.gz')
        else:
            print("Dataset already downloaded and extracted.")
        train_data_url = f"{self.data_dir}/{self.directory_name}/train.csv"
        val_data_url = f"{self.data_dir}/{self.directory_name}/valid.csv"
        test_data_url = f"{self.data_dir}/{self.directory_name}/test.csv"

        self.label_encoder = LabelEncoder()
        self.label_encoder.fit(pd.read_csv(train_data_url, on_bad_lines='skip')['context'])

        if split == 'train':
            self.pd_data = pd.read_csv(train_data_url, on_bad_lines='skip')
        elif split == 'val':
            self.pd_data = pd.read_csv(val_data_url, on_bad_lines='skip')
        elif split == 'test':
            self.pd_data = pd.read_csv(test_data_url, on_bad_lines='skip')
        else:
            raise ValueError("Invalid split. Must be one of 'train', 'val', or 'test'.")
        
        # Add a new column for the index of the context
        self.pd_data['context_idx'] = self.label_encoder.transform(self.pd_data['context'])

        # Apply the transform function to the data
        self.data = [self.transform(item) if self.transform else item for _, item in tqdm(list(self.pd_data.iterrows()), "Applying dataset transform")]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

Let's poke around our dataset a little!

In [8]:
# Show the internal Pandas DataFrame within the EmpatheticDataset
sample_dataset = EmpatheticDataset()
sample_dataset.pd_data

Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 4780129.91it/s]


Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags,context_idx
0,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,,28
1,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,,28
2,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,,28
3,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,,28
4,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,,28
...,...,...,...,...,...,...,...,...,...
76663,hit:12424_conv:24848,5,sentimental,I found some pictures of my grandma in the att...,389,Yeah reminds me of the good old days. I miss ...,5|5|5_5|5|5,,28
76664,hit:12424_conv:24849,1,surprised,I woke up this morning to my wife telling me s...,294,I woke up this morning to my wife telling me s...,5|5|5_5|5|5,,29
76665,hit:12424_conv:24849,2,surprised,I woke up this morning to my wife telling me s...,389,Oh hey that's awesome! That is awesome right?,5|5|5_5|5|5,,29
76666,hit:12424_conv:24849,3,surprised,I woke up this morning to my wife telling me s...,294,It is soooo awesome. We have been wanting a b...,5|5|5_5|5|5,,29


Let's explore the label encoder in the data module.  It should be able to convert the string labels to integers and vice versa.

In [9]:
print(f'Label encoder classes: {sample_dataset.label_encoder.classes_}')
print(f'Label for "sad": {sample_dataset.label_encoder.transform(["sad"])}')
print(f'Label for "hopeful": {sample_dataset.label_encoder.transform(["hopeful"])}')
print(f'Label for "angry": {sample_dataset.label_encoder.transform(["angry"])}')

Label encoder classes: ['afraid' 'angry' 'annoyed' 'anticipating' 'anxious' 'apprehensive'
 'ashamed' 'caring' 'confident' 'content' 'devastated' 'disappointed'
 'disgusted' 'embarrassed' 'excited' 'faithful' 'furious' 'grateful'
 'guilty' 'hopeful' 'impressed' 'jealous' 'joyful' 'lonely' 'nostalgic'
 'prepared' 'proud' 'sad' 'sentimental' 'surprised' 'terrified' 'trusting']
Label for "sad": [27]
Label for "hopeful": [19]
Label for "angry": [1]


### Transform
Now, let's create a transform to extract the context and the utterance from the dataset.

The columns we care about are:
1. "context": This is the emotion we're trying to predict (this has already been converted to a number usign the dataset label encoder)
1. "prompt" and "utterance": We'll combine these sentences and use them as input 

In [10]:
def transform_mean_vectors(item):
    # Combine 'prompt' and 'utterance' into a single string
    input_string = item['prompt'] + ' ' + item['utterance']
    # Vectorize the input string
    x = np.mean([token.vector for token in nlp.make_doc(input_string)], axis=0)
    # Retrieve the context label
    y = item['context_idx'] # This column was added in the EmpatheticDataset class __init__ method
    return {
        'input_string': input_string, # This is useful for visualization/debugging
        'sentiment': item['context'], # This is useful for visualization/debugging
        'input_vector': x,
        'sentiment_idx': y
    }

### Collate
Next, we'll create a collate function that tokenizes the batch and returns the tokenized input and the emotion label.

In [11]:
def collate_fn(batch):
    # A torch data loader will combine a list of samples into a batch
    # This function will be used to process the batch
    # batch is a list of the outputs from the __getitem__ method of the EmpatheticDataset class
    # Separate the batch into input and target, then convert to tensors
    input_tensors = torch.stack([torch.tensor(item['input_vector']) for item in batch])
    target_tensors = torch.stack([torch.tensor(item['sentiment_idx']) for item in batch])
    return {
        'input': input_tensors,
        'target': target_tensors,
        'input_string': [item['input_string'] for item in batch], # This is useful for visualization/debugging
        'sentiment': [item['sentiment'] for item in batch] # This is useful for visualization/debugging
    }

In [12]:
class EmpatheticDialoguesDataModule(L.LightningDataModule):
    def __init__(self, batch_size=32, collate_fn=None, transform=None):
        super().__init__()
        self.batch_size = batch_size
        self.collate_fn = collate_fn
        self.transform = transform

    def prepare_data(self):
        # This downloads the dataset and prepares it
        # It doesn't save anything to the data module, just prepares the dataset
        EmpatheticDataset()
        self.label_encoder = EmpatheticDataset().label_encoder

    def setup(self, stage):
        # Retrieve the dataset from disk
        self.train_dataset = EmpatheticDataset(split='train', transform=self.transform)
        self.val_dataset = EmpatheticDataset(split='val', transform=self.transform)
        self.test_dataset = EmpatheticDataset(split='test', transform=self.transform)

    def train_dataloader(self):
        return tfdata.DataLoader(self.train_dataset,
                               batch_size=self.batch_size,
                               collate_fn=self.collate_fn,
                               num_workers=0,
                               shuffle=True)

    def val_dataloader(self):
        return tfdata.DataLoader(self.val_dataset,
                               batch_size=self.batch_size,
                               collate_fn=self.collate_fn,
                               num_workers=0)

    def test_dataloader(self):
        return tfdata.DataLoader(self.test_dataset,
                               batch_size=self.batch_size,
                               collate_fn=self.collate_fn,
                               num_workers=0)

### Classfier Module

We've now created the DataLoader and Datasets we'll use in the entire project, it is time to write the training and testing code via a `LightningModule`. 

[LightingModule](https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html): organizes your PyTorch code into 5 sections

1. Computations (init).
2. Train loop (training_step)
3. Validation loop (validation_step)
4. Test loop (test_step)
5. Optimizers (configure_optimizers)

In [13]:
class EmotionClassifier(L.LightningModule):
  def __init__(self, model, batch_size, learning_rate, num_classes):
      super().__init__()
      self.model = model
      self.batch_size = batch_size
      self.learning_rate = learning_rate
      self.accuracy = MulticlassAccuracy(num_classes=num_classes)

  def training_step(self, batch, batch_idx):
    x = batch["input"]
    y = batch["target"]
    y_hat = self.model(x)
    loss = F.cross_entropy(y_hat, y)
    self.log_dict(
      {'train_loss': loss},
      batch_size=self.batch_size,
      prog_bar=True
    )
    return loss
  
  def validation_step(self, batch, batch_nb):
    x = batch["input"]
    y = batch["target"]
    y_hat = self.model(x)
    val_loss = F.cross_entropy(y_hat, y)
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'val_loss': val_loss,
          'val_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.batch_size,
        prog_bar=True
      )
    return val_loss

  def test_step(self, batch, batch_nb):
    x = batch["input"]
    y = batch["target"]
    y_hat = self.model(x)
    test_loss = F.cross_entropy(y_hat, y)
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'test_loss': test_loss,
          'test_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.batch_size,
        prog_bar=True
      )
    return test_loss
  
  def predict_step(self, batch, batch_idx):
    y_hat = self.model(batch["input"])
    predictions = torch.argmax(y_hat, dim=1)
    return {'logits':y_hat, 'predictions': predictions, 'sentiment_labels': batch["sentiment"], 'input_string': batch['input_string']}

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
    return optimizer

# Models

Are we building models yet? Finally the time has come to build our baseline model and then we'll work towards improving it.

### Model 1: Average word vector of the sentence -- Baseline
##### <font color='red'>Expected accuracy: ~29 - 32%</font>

Let's build our first simple word2vec based model we'll use as our baseline.

Here we've three key pieces:

1. *WordVectorClassificationModel*: Simple linear model that just has one single neuron layer that maps the input word2vec dimensions (300) to the output classes (32) building a really simple classifier.

In [14]:
class WordVectorClassificationModel(torch.nn.Module):
  def __init__(self, word_vec_dimension, num_classes):
    super().__init__()
    self.linear_layer = torch.nn.Linear(word_vec_dimension, num_classes)

  # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
  def forward(self, batch):
    """Projection from word_vec_dim to n_classes

    Batch is of shape (batch_size, max_seq_len, word_vector_dim)
    """
    return self.linear_layer(batch)

### Trainer
Now, let's use the Lighning [`Trainer`](https://lightning.ai/docs/pytorch/latest/common/trainer.html) to train our model.

In [15]:
def train(
    model,
    transform,
    collate_fn,
    batch_size=32,
    max_epochs=4,
    learning_rate=0.001,
):
    # Create a pytorch trainer
    trainer = L.Trainer(max_epochs=max_epochs, check_val_every_n_epoch=1)

    # Initialize our data loader with the passed vectorizer
    data_module = EmpatheticDialoguesDataModule(collate_fn=collate_fn,
                                                transform=transform,
                                                batch_size=batch_size)
    data_module.prepare_data()
    data_module.setup('fit')

    # Instantiate a new lightning module
    module = EmotionClassifier(model,
                            batch_size=batch_size,
                            learning_rate=learning_rate,
                            num_classes=len(data_module.label_encoder.classes_))

    # Train and validate the model
    trainer.fit(module, data_module.train_dataloader(), val_dataloaders=data_module.val_dataloader())

    # Test the model
    trainer.test(module, data_module.test_dataloader())

    # Predict on the same test set to show some output
    output = trainer.predict(module, data_module.test_dataloader())

    # Un-collate the output
    uncollated_output = {}
    for batch_output in output:
        for k, v in batch_output.items():
            if k not in uncollated_output:
                uncollated_output[k] = []
            uncollated_output[k].extend(v)

    return uncollated_output, data_module.label_encoder

In [16]:
uncollated_output, label_encoder = train(
    model=WordVectorClassificationModel(word_vec_dimension=300, num_classes=32),
    transform=transform_mean_vectors,
    collate_fn=collate_fn,
    batch_size=256,
    max_epochs=4,
    learning_rate=0.001,
)

# Feel free to explore the uncollated_output dictionary to see the predictions
for _ in range(5):
    # Randomly select a sample from the test set
    i = np.random.randint(len(uncollated_output['input_string']))
    print(f'Input: {uncollated_output["input_string"][i]}')
    print(f'Predicted label: {label_encoder.inverse_transform([uncollated_output["predictions"][i].item()])}')
    print(f'Actual label: {uncollated_output["sentiment_labels"][i]}')
    print()

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/grinch/Developer/nlp_course/nlp_course_env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 6503304.53it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 6568126.37it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:10<00:00, 7290.07it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 6318/6318 [00:00<00:00, 6533.52it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 5701/5701 [00:00<00:00, 6136.24it/s]

  | Name     | Type                          | Params
-----------------------------------------------------------
0 | model    | WordVectorClassificationModel | 9.6 K 
1 | accuracy | MulticlassAccuracy            | 0     
-----------------------------------------------------------
9.6 K     Trainable params
0         Non-trainable params
9.6 K     Total params
0.039     Total estimated model params size (MB)


Epoch 3: 100%|██████████| 300/300 [00:02<00:00, 121.16it/s, v_num=46, train_loss=2.270, val_loss=2.330, val_accuracy=0.331]

`Trainer.fit` stopped: `max_epochs=4` reached.


Epoch 3: 100%|██████████| 300/300 [00:02<00:00, 121.03it/s, v_num=46, train_loss=2.270, val_loss=2.330, val_accuracy=0.331]
Testing DataLoader 0: 100%|██████████| 23/23 [00:01<00:00, 11.99it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy         0.32079431414604187
        test_loss           2.3128020763397217
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Predicting DataLoader 0: 100%|██████████| 23/23 [00:00<00:00, 172.02it/s]
Input: hopeful: I know things will work out in God's perfect timing Yes I am. For everything we want for the baby to be gifted to us lol
Predicted label: ['hopeful']
Actual label: hopeful

Input: I lost my temper with my daughter 

🎉🎉🎉 WE HAVE OUR TEXT CLASSIFIER 🎉🎉🎉

Now might be a good time to play around with the vectorizer and the classifier.

### Assignment Part: 1 - Model 2: Sliding Window of Vectors ---- TO BE COMPLETED
##### <font color='red'>Expected accuracy: ~30 to 36%</font>

We'll be re-using the simple linear model from Model-1 but changing the input to use sliding windows instead of one word at a time.

Implement a new `transform_sliding_window` which is a variant of the `transform_mean_vectors` that operates. Here are some instructions on how to implement it.

1. Split the sentence into chunks of the size of the n_grams parameter.
2. Concat all the spacy embeddings of the tokens inside to create embeddings of the chuck. Each chunk vector is of the size of `n_gram * size_of_embedding`
3. Sentence vector is the average of all chunk vectors.
4. Return the sentence_vector and tokens (for debugging, option: could just return None for tokens)

Does this model perform better than our baseline? Why do you think that is?

In [17]:
def transform_sliding_window(item):
    n_grams: int = 3  ## ADAPT, Change it to your liking

    # Combine 'prompt' and 'utterance' into a single string
    input_string = item['prompt'] + ' ' + item['utterance']
    """Given a sentence, tokenize it and calculate a vector for that sentence.

    Sentence is of length (n)

    1. Split the sentence into tokens using Spacy's function make_doc
    2. Split the list of token into size of the n_grams parameter.
    3. Concat all the spacy embeddings of the tokens inside to create embeddings of the chuck. Each chunk vector is of the size of n_grams * size_of_embedding
    4. Sentence vector is the average of all chunk vectors.
    5. Return the sentence_vector and tokens (option: could just return None for tokens)

    Sentence_vector is of length (n_grams * word_vector_dim)
    Sentence_tokens is of length (n)

    Example of word tri-gram encoding: "I am doing great right now.":
      <EMPTY (300,)> <EMPTY (300,)> <I (300,)> -> (900, )
      <EMPTY (300,)> <I (300,)> <am (300,)> -> (900, )
      <I (300,)> <am (300,)> <doing (300,)> -> (900, )
      <am (300,)> <doing (300,)> <great (300,)> -> (900, )
      <doing (300,)> <great (300,)> <right (300,)> -> (900, )
      <great (300,)> <right (300,)> <now (300,)> -> (900, )
      <right (300,)> <now (300,)> <EMPTY (300,)> -> (900, )
      <now (300,)> <EMPTY (300,)> <EMPTY (300,)> -> (900, )

    We'd encourage you to also try other variants to encode!
    """
    ### TO BE IMPLEMENTED ###
    sentence_vectors = [token.vector for token in nlp.make_doc(input_string)]
    # pad the sentence_vectors with (n_gram-1) empty vectors on both sides
    sentence_vectors = [np.zeros_like(sentence_vectors[0])]*(n_grams-1) + sentence_vectors + [np.zeros_like(sentence_vectors[0])]*(n_grams-1)
    # create the n-gram chunks
    split_sentence = []
    for i in range(len(sentence_vectors) - n_grams + 1):
        split_sentence.append(np.concatenate(sentence_vectors[i:i+n_grams], axis=0))
    # calculate the sentence vector
    x = np.mean(split_sentence, axis=0)
    ### TO BE IMPLEMENTED ###
    # Retrieve the context label
    y = item['context_idx'] # This column was added in the EmpatheticDataset class __init__ method
    return {
        'input_string': input_string, # This is useful for visualization/debugging
        'sentiment': item['context'], # This is useful for visualization/debugging
        'input_vector': x,
        'sentiment_idx': y
    }

Now, let's train the model and see how it performs.

In [18]:
uncollated_output, label_encoder = train(
    model=WordVectorClassificationModel(word_vec_dimension=300*3, num_classes=32),
    transform=transform_sliding_window,
    collate_fn=collate_fn,
    batch_size=256,
    max_epochs=4,
    learning_rate=0.001,
)

# Feel free to explore the uncollated_output dictionary to see the predictions
for _ in range(5):
    # Randomly select a sample from the test set
    i = np.random.randint(len(uncollated_output['input_string']))
    print(f'Input: {uncollated_output["input_string"][i]}')
    print(f'Predicted label: {label_encoder.inverse_transform([uncollated_output["predictions"][i].item()])}')
    print(f'Actual label: {uncollated_output["sentiment_labels"][i]}')
    print()

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 6125588.60it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 6416747.80it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:13<00:00, 5665.63it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 6318/6318 [00:01<00:00, 5005.98it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 5701/5701 [00:01<00:00, 4543.41it/s]

  | Name     | Type                          | Params
-----------------------------------------------------------
0 | model    | WordVectorClassificationModel | 28.8 K
1 | accuracy | MulticlassAccuracy            | 0     
-----------------------------------------------------------
28.8 K    Trainable params
0         Non-trainable params
28.8 K    Total params
0.115     Total estimated model params size (MB)


Epoch 3: 100%|██████████| 300/300 [00:02<00:00, 120.33it/s, v_num=47, train_loss=2.020, val_loss=2.230, val_accuracy=0.351]

`Trainer.fit` stopped: `max_epochs=4` reached.


Epoch 3: 100%|██████████| 300/300 [00:02<00:00, 120.15it/s, v_num=47, train_loss=2.020, val_loss=2.230, val_accuracy=0.351]
Testing DataLoader 0: 100%|██████████| 23/23 [00:00<00:00, 78.38it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy         0.34902888536453247
        test_loss           2.2372679710388184
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Predicting DataLoader 0: 100%|██████████| 23/23 [00:00<00:00, 224.64it/s]
Input: I told my brother a secret and I hope he keeps it! I told my brother that I was planning to propose to my best friend_comma_ but I hope he keeps it a secret!
Predicted label: ['trusting']
Actual label: trusting

Input: Play

### Assignment Part: 2 - Model 3: EmbeddingBag  ---- TO BE COMPLETED
##### <font color='red'>Expected accuracy: ~32 to 38%</font>

The third model we're going to build is an embedding layer based model. Here instead of using pre-trained word-embeddings we'll be creating new vectors as part of the training process. How do you think this model will perform?

Implementation has the following steps:

1. **`get_char_trigram_token_map`**: Create a map of trigram to trigram_id for the most common trigrams in the training sentences.  Here are some steps that should help with the implementation.
      1. Compute a frequency map of the `num_tokens` most common  character trigrams in the training data. **Note: A trigram is a group of three characters.  The model won't understand words- don't use the Spacy tokenizer for this!**.
      3. Create unique integer ids for all these trigrams 1...N (don't use 0 because that is used for padding!)

2. **Vectorizer**: Implement a new transform for each sentence that does the following:
    1. Get all trigrams for the sentence
    2. Get id for every trigram
    4. Append all the ids into a list and that is your input sentence vector

3. **Forward pass of the model**: Implement the forward pass of the model
    1. Pass the input batch through the embedding layer
    1. Pass the output of the embedding layer into our linear layer  

Note:  We have provided a new `collate_fn` because the input to the model is now a padded list of integers.

Some rational for trying this out is:
- We average the embeddings in a sentence anyway so word or not word maybe doesn't matter.
- Vocabalary of character trigrams is much smaller than word trigrams so our models are easier to train.

In [19]:
from collections import Counter

def get_char_trigram_token_map(train_data, num_tokens, verbose=True):
    """
    1. Compute a frequency map of the `num_tokens` most common trigrams in the training data.
    2. Create unique integer ids for all these tokens 1...N

    We HIGHLY recommend using the `collections.Counter` class to compute the frequency map.

    Args:
    train_data: List of strings from the training corpus
    """
    token_to_id_map = {}

    ### TO BE IMPLEMENTED ###
    def iterate_trigrams(sentences):
        if verbose:
            sentences = tqdm(sentences, "Computing trigrams")
        for sentence in sentences:
            for i in range(len(sentence) - 2):
                yield sentence[i:i+3]
    counter = Counter(iterate_trigrams(train_data))
    token_to_id_map = {k: i+1 for i, (k, _) in enumerate(counter.most_common(num_tokens))}
    ### TO BE IMPLEMENTED ###

    return token_to_id_map

In [20]:
get_char_trigram_token_map(["The quick brown fox jumps over the lazy dog"], 10)

Computing trigrams: 100%|██████████| 1/1 [00:00<00:00, 7570.95it/s]


{'he ': 1,
 'The': 2,
 'e q': 3,
 ' qu': 4,
 'qui': 5,
 'uic': 6,
 'ick': 7,
 'ck ': 8,
 'k b': 9,
 ' br': 10}

Let's just validate the output of the tokenizer before we train the model. We should see something like:

```
[(' th', 1), (' I ', 2), ('the', 3), (' to', 4), ('ing', 5)]
```

In [21]:
token_map = get_char_trigram_token_map(EmpatheticDataset(transform=lambda item: item['prompt'] + ' ' + item['utterance']), 5000)
list(token_map.items())[:5]

Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 386821.42it/s]
Computing trigrams: 100%|██████████| 76668/76668 [00:01<00:00, 54422.42it/s]


[(' th', 1), (' I ', 2), ('the', 3), (' to', 4), ('ing', 5)]

In [22]:
def get_char_trigram_token_transform(token_map):
    """ Returns a transform function that converts a string to a list of token ids using the token_map.
     
    This effectively saves the token_map into the transform function.
    """
    def transform_trigram_tokenize(item):
        # Combine 'prompt' and 'utterance' into a single string
        input_string = item['prompt'] + ' ' + item['utterance']
        # Calculate the tokenized input
        """
        Given a inputs sentence (input_string), do the following -
        1. Get all trigrams for the sentence
        2. Get id for every trigram (that exists)
        3. Append all the ids into a list and that is your sentence vector
        """
        ### TO BE IMPLEMENTED ###
        def iterate_trigrams(sentence):
            for i in range(len(sentence) - 2):
                yield sentence[i:i+3]
        x = [token_map[trigram] for trigram in iterate_trigrams(input_string) if trigram in token_map]
        ### TO BE IMPLEMENTED ###
        # Retrieve the context label
        y = item['context_idx'] # This column was added in the EmpatheticDataset class __init__ method
        return {
            'input_string': input_string, # This is useful for visualization/debugging
            'sentiment': item['context'], # This is useful for visualization/debugging
            'input_tokens': x,
            'sentiment_idx': y
        }
    return transform_trigram_tokenize

Here is a new collate function that prepares the input as a padded tensor of integers.

In [23]:
def collate_fn_tokens(batch):
    # A torch data loader will combine a list of samples into a batch
    # This function will be used to process the batch
    # batch is a list of the outputs from the __getitem__ method of the EmpatheticDataset class
    # Separate the batch into input and target, then convert to tensors
    PAD_TOKEN = 0
    input_tensors = pad_sequence([torch.tensor(item['input_tokens'], dtype=torch.long) for item in batch],
                                 batch_first=True,
                                 padding_value=PAD_TOKEN)
    target_tensors = torch.stack([torch.tensor(item['sentiment_idx']) for item in batch])
    return {
        'input': input_tensors,
        'target': target_tensors,
        'input_string': [item['input_string'] for item in batch], # This is useful for visualization/debugging
        'sentiment': [item['sentiment'] for item in batch] # This is useful for visualization/debugging
    }

Now we can create the simple embedding layer based model and start training it.

In [24]:
class EmbeddingBagClassificationModel(torch.nn.Module):
  def __init__(self, num_tokens, embed_dim, n_classes):
    super().__init__()
    self.classes = n_classes
    # self.embedding = torch.nn.EmbeddingBag(num_tokens, embed_dim)
    self.embedding = torch.nn.Embedding(num_tokens, embed_dim, padding_idx=0)
    self.linear_layer = torch.nn.Linear(embed_dim, n_classes)

  def forward(self, batch):
    """Pass the input batch through the embedding layer and then follow it up with the linear layer
    """
    ### TO BE IMPLEMENTED ###
    y = self.embedding(batch)
    y = torch.mean(y, dim=1)
    y = self.linear_layer(y)
    ### TO BE IMPLEMENTED ###
    return y

In [27]:
num_tokens = 5000

token_map = get_char_trigram_token_map(EmpatheticDataset(transform=lambda item: item['prompt'] + ' ' + item['utterance']), num_tokens)

uncollated_output, label_encoder = train(
    model=EmbeddingBagClassificationModel(num_tokens=num_tokens+2, embed_dim=300, n_classes=32),
    transform=get_char_trigram_token_transform(token_map),
    collate_fn=collate_fn_tokens,
    batch_size=256,
    max_epochs=10,
    learning_rate=0.001,
)

# Feel free to explore the uncollated_output dictionary to see the predictions
for _ in range(5):
    # Randomly select a sample from the test set
    i = np.random.randint(len(uncollated_output['input_string']))
    print(f'Input: {uncollated_output["input_string"][i]}')
    print(f'Predicted label: {label_encoder.inverse_transform([uncollated_output["predictions"][i].item()])}')
    print(f'Actual label: {uncollated_output["sentiment_labels"][i]}')
    print()

Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 373074.87it/s]
Computing trigrams: 100%|██████████| 76668/76668 [00:01<00:00, 47881.59it/s]
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 6424953.03it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 6775577.31it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:01<00:00, 38870.69it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 6318/6318 [00:00<00:00, 34560.14it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 5701/5701 [00:00<00:00, 31902.12it/s]

  | Name     | Type                            | Params
-------------------------------------------------------------
0 | model    | EmbeddingBagClassificationModel | 160 K 
1 | accuracy | MulticlassAccuracy              | 0     
-------------------------------------------------------------
160 K     Trainable params
0         Non-trainable params
160 K     Total params
0.641     Total estimated model params size (MB)


Epoch 1:  33%|███▎      | 396/1198 [01:44<03:31,  3.79it/s, v_num=50, train_loss=2.910, val_loss=2.880, val_accuracy=0.165] 

/Users/grinch/Developer/nlp_course/nlp_course_env/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


Testing DataLoader 0: 100%|██████████| 90/90 [00:07<00:00, 11.85it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy          0.178696870803833
        test_loss            2.736837148666382
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Predicting DataLoader 0: 100%|██████████| 90/90 [00:07<00:00, 11.72it/s] 
Input: I saw a homeless lady  I saw a homeless lady and i felt so bad for her because it was so hot outside so went to five guys and bought her some food and cold drinks
Predicted label: ['disgusted']
Actual label: caring

Input: My husband usually gets a large bonus each year around this time_comma_ but there's no guarantee. It's hard waiting_comma_ an

🎉 CONGRATS!!! on finishing the assignment. Now is a good time to pause and reflect how much progress we've made in understanding word vectors, reading some pytorch code and build our first model. But hey, don't stop here, there is a lot to do or play with in the next sections.

# Extensions

Now that you've worked through the project. There is a lot more for us to try.

- Which model performed the best? Why do you think that was?
- Try decreasing and increasing the size of the dataset. How does that impact training time and accuracy of each model?
- Try adding a hidden layer to the baseline of the models and see if that changes anything
- Does adding a hidden layer to the embedding bag model help?
- visualize a [confusion matrix](https://torchmetrics.readthedocs.io/en/latest/references/functional.html#confusion-matrix-func) of N*N of actual class vs predicted class (N = number of classes)