> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Prereq Week: Text Classification

### What are we building
We’ll continue to apply our learning philosophy of repetition as we build multiple classification models of increasing complexity in the following order:

1. Average of Word2Vec + MLP Layer
1. Can we concatenate 3 token embeddings and then average them? Does this do better than the previous method?
1. Build an embedding layer based model.
1. **Extension**: Explore different parameters, features and architectures. 

###  Evaluation
We’ll be evaluating our models on the following metric: 

1. Accuracy: is the ratio of the number of correctly classified instances to the total number of instances
1. **Extension**: this is a multi-class classification problem, visualize a [confusion matrix](https://torchmetrics.readthedocs.io/en/latest/references/functional.html#confusion-matrix-func) of N*N of actual class vs predicted class (N = number of classes).


### Instructions

1. We've provide scaffolding for all the boiler plate PyTorch code to get to our first model. This covers downloading and parsing the dataset, training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point our model gets to an accuracy of about 0.32. After this we'll try to improve the model by using sliding windows of text instead of just one word at a time. **Does this improve accuracy?**
1. The third model we're going to build is an embedding layer based model. Here instead of using pre-trained word-embeddings we'll be creating new vectors as part of the training process. **How do you think this model will perform?**
1. **Extension**: We've suggested a bunch of extensions to the project so go crazy, tweak any parts of the pipeline and see if you can beat all the current modes.

### Code Overview
- Dependencies: Python dependencies and loading the spacy model
- Project
  - Dataset: Download the conversation dataset and parse it into a pytorch Dataset
  - Trainer: Trainer function to help with multi-epoch training
  - Model 1: Simple Word2Vec + MLP model
  - Model 2: Sliding window trigram (Word2Vec)
  - Model 3: Embedding bag based model on Trigram
- Extensions
 


## Dependencies

In [3]:
from sklearn.preprocessing import LabelEncoder
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, random_split
from collections import Counter
import en_core_web_lg
import numpy as np
import lightning as L
import spacy
import torch
import torch.nn.functional as F
import torchmetrics
import pandas as pd

In [5]:
# Load the spaCy model
# python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

# Fix the random seed so that we get consistent results
torch.manual_seed(0)
np.random.seed(0)

# Classifier Project
✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

We’ll be using the Empathetic Dialogs dataset open-sourced by Facebook ([link](https://research.fb.com/publications/towards-empathetic-open-domain-conversation-models-a-new-benchmark-and-dataset/)). It can be downloaded as a tar ball from the following [link](https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz)

A sample row from the dataset: 
```
conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
hit:12388_conv:24777,1,joyful,I felt overcome with emotions when Christmas came around as a kid,437,Christmas was the best time of year back in the day!,5|5|5_5|5|5, ''
```

The three columns we'll primarily focus on are:
1. context ==> emotion we're trying to predict
1. prompt + utterance ==> We'll combine these sentences and use them as input 

But let's download and explore the dataset and these should automatically get clear.

In [17]:
DIRECTORY_NAME="classification_data"
TRAIN_FILE="classification/empatheticdialogues/train.csv"
VALIDATION_FILE="classification/empatheticdialogues/valid.csv"
TEST_FILE="classification/empatheticdialogues/test.csv"

# Download the dataset
!wget 'https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz'
# Extract the dataset to a directory
!mkdir classification_data
!tar -xvf empatheticdialogues.tar.gz -C classification_data

--2024-02-14 11:11:18--  https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 99.84.238.181, 99.84.238.162, 99.84.238.206, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|99.84.238.181|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28022709 (27M) [application/gzip]
Saving to: ‘empatheticdialogues.tar.gz.1’


2024-02-14 11:11:21 (9.97 MB/s) - ‘empatheticdialogues.tar.gz.1’ saved [28022709/28022709]

x empatheticdialogues/
x empatheticdialogues/test.csv
x empatheticdialogues/train.csv
x empatheticdialogues/valid.csv


Cool we see all our files. Let's poke at one of them before we start parsing our dataset.

In [16]:
# See the parse_dataset function below for short explanation.
df = pd.read_csv(TRAIN_FILE, on_bad_lines='skip')
df

Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
0,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,
1,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,
2,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,
3,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,
4,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,
...,...,...,...,...,...,...,...,...
76663,hit:12424_conv:24848,5,sentimental,I found some pictures of my grandma in the att...,389,Yeah reminds me of the good old days. I miss ...,5|5|5_5|5|5,
76664,hit:12424_conv:24849,1,surprised,I woke up this morning to my wife telling me s...,294,I woke up this morning to my wife telling me s...,5|5|5_5|5|5,
76665,hit:12424_conv:24849,2,surprised,I woke up this morning to my wife telling me s...,389,Oh hey that's awesome! That is awesome right?,5|5|5_5|5|5,
76666,hit:12424_conv:24849,3,surprised,I woke up this morning to my wife telling me s...,294,It is soooo awesome. We have been wanting a b...,5|5|5_5|5|5,


The columns we care about are:
1. "context": This is the emotion we're trying to predict
1. "prompt" and "utterance": We'll combine these sentences and use them as input 

Let's create a label encoder which converts our text labels to integer ids or vice versa