# Train your sentiment classifier

[![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hieultp/holistics-nlp-workshop/blob/main/02-sentiment-classification.ipynb)

Good to see you again ;) Let continue our journey then.

Just in case you forgot what we need to train your ~dragon~ AI model 👀:
- Prepare data
- Define the model
- Define the loss function
- Train it!

Okay, let's dive in then.

In [None]:
! git clone https://github.com/hieultp/holistics-nlp-workshop
%cd holistics-nlp-workshop

## 1. Prepare dataset

Let's us prepare the dataset. We will use the TweetEval dataset this time.

The dataset contains many tweets and sentiment labels associate with each tweet.

Usually, we will have three seperated image sets. The first one is to used for training the model (train set), the second is used for validating (validation set), and the last one will be used for testing (test set).

The reason we need an additional set, validation set, is that we will tune our model based on this one. Avoiding checking the test set too many times, which might lead to overfiting to the test set.

But for simplicity, let's settle at two dataset only now.

In [None]:
!pip install -q torchtext==0.14.1

from src.datasets import TweetEvalSetiment
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

An additional step before building the vocabulary is that we have to tokenize the input text.

<div>
<img src="https://miro.medium.com/max/1400/1*UhfwmhMN9sdfcWIbO5_tGg.jpeg" width="600"/>
</div>

In [None]:
LABEL_MAPPING = {
    0: "negative",
    1: "neutral",
    2: "positive",
}

In [None]:
train_data = TweetEvalSetiment(type="train")
test_data = TweetEvalSetiment(type="test")

Let's check a sample from our data.

In [None]:
text, label = train_data[0]

In [None]:
text

In [None]:
LABEL_MAPPING[label]

Looks good enough. Let's build our vocabulary then.

In [None]:
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator((text for text, _ in train_data), specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))

You could notice that we have two additional tokens: `<unk>` and `<pad>`.

The `<unk>` token will be used in the inference when we encounters a new word that we don't have in the vocabulary. `<pad>` token will be used when we prepared the dataset for our model.

Usually a DL model operates on a fixed length sentence. Thus we will pad each sentence to a fixed length, and truncate those longer than our predefined length.

Note: You can also play around with the tokenizer. Different ways of tokenizing results in different vocabulary and ultimately affects the model performance too.

In [None]:
import torch

MAX_SEQUENCE_LEN = max(len(text_pipeline(text)) for text, _ in train_data)

def collate_batch(batch):
    text_list, label_list = [torch.zeros(MAX_SEQUENCE_LEN, dtype=torch.int64)], []
    for _text, _label in batch:
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        label_list.append(_label)
    text_list = torch.nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])[1:]
    label_list = torch.tensor(label_list, dtype=torch.int64)
    return text_list, label_list

## 2. Define the model

It's time to define our model. This model simply contains an embedding followed by a linear layer.

You could play around with the embedding size, or take a look a the source code and change the number of linear layer then. 👀

In [None]:
from src.models import TextClassificationModel

vocab_size = len(vocab)
model = TextClassificationModel(
    vocab_size, embed_dim=64, num_class=len(LABEL_MAPPING), padding_idx=vocab["<pad>"]
)

In [None]:
model

Cool, let's define the loss function and we're ready to go.

## 3. Define the loss function

In [None]:
import torch.nn as nn

This is just like the previous excercise. Take your time to explore!

In [None]:
loss_fn = nn.CrossEntropyLoss()
# loss_fn = nn.MSELoss()

## 4. Train it

It's timeeeee.

<div>
<img src="https://i.imgflip.com/3f23r3.jpg" width="300"/>
</div>

In [None]:
from src.train import train

model = train(model, loss_fn, train_data, test_data, num_epochs=10, collate_fn=collate_batch)

## 5. Test it

Great, now that you have your model trained (and hopefully the performance on the test set does no bad).

Let's test it. Run the three cells below and draw any numbers then see if your model could guess it or not.

Then feel free to go back and play around with the loss function, the model to see whether you could improve the model performance or not.

In [None]:
!pip install -q gradio

import torch
import gradio as gr

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval()

In [None]:
def inference(text):
    text = torch.nn.utils.rnn.pad_sequence(
        [
            torch.zeros(MAX_SEQUENCE_LEN, dtype=torch.int64),
            torch.tensor(text_pipeline(text), dtype=torch.int64)
        ],
        batch_first=True,
        padding_value=vocab["<pad>"]
    )[1:].to(device)
    prediction = model(text).softmax(1).squeeze()
    return {LABEL_MAPPING[i]: prediction[i].item() for i in range(3)}

gr.Interface(
    fn=inference,
    inputs="text",
    outputs=gr.outputs.Label(num_top_classes=3),
    live=True,
    css=".footer {display:none !important}",
    title="Sentiment Analysis",
    description="Enter a tweet and see predictions in real time.",
    thumbnail="https://raw.githubusercontent.com/gradio-app/real-time-mnist/master/thumbnail2.png"
).launch()


## 6. Use a pretrained model

Training a model on our own sometimes can be painful 🥲 Let's try some other pretrained model that available freely on the internet thanks to dedicated researchers and practicioners.

A very popular community for sharing NLP models is `🤗 Hugging Face`. We will use their library and models available freely their by contributors.

In [None]:
!pip install -q transformers emoji

from transformers import pipeline

pretrained_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis", device=device)

In [None]:
def pretrained_inference(text):
    return {
        "NEG": "Negative",
        "NEU": "Neutral",
        "POS": "Positive",
    }[pretrained_model(text)[0]['label']]

gr.Interface(
    fn=pretrained_inference,
    inputs="text",
    outputs=gr.outputs.Textbox(),
    live=True,
    css=".footer {display:none !important}",
    title="Sentiment Analysis",
    description="Enter a tweet and see predictions in real time.",
    thumbnail="https://raw.githubusercontent.com/gradio-app/real-time-mnist/master/thumbnail2.png"
).launch()

Quite convenient right? If you happens to have any cool models, don't hesitate to share it back with the commnunity then ;)

That's all for this notebook. Thank you for staying until this end ;)