# Course 1 : Recap

The slides of the course are available [here](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course2_tokenization.pdf)

## Goal of the session

We are going to build a system that identify emotions in an English social media post.

❗❗❗ SELECT A GPU HARDWARE ❗❗❗

# Part 0: Install libraries

Let's install the following HuggingFace's libraries:
- `transformers`: a module that allows downloading and using NLP models (among others)
- `datasets`: a module that handles dataset downloading and management

In [None]:
!pip install datasets transformers > /dev/null

In [None]:
# write your imports here

## Part 1: Get a Language Model

In this session, we are going to use a pre-trained language model. Let's first download one, and observe how good it is at its pretraining task (Masked Language Modeling.

### Question 1
Using `transformers`'s `AutoModelForMaskedLM`, load the `distilbert-base-cased` language model [available here](https://huggingface.co/distilbert-base-cased). Print it to see what layers it contains.

### Question 2

Using `transformers`'s `AutoTokenizer`, load the tokenizer for  `distilbert-base-cased` [available here](https://huggingface.co/distilbert-base-cased). How many tokens does it have in its vocabulary? Use it to tokenize this sentence:

In [None]:
raw_sentence = "This model seems to work preeeetty well."

You can actually use the `return_tensors="pt"` argument in the tokenizer to get a Torch tensor.

### Question 3


Run the tokenized sentence (as a Torch tensor) through the model.

That took quite some time... Can we do faster?

### Question 4
Load all tensors and the model onto the GPU using the `.cuda()` method. Run the tokenized sentence through the model. Was it faster?

What is the shape of `logits`? What do you think it represents?

### Question 5
What is DistilBERT's favorite ice-cream flavor?

## Part 2: Get a dataset
We are going to download a dataset of Reddit posts annotated for emotion recognition.

### Question 6
Download [this dataset](https://huggingface.co/datasets/go_emotions) using the `load_dataset` function of the `datasets` module. Select the "raw" config. What is the returned object made of?

### Question 7
How many emotions are labeled? Can there be several labels at the same time? What is the frequency of each label?

### Question 8
Using a regular expression (regex), find out what percentage of user profiles end with a number (e.g. User456).

### Question 9
Using a regular expression (regex), find out what percentage of messages use smileys (e.g. ":)" or ":/")

## Part 3: Fine-tuning

### Question 10
We first need to prepare the model for fine-tuning. What should be the shape and range of the predictions of the fine-tuned model? With that in mind, build a prediction head and use it to replace the current LM head.

**Tip**: You will need `torch.nn.Linear` and an [activation function](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity). You also need to pick a pooling strategy to get only one vector for each sequence.

### Question 11
Build a `collate_fn` for the fine-tuning. It should:
- Take a batch of rows from the dataset as an input
- Tokenize the `text` as a `torch.Tensor`
- Retrieve the labels associated with the emotions and turn them into a `torch.Tensor`
- Return the tokenized text and the labels in a dictionary

In [None]:
def collate_emotions(rows):
  ...

### Question 12
Split the dataset in two parts : `train`:90%, `val`:10%. You can use the [dedicated method](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.train_test_split). Create a dataloader for each split using the `collate_emotions` function.

In [None]:
test_batch = next(iter(train_dataloader))

### Question 13



With the `test_batch` variable, make a prediction using the untrained model from Question 10. How can we compare the model's output with the expected labels? What loss is best suited for this problem?

### Question 14

In order to fine-tune the model, we are going to use the [Pytorch-Lightning module](https://lightning.ai/docs/pytorch/stable/). Fill out the missing parts:

In [None]:
!pip install lightning

In [None]:
import lightning.pytorch as pl

# define the LightningModule
class FineTuner(pl.LightningModule):
  def __init__(self, model, learning_rate, weight_decay):
    super().__init__()
    self.model = model
    self.learning_rate = learning_rate
    self.weight_decay = weight_decay

  def common_step(self, batch, batch_idx):
    # You may need to adapt this :
    input_ids, labels = batch["input_ids"], batch["labels"]

    # TO COMPLETE

    return loss, predictions, labels

  def training_step(self, batch, batch_idx):
    loss, _, _ = self.common_step(batch, batch_idx)
    self.log("train_loss", loss)
    return loss

  def validation_step(self, batch, batch_idx):
    loss, predictions, labels = self.common_step(batch, batch_idx)

    accuracy = ...  # to complete

    self.log("val_loss", loss)
    self.log("val_accuracy", accuracy)

    return loss

  def configure_optimizers(self):
    optimizer = ...
    lr_scheduler = ...

    return [optimizer], [lr_scheduler]

In [None]:
pl_module = FineTuner(
    model=...,
    learning_rate=...,
    weight_decay=...,
)

trainer = pl.Trainer(
    accumulate_grad_batches=1,
    gradient_clip_val=None,
    max_epochs=3,
    precision=32,
    val_check_interval=1.,
)

### Question 15
Write a function that asks an input from the user and outputs the labels corresponding to the emotions in the written sentence.

## Question 16

Re-initialize the DistilBERT model using the `init_weights` method. Fine-tune it with the previous approach. What can you say about the final performance?

## Question 17

Create a class using a `torch.nn.LSTM` model that behaves like the `BertModel` class from `HuggingFace`. Fine-tune it with the previous approach. What can you say about the final performance?