In [None]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding, DataCollatorForSeq2Seq
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

from datasets import load_dataset, load_from_disk

import torch
from torch.utils.data import DataLoader

import evaluate
from sklearn.metrics import accuracy_score, f1_score

from tqdm import tqdm

# Task-A: Sentiment Analysis

## A.1 Pretrained model
In this subsection we will load a pretrained model and evaluate its performance on a movie, sentiment analysis task.

In Hugging Face’s Transformers library, the `Auto Classes` are generic wrappers that automatically pick the right model/tokenizer/config class for you, based on the pretrained model you’re loading.

### AutoTokenizer
- `AutoTokenizer` is a Hugging Face class that automatically picks the right tokenizer for the model you specify.
- A tokenizer is responsible for splitting text into tokens (subwords or word pieces) that the model can understand.
- `AutoTokenizer.from_pretrained(model_name)` downloads and loads the pretrained tokenizer for the model of your choice.

### AutoModelForSequenceClassification
- `AutoModelForSequenceClassification` is a Hugging Face class that loads a model configured for sequence classification (e.g., sentiment analysis, text classification).
- `AutoModelForSequenceClassification.from_pretrained(model_name)`This method loads a model that has already been trained (pretrained) and published.

This is the setup we’re using for the following exercise. In practice, Hugging Face provides many additional classes for a wide variety of setups. You are encouraged to explore the [Auto Classes documentation](https://huggingface.co/docs/transformers/en/model_doc/auto) to get a broader understanding of what’s available.

**`TODO:`** Load the tokenizer and the pretrained model for sequence classification for `distilbert/distilbert-base-uncased`.

### Datasets
- The datasets library is another core library from Hugging Face, separate from transformers.
- In short, this library is designed to make it easy to access, share, preprocess, and work with large datasets (especially for NLP, but also vision, audio, and multimodal tasks).
- The `load_dataset("dataset_name")` function pulls and prepares a dataset from the internet.
- The `load_from_disk("dataset_name")` function loads a local dataset in the same way.
- The `.map()` method applies a function to every element (or batch of elements) in a Dataset. Have a look at an example here: [link](https://huggingface.co/docs/datasets/en/process#map).
- When loading the dataset using the above function, the `split` argument can be used to get a specific split.
- Note: These are different from the PyTorch datasets. They're similar and it's easy to transition from one to the other but they're not identical.

For further documentation regarding HF dataset, you are encouraged to explore the following [documentation](https://huggingface.co/docs/datasets/en/index).

**`TODO:`** Load the train and test split of the `imdb` dataset. How many samples are in each split?

### DataLoader
- `DataLoader` is a PyTorch utility that wraps a dataset and handles batching, shuffling, and parallel loading.
- It takes a dataset (can be from both HF and PyTorch) and returns an iterator you can loop over in training or evaluation.

Have a look at its [documentation](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). Specifically, see what arguments are need to load a dataset, to set your own batch size and decide whether the data will be shuffled or not. 

**`TODO:`** Load your test data into a `DataLoader` of batch size 16. Do not shuffle the data.

**`TODO:`** As we have previously mentioned, `DataLoader` returns an iterator. Using a `for` loop, investigate what the iterator returns for its first iteration.

In [None]:
def evaluate_model(model, tokenizer, test_loader):
    """
    Evaluate a Hugging Face sequence classification model on a test dataset.

    Parameters:
        model (transformers.AutoModelForSequenceClassification): The pretrained sequence classification model to evaluate.
        tokenizer (transformers.AutoTokenizer): The tokenizer used to preprocess the input text for the model.
        test_loader (torch.utils.data.DataLoader): A DataLoader providing the test dataset.

    Returns:
        all_preds (torch.Tensor): Model predictions on the test set.
        all_labels (torch.Tensor): Ground truth labels from the test set.
        acc (float): Overall accuracy on the test set.
        f1 (float): F1 score on the test set.
    """

    all_labels = None
    all_preds = None

    for batch in tqdm(test_loader):
        text = batch['text']
        labels = batch['label']

        # TODO: Tokenize the text using the provided tokenizer, use both truncation and padding.
        inputs = ...

        if torch.cuda.is_available():
            inputs = inputs.to('cuda')
            model = model.to('cuda')

        # TODO: Perform a forward pass through the output of the model
        with torch.no_grad():
            outputs = ...
        
        # TODO: Get the logits from the model's output and compute the predictions by taking the argmax
        logits = ...
        preds = ...

        if all_labels is None:
            all_labels = labels.cpu()
            all_preds = preds.cpu()
        else:
            all_labels = torch.cat((all_labels, labels.cpu()))
            all_preds = torch.cat((all_preds, preds.cpu()))

    # TODO: compute f1 score between model predictions and ground-truth labels (you can use sklearn.metrics)
    f1 = ...

    # TODO: compute accuracy score between model predictions and ground-truth labels (you can use sklearn.metrics)
    acc = ...

    # TODO: compute the accuracy on Positive(label==1) samples
    pos_acc = ...

    # TODO: compute the accuracy on Negative(label==0) samples
    neg_acc = ...

    print('Accuracy: ', acc*100, '%')
    print(' -- Positive Accuracy: ', pos_acc*100, '%')
    print(' -- Negative Accuracy: ', neg_acc*100, '%')
    print('F1 score: ', f1)

    return all_preds, all_labels, acc, f1

## A.2 Finetuned model
In this section, we will further train the model for the task that we are interested in and see if we can increase its performance.

**`TODO:`** Use `evaluate_model` to measure the performance of the pretrained model.

**`TODO:`** Define a function that receives some samples and then uses the tokenizer we have defined to tokenize the samples. Use the `Dataset.map`method that we have previously discussed to apply your function to the train data. Keep this in mind when defining the tokenizing function.

### Data Collator
- A data collator is a small function/class that tells the DataLoader how to merge a list of individual samples into a single batch.
- In NLP, a main challenge is padding. Since sentences have different lengths, you need to pad them so they fit into a uniform tensor batch.
- The `DataCollatorWithPadding` will  automatically pad sequences in a batch to the length of the longest sequence in that batch (dynamic padding) based on the tokenizer you're using.

For more info on Data Collators please refer to the following [documentation](https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorWithPadding).

**`TODO:`** Define a data collator that automatically pads sequences in a batch based on the defined `tokenizer`.

In [None]:
data_collator = ...

### Training Arguments
- A configuration object that stores all the knobs and settings related to the training procedure (like learning rate, batch size, number of epochs, output directory, etc.).
- It tells the training loop how to train (e.g., optimization settings, saving/checkpoint rules, logging options).
- You don’t train with it directly, you just define the "rules of training."

### Trainer
- This is the high-level training loop: it actually runs the training, evaluation, and prediction based on your model, data, and `TrainingArguments`.
- It takes care of all the heavy lifting (forward pass, loss calculation, backprop, optimizer steps, checkpoint saving, etc.).
- You just call methods like `.train()`, `.evaluate()`, or `.predict()` without writing a manual loop.

In [None]:
output_dir_name = "imdb-finetuned-distilbert"

training_args = TrainingArguments(
    output_dir=output_dir_name,          # Where to save model checkpoints and logs
    learning_rate=2e-5,                  # Step size for the optimizer (how fast the model learns)
    per_device_train_batch_size=16,      # Training batch size per GPU/CPU device
    per_device_eval_batch_size=16,       # Evaluation batch size per GPU/CPU device
    num_train_epochs=1,                  # Number of times to iterate over the full training dataset
    weight_decay=0.01,                   # Strength of L2 regularization (helps prevent overfitting)
    save_strategy="epoch",               # When to save checkpoints ("epoch" = at the end of each epoch)
    push_to_hub=False,                   # Whether to push the model to the Hugging Face Hub
    report_to="none"                     # Where to report logs (e.g., "wandb", "tensorboard", "none")
)

**`TODO:`** Based on everything what we have defined so far in this exercise, complete the following code to initialize the trainer.

**`TODO:`** Use the `.train` method of the trainer to train the model

**`TODO:`** Use `evaluate_model` to measure the performance of the post-trained model.

# Task-B: Machine Translation

### `text_target` in Tokenizers

- The text_target argument is used when working with sequence-to-sequence (encoder–decoder) models such as T5, BART, mBART, or mT5.
- It allows you to tokenize the target/output text (e.g. a translation or summary) alongside the input text in a single call to the tokenizer.
- The tokenized targets are stored under the key labels, which the model uses during training for loss computation.

### AutoModelForSeq2SeqLM
- `AutoModelForSeq2SeqLM` is a Hugging Face class that loads a model configured for sequence-to-sequence tasks (e.g., machine translation, text summarization, question answering, text generation with input-output pairs).
- These models typically take in a sequence as input (e.g., a sentence or paragraph) and generate a new sequence as output (e.g., a translated or summarized version).

Note: The specific tokenizer will require you to `pip install protobuf`.


**`TODO:`** Load the tokenizer and the pretrained model for sequence-to-sequence tasks for `google/mt5-small`.

**`TODO:`** Use the tokenizer to tokenize both an input and a target in one go. You can use "Hello" for input and "Bonjour" for output.

**`TODO:`** Unzip the `wmt14_fr_en_10k` dataset and load it into a HF `Dataset`. Split the data into a training and a validation set. To minimize the procedure, use the `.select(N)` method of the `Dataset` to use only 3000 samples for training and 50 for validation. Then, print the data, to see their feaures and ensure the number of samples.

**`TODO:`** Define a function that receives some samples and then uses the tokenizer we have defined to tokenize the samples. Append the phrase `"translate English to French: "` to the inputs. Tokenize the targets (french translation) as well. Use the `Dataset.map`method that we have previously discussed to apply your function to all of the data.

### Evaluate
- Hugging Face's library for evaluations.
- By importing `evaluate` the library provides ready-to-use implementations of common metrics (accuracy, F1, BLEU, ROUGE, etc.).
- The `evaluate.load("sacrebleu")` method, loads the SacreBLEU metric, a standard metric for evaluating machine translation quality. Give it a try in the following example. Any ideas about what the results could mean?

More information regarding HF Evaluate can be found in the following [documentation](https://huggingface.co/docs/evaluate/en/index).

In [None]:
sacrebleu = evaluate.load("sacrebleu")

predictions = ["the cat is on the mat"]
references = [["there is a cat on the mat"]]

results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

### Data COllators (continued)
- The DataCollatorForSeq2Seq is a special collator for sequence-to-sequence tasks (like translation or summarization). It not only handles dynamic padding, but also makes sure that both the inputs and the labels (decoder side) are correctly padded. It can also prepare the labels for the model’s loss function (e.g. replacing padding tokens with -100 so they’re ignored during loss computation).
- Because of this, the model as well as the tokenizer are required to initialize it: it uses the model’s configuration (e.g. label_pad_token_id, eos_token_id) to ensure that the labels it produces match exactly what the model expects for training and loss calculation.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Same as before, we need to be able to evaluate our model. Below is the evaluation method that we have chosen.

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Convert token IDs back to text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in labels as the padding token ID, then decode
    labels = [[(l if l != -100 else tokenizer.pad_token_id) for l in label] for label in labels]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # SacreBLEU expects a list of prediction strings, plus a list of lists for references
    results = sacrebleu.compute(
        predictions=decoded_preds,
        references=[[lbl] for lbl in decoded_labels]
    )
    return {"bleu": results["score"]}


**`TODO:`** In the same way as previously, define the `Seq2SeqTrainingArguments`, the `Seq2SeqTrainer` and train the model. Once it's done evaluate the final model.

Notes:
- You now have a validation split as well which the trainer can use.
- The evaluation method given to you `compute_metrics` can also be given to the `Seq2SeqTrainer` for the same reason.