In [82]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding, DataCollatorForSeq2Seq
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

from datasets import load_dataset, load_from_disk

import torch
from torch.utils.data import DataLoader

import evaluate
from sklearn.metrics import accuracy_score, f1_score

from tqdm import tqdm

# Task-A: Sentiment Analysis

## A.1 Pretrained model
In this subsection we will load a pretrained model and evaluate its performance on a movie, sentiment analysis task.

In Hugging Face’s Transformers library, the `Auto Classes` are generic wrappers that automatically pick the right model/tokenizer/config class for you, based on the pretrained model you’re loading.

### AutoTokenizer
- `AutoTokenizer` is a Hugging Face class that automatically picks the right tokenizer for the model you specify.
- A tokenizer is responsible for splitting text into tokens (subwords or word pieces) that the model can understand.
- `AutoTokenizer.from_pretrained(model_name)` downloads and loads the pretrained tokenizer for the model of your choice.

### AutoModelForSequenceClassification
- `AutoModelForSequenceClassification` is a Hugging Face class that loads a model configured for sequence classification (e.g., sentiment analysis, text classification).
- `AutoModelForSequenceClassification.from_pretrained(model_name)`This method loads a model that has already been trained (pretrained) and published.

This is the setup we’re using for the following exercise. In practice, Hugging Face provides many additional classes for a wide variety of setups. You are encouraged to explore the [Auto Classes documentation](https://huggingface.co/docs/transformers/en/model_doc/auto) to get a broader understanding of what’s available.

**`TODO:`** Load the tokenizer and the pretrained model for sequence classification for `distilbert/distilbert-base-uncased`.

In [3]:
model_name = "distilbert/distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Datasets
- The datasets library is another core library from Hugging Face, separate from transformers.
- In short, this library is designed to make it easy to access, share, preprocess, and work with large datasets (especially for NLP, but also vision, audio, and multimodal tasks).
- The `load_dataset("dataset_name")` function pulls and prepares a dataset from the internet.
- The `load_from_disk("dataset_name")` function loads a local dataset in the same way.
- The `.map()` method applies a function to every element (or batch of elements) in a Dataset. Have a look at an example here: [link](https://huggingface.co/docs/datasets/en/process#map).
- When loading the dataset using the above function, the `split` argument can be used to get a specific split.
- Note: These are different from the PyTorch datasets. They're similar and it's easy to transition from one to the other but they're not identical.

For further documentation regarding HF dataset, you are encouraged to explore the following [documentation](https://huggingface.co/docs/datasets/en/index).

**`TODO:`** Load the train and test split of the `imdb` dataset. How many samples are in each split?

In [14]:
train_data = load_dataset("imdb", split="train")
test_data = load_dataset("imdb", split="test")

print('#Training samples: ', len(train_data))
print('#Test samples: ', len(test_data))

#Training samples:  25000
#Test samples:  25000


### DataLoader
- `DataLoader` is a PyTorch utility that wraps a dataset and handles batching, shuffling, and parallel loading.
- It takes a dataset (can be from both HF and PyTorch) and returns an iterator you can loop over in training or evaluation.

Have a look at its [documentation](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). Specifically, see what arguments are need to load a dataset, to set your own batch size and decide whether the data will be shuffled or not. 

**`TODO:`** Load your test data into a `DataLoader` of batch size 16. Do not shuffle the data.

In [33]:
test_loader = DataLoader(test_data, batch_size=2, shuffle=False)

**`TODO:`** As we have previously mentioned, `DataLoader` returns an iterator. Using a `for` loop, investigate what the iterator returns for its first iteration.

In [34]:
# Check what is in the data loader?
for batch in test_loader:
  print(batch['text'])
  print(batch['label'])
  break

['I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they hav

In [22]:
def evaluate_model(model, tokenizer, test_loader):
    """
    Evaluate a Hugging Face sequence classification model on a test dataset.

    Parameters:
        model (transformers.AutoModelForSequenceClassification): The pretrained sequence classification model to evaluate.
        tokenizer (transformers.AutoTokenizer): The tokenizer used to preprocess the input text for the model.
        test_loader (torch.utils.data.DataLoader): A DataLoader providing the test dataset.

    Returns:
        all_preds (torch.Tensor): Model predictions on the test set.
        all_labels (torch.Tensor): Ground truth labels from the test set.
        acc (float): Overall accuracy on the test set.
        f1 (float): F1 score on the test set.
    """

    all_labels = None
    all_preds = None

    for batch in tqdm(test_loader):
        text = batch['text']
        labels = batch['label']

        # TODO: Tokenize the text using the provided tokenizer, use both truncation and padding.
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

        if torch.cuda.is_available():
            inputs = inputs.to('cuda')
            model = model.to('cuda')

        # TODO: Perform a forward pass through the output of the model
        with torch.no_grad():
            outputs = model(**inputs)
        
        # TODO: Get the logits from the model's output and compute the predictions by taking the argmax
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1).cpu()

        if all_labels is None:
            all_labels = labels.cpu()
            all_preds = preds.cpu()
        else:
            all_labels = torch.cat((all_labels, labels.cpu()))
            all_preds = torch.cat((all_preds, preds.cpu()))

    # TODO: compute f1 score between model predictions and ground-truth labels (you can use sklearn.metrics)
    f1 = f1_score(all_labels, all_preds)

    # TODO: compute accuracy score between model predictions and ground-truth labels (you can use sklearn.metrics)
    acc = accuracy_score(all_labels, all_preds)

    # TODO: compute the accuracy on Positive(label==1) samples
    pos_acc = accuracy_score(all_labels[all_labels==1], all_preds[all_labels==1])

    # TODO: compute the accuracy on Negative(label==0) samples
    neg_acc = accuracy_score(all_labels[all_labels==0], all_preds[all_labels==0])

    print('Accuracy: ', acc*100, '%')
    print(' -- Positive Accuracy: ', pos_acc*100, '%')
    print(' -- Negative Accuracy: ', neg_acc*100, '%')
    print('F1 score: ', f1)

    return all_preds, all_labels, acc, f1

['I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they hav

## A.2 Finetuned model
In this section, we will further train the model for the task that we are interested in and see if we can increase its performance.

**`TODO:`** Use `evaluate_model` to measure the performance of the pretrained model.

In [None]:
all_preds, all_labels, acc, f1 = evaluate_model(model, tokenizer, test_loader)

**`TODO:`** Define a function that receives some samples and then uses the tokenizer we have defined to tokenize the samples. Use the `Dataset.map`method that we have previously discussed to apply your function to the train data. Keep this in mind when defining the tokenizing function.

In [40]:
def tokenize(samples):
    return tokenizer(samples['text'], truncation=True)

# TODO: Tokenize the training data
tokenized_train = train_data.map(tokenize, batched = True)

"If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />"

### Data Collator
- A data collator is a small function/class that tells the DataLoader how to merge a list of individual samples into a single batch.
- In NLP, a main challenge is padding. Since sentences have different lengths, you need to pad them so they fit into a uniform tensor batch.
- The `DataCollatorWithPadding` will  automatically pad sequences in a batch to the length of the longest sequence in that batch (dynamic padding) based on the tokenizer you're using.

For more info on Data Collators please refer to the following [documentation](https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorWithPadding).

**`TODO:`** Define a data collator that automatically pads sequences in a batch based on the defined `tokenizer`.

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

### Training Arguments
- A configuration object that stores all the knobs and settings related to the training procedure (like learning rate, batch size, number of epochs, output directory, etc.).
- It tells the training loop how to train (e.g., optimization settings, saving/checkpoint rules, logging options).
- You don’t train with it directly, you just define the "rules of training."

### Trainer
- This is the high-level training loop: it actually runs the training, evaluation, and prediction based on your model, data, and `TrainingArguments`.
- It takes care of all the heavy lifting (forward pass, loss calculation, backprop, optimizer steps, checkpoint saving, etc.).
- You just call methods like `.train()`, `.evaluate()`, or `.predict()` without writing a manual loop.

In [None]:
output_dir_name = "imdb-finetuned-distilbert"

training_args = TrainingArguments(
    output_dir=output_dir_name,          # Where to save model checkpoints and logs
    learning_rate=2e-5,                  # Step size for the optimizer (how fast the model learns)
    per_device_train_batch_size=16,      # Training batch size per GPU/CPU device
    per_device_eval_batch_size=16,       # Evaluation batch size per GPU/CPU device
    num_train_epochs=1,                  # Number of times to iterate over the full training dataset
    weight_decay=0.01,                   # Strength of L2 regularization (helps prevent overfitting)
    save_strategy="epoch",               # When to save checkpoints ("epoch" = at the end of each epoch)
    push_to_hub=False,                   # Whether to push the model to the Hugging Face Hub
    report_to="none"                     # Where to report logs (e.g., "wandb", "tensorboard", "none")
)

**`TODO:`** Based on everything what we have defined so far in this exercise, complete the following code to initialize the trainer.

In [None]:
# TODO: Initialize the trainer
trainer = Trainer(
   model = model,
   args = training_args,
   train_dataset = tokenized_train,
   tokenizer = tokenizer,
   data_collator = data_collator,
)

**`TODO:`** Use the `.train` method of the trainer to train the model

In [None]:
trainer.train()

**`TODO:`** Use `evaluate_model` to measure the performance of the post-trained model.

In [None]:
all_preds, all_labels, acc, f1 = evaluate_model(model, tokenizer, test_loader)

# Task-B: Machine Translation

### `text_target` in Tokenizers

- The text_target argument is used when working with sequence-to-sequence (encoder–decoder) models such as T5, BART, mBART, or mT5.
- It allows you to tokenize the target/output text (e.g. a translation or summary) alongside the input text in a single call to the tokenizer.
- The tokenized targets are stored under the key labels, which the model uses during training for loss computation.

### AutoModelForSeq2SeqLM
- `AutoModelForSeq2SeqLM` is a Hugging Face class that loads a model configured for sequence-to-sequence tasks (e.g., machine translation, text summarization, question answering, text generation with input-output pairs).
- These models typically take in a sequence as input (e.g., a sentence or paragraph) and generate a new sequence as output (e.g., a translated or summarized version).

Note: The specific tokenizer will require you to `pip install protobuf`.


In [None]:
model_name = "google/mt5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

**`TODO:`** Use the tokenizer to tokenize both an input and a target in one go. You can use "Hello" for input and "Bonjour" for output.

In [77]:
ts = tokenizer("Hello", text_target="Bonjour")
ts

{'input_ids': [30273, 1], 'attention_mask': [1, 1], 'labels': [259, 66392, 1]}

**`TODO:`** Unzip the `wmt14_fr_en_10k` dataset and load it into a HF `Dataset`. Split the data into a training and a validation set. To minimize the procedure, use the `.select(N)` method of the `Dataset` to use only 3000 samples for training and 50 for validation. Then, print the data, to see their feaures and ensure the number of samples.

In [62]:
data = load_from_disk('wmt14_fr_en_10k')
data["train"] = data["train"].select(range(3000))
data["validation"] = data["validation"].select(range(50))
print(data)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 3000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 50
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3003
    })
})


**`TODO:`** Define a function that receives some samples and then uses the tokenizer we have defined to tokenize the samples. Append the phrase `"translate English to French: "` to the inputs. Tokenize the targets (french translation) as well. Use the `Dataset.map`method that we have previously discussed to apply your function to all of the data.

In [79]:
def preprocess_function(examples):
    prefix = "translate English to French: "
    inputs = [prefix + src["en"] for src in examples["translation"]]
    targets = [tgt["fr"] for tgt in examples["translation"]]

    model_inputs = tokenizer(inputs, text_target=targets, max_length=64, truncation=True)
    return model_inputs

tokenized_datasets = data.map(preprocess_function, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/3003 [00:00<?, ? examples/s]

### Evaluate
- Hugging Face's library for evaluations.
- By importing `evaluate` the library provides ready-to-use implementations of common metrics (accuracy, F1, BLEU, ROUGE, etc.).
- The `evaluate.load("sacrebleu")` method, loads the SacreBLEU metric, a standard metric for evaluating machine translation quality. Give it a try in the following example. Any ideas about what the results could mean?

More information regarding HF Evaluate can be found in the following [documentation](https://huggingface.co/docs/evaluate/en/index).

In [90]:
sacrebleu = evaluate.load("sacrebleu")

predictions = ["the cat is on the mat"]
references = [["there is a cat on the mat"]]

results = sacrebleu.compute(predictions=predictions, references=references)
print(results)

{'score': 29.05925408079185, 'counts': [5, 2, 1, 0], 'totals': [6, 5, 4, 3], 'precisions': [83.33333333333333, 40.0, 25.0, 16.666666666666668], 'bp': 0.846481724890614, 'sys_len': 6, 'ref_len': 7}


### Data COllators (continued)
- The DataCollatorForSeq2Seq is a special collator for sequence-to-sequence tasks (like translation or summarization). It not only handles dynamic padding, but also makes sure that both the inputs and the labels (decoder side) are correctly padded. It can also prepare the labels for the model’s loss function (e.g. replacing padding tokens with -100 so they’re ignored during loss computation).
- Because of this, the model as well as the tokenizer are required to initialize it: it uses the model’s configuration (e.g. label_pad_token_id, eos_token_id) to ensure that the labels it produces match exactly what the model expects for training and loss calculation.

In [91]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Same as before, we need to be able to evaluate our model. Below is the evaluation method that we have chosen.