# Tarea 2: Question Answering Fine-tuning

Esta tarea consiste en desarrollar por vuestra cuenta un modelo de *Question-Answering*. Para ello, se recomienda investigar en qu√© consisten bien estos modelos y c√≥mo hay que preprocesar el dato antes.

El dataset est√° seleccionado al comienzo del notebook. Sin embargo, la elecci√≥n, configuraci√≥n y entrenamiento del modelo es totalmente libre. No olvid√©is que el notebook tiene que estar entregado con todas las celdas ejecutadas y sin errores, incluidas las celdas de evaluaci√≥n al final del notebook.

In [1]:
# Librer√≠as

import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

import torch
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Number of GPUs available:", torch.cuda.device_count())

from time import time
from datasets import *
from transformers import *
import pandas as pd
import numpy as np
import re

pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.colheader_justify', 'center')

Is CUDA available: True
CUDA version: 12.6
Number of GPUs available: 1
üö® Config not found for parakeet. You can manually add it to HARDCODED_CONFIG_FOR_MODELS in utils/auto_docstring.py
üö® Config not found for parakeet. You can manually add it to HARDCODED_CONFIG_FOR_MODELS in utils/auto_docstring.py
üö® Config not found for parakeet. You can manually add it to HARDCODED_CONFIG_FOR_MODELS in utils/auto_docstring.py


TAPAS models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version. Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
GroupViT models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version.Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.


## Dataset

El dataset de SQuAD (Stanford Question Answering Dataset) es un conjunto de datos utilizado principalmente para entrenar y evaluar modelos de comprensi√≥n lectora. Consiste en ternas de preguntas, respuestas y contexto.

Aqu√≠ la ficha del dataset para que pod√°is explorarla: https://huggingface.co/datasets/rajpurkar/squad

In [2]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

dataset = load_dataset("squad")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Con el √∫nico motivo de no demorar los tiempos de entrenamiento. Filtraremos el dataset y nos quedaremos solo con los registros que tenga longitud del campo _context_ inferior a 300.

El resto de la pr√°ctica se pide trabajarla sobre la variable `dataset` ya filtrada.

In [3]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

def filtra_por_longitud(ejemplo):
    return len(ejemplo["context"]) < 400

dataset = dataset.filter(filtra_por_longitud)

assert len(dataset['train']) == 6880
assert len(dataset['validation']) == 736

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 6880
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 736
    })
})

## Modeling

En este apartado es donde tendr√©is que realizar todo el trabajo de la pr√°ctica. El formato, el an√°lisis, el modelo escogido y cualquier proceso intermedio que consider√©is es totalmente libre. Sin embargo, hay algunas pautas que tendr√©is que cumplir:

- La variable `model_checkpoint` debe almacenar el nombre del modelo y el tokenizador de ü§ó que vais a utilizar.
- La variable `model` y la variable `tokenizer` almacenar√°n, respectivamente, el modelo y el tokenizador de ü§ó que vais a utilizar. A tener en cuenta que el modelo ha de tener arquitectura *AutoModelForQuestionAnswering*.
- La variable `trainer` almacenar√° el _Trainer_ de ü§ó que, en la siguiente secci√≥n utilizar√©is para entrenar el modelo.

**Importante**
No est√° permitido utilizar modelos pre-entrenados de Huggingface que han sido ya entrenados con este dataset. Por ejemplo, no se permitir√≠a usar `roberta-base-squad2`, `distilbert-base-cased-distilled-squad`, etc.

In [4]:
model_checkpoint = "roberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

max_length = 384

def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    contexts = examples["context"]

    tokenized_examples = tokenizer(
        questions,
        contexts,
        truncation="only_second",
        max_length=max_length,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = tokenized_examples.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        answers = examples["answers"][i]


        if len(answers["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        sequence_ids = tokenized_examples.sequence_ids(i)


        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx

        idx = len(sequence_ids) - 1
        while sequence_ids[idx] != 1:
            idx -= 1
        context_end = idx


        if not (offsets[context_start][0] <= start_char and offsets[context_end][1] >= end_char):
            start_positions.append(0)
            end_positions.append(0)
        else:

            token_start_index = context_start
            while token_start_index <= context_end and offsets[token_start_index][0] <= start_char:
                token_start_index += 1
            start_positions.append(token_start_index - 1)


            token_end_index = context_end
            while token_end_index >= context_start and offsets[token_end_index][1] >= end_char:
                token_end_index -= 1
            end_positions.append(token_end_index + 1)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions

    return tokenized_examples


tokenized_datasets = dataset.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]

# Data collator
data_collator = default_data_collator

# TrainingArguments
training_args = TrainingArguments(
    output_dir="./roberta_base_squad",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_gpu_train_batch_size=16,
    per_gpu_eval_batch_size=16,
    learning_rate=2e-5,
    num_train_epochs=2,
    logging_dir="./roberta_base_squad/logs",
    report_to="none" # Added to prevent wandb error
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--roberta-base/snapshots/e2da8e2f811d1448a5b465c236feacd80ffbac7b/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--roberta-base/snapshots/e2da8e2f811d1448a5b465c236feacd80ffbac7b/vocab.json
loading file merges.txt from cache at /root/.ca

Map:   0%|          | 0/6880 [00:00<?, ? examples/s]

Map:   0%|          | 0/736 [00:00<?, ? examples/s]

PyTorch: setting up devices
  trainer = Trainer(
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


## Training

In [5]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

assert len(trainer.train_dataset) == 6880

start = time()

trainer.train()

end = time()
print(f">>>>>>>>>>>>> elapsed time: {(end-start)/60:.0f}m")

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 6,880
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Training with DataParallel so batch size has been adjusted to: 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 860
  Number of trainable parameters = 124,056,578


Step,Training Loss
500,1.5697


Saving model checkpoint to ./roberta_base_squad/checkpoint-500
Configuration saved in ./roberta_base_squad/checkpoint-500/config.json
Model weights saved in ./roberta_base_squad/checkpoint-500/model.safetensors
tokenizer config file saved in ./roberta_base_squad/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./roberta_base_squad/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./roberta_base_squad/checkpoint-860
Configuration saved in ./roberta_base_squad/checkpoint-860/config.json
Model weights saved in ./roberta_base_squad/checkpoint-860/model.safetensors
tokenizer config file saved in ./roberta_base_squad/checkpoint-860/tokenizer_config.json
Special tokens file saved in ./roberta_base_squad/checkpoint-860/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




>>>>>>>>>>>>> elapsed time: 3m


## Evaluation

In [6]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

print(f"**** EVALUACI√ìN ****")
print(f"********\nTokenizer config:\n{tokenizer}")
print(f"\n\n********\nModel config:\n{model.config}")
print(f"\n\n********\nTrainer arguments:\n{trainer.args}")

**** EVALUACI√ìN ****
********
Tokenizer config:
RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
)


********
Model config:
RobertaConfig {
  "architect

In [7]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)

Device set to use cuda


In [8]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

assert len(dataset['train']) == 6880
assert len(dataset['validation']) == 736

def calculate_sentence_similarity(sentence1, sentence2):
    sentence1 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence1).lower()
    sentence2 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence2).lower()
    words1 = set(sentence1.lower().split())
    words2 = set(sentence2.lower().split())
    matches = len(words1.intersection(words2))
    total_words = len(words1.union(words2))
    if total_words == 0:
        return 0.0
    return (matches / total_words) * 100

samples = [324,342,249,176,270,168,120,58,90,192,278,289,197,146,323,248,260,273,112,211]
evaluation_list = []

for ii in samples:
    context = dataset['validation'][ii]['context']
    question = dataset['validation'][ii]['question']
    answer = dataset['validation'][ii]['answers']
    answers = [f"{tt}" for ii, tt in enumerate(answer['text'])]
    prediction = question_answerer(context=context, question=question)['answer']
    match = max([calculate_sentence_similarity(w, prediction) for w in answers])
    evaluation_list.append((ii,context,question,answers,prediction,match))

print(f"*** evaluation_df ***")
evaluation_df = pd.DataFrame(evaluation_list, columns=['sample', 'context', 'question', 'real_answers', 'predicted_answer', 'match'])
evaluation_df[['sample','real_answers','predicted_answer', 'match']]

Disabling tokenizer parallelism, we're using DataLoader multithreading already
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


*** evaluation_df ***


Unnamed: 0,sample,real_answers,predicted_answer,match
0,324,"[electric lamps, electric lamps wirelessly, electric lamps]",electric lamps,100.0
1,342,"[stationary, stationary, stationary waves]",stationary,100.0
2,249,"[Kevin Harlan, Kevin Harlan, Kevin Harlan]",Kevin Harlan,100.0
3,176,"[Tuesday, Tuesday afternoon prior to the game, Tuesday]",Tuesday afternoon,50.0
4,270,"[Lady Gaga, Lady Gaga, Lady Gaga]",Lady Gaga,100.0
5,168,"[San Jose Marriott., the San Jose Marriott, San Jose Marriott.]",San Jose Marriott,100.0
6,120,"[11, 11, 11]",11,100.0
7,58,"[Roger Goodell, Roger Goodell, Goodell]",Roger Goodell,100.0
8,90,"[2014, 2014, 2014]",2014,100.0
9,192,"[25 percent, 25 percent, 25]",25 percent,100.0


### Criterio de evaluaci√≥n

La **nota final de la tarea2** estar√° relacionada con el resultado de las predicciones de vuestro modelo.

El criterio de evaluaci√≥n ser√° el siguiente:

- La tarea2 se aprobar√° si el notebook se entrega sin fallos y con un modelo entrenado (independientemente de sus predicciones).
- Se ponderar√° en funci√≥n de la columna _match_, que otorga 100% de acierto si casi todas las palabras coinciden y bajar√° gradualmente el porcentaje de acierto en funci√≥n del n√∫mero de palabras que no coincidan.
    
Nota: La nota que se calcula a continuaci√≥n es orientativa y podr√≠a verse reducida en funci√≥n del c√≥digo de la entrega.

In [9]:
print(f"Tu nota de la tarea2 es: {max(np.ceil(evaluation_df['match'].sum() / len(evaluation_df) / 10), 5.0)}")

Tu nota de la tarea2 es: 10.0
