# Tarea 2: Question Answering Fine-tuning

In [1]:
# LibrerÃ­as

import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

import torch
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Number of GPUs available:", torch.cuda.device_count())

from time import time
from datasets import *
from transformers import *
import pandas as pd
import numpy as np
import re

pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.colheader_justify', 'center')

Is CUDA available: True
CUDA version: 12.8
Number of GPUs available: 1


  from .autonotebook import tqdm as notebook_tqdm





## Dataset

El dataset de SQuAD (Stanford Question Answering Dataset) es un conjunto de datos utilizado principalmente para entrenar y evaluar modelos de comprensiÃ³n lectora. Consiste en ternas de preguntas, respuestas y contexto. 

AquÃ­ la ficha del dataset para que podÃ¡is explorarla: https://huggingface.co/datasets/rajpurkar/squad

In [2]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Con el Ãºnico motivo de no demorar los tiempos de entrenamiento. Filtraremos el dataset y nos quedaremos solo con los registros que tenga longitud del campo _context_ inferior a 300.

El resto de la prÃ¡ctica se pide trabajarla sobre la variable `ds_tarea`.

In [3]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

def filtra_por_longitud(ejemplo):
    return len(ejemplo["context"]) < 300

ds_tarea = dataset.filter(filtra_por_longitud)

assert len(ds_tarea['train']) == 3466
assert len(ds_tarea['validation']) == 345

ds_tarea

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 3466
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 345
    })
})

## Modeling

En este apartado es donde tendrÃ©is que realizar todo el trabajo de la prÃ¡ctica. El formato, el anÃ¡lisis, el modelo escogido y cualquier proceso intermedio que considerÃ©is es totalmente libre. Sin embargo, hay algunas pautas que tendrÃ©is que cumplir:

- La variable `model_checkpoint` debe almacenar el nombre del modelo y el tokenizador de ðŸ¤— que vais a utilizar.
- La variable `model` y la variable `tokenizer` almacenarÃ¡n, respectivamente, el modelo y el tokenizador de ðŸ¤— que vais a utilizar.
- La variable `trainer` almacenarÃ¡ el _Trainer_ de ðŸ¤— que, en la siguiente secciÃ³n utilizarÃ©is para entrenar el modelo.

In [4]:
# DefiniciÃ³n de modelo y tokenizador

model_checkpoint = 'deepset/roberta-base-squad2' #'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Preprocesamiento del Dataset

def preprocess_function(x):
    tokenized = tokenizer(
        x['question'],
        x['context'],
        truncation='only_second',
        padding='max_length',
        max_length=384,
        stride=256,
        return_offsets_mapping=True,
        return_overflowing_tokens=True
    )
    
    sample_mapping = tokenized.pop('overflow_to_sample_mapping')
    offset_mapping = tokenized.pop('offset_mapping')
    
    start_positions = []
    end_positions = []
    
    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized['input_ids'][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        
        sequence_ids = tokenized.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = x['answers'][sample_index]
        
        if len(answers['answer_start']) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answers['answer_start'][0]
            end_char = start_char + len(answers['text'][0])
            
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
                
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1
                
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
                
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)
                
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)
    
    tokenized['start_positions'] = start_positions
    tokenized['end_positions'] = end_positions
    return tokenized

ds_tokenized = ds_tarea.map(preprocess_function, batched=True)

# DataCollator

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

loading configuration file config.json from cache at C:\Users\jacob\.cache\huggingface\hub\models--deepset--roberta-base-squad2\snapshots\adc3b06f79f797d1c575d5479d6f5efe54a9e3b4\config.json
Model config RobertaConfig {
  "_name_or_path": "deepset/roberta-base-squad2",
  "architectures": [
    "RobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "language": "english",
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "name": "Roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading file vocab.json from cac

In [5]:
# Fine-Tuning

training_args = TrainingArguments(
    output_dir='./results2',
    evaluation_strategy='epoch',
    learning_rate=3e-5,
    per_device_eval_batch_size=16,
    per_device_train_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    eval_steps=350,
    logging_steps=350,
    save_total_limit=1,
    gradient_accumulation_steps=2,
    lr_scheduler_type='polynomial',
    max_grad_norm=1.0,
    warmup_ratio=0.1
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [6]:
# Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized['train'],
    eval_dataset=ds_tokenized['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator
)

  trainer = Trainer(


## Training

In [7]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

start = time()

trainer.train()

end = time()
print(f">>>>>>>>>>>>> elapsed time: {(end-start)/60:.0f}m")

The following columns in the training set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: context, id, answers, question, title. If context, id, answers, question, title are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3,466
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 2
  Total optimization steps = 868
  Number of trainable parameters = 124,056,578


Epoch,Training Loss,Validation Loss
1,No log,0.959009
2,0.502800,1.022871
3,0.502800,1.243897
4,0.252800,1.398905


The following columns in the evaluation set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: context, id, answers, question, title. If context, id, answers, question, title are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 345
  Batch size = 16
Saving model checkpoint to ./results2\checkpoint-217
Configuration saved in ./results2\checkpoint-217\config.json
Model weights saved in ./results2\checkpoint-217\model.safetensors
tokenizer config file saved in ./results2\checkpoint-217\tokenizer_config.json
Special tokens file saved in ./results2\checkpoint-217\special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: context, id, answers, question, title. If context, id, answers, question, title are not expected by `RobertaForQuestionAnsweri

>>>>>>>>>>>>> elapsed time: 11m


## Evaluation

In [8]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

print(f"**** EVALUACIÃ“N ****")
print(f"********\nTokenizer config:\n{tokenizer}")
print(f"\n\n********\nModel config:\n{model.config}")
print(f"\n\n********\nTrainer arguments:\n{trainer.args}")

**** EVALUACIÃ“N ****
********
Tokenizer config:
RobertaTokenizerFast(name_or_path='deepset/roberta-base-squad2', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True),
}


********
Model config:
RobertaConfig 

In [9]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)

In [10]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

assert len(ds_tarea['train']) == 3466
assert len(ds_tarea['validation']) == 345

def calculate_sentence_similarity(sentence1, sentence2):
    sentence1 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence1).lower()
    sentence2 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence2).lower()
    words1 = set(sentence1.lower().split())
    words2 = set(sentence2.lower().split())
    matches = len(words1.intersection(words2))
    total_words = len(words1.union(words2))
    if total_words == 0:
        return 0.0
    return (matches / total_words) * 100

samples = [324,342,249,176,70,168,120,58,90,192,278,289,197,146,323,248,260,273,112,211]
evaluation_list = []

for ii in samples:
    context = ds_tarea['validation'][ii]['context']
    question = ds_tarea['validation'][ii]['question']
    answer = ds_tarea['validation'][ii]['answers']
    answers = [f"{tt}" for ii, tt in enumerate(answer['text'])]
    prediction = question_answerer(context=context, question=question)['answer']
    match = max([calculate_sentence_similarity(w, prediction) for w in answers])
    evaluation_list.append((ii,context,question,answers,prediction,match))

print(f"*** evaluation_df ***")
evaluation_df = pd.DataFrame(evaluation_list, columns=['sample', 'context', 'question', 'real_answers', 'predicted_answer', 'match'])
evaluation_df[['sample','real_answers','predicted_answer', 'match']]

Disabling tokenizer parallelism, we're using DataLoader multithreading already
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


*** evaluation_df ***


Unnamed: 0,sample,real_answers,predicted_answer,match
0,324,"[Hospitality Business/Financial Centre, Downtown Riverside, Hospitality Business/Financial Centre]",Hospitality Business/Financial Centre,100.0
1,342,"[Rugby, Rugby, Rugby]",Rugby,100.0
2,249,"[extremely high, high, extremely high]",high,100.0
3,176,"[""A Machine to End War"", ""A Machine to End War"", A Machine to End War]",A Machine to End War,100.0
4,70,"[Death Wish Coffee, Death Wish Coffee, Death Wish Coffee]",Death Wish Coffee,100.0
5,168,"[antagonistic, antagonistic, antagonistic]",antagonistic,100.0
6,120,"[1892 to 1894, from 1892 to 1894, from 1892 to 1894]",1892 to 1894,100.0
7,58,"[Vince Lombardi Trophy, the Vince Lombardi Trophy, Vince Lombardi Trophy]",Vince Lombardi Trophy,100.0
8,90,"[5 Live Sports Extra, 5 Live Sports Extra, 5 Live Sports Extra]",5 Live Sports Extra,100.0
9,192,"[time, time complexity, time complexity]",time complexity,100.0


### Criterio de evaluaciÃ³n

La **nota final de la tarea2** estarÃ¡ relacionada con el resultado de las predicciones de vuestro modelo. 

El criterio de evaluaciÃ³n serÃ¡ el siguiente:

- La tarea2 se aprobarÃ¡ si el notebook se entrega sin fallos y con un modelo entrenado (independientemente de sus predicciones).
- Se ponderarÃ¡ en funciÃ³n de la columna _match_, que otorga 100% de acierto si todas las palabras coinciden y bajarÃ¡ gradualmente el porcentaje de acierto en funciÃ³n del nÃºmero de palabras que no coincidan.
    
Nota: La nota que se calcula a continuaciÃ³n es orientativa y podrÃ­a verse reducida en funciÃ³n del cÃ³digo de la entrega.

In [11]:
print(f"Tu nota de la tarea2 es: {max(np.ceil(evaluation_df['match'].sum() / len(evaluation_df) / 10), 5.0)}")

Tu nota de la tarea2 es: 10.0
