# <font color='darkorange'> **Fine-tuning BERTa for extractive Question Answering in Catalan** </font>

In this notebook we will fine-tune BERTa for extractive Question Answering in Catalan using Hugging Face transformers.

In [1]:
# Install Hugging Face transformers and datasets.
!pip install transformers datasets huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [2]:
import transformers
# Before continuing, make sure version of transformers is at least 4.16. 
print(transformers.__version__)

4.27.3


In [3]:
model_checkpoint = "projecte-aina/roberta-base-ca-v2"

## **1. Loading Viquiquad**

In [4]:
# Use datasets library to load the dataset and get the metric we need for evaluation.
from datasets import load_dataset, load_metric

In [5]:
# Connect to Google Drive to load the dataset.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Load Viquiquad from Google Drive
#   (originally downloaded from https://huggingface.co/datasets/projecte-aina/viquiquad/tree/main).
path = "/content/drive/My Drive/viquiquad"

script = path+"/viquiquad.py"
dataset = {"train": path+"/train.json",
           "dev": path+"/dev.json",
           "test": path+"/test.json"}

raw_dataset = load_dataset(script, data_files=dataset)

Downloading and preparing dataset viquiquad/default to /root/.cache/huggingface/datasets/viquiquad/default-5315f57053f38ae0/1.0.1/bd00175ebc7354f52ec9b50720441e2b918cc8a93efecb056e0231e07e378717...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/551k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/518k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset viquiquad downloaded and prepared to /root/.cache/huggingface/datasets/viquiquad/default-5315f57053f38ae0/1.0.1/bd00175ebc7354f52ec9b50720441e2b918cc8a93efecb056e0231e07e378717. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11259
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1493
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1428
    })
})

## **2. Preprocessing the dataset**



In [8]:
# Instantiate tokenizer with the AutoTokenizer.from_pretrained method 
#   to get the tokenizer corresponding to the model architecture and 
#   to download the vocabulary used when pretraining it.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# The tokenizer will tokenize the inputs and convert the tokens to their corresponding IDs in the pretrained vocabulary.

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/848k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/506k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

In [9]:
max_length = 384  # Maximum length of a feature (question and context).
# Examples longer than max_length will be splitted into several input features.
# The context of each of these features will overlap a bit the context of the previous feature
#   (just in case the answer lies at the point where the context is splitted).
doc_stride = 128  # Allowed overlap between the features when splitting is performed.

In [10]:
def prepare_train_features(examples):
    # Tokenize examples with truncation and padding, but keep the overflows using a stride.
    
    tokenized_examples = tokenizer(examples["question"],
                                   examples["context"],
                                   truncation="only_second", # Only truncate context (not question).
                                   max_length=max_length,
                                   stride=doc_stride, 
                                   return_overflowing_tokens=True,
                                   return_offsets_mapping=True, # Map to find start and end positions of the answers in the tokens.
                                   padding="max_length")
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping") # Map from a feature to its corresponding example.
    offset_mapping = tokenized_examples.pop("offset_mapping") # Map from token to character position in the original context.
    
    # Label examples:
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        # Grab the sequence corresponding to that example (to know what is the context and what is the question):
        #   returns None for special tokens, 
        #   and 0 or 1 depending on whether the corresponding token comes from the first sentence (question) or the second (context).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Index of the example containing this span of text (as one example can give several spans).
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        start_char = answers[0]["answer_start"]
        end_char = answers[0]["answer_start"] + len(answers[0]["text"])

        # Start token index of the current span in the text.
        token_start_index = 0
        while sequence_ids[token_start_index] != 1:
          token_start_index += 1

        # End token index of the current span in the text.
        token_end_index = len(input_ids) - 1
        while sequence_ids[token_end_index] != 1:
          token_end_index -= 1

        # Detect if the answer is out of the span. If so, the label is (0,0).
        if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(0)
                tokenized_examples["end_positions"].append(0)
        # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
        else:
            while (token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char):
                token_start_index += 1
            tokenized_examples["start_positions"].append(token_start_index - 1)
            while offsets[token_end_index][1] >= end_char:
                token_end_index -= 1
            tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [11]:
# Apply function to dataset.
tokenized_dataset = raw_dataset.map(prepare_train_features, 
                                    batched=True, 
                                    remove_columns=raw_dataset["train"].column_names)

Map:   0%|          | 0/11259 [00:00<?, ? examples/s]

Map:   0%|          | 0/1493 [00:00<?, ? examples/s]

Map:   0%|          | 0/1428 [00:00<?, ? examples/s]

## **3. Fine-tuning the model**

In [13]:
# Download pretrained model.
# Since our task is QA, we use the TFAutoModelForQuestionAnswering class. 
from transformers import TFAutoModelForQuestionAnswering
# Like with the tokenizer, from_pretrained method will download and cache the model.
model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint, from_pt=True)

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForQuestionAnswering: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForQuestionAnswering from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForQuestionAnswering were not initialized from the PyTorch model and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
batch_size = 17      # Set to 32 in Armengol-Estapé et al.
learning_rate = 5e-5 
num_train_epochs = 5 # Set to 10 in Armengol-Estapé et al.

In [15]:
# Convert train and validation sets to tf.data.Dataset.
train_set = model.prepare_tf_dataset(tokenized_dataset["train"],
                                     shuffle=True,
                                     batch_size=batch_size)
dev_set = model.prepare_tf_dataset(tokenized_dataset["validation"],
                                   shuffle=False,
                                   batch_size=batch_size)

In [16]:
# Create optimizer and specify loss function.
from transformers import create_optimizer # AdamW optimizer.
total_train_steps = len(train_set)*num_train_epochs # Compute total number of training steps.
optimizer, schedule = create_optimizer(init_lr=learning_rate, 
                                       num_warmup_steps=0, 
                                       num_train_steps=total_train_steps)

In [17]:
# Compile the model.
import tensorflow as tf
model.compile(optimizer=optimizer, 
              jit_compile=True, 
              metrics=["accuracy"])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [18]:
# Fine-tune the model.
model.fit(train_set,
          validation_data=dev_set,
          epochs=num_train_epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f47a98e72b0>

## **4. Evaluation**

#### **4.1. Preprocessing the test set**

Answers are classified using the score obtained by adding the start and end logits. The best indices in the start and end logits are selected (according to the hyperparameter `n_best_size`). The answers this predicts are then checked one by one and sorted by their score to keep the best one. 

To check if a given span is inside the context (and not the question) and to get back the text inside, in the test features we keep (1) the ID of the example that generated the feature (as one example can generate several features), and (2) the offset mapping (to map from token indices to character positions in the context). For this reason, the test set is preprocessed with the function `prepare_test_features`.


In [19]:
def prepare_test_features(examples):
    # Tokenize examples with truncation and maybe padding, but keep the overflows using a stride.

    tokenized_examples = tokenizer(examples["question"],
                                   examples["context"],
                                   truncation="only_second", # Only truncate context (not question).
                                   max_length=max_length,
                                   stride=doc_stride,
                                   return_overflowing_tokens=True, 
                                   return_offsets_mapping=True, # Map to find start and end positions of the answers in the tokens.
                                   padding="max_length")
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping") # Map from a feature to its corresponding example.
    
    tokenized_examples["example_id"] = [] # Keep the example_id of the feature and store the offset mappings.

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1

        # Index of the example containing this span of text (as one example can give several spans).
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context 
        #   to determine if a token position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [(o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])]

    return tokenized_examples

In [20]:
# Apply function to test set. 
test_features = raw_dataset["test"].map(prepare_test_features,
                                        batched=True,
                                        remove_columns=raw_dataset["test"].column_names)

Map:   0%|          | 0/1428 [00:00<?, ? examples/s]

In [21]:
# Convert test set into a tf.data.Dataset.
test_set = model.prepare_tf_dataset(test_features,
                                    shuffle=False,
                                    batch_size=batch_size)

#### **4.2. Making predictions and processing them**

In [22]:
# Predictions for all features.
raw_predictions = model.predict(test_set)



In [23]:
import numpy as np
import collections
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples,
                               features,
                               all_start_logits,
                               all_end_logits,
                               n_best_size=20, # Best indices in start and end logits.
                               max_answer_length=30, # Eliminate longer answers.
                               ):
    
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    predictions = collections.OrderedDict()

    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Loop over all examples.
    for example_index, example in enumerate(tqdm(examples)):
        feature_indices = features_per_example[example_index] # Indices of the features associated to the current example.
        min_null_score = None
        valid_answers = []
        context = example["context"]
        
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # Grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # To be able to map some the positions in our logits to span of texts in the original context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Out-of-scope answers are not considered, 
                    #   either because the indices are out of bounds 
                    #   or because they correspond to part of the input_ids that are not in the context.
                    if (start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or not offset_mapping[start_index]
                        or not offset_mapping[end_index]):
                        continue
                    # Answers with a length that is either < 0 or > max_answer_length are not considered.
                    if (end_index < start_index
                        or end_index - start_index + 1 > max_answer_length):
                        continue
                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {"score": start_logits[start_index] + end_logits[end_index],
                         "text": context[start_char:end_char]})

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # Fake prediction to avoid failure if there is not a single non-null prediction.
            best_answer = {"text": "", "score": 0.0}

        # Select final answer: the best one or the null answer.
        answer = (best_answer["text"] if best_answer["score"] > min_null_score else "")
        predictions[example["id"]] = answer

    return predictions

In [24]:
# Apply post-processing function to raw predictions. 
final_predictions = postprocess_qa_predictions(raw_dataset["test"],
                                               test_features,
                                               raw_predictions["start_logits"],
                                               raw_predictions["end_logits"])

Post-processing 1428 example predictions split into 1445 features.


  0%|          | 0/1428 [00:00<?, ?it/s]

#### **4.3. Computing the metrics**

In [25]:
# Load metric from the datasets library.
metric = load_metric("squad_v2")

  metric = load_metric("squad_v2")


Downloading builder script:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

In [26]:
# Format predictions and labels as a list of dictionaries.
formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0}
                         for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} 
              for ex in raw_dataset["test"]]

In [27]:
# Compute the metric.
metric.compute(predictions=formatted_predictions, 
               references=references)

{'exact': 74.85994397759103,
 'f1': 88.67034332364784,
 'total': 1428,
 'HasAns_exact': 74.85994397759103,
 'HasAns_f1': 88.67034332364784,
 'HasAns_total': 1428,
 'best_exact': 74.85994397759103,
 'best_exact_thresh': 0.0,
 'best_f1': 88.67034332364784,
 'best_f1_thresh': 0.0}