<a href="https://colab.research.google.com/github/patrikrac/NLP_SQuAD2.0/blob/main/ATML_Project2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Topics in Machine Learning - Natural Language Processing -- Group Assignment
> Students
> - Patrik Rác
> - Méline Trochon
> - Valentina Moretti
> - Mattia Colbertaldo

---



In [1]:
# Imports
import os
import torch
import pandas

## 1. Investigate Dataset


We will investigate the dataset ***SQuAD2.0 (Stanford Question Answering Dataset)*** avaliable [here](https://rajpurkar.github.io/SQuAD-explorer/).

In [2]:
# Set the directory to work in
WORKING_DIR = "./squad"
DATA_DIR = "./data"

In [3]:
# Download the dataset
print("Downloading the DEV dataset of SQuAD2.0")
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Downloading the DEV dataset of SQuAD2.0
--2023-12-29 14:32:32--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2023-12-29 14:32:32 (42.8 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [4]:
dev_dataframe = pandas.read_json("/content/dev-v2.0.json")
print(f"Size of the dataset: {dev_dataframe.size} (e.g. Categories of questions)")
dev_dataframe.head()

Size of the dataset: 70 (e.g. Categories of questions)


Unnamed: 0,version,data
0,v2.0,"{'title': 'Normans', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Computational_complexity_theory', '..."
2,v2.0,"{'title': 'Southern_California', 'paragraphs':..."
3,v2.0,"{'title': 'Sky_(United_Kingdom)', 'paragraphs'..."
4,v2.0,"{'title': 'Victoria_(Australia)', 'paragraphs'..."


In [6]:
def count_paragraphs(row):
    return len(row["data"]["paragraphs"])

print(f"Number of individual paragraphs: {sum(dev_dataframe.apply(count_paragraphs, axis=1))}")


def count_questions(row):
  return sum([len(p["qas"]) for p in row["data"]["paragraphs"]])

print(f"Number of individual questions: {sum(dev_dataframe.apply(count_questions, axis=1))}")

def count_impossible_questions(row):
  n_impossible = 0
  for p in row["data"]["paragraphs"]:
    for q in p["qas"]:
      if q["is_impossible"]:
        n_impossible += 1
  return n_impossible


print(f"Number of impossible questions: {sum(dev_dataframe.apply(count_impossible_questions, axis=1))}")

Number of individual paragraphs: 1204
Number of individual questions: 11873
Number of impossible questions: 5945


In [5]:
print(type(dev_dataframe.iloc[0]["data"]))

import json
print(json.dumps(dev_dataframe.iloc[1]["data"], sort_keys=False, indent=4))

<class 'dict'>
{
    "title": "Computational_complexity_theory",
    "paragraphs": [
        {
            "qas": [
                {
                    "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?",
                    "id": "56e16182e3433e1400422e28",
                    "answers": [
                        {
                            "text": "Computational complexity theory",
                            "answer_start": 0
                        },
                        {
                            "text": "Computational complexity theory",
                            "answer_start": 0
                        },
                        {
                            "text": "Computational complexity theory",
                            "answer_start": 0
                        }
                    ],
                    "is_impossible": false
                },
           

In [None]:
# Example of a Paragraph and a question
import textwrap
print("Category of given paragraph: {}\n".format(dev_dataframe.iloc[0]["data"]["title"]))

example_datum = dev_dataframe.iloc[0]["data"]["paragraphs"][0]
print("Context\n {}\n".format(textwrap.fill(example_datum["context"], 50)))

print("Example Question:\n{}\n".format(example_datum["qas"][0]["question"]))

print("Example Answer:\n{}\n".format(example_datum["qas"][0]["answers"][0]["text"]))

Category of given paragraph: Normans

Context
 The Normans (Norman: Nourmands; French: Normands;
Latin: Normanni) were the people who in the 10th
and 11th centuries gave their name to Normandy, a
region in France. They were descended from Norse
("Norman" comes from "Norseman") raiders and
pirates from Denmark, Iceland and Norway who,
under their leader Rollo, agreed to swear fealty
to King Charles III of West Francia. Through
generations of assimilation and mixing with the
native Frankish and Roman-Gaulish populations,
their descendants would gradually merge with the
Carolingian-based cultures of West Francia. The
distinct cultural and ethnic identity of the
Normans emerged initially in the first half of the
10th century, and it continued to evolve over the
succeeding centuries.

Example Question:
In what country is Normandy located?

Example Answer:
France



## TODO Section
---

- Investigate the data further (Possibly some plots etc.)
- Train Word2Vec on the corpus of paragraphs.
- (Potentially index the paragraphs) -> Find appropriate Datastructure potentially

=> Prepare Data for Training

---
- Investigate the application of `DistilBert`
  - Fine Tuning
  - One-Shot learning
  - (Two-Shot learning)
- Come up with alternatives (Custom Transformer etc.)
---


## Finetuning DistilBERT

In [None]:
# Install the required Huggingface libs
! pip install datasets transformers accelerate -U

In [None]:
# Import the transformers package (We'll be taking a pretrained model from here)
import transformers
print(transformers.__version__)

In [None]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

In [None]:
from datasets import load_dataset

squad2_datasets = load_dataset("squad_v2") # TODO: Replace with loading custom dataset

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [None]:
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

### Actual Fine Tuning of the Model

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

In [None]:
trainer.save_model("test-squad-trained")

In [None]:
# TODO: Evaluation. The evaluation of the model should be rather genereal and work with all model training strategies.

# One shot learning