# **NLP - lab 9**

## **Task 1**

Get acquainted with the Simple legal questions dataset.

## **Task 2**

Select one open issue in the dataset, provide the answers for the questions in the package and open a pull request with the answers.

I've chosen 37th package.

## **Task 3**

The subset of the answers that you have provided in point 2 is your test dataset. If in the dataset there are questions that are the same as the questions in your test set, make the questions and the answers part of your test dataset (i.e. remove them from the training set).

## **Task 4**

The remaing questions and answers are your training set. Divide that set into training and validation subsets. The validation part should be selected as 20% of the original training set. Make sure that there are no questions in the validation set that are present in the training subset. If there are such questions, make them part of the validation set.

## **Task 5**

If the training set is small (less than 1 thousand question+answer pairs) use one of the available QA dataset, e.g. PoQUAD or SQUAD. Using the second dataset is sensible, if you are training a multilingual model, like mT5.

We use PoQUAD dataset due to the fact that training set is small.

In [1]:
!git clone https://github.com/ipipan/poquad.git

Cloning into 'poquad'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 33 (delta 13), reused 25 (delta 7), pack-reused 0[K
Unpacking objects: 100% (33/33), done.


In [2]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 14.1 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 61.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 69.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 75.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 80.2 MB/s 
Installing collected 

In [3]:
from datasets import Dataset, DatasetDict, load_metric
import json

def convert_poquad2squad_format(path: str) -> Dataset:
  result = {
        "id": [],
        "question": [],
        "answers": [],
        "context": [],
        "title": []
        }
  with open(path, encoding='utf-8') as f:
    data = json.loads(f.read())
  
  for elem in data['data']:
    i = 0
    for par in elem['paragraphs']:
      for qas in par['qas']:
        if not qas['is_impossible']:
          i += 1
          result['id'].append(f"{elem['id']}_{i}")
          result['question'].append(qas['question'])
          answers = {
              'text': [],
              'answer_start': []
          }
          for answer in qas['answers']:
            answers['text'].append(answer['generative_answer'])
            answers['answer_start'].append(answer['answer_start'])
          result['answers'].append(answers)
          result['context'].append(par['context'])
          result['title'].append(elem['title'])
  return Dataset.from_dict(result)

In [4]:
dataset_test = convert_poquad2squad_format('./poquad/poquad_dev.json')
dataset_test

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 3853
})

In [5]:
dataset_train = convert_poquad2squad_format('./poquad/poquad_train.json')
dataset_train

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 30757
})

Now we choose only 10000 samples from the training dataset and 2000 samples from the test dataset, because of a memory and a speed requirement. We also pick 2000 samples from the training dataset to obtain a validation dataset.

In [39]:
N_TRAIN = 3_000
N_TEST = 600
N_VAL = int(0.2 * N_TRAIN)

In [40]:
dataset_train = Dataset.from_dict({'id':dataset_train['id'][:N_TRAIN],
                                  'question':dataset_train['question'][:N_TRAIN],
                                  'answers':dataset_train['answers'][:N_TRAIN],
                                  'context':dataset_train['context'][:N_TRAIN],
                                  'title':dataset_train['title'][:N_TRAIN]})
dataset_train

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 3000
})

In [41]:
dataset_test = Dataset.from_dict({'id':dataset_test['id'][:N_TEST],
                                  'question':dataset_test['question'][:N_TEST],
                                  'answers':dataset_test['answers'][:N_TEST],
                                  'context':dataset_test['context'][:N_TEST],
                                  'title':dataset_test['title'][:N_TEST]})
dataset_test

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 600
})

We create also a validation dataset.

In [42]:
import random
import pandas as pd

temp_df = pd.DataFrame(dataset_train)
temp_df = temp_df.sample(n=N_VAL)
indices = temp_df.index
dataset_val = Dataset.from_pandas(temp_df)

dataset_val = Dataset.from_dict({'id':dataset_val['id'],
                                  'question':dataset_val['question'],
                                  'answers':dataset_val['answers'],
                                  'context':dataset_val['context'],
                                  'title':dataset_val['title']})
dataset_val

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 600
})

We should remove these records from the training dataset.

In [43]:
temp_df = pd.DataFrame(dataset_train)
bad_df = temp_df.index.isin(indices)
temp_df = temp_df[~bad_df]
dataset_train = Dataset.from_pandas(temp_df)

dataset_train = Dataset.from_dict({'id':dataset_train['id'],
                                  'question':dataset_train['question'],
                                  'answers':dataset_train['answers'],
                                  'context':dataset_train['context'],
                                  'title':dataset_train['title']})
dataset_train

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 2400
})

We should also check that data from three derived sets aren't present in each other.

In [44]:
# Training set and validation set
print(set(dataset_train['id']).intersection(dataset_val['id'])) # empty

# Test set and validation set
print(set(dataset_val['id']).intersection(dataset_test['id'])) # empty

# Training set and test set
print(set(dataset_train['id']).intersection(dataset_test['id'])) # empty

set()
set()
set()


We can merge three datasets.

In [45]:
datasets = {}
datasets['train'] = dataset_train
datasets['val'] = dataset_val
datasets['test'] = dataset_test

datasets = DatasetDict(datasets)
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'answers', 'context', 'title'],
        num_rows: 2400
    })
    val: Dataset({
        features: ['id', 'question', 'answers', 'context', 'title'],
        num_rows: 600
    })
    test: Dataset({
        features: ['id', 'question', 'answers', 'context', 'title'],
        num_rows: 600
    })
})

## **Task 6**

Train a neural model able to answer the legal questions. Fine-tune at least two pre-trained models. Make sure you are using a machine with a GPU, since training the model on CPU will be very long. The training should include at least 10 epochs (depending on the size of the training set you are using). The pre-trained models you can use include:
* plT5-base
* plT5-large
* mT5-base
* mT5-large

We install all required libraries.

In [13]:
!pip install transformers
!pip install sentencepiece
!pip install sacremoses
!pip install huggingface_hub
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('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')"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 32.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 63.6 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 31.5 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
Looking in indexes: https://pypi.org/simple, https://u

In [46]:
import collections
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, AutoTokenizer
from transformers import default_data_collator
import numpy as np
from tqdm.auto import tqdm
from pyarrow import Table

We can create dictionary with essential tools to train chosen transformers.

In [47]:
models = {
    'herbert': {
        'tokenizer': None,
        'model': None,
        'model_name': None
    },
    'bert': {
        'tokenizer': None,
        'model': None,
        'model_name': None
    }
}

**Preprocessing the training data**

In [129]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [130]:
def preprocess(tokenizer):
  for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > max_length:
        break
  example = datasets["train"][i]
  pad_on_right = tokenizer.padding_side == "right"

  def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else None

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples
  
  return datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

**Handy functions**

In [131]:
batch_size = 16

In [132]:
def create_trainer(qa, args = None):
  tokenized_datasets = preprocess(qa['tokenizer'])

  args = TrainingArguments(
    f"{qa['model_name']}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
  ) if args is None else args

  trainer = Trainer(
    qa['model'],
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"],
    data_collator=default_data_collator,
    tokenizer=qa['tokenizer'],
  )

  return trainer

**Evaluation**

In [133]:
def evaluate(tokenizer, trainer, test):

  pad_on_right = tokenizer.padding_side == "right"

  def prepare_validation_features(examples):
      # Some of the questions have lots of whitespace on the left, which is not useful and will make the
      # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
      # left whitespace
      examples["question"] = [q.lstrip() for q in examples["question"]]

      # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
      # in one example possible giving several features when a context is long, each of those features having a
      # context that overlaps a bit the context of the previous feature.
      tokenized_examples = tokenizer(
          examples["question" if pad_on_right else "context"],
          examples["context" if pad_on_right else "question"],
          truncation="only_second" if pad_on_right else "only_first",
          max_length=max_length,
          stride=doc_stride,
          return_overflowing_tokens=True,
          return_offsets_mapping=True,
          padding="max_length",
      )

      # Since one example might give us several features if it has a long context, we need a map from a feature to
      # its corresponding example. This key gives us just that.
      sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

      # We keep the example_id that gave us this feature and we will store the offset mappings.
      tokenized_examples["example_id"] = []

      for i in range(len(tokenized_examples["input_ids"])):
          # Grab the sequence corresponding to that example (to know what is the context and what is the question).
          sequence_ids = tokenized_examples.sequence_ids(i)
          context_index = 1 if pad_on_right else 0

          # One example can give several spans, this is the index of the example containing this span of text.
          sample_index = sample_mapping[i]
          tokenized_examples["example_id"].append(examples["id"][sample_index])

          # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
          # position is part of the context or not.
          tokenized_examples["offset_mapping"][i] = [
              (o if sequence_ids[k] == context_index else None)
              for k, o in enumerate(tokenized_examples["offset_mapping"][i])
          ]

      return tokenized_examples

  validation_features = test.map(
      prepare_validation_features,
      batched=True,
      remove_columns=test.column_names
  )

  raw_predictions = trainer.predict(validation_features)

  validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))
  max_answer_length = 30

  examples = test
  features = validation_features

  example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
  features_per_example = collections.defaultdict(list)
  for i, feature in enumerate(features):
      features_per_example[example_id_to_index[feature["example_id"]]].append(i)

  def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
      all_start_logits, all_end_logits = raw_predictions
      # Build a map example to its corresponding features.
      example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
      features_per_example = collections.defaultdict(list)
      for i, feature in enumerate(features):
          features_per_example[example_id_to_index[feature["example_id"]]].append(i)

      # The dictionaries we have to fill.
      predictions = collections.OrderedDict()

      # Logging.
      print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

      # Let's loop over all the examples!
      for example_index, example in enumerate(tqdm(examples)):
          # Those are the indices of the features associated to the current example.
          feature_indices = features_per_example[example_index]

          min_null_score = None # Only used if squad_v2 is True.
          valid_answers = []
          
          context = example["context"]
          # Looping through all the features associated to the current example.
          for feature_index in feature_indices:
              # We grab the predictions of the model for this feature.
              start_logits = all_start_logits[feature_index]
              end_logits = all_end_logits[feature_index]
              # This is what will allow us to map some the positions in our logits to span of texts in the original
              # context.
              offset_mapping = features[feature_index]["offset_mapping"]

              # Update minimum null prediction.
              cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
              feature_null_score = start_logits[cls_index] + end_logits[cls_index]
              if min_null_score is None or min_null_score < feature_null_score:
                  min_null_score = feature_null_score

              # Go through all possibilities for the `n_best_size` greater start and end logits.
              start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
              end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
              for start_index in start_indexes:
                  for end_index in end_indexes:
                      # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                      # to part of the input_ids that are not in the context.
                      if (
                          start_index >= len(offset_mapping)
                          or end_index >= len(offset_mapping)
                          or offset_mapping[start_index] is None
                          or offset_mapping[end_index] is None
                      ):
                          continue
                      # Don't consider answers with a length that is either < 0 or > max_answer_length.
                      if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                          continue

                      start_char = offset_mapping[start_index][0]
                      end_char = offset_mapping[end_index][1]
                      valid_answers.append(
                          {
                              "score": start_logits[start_index] + end_logits[end_index],
                              "text": context[start_char: end_char]
                          }
                      )
          
          if len(valid_answers) > 0:
              best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
          else:
              # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
              # failure.
              best_answer = {"text": "", "score": 0.0}
          
          predictions[example["id"]] = best_answer["text"]
        

      return predictions

  final_predictions = postprocess_qa_predictions(test, validation_features, raw_predictions.predictions)

  formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
  references = [{"id": ex["id"], "answers": ex["answers"]} for ex in test]
  return load_metric("squad").compute(predictions=formatted_predictions, references=references)

**Creating tokenizers**

In [121]:
models['herbert']['tokenizer'] = AutoTokenizer.from_pretrained('allegro/herbert-base-cased')
models['herbert']['model'] = AutoModelForQuestionAnswering.from_pretrained("allegro/herbert-base-cased")
models['herbert']['model_name'] = "allegro/herbert-base-cased"

models['bert']['tokenizer'] = AutoTokenizer.from_pretrained('dkleczek/bert-base-polish-cased-v1')
models['bert']['model'] = AutoModelForQuestionAnswering.from_pretrained("dkleczek/bert-base-polish-cased-v1")
models['bert']['model_name'] = "dkleczek/bert-base-polish-cased-v1"

loading file vocab.json
loading file merges.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file allegro/herbert-base-cased/config.json
Model config BertConfig {
  "_name_or_path": "allegro/herbert-base-cased",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "tokenizer_class": "HerbertTokenizerFast",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50000
}

loading weights file allegro/herbert-b

**Finetuning**

In [54]:
metrics = collections.defaultdict(str)

for model in tqdm(models):
  trainer = create_trainer(models[model])
  name = models[model]['model_name']
  trainer.train()
  trainer.save_model(name)
  evaluation_result = evaluate(models[model]['tokenizer'], trainer, datasets['test'])
  print(evaluation_result)
  metrics[models[model]['model_name']] = evaluation_result

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 2548
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1600
  Number of trainable parameters = 123853826


Epoch,Training Loss,Validation Loss
1,No log,2.301251
2,No log,1.816592
3,No log,1.650167
4,2.403400,1.712057
5,2.403400,1.926689
6,2.403400,1.843805
7,0.805500,1.935361
8,0.805500,2.019613
9,0.805500,2.033337


***** Running Evaluation *****
  Num examples = 637
  Batch size = 16
***** Running Evaluation *****
  Num examples = 637
  Batch size = 16
***** Running Evaluation *****
  Num examples = 637
  Batch size = 16
Saving model checkpoint to allegro/herbert-base-cased-finetuned-squad/checkpoint-500
Configuration saved in allegro/herbert-base-cased-finetuned-squad/checkpoint-500/config.json
Model weights saved in allegro/herbert-base-cased-finetuned-squad/checkpoint-500/pytorch_model.bin
tokenizer config file saved in allegro/herbert-base-cased-finetuned-squad/checkpoint-500/tokenizer_config.json
Special tokens file saved in allegro/herbert-base-cased-finetuned-squad/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 637
  Batch size = 16
***** Running Evaluation *****
  Num examples = 637
  Batch size = 16
***** Running Evaluation *****
  Num examples = 637
  Batch size = 16
Saving model checkpoint to allegro/herbert-base-cased-finetuned-squad/checkpoint-

Epoch,Training Loss,Validation Loss
1,No log,2.301251
2,No log,1.816592
3,No log,1.650167
4,2.403400,1.712057
5,2.403400,1.926689
6,2.403400,1.843805
7,0.805500,1.935361
8,0.805500,2.019613
9,0.805500,2.033337
10,0.437900,2.040416


***** Running Evaluation *****
  Num examples = 637
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to allegro/herbert-base-cased
Configuration saved in allegro/herbert-base-cased/config.json
Model weights saved in allegro/herbert-base-cased/pytorch_model.bin
tokenizer config file saved in allegro/herbert-base-cased/tokenizer_config.json
Special tokens file saved in allegro/herbert-base-cased/special_tokens_map.json


  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 633
  Batch size = 16


Post-processing 600 example predictions split into 633 features.


  0%|          | 0/600 [00:00<?, ?it/s]

  return load_metric("squad").compute(predictions=formatted_predictions, references=references)


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

{'exact_match': 34.0, 'f1': 52.21449356286099}


  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 2530
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1590
  Number of trainable parameters = 131532290


Epoch,Training Loss,Validation Loss
1,No log,2.282816
2,No log,1.997542
3,No log,2.216819
4,2.019300,2.316905
5,2.019300,2.644211
6,2.019300,2.985743
7,0.355300,3.191115
8,0.355300,3.32555
9,0.355300,3.445309
10,0.096100,3.505562


***** Running Evaluation *****
  Num examples = 630
  Batch size = 16
***** Running Evaluation *****
  Num examples = 630
  Batch size = 16
***** Running Evaluation *****
  Num examples = 630
  Batch size = 16
Saving model checkpoint to dkleczek/bert-base-polish-cased-v1-finetuned-squad/checkpoint-500
Configuration saved in dkleczek/bert-base-polish-cased-v1-finetuned-squad/checkpoint-500/config.json
Model weights saved in dkleczek/bert-base-polish-cased-v1-finetuned-squad/checkpoint-500/pytorch_model.bin
tokenizer config file saved in dkleczek/bert-base-polish-cased-v1-finetuned-squad/checkpoint-500/tokenizer_config.json
Special tokens file saved in dkleczek/bert-base-polish-cased-v1-finetuned-squad/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 630
  Batch size = 16
***** Running Evaluation *****
  Num examples = 630
  Batch size = 16
***** Running Evaluation *****
  Num examples = 630
  Batch size = 16
Saving model checkpoint to dkleczek/bert-

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 627
  Batch size = 16


Post-processing 600 example predictions split into 627 features.


  0%|          | 0/600 [00:00<?, ?it/s]

{'exact_match': 27.5, 'f1': 45.94652594740561}


It might be a good idea to download the fine-tuned models :)

In [104]:
from google.colab import files

!zip -r /content/herbert.zip /content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500

  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/ (stored 0%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/tokenizer_config.json (deflated 46%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/merges.txt (deflated 60%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/rng_state.pth (deflated 28%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/special_tokens_map.json (deflated 53%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/vocab.json (deflated 62%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/training_args.bin (deflated 48%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/scheduler.pt (deflated 50%)
  adding: content/allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/trainer_state.json (deflated 76%)
  adding: content/allegro/herbert-base-cas

## **Task 7**

Report the obtained performance of the models (in the form of a table). The report should include exact match and F1 score for the tokens appearing both in the reference and the predicted answer.

In [56]:
!pip install tabletext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tabletext
  Downloading tabletext-0.1.tar.gz (6.1 kB)
Building wheels for collected packages: tabletext
  Building wheel for tabletext (setup.py) ... [?25l[?25hdone
  Created wheel for tabletext: filename=tabletext-0.1-py3-none-any.whl size=6022 sha256=d44acfbae48896b80082c23bacda7dbdc4a65a5514cfc977257854d3a5f676f6
  Stored in directory: /root/.cache/pip/wheels/bf/07/53/fc26a7a3b10eed822a5a2e56ec0729d472ad39aa45e2ed814b
Successfully built tabletext
Installing collected packages: tabletext
Successfully installed tabletext-0.1


In [58]:
import tabletext

data = [["model","exact_match","f1"]]

for name, results in metrics.items():
  data.append([name.split('/')[-1], results["exact_match"], results["f1"]])

print(tabletext.to_text(data))

┌───────────────────────────┬─────────────┬───────────────────┐
│ model                     │ exact_match │ f1                │
├───────────────────────────┼─────────────┼───────────────────┤
│ herbert-base-cased        │        34.0 │ 52.21449356286099 │
├───────────────────────────┼─────────────┼───────────────────┤
│ bert-base-polish-cased-v1 │        27.5 │ 45.94652594740561 │
└───────────────────────────┴─────────────┴───────────────────┘


## **Task 8**

Report the best results obtained on the validation dataset and the corresponding results on your test dataset. The results on the test set have to be obtained for the model that yield the best result on the validation dataset.

The best performance on the validation dataset was achieved by the Herbert model. The results:
* exact_match = 34.0
* f1 = 52.21

## **Task 9**

Generate, report and analyze the answers provided by the best model on you test dataset.

In this case we load dataset created by myself and send as a pull request to NLP repo. It's worth to mention that i have removed questions with no answer.

In [93]:
with open('./qas.json', encoding='utf-8') as f:
    data = json.loads(f.read())
    data = data['data']

qas_dict = {
    'id': [],
    'question': [],
    'answers': [],
    'context': [],
    'title': []
}

for row in data:
  qas_dict['id'].append(row['id'])
  qas_dict['question'].append(row['question'])
  qas_dict['answers'].append({'answer_start': [0], 'text': row['answer']})
  qas_dict['context'].append(row['context'])
  qas_dict['title'].append(row['title'])

qas_test = Dataset.from_dict(qas_dict)
qas_test

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 6
})

In [100]:
tokenizer = models['herbert']['tokenizer']
best_model = AutoModelForQuestionAnswering.from_pretrained('./allegro/herbert-base-cased-finetuned-squad/checkpoint-1500')
best_trainer = Trainer(model = best_model)
pad_on_right = tokenizer.padding_side == "right"

def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

validation_features = qas_test.map(
    prepare_validation_features,
    batched=True,
    remove_columns=qas_test.column_names
)

raw_predictions = best_trainer.predict(validation_features)

validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))
max_answer_length = 30

examples = qas_test
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        predictions[example["id"]] = best_answer["text"]
      

    return predictions

final_predictions = postprocess_qa_predictions(qas_test, validation_features, raw_predictions.predictions)

formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in qas_test]

good_answers = 0
all_answers = len(qas_test)
print()
print("----------")
for i, pred in enumerate(formatted_predictions):
  print(qas_test[i]['question'])
  print(pred['prediction_text'])
  print(qas_test[i]['context'])
  if input() == 'y':
    good_answers += 1
  print()
  print("----------")
print("Accuracy:", good_answers / all_answers)

loading configuration file ./allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/config.json
Model config BertConfig {
  "_name_or_path": "./allegro/herbert-base-cased-finetuned-squad/checkpoint-1500",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "tokenizer_class": "HerbertTokenizerFast",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50000
}

loading weights file ./allegro/herbert-base-cased-finetuned-squad/checkpoint-1500/pytorch_model.bin
All model checkpoint weights were used when initi

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6
  Batch size = 8


Post-processing 6 example predictions split into 6 features.


  0%|          | 0/6 [00:00<?, ?it/s]


----------
Czy zamniętych zakładach karnych skazani mogą korzystać z własnych butów?
7) skazani mogą korzystać z własnej odzieży, bielizny i obuwia, 8
Art. 92. W zakładzie karnym typu otwartego: 1) cele mieszkalne skazanych pozostają otwarte przez całą dobę, 2) skazanych zatrudnia się przede wszystkim poza terenem zakładu karnego, bez konwojenta, na pojedynczych stanowiskach pracy, 3) skazanym można zezwalać na uczestniczenie w nauczaniu, szkoleniu oraz zajęciach terapeutycznych organizowanych poza terenem zakładu karnego, 4) skazani mogą brać udział w organizowanych przez administrację, poza terenem zakładu karnego, grupowych zajęciach kulturalno-oświatowych lub sportowych, 5) skazanym można zezwalać na udział w zajęciach i imprezach kulturalno-oświatowych lub sportowych organizowanych poza terenem zakładu karnego, 6) skazani mogą poruszać się po terenie zakładu karnego w czasie i miejscach ustalonych w porządku wewnętrznym, 7) skazani mogą korzystać z własnej odzieży, bielizny i obu

## **Task 10**

Perform hyperparameter tuning for the models to obtain better results. Take into account some of the following parameters:
* learning rate
* gradient accumulation steps
* batch size
* gradient clipping
* learning rate schedule

The Herbert model was the best model, so we will tune hyperparameters for this particular model. We take into considerations a grid of parameters such as:
* learning rate - [2e-5, 0.01, 0.1],
* batch size - [16, 32],
* weight decay - [0.1, 0.01, 0.001]

Due to the fact that fine-tuning of the model might be a long procedure, we decrease a size of training, validation and test datasets.

In [134]:
N_TRAIN_NEW = 500
N_TEST_NEW = 100
N_VAL_NEW = int(0.2 * N_TRAIN_NEW)

training_args_list = []

for lr in [2e-5, 0.01, 0.1]:
  for batch_size in [16, 32]:
    for weight_decay in [0.1, 0.01, 0.001]:
      training_args_list.append(
          TrainingArguments(
            f"herbert-finetuned-poquad",
            evaluation_strategy = "epoch",
            learning_rate=lr,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=10,
            weight_decay=weight_decay,
          )
      )
training_args_list[0]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from al

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_n

In [135]:
dataset_train = Dataset.from_dict({'id':dataset_train['id'][:N_TRAIN_NEW],
                                  'question':dataset_train['question'][:N_TRAIN_NEW],
                                  'answers':dataset_train['answers'][:N_TRAIN_NEW],
                                  'context':dataset_train['context'][:N_TRAIN_NEW],
                                  'title':dataset_train['title'][:N_TRAIN_NEW]})
dataset_train

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 400
})

In [136]:
dataset_test = Dataset.from_dict({'id':dataset_test['id'][:N_TEST_NEW],
                                  'question':dataset_test['question'][:N_TEST_NEW],
                                  'answers':dataset_test['answers'][:N_TEST_NEW],
                                  'context':dataset_test['context'][:N_TEST_NEW],
                                  'title':dataset_test['title'][:N_TEST_NEW]})
dataset_test

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 100
})

In [137]:
temp_df = pd.DataFrame(dataset_train)
temp_df = temp_df.sample(n=N_VAL_NEW)
indices = temp_df.index
dataset_val = Dataset.from_pandas(temp_df)

dataset_val = Dataset.from_dict({'id':dataset_val['id'],
                                  'question':dataset_val['question'],
                                  'answers':dataset_val['answers'],
                                  'context':dataset_val['context'],
                                  'title':dataset_val['title']})

temp_df = pd.DataFrame(dataset_train)
bad_df = temp_df.index.isin(indices)
temp_df = temp_df[~bad_df]
dataset_train = Dataset.from_pandas(temp_df)

dataset_train = Dataset.from_dict({'id':dataset_train['id'],
                                  'question':dataset_train['question'],
                                  'answers':dataset_train['answers'],
                                  'context':dataset_train['context'],
                                  'title':dataset_train['title']})
dataset_train

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 300
})

In [138]:
datasets = {}
datasets['train'] = dataset_train
datasets['val'] = dataset_val
datasets['test'] = dataset_test

datasets = DatasetDict(datasets)
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'answers', 'context', 'title'],
        num_rows: 300
    })
    val: Dataset({
        features: ['id', 'question', 'answers', 'context', 'title'],
        num_rows: 100
    })
    test: Dataset({
        features: ['id', 'question', 'answers', 'context', 'title'],
        num_rows: 100
    })
})

In [139]:
metrics_opt = collections.defaultdict(str)

for idx, args in tqdm(enumerate(training_args_list)):
  trainer = create_trainer(models['herbert'], args=args)
  name = f"{models['herbert']['model_name']}_args_{idx+1}"
  trainer.train()
  # trainer.save_model(name)
  evaluation_result = evaluate(models['herbert']['tokenizer'], trainer, datasets['test'])
  print(evaluation_result)
  metrics_opt[name] = evaluation_result

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 308
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 200
  Number of trainable parameters = 123853826


Epoch,Training Loss,Validation Loss
1,No log,0.255015
2,No log,0.307461
3,No log,0.322013
4,No log,0.330894
5,No log,0.323887
6,No log,0.344893
7,No log,0.322521
8,No log,0.314738
9,No log,0.310369
10,No log,0.310131


***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 104
  Batch size = 16


Post-processing 100 example predictions split into 104 features.


  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 34.0, 'f1': 53.51423957879065}


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 308
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 200
  Number of trainable parameters = 123853826


Epoch,Training Loss,Validation Loss
1,No log,0.318072
2,No log,0.371906
3,No log,0.358384
4,No log,0.39715
5,No log,0.394609
6,No log,0.425898
7,No log,0.435314
8,No log,0.432978
9,No log,0.435326
10,No log,0.435106


***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 104
  Batch size = 16


Post-processing 100 example predictions split into 104 features.


  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 34.0, 'f1': 51.117333706884786}


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 308
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 200
  Number of trainable parameters = 123853826


Epoch,Training Loss,Validation Loss
1,No log,0.539569
2,No log,0.433105
3,No log,0.477652
4,No log,0.488353
5,No log,0.514252
6,No log,0.550554
7,No log,0.519091
8,No log,0.488404
9,No log,0.507495
10,No log,0.500044


***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16
***** Running Evaluation *****
  Num examples = 108
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 104
  Batch size = 16


Post-processing 100 example predictions split into 104 features.


  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 37.0, 'f1': 54.43400037355144}


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 308
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 100
  Number of trainable parameters = 123853826


OutOfMemoryError: ignored

From the configurations that we obtained above, the best tested hyperparameters are:
* learning rate = 2e-5,
* batch size = 16,
* weight decay = 0.1.

The training with the rest of configurations was interrupted by OutOfMemoryError (CUDA out of memory).

## **Task 11**

Answer the following questions:

**Which pre-trained model performs better on that task?**

Herbert model.

**Does the performance on the validation dataset reflects the performance on your test set?**

Yes.

**What are the outcomes of the model on your own questions? Are they satisfying? If not, what might be the reason for that?**

In only one case question (_Czy zamniętych zakładach karnych skazani mogą korzystać z własnych butów?_) the model returned answer "6" to the question. The rest of the answers were correct. In my opinion, this number was returned because it was in the paragraph titled with this number that the correct answer to the question was found. The model decided that an answer sequence of this length would be good enough and returned what it returned.

**Why extractive question answering is not well suited for inflectional languages?**

That's because the model only selects a sequence of characters from a context in which the correct answer is found. However, this answer is not conjugated by cases in an inflectional language, making the answer artificial in some sense.

**Why you have to remove the duplicated questions from the training and the validation subsets?**

In a situation where the intersection of the training, validation and test set was non-empty, the results obtained would be biased.