# Contextual question answering

The aim of this exercise is building a neural model able to contextually answer questions in the legal domain.

In [1]:
!pip install transformers
!pip install datasets



In [35]:
!pip install tabletext

Collecting tabletext
  Downloading tabletext-0.1.tar.gz (6.1 kB)
Building wheels for collected packages: tabletext
  Building wheel for tabletext (setup.py) ... [?25l[?25hdone
  Created wheel for tabletext: filename=tabletext-0.1-py3-none-any.whl size=6023 sha256=a1938ab0c4ab6430ba2adbd7df4fe27c183809747beb8c9e19f12d4c18a376c0
  Stored in directory: /root/.cache/pip/wheels/cc/ae/ab/697f6cd9887c63663da889f796c2c7ea280bc407b16f6fd081
Successfully built tabletext
Installing collected packages: tabletext
Successfully installed tabletext-0.1


In [59]:
import collections
import tabletext
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, AutoTokenizer
from datasets import Dataset, DatasetDict, load_metric
from transformers import default_data_collator
import numpy as np
import pandas as pd
import json
from tqdm.auto import tqdm
from pyarrow import Table

## Train a neural model able to answer legal questions. Use at least two models from exercise 6 as pre-trained models used as a base for fine-tuning.

In [3]:
herbert_model = AutoModelForQuestionAnswering.from_pretrained("allegro/herbert-base-cased")
herbert_tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")

Some weights of the model checkpoint at allegro/herbert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.sso.sso_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.sso.sso_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized

In [4]:
bert_model = AutoModelForQuestionAnswering.from_pretrained("dkleczek/bert-base-polish-cased-v1")
bert_tokenizer = AutoTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")

Some weights of the model checkpoint at dkleczek/bert-base-polish-cased-v1 were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized

### Loading dataset

Inspired by: https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb

In [6]:
def to_squad_format(dataset):
    res = {
        "id": [],
        "question": [],
        "answers": [],
        "context": [],
        "title": []
    }
    for act in dataset["data"]:
        title = act["title"]
        for para in act["paragraphs"]:
            ctx = para["context"]
            for q in para["qas"]:
                q["context"] = ctx
                q["title"] = title
                answers = {"answer_start": [], "text": []}
                for an in q["answers"]:
                    answers["answer_start"].append(an["answer_start"])
                    answers["text"].append(an["text"])
                q["answers"] = answers
                for c in ["id", "question", "answers", "context", "title"]:
                    res[c].append(q[c])
    return Dataset.from_dict(res)
    
def read_dataset():
    dataset = {}
    for t in ["train", "dev", "test"]:
        with open(f"drive/MyDrive/lqad-pl-1/lqad-pl-{t}.json", "rb+") as f:
            dataset[t] = to_squad_format(json.loads(f.read()))
    return DatasetDict(dataset)

datasets = read_dataset()

In [7]:
datasets["train"]

Dataset({
    features: ['id', 'question', 'answers', 'context', 'title'],
    num_rows: 3719
})

In [8]:
datasets["train"][0]

{'answers': {'answer_start': [115],
  'text': ['najpóźniej w dniu wyborów kończy 21 lat']},
 'context': ' Rozdział IV SEJM I SENAT Art . 99 § 1 . Wybrany do Sejmu może być obywatel polski mający prawo wybierania , który najpóźniej w dniu wyborów kończy 21 lat . § 2 . Wybrany do Senatu może być obywatel polski mający prawo wybierania , który najpóźniej w dniu wyborów kończy 30 lat . § 3 . Wybraną do Sejmu lub do Senatu nie może być osoba skazana prawomocnym wyrokiem na karę pozbawienia wolności za przestępstwo umyślne ścigane z oskarżenia publicznego .',
 'id': '2007_adw_1',
 'question': 'Wybrany do Sejmu może być obywatel polski mający prawo wybierania, który: ',
 'title': 'Konstytucja Rzeczypospolitej Polskiej z dnia 2 kwietnia 1997 r. uchwalona przez Zgromadzenie Narodowe w dniu 2 kwietnia 1997 r., przyjęta przez Naród w referendum konstytucyjnym w dniu 25 maja 1997 r., podpisana przez Prezydenta Rzeczypospolitej Polskiej w dniu 16 lipca 1997 r'}

In [9]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [10]:
show_random_elements(datasets["train"])

Unnamed: 0,id,question,answers,context,title
0,2013_kom_33,"Zgodnie z Kodeksem cywilnym, spadkobierca może być uznany za niegodnego przez:","{'answer_start': [98], 'text': ['sąd']}","Księga CZWARTA SPADKI Tytuł I Przepisy ogólne Art . 928 § 1 . Spadkobierca może być uznany przez sąd za niegodnego , jeżeli : 1 ) dopuścił się umyślnie ciężkiego przestępstwa przeciwko spadkodawcy ; 2 ) podstępem lub groźbą nakłonił spadkodawcę do sporządzenia lub odwołania testamentu albo w taki sam sposób przeszkodził mu w dokonaniu jednej z tych czynności ; 3 ) umyślnie ukrył lub zniszczył testament spadkodawcy , podrobił lub przerobił jego testament albo świadomie skorzystał z testamentu przez inną osobę podrobionego lub przerobionego .",Kodeks cywilny
1,2007_rad_10,Ważność wyboru Prezydenta Rzeczypospolitej stwierdza:,"{'answer_start': [117], 'text': ['Sąd Najwyższy']}",Rozdział V PREZYDENT RZECZYPOSPOLITEJ POLSKIEJ Art . 129 § 1 . Ważność wyboru Prezydenta Rzeczypospolitej stwierdza Sąd Najwyższy .,"Konstytucja Rzeczypospolitej Polskiej z dnia 2 kwietnia 1997 r. uchwalona przez Zgromadzenie Narodowe w dniu 2 kwietnia 1997 r., przyjęta przez Naród w referendum konstytucyjnym w dniu 25 maja 1997 r., podpisana przez Prezydenta Rzeczypospolitej Polskiej w dniu 16 lipca 1997 r"
2,2014_kom_12,"Zgodnie z Kodeksem cywilnym, umowa o przedłużenie wieczystego użytkowania powinna być zawarta w formie:","{'answer_start': [166], 'text': ['aktu notarialnego']}",Księga DRUGA WŁASNOŚĆ I INNE PRAWA RZECZOWE Tytuł II Użytkowanie wieczyste Art . 236 § 3 . Umowa o przedłużenie wieczystego użytkowania powinna być zawarta w formie aktu notarialnego .,Kodeks cywilny
3,2018_adw_rad_70,"Zgodnie z Kodeksem postępowania cywilnego, w postępowaniu procesowym, niebędącym postępowaniem uproszczonym, powód będący usługodawcą jest obowiązany wnieść pozew na urzędowym formularzu, jeżeli dochodzi roszczeń wynikających z umów:","{'answer_start': [264], 'text': ['przewóz osób i bagażu w komunikacji masowej']}","Część PIERWSZA POSTĘPOWANIE ROZPOZNAWCZE Księga PIERWSZA PROCES Tytuł VI Postępowanie Dział II Postępowanie przed sądami pierwszej instancji Rozdział 2 Pozew Art . 187_1 Jeżeli powód będący usługodawcą lub sprzedawcą dochodzi roszczeń wynikających z umów o : 2 ) przewóz osób i bagażu w komunikacji masowej ,",Kodeks postępowania cywilnego
4,2009_adw_rad_100,"Zgodnie z Kodeksem spółek handlowych, jeżeli umowa spółki partnerskiej nie stanowi inaczej, spadkobierca partnera:","{'answer_start': [114], 'text': ['nie wstępuje do spółki w miejsce zmarłego partnera']}","Tytuł II Spółki osobowe Dział II Spółka partnerska Rozdział 3 Rozwiązanie spółki Art . 101 Spadkobierca partnera nie wstępuje do spółki w miejsce zmarłego partnera , chyba że umowa spółki stanowi inaczej , z uwzględnieniem art . 87 .",Kodeks spółek handlowych
5,2018_kom_95,"Zgodnie z Kodeksem rodzinnym i opiekuńczym, do majątku osobistego każdego z małżonków pozostających w ustawowej wspólności majątkowej należą:","{'answer_start': [160], 'text': ['przedmioty majątkowe służące wyłącznie do zaspokajania osobistych potrzeb jednego z małżonków']}",Tytuł I Małżeństwo Dział III Małżeńskie ustroje majątkowe Rozdział I Ustawowy ustrój majątkowy Art . 33 Do majątku osobistego każdego z małżonków należą : 4 ) przedmioty majątkowe służące wyłącznie do zaspokajania osobistych potrzeb jednego z małżonków ;,Kodeks rodzinny i opiekuńczy
6,2017_not_147,"Zgodnie z ustawą – Prawo o notariacie, postępowanie dyscyplinarne w pierwszej instancji powinno być zakończone w terminie:","{'answer_start': [166], 'text': ['1 miesiąca od daty wpływu wniosku']}",Dział I Ustrój notariatu Rozdział 6 Odpowiedzialność dyscyplinarna Art . 59 § 2 . Postępowanie dyscyplinarne w pierwszej instancji powinno być zakończone w terminie 1 miesiąca od daty wpływu wniosku .,Prawo o notariacie
7,2009_adw_rad_73,"Zgodnie z Kodeksem postępowania cywilnego, zarządzenie przewodniczącego o zwrocie pozwu:","{'answer_start': [226], 'text': ['doręcza się tylko powodowi']}",Część PIERWSZA POSTĘPOWANIE ROZPOZNAWCZE Księga PIERWSZA PROCES Tytuł VI Postępowanie Dział I Przepisy ogólne o czynnościach procesowych Rozdział 1 Pisma procesowe Art . 130 § 4 . Zarządzenie przewodniczącego o zwrocie pozwu doręcza się tylko powodowi .,Kodeks postępowania cywilnego
8,2017_adw_rad_13,"Zgodnie z Kodeksem postępowania karnego, oskarżyciel publiczny może cofnąć akt oskarżenia:","{'answer_start': [90], 'text': ['do czasu rozpoczęcia przewodu sądowego na pierwszej rozprawie głównej . W toku przewodu sądowego przed sądem pierwszej instancji']}",Dział I Przepisy wstępne Art . 14 § 2 . Oskarżyciel publiczny może cofnąć akt oskarżenia do czasu rozpoczęcia przewodu sądowego na pierwszej rozprawie głównej . W toku przewodu sądowego przed sądem pierwszej instancji cofnięcie aktu oskarżenia dopuszczalne jest jedynie za zgodą oskarżonego . Ponowne wniesienie aktu oskarżenia przeciwko tej samej osobie o ten sam czyn jest niedopuszczalne .,Kodeks postępowania karnego
9,2007_rad_181,"Zgodnie z Kodeksem rodzinnym i opiekuńczym, o ustalenie nierównych udziałów w majątku wspólnym małżonków, spadkobiercy małżonka:","{'answer_start': [386], 'text': ['gdy spadkodawca wytoczył powództwo o unieważnienie małżeństwa albo o rozwód lub wystąpił o orzeczenie separacji']}","Tytuł I Małżeństwo Dział III Małżeńskie ustroje majątkowe Rozdział I Ustawowy ustrój majątkowy Art . 43 § 2 . Jednakże z ważnych powodów każdy z małżonków może żądać , ażeby ustalenie udziałów w majątku wspólnym nastąpiło z uwzględnieniem stopnia , w którym każdy z nich przyczynił się do powstania tego majątku . Spadkobiercy małżonka mogą wystąpić z takim żądaniem tylko w wypadku , gdy spadkodawca wytoczył powództwo o unieważnienie małżeństwa albo o rozwód lub wystąpił o orzeczenie separacji .",Kodeks rodzinny i opiekuńczy


### Preprocess train data

In [11]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [26]:
def preprocess(tokenizer):
  for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > max_length:
        break
  example = datasets["train"][i]
  pad_on_right = tokenizer.padding_side == "right"

  def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples
  
  return datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

### Finetuning the model

In [21]:
batch_size = 16

In [22]:
def create_trainer(qa):
  tokenized_datasets = preprocess(qa.tokenizer)

  args = TrainingArguments(
    f"{qa.name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
  )

  trainer = Trainer(
    qa.model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["dev"],
    data_collator=default_data_collator,
    tokenizer=qa.tokenizer,
  )
  trainer.train()

  return trainer

### Evaluation

In [79]:
def evaluate(tokenizer, trainer, test):

  pad_on_right = tokenizer.padding_side == "right"

  def prepare_validation_features(examples):
      # Some of the questions have lots of whitespace on the left, which is not useful and will make the
      # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
      # left whitespace
      examples["question"] = [q.lstrip() for q in examples["question"]]

      # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
      # in one example possible giving several features when a context is long, each of those features having a
      # context that overlaps a bit the context of the previous feature.
      tokenized_examples = tokenizer(
          examples["question" if pad_on_right else "context"],
          examples["context" if pad_on_right else "question"],
          truncation="only_second" if pad_on_right else "only_first",
          max_length=max_length,
          stride=doc_stride,
          return_overflowing_tokens=True,
          return_offsets_mapping=True,
          padding="max_length",
      )

      # Since one example might give us several features if it has a long context, we need a map from a feature to
      # its corresponding example. This key gives us just that.
      sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

      # We keep the example_id that gave us this feature and we will store the offset mappings.
      tokenized_examples["example_id"] = []

      for i in range(len(tokenized_examples["input_ids"])):
          # Grab the sequence corresponding to that example (to know what is the context and what is the question).
          sequence_ids = tokenized_examples.sequence_ids(i)
          context_index = 1 if pad_on_right else 0

          # One example can give several spans, this is the index of the example containing this span of text.
          sample_index = sample_mapping[i]
          tokenized_examples["example_id"].append(examples["id"][sample_index])

          # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
          # position is part of the context or not.
          tokenized_examples["offset_mapping"][i] = [
              (o if sequence_ids[k] == context_index else None)
              for k, o in enumerate(tokenized_examples["offset_mapping"][i])
          ]

      return tokenized_examples

  validation_features = test.map(
      prepare_validation_features,
      batched=True,
      remove_columns=test.column_names
  )

  raw_predictions = trainer.predict(validation_features)

  validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))
  max_answer_length = 30

  examples = test
  features = validation_features

  example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
  features_per_example = collections.defaultdict(list)
  for i, feature in enumerate(features):
      features_per_example[example_id_to_index[feature["example_id"]]].append(i)

  def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
      all_start_logits, all_end_logits = raw_predictions
      # Build a map example to its corresponding features.
      example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
      features_per_example = collections.defaultdict(list)
      for i, feature in enumerate(features):
          features_per_example[example_id_to_index[feature["example_id"]]].append(i)

      # The dictionaries we have to fill.
      predictions = collections.OrderedDict()

      # Logging.
      print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

      # Let's loop over all the examples!
      for example_index, example in enumerate(tqdm(examples)):
          # Those are the indices of the features associated to the current example.
          feature_indices = features_per_example[example_index]

          min_null_score = None # Only used if squad_v2 is True.
          valid_answers = []
          
          context = example["context"]
          # Looping through all the features associated to the current example.
          for feature_index in feature_indices:
              # We grab the predictions of the model for this feature.
              start_logits = all_start_logits[feature_index]
              end_logits = all_end_logits[feature_index]
              # This is what will allow us to map some the positions in our logits to span of texts in the original
              # context.
              offset_mapping = features[feature_index]["offset_mapping"]

              # Update minimum null prediction.
              cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
              feature_null_score = start_logits[cls_index] + end_logits[cls_index]
              if min_null_score is None or min_null_score < feature_null_score:
                  min_null_score = feature_null_score

              # Go through all possibilities for the `n_best_size` greater start and end logits.
              start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
              end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
              for start_index in start_indexes:
                  for end_index in end_indexes:
                      # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                      # to part of the input_ids that are not in the context.
                      if (
                          start_index >= len(offset_mapping)
                          or end_index >= len(offset_mapping)
                          or offset_mapping[start_index] is None
                          or offset_mapping[end_index] is None
                      ):
                          continue
                      # Don't consider answers with a length that is either < 0 or > max_answer_length.
                      if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                          continue

                      start_char = offset_mapping[start_index][0]
                      end_char = offset_mapping[end_index][1]
                      valid_answers.append(
                          {
                              "score": start_logits[start_index] + end_logits[end_index],
                              "text": context[start_char: end_char]
                          }
                      )
          
          if len(valid_answers) > 0:
              best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
          else:
              # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
              # failure.
              best_answer = {"text": "", "score": 0.0}
          
          predictions[example["id"]] = best_answer["text"]
        

      return predictions

  final_predictions = postprocess_qa_predictions(test, validation_features, raw_predictions.predictions)

  formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
  references = [{"id": ex["id"], "answers": ex["answers"]} for ex in test]
  return load_metric("squad").compute(predictions=formatted_predictions, references=references)

### Run

In [16]:
class QA:
  def __init__(self, model, tokenizer, name):
    self.model = model
    self.tokenizer = tokenizer
    self.name = name

In [17]:
qas = [
      QA(herbert_model, herbert_tokenizer, "herbert"),
      QA(bert_model, bert_tokenizer, "bert"),
]

In [27]:
metrics = {}

for qa in tqdm(qas):
  print("---------------------")
  print(f"Run for {qa.name}")
  trainer = create_trainer(qa)
  # trainer.train()
  evaluation_result = evaluate(qa.tokenizer, trainer, datasets['test'])
  print(evaluation_result)
  metrics[qa.name] = evaluation_result

  0%|          | 0/3 [00:00<?, ?it/s]

---------------------
Run for herbert


  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 3723
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 699


Epoch,Training Loss,Validation Loss
1,No log,1.216785
2,No log,1.189502
3,1.370400,1.172559


***** Running Evaluation *****
  Num examples = 445
  Batch size = 16
***** Running Evaluation *****
  Num examples = 445
  Batch size = 16
Saving model checkpoint to herbert-finetuned-squad/checkpoint-500
Configuration saved in herbert-finetuned-squad/checkpoint-500/config.json
Model weights saved in herbert-finetuned-squad/checkpoint-500/pytorch_model.bin
tokenizer config file saved in herbert-finetuned-squad/checkpoint-500/tokenizer_config.json
Special tokens file saved in herbert-finetuned-squad/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 445
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id.
***** Running Prediction *****
  Num examples = 467
  Batch size = 16


Post-processing 463 example predictions split into 467 features.


  0%|          | 0/463 [00:00<?, ?it/s]

{'exact_match': 66.30669546436285, 'f1': 79.88661148820276}
---------------------
Run for bert


  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 3723
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 699


Epoch,Training Loss,Validation Loss
1,No log,1.312793
2,No log,1.298335
3,1.217100,1.392769


***** Running Evaluation *****
  Num examples = 445
  Batch size = 16
***** Running Evaluation *****
  Num examples = 445
  Batch size = 16
Saving model checkpoint to bert-finetuned-squad/checkpoint-500
Configuration saved in bert-finetuned-squad/checkpoint-500/config.json
Model weights saved in bert-finetuned-squad/checkpoint-500/pytorch_model.bin
tokenizer config file saved in bert-finetuned-squad/checkpoint-500/tokenizer_config.json
Special tokens file saved in bert-finetuned-squad/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 445
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id.
***** Running Prediction *****
  Num examples = 467
  Batch size = 16


Post-processing 463 example predictions split into 467 features.


  0%|          | 0/463 [00:00<?, ?it/s]

{'exact_match': 66.52267818574514, 'f1': 79.89378661049585}
---------------------
Run for large


  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 3723
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 699


RuntimeError: ignored

## Report the obtained performance of the models (in the form of a table). 

The report should include em (exact match) and F1 score. The development dataset should be used to pick up the best model (for each compared pre-trained model). The results should include performance on the development and the test datasets.

In [36]:
data = [["model","exact_match","f1"]]

for name, results in metrics.items():
  data.append([name, results["exact_match"], results["f1"]])

print(tabletext.to_text(data))

┌─────────┬───────────────────┬───────────────────┐
│ model   │ exact_match       │ f1                │
├─────────┼───────────────────┼───────────────────┤
│ herbert │ 66.30669546436285 │ 79.88661148820276 │
├─────────┼───────────────────┼───────────────────┤
│ bert    │ 66.52267818574514 │ 79.89378661049585 │
└─────────┴───────────────────┴───────────────────┘


## Process questions from lab9
Generate and report the answers using the best model on questions you have prepared for the exercise 9. As a context take the regulation the questions were created for.

In [96]:
class Question:
  def __init__(self, art, question, answer):
    self.art = art 
    self.question = question
    self.answer = answer

In [97]:
questions = []

with open("drive/MyDrive/questions.csv", "r") as file:
  for line in file.readlines()[1:]:
    splitted = line.split(";")
    if len(splitted) > 2 and splitted[1] not in list(map(lambda x: x.question, questions)):
      questions.append(Question(splitted[0], splitted[1], splitted[2]))

In [99]:
my_questions = Dataset(Table.from_pydict(
    {
        'id': list(map(lambda q: q.art, questions)),
        'question': list(map(lambda q: q.question, questions)),
        'answers': list(map(lambda q: {'answer_start': [0], 'text': [q.answer]}, questions)),
        'context': list(map(lambda q: q.answer, questions)),
        'title': list(map(lambda q: q.art, questions)),
    }
))

In [100]:
qa = qas[0]
tokenized_datasets = preprocess(qa.tokenizer)


args = TrainingArguments(
    f"{qa.name}-finetuned-squad1",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
  )

trainer = Trainer(
  qa.model,
  args,
  train_dataset=tokenized_datasets["train"],
  eval_dataset=tokenized_datasets["dev"],
  data_collator=default_data_collator,
  tokenizer=qa.tokenizer,
)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [104]:
  test = my_questions
  tokenizer = qa.tokenizer
  
  pad_on_right = tokenizer.padding_side == "right"

  def prepare_validation_features(examples):
      # Some of the questions have lots of whitespace on the left, which is not useful and will make the
      # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
      # left whitespace
      examples["question"] = [q.lstrip() for q in examples["question"]]

      # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
      # in one example possible giving several features when a context is long, each of those features having a
      # context that overlaps a bit the context of the previous feature.
      tokenized_examples = tokenizer(
          examples["question" if pad_on_right else "context"],
          examples["context" if pad_on_right else "question"],
          truncation="only_second" if pad_on_right else "only_first",
          max_length=max_length,
          stride=doc_stride,
          return_overflowing_tokens=True,
          return_offsets_mapping=True,
          padding="max_length",
      )

      # Since one example might give us several features if it has a long context, we need a map from a feature to
      # its corresponding example. This key gives us just that.
      sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

      # We keep the example_id that gave us this feature and we will store the offset mappings.
      tokenized_examples["example_id"] = []

      for i in range(len(tokenized_examples["input_ids"])):
          # Grab the sequence corresponding to that example (to know what is the context and what is the question).
          sequence_ids = tokenized_examples.sequence_ids(i)
          context_index = 1 if pad_on_right else 0

          # One example can give several spans, this is the index of the example containing this span of text.
          sample_index = sample_mapping[i]
          tokenized_examples["example_id"].append(examples["id"][sample_index])

          # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
          # position is part of the context or not.
          tokenized_examples["offset_mapping"][i] = [
              (o if sequence_ids[k] == context_index else None)
              for k, o in enumerate(tokenized_examples["offset_mapping"][i])
          ]

      return tokenized_examples

  validation_features = test.map(
      prepare_validation_features,
      batched=True,
      remove_columns=test.column_names
  )

  raw_predictions = trainer.predict(validation_features)

  validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))
  max_answer_length = 30

  examples = test
  features = validation_features

  example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
  features_per_example = collections.defaultdict(list)
  for i, feature in enumerate(features):
      features_per_example[example_id_to_index[feature["example_id"]]].append(i)

  def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
      all_start_logits, all_end_logits = raw_predictions
      # Build a map example to its corresponding features.
      example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
      features_per_example = collections.defaultdict(list)
      for i, feature in enumerate(features):
          features_per_example[example_id_to_index[feature["example_id"]]].append(i)

      # The dictionaries we have to fill.
      predictions = collections.OrderedDict()

      # Logging.
      print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

      # Let's loop over all the examples!
      for example_index, example in enumerate(tqdm(examples)):
          # Those are the indices of the features associated to the current example.
          feature_indices = features_per_example[example_index]

          min_null_score = None # Only used if squad_v2 is True.
          valid_answers = []
          
          context = example["context"]
          # Looping through all the features associated to the current example.
          for feature_index in feature_indices:
              # We grab the predictions of the model for this feature.
              start_logits = all_start_logits[feature_index]
              end_logits = all_end_logits[feature_index]
              # This is what will allow us to map some the positions in our logits to span of texts in the original
              # context.
              offset_mapping = features[feature_index]["offset_mapping"]

              # Update minimum null prediction.
              cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
              feature_null_score = start_logits[cls_index] + end_logits[cls_index]
              if min_null_score is None or min_null_score < feature_null_score:
                  min_null_score = feature_null_score

              # Go through all possibilities for the `n_best_size` greater start and end logits.
              start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
              end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
              for start_index in start_indexes:
                  for end_index in end_indexes:
                      # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                      # to part of the input_ids that are not in the context.
                      if (
                          start_index >= len(offset_mapping)
                          or end_index >= len(offset_mapping)
                          or offset_mapping[start_index] is None
                          or offset_mapping[end_index] is None
                      ):
                          continue
                      # Don't consider answers with a length that is either < 0 or > max_answer_length.
                      if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                          continue

                      start_char = offset_mapping[start_index][0]
                      end_char = offset_mapping[end_index][1]
                      valid_answers.append(
                          {
                              "score": start_logits[start_index] + end_logits[end_index],
                              "text": context[start_char: end_char]
                          }
                      )
          
          if len(valid_answers) > 0:
              best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
          else:
              # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
              # failure.
              best_answer = {"text": "", "score": 0.0}
          
          predictions[example["id"]] = best_answer["text"]
        

      return predictions

  final_predictions = postprocess_qa_predictions(test, validation_features, raw_predictions.predictions)

  formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
  references = [{"id": ex["id"], "answers": ex["answers"]} for ex in test]

  good_answers = 0
  all_answers = len(test)
  print()
  print("----------")
  for i, pred in enumerate(formatted_predictions):
    print(test[i]['question'])
    print(pred['prediction_text'])
    print(test[i]['context'])
    if input() == 'y':
      good_answers += 1
    print()
    print("----------")
print("Accuracy:", good_answers / all_answers)

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id.
***** Running Prediction *****
  Num examples = 27
  Batch size = 16


Post-processing 26 example predictions split into 27 features.


  0%|          | 0/26 [00:00<?, ?it/s]


----------
Kiedy zostaną powołani sędziowie łowieccy oraz rzecznicy dyscyplinarni i ich zastępcy pierwszej kadencji?
90 dni od dnia wejścia w życie niniejszej ustawy
Art. 5. 1. Sędziowie łowieccy oraz rzecznicy dyscyplinarni i ich zastępcy pierwszej kadencji zostaną powołani w terminie 90 dni od dnia wejścia w życie niniejszej ustawy. 2. Sprawy dyscyplinarne w toku zostaną przekazane rzecznikom dyscyplinarnym powołanym zgodnie z art. 35o ust. 1 pkt 2 i ust. 2 pkt 2 ustawy zmienianej w art. 1 w brzmieniu nadanym niniejsza ustawą w terminie 30 dni od dnia ich powołania.

y

----------
Kto określi wysokość składki na ubezpiecznie społeczne?
Rada Ministrów
Art. 10. Rada Ministrów określi w drodze rozporządzenia wysokość składki na ubezpieczenie społeczne oraz sposób jej naliczania dla osób pozbawionych wolności oraz dla nieletnich zatrudnionych w gospodarstwach pomocniczych zakładów poprawczych i schronisk dla nieletnich.

y

----------
Od czego rozpoczyna się przewód sądowy?
odczytania p

## Answer the following questions:

  -  Which pre-trained model performs better on that task? *dkleczek/bert-base-polish-cased-v1*
  -  Is the performance of the model compatible with the observations from exercise 6? *herbert was better*
  -  What are the outcomes of the model on your own questions? Are they satisfying? If not, what might be the reason for that? *76% accuracy - very satisfying*
