# Zero-Shot and Translation Experiments on XQuAD with mBERT

If you're opening this Notebook on colab, you will need to moun drive and change directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/master/Applications2/project

/content/drive/.shortcut-targets-by-id/1AtJHZWX_djzvUvOzjyLFydMJ7eINbuMI/project


In [None]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/project

/content/drive/MyDrive/LAP/Subjects/AP2/project


If you're opening this Notebook on colab, you will need to install 🤗 Transformers and 🤗 Datasets.

In [None]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## Create XQuAD-XTREME Dataset

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering
performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set
of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German,
Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi and Romanian. Consequently, the dataset is entirely parallel
across 12 languages. https://arxiv.org/pdf/1910.11856.pdf

We also include "translate-train", "translate-dev", and "translate-test"
splits for each non-English language from XTREME (Hu et al., 2020). These can be used to run XQuAD in the "translate-train" or "translate-test" settings. https://proceedings.mlr.press/v119/hu20b/hu20b.pdf



Make sure you are in the virtual environment where you installed Datasets, and run the following command:

In [None]:
# !huggingface-cli login

Login using your Hugging Face Hub credentials, and create a new dataset repository:

In [None]:
# !huggingface-cli repo create xquad_xtreme --type dataset

[90mgit version 2.17.1[0m
Error: unknown flag: --version

[90mSorry, no usage text found for "git-lfs"[0m

You are about to create [1mdatasets/juletxara/xquad_xtreme[0m
Proceed? [Y/n] y

Your repo now lives at:
  [1mhttps://huggingface.co/datasets/juletxara/xquad_xtreme[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/datasets/juletxara/xquad_xtreme



Install Git LFS and clone your repository:

In [None]:
# !git lfs install
# !git clone https://huggingface.co/datasets/juletxara/xquad_xtreme

Updated git hooks.
Git LFS initialized.
Cloning into 'xquad_xtreme'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.


We have to create these files to upload the dataset:

* `README.md` is a Dataset card that describes the datasets contents, 
creation, and usage.

* `xquad_xtreme.py` is your dataset loading script.

* `dataset_infos.json` contains metadata about the dataset.

Run the following command to create the metadata file, `dataset_infos.json`. This will also test your new dataset loading script and make sure it works correctly.

In [None]:
# !datasets-cli test xquad_xtreme --save_infos --all_configs

Testing builder 'ar' (1/12)
Downloading and preparing dataset xquad/ar to /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...
Downloading data files:   0% 0/4 [00:00<?, ?it/s]
Downloading data: 1.58MB [00:00, 83.4MB/s]      
Downloading data files:  25% 1/4 [00:01<00:03,  1.24s/it]
Downloading data:   0% 0.00/312M [00:00<?, ?B/s][A
Downloading data:   2% 4.80M/312M [00:00<00:06, 48.0MB/s][A
Downloading data:   3% 9.59M/312M [00:00<00:11, 27.0MB/s][A
Downloading data:   5% 16.6M/312M [00:00<00:07, 40.7MB/s][A
Downloading data:   8% 23.5M/312M [00:00<00:05, 49.6MB/s][A
Downloading data:  10% 30.2M/312M [00:00<00:05, 55.0MB/s][A
Downloading data:  12% 37.2M/312M [00:00<00:04, 59.8MB/s][A
Downloading data:  14% 44.3M/312M [00:00<00:04, 63.1MB/s][A
Downloading data:  16% 50.9M/312M [00:00<00:04, 64.1MB/s][A
Downloading data:  18% 57.6M/312M [00:01<00:03, 64.5MB/s][A
Downloading data:  21% 64.9M/312M [00:01<00:03, 67.

If you want to be able to test your dataset script without downloading the full dataset, you need to create some dummy data for automated testing.

In [None]:
# !datasets-cli dummy_data xquad_xtreme --auto_generate --json_field data

Downloading data files: 100% 4/4 [00:00<00:00, 10.74it/s]
Extracting data files: 100% 4/4 [00:00<00:00, 1451.82it/s]
Dummy data generation done and dummy data test succeeded for config 'ar''.
Downloading data files:   0% 0/4 [00:00<?, ?it/s]
Downloading data: 670kB [00:00, 40.5MB/s]       
Downloading data files:  25% 1/4 [00:00<00:01,  2.18it/s]
Downloading data:   0% 0.00/93.2M [00:00<?, ?B/s][A
Downloading data:   7% 6.31M/93.2M [00:00<00:01, 63.1MB/s][A
Downloading data:  14% 13.5M/93.2M [00:00<00:01, 68.2MB/s][A
Downloading data:  22% 20.6M/93.2M [00:00<00:01, 69.4MB/s][A
Downloading data:  30% 27.6M/93.2M [00:00<00:00, 69.6MB/s][A
Downloading data:  37% 34.7M/93.2M [00:00<00:00, 70.4MB/s][A
Downloading data:  45% 41.8M/93.2M [00:00<00:00, 70.1MB/s][A
Downloading data:  52% 48.8M/93.2M [00:00<00:00, 69.3MB/s][A
Downloading data:  60% 55.7M/93.2M [00:00<00:00, 68.1MB/s][A
Downloading data:  67% 62.7M/93.2M [00:00<00:00, 68.7MB/s][A
Downloading data:  75% 69.6M/93.2M [00:0

## Load XQuAD-XTREME Dataset

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering
performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set
of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German,
Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi and Romanian. Consequently, the dataset is entirely parallel
across 12 languages. https://arxiv.org/pdf/1910.11856.pdf

We also include "translate-train", "translate-dev", and "translate-test"
splits for each non-English language from XTREME (Hu et al., 2020). These can be used to run XQuAD in the "translate-train" or "translate-test" settings. https://proceedings.mlr.press/v119/hu20b/hu20b.pdf

In [None]:
from datasets import load_dataset

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
xquad = {}
for lang in langs:
    xquad[lang] = load_dataset("juletxara/xquad_xtreme", lang)

Downloading builder script:   0%|          | 0.00/6.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.1k [00:00<?, ?B/s]

Downloading and preparing dataset xquad/ar (download: 420.97 MiB, generated: 155.77 MiB, post-processed: Unknown size, total: 576.74 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/168k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/86787 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34448 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1151 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/de (download: 127.04 MiB, generated: 111.52 MiB, post-processed: Unknown size, total: 238.55 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/82603 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/32950 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1168 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/zh (download: 174.57 MiB, generated: 88.76 MiB, post-processed: Unknown size, total: 263.32 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/85700 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/33985 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1186 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/vi (download: 218.09 MiB, generated: 137.55 MiB, post-processed: Unknown size, total: 355.64 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/161M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/65.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/87187 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34575 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1178 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/en (download: 595.10 KiB, generated: 1.06 MiB, post-processed: Unknown size, total: 1.65 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/122k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/es (download: 138.41 MiB, generated: 118.85 MiB, post-processed: Unknown size, total: 257.26 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/132k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/41.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/87488 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34697 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1188 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/hi (download: 472.23 MiB, generated: 246.51 MiB, post-processed: Unknown size, total: 718.74 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/176k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/349M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/143M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/85804 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34111 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1184 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/el (download: 499.40 MiB, generated: 184.85 MiB, post-processed: Unknown size, total: 684.26 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/200k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/369M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/79946 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/31869 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1182 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/th (download: 461.54 MiB, generated: 235.99 MiB, post-processed: Unknown size, total: 697.52 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/th/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/178k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/341M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/139M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/85846 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34079 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1157 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/th/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/tr (download: 151.08 MiB, generated: 109.61 MiB, post-processed: Unknown size, total: 260.69 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/tr/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/135k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/111M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/86511 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34308 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1112 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/tr/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/ru (download: 513.80 MiB, generated: 186.36 MiB, post-processed: Unknown size, total: 700.16 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/ru/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/195k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/380M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/84869 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/33735 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/ru/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/ro (download: 645.66 KiB, generated: 1.24 MiB, post-processed: Unknown size, total: 1.87 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/ro/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/137k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/ro/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
xquad

{'ar': DatasetDict({
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1190
    })
    translate_train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 86787
    })
    translate_dev: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 34448
    })
    translate_test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1151
    })
}),
 'de': DatasetDict({
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1190
    })
    translate_train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 82603
    })
    translate_dev: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 32950
    })
    translate_test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1168
    })
}),
 'el':

In [None]:
xquad["es"]["test"][0]

{'answers': {'answer_start': [133], 'text': ['308']},
 'context': '\ufeffLos Panthers, que además de liderar las intercepciones de la NFL con 24 y contar con cuatro jugadores de la Pro Bowl, cedieron solo 308 puntos en defensa y se sitúan en el sexto lugar de la liga. Kawann Short, tacle defensivo de la Pro Bowl, lideró al equipo con 11 capturas, 3 balones sueltos forzados y 2 recuperaciones. A su vez, el liniero Mario Addison, consiguió 6 capturas y media. En la línea de los Panthers, también destacó como ala defensiva el veterano Jared Allen ―5 veces jugador de la Pro Bowl y que fue el líder, en activo, de capturas de la NFL con 136― junto con el también ala defensiva Kony Ealy, que lleva 5 capturas en solo 9 partidos como titular. Detrás de ellos, Thomas Davis y Luke Kuechly, dos de los tres apoyadores titulares que también han sido seleccionados para jugar la Pro Bowl. Davis se hizo con 5 capturas y media, 4 balones sueltos forzados y 4 intercepciones, mientras que Kuechly lideró a

In [None]:
xquad["es"]["translate_train"][0]

{'answers': {'answer_start': [161],
  'text': ['Coleman A. Young Municipal Center']},
 'context': 'Los tribunales de Detroit son administrados por el estado y las elecciones no son partidistas. El tribunal testamentario del condado de Wayne está ubicado en el Coleman A. Young Municipal Center en el centro de Detroit. El tribunal de circuito se encuentra al otro lado de la avenida Gratiot. en el Frank Murphy Hall of Justice, en el centro de Detroit. La ciudad alberga el Trigésimo Sexto Tribunal de Distrito, así como el Primer Distrito del Tribunal de Apelaciones de Michigan y el Tribunal de Distrito de los Estados Unidos para el Distrito Este de Michigan. La ciudad proporciona la aplicación de la ley a través del Departamento de Policía de Detroit y servicios de emergencia a través del Departamento de Bomberos de Detroit.',
 'id': '5728d4d3ff5b5019007da7ba',
 'question': '¿Dónde se encuentra el tribunal testamentario del condado de Wayne?'}

In [None]:
xquad["es"]["translate_dev"][0]

{'answers': {'answer_start': [227], 'text': ['una fuerza innata de ímpetu']},
 'context': 'Las deficiencias de la física aristotélica no se corregirían por completo hasta el trabajo del siglo XVII de Galileo Galilei, quien fue influenciado por la idea medieval tardía de que los objetos en movimiento forzado llevaban una fuerza innata de ímpetu. Galileo construyó un experimento en el que las piedras y las balas de cañón fueron rodadas por una pendiente para refutar la teoría aristotélica del movimiento a principios del siglo XVII. Mostró que los cuerpos eran acelerados por la gravedad hasta un punto que era independiente de su masa y argumentó que los objetos retienen su velocidad a menos que actúen por una fuerza, por ejemplo, la fricción.',
 'id': '57373f80c3c5551400e51e91',
 'question': '¿Qué contenían los objetos en movimiento forzado según la idea medieval tardía que influyen en Aristóteles?'}

In [None]:
xquad["es"]["translate_test"][0]

{'answers': {'answer_start': [411],
  'text': ['Cobb, Shepley, Rutan and Coolidge, Holabird & Roche, and other architectural firms']},
 'context': 'The first buildings on the University of Chicago campus, which make up what is now known as the main quadrangle, were part of a "master plan" conceived by two administrators of the University of Chicago and planned by the architect Henry Ives of Chicago. The main quadrangle consists of six quadrangle, each surrounded by buildings, bordering a larger quadrangle. The main quadrangle buildings were designed by Cobb, Shepley, Rutan and Coolidge, Holabird & Roche, and other architectural firms in a mixture of Victorian Gothic and collegiate Gothic styles, used in the faculties of the University Oxford (Mitchell Tower, for example, follows the model of the Magdalena Tower in Oxford, and Commons University, Hutchinson Hall, imitates Christ Church Hall).',
 'id': '57284b904b864d19001648e4',
 'question': 'Who helped design the main quadrangle?'}

## Load SQuAD Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. https://arxiv.org/pdf/1606.05250.pdf

In [None]:
from datasets import load_dataset, load_metric

In [None]:
squad = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [None]:
squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [None]:
squad["validation"][0]

{'answers': {'answer_start': [177, 177, 177],
  'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bo

## Preprocessing SQuAD

Load the mBERT tokenizer to process the question and context fields.

In [None]:
from transformers import AutoTokenizer

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

There are a few preprocessing steps particular to question answering that we should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. Truncate only the context by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original context by setting `return_offset_mapping=True`.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to find which part of the offset corresponds to the question and which corresponds to the `context`.

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
squad_train = squad.map(prepare_train_features, batched=True, 
                            remove_columns=squad["train"].column_names)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

The evaluation features are similar to the train features. We have to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1
        
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
squad_eval = squad["validation"].map(prepare_validation_features, batched=True, 
                                          remove_columns=squad["validation"].column_names)

  0%|          | 0/11 [00:00<?, ?ba/s]

## Fine-tuning mBERT

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-bas

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
batch_size = 16
args = TrainingArguments(
    "bert-base-multilingual-cased-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=squad_train["train"],
    eval_dataset=squad_train["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
# trainer.train()

In [None]:
trainer.push_to_hub()

## Evaluating mBERT

We load a model that is already finetuned on SQuAD to save time. We evaluate on the validation set of SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpx7vo12og


Downloading:   0%|          | 0.00/822 [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
creating metadata file for /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
loading configuration file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
Model config BertConfig {
  "_name_or_path": "salti/bert-base-multilingual-cased-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_d

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
creating metadata file for /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
loading weights file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
All model checkpoint weights were used when initializing BertForQuestionAnswering.

All the weights of BertForQuestionAnswering were initialized from the model checkpoint at salti/bert-base-multilingual-c

We can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(squad_eval)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10851
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
squad_eval.set_format(type=squad_eval.format["type"], 
                      columns=list(squad_eval.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices.

In [None]:
from tqdm.auto import tqdm
import collections
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(squad["validation"], squad_eval, raw_predictions.predictions)

Post-processing 10570 example predictions split into 10851 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 81.90160832544939, 'f1': 89.121876471452}

## Zero-Shot mBERT

Zero-Shot performance of the mBERT model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
from collections import defaultdict
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

xquad_prep = defaultdict(dict)

def map_datasets(langs, split, prepare_features):
    for lang in langs:
        xquad_prep[lang][split] = xquad[lang][split].map(prepare_features, batched=True, 
                                    remove_columns=xquad[lang][split].column_names)

In [None]:
map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
def compute_results(langs, split):
    results = {}
    for lang in langs:
        # We can grab the predictions for all features by using the method
        raw_predictions = trainer.predict(xquad_prep[lang][split])

        # example_id and offset_mapping which we will need for our post-processing
        xquad_prep[lang][split].set_format(type=xquad_prep[lang][split].format["type"], 
                        columns=list(xquad_prep[lang][split].features.keys()))
        
        # And we can apply our post-processing function to our raw predictions
        final_predictions = postprocess_qa_predictions(xquad[lang][split], xquad_prep[lang][split], raw_predictions.predictions)

        # We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad[lang][split]]
        results[lang] = metric.compute(predictions=formatted_predictions, references=references)
    return results

In [None]:
results_zero_shot_mbert = compute_results(langs, split)
print(results_zero_shot_mbert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1189
  Batch size = 16


Post-processing 1151 example predictions split into 1189 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1207
  Batch size = 16


Post-processing 1168 example predictions split into 1207 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1223
  Batch size = 16


Post-processing 1186 example predictions split into 1223 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1216
  Batch size = 16


Post-processing 1178 example predictions split into 1216 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1235
  Batch size = 16


Post-processing 1188 example predictions split into 1235 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1184 example predictions split into 1222 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1219
  Batch size = 16


Post-processing 1182 example predictions split into 1219 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1200
  Batch size = 16


Post-processing 1157 example predictions split into 1200 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1144
  Batch size = 16


Post-processing 1112 example predictions split into 1144 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 55.77758470894874, 'f1': 70.44619460475333}, 'de': {'exact_match': 63.27054794520548, 'f1': 76.73134579887086}, 'zh': {'exact_match': 56.57672849915683, 'f1': 70.12370423516863}, 'vi': {'exact_match': 55.602716468590835, 'f1': 70.63286731677346}, 'es': {'exact_match': 65.06734006734007, 'f1': 78.73225869616985}, 'hi': {'exact_match': 55.8277027027027, 'f1': 70.57554022304278}, 'el': {'exact_match': 61.92893401015228, 'f1': 75.9954399757002}, 'th': {'exact_match': 45.894554883318925, 'f1': 60.04206160508226}, 'tr': {'exact_match': 42.71582733812949, 'f1': 61.64366188342535}, 'ru': {'exact_match': 63.109243697478995, 'f1': 76.6412196213264}}


In [None]:
import pandas as pd
def results_df(results_dict):
    dict_results = defaultdict(list)
    for lang, scores in results_dict.items():
        dict_results["lang"].append(lang)
        dict_results["F1"].append(scores['f1'])
        dict_results["EM"].append(scores['exact_match'])

    df_results = pd.DataFrame(dict_results).round(2)
    return df_results

In [None]:
df_results_zero_shot_mbert = results_df(results_zero_shot_mbert)
df_results_zero_shot_mbert.to_csv("results/results_zero_shot_mbert.csv")
df_results_zero_shot_mbert

Unnamed: 0,lang,F1,EM
0,ar,70.45,55.78
1,de,76.73,63.27
2,zh,70.12,56.58
3,vi,70.63,55.6
4,es,78.73,65.07
5,hi,70.58,55.83
6,el,76.0,61.93
7,th,60.04,45.89
8,tr,61.64,42.72
9,ru,76.64,63.11


## Zero-Shot XLM-R

Zero-Shot performance of the XLM-R model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

All TF 2.0 model weights were used when initializing XLMRobertaForQuestionAnswering.

All the weights of XLMRobertaForQuestionAnswering were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use XLMRobertaForQuestionAnswering for predictions without further training.


Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_zero_shot_xlm_r = compute_results(langs, split)
print(results_zero_shot_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


NameError: ignored

In [None]:
df_results_zero_shot_xlm_r = results_df(results_zero_shot_xlm_r)
df_results_zero_shot_xlm_r.to_csv("results/results_zero_shot_xlm_r.csv")
df_results_zero_shot_xlm_r

## Zero-Shot XLM-R-large

Zero-Shot performance of the XLM-R-large model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_zero_shot_xlm_r_large = compute_results(langs, split)
print(results_zero_shot_xlm_r_large)

In [None]:
df_results_zero_shot_xlm_r_large = results_df(results_zero_shot_xlm_r_large)
df_results_zero_shot_xlm_r_large.to_csv("results/results_zero_shot_xlm_r_large.csv")
df_results_zero_shot_xlm_r_large

## Translate Test mBERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use mBERT, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-071cdb5168ee2a5e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-38da8b469993c3c1.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-cdafdadb0dc8ba13.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-6943b6f9c879f76c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-ed3a1b751476ea08.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/hi/1.0.0/c826765c504683edb842

In [None]:
results_translate_test_mbert = compute_results(langs, split)
print(results_translate_test_mbert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1189
  Batch size = 16


Post-processing 1151 example predictions split into 1189 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1207
  Batch size = 16


Post-processing 1168 example predictions split into 1207 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1223
  Batch size = 16


Post-processing 1186 example predictions split into 1223 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1216
  Batch size = 16


Post-processing 1178 example predictions split into 1216 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1235
  Batch size = 16


Post-processing 1188 example predictions split into 1235 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1184 example predictions split into 1222 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1219
  Batch size = 16


Post-processing 1182 example predictions split into 1219 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1200
  Batch size = 16


Post-processing 1157 example predictions split into 1200 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1144
  Batch size = 16


Post-processing 1112 example predictions split into 1144 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 55.77758470894874, 'f1': 70.44619460475333}, 'de': {'exact_match': 63.27054794520548, 'f1': 76.73134579887086}, 'zh': {'exact_match': 56.57672849915683, 'f1': 70.12370423516863}, 'vi': {'exact_match': 55.602716468590835, 'f1': 70.63286731677346}, 'es': {'exact_match': 65.06734006734007, 'f1': 78.73225869616985}, 'hi': {'exact_match': 55.8277027027027, 'f1': 70.57554022304278}, 'el': {'exact_match': 61.92893401015228, 'f1': 75.9954399757002}, 'th': {'exact_match': 45.894554883318925, 'f1': 60.04206160508226}, 'tr': {'exact_match': 42.71582733812949, 'f1': 61.64366188342535}, 'ru': {'exact_match': 63.109243697478995, 'f1': 76.6412196213264}}


In [None]:
df_results_translate_test_mbert = results_df(results_translate_test_mbert)
df_results_translate_test_mbert.to_csv("results/results_translate_test_mbert.csv")
df_results_translate_test_mbert

Unnamed: 0,lang,F1,EM
0,ar,70.45,55.78
1,de,76.73,63.27
2,zh,70.12,56.58
3,vi,70.63,55.6
4,es,78.73,65.07
5,hi,70.58,55.83
6,el,76.0,61.93
7,th,60.04,45.89
8,tr,61.64,42.72
9,ru,76.64,63.11


## Translate Test BERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "rsvp-ai/bertserini-bert-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bertserini-bert-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_bert = compute_results(langs, split)
print(results_translate_test_bert)

In [None]:
df_results_translate_test_bert = results_df(results_translate_test_bert)
df_results_translate_test_bert.to_csv("results/results_translate_test_bert.csv")
df_results_translate_test_bert

## Translate Test BERT-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "bert-large-cased-whole-word-masking-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-large-cased-whole-word-masking-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_translate_test_bert_large = compute_results(langs, split)
print(results_translate_test_bert_large)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1182
  Batch size = 16


Post-processing 1151 example predictions split into 1182 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1204
  Batch size = 16


Post-processing 1168 example predictions split into 1204 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1186 example predictions split into 1222 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1210
  Batch size = 16


Post-processing 1178 example predictions split into 1210 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1188 example predictions split into 1230 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1220
  Batch size = 16


Post-processing 1184 example predictions split into 1220 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1219
  Batch size = 16


Post-processing 1182 example predictions split into 1219 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1195
  Batch size = 16


Post-processing 1157 example predictions split into 1195 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1145
  Batch size = 16


Post-processing 1112 example predictions split into 1145 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 59.07906168549088, 'f1': 73.62074258466953}, 'de': {'exact_match': 66.3527397260274, 'f1': 80.41370021364638}, 'zh': {'exact_match': 59.527824620573355, 'f1': 73.98176392520615}, 'vi': {'exact_match': 62.13921901528013, 'f1': 76.39100238167063}, 'es': {'exact_match': 68.68686868686869, 'f1': 81.929551000276}, 'hi': {'exact_match': 61.6554054054054, 'f1': 75.31862416628505}, 'el': {'exact_match': 66.83587140439933, 'f1': 80.18622690602038}, 'th': {'exact_match': 53.93258426966292, 'f1': 67.49195736299158}, 'tr': {'exact_match': 47.302158273381295, 'f1': 66.30064032352537}, 'ru': {'exact_match': 66.97478991596638, 'f1': 80.09890179233422}}


In [None]:
df_results_translate_test_bert_large = results_df(results_translate_test_bert_large)
df_results_translate_test_bert_large.to_csv("results/results_translate_test_bert_large.csv")
df_results_translate_test_bert_large

Unnamed: 0,lang,F1,EM
0,ar,73.62,59.08
1,de,80.41,66.35
2,zh,73.98,59.53
3,vi,76.39,62.14
4,es,81.93,68.69
5,hi,75.32,61.66
6,el,80.19,66.84
7,th,67.49,53.93
8,tr,66.3,47.3
9,ru,80.1,66.97


## Translate Test XLM-R

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use XLM-R, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_xlm_r = compute_results(langs, split)
print(results_translate_test_xlm_r)

In [None]:
df_results_translate_test_xlm_r = results_df(results_translate_test_xlm_r)
df_results_translate_test_xlm_r.to_csv("results/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r

## Translate Test XLM-R-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use XLM-R-large, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_xlm_r_large = compute_results(langs, split)
print(results_translate_test_xlm_r_large)

In [None]:
df_results_translate_test_xlm_r_large = results_df(results_translate_test_xlm_r_large)
df_results_translate_test_xlm_r_large.to_csv("results/results_translate_test_xlm_r_large.csv")
df_results_translate_test_xlm_r_large

## Translate Test RoBERTa

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "thatdramebaazguy/roberta-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_roberta = compute_results(langs, split)
print(results_translate_test_roberta)

In [None]:
df_results_translate_test_roberta = results_df(results_translate_test_roberta)
df_results_translate_test_roberta.to_csv("results/results_translate_test_roberta.csv")
df_results_translate_test_roberta

## Translate Test RoBERTa-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "csarron/roberta-large-squad-v1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="csarron/roberta-large-squad-v1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_roberta_large = compute_results(langs, split)
print(results_translate_test_roberta_large)

In [None]:
df_results_translate_test_roberta_large = results_df(results_translate_test_roberta_large)
df_results_translate_test_roberta_large.to_csv("results/results_translate_test_roberta_large.csv")
df_results_translate_test_roberta_large

## Translate Train Es mBERT

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We then fine-tune mBERT on the translated data. We must align answer spans in the source and target language for the QA tasks. We use data that was already translated to save time.

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["es"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

In [None]:
xquad_prep["es"]

{'test': Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
     num_rows: 1274
 }), 'translate_dev': Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
     num_rows: 36431
 }), 'translate_test': Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
     num_rows: 1235
 }), 'translate_train': Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
     num_rows: 90680
 })}

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-es",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "abs

Define trainer and train. Trainer uses translated SQuAD train and dev data.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_prep["es"]["translate_train"],
    eval_dataset=xquad_prep["es"]["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Cloning https://huggingface.co/juletxara/bert-base-multilingual-cased-squad-es into local empty directory.


In [None]:
trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_es_mbert = compute_results(langs, split)
print(results_translate_train_es_mbert)

In [None]:
df_results_translate_train_es_mbert = results_df(results_translate_train_es_mbert)
df_results_translate_train_es_mbert.to_csv("results/results_translate_train_es_mbert.csv")
df_results_translate_train_es_mbert

In [None]:
trainer.push_to_hub()

## Translate Train Es XLM-R

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We use a XLM-R model that has already been finetuned to save time.

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["es"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "mrm8488/bert-multi-cased-finetuned-xquadv1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_es_xlm_r = compute_results(langs, split)
print(results_translate_train_es_xlm_r)

In [None]:
df_results_translate_train_es_xlm_r = results_df(results_translate_train_es_xlm_r)
df_results_translate_train_es_xlm_r.to_csv("results/results_translate_train_es_xlm_r.csv")
df_results_translate_train_es_xlm_r

## Translate Train All

We also experiment with a multi-task version of the translate-train setting where we fine-tune mBERT on the combined translated training data of all languages jointly.

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-82ed776ec82aef03.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-f29fbee97ac68574.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-b569d8b5e32c0920.arrow


  0%|          | 0/88 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-6d22e1e6dd4c1662.arrow


  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/80 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/87 [00:00<?, ?ba/s]

  0%|          | 0/85 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/33 [00:00<?, ?ba/s]

  0%|          | 0/34 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-cc7b8b134d366ad0.arrow


  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/32 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/34 [00:00<?, ?ba/s]

In [None]:
from datasets import DatasetDict, concatenate_datasets

xquad_merged = DatasetDict()
xquad_merged["translate_train"] = squad_train["train"]
xquad_merged["translate_dev"] = squad_train["validation"]

for lang in langs:
    for split in ["translate_train", "translate_dev"]:
        xquad_merged[split] = concatenate_datasets([xquad_merged[split], xquad_prep[lang][split]])

In [None]:
xquad_merged

DatasetDict({
    translate_train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 1067346
    })
    translate_dev: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 406809
    })
})

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-all",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "abs

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_merged["translate_train"],
    eval_dataset=xquad_merged["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Cloning https://huggingface.co/juletxara/bert-base-multilingual-cased-squad-all into local empty directory.


In [None]:
trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_all_mbert = compute_results(langs, split)
print(results_translate_train_all_mbert)

In [None]:
df_results_translate_train_all_mbert = results_df(results_translate_train_all_mbert)
df_results_translate_train_all_mbert.to_csv("results/results_translate_train_all_mbert.csv")
df_results_translate_train_all_mbert

In [None]:
trainer.push_to_hub()

## Data Augmentation

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "mrm8488/bert-multi-cased-finetuned-xquadv1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_data_augmentation = compute_results(langs, split)
print(results_data_augmentation)

In [None]:
df_results_data_augmentation = results_df(results_data_augmentation)
df_results_data_augmentation.to_csv("results/results_data_augmentation.csv")
df_results_data_augmentation