# Zero-Shot and Translation Experiments on XQuAD with mBERT

If you're opening this Notebook on colab, you will need to moun drive and change directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/master/Applications2/project

/content/drive/.shortcut-targets-by-id/1AtJHZWX_djzvUvOzjyLFydMJ7eINbuMI/project


In [None]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/project

/content/drive/MyDrive/LAP/Subjects/AP2/project


If you're opening this Notebook on colab, you will need to install 🤗 Transformers and 🤗 Datasets.

In [None]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.1 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 73.6 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 60.6 MB/s 
[?25hCollecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.3 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.2 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Create XQuAD-XTREME Dataset

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering
performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set
of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into 11 languages: Spanish, German,
Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi and Romanian. Consequently, the dataset is entirely parallel
across 12 languages. https://arxiv.org/pdf/1910.11856.pdf

We also include "translate-train", "translate-dev", and "translate-test"
splits for each non-English language from XTREME (Hu et al., 2020). These can be used to run XQuAD in the "translate-train" or "translate-test" settings. https://proceedings.mlr.press/v119/hu20b/hu20b.pdf

As the dataset is based on SQuAD v1.1, there are no unanswerable questions in the data. We chose this
setting so that models can focus on cross-lingual transfer.

We show the average number of tokens per paragraph, question, and answer for each language in the
table below. The statistics were obtained using [Jieba](https://github.com/fxsjy/jieba) for Chinese
and the [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)
for the other languages. 

|           |   en  |   es  |   de  |   el  |   ru  |   tr  |   ar  |   vi  |   th  |   zh  |   hi  |
|-----------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
| Paragraph | 142.4 | 160.7 | 139.5 | 149.6 | 133.9 | 126.5 | 128.2 | 191.2 | 158.7 | 147.6 | 232.4 |
| Question  |  11.5 |  13.4 |  11.0 |  11.7 |  10.0 |  9.8  |  10.7 |  14.8 |  11.5 |  10.5 |  18.7 |
| Answer    |  3.1  |  3.6  |  3.0  |  3.3  |  3.1  |  3.1  |  3.1  |  4.5  |  4.1  |  3.5  |  5.6  |

Make sure you are in the virtual environment where you installed Datasets, and run the following command:

In [None]:
# !huggingface-cli login

Login using your Hugging Face Hub credentials, and create a new dataset repository:

In [None]:
# !huggingface-cli repo create xquad_xtreme --type dataset

[90mgit version 2.17.1[0m
Error: unknown flag: --version

[90mSorry, no usage text found for "git-lfs"[0m

You are about to create [1mdatasets/juletxara/xquad_xtreme[0m
Proceed? [Y/n] y

Your repo now lives at:
  [1mhttps://huggingface.co/datasets/juletxara/xquad_xtreme[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/datasets/juletxara/xquad_xtreme



Install Git LFS and clone your repository:

In [None]:
# !git lfs install
# !git clone https://huggingface.co/datasets/juletxara/xquad_xtreme

Updated git hooks.
Git LFS initialized.
Cloning into 'xquad_xtreme'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.


We have to create these files to upload the dataset:

* `README.md` is a Dataset card that describes the datasets contents, 
creation, and usage.

* `xquad_xtreme.py` is your dataset loading script.

* `dataset_infos.json` contains metadata about the dataset.

Run the following command to create the metadata file, `dataset_infos.json`. This will also test your new dataset loading script and make sure it works correctly.

In [None]:
# !datasets-cli test xquad_xtreme --save_infos --all_configs

Testing builder 'ar' (1/12)
Downloading and preparing dataset xquad/ar to /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...
Downloading data files:   0% 0/4 [00:00<?, ?it/s]
Downloading data: 1.58MB [00:00, 83.4MB/s]      
Downloading data files:  25% 1/4 [00:01<00:03,  1.24s/it]
Downloading data:   0% 0.00/312M [00:00<?, ?B/s][A
Downloading data:   2% 4.80M/312M [00:00<00:06, 48.0MB/s][A
Downloading data:   3% 9.59M/312M [00:00<00:11, 27.0MB/s][A
Downloading data:   5% 16.6M/312M [00:00<00:07, 40.7MB/s][A
Downloading data:   8% 23.5M/312M [00:00<00:05, 49.6MB/s][A
Downloading data:  10% 30.2M/312M [00:00<00:05, 55.0MB/s][A
Downloading data:  12% 37.2M/312M [00:00<00:04, 59.8MB/s][A
Downloading data:  14% 44.3M/312M [00:00<00:04, 63.1MB/s][A
Downloading data:  16% 50.9M/312M [00:00<00:04, 64.1MB/s][A
Downloading data:  18% 57.6M/312M [00:01<00:03, 64.5MB/s][A
Downloading data:  21% 64.9M/312M [00:01<00:03, 67.

If you want to be able to test your dataset script without downloading the full dataset, you need to create some dummy data for automated testing.

In [None]:
# !datasets-cli dummy_data xquad_xtreme --auto_generate --json_field data

Downloading data files: 100% 4/4 [00:00<00:00, 10.74it/s]
Extracting data files: 100% 4/4 [00:00<00:00, 1451.82it/s]
Dummy data generation done and dummy data test succeeded for config 'ar''.
Downloading data files:   0% 0/4 [00:00<?, ?it/s]
Downloading data: 670kB [00:00, 40.5MB/s]       
Downloading data files:  25% 1/4 [00:00<00:01,  2.18it/s]
Downloading data:   0% 0.00/93.2M [00:00<?, ?B/s][A
Downloading data:   7% 6.31M/93.2M [00:00<00:01, 63.1MB/s][A
Downloading data:  14% 13.5M/93.2M [00:00<00:01, 68.2MB/s][A
Downloading data:  22% 20.6M/93.2M [00:00<00:01, 69.4MB/s][A
Downloading data:  30% 27.6M/93.2M [00:00<00:00, 69.6MB/s][A
Downloading data:  37% 34.7M/93.2M [00:00<00:00, 70.4MB/s][A
Downloading data:  45% 41.8M/93.2M [00:00<00:00, 70.1MB/s][A
Downloading data:  52% 48.8M/93.2M [00:00<00:00, 69.3MB/s][A
Downloading data:  60% 55.7M/93.2M [00:00<00:00, 68.1MB/s][A
Downloading data:  67% 62.7M/93.2M [00:00<00:00, 68.7MB/s][A
Downloading data:  75% 69.6M/93.2M [00:0

## Load XQuAD-XTREME Dataset

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering
performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set
of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German,
Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi and Romanian. Consequently, the dataset is entirely parallel
across 12 languages. https://arxiv.org/pdf/1910.11856.pdf

We also include "translate-train", "translate-dev", and "translate-test"
splits for each non-English language from XTREME (Hu et al., 2020). These can be used to run XQuAD in the "translate-train" or "translate-test" settings. https://proceedings.mlr.press/v119/hu20b/hu20b.pdf

In [None]:
from datasets import load_dataset, load_metric

langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
xquad = {}
for lang in langs:
    xquad[lang] = load_dataset("juletxara/xquad_xtreme", lang)

Downloading builder script:   0%|          | 0.00/6.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.1k [00:00<?, ?B/s]

Downloading and preparing dataset xquad/ar (download: 420.97 MiB, generated: 155.77 MiB, post-processed: Unknown size, total: 576.74 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/168k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/86787 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34448 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1151 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/de (download: 127.04 MiB, generated: 111.52 MiB, post-processed: Unknown size, total: 238.55 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/82603 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/32950 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1168 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/zh (download: 174.57 MiB, generated: 88.76 MiB, post-processed: Unknown size, total: 263.32 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/85700 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/33985 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1186 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/vi (download: 218.09 MiB, generated: 137.55 MiB, post-processed: Unknown size, total: 355.64 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/161M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/65.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/87187 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34575 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1178 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/en (download: 595.10 KiB, generated: 1.06 MiB, post-processed: Unknown size, total: 1.65 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/122k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/es (download: 138.41 MiB, generated: 118.85 MiB, post-processed: Unknown size, total: 257.26 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/132k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/41.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/87488 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34697 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1188 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/hi (download: 472.23 MiB, generated: 246.51 MiB, post-processed: Unknown size, total: 718.74 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/176k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/349M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/143M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/85804 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34111 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1184 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/el (download: 499.40 MiB, generated: 184.85 MiB, post-processed: Unknown size, total: 684.26 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/200k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/369M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/79946 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/31869 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1182 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/th (download: 461.54 MiB, generated: 235.99 MiB, post-processed: Unknown size, total: 697.52 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/th/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/178k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/341M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/139M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/85846 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34079 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1157 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/th/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/tr (download: 151.08 MiB, generated: 109.61 MiB, post-processed: Unknown size, total: 260.69 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/tr/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/135k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/111M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/86511 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/34308 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1112 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/tr/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/ru (download: 513.80 MiB, generated: 186.36 MiB, post-processed: Unknown size, total: 700.16 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/ru/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/195k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/380M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Generating translate_train split:   0%|          | 0/84869 [00:00<?, ? examples/s]

Generating translate_dev split:   0%|          | 0/33735 [00:00<?, ? examples/s]

Generating translate_test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/ru/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/ro (download: 645.66 KiB, generated: 1.24 MiB, post-processed: Unknown size, total: 1.87 MiB) to /root/.cache/huggingface/datasets/juletxara___xquad/ro/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/137k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/juletxara___xquad/ro/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
xquad

{'ar': DatasetDict({
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1190
    })
    translate_train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 86787
    })
    translate_dev: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 34448
    })
    translate_test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1151
    })
}),
 'de': DatasetDict({
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1190
    })
    translate_train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 82603
    })
    translate_dev: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 32950
    })
    translate_test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1168
    })
}),
 'el':

In [None]:
xquad["es"]["test"][0]

{'answers': {'answer_start': [133], 'text': ['308']},
 'context': '\ufeffLos Panthers, que además de liderar las intercepciones de la NFL con 24 y contar con cuatro jugadores de la Pro Bowl, cedieron solo 308 puntos en defensa y se sitúan en el sexto lugar de la liga. Kawann Short, tacle defensivo de la Pro Bowl, lideró al equipo con 11 capturas, 3 balones sueltos forzados y 2 recuperaciones. A su vez, el liniero Mario Addison, consiguió 6 capturas y media. En la línea de los Panthers, también destacó como ala defensiva el veterano Jared Allen ―5 veces jugador de la Pro Bowl y que fue el líder, en activo, de capturas de la NFL con 136― junto con el también ala defensiva Kony Ealy, que lleva 5 capturas en solo 9 partidos como titular. Detrás de ellos, Thomas Davis y Luke Kuechly, dos de los tres apoyadores titulares que también han sido seleccionados para jugar la Pro Bowl. Davis se hizo con 5 capturas y media, 4 balones sueltos forzados y 4 intercepciones, mientras que Kuechly lideró a

In [None]:
xquad["es"]["translate_train"][0]

{'answers': {'answer_start': [161],
  'text': ['Coleman A. Young Municipal Center']},
 'context': 'Los tribunales de Detroit son administrados por el estado y las elecciones no son partidistas. El tribunal testamentario del condado de Wayne está ubicado en el Coleman A. Young Municipal Center en el centro de Detroit. El tribunal de circuito se encuentra al otro lado de la avenida Gratiot. en el Frank Murphy Hall of Justice, en el centro de Detroit. La ciudad alberga el Trigésimo Sexto Tribunal de Distrito, así como el Primer Distrito del Tribunal de Apelaciones de Michigan y el Tribunal de Distrito de los Estados Unidos para el Distrito Este de Michigan. La ciudad proporciona la aplicación de la ley a través del Departamento de Policía de Detroit y servicios de emergencia a través del Departamento de Bomberos de Detroit.',
 'id': '5728d4d3ff5b5019007da7ba',
 'question': '¿Dónde se encuentra el tribunal testamentario del condado de Wayne?'}

In [None]:
xquad["es"]["translate_dev"][0]

{'answers': {'answer_start': [227], 'text': ['una fuerza innata de ímpetu']},
 'context': 'Las deficiencias de la física aristotélica no se corregirían por completo hasta el trabajo del siglo XVII de Galileo Galilei, quien fue influenciado por la idea medieval tardía de que los objetos en movimiento forzado llevaban una fuerza innata de ímpetu. Galileo construyó un experimento en el que las piedras y las balas de cañón fueron rodadas por una pendiente para refutar la teoría aristotélica del movimiento a principios del siglo XVII. Mostró que los cuerpos eran acelerados por la gravedad hasta un punto que era independiente de su masa y argumentó que los objetos retienen su velocidad a menos que actúen por una fuerza, por ejemplo, la fricción.',
 'id': '57373f80c3c5551400e51e91',
 'question': '¿Qué contenían los objetos en movimiento forzado según la idea medieval tardía que influyen en Aristóteles?'}

In [None]:
xquad["es"]["translate_test"][0]

{'answers': {'answer_start': [411],
  'text': ['Cobb, Shepley, Rutan and Coolidge, Holabird & Roche, and other architectural firms']},
 'context': 'The first buildings on the University of Chicago campus, which make up what is now known as the main quadrangle, were part of a "master plan" conceived by two administrators of the University of Chicago and planned by the architect Henry Ives of Chicago. The main quadrangle consists of six quadrangle, each surrounded by buildings, bordering a larger quadrangle. The main quadrangle buildings were designed by Cobb, Shepley, Rutan and Coolidge, Holabird & Roche, and other architectural firms in a mixture of Victorian Gothic and collegiate Gothic styles, used in the faculties of the University Oxford (Mitchell Tower, for example, follows the model of the Magdalena Tower in Oxford, and Commons University, Hutchinson Hall, imitates Christ Church Hall).',
 'id': '57284b904b864d19001648e4',
 'question': 'Who helped design the main quadrangle?'}

## Load SQuAD Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. https://arxiv.org/pdf/1606.05250.pdf

In [None]:
from datasets import load_dataset, load_metric

In [None]:
squad = load_dataset("squad")

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [None]:
squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [None]:
squad["validation"][0]

{'answers': {'answer_start': [177, 177, 177],
  'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bo

## Preprocessing SQuAD

Load the mBERT tokenizer to process the question and context fields.

In [None]:
from transformers import AutoTokenizer

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

There are a few preprocessing steps particular to question answering that we should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. Truncate only the context by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original context by setting `return_offset_mapping=True`.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to find which part of the offset corresponds to the question and which corresponds to the `context`.

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
squad_train = squad.map(prepare_train_features, batched=True, 
                            remove_columns=squad["train"].column_names)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

The evaluation features are similar to the train features. We have to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1
        
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
squad_eval = squad["validation"].map(prepare_validation_features, batched=True, 
                                          remove_columns=squad["validation"].column_names)

  0%|          | 0/11 [00:00<?, ?ba/s]

## Fine-tuning mBERT

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-bas

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
batch_size = 16
args = TrainingArguments(
    "bert-base-multilingual-cased-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=squad_train["train"],
    eval_dataset=squad_train["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
# trainer.train()

In [None]:
trainer.push_to_hub()

## Evaluating mBERT

We load a model that is already finetuned on SQuAD to save time. We evaluate on the validation set of SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpx7vo12og


Downloading:   0%|          | 0.00/822 [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
creating metadata file for /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
loading configuration file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
Model config BertConfig {
  "_name_or_path": "salti/bert-base-multilingual-cased-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_d

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
creating metadata file for /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
loading weights file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
All model checkpoint weights were used when initializing BertForQuestionAnswering.

All the weights of BertForQuestionAnswering were initialized from the model checkpoint at salti/bert-base-multilingual-c

We can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(squad_eval)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10851
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
squad_eval.set_format(type=squad_eval.format["type"], 
                      columns=list(squad_eval.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices.

In [None]:
from tqdm.auto import tqdm
import collections
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(squad["validation"], squad_eval, raw_predictions.predictions)

Post-processing 10570 example predictions split into 10851 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 81.90160832544939, 'f1': 89.121876471452}

## Zero-Shot mBERT

Zero-Shot performance of the mBERT model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
from collections import defaultdict
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

xquad_prep = defaultdict(dict)

def map_datasets(langs, split, prepare_features):
    for lang in langs:
        xquad_prep[lang][split] = xquad[lang][split].map(prepare_features, batched=True, 
                                    remove_columns=xquad[lang][split].column_names)

In [None]:
map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

NameError: ignored

In [None]:
def compute_results(langs, split):
    results = {}
    for lang in langs:
        # We can grab the predictions for all features by using the method
        raw_predictions = trainer.predict(xquad_prep[lang][split])

        # example_id and offset_mapping which we will need for our post-processing
        xquad_prep[lang][split].set_format(type=xquad_prep[lang][split].format["type"], 
                        columns=list(xquad_prep[lang][split].features.keys()))
        
        # And we can apply our post-processing function to our raw predictions
        final_predictions = postprocess_qa_predictions(xquad[lang][split], xquad_prep[lang][split], raw_predictions.predictions)

        # We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad[lang][split]]
        results[lang] = metric.compute(predictions=formatted_predictions, references=references)
    return results

In [None]:
results_zero_shot_mbert = compute_results(langs, split)
print(results_zero_shot_mbert)

In [None]:
import pandas as pd
def results_df(results_dict, model):
    F1colname = "F1_" + model
    EMcolname = "EM_" + model
    dict_results = defaultdict(list)
    for lang, scores in results_dict.items():
        dict_results["lang"].append(lang)
        dict_results[F1colname].append(scores['f1'])
        dict_results[EMcolname].append(scores['exact_match'])

    avg_f1 = np.average(dict_results[F1colname])
    avg_em = np.average(dict_results[EMcolname])
    dict_results["lang"].append('avg')
    dict_results[F1colname].append(avg_f1)
    dict_results[EMcolname].append(avg_em)
    df_results = pd.DataFrame(dict_results).round(2)
    return df_results

In [None]:
df_results_zero_shot_mbert = results_df(results_zero_shot_mbert, "ZS_mbert")
df_results_zero_shot_mbert.to_csv("results/results_zero_shot_mbert.csv")
df_results_zero_shot_mbert

## Zero-Shot XLM-R

Zero-Shot performance of the XLM-R model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6e9fab04c4168068e4162152496d68d0594b8a838953657e9234b70b2c4932fb.b04b4828cf6cfcbbcba34b7e4fc29fe9a0563001f3cedda0f9cf6487875eae92
Model config XLMRobertaConfig {
  "_name_or_path": "vanichandna/xlm-roberta-finetuned-squad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,
  "use_c

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-170be1790677e51e.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-e18d12a46e600624.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-8ff2f81c60b146f3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-e9942ab8a849638f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-a5dd56cf46af11e0.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-4b9e7a84397ccee5.arrow
Loading cached processed dataset at /root/.ca

In [None]:
results_zero_shot_xlm_r = compute_results(langs, split)
print(results_zero_shot_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1269
  Batch size = 16


Post-processing 1190 example predictions split into 1269 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1190 example predictions split into 1230 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1287
  Batch size = 16


Post-processing 1190 example predictions split into 1287 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1190 example predictions split into 1249 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1275
  Batch size = 16


Post-processing 1190 example predictions split into 1275 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1342
  Batch size = 16


Post-processing 1190 example predictions split into 1342 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1417
  Batch size = 16


Post-processing 1190 example predictions split into 1417 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1283
  Batch size = 16


Post-processing 1190 example predictions split into 1283 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1253
  Batch size = 16


Post-processing 1190 example predictions split into 1253 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1296
  Batch size = 16


Post-processing 1190 example predictions split into 1296 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 52.10084033613445, 'f1': 67.9083701500317}, 'de': {'exact_match': 59.831932773109244, 'f1': 75.30241062207294}, 'zh': {'exact_match': 54.95798319327731, 'f1': 64.99449476760391}, 'vi': {'exact_match': 54.53781512605042, 'f1': 73.6063652986747}, 'en': {'exact_match': 73.78151260504201, 'f1': 84.39720349086315}, 'es': {'exact_match': 59.2436974789916, 'f1': 76.97376109748436}, 'hi': {'exact_match': 52.52100840336134, 'f1': 69.01052539755287}, 'el': {'exact_match': 56.97478991596638, 'f1': 74.33620155037423}, 'th': {'exact_match': 56.38655462184874, 'f1': 67.99273042550345}, 'tr': {'exact_match': 51.76470588235294, 'f1': 67.9781301084965}, 'ru': {'exact_match': 58.57142857142857, 'f1': 75.1266201920252}, 'ro': {'exact_match': 66.30252100840336, 'f1': 80.02036648068083}}


In [None]:
df_results_zero_shot_xlm_r = results_df(results_zero_shot_xlm_r, "ZS_xml_r")
df_results_zero_shot_xlm_r.to_csv("results/results_zero_shot_xlm_r.csv")
df_results_zero_shot_xlm_r

Unnamed: 0,lang,F1_ZS_xml_r,EM_ZS_xml_r
0,ar,67.91,52.1
1,de,75.3,59.83
2,zh,64.99,54.96
3,vi,73.61,54.54
4,en,84.4,73.78
5,es,76.97,59.24
6,hi,69.01,52.52
7,el,74.34,56.97
8,th,67.99,56.39
9,tr,67.98,51.76


## Zero-Shot XLM-R-large

Zero-Shot performance of the XLM-R-large model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/Palak/xlm-roberta-large_squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/15ba31e9d5d227c3e7514429b9e154b4dd9fa20dc6634e2aa859a6570580ab8a.e0ee6561aeb1e6c81ca8555d0aed71ddc7b312c694416a29b2dd0b16df13a0b3
Model config XLMRobertaConfig {
  "_name_or_path": "Palak/xlm-roberta-large_squad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-fed9318934c744e6.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-edb00a787e8fc2e0.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-7a849a9896f183e1.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-f7db87079a79f88c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-ad4fc05a7f9f4f09.arrow
Loading cached processed dataset at /root/.ca

In [None]:
results_zero_shot_xlm_r_large = compute_results(langs, split)
print(results_zero_shot_xlm_r_large)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1269
  Batch size = 16


Post-processing 1190 example predictions split into 1269 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1190 example predictions split into 1230 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1287
  Batch size = 16


Post-processing 1190 example predictions split into 1287 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1190 example predictions split into 1249 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1275
  Batch size = 16


Post-processing 1190 example predictions split into 1275 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1342
  Batch size = 16


Post-processing 1190 example predictions split into 1342 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1417
  Batch size = 16


Post-processing 1190 example predictions split into 1417 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1283
  Batch size = 16


Post-processing 1190 example predictions split into 1283 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1253
  Batch size = 16


Post-processing 1190 example predictions split into 1253 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1296
  Batch size = 16


Post-processing 1190 example predictions split into 1296 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 57.983193277310924, 'f1': 75.02501274311958}, 'de': {'exact_match': 63.78151260504202, 'f1': 79.88947322002413}, 'zh': {'exact_match': 57.983193277310924, 'f1': 66.845738295318}, 'vi': {'exact_match': 59.32773109243698, 'f1': 79.03355276548136}, 'en': {'exact_match': 75.88235294117646, 'f1': 86.47668134937751}, 'es': {'exact_match': 62.6890756302521, 'f1': 81.0434426032045}, 'hi': {'exact_match': 60.7563025210084, 'f1': 76.01546703765584}, 'el': {'exact_match': 61.260504201680675, 'f1': 79.08708559799136}, 'th': {'exact_match': 61.680672268907564, 'f1': 72.82312622018495}, 'tr': {'exact_match': 58.319327731092436, 'f1': 74.13915628372908}, 'ru': {'exact_match': 63.109243697478995, 'f1': 80.29068172686758}, 'ro': {'exact_match': 70.16806722689076, 'f1': 83.50450484850974}}


In [None]:
df_results_zero_shot_xlm_r_large = results_df(results_zero_shot_xlm_r_large, "ZS_xml_r_large")
df_results_zero_shot_xlm_r_large.to_csv("results/results_zero_shot_xlm_r_large.csv")
df_results_zero_shot_xlm_r_large

Unnamed: 0,lang,F1_ZS_xml_r_large,EM_ZS_xml_r_large
0,ar,75.03,57.98
1,de,79.89,63.78
2,zh,66.85,57.98
3,vi,79.03,59.33
4,en,86.48,75.88
5,es,81.04,62.69
6,hi,76.02,60.76
7,el,79.09,61.26
8,th,72.82,61.68
9,tr,74.14,58.32


## Translate Test mBERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use mBERT, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
Model config BertConfig {
  "_name_or_path": "salti/bert-base-multilingual-cased-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_translate_test_mbert = compute_results(langs, split)
print(results_translate_test_mbert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1189
  Batch size = 16


Post-processing 1151 example predictions split into 1189 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1207
  Batch size = 16


Post-processing 1168 example predictions split into 1207 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1223
  Batch size = 16


Post-processing 1186 example predictions split into 1223 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1216
  Batch size = 16


Post-processing 1178 example predictions split into 1216 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1235
  Batch size = 16


Post-processing 1188 example predictions split into 1235 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1184 example predictions split into 1222 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1219
  Batch size = 16


Post-processing 1182 example predictions split into 1219 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1200
  Batch size = 16


Post-processing 1157 example predictions split into 1200 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1144
  Batch size = 16


Post-processing 1112 example predictions split into 1144 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 55.77758470894874, 'f1': 70.44619460475333}, 'de': {'exact_match': 63.27054794520548, 'f1': 76.73134579887086}, 'zh': {'exact_match': 56.57672849915683, 'f1': 70.12370423516863}, 'vi': {'exact_match': 55.602716468590835, 'f1': 70.63286731677346}, 'es': {'exact_match': 65.06734006734007, 'f1': 78.73225869616985}, 'hi': {'exact_match': 55.8277027027027, 'f1': 70.57554022304278}, 'el': {'exact_match': 61.92893401015228, 'f1': 75.9954399757002}, 'th': {'exact_match': 45.894554883318925, 'f1': 60.04206160508226}, 'tr': {'exact_match': 42.71582733812949, 'f1': 61.64366188342535}, 'ru': {'exact_match': 63.109243697478995, 'f1': 76.6412196213264}}


In [None]:
df_results_translate_test_mbert = results_df(results_translate_test_mbert, "TT_mbert")
df_results_translate_test_mbert.to_csv("results/results_translate_test_mbert.csv")
df_results_translate_test_mbert

Unnamed: 0,lang,F1_TT_mbert,EM_TT_mbert
0,ar,70.45,55.78
1,de,76.73,63.27
2,zh,70.12,56.58
3,vi,70.63,55.6
4,es,78.73,65.07
5,hi,70.58,55.83
6,el,76.0,61.93
7,th,60.04,45.89
8,tr,61.64,42.72
9,ru,76.64,63.11


## Translate Test BERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "rsvp-ai/bertserini-bert-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bertserini-bert-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/rsvp-ai/bertserini-bert-base-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ed70b9dc9bc27dc87b2280c39cf44cf6a203a269dedcd004633a5eec0898d233.1afc8e7ce4fe9efe54e1f9ac968f42b33a2baac9f294d23e1123658693b0ec80
Model config BertConfig {
  "_name_or_path": "rsvp-ai/bertserini-bert-base-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file 

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-1a5399a6c07ceec8.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-4f8f9b9203bc7fe2.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-307ce8d974bd8ee1.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-acbb07ca315efefa.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-b32b0a83ae4ea00d.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-63a995805cfdc313.arrow
Loading cached processed dataset at /root/.ca

In [None]:
results_translate_test_bert = compute_results(langs, split)
print(results_translate_test_bert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1175
  Batch size = 16


Post-processing 1151 example predictions split into 1175 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1198
  Batch size = 16


Post-processing 1168 example predictions split into 1198 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1186 example predictions split into 1222 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1209
  Batch size = 16


Post-processing 1178 example predictions split into 1209 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1224
  Batch size = 16


Post-processing 1188 example predictions split into 1224 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1215
  Batch size = 16


Post-processing 1184 example predictions split into 1215 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1215
  Batch size = 16


Post-processing 1182 example predictions split into 1215 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1188
  Batch size = 16


Post-processing 1157 example predictions split into 1188 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1140
  Batch size = 16


Post-processing 1112 example predictions split into 1140 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1232
  Batch size = 16


Post-processing 1190 example predictions split into 1232 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 54.995655951346656, 'f1': 69.41258276904172}, 'de': {'exact_match': 62.67123287671233, 'f1': 75.71406459678344}, 'zh': {'exact_match': 55.98650927487353, 'f1': 69.88326379523107}, 'vi': {'exact_match': 58.31918505942275, 'f1': 72.22220607235384}, 'es': {'exact_match': 62.62626262626262, 'f1': 77.15532703184773}, 'hi': {'exact_match': 53.71621621621622, 'f1': 69.68960251827099}, 'el': {'exact_match': 60.575296108291035, 'f1': 74.958777632221}, 'th': {'exact_match': 46.49956784788245, 'f1': 60.50298205744256}, 'tr': {'exact_match': 41.81654676258993, 'f1': 59.87210519455375}, 'ru': {'exact_match': 60.50420168067227, 'f1': 74.89718180819315}}


In [None]:
df_results_translate_test_bert = results_df(results_translate_test_bert, "TT_bert")
df_results_translate_test_bert.to_csv("results/results_translate_test_bert.csv")
df_results_translate_test_bert

Unnamed: 0,lang,F1_TT_bert,EM_TT_bert
0,ar,69.41,55.0
1,de,75.71,62.67
2,zh,69.88,55.99
3,vi,72.22,58.32
4,es,77.16,62.63
5,hi,69.69,53.72
6,el,74.96,60.58
7,th,60.5,46.5
8,tr,59.87,41.82
9,ru,74.9,60.5


## Translate Test BERT-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "bert-large-cased-whole-word-masking-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="dir-bert-large-cased-whole-word-masking-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpxkzlu_oa


Downloading:   0%|          | 0.00/634 [00:00<?, ?B/s]

storing https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/09a6ac8433ff705d719a8e60cb1588e83d2220da2cfe95905e958496c092535c.e15ff0ac80307dd3e1ae18a41ec3b31fef893cd86ef240a21ccd2e44dc175ee5
creating metadata file for /root/.cache/huggingface/transformers/09a6ac8433ff705d719a8e60cb1588e83d2220da2cfe95905e958496c092535c.e15ff0ac80307dd3e1ae18a41ec3b31fef893cd86ef240a21ccd2e44dc175ee5
loading configuration file https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/09a6ac8433ff705d719a8e60cb1588e83d2220da2cfe95905e958496c092535c.e15ff0ac80307dd3e1ae18a41ec3b31fef893cd86ef240a21ccd2e44dc175ee5
Model config BertConfig {
  "_name_or_path": "bert-large-cased-whole-word-masking-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifie

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

storing https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/53388cf41351c777622f6d7afdf706d7d7d5c4a05a33791985136ef7dddc7c43.53f4a814c8affef829fa2203ae222c8c013fb980860c34a0043b12ba52d16f0b
creating metadata file for /root/.cache/huggingface/transformers/53388cf41351c777622f6d7afdf706d7d7d5c4a05a33791985136ef7dddc7c43.53f4a814c8affef829fa2203ae222c8c013fb980860c34a0043b12ba52d16f0b
loading weights file https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/53388cf41351c777622f6d7afdf706d7d7d5c4a05a33791985136ef7dddc7c43.53f4a814c8affef829fa2203ae222c8c013fb980860c34a0043b12ba52d16f0b
All model checkpoint weights were used when initializing BertForQuestionAnswering.

All the weights of BertForQuestionAnswering were initialized from the model checkpoint at bert-large-cased-whole-word-

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

storing https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/0055368eb34542f93de5eb70c2a9c3a059bd01bae2628a405905dbc882701901.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
creating metadata file for /root/.cache/huggingface/transformers/0055368eb34542f93de5eb70c2a9c3a059bd01bae2628a405905dbc882701901.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/09a6ac8433ff705d719a8e60cb1588e83d2220da2cfe95905e958496c092535c.e15ff0ac80307dd3e1ae18a41ec3b31fef893cd86ef240a21ccd2e44dc175ee5
Model config BertConfig {
  "_name_or_path": "bert-large-cased-whole-word-masking-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

storing https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/f77a43252a6dcf4565bb023e211c0f71c8b11ec71cae167d4e2fb1ec8dee9a87.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
creating metadata file for /root/.cache/huggingface/transformers/f77a43252a6dcf4565bb023e211c0f71c8b11ec71cae167d4e2fb1ec8dee9a87.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp8lfo2z45


Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

storing https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/7e7140ff927e58a37a0862cd3dc4e2873c713af14eaf32a602d99be03d11bbb4.2b9a196704f2f183fe3f4b48d6e662dba8203fdcb3346bfa896831378edf6f97
creating metadata file for /root/.cache/huggingface/transformers/7e7140ff927e58a37a0862cd3dc4e2873c713af14eaf32a602d99be03d11bbb4.2b9a196704f2f183fe3f4b48d6e662dba8203fdcb3346bfa896831378edf6f97
loading file https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/f77a43252a6dcf4565bb023e211c0f71c8b11ec71cae167d4e2fb1ec8dee9a87.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
loading file https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/7e7140ff927e58a37a0862cd3dc4e2873c713af14eaf32a602d99

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_translate_test_bert_large = compute_results(langs, split)
print(results_translate_test_bert_large)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1182
  Batch size = 16


Post-processing 1151 example predictions split into 1182 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1204
  Batch size = 16


Post-processing 1168 example predictions split into 1204 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1186 example predictions split into 1222 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1210
  Batch size = 16


Post-processing 1178 example predictions split into 1210 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1188 example predictions split into 1230 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1220
  Batch size = 16


Post-processing 1184 example predictions split into 1220 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1219
  Batch size = 16


Post-processing 1182 example predictions split into 1219 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1195
  Batch size = 16


Post-processing 1157 example predictions split into 1195 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1145
  Batch size = 16


Post-processing 1112 example predictions split into 1145 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 59.07906168549088, 'f1': 73.62074258466953}, 'de': {'exact_match': 66.3527397260274, 'f1': 80.41370021364638}, 'zh': {'exact_match': 59.527824620573355, 'f1': 73.98176392520615}, 'vi': {'exact_match': 62.13921901528013, 'f1': 76.39100238167063}, 'es': {'exact_match': 68.68686868686869, 'f1': 81.929551000276}, 'hi': {'exact_match': 61.6554054054054, 'f1': 75.31862416628505}, 'el': {'exact_match': 66.83587140439933, 'f1': 80.18622690602038}, 'th': {'exact_match': 53.93258426966292, 'f1': 67.49195736299158}, 'tr': {'exact_match': 47.302158273381295, 'f1': 66.30064032352537}, 'ru': {'exact_match': 66.97478991596638, 'f1': 80.09890179233422}}


In [None]:
df_results_translate_test_bert_large = results_df(results_translate_test_bert_large, "TT_bert_large")
df_results_translate_test_bert_large.to_csv("results/results_translate_test_bert_large.csv")
df_results_translate_test_bert_large

Unnamed: 0,lang,F1_TT_bert_large,EM_TT_bert_large
0,ar,73.62,59.08
1,de,80.41,66.35
2,zh,73.98,59.53
3,vi,76.39,62.14
4,es,81.93,68.69
5,hi,75.32,61.66
6,el,80.19,66.84
7,th,67.49,53.93
8,tr,66.3,47.3
9,ru,80.1,66.97


## Translate Test XLM-R

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use XLM-R, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6e9fab04c4168068e4162152496d68d0594b8a838953657e9234b70b2c4932fb.b04b4828cf6cfcbbcba34b7e4fc29fe9a0563001f3cedda0f9cf6487875eae92
Model config XLMRobertaConfig {
  "_name_or_path": "vanichandna/xlm-roberta-finetuned-squad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,
  "use_c

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-53a761f848517eaf.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-a042f99970b7ac2f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-15429b99f9ccb8aa.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-9b83f4f55936d4f8.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-1a57892ae60bd1c7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-80e6be44b417319c.arrow
Loading cached processed dataset at /root/.ca

In [None]:
results_translate_test_xlm_r = compute_results(langs, split)
print(results_translate_test_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1202
  Batch size = 16


Post-processing 1151 example predictions split into 1202 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1224
  Batch size = 16


Post-processing 1168 example predictions split into 1224 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1186 example predictions split into 1237 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1226
  Batch size = 16


Post-processing 1178 example predictions split into 1226 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1188 example predictions split into 1249 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1184 example predictions split into 1230 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1234
  Batch size = 16


Post-processing 1182 example predictions split into 1234 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1211
  Batch size = 16


Post-processing 1157 example predictions split into 1211 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1156
  Batch size = 16


Post-processing 1112 example predictions split into 1156 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1258
  Batch size = 16


Post-processing 1190 example predictions split into 1258 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 56.47263249348393, 'f1': 70.34504098193672}, 'de': {'exact_match': 65.75342465753425, 'f1': 79.01460928822512}, 'zh': {'exact_match': 57.41989881956155, 'f1': 71.06902529526067}, 'vi': {'exact_match': 58.40407470288625, 'f1': 73.0365990863463}, 'es': {'exact_match': 66.41414141414141, 'f1': 79.31029572326956}, 'hi': {'exact_match': 57.601351351351354, 'f1': 72.41350340102578}, 'el': {'exact_match': 64.9746192893401, 'f1': 77.81299240667566}, 'th': {'exact_match': 45.375972342264475, 'f1': 60.32438968055058}, 'tr': {'exact_match': 44.33453237410072, 'f1': 63.418026906561565}, 'ru': {'exact_match': 63.61344537815126, 'f1': 77.40922333369343}}


In [None]:
df_results_translate_test_xlm_r = results_df(results_translate_test_xlm_r, "TT_xml_r")
df_results_translate_test_xlm_r.to_csv("results/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r

Unnamed: 0,lang,F1_TT_xml_r,EM_TT_xml_r
0,ar,70.35,56.47
1,de,79.01,65.75
2,zh,71.07,57.42
3,vi,73.04,58.4
4,es,79.31,66.41
5,hi,72.41,57.6
6,el,77.81,64.97
7,th,60.32,45.38
8,tr,63.42,44.33
9,ru,77.41,63.61


## Translate Test XLM-R-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use XLM-R-large, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/Palak/xlm-roberta-large_squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/15ba31e9d5d227c3e7514429b9e154b4dd9fa20dc6634e2aa859a6570580ab8a.e0ee6561aeb1e6c81ca8555d0aed71ddc7b312c694416a29b2dd0b16df13a0b3
Model config XLMRobertaConfig {
  "_name_or_path": "Palak/xlm-roberta-large_squad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-da798cb42c92ece6.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-ba3fe355d32778e3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-75d3385a81f79fce.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-7f9a3a38877a45c7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-bdae6b35b594b477.arrow
Loading cached processed dataset at /root/.ca

In [None]:
results_translate_test_xlm_r_large = compute_results(langs, split)
print(results_translate_test_xlm_r_large)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1202
  Batch size = 16


Post-processing 1151 example predictions split into 1202 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1224
  Batch size = 16


Post-processing 1168 example predictions split into 1224 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1186 example predictions split into 1237 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1226
  Batch size = 16


Post-processing 1178 example predictions split into 1226 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1188 example predictions split into 1249 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1184 example predictions split into 1230 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1234
  Batch size = 16


Post-processing 1182 example predictions split into 1234 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1211
  Batch size = 16


Post-processing 1157 example predictions split into 1211 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1156
  Batch size = 16


Post-processing 1112 example predictions split into 1156 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1258
  Batch size = 16


Post-processing 1190 example predictions split into 1258 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 59.07906168549088, 'f1': 72.86049593023081}, 'de': {'exact_match': 66.60958904109589, 'f1': 80.07410198897794}, 'zh': {'exact_match': 58.85328836424958, 'f1': 73.61014549140681}, 'vi': {'exact_match': 61.544991511035654, 'f1': 75.13980262377616}, 'es': {'exact_match': 67.08754208754209, 'f1': 81.51584137543719}, 'hi': {'exact_match': 60.13513513513514, 'f1': 74.24963260290495}, 'el': {'exact_match': 66.24365482233503, 'f1': 79.61634373353917}, 'th': {'exact_match': 45.980985306828, 'f1': 61.73035348044344}, 'tr': {'exact_match': 48.201438848920866, 'f1': 66.18604275834437}, 'ru': {'exact_match': 65.71428571428571, 'f1': 79.70748081174477}}


In [None]:
df_results_translate_test_xlm_r_large = results_df(results_translate_test_xlm_r_large, "TT_xml_r_large")
df_results_translate_test_xlm_r_large.to_csv("results/results_translate_test_xlm_r_large.csv")
df_results_translate_test_xlm_r_large

Unnamed: 0,lang,F1_TT_xml_r_large,EM_TT_xml_r_large
0,ar,72.86,59.08
1,de,80.07,66.61
2,zh,73.61,58.85
3,vi,75.14,61.54
4,es,81.52,67.09
5,hi,74.25,60.14
6,el,79.62,66.24
7,th,61.73,45.98
8,tr,66.19,48.2
9,ru,79.71,65.71


## Translate Test RoBERTa

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "thatdramebaazguy/roberta-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/thatdramebaazguy/roberta-base-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/892dedc3d4dc51337f07712ae60bee9d35540607c674cf14c1e8fa759c8044f6.143a3b91b00882c22e382aa9dce198fca4876a0147d4208940720f26038eecea
Model config RobertaConfig {
  "_name_or_path": "thatdramebaazguy/roberta-base-squad",
  "architectures": [
    "RobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,
  "use_cache":

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-1c8cc633a0b7ba64.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-57cec9d44e42eec3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-6d530472cf1c272a.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-0a8ffe6edfe3cd79.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-d29fc8a2aeb05d4e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/juletxara___xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-5df8a366be36cb84.arrow
Loading cached processed dataset at /root/.ca

In [None]:
results_translate_test_roberta = compute_results(langs, split)
print(results_translate_test_roberta)

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1185
  Batch size = 16


Post-processing 1151 example predictions split into 1185 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1203
  Batch size = 16


Post-processing 1168 example predictions split into 1203 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1186 example predictions split into 1222 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1209
  Batch size = 16


Post-processing 1178 example predictions split into 1209 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1228
  Batch size = 16


Post-processing 1188 example predictions split into 1228 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1216
  Batch size = 16


Post-processing 1184 example predictions split into 1216 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1208
  Batch size = 16


Post-processing 1182 example predictions split into 1208 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1191
  Batch size = 16


Post-processing 1157 example predictions split into 1191 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1145
  Batch size = 16


Post-processing 1112 example predictions split into 1145 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 56.99391833188532, 'f1': 71.61217230800395}, 'de': {'exact_match': 62.41438356164384, 'f1': 76.99315217534317}, 'zh': {'exact_match': 57.92580101180438, 'f1': 72.3979777561931}, 'vi': {'exact_match': 56.621392190152804, 'f1': 72.35810811677062}, 'es': {'exact_match': 64.64646464646465, 'f1': 80.04981137323145}, 'hi': {'exact_match': 55.57432432432432, 'f1': 72.03651371348244}, 'el': {'exact_match': 63.87478849407783, 'f1': 76.79308090404957}, 'th': {'exact_match': 46.58599827139153, 'f1': 62.23507716283365}, 'tr': {'exact_match': 44.064748201438846, 'f1': 63.419033460984835}, 'ru': {'exact_match': 62.35294117647059, 'f1': 77.24099636983811}}


In [None]:
df_results_translate_test_roberta = results_df(results_translate_test_roberta, "TT_roberta")
df_results_translate_test_roberta.to_csv("results/results_translate_test_roberta.csv")
df_results_translate_test_roberta

Unnamed: 0,lang,F1_TT_roberta,EM_TT_roberta
0,ar,71.61,56.99
1,de,76.99,62.41
2,zh,72.4,57.93
3,vi,72.36,56.62
4,es,80.05,64.65
5,hi,72.04,55.57
6,el,76.79,63.87
7,th,62.24,46.59
8,tr,63.42,44.06
9,ru,77.24,62.35


## Translate Test RoBERTa-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "csarron/roberta-large-squad-v1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-large-squad-v1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/csarron/roberta-large-squad-v1/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/0fc041ca661ebb7473123e9722819c48e27f9a83aedf13e1150435275cc21bdd.e59dc72a749f921962e2e724ff292de7eb1b9902723153972b2069e8847ff4f0
Model config RobertaConfig {
  "_name_or_path": "csarron/roberta-large-squad-v1",
  "architectures": [
    "RobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,
  "use_cache": true,
  

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)



  0%|          | 0/2 [00:00<?, ?ba/s]



In [None]:
results_translate_test_roberta_large = compute_results(langs, split)
print(results_translate_test_roberta_large)

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1185
  Batch size = 16


Post-processing 1151 example predictions split into 1185 features.


  0%|          | 0/1151 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1203
  Batch size = 16


Post-processing 1168 example predictions split into 1203 features.


  0%|          | 0/1168 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1222
  Batch size = 16


Post-processing 1186 example predictions split into 1222 features.


  0%|          | 0/1186 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1209
  Batch size = 16


Post-processing 1178 example predictions split into 1209 features.


  0%|          | 0/1178 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1228
  Batch size = 16


Post-processing 1188 example predictions split into 1228 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1216
  Batch size = 16


Post-processing 1184 example predictions split into 1216 features.


  0%|          | 0/1184 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1208
  Batch size = 16


Post-processing 1182 example predictions split into 1208 features.


  0%|          | 0/1182 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1191
  Batch size = 16


Post-processing 1157 example predictions split into 1191 features.


  0%|          | 0/1157 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1145
  Batch size = 16


Post-processing 1112 example predictions split into 1145 features.


  0%|          | 0/1112 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1237
  Batch size = 16


Post-processing 1190 example predictions split into 1237 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 61.07732406602954, 'f1': 74.7489821351272}, 'de': {'exact_match': 67.12328767123287, 'f1': 80.37386737984684}, 'zh': {'exact_match': 59.86509274873524, 'f1': 73.98462224762834}, 'vi': {'exact_match': 61.969439728353144, 'f1': 76.3938454313451}, 'es': {'exact_match': 69.44444444444444, 'f1': 83.13549813942818}, 'hi': {'exact_match': 60.979729729729726, 'f1': 75.05802677072126}, 'el': {'exact_match': 68.02030456852792, 'f1': 80.79771497740221}, 'th': {'exact_match': 50.99394987035436, 'f1': 65.27015861017648}, 'tr': {'exact_match': 46.94244604316547, 'f1': 65.99299955055531}, 'ru': {'exact_match': 67.98319327731092, 'f1': 81.15423069826312}}


In [None]:
df_results_translate_test_roberta_large = results_df(results_translate_test_roberta_large, "TT_roberta_large")
df_results_translate_test_roberta_large.to_csv("results/results_translate_test_roberta_large.csv")
df_results_translate_test_roberta_large

Unnamed: 0,lang,F1_TT_roberta_large,EM_TT_roberta_large
0,ar,74.75,61.08
1,de,80.37,67.12
2,zh,73.98,59.87
3,vi,76.39,61.97
4,es,83.14,69.44
5,hi,75.06,60.98
6,el,80.8,68.02
7,th,65.27,50.99
8,tr,65.99,46.94
9,ru,81.15,67.98


## Translate Train Es mBERT

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We then fine-tune mBERT on the translated data. We must align answer spans in the source and target language for the QA tasks. We use data that was already translated to save time.

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-es",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "abs

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["es"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_prep["es"]["translate_train"],
    eval_dataset=xquad_prep["es"]["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
# trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_es_mbert = compute_results(langs, split)
print(results_translate_train_es_mbert)

In [None]:
df_results_translate_train_es_mbert = results_df(results_translate_train_es_mbert, "TTr_es_mbert")
df_results_translate_train_es_mbert.to_csv("results/results_translate_train_es_mbert.csv")
df_results_translate_train_es_mbert

In [None]:
trainer.push_to_hub()

## Translate Train Es XLM-R

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We use a XLM-R model that has already been finetuned to save time.

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "saattrupdan/xlmr-base-texas-squad-es"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp8olpvol1


Downloading:   0%|          | 0.00/716 [00:00<?, ?B/s]

storing https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/faf74a33d110680ea7f72791fa918036c0fd9edbcbdbf36af3bde7c86b23a8f2.b749a309a5febb0957ff5acc5d4f5534d14f38569553b954c37647de7ea10ae4
creating metadata file for /root/.cache/huggingface/transformers/faf74a33d110680ea7f72791fa918036c0fd9edbcbdbf36af3bde7c86b23a8f2.b749a309a5febb0957ff5acc5d4f5534d14f38569553b954c37647de7ea10ae4
loading configuration file https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/faf74a33d110680ea7f72791fa918036c0fd9edbcbdbf36af3bde7c86b23a8f2.b749a309a5febb0957ff5acc5d4f5534d14f38569553b954c37647de7ea10ae4
Model config XLMRobertaConfig {
  "_name_or_path": "saattrupdan/xlmr-base-texas-squad-es",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": 

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

storing https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/acdf22ff03faa9cd992d27a19088abdc9f28a3b4e1e40a14e98cfcb0b18c1403.d1cb5eddd332add5df96171f7f759027748f6b03de7b4fdd45f72ebf349e9eb3
creating metadata file for /root/.cache/huggingface/transformers/acdf22ff03faa9cd992d27a19088abdc9f28a3b4e1e40a14e98cfcb0b18c1403.d1cb5eddd332add5df96171f7f759027748f6b03de7b4fdd45f72ebf349e9eb3
loading weights file https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/acdf22ff03faa9cd992d27a19088abdc9f28a3b4e1e40a14e98cfcb0b18c1403.d1cb5eddd332add5df96171f7f759027748f6b03de7b4fdd45f72ebf349e9eb3
All model checkpoint weights were used when initializing XLMRobertaForQuestionAnswering.

All the weights of XLMRobertaForQuestionAnswering were initialized from the model checkpoint at saattrupdan/xlmr-base-texas-squad-es.
If your 

Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

storing https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/111d92253a878f3919c5c83c05be6fe131f4d4e29e27d3994266b0581067ef7a.b36482fbec4a714d3cfec99e0b05f4fdeec9e759090a78aed5597583a8b4783d
creating metadata file for /root/.cache/huggingface/transformers/111d92253a878f3919c5c83c05be6fe131f4d4e29e27d3994266b0581067ef7a.b36482fbec4a714d3cfec99e0b05f4fdeec9e759090a78aed5597583a8b4783d
https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/sentencepiece.bpe.model not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpdt82rvt9


Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

storing https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/sentencepiece.bpe.model in cache at /root/.cache/huggingface/transformers/e70140c9e0682f255072dc32b137ae3cde1808241a5359b356fa10413d32a677.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
creating metadata file for /root/.cache/huggingface/transformers/e70140c9e0682f255072dc32b137ae3cde1808241a5359b356fa10413d32a677.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpi8fob3ry


Downloading:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

storing https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/6aa7bc2ebebd8aec9f0d822224378677812644178ecff3c50ceddf15ed84427c.2dedbd3aa2bb53e8e26ed0125daf18f6d4aeeeeb98252d7ce59b3a63d810a963
creating metadata file for /root/.cache/huggingface/transformers/6aa7bc2ebebd8aec9f0d822224378677812644178ecff3c50ceddf15ed84427c.2dedbd3aa2bb53e8e26ed0125daf18f6d4aeeeeb98252d7ce59b3a63d810a963
https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0wfbwk90


Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

storing https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/f06ef5b71fd7eb7e035d0f645a260e1b7933933ff5438715bc96d41446ac6db1.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
creating metadata file for /root/.cache/huggingface/transformers/f06ef5b71fd7eb7e035d0f645a260e1b7933933ff5438715bc96d41446ac6db1.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
loading file https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/sentencepiece.bpe.model from cache at /root/.cache/huggingface/transformers/e70140c9e0682f255072dc32b137ae3cde1808241a5359b356fa10413d32a677.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
loading file https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/6aa7bc2ebebd8aec9f0d822224378677812644178ecff3c50ceddf15ed84427c.2dedbd3aa2

In [None]:
langs = ["es"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_es_xlm_r = compute_results(langs, split)
print(results_translate_train_es_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1269
  Batch size = 16


Post-processing 1190 example predictions split into 1269 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1190 example predictions split into 1230 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1287
  Batch size = 16


Post-processing 1190 example predictions split into 1287 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1190 example predictions split into 1249 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1275
  Batch size = 16


Post-processing 1190 example predictions split into 1275 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1342
  Batch size = 16


Post-processing 1190 example predictions split into 1342 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1417
  Batch size = 16


Post-processing 1190 example predictions split into 1417 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1283
  Batch size = 16


Post-processing 1190 example predictions split into 1283 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1253
  Batch size = 16


Post-processing 1190 example predictions split into 1253 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1296
  Batch size = 16


Post-processing 1190 example predictions split into 1296 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 47.89915966386555, 'f1': 67.0388115846991}, 'de': {'exact_match': 56.38655462184874, 'f1': 74.19614072488609}, 'zh': {'exact_match': 50.33613445378151, 'f1': 63.442740732656574}, 'vi': {'exact_match': 52.016806722689076, 'f1': 73.25160491170328}, 'en': {'exact_match': 66.1344537815126, 'f1': 80.40537178558418}, 'es': {'exact_match': 56.63865546218487, 'f1': 76.30470519642189}, 'hi': {'exact_match': 48.23529411764706, 'f1': 66.8705713638248}, 'el': {'exact_match': 52.436974789915965, 'f1': 73.46932544279235}, 'th': {'exact_match': 58.48739495798319, 'f1': 68.67126990656391}, 'tr': {'exact_match': 46.470588235294116, 'f1': 66.15025291004822}, 'ru': {'exact_match': 54.20168067226891, 'f1': 72.34748330210338}, 'ro': {'exact_match': 59.2436974789916, 'f1': 76.02388227839175}}


In [None]:
df_results_translate_train_es_xlm_r = results_df(results_translate_train_es_xlm_r, "TTr_es_xml_r")
df_results_translate_train_es_xlm_r.to_csv("results/results_translate_train_es_xlm_r.csv")
df_results_translate_train_es_xlm_r

Unnamed: 0,lang,F1_TTr_es_xml_r,EM_TTr_es_xml_r
0,ar,67.04,47.9
1,de,74.2,56.39
2,zh,63.44,50.34
3,vi,73.25,52.02
4,en,80.41,66.13
5,es,76.3,56.64
6,hi,66.87,48.24
7,el,73.47,52.44
8,th,68.67,58.49
9,tr,66.15,46.47


## Translate Train De XLM-R

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "saattrupdan/xlmr-base-texas-squad-de"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1-de",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Downloading:   0%|          | 0.00/716 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
langs = ["de"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

  0%|          | 0/83 [00:00<?, ?ba/s]

  0%|          | 0/33 [00:00<?, ?ba/s]

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_de_xlm_r = compute_results(langs, split)
print(results_translate_train_de_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1269
  Batch size = 16


Post-processing 1190 example predictions split into 1269 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1190 example predictions split into 1230 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1287
  Batch size = 16


Post-processing 1190 example predictions split into 1287 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1190 example predictions split into 1249 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1275
  Batch size = 16


Post-processing 1190 example predictions split into 1275 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1342
  Batch size = 16


Post-processing 1190 example predictions split into 1342 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1417
  Batch size = 16


Post-processing 1190 example predictions split into 1417 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1283
  Batch size = 16


Post-processing 1190 example predictions split into 1283 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1253
  Batch size = 16


Post-processing 1190 example predictions split into 1253 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1296
  Batch size = 16


Post-processing 1190 example predictions split into 1296 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 48.15126050420168, 'f1': 65.89257531559332}, 'de': {'exact_match': 58.8235294117647, 'f1': 74.26857035689696}, 'zh': {'exact_match': 55.04201680672269, 'f1': 64.67947178871539}, 'vi': {'exact_match': 53.19327731092437, 'f1': 72.71213914676726}, 'en': {'exact_match': 67.14285714285714, 'f1': 79.78760904305636}, 'es': {'exact_match': 57.89915966386555, 'f1': 75.92705374261308}, 'hi': {'exact_match': 50.588235294117645, 'f1': 66.43992031829906}, 'el': {'exact_match': 54.45378151260504, 'f1': 72.2918166818599}, 'th': {'exact_match': 56.80672268907563, 'f1': 65.42983090672162}, 'tr': {'exact_match': 50.7563025210084, 'f1': 65.82973809534654}, 'ru': {'exact_match': 56.38655462184874, 'f1': 73.12606185820042}, 'ro': {'exact_match': 61.09243697478992, 'f1': 75.27282891644208}}


In [None]:
df_results_translate_train_de_xlm_r = results_df(results_translate_train_de_xlm_r, "TTr_de_xml_r")
df_results_translate_train_de_xlm_r.to_csv("results/results_translate_train_de_xlm_r.csv")
df_results_translate_train_de_xlm_r

Unnamed: 0,lang,F1_TTr_de_xml_r,EM_TTr_de_xml_r
0,ar,65.89,48.15
1,de,74.27,58.82
2,zh,64.68,55.04
3,vi,72.71,53.19
4,en,79.79,67.14
5,es,75.93,57.9
6,hi,66.44,50.59
7,el,72.29,54.45
8,th,65.43,56.81
9,tr,65.83,50.76


## Translate Train All mBERT

We also experiment with a multi-task version of the translate-train setting where we fine-tune mBERT on the combined translated training data of all languages jointly.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-all",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-82ed776ec82aef03.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-f29fbee97ac68574.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-b569d8b5e32c0920.arrow


  0%|          | 0/88 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-6d22e1e6dd4c1662.arrow


  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/80 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/87 [00:00<?, ?ba/s]

  0%|          | 0/85 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/33 [00:00<?, ?ba/s]

  0%|          | 0/34 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-cc7b8b134d366ad0.arrow


  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/32 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/34 [00:00<?, ?ba/s]

In [None]:
from datasets import DatasetDict, concatenate_datasets

xquad_merged = DatasetDict()
xquad_merged["translate_train"] = squad_train["train"]
xquad_merged["translate_dev"] = squad_train["validation"]

for lang in langs:
    for split in ["translate_train", "translate_dev"]:
        xquad_merged[split] = concatenate_datasets([xquad_merged[split], xquad_prep[lang][split]])

In [None]:
xquad_merged

DatasetDict({
    translate_train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 1067346
    })
    translate_dev: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 406809
    })
})

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_merged["translate_train"],
    eval_dataset=xquad_merged["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "abs

In [None]:
# trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_all_mbert = compute_results(langs, split)
print(results_translate_train_all_mbert)

In [None]:
df_results_translate_train_all_mbert = results_df(results_translate_train_all_mbert, "TTr_all_mbert")
df_results_translate_train_all_mbert.to_csv("results/results_translate_train_all_mbert.csv")
df_results_translate_train_all_mbert

In [None]:
trainer.push_to_hub()

## Fine-tuning XQuAD mBERT

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/bert-base-multilingual-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/alon-albalak/bert-base-multilingual-xquad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/470e6ca9a5d81f600655f10ab2fce49e027c09aed81d6226dd19b443fab64a11.1ef5cb917e276a09d6324402c6bb6e4099f28b87a921cc6cd6b5fdf56d51d5a3
Model config BertConfig {
  "_name_or_path": "alon-albalak/bert-base-multilingual-xquad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "po

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_fine_tuning_xquad_mbert = compute_results(langs, split)
print(results_fine_tuning_xquad_mbert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1381
  Batch size = 16


Post-processing 1190 example predictions split into 1381 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1256
  Batch size = 16


Post-processing 1190 example predictions split into 1256 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1336
  Batch size = 16


Post-processing 1190 example predictions split into 1336 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1254
  Batch size = 16


Post-processing 1190 example predictions split into 1254 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1224
  Batch size = 16


Post-processing 1190 example predictions split into 1224 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1262
  Batch size = 16


Post-processing 1190 example predictions split into 1262 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1420
  Batch size = 16


Post-processing 1190 example predictions split into 1420 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1716
  Batch size = 16


Post-processing 1190 example predictions split into 1716 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1190
  Batch size = 16


Post-processing 1190 example predictions split into 1190 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1276
  Batch size = 16


Post-processing 1190 example predictions split into 1276 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1349
  Batch size = 16


Post-processing 1190 example predictions split into 1349 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1307
  Batch size = 16


Post-processing 1190 example predictions split into 1307 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 84.28571428571429, 'f1': 90.00810258925225}, 'de': {'exact_match': 90.0, 'f1': 94.17307265160898}, 'zh': {'exact_match': 84.45378151260505, 'f1': 87.50140056022403}, 'vi': {'exact_match': 87.56302521008404, 'f1': 93.44286792284115}, 'en': {'exact_match': 95.29411764705883, 'f1': 97.33996743890854}, 'es': {'exact_match': 92.43697478991596, 'f1': 96.20689571799173}, 'hi': {'exact_match': 77.47899159663865, 'f1': 88.20275068140423}, 'el': {'exact_match': 86.97478991596638, 'f1': 92.21595598206223}, 'th': {'exact_match': 16.80672268907563, 'f1': 25.175680709833998}, 'tr': {'exact_match': 84.45378151260505, 'f1': 89.85649740393453}, 'ru': {'exact_match': 90.08403361344538, 'f1': 94.35131833669668}, 'ro': {'exact_match': 91.34453781512605, 'f1': 95.4745314605511}}


In [None]:
df_results_fine_tuning_xquad_mbert = results_df(results_fine_tuning_xquad_mbert, "FT_xquad_mbert")
df_results_fine_tuning_xquad_mbert.to_csv("results/results_fine_tuning_xquad_mbert.csv")
df_results_fine_tuning_xquad_mbert

Unnamed: 0,lang,F1_data_augm_mbert2,EM_data_augm_mbert2
0,ar,90.01,84.29
1,de,94.17,90.0
2,zh,87.5,84.45
3,vi,93.44,87.56
4,en,97.34,95.29
5,es,96.21,92.44
6,hi,88.2,77.48
7,el,92.22,86.97
8,th,25.18,16.81
9,tr,89.86,84.45


## Fine-tuning XQuAD XLM-R

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/xlm-roberta-base-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-base-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpztllec7w


Downloading:   0%|          | 0.00/692 [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/e2d4ea694ec14fd7be1dd49b8c6907c4f654ee0454829427b44dcd35bdfc25e3.8b062b593b6db4aac49a300f3c4e9282269bbbfcdfecc1baa101ea84d93b8e9e
creating metadata file for /root/.cache/huggingface/transformers/e2d4ea694ec14fd7be1dd49b8c6907c4f654ee0454829427b44dcd35bdfc25e3.8b062b593b6db4aac49a300f3c4e9282269bbbfcdfecc1baa101ea84d93b8e9e
loading configuration file https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/e2d4ea694ec14fd7be1dd49b8c6907c4f654ee0454829427b44dcd35bdfc25e3.8b062b593b6db4aac49a300f3c4e9282269bbbfcdfecc1baa101ea84d93b8e9e
Model config XLMRobertaConfig {
  "_name_or_path": "alon-albalak/xlm-roberta-base-xquad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": nul

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/7a4965177aac7fa2b5726644da9dc09f7993eebe4b18bfea911179e4221c6d64.e573141447b29e44474bc05f85798e8d9f28e83e25a3c7a6963f9187a76c438b
creating metadata file for /root/.cache/huggingface/transformers/7a4965177aac7fa2b5726644da9dc09f7993eebe4b18bfea911179e4221c6d64.e573141447b29e44474bc05f85798e8d9f28e83e25a3c7a6963f9187a76c438b
loading weights file https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/7a4965177aac7fa2b5726644da9dc09f7993eebe4b18bfea911179e4221c6d64.e573141447b29e44474bc05f85798e8d9f28e83e25a3c7a6963f9187a76c438b
All model checkpoint weights were used when initializing XLMRobertaForQuestionAnswering.

All the weights of XLMRobertaForQuestionAnswering were initialized from the model checkpoint at alon-albalak/xlm-roberta-base-xquad.
If your tas

Downloading:   0%|          | 0.00/356 [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/5869aa8df5966cdb5ab78a4b59c5a33ce3667c8d0e447f261b3c0176159c63e7.fea55cb36db9ffd477d449fa9329a4b36f413dcd9a2680ed17914af13f83cd83
creating metadata file for /root/.cache/huggingface/transformers/5869aa8df5966cdb5ab78a4b59c5a33ce3667c8d0e447f261b3c0176159c63e7.fea55cb36db9ffd477d449fa9329a4b36f413dcd9a2680ed17914af13f83cd83
loading configuration file https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/e2d4ea694ec14fd7be1dd49b8c6907c4f654ee0454829427b44dcd35bdfc25e3.8b062b593b6db4aac49a300f3c4e9282269bbbfcdfecc1baa101ea84d93b8e9e
Model config XLMRobertaConfig {
  "_name_or_path": "alon-albalak/xlm-roberta-base-xquad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dro

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/sentencepiece.bpe.model in cache at /root/.cache/huggingface/transformers/5712424c300472d474e267328969c68208ccee7e9097c21f01dd1e5f449955be.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
creating metadata file for /root/.cache/huggingface/transformers/5712424c300472d474e267328969c68208ccee7e9097c21f01dd1e5f449955be.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpe_6zawnm


Downloading:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/b826d2cfbd393622580a5565a322ba25316f5df40e6e95f44a232e43d0aeaa1b.2dedbd3aa2bb53e8e26ed0125daf18f6d4aeeeeb98252d7ce59b3a63d810a963
creating metadata file for /root/.cache/huggingface/transformers/b826d2cfbd393622580a5565a322ba25316f5df40e6e95f44a232e43d0aeaa1b.2dedbd3aa2bb53e8e26ed0125daf18f6d4aeeeeb98252d7ce59b3a63d810a963
https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp7ngvynr0


Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/cb10a82f98694a3b203431384c67210b05e561bc93c399e7321ae8b267afc5eb.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
creating metadata file for /root/.cache/huggingface/transformers/cb10a82f98694a3b203431384c67210b05e561bc93c399e7321ae8b267afc5eb.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
loading file https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/sentencepiece.bpe.model from cache at /root/.cache/huggingface/transformers/5712424c300472d474e267328969c68208ccee7e9097c21f01dd1e5f449955be.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
loading file https://huggingface.co/alon-albalak/xlm-roberta-base-xquad/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/b826d2cfbd393622580a5565a322ba25316f5df40e6e95f44a232e43d0aeaa1b.2dedbd3aa2bb5

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_fine_tuning_xquad_xlm_r = compute_results(langs, split)
print(results_fine_tuning_xquad_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1269
  Batch size = 16


Post-processing 1190 example predictions split into 1269 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1190 example predictions split into 1230 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1287
  Batch size = 16


Post-processing 1190 example predictions split into 1287 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1190 example predictions split into 1249 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1275
  Batch size = 16


Post-processing 1190 example predictions split into 1275 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1342
  Batch size = 16


Post-processing 1190 example predictions split into 1342 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1417
  Batch size = 16


Post-processing 1190 example predictions split into 1417 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1283
  Batch size = 16


Post-processing 1190 example predictions split into 1283 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1253
  Batch size = 16


Post-processing 1190 example predictions split into 1253 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1296
  Batch size = 16


Post-processing 1190 example predictions split into 1296 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 88.15126050420169, 'f1': 92.5003818120229}, 'de': {'exact_match': 91.84873949579831, 'f1': 95.11634782167732}, 'zh': {'exact_match': 92.94117647058823, 'f1': 93.97478991596638}, 'vi': {'exact_match': 91.26050420168067, 'f1': 95.50502157631122}, 'en': {'exact_match': 97.47899159663865, 'f1': 98.52001063980404}, 'es': {'exact_match': 93.61344537815125, 'f1': 97.80438425351333}, 'hi': {'exact_match': 88.57142857142857, 'f1': 92.62737821541008}, 'el': {'exact_match': 91.84873949579831, 'f1': 96.03973472922334}, 'th': {'exact_match': 92.3529411764706, 'f1': 94.02063389458345}, 'tr': {'exact_match': 87.3109243697479, 'f1': 91.96783162809383}, 'ru': {'exact_match': 90.84033613445378, 'f1': 95.15990194807932}, 'ro': {'exact_match': 94.78991596638656, 'f1': 97.70723093575782}}


In [None]:
df_results_fine_tuning_xquad_xlm_r = results_df(results_fine_tuning_xquad_xlm_r, "FT_xquad_xlm_r")
df_results_fine_tuning_xquad_xlm_r.to_csv("results/results_fine_tuning_xquad_xlm_r.csv")
df_results_fine_tuning_xquad_xlm_r

Unnamed: 0,lang,F1_data_augm_xml_r,EM_data_augm_xml_r
0,ar,92.5,88.15
1,de,95.12,91.85
2,zh,93.97,92.94
3,vi,95.51,91.26
4,en,98.52,97.48
5,es,97.8,93.61
6,hi,92.63,88.57
7,el,96.04,91.85
8,th,94.02,92.35
9,tr,91.97,87.31


## FIne-tuning XQuAD XLM-R-large

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/xlm-roberta-large-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpqms1va75


Downloading:   0%|          | 0.00/694 [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/589a3f1fafca5a7d5c612c5243f4a1cc46e421309ce5981a3a1b0d5a2455b9d7.82e8d4b651bc479215ea60a66f5b1ee98a9eecf3910c8ebf96aa47587fc7548c
creating metadata file for /root/.cache/huggingface/transformers/589a3f1fafca5a7d5c612c5243f4a1cc46e421309ce5981a3a1b0d5a2455b9d7.82e8d4b651bc479215ea60a66f5b1ee98a9eecf3910c8ebf96aa47587fc7548c
loading configuration file https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/589a3f1fafca5a7d5c612c5243f4a1cc46e421309ce5981a3a1b0d5a2455b9d7.82e8d4b651bc479215ea60a66f5b1ee98a9eecf3910c8ebf96aa47587fc7548c
Model config XLMRobertaConfig {
  "_name_or_path": "alon-albalak/xlm-roberta-large-xquad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": 

Downloading:   0%|          | 0.00/2.08G [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/d6799fc6d166b019982cb850aa1e08b35c283b96a4cd235f7e5136e20e22d6c7.48e708136ff9dbcdf5be31a11f1dad1e56cea7710a769f3fd22e4c181163bf58
creating metadata file for /root/.cache/huggingface/transformers/d6799fc6d166b019982cb850aa1e08b35c283b96a4cd235f7e5136e20e22d6c7.48e708136ff9dbcdf5be31a11f1dad1e56cea7710a769f3fd22e4c181163bf58
loading weights file https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/d6799fc6d166b019982cb850aa1e08b35c283b96a4cd235f7e5136e20e22d6c7.48e708136ff9dbcdf5be31a11f1dad1e56cea7710a769f3fd22e4c181163bf58
All model checkpoint weights were used when initializing XLMRobertaForQuestionAnswering.

All the weights of XLMRobertaForQuestionAnswering were initialized from the model checkpoint at alon-albalak/xlm-roberta-large-xquad.
If your 

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/sentencepiece.bpe.model in cache at /root/.cache/huggingface/transformers/d42070dcbd45186cdf0047bf4f1cf65e48278096a0f0642cda1629907c81a30c.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
creating metadata file for /root/.cache/huggingface/transformers/d42070dcbd45186cdf0047bf4f1cf65e48278096a0f0642cda1629907c81a30c.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpvttvoglh


Downloading:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

storing https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/9ba1409b5e154f113894e8f25560e1447c7307223c98ad971640f6b4f834821a.2dedbd3aa2bb53e8e26ed0125daf18f6d4aeeeeb98252d7ce59b3a63d810a963
creating metadata file for /root/.cache/huggingface/transformers/9ba1409b5e154f113894e8f25560e1447c7307223c98ad971640f6b4f834821a.2dedbd3aa2bb53e8e26ed0125daf18f6d4aeeeeb98252d7ce59b3a63d810a963
loading file https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/sentencepiece.bpe.model from cache at /root/.cache/huggingface/transformers/d42070dcbd45186cdf0047bf4f1cf65e48278096a0f0642cda1629907c81a30c.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e
loading file https://huggingface.co/alon-albalak/xlm-roberta-large-xquad/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/9ba1409b5e154f113894e8f25560e1447c7307223c98ad971640f6b4f834821a.2dedbd3aa2bb53e8e26

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_fine_tuning_xquad_xlm_r_large = compute_results(langs, split)
print(results_fine_tuning_xquad_xlm_r_large)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1269
  Batch size = 16


Post-processing 1190 example predictions split into 1269 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1230
  Batch size = 16


Post-processing 1190 example predictions split into 1230 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1287
  Batch size = 16


Post-processing 1190 example predictions split into 1287 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1249
  Batch size = 16


Post-processing 1190 example predictions split into 1249 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1275
  Batch size = 16


Post-processing 1190 example predictions split into 1275 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1342
  Batch size = 16


Post-processing 1190 example predictions split into 1342 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1417
  Batch size = 16


Post-processing 1190 example predictions split into 1417 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1283
  Batch size = 16


Post-processing 1190 example predictions split into 1283 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1253
  Batch size = 16


Post-processing 1190 example predictions split into 1253 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1292
  Batch size = 16


Post-processing 1190 example predictions split into 1292 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1296
  Batch size = 16


Post-processing 1190 example predictions split into 1296 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 94.2016806722689, 'f1': 97.04008498265348}, 'de': {'exact_match': 95.63025210084034, 'f1': 98.12034550700231}, 'zh': {'exact_match': 95.71428571428571, 'f1': 96.27891156462586}, 'vi': {'exact_match': 94.03361344537815, 'f1': 97.62400268842809}, 'en': {'exact_match': 99.15966386554622, 'f1': 99.6990404004739}, 'es': {'exact_match': 95.7983193277311, 'f1': 98.45924746002913}, 'hi': {'exact_match': 93.61344537815125, 'f1': 96.50460182833032}, 'el': {'exact_match': 94.45378151260505, 'f1': 97.77491054105244}, 'th': {'exact_match': 95.12605042016807, 'f1': 96.0633843280902}, 'tr': {'exact_match': 92.26890756302521, 'f1': 95.92916027034194}, 'ru': {'exact_match': 95.96638655462185, 'f1': 98.144208266057}, 'ro': {'exact_match': 97.05882352941177, 'f1': 98.86947939461108}}


In [None]:
df_results_fine_tuning_xquad_xlm_r_large = results_df(results_fine_tuning_xquad_xlm_r_large, "FT_xquad_xml_r_large")
df_results_fine_tuning_xquad_xlm_r_large.to_csv("results/results_fine_tuning_xquad_xlm_r_large.csv")
df_results_fine_tuning_xquad_xlm_r_large

Unnamed: 0,lang,F1_FT_xquad_xml_r_large,EM_FT_xquad_xml_r_large
0,ar,97.04,94.2
1,de,98.12,95.63
2,zh,96.28,95.71
3,vi,97.62,94.03
4,en,99.7,99.16
5,es,98.46,95.8
6,hi,96.5,93.61
7,el,97.77,94.45
8,th,96.06,95.13
9,tr,95.93,92.27


## Data Augmentation mBERT

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "mrm8488/bert-multi-cased-finetuned-xquadv1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpq398pty2


Downloading:   0%|          | 0.00/657 [00:00<?, ?B/s]

storing https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/0f0d4a1b24bfa98a5f51989ef463e5d0e88eb9e154803dac5af801f778f760da.b94e26d9c4c6febfe5881b469240758616a103ec32f181f3bec6660712555f9b
creating metadata file for /root/.cache/huggingface/transformers/0f0d4a1b24bfa98a5f51989ef463e5d0e88eb9e154803dac5af801f778f760da.b94e26d9c4c6febfe5881b469240758616a103ec32f181f3bec6660712555f9b
loading configuration file https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/0f0d4a1b24bfa98a5f51989ef463e5d0e88eb9e154803dac5af801f778f760da.b94e26d9c4c6febfe5881b469240758616a103ec32f181f3bec6660712555f9b
Model config BertConfig {
  "_name_or_path": "mrm8488/bert-multi-cased-finetuned-xquadv1",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "direct

Downloading:   0%|          | 0.00/679M [00:00<?, ?B/s]

storing https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/961c2f2e8bb39d8cb535e249cf5cf593945b5b8d49abfa0a5b4c4f2e1e7c25e4.31b829a1432106ca2faa48850072218959b3a897e315df4b9288f6755a00893b
creating metadata file for /root/.cache/huggingface/transformers/961c2f2e8bb39d8cb535e249cf5cf593945b5b8d49abfa0a5b4c4f2e1e7c25e4.31b829a1432106ca2faa48850072218959b3a897e315df4b9288f6755a00893b
loading weights file https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/961c2f2e8bb39d8cb535e249cf5cf593945b5b8d49abfa0a5b4c4f2e1e7c25e4.31b829a1432106ca2faa48850072218959b3a897e315df4b9288f6755a00893b
All model checkpoint weights were used when initializing BertForQuestionAnswering.

All the weights of BertForQuestionAnswering were initialized from the model checkpoint at mrm8488/bert-multi-cased-finetuned-xquadv1.
If

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

storing https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/14948b39c7fd98c051749dab60bd56c6712eb6abb164395abfa81c83ba4de1a4.f19de0c372e9b00104464e8b09d5fbbdd67565d0e0af78462fb22d8f5d2c1fe1
creating metadata file for /root/.cache/huggingface/transformers/14948b39c7fd98c051749dab60bd56c6712eb6abb164395abfa81c83ba4de1a4.f19de0c372e9b00104464e8b09d5fbbdd67565d0e0af78462fb22d8f5d2c1fe1
loading configuration file https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/0f0d4a1b24bfa98a5f51989ef463e5d0e88eb9e154803dac5af801f778f760da.b94e26d9c4c6febfe5881b469240758616a103ec32f181f3bec6660712555f9b
Model config BertConfig {
  "_name_or_path": "mrm8488/bert-multi-cased-finetuned-xquadv1",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

storing https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/5bb748d396e0b6fa52b12b476f832ed3d4d6471fa94f584cc7b857e92a2e7ff9.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
creating metadata file for /root/.cache/huggingface/transformers/5bb748d396e0b6fa52b12b476f832ed3d4d6471fa94f584cc7b857e92a2e7ff9.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp4ovin5d_


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

storing https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/c21ae7b423e72e02a5d909c5927f700025e4e0747ea39ca8db8880b8bf89b08e.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /root/.cache/huggingface/transformers/c21ae7b423e72e02a5d909c5927f700025e4e0747ea39ca8db8880b8bf89b08e.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/5bb748d396e0b6fa52b12b476f832ed3d4d6471fa94f584cc7b857e92a2e7ff9.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1/resolve/main/added_tokens

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
results_data_augmentation_mbert = compute_results(langs, split)
print(results_data_augmentation_mbert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1387
  Batch size = 16


Post-processing 1190 example predictions split into 1387 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1265
  Batch size = 16


Post-processing 1190 example predictions split into 1265 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1336
  Batch size = 16


Post-processing 1190 example predictions split into 1336 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1289
  Batch size = 16


Post-processing 1190 example predictions split into 1289 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1233
  Batch size = 16


Post-processing 1190 example predictions split into 1233 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1274
  Batch size = 16


Post-processing 1190 example predictions split into 1274 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1532
  Batch size = 16


Post-processing 1190 example predictions split into 1532 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1848
  Batch size = 16


Post-processing 1190 example predictions split into 1848 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2574
  Batch size = 16


Post-processing 1190 example predictions split into 2574 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1317
  Batch size = 16


Post-processing 1190 example predictions split into 1317 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1358
  Batch size = 16


Post-processing 1190 example predictions split into 1358 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1363
  Batch size = 16


Post-processing 1190 example predictions split into 1363 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

{'ar': {'exact_match': 94.45378151260505, 'f1': 97.06741779896319}, 'de': {'exact_match': 97.89915966386555, 'f1': 98.94310987798319}, 'zh': {'exact_match': 96.80672268907563, 'f1': 97.5254101640656}, 'vi': {'exact_match': 97.47899159663865, 'f1': 98.86345837108117}, 'en': {'exact_match': 99.24369747899159, 'f1': 99.69790826547336}, 'es': {'exact_match': 98.90756302521008, 'f1': 99.57443197124364}, 'hi': {'exact_match': 95.12605042016807, 'f1': 97.67274608446718}, 'el': {'exact_match': 94.6218487394958, 'f1': 96.96550737351699}, 'th': {'exact_match': 84.87394957983193, 'f1': 87.30776692961565}, 'tr': {'exact_match': 97.39495798319328, 'f1': 98.75147919784077}, 'ru': {'exact_match': 97.3109243697479, 'f1': 98.53471845682292}, 'ro': {'exact_match': 81.59663865546219, 'f1': 90.62669117460511}}


In [None]:
df_results_data_augmentation_mbert = results_df(results_data_augmentation_mbert, "data_augm_mbert")
df_results_data_augmentation_mbert.to_csv("results/results_data_augmentation_mbert.csv")
df_results_data_augmentation_mbert

Unnamed: 0,lang,F1_data_augm,EM_data_augm
0,ar,97.07,94.45
1,de,98.94,97.9
2,zh,97.53,96.81
3,vi,98.86,97.48
4,en,99.7,99.24
5,es,99.57,98.91
6,hi,97.67,95.13
7,el,96.97,94.62
8,th,87.31,84.87
9,tr,98.75,97.39


## Baselines

We show results using baseline methods in the tables below. We directly fine-tune [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md)
and [XLM-R Large](https://arxiv.org/abs/1911.02116) on the English SQuAD v1.1 training data
and evaluate them via zero-shot transfer on the XQuAD test datasets. For translate-train, 
we fine-tune mBERT on the SQuAD v1.1 training data, which we automatically translate
to the target language. For translate-test, we fine-tune [BERT-Large](https://arxiv.org/abs/1810.04805)
on the SQuAD v1.1 training set and evaluate it on the XQuAD test set of the target language,
which we automatically translate to English. Note that results with translate-test are not directly
comparable as we drop a small number (less than 3%) of the test examples.

| Model Baseline F1 / EM                | en   | ar   | de   | el   | es   | hi   | ru   | th   | tr   | vi   | zh   | ro   | avg  |
|:-----------------------|------|------|------|------|------|------|------|------|------|------|------|------|------|
| Zero-shot mBERT                 | 83.5 / 72.2 | 61.5 / 45.1 | 70.6 / 54.0 | 62.6 / 44.9 | 75.5 / 56.9 | 59.2 / 46.0 | 71.3 / 53.3 | 42.7 / 33.5 | 55.4 / 40.1 | 69.5 / 49.6 | 58.0 / 48.3 | 72.7 / 59.9 | 65.2 / 50.3 |
| Zero-shot XLM-R Large           | 86.5 / 75.7 | 68.6 / 49.0 | **80.4** / 63.4 | **79.8** / 61.7 | 82.0 / 63.9 | **76.7** / 59.7 | **80.1** / 64.3 | **74.2** / **62.8** | **75.9** / **59.3** | **79.1** / 59.0 | 59.3 / 50.0 | **83.6** / **69.7** | **77.2** / 61.5 |
| Translate-train mBERT | 83.5 / 72.2 | 68.0 / 51.1 | 75.6 / 60.7 | 70.0 / 53.0 | 80.2 / 63.1 | 69.6 / 55.4 | 75.0 / 59.7 | 36.9 / 33.5 | 68.9 / 54.8 | 75.6 / 56.2 | 66.2 / 56.6 |   | 70.0 / 56.0 |
| Translate-test BERT Large | **87.9** / **77.1** | **73.7** / **58.8** | 79.8 / **66.7** | 79.4 / **65.5** | **82.0** / **68.4**| 74.9 / **60.1** | 79.9 / **66.7** | 64.6 / 50.0 | 67.4 / 49.6 | 76.3 / **61.5** | **73.7** / **59.1** |     | 76.3 / **62.1** |

## Results

* Similar to XQuAD baseline results in first table.
* Zero-shot is better than translate-test for larger models and worse for smaller models.
* Monolingual models get better results than multilingual in translate-test.
* Larger versions of models get better results.
* Results from worst to better: translate-train, translate-test multi, translate-test monolingual, zero-shot, fine-tuning, data augmentation
* Best languages: English, Spanish, Romanian, Russian
* Worst languages: Chinese, Hindi, Thai, Turkish

| Model Ours F1 / EM                            | en          | ar          | de          | el          | es          | hi          | ru          | th          | tr          | vi          | zh          | ro          | avg         |
|:-----------------------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
Zero-shot
| Zero-shot mBERT              | 85.0 / 73.5 | 57.8 / 42.2 | 72.6 / 55.9 | 62.2 / 45.2 | 76.4 / 58.1 | 55.3 / 40.6 | 71.3 / 54.7 | 35.1 / 26.3 | 51.1 / 34.9 | 68.1 / 47.9 | 58.2 / 47.3 | 72.4 / 59.5 | 63.8 / 48.8 |
| Zero-shot XLM-R              | 84.4 / 73.8 | 67.9 / 52.1 | 75.3 / 59.8 | 74.3 / 57.0 | 77.0 / 59.2 | 69.0 / 52.5 | 75.1 / 58.6 | 68.0 / 56.4 | 68.0 / 51.8 | 73.6 / 54.5 | 65.0 / 55.0 | 80.0 / 66.3 | 73.1 / 58.1 |
| Zero-shot XLM-R Large        | **86.5 / 75.9** | **75.0 / 58.0** | **79.9 / 63.8** | **79.1 / 61.3** | **81.0 / 62.7** | **76.0 / 60.8** | **80.3 / 63.1** | **72.8 / 61.7** | **74.1 / 58.3** | **79.0 / 59.3** | **66.8 / 58.0** | **83.5 / 70.2** | **77.8 / 62.8** |
Translate-test monolingual
| Translate-test BERT          |         | 69.4 / 55.0 | 75.7 / 62.7 | 75.0 / 60.6 | 77.2 / 62.6 | 69.7 / 53.7 | 74.9 / 60.5 | 60.5 / 46.5 | 59.9 / 41.8 | 72.2 / 58.3 | 69.9 / 56.0 |         | 70.4 / 55.8 |
| Translate-test BERT Large    |         | 73.6 / 59.1 | 80.4 / 66.4 | 80.2 / 66.8 | 81.9 / 68.7 | **75.3 / 61.7** | 80.1 / 67.0 | **67.5 / 53.9** | **66.3 / 47.3** | **76.4 / 62.1** | 74.0 / 59.5 |         | 75.6 / 61.2 |
| Translate-test RoBERTa       |         | 71.6 / 57.0 | 77.0 / 62.4 | 76.8 / 63.9 | 80.0 / 64.6 | 72.0 / 55.6 | 77.2 / 62.4 | 62.2 / 46.6 | 63.4 / 44.1 | 72.4 / 56.6 | 72.4 / 57.9 |         | 72.5 / 57.1 |
| Translate-test RoBERTa Large |         | **74.8 / 61.1** | **80.4 / 67.1** | **80.8 / 68.0** | **83.1 / 69.4** | 75.1 / 61.0 | **81.2 / 68.0** | 65.3 / 51.0 | 66.0 / 46.9 | 76.4 / 62.0 | **74.0 / 59.9** |         | **75.7 / 61.4** |
Translate-test multilingual
| Translate-test mBERT         |         | 70.4 / 55.8 | 76.7 / 63.3 | 76.0 / 61.9 | 78.7 / 65.1 | 70.6 / 55.8 | 76.6 / 63.1 | 60.0 / 45.9 | 61.6 / 42.7 | 70.6 / 55.6 | 70.1 / 56.6 |         | 71.2 / 56.6 |
| Translate-test XLM-R         |         | 70.4 / 56.5 | 79.0 / 65.8 | 77.8 / 65.0 | 79.3 / 66.4 | 72.4 / 57.6 | 77.4 / 63.6 | 60.3 / 45.4 | 63.4 / 44.3 | 73.0 / 58.4 | 71.1 / 57.4 |         | 72.4 / 58.0 |
| Translate-test XLM-R Large   |         | **72.9 / 59.1** | **80.1 / 66.6** | **79.6 / 66.2** | **81.5 / 67.1** | **74.2 / 60.1** | **79.7 / 65.7** | **61.7 / 46.0** | **66.2 / 48.2** | **75.1 / 61.5** | **73.6 / 58.8** |         | **74.5 / 59.9** |
Translate-train
| Translate-train es XLM-R     | **80.4** / 66.1 | **67.0** / 47.9 | 74.2 / 56.4 | **73.5** / 52.4 | **76.3** / 56.6 | **66.9** / 48.2 | 72.4 / 54.2 | **68.7** / **58.5** | **66.2** / 46.5 | 73.2 / 52.0 | 63.4 / 50.3 | **76.0** / 59.2 | **71.5** / 54.0 |
| Translate-train de XLM-R     | 79.8 / **67.1** | 65.9 / **48.2** | **74.3** / **58.8** | 72.3 / **54.4** | 75.9 / **57.9** | 66.4 / **50.6** | **73.1** / **56.4** | 65.4 / 56.8 | 65.8 / **50.8** | 72.7 / **53.2** | **64.7** / **55.0** | 75.3 / **61.1** | 71.0 / **55.9** |
Fine-tuning XQuAD
| Fine-tuning mBERT            | 97.3 / 95.3 | 90.0 / 84.3 | 94.2 / 90.0 | 92.2 / 87.0 | 96.2 / 92.4 | 88.2 / 77.5 | 94.4 / 90.1 | 25.2 / 16.8 | 89.9 / 84.4 | 93.4 / 87.6 | 87.5 / 84.4 | 95.5 / 91.3 | 87.0 / 81.8 |
| Fine-tuning XLM-R            | 98.5 / 97.5 | 92.5 / 88.2 | 95.1 / 91.8 | 96.0 / 91.8 | 97.8 / 93.6 | 92.6 / 88.6 | 95.2 / 90.8 | 94.0 / 92.4 | 92.0 / 87.3 | 95.5 / 91.3 | 94.0 / 92.9 | 97.7 / 94.8 | 95.1 / 91.8 |
| Fine-tuning XLM-R Large      | **99.7 / 99.2** | **97.0 / 94.2** | **98.1 / 95.6** | **97.8 / 94.4** | **98.5 / 95.8** | **96.5 / 93.6** | **98.1 / 96.0** | **96.1 / 95.1** | **95.9 / 92.3** | **97.6 / 94.0** | **96.3 / 95.7** | **98.9 / 97.1** | **97.5 / 95.2** |
Data-augmentation XQuAD
| Data-augmentation mBERT      | 99.7 / 99.2 | 97.1 / 94.4 | 98.9 / 97.9 | 97.0 / 94.6 | 99.6 / 98.9 | 97.7 / 95.1 | 98.5 / 97.3 | 87.3 / 84.9 | 98.8 / 97.4 | 98.9 / 97.5 | 97.5 / 96.8 | 90.6 / 81.6 | 96.8 / 94.6 |

In [None]:
import pandas as pd

df_results_zero_shot_mbert = pd.read_csv('results/results_zero_shot_mbert.csv')
df_results_zero_shot_xlm_r = pd.read_csv('results/results_zero_shot_xlm_r.csv')
df_results_zero_shot_xlm_r_large = pd.read_csv("results/results_zero_shot_xlm_r_large.csv")
df_results_translate_test_mbert = pd.read_csv("results/results_translate_test_mbert.csv")
df_results_translate_test_bert = pd.read_csv("results/results_translate_test_bert.csv")
df_results_translate_test_bert_large = pd.read_csv("results/results_translate_test_bert_large.csv")
df_results_translate_test_xlm_r = pd.read_csv("results/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r_large = pd.read_csv("results/results_translate_test_xlm_r_large.csv")
df_results_translate_test_roberta = pd.read_csv("results/results_translate_test_roberta.csv")
df_results_translate_test_roberta_large = pd.read_csv("results/results_translate_test_roberta_large.csv")
#df_results_translate_train_es_mbert = pd.read_csv("results/results_translate_train_es_mbert.csv")
df_results_translate_train_es_xlm_r = pd.read_csv("results/results_translate_train_es_xlm_r.csv")
df_results_translate_train_de_xlm_r = pd.read_csv("results/results_translate_train_de_xlm_r.csv")
df_results_fine_tuning_mbert = pd.read_csv("results/results_fine_tuning_xquad_mbert.csv")
df_results_fine_tuning_xquad_xlm_r = pd.read_csv("results/results_fine_tuning_xquad_xlm_r.csv")
df_results_fine_tuning_xquad_xml_r_large = pd.read_csv("results/results_fine_tuning_xquad_xlm_r_large.csv")
df_results_data_augmentation_mbert = pd.read_csv("results/results_data_augmentation_mbert.csv")


dataframes = [df_results_zero_shot_mbert, 
              df_results_zero_shot_xlm_r, 
              df_results_zero_shot_xlm_r_large, 
              df_results_translate_test_mbert, 
              df_results_translate_test_bert, 
              df_results_translate_test_bert_large, 
              df_results_translate_test_xlm_r, 
              df_results_translate_test_xlm_r_large, 
              df_results_translate_test_roberta, 
              df_results_translate_test_roberta_large, 
              #df_results_translate_train_es_mbert,
              df_results_translate_train_es_xlm_r, 
              df_results_translate_train_de_xlm_r,
              df_results_fine_tuning_mbert,
              df_results_fine_tuning_xquad_xlm_r,
              df_results_fine_tuning_xquad_xml_r_large,
              df_results_data_augmentation_mbert, 
              ]

In [None]:
dfs = []
for df in dataframes:
    name1 = list(df.columns)[2]
    name2 = list(df.columns)[3]
    name = name1[3:]
    df = df.round(1)
    df = df.astype({name1: 'str', name2: 'str'}) #float to string
    df[name] = df[[name1, name2]].apply(lambda x: ' / '.join(x), axis=1) #concat F1 and EM
    df = df.drop([name1, name2,"Unnamed: 0"], axis=1) #remove useless columns
    df = df.set_index('lang').T #rotate dataframe
    dfs.append(df)

In [None]:
results_df = pd.concat(dfs, axis=0)
# reorder languages to match baseline
results_df = results_df[['en', 'ar', 'de', 'el', 'es', 'hi', 'ru', 'th', 'tr', 'vi', 'zh', 'ro', 'avg']]
# rename rows
rows = ["Zero-shot mBERT", "Zero-shot XLM-R", "Zero-shot XLM-R Large", 
        "Translate-test mBERT", "Translate-test BERT", "Translate-test BERT Large",
        "Translate-test XLM-R", "Translate-test XLM-R Large", "Translate-test RoBERTa",
        "Translate-test RoBERTa Large", "Translate-train es XLM-R", "Translate-train de XLM-R",
        "Fine-tuning mBERT", "Fine-tuning XLM-R", "Fine-tuning XLM-R Large", "Data-augmentation mBERT"]
results_df = results_df.rename(dict(zip(results_df.index, rows)))
results_df.to_csv("results/results.csv")
display(results_df)

lang,en,ar,de,el,es,hi,ru,th,tr,vi,zh,ro,avg
Zero-shot mBERT,85.0 / 73.5,57.8 / 42.2,72.6 / 55.9,62.2 / 45.2,76.4 / 58.1,55.3 / 40.6,71.3 / 54.7,35.1 / 26.3,51.1 / 34.9,68.1 / 47.9,58.2 / 47.3,72.4 / 59.5,63.8 / 48.8
Zero-shot XLM-R,84.4 / 73.8,67.9 / 52.1,75.3 / 59.8,74.3 / 57.0,77.0 / 59.2,69.0 / 52.5,75.1 / 58.6,68.0 / 56.4,68.0 / 51.8,73.6 / 54.5,65.0 / 55.0,80.0 / 66.3,73.1 / 58.1
Zero-shot XLM-R Large,86.5 / 75.9,75.0 / 58.0,79.9 / 63.8,79.1 / 61.3,81.0 / 62.7,76.0 / 60.8,80.3 / 63.1,72.8 / 61.7,74.1 / 58.3,79.0 / 59.3,66.8 / 58.0,83.5 / 70.2,77.8 / 62.8
Translate-test mBERT,,70.4 / 55.8,76.7 / 63.3,76.0 / 61.9,78.7 / 65.1,70.6 / 55.8,76.6 / 63.1,60.0 / 45.9,61.6 / 42.7,70.6 / 55.6,70.1 / 56.6,,71.2 / 56.6
Translate-test BERT,,69.4 / 55.0,75.7 / 62.7,75.0 / 60.6,77.2 / 62.6,69.7 / 53.7,74.9 / 60.5,60.5 / 46.5,59.9 / 41.8,72.2 / 58.3,69.9 / 56.0,,70.4 / 55.8
Translate-test BERT Large,,73.6 / 59.1,80.4 / 66.4,80.2 / 66.8,81.9 / 68.7,75.3 / 61.7,80.1 / 67.0,67.5 / 53.9,66.3 / 47.3,76.4 / 62.1,74.0 / 59.5,,75.6 / 61.2
Translate-test XLM-R,,70.4 / 56.5,79.0 / 65.8,77.8 / 65.0,79.3 / 66.4,72.4 / 57.6,77.4 / 63.6,60.3 / 45.4,63.4 / 44.3,73.0 / 58.4,71.1 / 57.4,,72.4 / 58.0
Translate-test XLM-R Large,,72.9 / 59.1,80.1 / 66.6,79.6 / 66.2,81.5 / 67.1,74.2 / 60.1,79.7 / 65.7,61.7 / 46.0,66.2 / 48.2,75.1 / 61.5,73.6 / 58.8,,74.5 / 59.9
Translate-test RoBERTa,,71.6 / 57.0,77.0 / 62.4,76.8 / 63.9,80.0 / 64.6,72.0 / 55.6,77.2 / 62.4,62.2 / 46.6,63.4 / 44.1,72.4 / 56.6,72.4 / 57.9,,72.5 / 57.1
Translate-test RoBERTa Large,,74.8 / 61.1,80.4 / 67.1,80.8 / 68.0,83.1 / 69.4,75.1 / 61.0,81.2 / 68.0,65.3 / 51.0,66.0 / 46.9,76.4 / 62.0,74.0 / 59.9,,75.7 / 61.4


In [None]:
results_df.to_markdown()

'|                              | en          | ar          | de          | el          | es          | hi          | ru          | th          | tr          | vi          | zh          | ro          | avg         |\n|:-----------------------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|:------------|\n| Zero-shot mBERT              | 85.0 / 73.5 | 57.8 / 42.2 | 72.6 / 55.9 | 62.2 / 45.2 | 76.4 / 58.1 | 55.3 / 40.6 | 71.3 / 54.7 | 35.1 / 26.3 | 51.1 / 34.9 | 68.1 / 47.9 | 58.2 / 47.3 | 72.4 / 59.5 | 63.8 / 48.8 |\n| Zero-shot XLM-R              | 84.4 / 73.8 | 67.9 / 52.1 | 75.3 / 59.8 | 74.3 / 57.0 | 77.0 / 59.2 | 69.0 / 52.5 | 75.1 / 58.6 | 68.0 / 56.4 | 68.0 / 51.8 | 73.6 / 54.5 | 65.0 / 55.0 | 80.0 / 66.3 | 73.1 / 58.1 |\n| Zero-shot XLM-R Large        | 86.5 / 75.9 | 75.0 / 58.0 | 79.9 / 63.8 | 79.1 / 61.3 | 81.0 / 62.7 | 76.0 / 60.8 | 80.3 / 63.1 | 72.8

## Analysis of the results

-No English and Romanian for Translate Test

-Xquad fine tuned models give much better results in all languages

-Best results with English, even if there is not Translate Test

-Worst results for Thai, but they increase a lot with fine tunning and data augmentation 

-Overall the results from worst to better: Zero shot, translate train, tranlsate test, fine tuning, data augmentation

-Best languages: English, German, Romanian, Russian

-Worst languages; Chinese, Hindi, Thai, Turkish

-We got worst results than the baseline with zero-shot, but better when fine tunning and augmenting data