# Zero-Shot and Translation Experiments on MLQA

If you're opening this Notebook on colab, you will need to moun drive and change directory.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/master/Applications2/project

[Errno 2] No such file or directory: '/content/drive/MyDrive/master/Applications2/project'
/content


In [3]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/project

/content/drive/MyDrive/LAP/Subjects/AP2/project


If you're opening this Notebook on colab, you will need to install 🤗 Transformers and 🤗 Datasets.

In [4]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.3 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 65.8 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 49.4 MB/s 
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.3 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 74.2 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load MLQA Dataset

MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.

In [6]:
from datasets import load_dataset, load_metric

langs = ["ar", "de", "vi", "zh", "en", "es", "hi"]
translate_langs = ["ar", "de", "vi", "zh", "es", "hi"]
langs_test = []
langs_translate_test = []
langs_translate_train = []

mlqa = {}

for lang1 in langs:
    for lang2 in langs:
        mlqa[f"{lang1}.{lang2}"] = load_dataset("mlqa", f"mlqa.{lang1}.{lang2}")
        langs_test.append(f"{lang1}.{lang2}")

for lang in translate_langs:
    mlqa[f"translate-train.{lang}"] = load_dataset("mlqa", f"mlqa-translate-train.{lang}")
    mlqa[f"translate-test.{lang}"] = load_dataset("mlqa", f"mlqa-translate-test.{lang}")
    langs_translate_test.append(f"translate-test.{lang}")
    langs_translate_train.append(f"translate-train.{lang}")

Downloading builder script:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.09k [00:00<?, ?B/s]

Downloading and preparing dataset mlqa/mlqa.ar.ar (download: 72.21 MiB, generated: 8.61 MiB, post-processed: Unknown size, total: 80.82 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Downloading data:   0%|          | 0.00/75.7M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.de (download: 72.21 MiB, generated: 2.38 MiB, post-processed: Unknown size, total: 74.59 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1649 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/207 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.vi (download: 72.21 MiB, generated: 3.36 MiB, post-processed: Unknown size, total: 75.57 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2047 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.zh (download: 72.21 MiB, generated: 3.35 MiB, post-processed: Unknown size, total: 75.56 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1912 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/188 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.en (download: 72.21 MiB, generated: 8.46 MiB, post-processed: Unknown size, total: 80.67 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.es (download: 72.21 MiB, generated: 3.06 MiB, post-processed: Unknown size, total: 75.27 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1978 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.hi (download: 72.21 MiB, generated: 3.12 MiB, post-processed: Unknown size, total: 75.33 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1831 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/186 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.ar (download: 72.21 MiB, generated: 1.70 MiB, post-processed: Unknown size, total: 73.91 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1649 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/207 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.de (download: 72.21 MiB, generated: 4.53 MiB, post-processed: Unknown size, total: 76.74 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.vi (download: 72.21 MiB, generated: 1.78 MiB, post-processed: Unknown size, total: 73.99 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1675 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/182 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.zh (download: 72.21 MiB, generated: 1.74 MiB, post-processed: Unknown size, total: 73.95 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1621 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/190 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.en (download: 72.21 MiB, generated: 4.51 MiB, post-processed: Unknown size, total: 76.72 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.es (download: 72.21 MiB, generated: 1.76 MiB, post-processed: Unknown size, total: 73.97 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1776 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/196 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.hi (download: 72.21 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 73.64 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1430 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.ar (download: 72.21 MiB, generated: 3.23 MiB, post-processed: Unknown size, total: 75.45 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2047 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.de (download: 72.21 MiB, generated: 2.35 MiB, post-processed: Unknown size, total: 74.56 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1675 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/182 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.vi (download: 72.21 MiB, generated: 8.13 MiB, post-processed: Unknown size, total: 80.34 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/511 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.zh (download: 72.21 MiB, generated: 3.06 MiB, post-processed: Unknown size, total: 75.28 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1943 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/184 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.en (download: 72.21 MiB, generated: 8.04 MiB, post-processed: Unknown size, total: 80.26 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/511 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.es (download: 72.21 MiB, generated: 2.96 MiB, post-processed: Unknown size, total: 75.17 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2018 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.hi (download: 72.21 MiB, generated: 2.85 MiB, post-processed: Unknown size, total: 75.06 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/177 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.ar (download: 72.21 MiB, generated: 1.78 MiB, post-processed: Unknown size, total: 73.99 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1912 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/188 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.de (download: 72.21 MiB, generated: 1.46 MiB, post-processed: Unknown size, total: 73.67 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1621 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/190 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.vi (download: 72.21 MiB, generated: 1.85 MiB, post-processed: Unknown size, total: 74.06 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1943 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/184 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.zh (download: 72.21 MiB, generated: 4.54 MiB, post-processed: Unknown size, total: 76.75 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/504 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.en (download: 72.21 MiB, generated: 4.57 MiB, post-processed: Unknown size, total: 76.78 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/504 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.es (download: 72.21 MiB, generated: 1.75 MiB, post-processed: Unknown size, total: 73.96 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.hi (download: 72.21 MiB, generated: 1.65 MiB, post-processed: Unknown size, total: 73.86 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.ar (download: 72.21 MiB, generated: 6.93 MiB, post-processed: Unknown size, total: 79.14 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.de (download: 72.21 MiB, generated: 5.29 MiB, post-processed: Unknown size, total: 77.51 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.vi (download: 72.21 MiB, generated: 7.24 MiB, post-processed: Unknown size, total: 79.45 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/511 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.zh (download: 72.21 MiB, generated: 6.71 MiB, post-processed: Unknown size, total: 78.93 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/504 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.en (download: 72.21 MiB, generated: 14.40 MiB, post-processed: Unknown size, total: 86.61 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/11590 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1148 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.es (download: 72.21 MiB, generated: 6.31 MiB, post-processed: Unknown size, total: 78.53 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.hi (download: 72.21 MiB, generated: 6.59 MiB, post-processed: Unknown size, total: 78.80 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/507 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.ar (download: 72.21 MiB, generated: 1.76 MiB, post-processed: Unknown size, total: 73.97 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1978 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.de (download: 72.21 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 73.64 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1776 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/196 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.vi (download: 72.21 MiB, generated: 1.79 MiB, post-processed: Unknown size, total: 74.00 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2018 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.zh (download: 72.21 MiB, generated: 1.68 MiB, post-processed: Unknown size, total: 73.89 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.en (download: 72.21 MiB, generated: 4.44 MiB, post-processed: Unknown size, total: 76.65 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.es (download: 72.21 MiB, generated: 4.48 MiB, post-processed: Unknown size, total: 76.69 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.hi (download: 72.21 MiB, generated: 1.59 MiB, post-processed: Unknown size, total: 73.80 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1723 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/187 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.ar (download: 72.21 MiB, generated: 4.56 MiB, post-processed: Unknown size, total: 76.77 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1831 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/186 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.de (download: 72.21 MiB, generated: 3.11 MiB, post-processed: Unknown size, total: 75.32 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1430 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.vi (download: 72.21 MiB, generated: 4.84 MiB, post-processed: Unknown size, total: 77.05 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/177 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.zh (download: 72.21 MiB, generated: 4.48 MiB, post-processed: Unknown size, total: 76.69 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.en (download: 72.21 MiB, generated: 11.75 MiB, post-processed: Unknown size, total: 83.96 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/507 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.es (download: 72.21 MiB, generated: 4.01 MiB, post-processed: Unknown size, total: 76.22 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1723 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/187 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.hi (download: 72.21 MiB, generated: 12.13 MiB, post-processed: Unknown size, total: 84.34 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/507 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.ar (download: 60.43 MiB, generated: 109.07 MiB, post-processed: Unknown size, total: 169.50 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Downloading data:   0%|          | 0.00/63.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/78058 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.ar (download: 9.61 MiB, generated: 5.23 MiB, post-processed: Unknown size, total: 14.84 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Downloading data:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.de (download: 60.43 MiB, generated: 84.23 MiB, post-processed: Unknown size, total: 144.66 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/80069 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9927 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.de (download: 9.61 MiB, generated: 3.70 MiB, post-processed: Unknown size, total: 13.31 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.vi (download: 60.43 MiB, generated: 105.02 MiB, post-processed: Unknown size, total: 165.45 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/84816 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10356 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.vi (download: 9.61 MiB, generated: 5.72 MiB, post-processed: Unknown size, total: 15.33 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.zh (download: 60.43 MiB, generated: 59.66 MiB, post-processed: Unknown size, total: 120.09 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/76285 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9568 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.zh (download: 9.61 MiB, generated: 4.61 MiB, post-processed: Unknown size, total: 14.22 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.es (download: 60.43 MiB, generated: 87.27 MiB, post-processed: Unknown size, total: 147.70 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/81810 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10123 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.es (download: 9.61 MiB, generated: 3.74 MiB, post-processed: Unknown size, total: 13.34 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.hi (download: 60.43 MiB, generated: 181.71 MiB, post-processed: Unknown size, total: 242.14 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/82451 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10253 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.hi (download: 9.61 MiB, generated: 4.40 MiB, post-processed: Unknown size, total: 14.00 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
mlqa

{'ar.ar': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 5335
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 517
    })
}),
 'ar.de': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 1649
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 207
    })
}),
 'ar.en': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 5335
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 517
    })
}),
 'ar.es': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 1978
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 161

In [8]:
mlqa["en.es"]["test"][0]

{'answers': {'answer_start': [457],
  'text': ['Rutgers University biochemists']},
 'context': 'In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in

In [9]:
mlqa["en.es"]["validation"][0]

{'answers': {'answer_start': [571],
  'text': ['remains infected for its lifetime']},
 'context': 'Pappataci fever is prevalent in the subtropical zone of the Eastern Hemisphere between 20°N and 45°N, particularly in Southern Europe, North Africa, the Balkans, Eastern Mediterranean, Iraq, Iran, Pakistan, Afghanistan and India.The disease is transmitted by the bites of phlebotomine sandflies of the Genus Phlebotomus, in particular, Phlebotomus papatasi, Phlebotomus perniciosus and Phlebotomus perfiliewi. The sandfly becomes infected when biting an infected human in the period between 48 hours before the onset of fever and 24 hours after the end of the fever, and remains infected for its lifetime. Besides this horizontal virus transmission from man to sandfly, the virus can be transmitted in insects transovarially, from an infected female sandfly to its offspring.Pappataci fever is seldom recognised in endemic populations because it is mixed with other febrile illnesses of childhood, but

In [10]:
mlqa["translate-train.es"]["train"][0]

{'answers': {'answer_start': [575], 'text': ['Santa Bernadette Soubirous']},
 'context': 'Arquitectónico, la escuela tiene un carácter católico. sobre la cúpula de oro del edificio principal es una estatua de oro de la Virgen María. inmediatamente frente al edificio principal y frente a ella, es una estatua de cobre de Cristo con los brazos levantado con la leyenda venite ad me omnes. junto al edificio principal se encuentra la Basílica del sagrado corazón. Inmediatamente detrás de la Basílica se encuentra la gruta, un lugar de oración y reflexión de Marian. Se trata de una réplica de la gruta en Lourdes, Francia, donde la Virgen María supuestamente apareció a saint bernadette soubirous en 1858. Al final de la unidad principal (y en una línea directa que conecta a través de 3 estatuas y la cúpula de oro), es una simple y moderna estatua de piedra de María.',
 'id': '5733be284776f41900661182',
 'question': 'A quién presuntamente apareció la Virgen María en 1858 en Lourdes Francia?'}

In [11]:
mlqa["translate-train.es"]["validation"][0]

{'answers': {'answer_start': [182, 182, 182],
  'text': ['DENVER BRONCOS', 'DENVER BRONCOS', 'DENVER BRONCOS']},
 'context': "Super Bowl 50 fue un juego de fútbol americano para determinar el campeón de la liga nacional de fútbol (NFL) para la temporada 2015 la conferencia de fútbol americano (AFC) campeón DENVER BRONCOS derrotó a la conferencia nacional de fútbol (NFC) Campeona Carolina Panthers 24-10 para ganar su tercer título de super bowl. El juego se jugó el 7 de febrero de 2016, en el estadio Levi ' s en la zona de la bahía de San Francisco en santa clara, California. Como este fue el 50º super bowl, la liga hizo hincapié en el aniversario de oro con varias iniciativas temáticas de oro, así como suspender temporalmente la tradición de nombrar cada juego de super bowl con números romanos (bajo el cual el juego habría sido conocido como super bowl l), de modo que el logotipo podría característica de forma prominente los números árabes 50.",
 'id': '56be4db0acb8001400a502ec',
 'que

In [12]:
mlqa["translate-test.es"]["test"][0]

{'answers': {'answer_start': [-1],
  'text': ['A Fan-shaped structures that followed a pattern of leaves, languages and lobes']},
 'context': 'After the eruption, the emissions of pyroclastic material that occurred from the gap created by the collapse were mostly magmatic origin, and in lower proportion of fragments of pre-existing volcanic rocks. The resulting deposits formed a fan-shaped structures that followed a pattern of leaves, languages and overlapping lobes. During the eruption of may 18, there were at least 17 emissions of pyroclastic flow separated over time, whose volumes of aggregation were around 208 million m3.',
 'id': 'b77c037b331e06542272669766df3b9515366b57',
 'question': 'What was the appearance of the deposits that left that landslide?'}

## Load SQuAD Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. https://arxiv.org/pdf/1606.05250.pdf

In [13]:
from datasets import load_dataset, load_metric

In [14]:
squad = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [16]:
squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [17]:
squad["validation"][0]

{'answers': {'answer_start': [177, 177, 177],
  'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bo

## Preprocessing SQuAD

Load the mBERT tokenizer to process the question and context fields.

In [18]:
from transformers import AutoTokenizer

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [19]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

There are a few preprocessing steps particular to question answering that we should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. Truncate only the context by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original context by setting `return_offset_mapping=True`.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to find which part of the offset corresponds to the question and which corresponds to the `context`.

In [20]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [21]:
squad_train = squad.map(prepare_train_features, batched=True, 
                            remove_columns=squad["train"].column_names)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

The evaluation features are similar to the train features. We have to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [22]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1
        
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [23]:
squad_eval = squad["validation"].map(prepare_validation_features, batched=True, 
                                          remove_columns=squad["validation"].column_names)

  0%|          | 0/11 [00:00<?, ?ba/s]

## Fine-tuning mBERT

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [24]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-bas

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [25]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [26]:
batch_size = 16
args = TrainingArguments(
    "bert-base-multilingual-cased-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    # push_to_hub=True,
)

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [27]:
trainer = Trainer(
    model,
    args,
    train_dataset=squad_train["train"],
    eval_dataset=squad_train["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [28]:
# trainer.train()

In [29]:
# trainer.push_to_hub()

## Evaluating mBERT

We load a model that is already finetuned on SQuAD to save time. We evaluate on the validation set of SQuAD.

In [30]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpqsh2ch2_


Downloading:   0%|          | 0.00/822 [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
creating metadata file for /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
loading configuration file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
Model config BertConfig {
  "_name_or_path": "salti/bert-base-multilingual-cased-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_d

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
creating metadata file for /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
loading weights file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9f54849aca742a855728dc8a74c1a733627678a2e8c7a97ba60e2b318ad1438a.b9c031b09975cb84030a5da7731e0b35844cbf7eed5f7ddc0dc0704ba6cc5802
All model checkpoint weights were used when initializing BertForQuestionAnswering.

All the weights of BertForQuestionAnswering were initialized from the model checkpoint at salti/bert-base-multilingual-c

Downloading:   0%|          | 0.00/264 [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/834c4d17cccdc848e5ac1e4978b371d9d3afa3bb371fe8f6be85119e3a0ea885.c60f034cf5bf819518a0170960ddb62b4576fa3d01e9021876b801600cbb6f42
creating metadata file for /root/.cache/huggingface/transformers/834c4d17cccdc848e5ac1e4978b371d9d3afa3bb371fe8f6be85119e3a0ea885.c60f034cf5bf819518a0170960ddb62b4576fa3d01e9021876b801600cbb6f42
loading configuration file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
Model config BertConfig {
  "_name_or_path": "salti/bert-base-multilingual-cased-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "cl

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/f4a09a359df2eb92d135ea2d1092b9f0f3388951f8612fd97e9b071eaa3c3c98.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
creating metadata file for /root/.cache/huggingface/transformers/f4a09a359df2eb92d135ea2d1092b9f0f3388951f8612fd97e9b071eaa3c3c98.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp1z0rang8


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

storing https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/33bece2c5789380abddaab5941fb528198297c167ddfa9547ab37dd92f940b55.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /root/.cache/huggingface/transformers/33bece2c5789380abddaab5941fb528198297c167ddfa9547ab37dd92f940b55.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/f4a09a359df2eb92d135ea2d1092b9f0f3388951f8612fd97e9b071eaa3c3c98.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned

We can grab the predictions for all features by using the `Trainer.predict` method:

In [31]:
raw_predictions = trainer.predict(squad_eval)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10851
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [32]:
squad_eval.set_format(type=squad_eval.format["type"], 
                      columns=list(squad_eval.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices.

In [33]:
from tqdm.auto import tqdm
import collections
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

And we can apply our post-processing function to our raw predictions:

In [34]:
final_predictions = postprocess_qa_predictions(squad["validation"], squad_eval, raw_predictions.predictions)

Post-processing 10570 example predictions split into 10851 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [35]:
metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [36]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 81.90160832544939, 'f1': 89.121876471452}

## Zero-Shot mBERT

Zero-Shot performance of the mBERT model fine-tuned on SQuAD.

In [37]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

loading configuration file https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1df6572a9ae2fd1152d4fa4e3b9d30e0d303c69cb87d5b8401ef5cb032016bef.aa91fcc51e661ddbf70fda4906759b1d8178a512385633adf5d0db934ae1e333
Model config BertConfig {
  "_name_or_path": "salti/bert-base-multilingual-cased-finetuned-squad",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [38]:
from collections import defaultdict

mlqa_prep = defaultdict(dict)

def map_datasets(langs, split, prepare_features):
    for lang in langs:
        mlqa_prep[lang][split] = mlqa[lang][split].map(prepare_features, batched=True, 
                                    remove_columns=mlqa[lang][split].column_names)

In [39]:
split = "test"
map_datasets(langs_test, split, prepare_validation_features)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

In [40]:
def compute_results(langs, split):
    results = {}
    for lang in langs:
        # We can grab the predictions for all features by using the method
        raw_predictions = trainer.predict(mlqa_prep[lang][split])

        # example_id and offset_mapping which we will need for our post-processing
        mlqa_prep[lang][split].set_format(type=mlqa_prep[lang][split].format["type"], 
                        columns=list(mlqa_prep[lang][split].features.keys()))
        
        # And we can apply our post-processing function to our raw predictions
        final_predictions = postprocess_qa_predictions(mlqa[lang][split], mlqa_prep[lang][split], raw_predictions.predictions)

        # We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in mlqa[lang][split]]
        results[lang] = metric.compute(predictions=formatted_predictions, references=references)
    return results

In [41]:
results_zero_shot_mbert = compute_results(langs_test, split)
print(results_zero_shot_mbert)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7534
  Batch size = 16


Post-processing 5335 example predictions split into 7534 features.


  0%|          | 0/5335 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2090
  Batch size = 16


Post-processing 1649 example predictions split into 2090 features.


  0%|          | 0/1649 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2939
  Batch size = 16


Post-processing 2047 example predictions split into 2939 features.


  0%|          | 0/2047 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2902
  Batch size = 16


Post-processing 1912 example predictions split into 2902 features.


  0%|          | 0/1912 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7470
  Batch size = 16


Post-processing 5335 example predictions split into 7470 features.


  0%|          | 0/5335 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2763
  Batch size = 16


Post-processing 1978 example predictions split into 2763 features.


  0%|          | 0/1978 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2642
  Batch size = 16


Post-processing 1831 example predictions split into 2642 features.


  0%|          | 0/1831 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1916
  Batch size = 16


Post-processing 1649 example predictions split into 1916 features.


  0%|          | 0/1649 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5242
  Batch size = 16


Post-processing 4517 example predictions split into 5242 features.


  0%|          | 0/4517 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1986
  Batch size = 16


Post-processing 1675 example predictions split into 1986 features.


  0%|          | 0/1675 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1960
  Batch size = 16


Post-processing 1621 example predictions split into 1960 features.


  0%|          | 0/1621 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5232
  Batch size = 16


Post-processing 4517 example predictions split into 5232 features.


  0%|          | 0/4517 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2046
  Batch size = 16


Post-processing 1776 example predictions split into 2046 features.


  0%|          | 0/1776 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1604
  Batch size = 16


Post-processing 1430 example predictions split into 1604 features.


  0%|          | 0/1430 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2905
  Batch size = 16


Post-processing 2047 example predictions split into 2905 features.


  0%|          | 0/2047 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2175
  Batch size = 16


Post-processing 1675 example predictions split into 2175 features.


  0%|          | 0/1675 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7399
  Batch size = 16


Post-processing 5495 example predictions split into 7399 features.


  0%|          | 0/5495 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2736
  Batch size = 16


Post-processing 1943 example predictions split into 2736 features.


  0%|          | 0/1943 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7373
  Batch size = 16


Post-processing 5495 example predictions split into 7373 features.


  0%|          | 0/5495 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2678
  Batch size = 16


Post-processing 2018 example predictions split into 2678 features.


  0%|          | 0/2018 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2540
  Batch size = 16


Post-processing 1947 example predictions split into 2540 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2368
  Batch size = 16


Post-processing 1912 example predictions split into 2368 features.


  0%|          | 0/1912 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2006
  Batch size = 16


Post-processing 1621 example predictions split into 2006 features.


  0%|          | 0/1621 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2486
  Batch size = 16


Post-processing 1943 example predictions split into 2486 features.


  0%|          | 0/1943 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6334
  Batch size = 16


Post-processing 5137 example predictions split into 6334 features.


  0%|          | 0/5137 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6296
  Batch size = 16


Post-processing 5137 example predictions split into 6296 features.


  0%|          | 0/5137 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2442
  Batch size = 16


Post-processing 1947 example predictions split into 2442 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2077
  Batch size = 16


Post-processing 1767 example predictions split into 2077 features.


  0%|          | 0/1767 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6902
  Batch size = 16


Post-processing 5335 example predictions split into 6902 features.


  0%|          | 0/5335 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5552
  Batch size = 16


Post-processing 4517 example predictions split into 5552 features.


  0%|          | 0/4517 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7162
  Batch size = 16


Post-processing 5495 example predictions split into 7162 features.


  0%|          | 0/5495 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6726
  Batch size = 16


Post-processing 5137 example predictions split into 6726 features.


  0%|          | 0/5137 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 14706
  Batch size = 16


Post-processing 11590 example predictions split into 14706 features.


  0%|          | 0/11590 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6585
  Batch size = 16


Post-processing 5253 example predictions split into 6585 features.


  0%|          | 0/5253 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6360
  Batch size = 16


Post-processing 4918 example predictions split into 6360 features.


  0%|          | 0/4918 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2065
  Batch size = 16


Post-processing 1978 example predictions split into 2065 features.


  0%|          | 0/1978 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1833
  Batch size = 16


Post-processing 1776 example predictions split into 1833 features.


  0%|          | 0/1776 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2107
  Batch size = 16


Post-processing 2018 example predictions split into 2107 features.


  0%|          | 0/2018 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2022
  Batch size = 16


Post-processing 1947 example predictions split into 2022 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5451
  Batch size = 16


Post-processing 5253 example predictions split into 5451 features.


  0%|          | 0/5253 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5456
  Batch size = 16


Post-processing 5253 example predictions split into 5456 features.


  0%|          | 0/5253 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1790
  Batch size = 16


Post-processing 1723 example predictions split into 1790 features.


  0%|          | 0/1723 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2822
  Batch size = 16


Post-processing 1831 example predictions split into 2822 features.


  0%|          | 0/1831 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2007
  Batch size = 16


Post-processing 1430 example predictions split into 2007 features.


  0%|          | 0/1430 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3058
  Batch size = 16


Post-processing 1947 example predictions split into 3058 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2781
  Batch size = 16


Post-processing 1767 example predictions split into 2781 features.


  0%|          | 0/1767 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7359
  Batch size = 16


Post-processing 4918 example predictions split into 7359 features.


  0%|          | 0/4918 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2489
  Batch size = 16


Post-processing 1723 example predictions split into 2489 features.


  0%|          | 0/1723 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7500
  Batch size = 16


Post-processing 4918 example predictions split into 7500 features.


  0%|          | 0/4918 [00:00<?, ?it/s]

{'ar.ar': {'exact_match': 28.02249297094658, 'f1': 44.929516784532346}, 'ar.de': {'exact_match': 30.503335354760463, 'f1': 46.34281596858772}, 'ar.vi': {'exact_match': 20.078163165608206, 'f1': 35.855170013530305}, 'ar.zh': {'exact_match': 21.286610878661087, 'f1': 36.836139618796864}, 'ar.en': {'exact_match': 33.68322399250234, 'f1': 51.05706234227977}, 'ar.es': {'exact_match': 28.665318503538927, 'f1': 45.39895834649171}, 'ar.hi': {'exact_match': 17.312943746586566, 'f1': 30.78874873874883}, 'de.ar': {'exact_match': 23.650697392359007, 'f1': 36.746358156228915}, 'de.de': {'exact_match': 43.74584901483286, 'f1': 59.39314267422581}, 'de.vi': {'exact_match': 29.611940298507463, 'f1': 43.59122326994722}, 'de.zh': {'exact_match': 30.66008636644047, 'f1': 46.46878322661527}, 'de.en': {'exact_match': 46.734558335178214, 'f1': 62.361949176983636}, 'de.es': {'exact_match': 40.990990990990994, 'f1': 56.37397727738263}, 'de.hi': {'exact_match': 21.53846153846154, 'f1': 34.05070095047693}, 'vi.a

In [42]:
import pandas as pd
def results_df(results_dict, model):
    F1colname = "F1_" + model
    EMcolname = "EM_" + model
    dict_results = defaultdict(list)
    for lang, scores in results_dict.items():
        dict_results["lang"].append(lang)
        dict_results[F1colname].append(scores['f1'])
        dict_results[EMcolname].append(scores['exact_match'])

    avg_f1 = np.average(dict_results[F1colname])
    avg_em = np.average(dict_results[EMcolname])
    dict_results["lang"].append('avg')
    dict_results[F1colname].append(avg_f1)
    dict_results[EMcolname].append(avg_em)
    df_results = pd.DataFrame(dict_results).round(2)
    return df_results

In [43]:
df_results_zero_shot_mbert = results_df(results_zero_shot_mbert, "Zero-shot mBERT")
df_results_zero_shot_mbert.to_csv("results/mlqa/results_zero_shot_mbert.csv")
df_results_zero_shot_mbert

Unnamed: 0,lang,F1_Zero-shot mBERT,EM_Zero-shot mBERT
0,ar.ar,44.93,28.02
1,ar.de,46.34,30.5
2,ar.vi,35.86,20.08
3,ar.zh,36.84,21.29
4,ar.en,51.06,33.68
5,ar.es,45.4,28.67
6,ar.hi,30.79,17.31
7,de.ar,36.75,23.65
8,de.de,59.39,43.75
9,de.vi,43.59,29.61


## Zero-Shot XLM-R

Zero-Shot performance of the XLM-R model fine-tuned on SQuAD.

In [44]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpevl72ctz


Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

storing https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/6e9fab04c4168068e4162152496d68d0594b8a838953657e9234b70b2c4932fb.b04b4828cf6cfcbbcba34b7e4fc29fe9a0563001f3cedda0f9cf6487875eae92
creating metadata file for /root/.cache/huggingface/transformers/6e9fab04c4168068e4162152496d68d0594b8a838953657e9234b70b2c4932fb.b04b4828cf6cfcbbcba34b7e4fc29fe9a0563001f3cedda0f9cf6487875eae92
loading configuration file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6e9fab04c4168068e4162152496d68d0594b8a838953657e9234b70b2c4932fb.b04b4828cf6cfcbbcba34b7e4fc29fe9a0563001f3cedda0f9cf6487875eae92
Model config XLMRobertaConfig {
  "_name_or_path": "vanichandna/xlm-roberta-finetuned-squad",
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_d

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

storing https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/tf_model.h5 in cache at /root/.cache/huggingface/transformers/40231dec3bc81ec27bf1027247098bd684738dffa5f2296b2cad573879d0c229.fdcf91f65f36389325caf97398f95d17232f926dca8f8ecac72019c5ca36ba9d.h5
creating metadata file for /root/.cache/huggingface/transformers/40231dec3bc81ec27bf1027247098bd684738dffa5f2296b2cad573879d0c229.fdcf91f65f36389325caf97398f95d17232f926dca8f8ecac72019c5ca36ba9d.h5
loading weights file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/tf_model.h5 from cache at /root/.cache/huggingface/transformers/40231dec3bc81ec27bf1027247098bd684738dffa5f2296b2cad573879d0c229.fdcf91f65f36389325caf97398f95d17232f926dca8f8ecac72019c5ca36ba9d.h5
Loading TensorFlow weights from /root/.cache/huggingface/transformers/40231dec3bc81ec27bf1027247098bd684738dffa5f2296b2cad573879d0c229.fdcf91f65f36389325caf97398f95d17232f926dca8f8ecac72019c5ca36ba9d.h5
All TF 2.0 model weights w

Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

storing https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/14a0caad62ab9246eaaf3d48369dcfa9230e8c7a75c21a5a5b984e9b681a780b.b36482fbec4a714d3cfec99e0b05f4fdeec9e759090a78aed5597583a8b4783d
creating metadata file for /root/.cache/huggingface/transformers/14a0caad62ab9246eaaf3d48369dcfa9230e8c7a75c21a5a5b984e9b681a780b.b36482fbec4a714d3cfec99e0b05f4fdeec9e759090a78aed5597583a8b4783d
https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpkpeu_mhg


Downloading:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

storing https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/7f7f92d02ee98c92db3e0ef5445a40b8301c4206d4df0adc2bbb55b109b17dc3.d9915425c8f75472760a4bb40b4d5d0b3ab357a1b00aaf7917cda68403cbc936
creating metadata file for /root/.cache/huggingface/transformers/7f7f92d02ee98c92db3e0ef5445a40b8301c4206d4df0adc2bbb55b109b17dc3.d9915425c8f75472760a4bb40b4d5d0b3ab357a1b00aaf7917cda68403cbc936
https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpx65rprkb


Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

storing https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/21e94506b48b2fcd8bf2ffb7d2216e68cb34dd50b493e0cdbffbb6c97efd961b.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
creating metadata file for /root/.cache/huggingface/transformers/21e94506b48b2fcd8bf2ffb7d2216e68cb34dd50b493e0cdbffbb6c97efd961b.a11ebb04664c067c8fe5ef8f8068b0f721263414a26058692f7b2e4ba2a1b342
loading file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/sentencepiece.bpe.model from cache at None
loading file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/7f7f92d02ee98c92db3e0ef5445a40b8301c4206d4df0adc2bbb55b109b17dc3.d9915425c8f75472760a4bb40b4d5d0b3ab357a1b00aaf7917cda68403cbc936
loading file https://huggingface.co/vanichandna/xlm-roberta-finetuned-squad/resolve/main/added_toke

In [45]:
split = "test"
map_datasets(langs_test, split, prepare_validation_features)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

In [None]:
results_zero_shot_xlm_r = compute_results(langs_test, split)
print(results_zero_shot_xlm_r)

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6810
  Batch size = 16


Post-processing 5335 example predictions split into 6810 features.


  0%|          | 0/5335 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1962
  Batch size = 16


Post-processing 1649 example predictions split into 1962 features.


  0%|          | 0/1649 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2635
  Batch size = 16


Post-processing 2047 example predictions split into 2635 features.


  0%|          | 0/2047 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2595
  Batch size = 16


Post-processing 1912 example predictions split into 2595 features.


  0%|          | 0/1912 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6793
  Batch size = 16


Post-processing 5335 example predictions split into 6793 features.


  0%|          | 0/5335 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2492
  Batch size = 16


Post-processing 1978 example predictions split into 2492 features.


  0%|          | 0/1978 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2365
  Batch size = 16


Post-processing 1831 example predictions split into 2365 features.


  0%|          | 0/1831 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1918
  Batch size = 16


Post-processing 1649 example predictions split into 1918 features.


  0%|          | 0/1649 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5278
  Batch size = 16


Post-processing 4517 example predictions split into 5278 features.


  0%|          | 0/4517 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2008
  Batch size = 16


Post-processing 1675 example predictions split into 2008 features.


  0%|          | 0/1675 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1966
  Batch size = 16


Post-processing 1621 example predictions split into 1966 features.


  0%|          | 0/1621 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5274
  Batch size = 16


Post-processing 4517 example predictions split into 5274 features.


  0%|          | 0/4517 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2066
  Batch size = 16


Post-processing 1776 example predictions split into 2066 features.


  0%|          | 0/1776 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1604
  Batch size = 16


Post-processing 1430 example predictions split into 1604 features.


  0%|          | 0/1430 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2862
  Batch size = 16


Post-processing 2047 example predictions split into 2862 features.


  0%|          | 0/2047 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2166
  Batch size = 16


Post-processing 1675 example predictions split into 2166 features.


  0%|          | 0/1675 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7335
  Batch size = 16


Post-processing 5495 example predictions split into 7335 features.


  0%|          | 0/5495 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2690
  Batch size = 16


Post-processing 1943 example predictions split into 2690 features.


  0%|          | 0/1943 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7311
  Batch size = 16


Post-processing 5495 example predictions split into 7311 features.


  0%|          | 0/5495 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2651
  Batch size = 16


Post-processing 2018 example predictions split into 2651 features.


  0%|          | 0/2018 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2504
  Batch size = 16


Post-processing 1947 example predictions split into 2504 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2120
  Batch size = 16


Post-processing 1912 example predictions split into 2120 features.


  0%|          | 0/1912 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1829
  Batch size = 16


Post-processing 1621 example predictions split into 1829 features.


  0%|          | 0/1621 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2210
  Batch size = 16


Post-processing 1943 example predictions split into 2210 features.


  0%|          | 0/1943 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5724
  Batch size = 16


Post-processing 5137 example predictions split into 5724 features.


  0%|          | 0/5137 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5724
  Batch size = 16


Post-processing 5137 example predictions split into 5724 features.


  0%|          | 0/5137 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2212
  Batch size = 16


Post-processing 1947 example predictions split into 2212 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1900
  Batch size = 16


Post-processing 1767 example predictions split into 1900 features.


  0%|          | 0/1767 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7119
  Batch size = 16


Post-processing 5335 example predictions split into 7119 features.


  0%|          | 0/5335 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5759
  Batch size = 16


Post-processing 4517 example predictions split into 5759 features.


  0%|          | 0/4517 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7457
  Batch size = 16


Post-processing 5495 example predictions split into 7457 features.


  0%|          | 0/5495 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6958
  Batch size = 16


Post-processing 5137 example predictions split into 6958 features.


  0%|          | 0/5137 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 15269
  Batch size = 16


Post-processing 11590 example predictions split into 15269 features.


  0%|          | 0/11590 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6840
  Batch size = 16


Post-processing 5253 example predictions split into 6840 features.


  0%|          | 0/5253 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6517
  Batch size = 16


Post-processing 4918 example predictions split into 6517 features.


  0%|          | 0/4918 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2061
  Batch size = 16


Post-processing 1978 example predictions split into 2061 features.


  0%|          | 0/1978 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1837
  Batch size = 16


Post-processing 1776 example predictions split into 1837 features.


  0%|          | 0/1776 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2107
  Batch size = 16


Post-processing 2018 example predictions split into 2107 features.


  0%|          | 0/2018 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2021
  Batch size = 16


Post-processing 1947 example predictions split into 2021 features.


  0%|          | 0/1947 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5454
  Batch size = 16


Post-processing 5253 example predictions split into 5454 features.


  0%|          | 0/5253 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5457
  Batch size = 16


Post-processing 5253 example predictions split into 5457 features.


  0%|          | 0/5253 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1783
  Batch size = 16


Post-processing 1723 example predictions split into 1783 features.


  0%|          | 0/1723 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2380
  Batch size = 16


Post-processing 1831 example predictions split into 2380 features.


  0%|          | 0/1831 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1760
  Batch size = 16


Post-processing 1430 example predictions split into 1760 features.


  0%|          | 0/1430 [00:00<?, ?it/s]

The following columns in the test set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping. If example_id, offset_mapping are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2592
  Batch size = 16


In [None]:
df_results_zero_shot_xlm_r = results_df(results_zero_shot_xlm_r, "Zero-shot XML-R")
df_results_zero_shot_xlm_r.to_csv("results/mlqa/results_zero_shot_xlm_r.csv")
df_results_zero_shot_xlm_r

## Zero-Shot XLM-R-large

Zero-Shot performance of the XLM-R-large model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_test, split, prepare_validation_features)

In [None]:
results_zero_shot_xlm_r_large = compute_results(langs_test, split)
print(results_zero_shot_xlm_r_large)

In [None]:
df_results_zero_shot_xlm_r_large = results_df(results_zero_shot_xlm_r_large, "Zero-shot XML-R Large")
df_results_zero_shot_xlm_r_large.to_csv("results/mlqa/results_zero_shot_xlm_r_large.csv")
df_results_zero_shot_xlm_r_large

## Translate Test BERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "rsvp-ai/bertserini-bert-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bertserini-bert-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_bert = compute_results(langs_translate_test, split)
print(results_translate_test_bert)

In [None]:
df_results_translate_test_bert = results_df(results_translate_test_bert, "Translate-test BERT")
df_results_translate_test_bert.to_csv("results/mlqa/results_translate_test_bert.csv")
df_results_translate_test_bert

## Translate Test BERT-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "bert-large-cased-whole-word-masking-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="dir-bert-large-cased-whole-word-masking-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_bert_large = compute_results(langs_translate_test, split)
print(results_translate_test_bert_large)

In [None]:
df_results_translate_test_bert_large = results_df(results_translate_test_bert_large, "Translate-test BERT Large")
df_results_translate_test_bert_large.to_csv("results/mlqa/results_translate_test_bert_large.csv")
df_results_translate_test_bert_large

## Translate Test RoBERTa

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "thatdramebaazguy/roberta-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_roberta = compute_results(langs_translate_test, split)
print(results_translate_test_roberta)

In [None]:
df_results_translate_test_roberta = results_df(results_translate_test_roberta, "Translate-test RoBERTa")
df_results_translate_test_roberta.to_csv("results/mlqa/results_translate_test_roberta.csv")
df_results_translate_test_roberta

## Translate Test RoBERTa-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "csarron/roberta-large-squad-v1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-large-squad-v1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_roberta_large = compute_results(langs_translate_test, split)
print(results_translate_test_roberta_large)

In [None]:
df_results_translate_test_roberta_large = results_df(results_translate_test_roberta_large, "Translate-test RoBERTa Large")
df_results_translate_test_roberta_large.to_csv("results/mlqa/results_translate_test_roberta_large.csv")
df_results_translate_test_roberta_large

## Translate Test mBERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use mBERT, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_mbert = compute_results(langs_translate_test, split)
print(results_translate_test_mbert)

In [None]:
df_results_translate_test_mbert = results_df(results_translate_test_mbert, "Translate-test mBERT")
df_results_translate_test_mbert.to_csv("results/mlqa/results_translate_test_mbert.csv")
df_results_translate_test_mbert

## Translate Test XLM-R

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use XLM-R, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_xlm_r = compute_results(langs_translate_test, split)
print(results_translate_test_xlm_r)

In [None]:
df_results_translate_test_xlm_r = results_df(results_translate_test_xlm_r, "Translate-test XLM-R")
df_results_translate_test_xlm_r.to_csv("results/mlqa/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r

## Translate Test XLM-R-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use XLM-R-large, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_translate_test, split, prepare_validation_features)

In [None]:
results_translate_test_xlm_r_large = compute_results(langs_translate_test, split)
print(results_translate_test_xlm_r_large)

In [None]:
df_results_translate_test_xlm_r_large = results_df(results_translate_test_xlm_r_large, "Translate-test XLM-R Large")
df_results_translate_test_xlm_r_large.to_csv("results/mlqa/results_translate_test_xlm_r_large.csv")
df_results_translate_test_xlm_r_large

## Translate Train Es mBERT

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We then fine-tune mBERT on the translated data. We must align answer spans in the source and target language for the QA tasks. We use data that was already translated to save time.

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-es",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
lang = ["translate-train.es"]

split = "train"
map_datasets(lang, split, prepare_train_features)

split = "validation"
map_datasets(lang, split, prepare_train_features)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mlqa_prep["translate-train.es"]["train"],
    eval_dataset=mlqa_prep["translate-train.es"]["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
# trainer.train()

In [None]:
split = "test"
map_datasets(langs_test, split, prepare_validation_features)

In [None]:
split = "test"
results_translate_train_es_mbert = compute_results(langs_test, split)
print(results_translate_train_es_mbert)

In [None]:
df_results_translate_train_es_mbert = results_df(results_translate_train_es_mbert, "Translate-train es mBERT")
df_results_translate_train_es_mbert.to_csv("results/mlqa/results_translate_train_es_mbert.csv")
df_results_translate_train_es_mbert

In [None]:
trainer.push_to_hub()

## Translate Train Es XLM-R

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We use a XLM-R model that has already been finetuned to save time.

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "saattrupdan/xlmr-base-texas-squad-es"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlmr-base-texas-squad-es",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_test, split, prepare_validation_features)

In [None]:
split = "test"
results_translate_train_es_xlm_r = compute_results(langs_test, split)
print(results_translate_train_es_xlm_r)

In [None]:
df_results_translate_train_es_xlm_r = results_df(results_translate_train_es_xlm_r, "Translate-train es XLM-R")
df_results_translate_train_es_xlm_r.to_csv("results/mlqa/results_translate_train_es_xlm_r.csv")
df_results_translate_train_es_xlm_r

## Translate Train De XLM-R

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "saattrupdan/xlmr-base-texas-squad-de"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlmr-base-texas-squad-de",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
split = "test"
map_datasets(langs_test, split, prepare_validation_features)

In [None]:
split = "test"
results_translate_train_de_xlm_r = compute_results(langs_test, split)
print(results_translate_train_de_xlm_r)

In [None]:
df_results_translate_train_de_xlm_r = results_df(results_translate_train_de_xlm_r, "Translate-train de XLM-R")
df_results_translate_train_de_xlm_r.to_csv("results/mlqa/results_translate_train_de_xlm_r.csv")
df_results_translate_train_de_xlm_r

## Translate Train All mBERT

We also experiment with a multi-task version of the translate-train setting where we fine-tune mBERT on the combined translated training data of all languages jointly.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-all",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

In [None]:
from datasets import DatasetDict, concatenate_datasets

xquad_merged = DatasetDict()
xquad_merged["translate_train"] = squad_train["train"]
xquad_merged["translate_dev"] = squad_train["validation"]

for lang in langs:
    for split in ["translate_train", "translate_dev"]:
        xquad_merged[split] = concatenate_datasets([xquad_merged[split], xquad_prep[lang][split]])

In [None]:
xquad_merged

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_merged["translate_train"],
    eval_dataset=xquad_merged["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
# trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_all_mbert = compute_results(langs, split)
print(results_translate_train_all_mbert)

In [None]:
df_results_translate_train_all_mbert = results_df(results_translate_train_all_mbert, "TTr_all_mbert")
df_results_translate_train_all_mbert.to_csv("results/mlqa/results_translate_train_all_mbert.csv")
df_results_translate_train_all_mbert

In [None]:
trainer.push_to_hub()

## Fine-tuning XQuAD mBERT

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/bert-base-multilingual-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_fine_tuning_xquad_mbert = compute_results(langs, split)
print(results_fine_tuning_xquad_mbert)

In [None]:
df_results_fine_tuning_xquad_mbert = results_df(results_fine_tuning_xquad_mbert, "FT_xquad_mbert")
df_results_fine_tuning_xquad_mbert.to_csv("results/mlqa/results_fine_tuning_xquad_mbert.csv")
df_results_fine_tuning_xquad_mbert

## Fine-tuning XQuAD XLM-R

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/xlm-roberta-base-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-base-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_fine_tuning_xquad_xlm_r = compute_results(langs, split)
print(results_fine_tuning_xquad_xlm_r)

In [None]:
df_results_fine_tuning_xquad_xlm_r = results_df(results_fine_tuning_xquad_xlm_r, "FT_xquad_xlm_r")
df_results_fine_tuning_xquad_xlm_r.to_csv("results/mlqa/results_fine_tuning_xquad_xlm_r.csv")
df_results_fine_tuning_xquad_xlm_r

## FIne-tuning XQuAD XLM-R-large

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/xlm-roberta-large-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_fine_tuning_xquad_xlm_r_large = compute_results(langs, split)
print(results_fine_tuning_xquad_xlm_r_large)

In [None]:
df_results_fine_tuning_xquad_xlm_r_large = results_df(results_fine_tuning_xquad_xlm_r_large, "FT_xquad_xml_r_large")
df_results_fine_tuning_xquad_xlm_r_large.to_csv("results/mlqa/results_fine_tuning_xquad_xlm_r_large.csv")
df_results_fine_tuning_xquad_xlm_r_large

## Data Augmentation mBERT

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "mrm8488/bert-multi-cased-finetuned-xquadv1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_data_augmentation_mbert = compute_results(langs, split)
print(results_data_augmentation_mbert)

In [None]:
df_results_data_augmentation_mbert = results_df(results_data_augmentation_mbert, "data_augm_mbert")
df_results_data_augmentation_mbert.to_csv("results/mlqa/results_data_augmentation_mbert.csv")
df_results_data_augmentation_mbert

## Baselines

The MLQA [paper](http://arxiv.org/abs/1910.07475) presents several baselines for zero-shot experiments on MLQA, with training QA data taken from SQuAD V1.1, and using the MLQA English development set for early stopping.

The F1 scores for zero-shot transfer from training with english questions and documents to target language questions and documents are shown below (see the paper for further details). There is lots of room for improvement, and we hope the community will engage in this QA challenge.

| Model F1 Score | en | es | de | ar | hi| vi | zh | 
|:--- |:---: |:---: | :---: |:---: | :---: | :---: | :---: | 
BERT-Large    | **80.2**| - | - | - |- |- |- |
Multilingual-BERT  | 77.7| 64.3| 57.9| 45.7| 43.8| 57.1| 57.5|
XLM    |74.9| **68.0**| **62.2**|**54.8**| 48.8| 61.4| 61.1|
Translate-test BERT-L    | -| 65.4 | 57.9 | 33.6 | 23.8 | 58.2 |44.2 |
Translate-train M-BERT    | - | 53.9 | 62.0  | 51.8 | **55.0**| **62.0**| **61.4** |
Translate-train XLM    | -| 65.2| 61.4| 54.0| 50.7| 59.3| 59.8 |

## Results


In [None]:
import pandas as pd

df_results_zero_shot_mbert = pd.read_csv('results/mlqa/results_zero_shot_mbert.csv')
df_results_zero_shot_xlm_r = pd.read_csv('results/mlqa/results_zero_shot_xlm_r.csv')
df_results_zero_shot_xlm_r_large = pd.read_csv("results/mlqa/results_zero_shot_xlm_r_large.csv")
df_results_translate_test_mbert = pd.read_csv("results/mlqa/results_translate_test_mbert.csv")
df_results_translate_test_bert = pd.read_csv("results/mlqa/results_translate_test_bert.csv")
df_results_translate_test_bert_large = pd.read_csv("results/mlqa/results_translate_test_bert_large.csv")
df_results_translate_test_xlm_r = pd.read_csv("results/mlqa/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r_large = pd.read_csv("results/mlqa/results_translate_test_xlm_r_large.csv")
df_results_translate_test_roberta = pd.read_csv("results/mlqa/results_translate_test_roberta.csv")
df_results_translate_test_roberta_large = pd.read_csv("results/mlqa/results_translate_test_roberta_large.csv")
df_results_translate_train_es_xlm_r = pd.read_csv("results/mlqa/results_translate_train_es_xlm_r.csv")
df_results_translate_train_de_xlm_r = pd.read_csv("results/mlqa/results_translate_train_de_xlm_r.csv")
df_results_fine_tuning_mbert = pd.read_csv("results/mlqa/results_fine_tuning_xquad_mbert.csv")
df_results_fine_tuning_xquad_xlm_r = pd.read_csv("results/mlqa/results_fine_tuning_xquad_xlm_r.csv")
df_results_fine_tuning_xquad_xml_r_large = pd.read_csv("results/mlqa/results_fine_tuning_xquad_xlm_r_large.csv")
df_results_data_augmentation_mbert = pd.read_csv("results/mlqa/results_data_augmentation_mbert.csv")


dataframes = [df_results_zero_shot_mbert, 
              df_results_zero_shot_xlm_r, 
              df_results_zero_shot_xlm_r_large, 
              df_results_translate_test_mbert, 
              df_results_translate_test_bert, 
              df_results_translate_test_bert_large, 
              df_results_translate_test_xlm_r, 
              df_results_translate_test_xlm_r_large, 
              df_results_translate_test_roberta, 
              df_results_translate_test_roberta_large, 
              df_results_translate_train_es_xlm_r, 
              df_results_translate_train_de_xlm_r,
              df_results_fine_tuning_mbert,
              df_results_fine_tuning_xquad_xlm_r,
              df_results_fine_tuning_xquad_xml_r_large,
              df_results_data_augmentation_mbert, 
              ]

In [None]:
dfs = []
for df in dataframes:
    name1 = list(df.columns)[2]
    name2 = list(df.columns)[3]
    name = name1[3:]
    df = df.round(1)
    df = df.astype({name1: 'str', name2: 'str'}) #float to string
    df[name] = df[[name1, name2]].apply(lambda x: ' / '.join(x), axis=1) #concat F1 and EM
    df = df.drop([name1, name2,"Unnamed: 0"], axis=1) #remove useless columns
    df = df.set_index('lang').T #rotate dataframe
    dfs.append(df)

In [None]:
results_df = pd.concat(dfs, axis=0)
# reorder languages to match baseline
results_df = results_df[['en', 'ar', 'de', 'el', 'es', 'hi', 'ru', 'th', 'tr', 'vi', 'zh', 'ro', 'avg']]
# rename rows
rows = ["Zero-shot mBERT", "Zero-shot XLM-R", "Zero-shot XLM-R Large", 
        "Translate-test mBERT", "Translate-test BERT", "Translate-test BERT Large",
        "Translate-test XLM-R", "Translate-test XLM-R Large", "Translate-test RoBERTa",
        "Translate-test RoBERTa Large", "Translate-train es XLM-R", "Translate-train de XLM-R",
        "Fine-tuning mBERT", "Fine-tuning XLM-R", "Fine-tuning XLM-R Large", "Data-augmentation mBERT"]
results_df = results_df.rename(dict(zip(results_df.index, rows)))
results_df.to_csv("results/mlqa/results.csv")
display(results_df)

In [None]:
results_df.to_markdown()