# Zero-Shot and Translation Experiments on MLQA

If you're opening this Notebook on colab, you will need to moun drive and change directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/master/Applications2/project

In [None]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/project

If you're opening this Notebook on colab, you will need to install 🤗 Transformers and 🤗 Datasets.

In [None]:
!pip install datasets transformers

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load MLQA Dataset

MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.

In [3]:
from datasets import load_dataset, load_metric

langs = ["ar", "de", "vi", "zh", "en", "es", "hi"]
translate_langs = ["ar", "de", "vi", "zh", "es", "hi"]

mlqa = {}

for lang1 in langs:
    for lang2 in langs:
        mlqa[f"{lang1}.{lang2}"] = load_dataset("mlqa", f"mlqa.{lang1}.{lang2}")

for lang in translate_langs:
    mlqa[f"translate-train.{lang}"] = load_dataset("mlqa", f"mlqa-translate-train.{lang}")
    mlqa[f"translate-test.{lang}"] = load_dataset("mlqa", f"mlqa-translate-test.{lang}")

Downloading and preparing dataset mlqa/mlqa.ar.ar (download: 72.21 MiB, generated: 8.61 MiB, post-processed: Unknown size, total: 80.82 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Downloading data:   0%|          | 0.00/75.7M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.de (download: 72.21 MiB, generated: 2.38 MiB, post-processed: Unknown size, total: 74.59 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1649 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/207 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.vi (download: 72.21 MiB, generated: 3.36 MiB, post-processed: Unknown size, total: 75.57 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2047 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.zh (download: 72.21 MiB, generated: 3.35 MiB, post-processed: Unknown size, total: 75.56 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1912 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/188 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.en (download: 72.21 MiB, generated: 8.46 MiB, post-processed: Unknown size, total: 80.67 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.es (download: 72.21 MiB, generated: 3.06 MiB, post-processed: Unknown size, total: 75.27 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1978 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.ar.hi (download: 72.21 MiB, generated: 3.12 MiB, post-processed: Unknown size, total: 75.33 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1831 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/186 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.ar.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.ar (download: 72.21 MiB, generated: 1.70 MiB, post-processed: Unknown size, total: 73.91 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1649 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/207 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.de (download: 72.21 MiB, generated: 4.53 MiB, post-processed: Unknown size, total: 76.74 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.vi (download: 72.21 MiB, generated: 1.78 MiB, post-processed: Unknown size, total: 73.99 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1675 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/182 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.zh (download: 72.21 MiB, generated: 1.74 MiB, post-processed: Unknown size, total: 73.95 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1621 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/190 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.en (download: 72.21 MiB, generated: 4.51 MiB, post-processed: Unknown size, total: 76.72 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.es (download: 72.21 MiB, generated: 1.76 MiB, post-processed: Unknown size, total: 73.97 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1776 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/196 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.de.hi (download: 72.21 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 73.64 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.de.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1430 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.de.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.ar (download: 72.21 MiB, generated: 3.23 MiB, post-processed: Unknown size, total: 75.45 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2047 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.de (download: 72.21 MiB, generated: 2.35 MiB, post-processed: Unknown size, total: 74.56 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1675 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/182 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.vi (download: 72.21 MiB, generated: 8.13 MiB, post-processed: Unknown size, total: 80.34 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/511 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.zh (download: 72.21 MiB, generated: 3.06 MiB, post-processed: Unknown size, total: 75.28 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1943 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/184 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.en (download: 72.21 MiB, generated: 8.04 MiB, post-processed: Unknown size, total: 80.26 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/511 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.es (download: 72.21 MiB, generated: 2.96 MiB, post-processed: Unknown size, total: 75.17 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2018 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.vi.hi (download: 72.21 MiB, generated: 2.85 MiB, post-processed: Unknown size, total: 75.06 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/177 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.vi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.ar (download: 72.21 MiB, generated: 1.78 MiB, post-processed: Unknown size, total: 73.99 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1912 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/188 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.de (download: 72.21 MiB, generated: 1.46 MiB, post-processed: Unknown size, total: 73.67 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1621 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/190 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.vi (download: 72.21 MiB, generated: 1.85 MiB, post-processed: Unknown size, total: 74.06 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1943 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/184 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.zh (download: 72.21 MiB, generated: 4.54 MiB, post-processed: Unknown size, total: 76.75 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/504 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.en (download: 72.21 MiB, generated: 4.57 MiB, post-processed: Unknown size, total: 76.78 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/504 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.es (download: 72.21 MiB, generated: 1.75 MiB, post-processed: Unknown size, total: 73.96 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.zh.hi (download: 72.21 MiB, generated: 1.65 MiB, post-processed: Unknown size, total: 73.86 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.zh.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.ar (download: 72.21 MiB, generated: 6.93 MiB, post-processed: Unknown size, total: 79.14 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.de (download: 72.21 MiB, generated: 5.29 MiB, post-processed: Unknown size, total: 77.51 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.vi (download: 72.21 MiB, generated: 7.24 MiB, post-processed: Unknown size, total: 79.45 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/511 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.zh (download: 72.21 MiB, generated: 6.71 MiB, post-processed: Unknown size, total: 78.93 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/504 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.en (download: 72.21 MiB, generated: 14.40 MiB, post-processed: Unknown size, total: 86.61 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/11590 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1148 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.es (download: 72.21 MiB, generated: 6.31 MiB, post-processed: Unknown size, total: 78.53 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.en.hi (download: 72.21 MiB, generated: 6.59 MiB, post-processed: Unknown size, total: 78.80 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.en.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/507 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.en.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.ar (download: 72.21 MiB, generated: 1.76 MiB, post-processed: Unknown size, total: 73.97 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1978 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.de (download: 72.21 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 73.64 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1776 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/196 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.vi (download: 72.21 MiB, generated: 1.79 MiB, post-processed: Unknown size, total: 74.00 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/2018 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.zh (download: 72.21 MiB, generated: 1.68 MiB, post-processed: Unknown size, total: 73.89 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/161 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.en (download: 72.21 MiB, generated: 4.44 MiB, post-processed: Unknown size, total: 76.65 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.es (download: 72.21 MiB, generated: 4.48 MiB, post-processed: Unknown size, total: 76.69 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.es.hi (download: 72.21 MiB, generated: 1.59 MiB, post-processed: Unknown size, total: 73.80 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.es.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1723 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/187 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.es.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.ar (download: 72.21 MiB, generated: 4.56 MiB, post-processed: Unknown size, total: 76.77 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1831 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/186 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.de (download: 72.21 MiB, generated: 3.11 MiB, post-processed: Unknown size, total: 75.32 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1430 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/163 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.vi (download: 72.21 MiB, generated: 4.84 MiB, post-processed: Unknown size, total: 77.05 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1947 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/177 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.zh (download: 72.21 MiB, generated: 4.48 MiB, post-processed: Unknown size, total: 76.69 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/189 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.en (download: 72.21 MiB, generated: 11.75 MiB, post-processed: Unknown size, total: 83.96 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/507 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.en/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.es (download: 72.21 MiB, generated: 4.01 MiB, post-processed: Unknown size, total: 76.22 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/1723 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/187 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa.hi.hi (download: 72.21 MiB, generated: 12.13 MiB, post-processed: Unknown size, total: 84.34 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/507 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa.hi.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.ar (download: 60.43 MiB, generated: 109.07 MiB, post-processed: Unknown size, total: 169.50 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Downloading data:   0%|          | 0.00/63.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/78058 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9512 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.ar (download: 9.61 MiB, generated: 5.23 MiB, post-processed: Unknown size, total: 14.84 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Downloading data:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.ar/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.de (download: 60.43 MiB, generated: 84.23 MiB, post-processed: Unknown size, total: 144.66 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/80069 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9927 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.de (download: 9.61 MiB, generated: 3.70 MiB, post-processed: Unknown size, total: 13.31 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4517 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.de/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.vi (download: 60.43 MiB, generated: 105.02 MiB, post-processed: Unknown size, total: 165.45 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/84816 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10356 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.vi (download: 9.61 MiB, generated: 5.72 MiB, post-processed: Unknown size, total: 15.33 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5495 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.vi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.zh (download: 60.43 MiB, generated: 59.66 MiB, post-processed: Unknown size, total: 120.09 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/76285 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9568 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.zh (download: 9.61 MiB, generated: 4.61 MiB, post-processed: Unknown size, total: 14.22 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5137 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.zh/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.es (download: 60.43 MiB, generated: 87.27 MiB, post-processed: Unknown size, total: 147.70 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/81810 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10123 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.es (download: 9.61 MiB, generated: 3.74 MiB, post-processed: Unknown size, total: 13.34 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.es/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-train.hi (download: 60.43 MiB, generated: 181.71 MiB, post-processed: Unknown size, total: 242.14 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating train split:   0%|          | 0/82451 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10253 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-train.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading and preparing dataset mlqa/mlqa-translate-test.hi (download: 9.61 MiB, generated: 4.40 MiB, post-processed: Unknown size, total: 14.00 MiB) to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7...


Generating test split:   0%|          | 0/4918 [00:00<?, ? examples/s]

Dataset mlqa downloaded and prepared to /root/.cache/huggingface/datasets/mlqa/mlqa-translate-test.hi/1.0.0/224fde9ea61350ffb013e4beff31d44c6e125ce82c3aa4af70298eceabc8f7f7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
mlqa

{'ar.ar': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 5335
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 517
    })
}),
 'ar.de': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 1649
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 207
    })
}),
 'ar.en': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 5335
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 517
    })
}),
 'ar.es': DatasetDict({
    test: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 1978
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id'],
        num_rows: 161

In [11]:
mlqa["en.es"]["test"][0]

{'answers': {'answer_start': [457],
  'text': ['Rutgers University biochemists']},
 'context': 'In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in

In [12]:
mlqa["en.es"]["validation"][0]

{'answers': {'answer_start': [571],
  'text': ['remains infected for its lifetime']},
 'context': 'Pappataci fever is prevalent in the subtropical zone of the Eastern Hemisphere between 20°N and 45°N, particularly in Southern Europe, North Africa, the Balkans, Eastern Mediterranean, Iraq, Iran, Pakistan, Afghanistan and India.The disease is transmitted by the bites of phlebotomine sandflies of the Genus Phlebotomus, in particular, Phlebotomus papatasi, Phlebotomus perniciosus and Phlebotomus perfiliewi. The sandfly becomes infected when biting an infected human in the period between 48 hours before the onset of fever and 24 hours after the end of the fever, and remains infected for its lifetime. Besides this horizontal virus transmission from man to sandfly, the virus can be transmitted in insects transovarially, from an infected female sandfly to its offspring.Pappataci fever is seldom recognised in endemic populations because it is mixed with other febrile illnesses of childhood, but

In [8]:
mlqa["translate-train.es"]["train"][0]

{'answers': {'answer_start': [575], 'text': ['Santa Bernadette Soubirous']},
 'context': 'Arquitectónico, la escuela tiene un carácter católico. sobre la cúpula de oro del edificio principal es una estatua de oro de la Virgen María. inmediatamente frente al edificio principal y frente a ella, es una estatua de cobre de Cristo con los brazos levantado con la leyenda venite ad me omnes. junto al edificio principal se encuentra la Basílica del sagrado corazón. Inmediatamente detrás de la Basílica se encuentra la gruta, un lugar de oración y reflexión de Marian. Se trata de una réplica de la gruta en Lourdes, Francia, donde la Virgen María supuestamente apareció a saint bernadette soubirous en 1858. Al final de la unidad principal (y en una línea directa que conecta a través de 3 estatuas y la cúpula de oro), es una simple y moderna estatua de piedra de María.',
 'id': '5733be284776f41900661182',
 'question': 'A quién presuntamente apareció la Virgen María en 1858 en Lourdes Francia?'}

In [9]:
mlqa["translate-train.es"]["validation"][0]

{'answers': {'answer_start': [182, 182, 182],
  'text': ['DENVER BRONCOS', 'DENVER BRONCOS', 'DENVER BRONCOS']},
 'context': "Super Bowl 50 fue un juego de fútbol americano para determinar el campeón de la liga nacional de fútbol (NFL) para la temporada 2015 la conferencia de fútbol americano (AFC) campeón DENVER BRONCOS derrotó a la conferencia nacional de fútbol (NFC) Campeona Carolina Panthers 24-10 para ganar su tercer título de super bowl. El juego se jugó el 7 de febrero de 2016, en el estadio Levi ' s en la zona de la bahía de San Francisco en santa clara, California. Como este fue el 50º super bowl, la liga hizo hincapié en el aniversario de oro con varias iniciativas temáticas de oro, así como suspender temporalmente la tradición de nombrar cada juego de super bowl con números romanos (bajo el cual el juego habría sido conocido como super bowl l), de modo que el logotipo podría característica de forma prominente los números árabes 50.",
 'id': '56be4db0acb8001400a502ec',
 'que

In [10]:
mlqa["translate-test.es"]["test"][0]

{'answers': {'answer_start': [-1],
  'text': ['A Fan-shaped structures that followed a pattern of leaves, languages and lobes']},
 'context': 'After the eruption, the emissions of pyroclastic material that occurred from the gap created by the collapse were mostly magmatic origin, and in lower proportion of fragments of pre-existing volcanic rocks. The resulting deposits formed a fan-shaped structures that followed a pattern of leaves, languages and overlapping lobes. During the eruption of may 18, there were at least 17 emissions of pyroclastic flow separated over time, whose volumes of aggregation were around 208 million m3.',
 'id': 'b77c037b331e06542272669766df3b9515366b57',
 'question': 'What was the appearance of the deposits that left that landslide?'}

## Load SQuAD Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. https://arxiv.org/pdf/1606.05250.pdf

In [None]:
from datasets import load_dataset, load_metric

In [None]:
squad = load_dataset("squad")

In [None]:
squad

In [None]:
squad["train"][0]

In [None]:
squad["validation"][0]

## Preprocessing SQuAD

Load the mBERT tokenizer to process the question and context fields.

In [None]:
from transformers import AutoTokenizer

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

There are a few preprocessing steps particular to question answering that we should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. Truncate only the context by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original context by setting `return_offset_mapping=True`.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to find which part of the offset corresponds to the question and which corresponds to the `context`.

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
squad_train = squad.map(prepare_train_features, batched=True, 
                            remove_columns=squad["train"].column_names)

The evaluation features are similar to the train features. We have to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1
        
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
squad_eval = squad["validation"].map(prepare_validation_features, batched=True, 
                                          remove_columns=squad["validation"].column_names)

## Fine-tuning mBERT

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
batch_size = 16
args = TrainingArguments(
    "bert-base-multilingual-cased-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=squad_train["train"],
    eval_dataset=squad_train["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
# trainer.train()

In [None]:
trainer.push_to_hub()

## Evaluating mBERT

We load a model that is already finetuned on SQuAD to save time. We evaluate on the validation set of SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

We can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(squad_eval)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
squad_eval.set_format(type=squad_eval.format["type"], 
                      columns=list(squad_eval.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices.

In [None]:
from tqdm.auto import tqdm
import collections
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(squad["validation"], squad_eval, raw_predictions.predictions)

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

## Zero-Shot mBERT

Zero-Shot performance of the mBERT model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
from collections import defaultdict
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

xquad_prep = defaultdict(dict)

def map_datasets(langs, split, prepare_features):
    for lang in langs:
        xquad_prep[lang][split] = xquad[lang][split].map(prepare_features, batched=True, 
                                    remove_columns=xquad[lang][split].column_names)

In [None]:
map_datasets(langs, split, prepare_validation_features)

In [None]:
def compute_results(langs, split):
    results = {}
    for lang in langs:
        # We can grab the predictions for all features by using the method
        raw_predictions = trainer.predict(xquad_prep[lang][split])

        # example_id and offset_mapping which we will need for our post-processing
        xquad_prep[lang][split].set_format(type=xquad_prep[lang][split].format["type"], 
                        columns=list(xquad_prep[lang][split].features.keys()))
        
        # And we can apply our post-processing function to our raw predictions
        final_predictions = postprocess_qa_predictions(xquad[lang][split], xquad_prep[lang][split], raw_predictions.predictions)

        # We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad[lang][split]]
        results[lang] = metric.compute(predictions=formatted_predictions, references=references)
    return results

In [None]:
results_zero_shot_mbert = compute_results(langs, split)
print(results_zero_shot_mbert)

In [None]:
import pandas as pd
def results_df(results_dict, model):
    F1colname = "F1_" + model
    EMcolname = "EM_" + model
    dict_results = defaultdict(list)
    for lang, scores in results_dict.items():
        dict_results["lang"].append(lang)
        dict_results[F1colname].append(scores['f1'])
        dict_results[EMcolname].append(scores['exact_match'])

    avg_f1 = np.average(dict_results[F1colname])
    avg_em = np.average(dict_results[EMcolname])
    dict_results["lang"].append('avg')
    dict_results[F1colname].append(avg_f1)
    dict_results[EMcolname].append(avg_em)
    df_results = pd.DataFrame(dict_results).round(2)
    return df_results

In [None]:
df_results_zero_shot_mbert = results_df(results_zero_shot_mbert, "ZS_mbert")
df_results_zero_shot_mbert.to_csv("results/mlqa/results_zero_shot_mbert.csv")
df_results_zero_shot_mbert

## Zero-Shot XLM-R

Zero-Shot performance of the XLM-R model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_zero_shot_xlm_r = compute_results(langs, split)
print(results_zero_shot_xlm_r)

In [None]:
df_results_zero_shot_xlm_r = results_df(results_zero_shot_xlm_r, "ZS_xml_r")
df_results_zero_shot_xlm_r.to_csv("results/mlqa/results_zero_shot_xlm_r.csv")
df_results_zero_shot_xlm_r

## Zero-Shot XLM-R-large

Zero-Shot performance of the XLM-R-large model fine-tuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_zero_shot_xlm_r_large = compute_results(langs, split)
print(results_zero_shot_xlm_r_large)

In [None]:
df_results_zero_shot_xlm_r_large = results_df(results_zero_shot_xlm_r_large, "ZS_xml_r_large")
df_results_zero_shot_xlm_r_large.to_csv("results/mlqa/results_zero_shot_xlm_r_large.csv")
df_results_zero_shot_xlm_r_large

## Translate Test mBERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use mBERT, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_mbert = compute_results(langs, split)
print(results_translate_test_mbert)

In [None]:
df_results_translate_test_mbert = results_df(results_translate_test_mbert, "TT_mbert")
df_results_translate_test_mbert.to_csv("results/mlqa/results_translate_test_mbert.csv")
df_results_translate_test_mbert

## Translate Test BERT

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "rsvp-ai/bertserini-bert-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bertserini-bert-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_bert = compute_results(langs, split)
print(results_translate_test_bert)

In [None]:
df_results_translate_test_bert = results_df(results_translate_test_bert, "TT_bert")
df_results_translate_test_bert.to_csv("results/mlqa/results_translate_test_bert.csv")
df_results_translate_test_bert

## Translate Test BERT-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual BERT-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "bert-large-cased-whole-word-masking-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="dir-bert-large-cased-whole-word-masking-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_bert_large = compute_results(langs, split)
print(results_translate_test_bert_large)

In [None]:
df_results_translate_test_bert_large = results_df(results_translate_test_bert_large, "TT_bert_large")
df_results_translate_test_bert_large.to_csv("results/mlqa/results_translate_test_bert_large.csv")
df_results_translate_test_bert_large

## Translate Test XLM-R

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we  use XLM-R, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "vanichandna/xlm-roberta-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name,from_tf=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_xlm_r = compute_results(langs, split)
print(results_translate_test_xlm_r)

In [None]:
df_results_translate_test_xlm_r = results_df(results_translate_test_xlm_r, "TT_xml_r")
df_results_translate_test_xlm_r.to_csv("results/mlqa/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r

## Translate Test XLM-R-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use XLM-R-large, but we could also use a monolingual model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "Palak/xlm-roberta-large_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large_squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_xlm_r_large = compute_results(langs, split)
print(results_translate_test_xlm_r_large)

In [None]:
df_results_translate_test_xlm_r_large = results_df(results_translate_test_xlm_r_large, "TT_xml_r_large")
df_results_translate_test_xlm_r_large.to_csv("results/mlqa/results_translate_test_xlm_r_large.csv")
df_results_translate_test_xlm_r_large

## Translate Test RoBERTa

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "thatdramebaazguy/roberta-base-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-base-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_roberta = compute_results(langs, split)
print(results_translate_test_roberta)

In [None]:
df_results_translate_test_roberta = results_df(results_translate_test_roberta, "TT_roberta")
df_results_translate_test_roberta.to_csv("results/mlqa/results_translate_test_roberta.csv")
df_results_translate_test_roberta

## Translate Test RoBERTa-large

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English. In this case we use a monolingual RoBERTa-large model, because we only evaluate on English data.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "csarron/roberta-large-squad-v1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="roberta-large-squad-v1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]
split = "translate_test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_translate_test_roberta_large = compute_results(langs, split)
print(results_translate_test_roberta_large)

In [None]:
df_results_translate_test_roberta_large = results_df(results_translate_test_roberta_large, "TT_roberta_large")
df_results_translate_test_roberta_large.to_csv("results/mlqa/results_translate_test_roberta_large.csv")
df_results_translate_test_roberta_large

## Translate Train Es mBERT

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We then fine-tune mBERT on the translated data. We must align answer spans in the source and target language for the QA tasks. We use data that was already translated to save time.

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-es",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["es"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_prep["es"]["translate_train"],
    eval_dataset=xquad_prep["es"]["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
# trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_es_mbert = compute_results(langs, split)
print(results_translate_train_es_mbert)

In [None]:
df_results_translate_train_es_mbert = results_df(results_translate_train_es_mbert, "TTr_es_mbert")
df_results_translate_train_es_mbert.to_csv("results/mlqa/results_translate_train_es_mbert.csv")
df_results_translate_train_es_mbert

In [None]:
trainer.push_to_hub()

## Translate Train Es XLM-R

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We use a XLM-R model that has already been finetuned to save time.

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "saattrupdan/xlmr-base-texas-squad-es"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["es"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_es_xlm_r = compute_results(langs, split)
print(results_translate_train_es_xlm_r)

In [None]:
df_results_translate_train_es_xlm_r = results_df(results_translate_train_es_xlm_r, "TTr_es_xml_r")
df_results_translate_train_es_xlm_r.to_csv("results/mlqa/results_translate_train_es_xlm_r.csv")
df_results_translate_train_es_xlm_r

## Translate Train De XLM-R

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "saattrupdan/xlmr-base-texas-squad-de"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1-de",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["de"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_de_xlm_r = compute_results(langs, split)
print(results_translate_train_de_xlm_r)

In [None]:
df_results_translate_train_de_xlm_r = results_df(results_translate_train_de_xlm_r, "TTr_de_xml_r")
df_results_translate_train_de_xlm_r.to_csv("results/mlqa/results_translate_train_de_xlm_r.csv")
df_results_translate_train_de_xlm_r

## Translate Train All mBERT

We also experiment with a multi-task version of the translate-train setting where we fine-tune mBERT on the combined translated training data of all languages jointly.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 16

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-squad-all",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

split = "translate_train"
map_datasets(langs, split, prepare_train_features)

split = "translate_dev"
map_datasets(langs, split, prepare_train_features)

In [None]:
from datasets import DatasetDict, concatenate_datasets

xquad_merged = DatasetDict()
xquad_merged["translate_train"] = squad_train["train"]
xquad_merged["translate_dev"] = squad_train["validation"]

for lang in langs:
    for split in ["translate_train", "translate_dev"]:
        xquad_merged[split] = concatenate_datasets([xquad_merged[split], xquad_prep[lang][split]])

In [None]:
xquad_merged

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_merged["translate_train"],
    eval_dataset=xquad_merged["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
# trainer.train()

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"
results_translate_train_all_mbert = compute_results(langs, split)
print(results_translate_train_all_mbert)

In [None]:
df_results_translate_train_all_mbert = results_df(results_translate_train_all_mbert, "TTr_all_mbert")
df_results_translate_train_all_mbert.to_csv("results/mlqa/results_translate_train_all_mbert.csv")
df_results_translate_train_all_mbert

In [None]:
trainer.push_to_hub()

## Fine-tuning XQuAD mBERT

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/bert-base-multilingual-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-base-multilingual-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_fine_tuning_xquad_mbert = compute_results(langs, split)
print(results_fine_tuning_xquad_mbert)

In [None]:
df_results_fine_tuning_xquad_mbert = results_df(results_fine_tuning_xquad_mbert, "FT_xquad_mbert")
df_results_fine_tuning_xquad_mbert.to_csv("results/mlqa/results_fine_tuning_xquad_mbert.csv")
df_results_fine_tuning_xquad_mbert

## Fine-tuning XQuAD XLM-R

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/xlm-roberta-base-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-base-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_fine_tuning_xquad_xlm_r = compute_results(langs, split)
print(results_fine_tuning_xquad_xlm_r)

In [None]:
df_results_fine_tuning_xquad_xlm_r = results_df(results_fine_tuning_xquad_xlm_r, "FT_xquad_xlm_r")
df_results_fine_tuning_xquad_xlm_r.to_csv("results/mlqa/results_fine_tuning_xquad_xlm_r.csv")
df_results_fine_tuning_xquad_xlm_r

## FIne-tuning XQuAD XLM-R-large

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "alon-albalak/xlm-roberta-large-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="xlm-roberta-large-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_fine_tuning_xquad_xlm_r_large = compute_results(langs, split)
print(results_fine_tuning_xquad_xlm_r_large)

In [None]:
df_results_fine_tuning_xquad_xlm_r_large = results_df(results_fine_tuning_xquad_xlm_r_large, "FT_xquad_xml_r_large")
df_results_fine_tuning_xquad_xlm_r_large.to_csv("results/mlqa/results_fine_tuning_xquad_xlm_r_large.csv")
df_results_fine_tuning_xquad_xlm_r_large

## Data Augmentation mBERT

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator

model_name = "mrm8488/bert-multi-cased-finetuned-xquadv1"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DefaultDataCollator()

batch_size = 16
training_args = TrainingArguments(
    output_dir="bert-multi-cased-finetuned-xquadv1",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
split = "test"

map_datasets(langs, split, prepare_validation_features)

In [None]:
results_data_augmentation_mbert = compute_results(langs, split)
print(results_data_augmentation_mbert)

In [None]:
df_results_data_augmentation_mbert = results_df(results_data_augmentation_mbert, "data_augm_mbert")
df_results_data_augmentation_mbert.to_csv("results/mlqa/results_data_augmentation_mbert.csv")
df_results_data_augmentation_mbert

## Baselines

We show results using baseline methods in the tables below. We directly fine-tune [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md)
and [XLM-R Large](https://arxiv.org/abs/1911.02116) on the English SQuAD v1.1 training data
and evaluate them via zero-shot transfer on the XQuAD test datasets. For translate-train, 
we fine-tune mBERT on the SQuAD v1.1 training data, which we automatically translate
to the target language. For translate-test, we fine-tune [BERT-Large](https://arxiv.org/abs/1810.04805)
on the SQuAD v1.1 training set and evaluate it on the XQuAD test set of the target language,
which we automatically translate to English. Note that results with translate-test are not directly
comparable as we drop a small number (less than 3%) of the test examples.

| Model Baseline F1 / EM                | en   | ar   | de   | el   | es   | hi   | ru   | th   | tr   | vi   | zh   | ro   | avg  |
|:-----------------------|------|------|------|------|------|------|------|------|------|------|------|------|------|
| Zero-shot mBERT                 | 83.5 / 72.2 | 61.5 / 45.1 | 70.6 / 54.0 | 62.6 / 44.9 | 75.5 / 56.9 | 59.2 / 46.0 | 71.3 / 53.3 | 42.7 / 33.5 | 55.4 / 40.1 | 69.5 / 49.6 | 58.0 / 48.3 | 72.7 / 59.9 | 65.2 / 50.3 |
| Zero-shot XLM-R Large           | 86.5 / 75.7 | 68.6 / 49.0 | **80.4** / 63.4 | **79.8** / 61.7 | 82.0 / 63.9 | **76.7** / 59.7 | **80.1** / 64.3 | **74.2** / **62.8** | **75.9** / **59.3** | **79.1** / 59.0 | 59.3 / 50.0 | **83.6** / **69.7** | **77.2** / 61.5 |
| Translate-train mBERT | 83.5 / 72.2 | 68.0 / 51.1 | 75.6 / 60.7 | 70.0 / 53.0 | 80.2 / 63.1 | 69.6 / 55.4 | 75.0 / 59.7 | 36.9 / 33.5 | 68.9 / 54.8 | 75.6 / 56.2 | 66.2 / 56.6 |   | 70.0 / 56.0 |
| Translate-test BERT Large | **87.9** / **77.1** | **73.7** / **58.8** | 79.8 / **66.7** | 79.4 / **65.5** | **82.0** / **68.4**| 74.9 / **60.1** | 79.9 / **66.7** | 64.6 / 50.0 | 67.4 / 49.6 | 76.3 / **61.5** | **73.7** / **59.1** |     | 76.3 / **62.1** |

## Results

* Similar to XQuAD baseline results in first table.
* Zero-shot is better than translate-test for larger models and worse for smaller models.
* Monolingual models get better results than multilingual in translate-test.
* Larger versions of models get better results.
* Results from worst to better: translate-train, translate-test multi, translate-test monolingual, zero-shot, fine-tuning, data augmentation
* Best languages: English, Spanish, Romanian, Russian
* Worst languages: Chinese, Hindi, Thai, Turkish

| Model Ours F1 / EM                            | en          | ar          | de          | el          | es          | hi          | ru          | th          | tr          | vi          | zh          | ro          | avg         |
|:-----------------------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
Zero-shot
| Zero-shot mBERT              | 85.0 / 73.5 | 57.8 / 42.2 | 72.6 / 55.9 | 62.2 / 45.2 | 76.4 / 58.1 | 55.3 / 40.6 | 71.3 / 54.7 | 35.1 / 26.3 | 51.1 / 34.9 | 68.1 / 47.9 | 58.2 / 47.3 | 72.4 / 59.5 | 63.8 / 48.8 |
| Zero-shot XLM-R              | 84.4 / 73.8 | 67.9 / 52.1 | 75.3 / 59.8 | 74.3 / 57.0 | 77.0 / 59.2 | 69.0 / 52.5 | 75.1 / 58.6 | 68.0 / 56.4 | 68.0 / 51.8 | 73.6 / 54.5 | 65.0 / 55.0 | 80.0 / 66.3 | 73.1 / 58.1 |
| Zero-shot XLM-R Large        | **86.5 / 75.9** | **75.0 / 58.0** | **79.9 / 63.8** | **79.1 / 61.3** | **81.0 / 62.7** | **76.0 / 60.8** | **80.3 / 63.1** | **72.8 / 61.7** | **74.1 / 58.3** | **79.0 / 59.3** | **66.8 / 58.0** | **83.5 / 70.2** | **77.8 / 62.8** |
Translate-test monolingual
| Translate-test BERT          |         | 69.4 / 55.0 | 75.7 / 62.7 | 75.0 / 60.6 | 77.2 / 62.6 | 69.7 / 53.7 | 74.9 / 60.5 | 60.5 / 46.5 | 59.9 / 41.8 | 72.2 / 58.3 | 69.9 / 56.0 |         | 70.4 / 55.8 |
| Translate-test BERT Large    |         | 73.6 / 59.1 | 80.4 / 66.4 | 80.2 / 66.8 | 81.9 / 68.7 | **75.3 / 61.7** | 80.1 / 67.0 | **67.5 / 53.9** | **66.3 / 47.3** | **76.4 / 62.1** | 74.0 / 59.5 |         | 75.6 / 61.2 |
| Translate-test RoBERTa       |         | 71.6 / 57.0 | 77.0 / 62.4 | 76.8 / 63.9 | 80.0 / 64.6 | 72.0 / 55.6 | 77.2 / 62.4 | 62.2 / 46.6 | 63.4 / 44.1 | 72.4 / 56.6 | 72.4 / 57.9 |         | 72.5 / 57.1 |
| Translate-test RoBERTa Large |         | **74.8 / 61.1** | **80.4 / 67.1** | **80.8 / 68.0** | **83.1 / 69.4** | 75.1 / 61.0 | **81.2 / 68.0** | 65.3 / 51.0 | 66.0 / 46.9 | 76.4 / 62.0 | **74.0 / 59.9** |         | **75.7 / 61.4** |
Translate-test multilingual
| Translate-test mBERT         |         | 70.4 / 55.8 | 76.7 / 63.3 | 76.0 / 61.9 | 78.7 / 65.1 | 70.6 / 55.8 | 76.6 / 63.1 | 60.0 / 45.9 | 61.6 / 42.7 | 70.6 / 55.6 | 70.1 / 56.6 |         | 71.2 / 56.6 |
| Translate-test XLM-R         |         | 70.4 / 56.5 | 79.0 / 65.8 | 77.8 / 65.0 | 79.3 / 66.4 | 72.4 / 57.6 | 77.4 / 63.6 | 60.3 / 45.4 | 63.4 / 44.3 | 73.0 / 58.4 | 71.1 / 57.4 |         | 72.4 / 58.0 |
| Translate-test XLM-R Large   |         | **72.9 / 59.1** | **80.1 / 66.6** | **79.6 / 66.2** | **81.5 / 67.1** | **74.2 / 60.1** | **79.7 / 65.7** | **61.7 / 46.0** | **66.2 / 48.2** | **75.1 / 61.5** | **73.6 / 58.8** |         | **74.5 / 59.9** |
Translate-train
| Translate-train es XLM-R     | **80.4** / 66.1 | **67.0** / 47.9 | 74.2 / 56.4 | **73.5** / 52.4 | **76.3** / 56.6 | **66.9** / 48.2 | 72.4 / 54.2 | **68.7** / **58.5** | **66.2** / 46.5 | 73.2 / 52.0 | 63.4 / 50.3 | **76.0** / 59.2 | **71.5** / 54.0 |
| Translate-train de XLM-R     | 79.8 / **67.1** | 65.9 / **48.2** | **74.3** / **58.8** | 72.3 / **54.4** | 75.9 / **57.9** | 66.4 / **50.6** | **73.1** / **56.4** | 65.4 / 56.8 | 65.8 / **50.8** | 72.7 / **53.2** | **64.7** / **55.0** | 75.3 / **61.1** | 71.0 / **55.9** |
Fine-tuning XQuAD
| Fine-tuning mBERT            | 97.3 / 95.3 | 90.0 / 84.3 | 94.2 / 90.0 | 92.2 / 87.0 | 96.2 / 92.4 | 88.2 / 77.5 | 94.4 / 90.1 | 25.2 / 16.8 | 89.9 / 84.4 | 93.4 / 87.6 | 87.5 / 84.4 | 95.5 / 91.3 | 87.0 / 81.8 |
| Fine-tuning XLM-R            | 98.5 / 97.5 | 92.5 / 88.2 | 95.1 / 91.8 | 96.0 / 91.8 | 97.8 / 93.6 | 92.6 / 88.6 | 95.2 / 90.8 | 94.0 / 92.4 | 92.0 / 87.3 | 95.5 / 91.3 | 94.0 / 92.9 | 97.7 / 94.8 | 95.1 / 91.8 |
| Fine-tuning XLM-R Large      | **99.7 / 99.2** | **97.0 / 94.2** | **98.1 / 95.6** | **97.8 / 94.4** | **98.5 / 95.8** | **96.5 / 93.6** | **98.1 / 96.0** | **96.1 / 95.1** | **95.9 / 92.3** | **97.6 / 94.0** | **96.3 / 95.7** | **98.9 / 97.1** | **97.5 / 95.2** |
Data-augmentation XQuAD
| Data-augmentation mBERT      | 99.7 / 99.2 | 97.1 / 94.4 | 98.9 / 97.9 | 97.0 / 94.6 | 99.6 / 98.9 | 97.7 / 95.1 | 98.5 / 97.3 | 87.3 / 84.9 | 98.8 / 97.4 | 98.9 / 97.5 | 97.5 / 96.8 | 90.6 / 81.6 | 96.8 / 94.6 |

In [None]:
import pandas as pd

df_results_zero_shot_mbert = pd.read_csv('results/mlqa/results_zero_shot_mbert.csv')
df_results_zero_shot_xlm_r = pd.read_csv('results/mlqa/results_zero_shot_xlm_r.csv')
df_results_zero_shot_xlm_r_large = pd.read_csv("results/mlqa/results_zero_shot_xlm_r_large.csv")
df_results_translate_test_mbert = pd.read_csv("results/mlqa/results_translate_test_mbert.csv")
df_results_translate_test_bert = pd.read_csv("results/mlqa/results_translate_test_bert.csv")
df_results_translate_test_bert_large = pd.read_csv("results/mlqa/results_translate_test_bert_large.csv")
df_results_translate_test_xlm_r = pd.read_csv("results/mlqa/results_translate_test_xlm_r.csv")
df_results_translate_test_xlm_r_large = pd.read_csv("results/mlqa/results_translate_test_xlm_r_large.csv")
df_results_translate_test_roberta = pd.read_csv("results/mlqa/results_translate_test_roberta.csv")
df_results_translate_test_roberta_large = pd.read_csv("results/mlqa/results_translate_test_roberta_large.csv")
df_results_translate_train_es_xlm_r = pd.read_csv("results/mlqa/results_translate_train_es_xlm_r.csv")
df_results_translate_train_de_xlm_r = pd.read_csv("results/mlqa/results_translate_train_de_xlm_r.csv")
df_results_fine_tuning_mbert = pd.read_csv("results/mlqa/results_fine_tuning_xquad_mbert.csv")
df_results_fine_tuning_xquad_xlm_r = pd.read_csv("results/mlqa/results_fine_tuning_xquad_xlm_r.csv")
df_results_fine_tuning_xquad_xml_r_large = pd.read_csv("results/mlqa/results_fine_tuning_xquad_xlm_r_large.csv")
df_results_data_augmentation_mbert = pd.read_csv("results/mlqa/results_data_augmentation_mbert.csv")


dataframes = [df_results_zero_shot_mbert, 
              df_results_zero_shot_xlm_r, 
              df_results_zero_shot_xlm_r_large, 
              df_results_translate_test_mbert, 
              df_results_translate_test_bert, 
              df_results_translate_test_bert_large, 
              df_results_translate_test_xlm_r, 
              df_results_translate_test_xlm_r_large, 
              df_results_translate_test_roberta, 
              df_results_translate_test_roberta_large, 
              df_results_translate_train_es_xlm_r, 
              df_results_translate_train_de_xlm_r,
              df_results_fine_tuning_mbert,
              df_results_fine_tuning_xquad_xlm_r,
              df_results_fine_tuning_xquad_xml_r_large,
              df_results_data_augmentation_mbert, 
              ]

In [None]:
dfs = []
for df in dataframes:
    name1 = list(df.columns)[2]
    name2 = list(df.columns)[3]
    name = name1[3:]
    df = df.round(1)
    df = df.astype({name1: 'str', name2: 'str'}) #float to string
    df[name] = df[[name1, name2]].apply(lambda x: ' / '.join(x), axis=1) #concat F1 and EM
    df = df.drop([name1, name2,"Unnamed: 0"], axis=1) #remove useless columns
    df = df.set_index('lang').T #rotate dataframe
    dfs.append(df)

In [None]:
results_df = pd.concat(dfs, axis=0)
# reorder languages to match baseline
results_df = results_df[['en', 'ar', 'de', 'el', 'es', 'hi', 'ru', 'th', 'tr', 'vi', 'zh', 'ro', 'avg']]
# rename rows
rows = ["Zero-shot mBERT", "Zero-shot XLM-R", "Zero-shot XLM-R Large", 
        "Translate-test mBERT", "Translate-test BERT", "Translate-test BERT Large",
        "Translate-test XLM-R", "Translate-test XLM-R Large", "Translate-test RoBERTa",
        "Translate-test RoBERTa Large", "Translate-train es XLM-R", "Translate-train de XLM-R",
        "Fine-tuning mBERT", "Fine-tuning XLM-R", "Fine-tuning XLM-R Large", "Data-augmentation mBERT"]
results_df = results_df.rename(dict(zip(results_df.index, rows)))
results_df.to_csv("results/mlqa/results.csv")
display(results_df)

In [None]:
results_df.to_markdown()