# Zero-Shot and Translation Experiments on XQuAD with mBERT

https://proceedings.mlr.press/v119/hu20b/hu20b.pdf

If you're opening this Notebook on colab, you will need to moun drive and change directory.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [62]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/project

/content/drive/MyDrive/LAP/Subjects/AP2/project


If you're opening this Notebook on colab, you will need to install 🤗 Transformers and 🤗 Datasets.

In [3]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 9.5 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 64.9 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 72.8 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.5 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |███████████████████████████████

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [5]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## Load SQuAD Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

In [7]:
from datasets import list_datasets, list_metrics, load_dataset, load_metric

In [8]:
squad = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [10]:
squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [11]:
squad["validation"][0]

{'answers': {'answer_start': [177, 177, 177],
  'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bo

## Load XQuAD Dataset

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering
performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set
of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German,
Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi and Romanian. Consequently, the dataset is entirely parallel
across 12 languages.

We also include "translate-train", "translate-dev", and "translate-test"
splits for each non-English language from XTREME (Hu et al., 2020). These can be used to run XQuAD in the "translate-train" or "translate-test" settings.

In [63]:
langs = ["ar", "de", "zh", "vi", "en", "es", "hi", "el", "th", "tr", "ru", "ro"]
xquad = {}
for lang in langs:
    xquad[lang] = load_dataset("xquad.py", lang)

Downloading and preparing dataset xquad/ar to /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/168k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/ar/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/de to /root/.cache/huggingface/datasets/xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/de/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/zh to /root/.cache/huggingface/datasets/xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/zh/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/vi to /root/.cache/huggingface/datasets/xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/161M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/65.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/vi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Reusing dataset xquad (/root/.cache/huggingface/datasets/xquad/en/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40)


  0%|          | 0/1 [00:00<?, ?it/s]

Reusing dataset xquad (/root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40)


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/hi to /root/.cache/huggingface/datasets/xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/176k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/349M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/143M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/hi/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/el to /root/.cache/huggingface/datasets/xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/200k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/369M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/el/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/th to /root/.cache/huggingface/datasets/xquad/th/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/178k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/341M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/139M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/th/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/tr to /root/.cache/huggingface/datasets/xquad/tr/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/135k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/111M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/tr/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/ru to /root/.cache/huggingface/datasets/xquad/ru/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/195k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/380M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating translate_train split: 0 examples [00:00, ? examples/s]

Generating translate_dev split: 0 examples [00:00, ? examples/s]

Generating translate_test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/ru/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Downloading and preparing dataset xquad/ro to /root/.cache/huggingface/datasets/xquad/ro/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/137k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset xquad downloaded and prepared to /root/.cache/huggingface/datasets/xquad/ro/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [67]:
xquad

{'ar': DatasetDict({
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1190
    })
    translate_train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 86787
    })
    translate_dev: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 34448
    })
    translate_test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1151
    })
}),
 'de': DatasetDict({
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1190
    })
    translate_train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 82603
    })
    translate_dev: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 32950
    })
    translate_test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1168
    })
}),
 'el':

In [66]:
xquad["es"]["test"][0]

{'answers': {'answer_start': [133], 'text': ['308']},
 'context': '\ufeffLos Panthers, que además de liderar las intercepciones de la NFL con 24 y contar con cuatro jugadores de la Pro Bowl, cedieron solo 308 puntos en defensa y se sitúan en el sexto lugar de la liga. Kawann Short, tacle defensivo de la Pro Bowl, lideró al equipo con 11 capturas, 3 balones sueltos forzados y 2 recuperaciones. A su vez, el liniero Mario Addison, consiguió 6 capturas y media. En la línea de los Panthers, también destacó como ala defensiva el veterano Jared Allen ―5 veces jugador de la Pro Bowl y que fue el líder, en activo, de capturas de la NFL con 136― junto con el también ala defensiva Kony Ealy, que lleva 5 capturas en solo 9 partidos como titular. Detrás de ellos, Thomas Davis y Luke Kuechly, dos de los tres apoyadores titulares que también han sido seleccionados para jugar la Pro Bowl. Davis se hizo con 5 capturas y media, 4 balones sueltos forzados y 4 intercepciones, mientras que Kuechly lideró a

In [18]:
xquad["es"]["translate_train"][0]

{'answers': {'answer_start': [161],
  'text': ['Coleman A. Young Municipal Center']},
 'context': 'Los tribunales de Detroit son administrados por el estado y las elecciones no son partidistas. El tribunal testamentario del condado de Wayne está ubicado en el Coleman A. Young Municipal Center en el centro de Detroit. El tribunal de circuito se encuentra al otro lado de la avenida Gratiot. en el Frank Murphy Hall of Justice, en el centro de Detroit. La ciudad alberga el Trigésimo Sexto Tribunal de Distrito, así como el Primer Distrito del Tribunal de Apelaciones de Michigan y el Tribunal de Distrito de los Estados Unidos para el Distrito Este de Michigan. La ciudad proporciona la aplicación de la ley a través del Departamento de Policía de Detroit y servicios de emergencia a través del Departamento de Bomberos de Detroit.',
 'id': '5728d4d3ff5b5019007da7ba',
 'question': '¿Dónde se encuentra el tribunal testamentario del condado de Wayne?'}

In [19]:
xquad["es"]["translate_dev"][0]

{'answers': {'answer_start': [227], 'text': ['una fuerza innata de ímpetu']},
 'context': 'Las deficiencias de la física aristotélica no se corregirían por completo hasta el trabajo del siglo XVII de Galileo Galilei, quien fue influenciado por la idea medieval tardía de que los objetos en movimiento forzado llevaban una fuerza innata de ímpetu. Galileo construyó un experimento en el que las piedras y las balas de cañón fueron rodadas por una pendiente para refutar la teoría aristotélica del movimiento a principios del siglo XVII. Mostró que los cuerpos eran acelerados por la gravedad hasta un punto que era independiente de su masa y argumentó que los objetos retienen su velocidad a menos que actúen por una fuerza, por ejemplo, la fricción.',
 'id': '57373f80c3c5551400e51e91',
 'question': '¿Qué contenían los objetos en movimiento forzado según la idea medieval tardía que influyen en Aristóteles?'}

In [20]:
xquad["es"]["translate_test"][0]

{'answers': {'answer_start': [411],
  'text': ['Cobb, Shepley, Rutan and Coolidge, Holabird & Roche, and other architectural firms']},
 'context': 'The first buildings on the University of Chicago campus, which make up what is now known as the main quadrangle, were part of a "master plan" conceived by two administrators of the University of Chicago and planned by the architect Henry Ives of Chicago. The main quadrangle consists of six quadrangle, each surrounded by buildings, bordering a larger quadrangle. The main quadrangle buildings were designed by Cobb, Shepley, Rutan and Coolidge, Holabird & Roche, and other architectural firms in a mixture of Victorian Gothic and collegiate Gothic styles, used in the faculties of the University Oxford (Mitchell Tower, for example, follows the model of the Magdalena Tower in Oxford, and Commons University, Hutchinson Hall, imitates Christ Church Hall).',
 'id': '57284b904b864d19001648e4',
 'question': 'Who helped design the main quadrangle?'}

## Preprocessing

Load the mBERT tokenizer to process the question and context fields.

In [21]:
from transformers import AutoTokenizer

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [22]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

There are a few preprocessing steps particular to question answering that we should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. Truncate only the context by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original context by setting `return_offset_mapping=True`.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to find which part of the offset corresponds to the question and which corresponds to the `context`.

In [23]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [24]:
squad_train = squad.map(prepare_train_features, batched=True, 
                            remove_columns=squad["train"].column_names)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

The evaluation features are similar to the train features. We have to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [25]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1
        
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [26]:
squad_eval = squad["validation"].map(prepare_validation_features, batched=True, 
                                          remove_columns=squad["validation"].column_names)

  0%|          | 0/11 [00:00<?, ?ba/s]

## Fine-tuning

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [28]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-bas

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [29]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [30]:
batch_size = 16
args = TrainingArguments(
    "bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=squad_train["train"],
    eval_dataset=squad_train["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
# trainer.train()

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
# trainer.save_model("bert-base-multilingual-cased-finetuned-squad")

## Fine-tuned Model

We load a model that is already finetuned on SQuAD.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "salti/bert-base-multilingual-cased-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-cased-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

## Evaluation

We can grab the predictions for all features by using the `Trainer.predict` method:

In [42]:
raw_predictions = trainer.predict(squad_eval)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10851
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [43]:
squad_eval.set_format(type=squad_eval.format["type"], 
                      columns=list(squad_eval.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices.

In [44]:
from tqdm.auto import tqdm
import collections
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

And we can apply our post-processing function to our raw predictions:

In [45]:
final_predictions = postprocess_qa_predictions(squad["validation"], squad_eval, raw_predictions.predictions)

Post-processing 10570 example predictions split into 10851 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [46]:
metric = load_metric("squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [47]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 81.90160832544939, 'f1': 89.121876471452}

## Zero-Shot

Zero-Shot performance of the model fine-tuned on SQuAD.

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [33]:
xquad_es_test = xquad["es"]["test"].map(prepare_validation_features, batched=True, 
                                  remove_columns=xquad["es"]["test"].column_names)

  0%|          | 0/2 [00:00<?, ?ba/s]

We can grab the predictions for all features by using the `Trainer.predict` method:

In [34]:
raw_predictions = trainer.predict(xquad_es_test)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1274
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [35]:
xquad_es_test.set_format(type=xquad_es_test.format["type"], 
                      columns=list(xquad_es_test.features.keys()))

And we can apply our post-processing function to our raw predictions:

In [38]:
final_predictions = postprocess_qa_predictions(xquad["es"]["test"], xquad_es_test, raw_predictions.predictions)

Post-processing 1190 example predictions split into 1274 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [39]:
metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [40]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad["es"]["test"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 58.0672268907563, 'f1': 76.40620285202574}

## Translate Test

We use the model fine-tuned on SQuAD and evaluate it on test data that we translated from the target language to English using our in-house MT system. In this case we could also use a monolingual model, because we only evaluate on English data.

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

In [48]:
xquad_es_translate_test = xquad["es"]["translate_test"].map(prepare_validation_features, batched=True, 
                                  remove_columns=xquad["es"]["translate_test"].column_names)

  0%|          | 0/2 [00:00<?, ?ba/s]

We can grab the predictions for all features by using the `Trainer.predict` method:

In [49]:
raw_predictions = trainer.predict(xquad_es_translate_test)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1235
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [50]:
xquad_es_test.set_format(type=xquad_es_translate_test.format["type"], 
                      columns=list(xquad_es_translate_test.features.keys()))

And we can apply our post-processing function to our raw predictions:

In [51]:
final_predictions = postprocess_qa_predictions(xquad["es"]["translate_test"], xquad_es_translate_test, raw_predictions.predictions)

Post-processing 1188 example predictions split into 1235 features.


  0%|          | 0/1188 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [52]:
metric = load_metric("squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [53]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad["es"]["translate_test"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 65.06734006734007, 'f1': 78.73225869616985}

## Translate Train

For many language pairs, a MT model may be available, which can be used to obtain data in the target language. To evaluate the impact of using such data, we translate the English training data into the target language using our MT system. We then fine-tune mBERT on the translated data. We must align answer spans in the source and target language for the QA tasks. We use data that was already translated to save time.

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = "bert-base-multilingual-cased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch_size = 48

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
xquad_es_train = xquad_es.map(prepare_train_features, batched=True, 
                              remove_columns=xquad["es"]["test"].column_names)

Define trainer and train. Trainer uses translated SQuAD train and dev data.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=xquad_es_train["translate_train"],
    eval_dataset=xquad_es_train["translate_dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
trainer.train()

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [None]:
xquad_es_test = xquad["es"]["test"].map(prepare_validation_features, batched=True, 
                                  remove_columns=xquad["es"]["test"].column_names)

  0%|          | 0/2 [00:00<?, ?ba/s]

We can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(xquad_es_test)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1274
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
xquad_es_test.set_format(type=xquad_es_test.format["type"], 
                      columns=list(xquad_es_test.features.keys()))

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(xquad["es"]["test"], xquad_es_test, raw_predictions.predictions)

Post-processing 1190 example predictions split into 1274 features.


  0%|          | 0/1190 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad")

Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad["es"]["test"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 58.0672268907563, 'f1': 76.40620285202574}

## Translate Train All

In [None]:
langs = ["ar", "de", "zh", "vi", "es", "hi", "el", "th", "tr", "ru"]

## Fine-tuned XQuAD

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model_name = "alon-albalak/bert-base-multilingual-xquad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
batch_size = 48

training_args = TrainingArguments(
    output_dir="bert-base-multilingual-xquad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Use the map function again to apply the preprocessing function over the validation dataset. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don’t need.

In [55]:
xquad_es_test = xquad["es"]["test"].map(prepare_validation_features, batched=True, 
                                  remove_columns=xquad["es"]["test"].column_names)

Loading cached processed dataset at /root/.cache/huggingface/datasets/xquad/es/1.0.0/c826765c504683edb842a571920db7c3721b021e292ccf87f607737218cbeb40/cache-b7db7172212b5cee.arrow


We can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(xquad_es_test)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
xquad_es_test.set_format(type=xquad_es_test.format["type"], 
                      columns=list(xquad_es_test.features.keys()))

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(xquad["es"]["test"], xquad_es_test, raw_predictions.predictions)

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in xquad["es"]["test"]]
metric.compute(predictions=formatted_predictions, references=references)