<a href="https://colab.research.google.com/github/jyjoon001/KoBART-KorQuAD/blob/main/Question_Answering_on_SQUAD_AIC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers
! pip install sentencepiece

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 7.8MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 11.5MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 16.5MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/e9/91/2ef649137816850fa4f4c97c6f2eabb1a79bf0aa2c8ed198e387e373455e/fsspec-2021.4.0-py3-none-any.whl (108kB)
[K     |████████████████████████████████| 112kB

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

# Fine-tuning a model on a question-answering task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the `Trainer` API to fine-tune a model on it.

![Widget inference representing the QA task](https://github.com/huggingface/notebooks/blob/master/examples/images/question_answering.png?raw=1)

**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "hyunwoongko/kobart"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")
datasetsKR = load_dataset("squad_kor_v2" if squad_v2 else "squad_kor_v1")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1877.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=955.0, style=ProgressStyle(description_…


Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.75 MiB, post-processed: Unknown size, total: 119.27 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=8116577.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1054280.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1710.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=962.0, style=ProgressStyle(description_…


Downloading and preparing dataset squad_kor_v1/squad_kor_v1 (download: 40.44 MiB, generated: 87.40 MiB, post-processed: Unknown size, total: 127.84 MiB) to /root/.cache/huggingface/datasets/squad_kor_v1/squad_kor_v1/1.0.0/92f88eedc7d67b3f38389e8682eabe68caa450442cc4f7370a27873dbc045fe4...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=7568316.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=770480.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset squad_kor_v1 downloaded and prepared to /root/.cache/huggingface/datasets/squad_kor_v1/squad_kor_v1/1.0.0/92f88eedc7d67b3f38389e8682eabe68caa450442cc4f7370a27873dbc045fe4. Subsequent calls will reuse this data.


The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [None]:
datasetsKR["train"][0]

{'answers': {'answer_start': [54], 'text': ['교향곡']},
 'context': '1839년 바그너는 괴테의 파우스트을 처음 읽고 그 내용에 마음이 끌려 이를 소재로 해서 하나의 교향곡을 쓰려는 뜻을 갖는다. 이 시기 바그너는 1838년에 빛 독촉으로 산전수전을 다 걲은 상황이라 좌절과 실망에 가득했으며 메피스토펠레스를 만나는 파우스트의 심경에 공감했다고 한다. 또한 파리에서 아브네크의 지휘로 파리 음악원 관현악단이 연주하는 베토벤의 교향곡 9번을 듣고 깊은 감명을 받았는데, 이것이 이듬해 1월에 파우스트의 서곡으로 쓰여진 이 작품에 조금이라도 영향을 끼쳤으리라는 것은 의심할 여지가 없다. 여기의 라단조 조성의 경우에도 그의 전기에 적혀 있는 것처럼 단순한 정신적 피로나 실의가 반영된 것이 아니라 베토벤의 합창교향곡 조성의 영향을 받은 것을 볼 수 있다. 그렇게 교향곡 작곡을 1839년부터 40년에 걸쳐 파리에서 착수했으나 1악장을 쓴 뒤에 중단했다. 또한 작품의 완성과 동시에 그는 이 서곡(1악장)을 파리 음악원의 연주회에서 연주할 파트보까지 준비하였으나, 실제로는 이루어지지는 않았다. 결국 초연은 4년 반이 지난 후에 드레스덴에서 연주되었고 재연도 이루어졌지만, 이후에 그대로 방치되고 말았다. 그 사이에 그는 리엔치와 방황하는 네덜란드인을 완성하고 탄호이저에도 착수하는 등 분주한 시간을 보냈는데, 그런 바쁜 생활이 이 곡을 잊게 한 것이 아닌가 하는 의견도 있다.',
 'id': '6566495-0-0',
 'question': '바그너는 괴테의 파우스트를 읽고 무엇을 쓰고자 했는가?',
 'title': '파우스트_서곡'}

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasetsKR["train"])

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [32], 'text': ['만델라 재단']}","남아프리카 공화국의 넬슨 만델라 전 대통령의 대외 창구인 만델라 재단은 18일 김대중 전 대통령 사망과 관련해 ""우리는 그가 인권을 위해 싸우고 북한과의 화해를 위해 얼마나 노력했는지를 기억한다""면서 ""유족과 한국 국민에 위로의 뜻을 전한다""라고 밝혔다. 만델라 재단은 성명에서 ""노벨평화상 수상자인 김 전 대통령이 서거한 데 대해 애도한다""면서 이 같이 말하고, ""만델라 전 대통령은 지난 2001년 3월 김 전 대통령과 만난 적이 있으며, 당시 김 전 대통령은 한반도의 비무장지대를 평화공원으로 전환하자는 만델라의 아이디어에 공감을 표시했다""라고 소개했다. 만델라 전 대통령은 지난 1997년 5월 대선을 앞둔 김 전 대통령에게 자신의 셋째 딸 진드지 여사 부부를 보내 자신이 27년 동안 옥중에서 차고 있던 손목시계를 선물하며 승리를 기원하는 등 개인적으로도 친분이 있다. 이에 당시 김 전 대통령도 유신 체제와 망명 시절을 거치며 20년 동안 간직해온 낡은 가방을 답례품으로 전달했다.",6488203-49-0,넬슨 만델라의 대외 창구는?,김대중
1,"{'answer_start': [425], 'text': ['박규리']}","1월 3일 일본 니혼TV '스타드래프트'에서 걸그룹 카라가 '일본인이 좋아하는 100명의 스타'로 선정되어 한국 아이돌 가수 중 유일하게 순위에 이름을 올렸으며 1월 26일에 발간된 한국문화산업교류재단의 한류총서 4번째 시리즈 ‘한류 포에버:일본편’에 따르면 한국 가수 중 가장 먼저 떠오른 가수와 가장 좋아하는 가수로 카라가 1위를 차지했다. 광고모델로서의 위상도 대단하여 2월 6일 대상의 홍초가 일본에서 카라를 광고 모델로 기용하여 전년 대비 35배 매출 성장, 500억의 매출을 기록했다고 발표하였다. 카라의 최초의 단독공연인 카라시아가 2월 18일, 2월 19일 이틀간 서울 송파구 방이동 올림픽공원 체조경기장에서 개최되었는데 18일 공연에서 니콜이 발목 인대를 다치는 부상을 당하였으나 19일 공연에 정상적으로 참가하였다. 공연 이후 2월 21일에 박규리가 성대결절수술을 받았다.",6597270-7-1,올림픽공원 체조경기장에서 공연 이후 성대결절수술을 받은 카라의 멤버는?,카라_(음악_그룹)
2,"{'answer_start': [165], 'text': ['카를 대공']}","그러는 동안 나폴레옹은 오스트리아군의 본대에 총공격을 시도했다. 프랑스군 본대의 총 병력, 그리고 좌익에 있는 란의 병력과 예비 기병대가 앞으로 진군하였다. 로젠베르크의 우익과 호엔촐레른의 좌익 사이에 위치한 오스트리아군의 전열이 쉽게 붕괴되었다. 승리는 프랑스군에게 돌아가는 듯하였다. 그러나 카를 대공이 남은 예비부대를 이끌고 전장에 나타났고, 대공은 직접 손에 군기를 들고 병사들을 인도하였다. 란의 부대에 오스트리아군의 공격이 집중되었으며, 란은 이를 저지하려 하였으나 실패하였고, 전 전선에 걸쳐 프랑스군은 물러나기 시작했다. 아스페른은 오스트리아군의 손에 떨어졌다. 이런 상황에서 심각한 소식이 나폴레옹에게 날아들었다. 이미 예전에 부서진 적이 있던 도나우 강의 다리가 오스트리아군에 의해 강 하류로 흘러 내려간 거대한 바지선에 의해 부서졌다는 보고였다.",6585190-4-1,남은 예비부대를 이끌고 전장에 나타난 사람은?,아스페른-에슬링_전투
3,"{'answer_start': [520], 'text': ['중화 혁명당']}","국민당은 1913년 1월과 2월에 걸쳐 시행된 국회선거에서 민중들의 지지를 받아 과반수의 의석을 차지하였다. 그러나 국민당의 내각을 꺼린 위안스카이는 암살범을 보내 3월 20일 국무총리로 예정된 쑹자오런을 저격, 3월 22일 쑹자오런은 숨을 거두었다. 위안스카이는 이어 국회의 반대를 무릅쓰고 1913년 4월 25일 5개국(영국, 독일, 프랑스, 일본, 러시아) 은행단으로부터 2500만 파운드에 달하는 거액의 차관을 도입하는 협정을 체결하여 독재를 위한 재정기반을 확보했다. 6월에 위안스카이가 국민당 계열의 도독 3인을 파면하자 쑨원, 천치메이등을 비롯한 국민당 세력이 봉기(제2차 혁명)를 호소했으며, 1913년 6월 9일 한 봉기는 6월 9일에서 7월 12일까지 양자 강 중류와 하류 지역 일대에서 전투가 일어났으나나 당시 전란 재발을 달가워하지 않는 분위기에서 크게 호응받지 못했고 1913년 9월 위안스카이의 압도적인 무력에 완벽히 결국 진압됐고, 쑨원은 7월 일본 도쿄로 망명했다. 거기서 쑨원은 1914년 7월에 ‘중화 혁명당’을 결성해 국외에서 위안스카이 군벌 정부에 저항했다. 10월에 국회는 ‘공민단’이라 자칭하는 폭도에게 협박받는 가운데 위안스카이를 정식 대총통으로 선출했다. 곧 이어 위안스카이는 국민당 해산을 명하고 대총통 권한을 대폭 강화시킨 ‘신약법’을 제정했다. 결국 ‘위안스카이를 배려한 혁명’이 되어버렸다. 이후 일본이 위안스카이에게 “황제 등극을 인정해 주는 대가로 ‘21개조 요구’를 들어달라”고 요구하자 ‘황제 등극’을 원했던 위안스카이는 이를 수용 하여 1915년 12월 ‘황제 제도’의 부활을 시도했다가 군중에게 지탄받아 실패하고(제3차 혁명), 1916년 6월 위안스카이는 사망했다.",6533378-21-2,쑨원이 일본 망명 후 위안스카이 군벌 정부에 저항하고자 결성한 단체는?,신해혁명
4,"{'answer_start': [130], 'text': ['Ocean Sunfish']}","몸이 전반적으로 둥그렇고, 피부는 거칠고 회색이기 때문에 라틴어로는 “Mola”라고 한다. 또한 때로 맑은 날 수면에 누워서 일광욕을 하는 것처럼 보이는 모습이 예로부터 눈에 띄어 마치 해바라기를 하는 것처럼 보였기 때문에 영어로는 Ocean Sunfish라고도 한다(또는 머리만 있고 몸통 이하는 없는 것처럼 보인다고 해서 Head Fish라고 불린다.). 세계 여러 나라에서 불리는 명칭은 다양한데, 이는 모두 개복치의 가지각색 특색에서 유래한 것이 많다. 포르투갈어 및 프랑스어, 스페인어, 독일어로는 “달과 비슷한 물고기”라는 뜻인 peixe lua, poisson lune, pez luna와 Mondfisch로 각각 불린다.",6478833-0-1,개복치의 영어표현은?,개복치
5,"{'answer_start': [121], 'text': ['광주 폭동']}","일베저장소는 ""5.18 광주 민주화 운동""을 폭동으로 규정하고, 재평가하자는 주장의 진원지로 꼽힌다. 또한, 일베의 대다수 이용자는 5.18 광주 민주화 운동에 부정적인 반응을 보인다. 5.18 광주 민주화 운동을 '광주 폭동'이라고 부르거나 '광주 사태'로 한단계 낮춰 부르며 그 당시 희생자들의 사진을 보고 '홍어'라 비하하였다. 반대로 전두환 前 대통령이 '광주를 땅크로 진압했다'는 뜻으로 전땅크라는 용어로 부르기도 했다. 일부 회원들은 그 당시 5.18 광주 민주화 운동을 취재했던 조갑제의 의견대로 5.18을 옹호하지만 이는 소수이다. 일부 일베 사용자들은 시민이 먼저 계엄군을 공격했다고 주장하며 군대의 폭력은 정당방위라고 주장했다. 또한 지만원의 책을 인용하며 5.18이 북한의 지령에서 시작되었다고 설파한다. 5.18기념재단의 송선태는 CBS 라디오에서 일베가 주로 '북한군 특수부대가 와서 광주시민을 살상했다'고 주장하거나 희생자들의 사진을 두고 '홍어 말리는 중'이라는 표현을 쓴다고 진술했다. 또한 지역감정을 교묘히 오버랩핑해서 죽은 사람의 명예를 훼손하는 수준이라고 말했다. 3월 22일 5·18기념재단은 광주시청 등 총 4곳 기관과 모여 5·18을 비하하는 일베를 대상으로 법적 소송을 준비하고 있다고 주간조선에 밝혔다. 고려대학교에서는 학생회 측이 마련한 '5·18 광주민주화운동 사진전' 전시물 위에 일베 회원으로 추정되는 사람들이 '광주민주화운동은 북한에 의한 폭동이었다""는 내용을 담은 사진 10여장을 붙여 훼손하는 일이 벌어졌다.",6548380-3-1,일베의 대다수는 5.18 광주 민주화 운동을 낮추어 무엇이라 불렀나요?,일베저장소
6,"{'answer_start': [108], 'text': ['예수회']}","동아시아에 대한 保敎權은 포르투갈이 교황으로부터 부여받았다. 반면 프랑스는 포르투갈의 보교권에 도전하며 예수회를 중심으로 단독으로 중국에 선교사를 파견하였다. 1685년 프랑스의 루이 14세는 예수회 선교사 5명을 중국으로 파견하였다. 프랑스 선교사들은 점차 강희제의 신임을 얻었으며, 프랑스 예수회는 중국에서 하나의 독립된 역량을 지니기 시작했다. 프랑스 예수회는 중국관습과 조화하려는 적응주의 선교정책을 취했다. 뒤늦게 중국에 도착한 프란치스코회,도미니코회와 파리외방전교회 등 여러 단체의 선교사들은 예수회를 곱지 못한 시선으로 바라보고 있었다. 이들의 눈에는 예수회가 조상과 공자에 대한 제사를 인정하고 있는 것이나 중국인들이 예부터 믿고 있는 ‘上帝’나 ‘天’을 전능하신 하느님으로 해석하고 있는 것은 너무나 이단적이었기 때문이었다. 그들은 교황에게 예수회 선교사들을 고소했고, 이에 따른 儀禮논쟁이 1634년부터 1742년에 이르기까지 백년에 걸쳐 진행됐다. 이에 교황 클레멘스 11세는 中國儀禮 문제를 심사하게 했다. 교황은 중국 가톨릭교회에 중국의례를 금하였고 중국에 사절단을 파견하였다. 교황이 파견한 투르농(CarloTommascoMaillarddeTournon;多羅)사절단은 中國儀禮를 금지하는 교황의 조서를 가지고 1705년 12월에 康熙帝를 알현했다. 교황 특사를 만난 강희제는 중국의례 문제가 외국에 의해 흔들리는 데 분노했고 이후 천주교 금지정책을 시행했다. 강희제의 禁敎 정책은 이후의 雍正 乾隆 嘉慶 道光에 의해 계승되고 준수됐으며 淸朝의 기본 國策이 됐고 淸朝 황제들은 천주교를 제국의 안전을 위협하는 존재로 인식하였다. 1821년(道光 元年)에는 大淸律例에 禁敎 條項이 추가돼 백성들이 천주교를 믿거나 선교하면 판결 후 즉각 교수형을 집행하거나 유배하여 노예로 삼도록 했다. 또한 內地에서 서양인의 부동산 구입을 금지했고,禁敎를 엄격히 실행하지 않는 지방관은 처벌하도록 규정했다. 1796년에서 1804년까지의 백련교도의 난은 淸朝에게 천주교 등 邪敎에 대한 위기심을 고조시켰다. 천주교는 옹정제의 禁敎 이후 중앙정부의 감시를 피해 대부분 四川 貴州와 같은 중국의 주변지역을 중심으로 선교해 나갔기 때문에 천주교가 白蓮敎와 같은 秘密宗敎로 취급됐고,天主敎徒는 匪徒와 같이 인식됐다. 淸朝의 기독교 금지 조치 이후 기독교 문제가 다시 대두된 시기는 아편전쟁 이후이다. 아편전쟁의 결과 영국은 淸과 난징조약(南京條約)을 맺었고, 프랑스는 1843년에 駐그리스 대사인 라그르네(Lagrene)와 사절단을 파견하여 영국이 南京條約에 의해 획득한 이익과 보장을 요구하였고 이듬해 10월에 황푸조약을 체결했다. 프랑스의 선교사들은 이 조약의 종교보호조항을 통해 중국에서의 保敎權을 법률적으로 보장받았다.",6508691-3-1,1685년 프랑스의 루이 14세가 중국으로 파견한 5명의 선교사는 어디 소속입니까?,보교권
7,"{'answer_start': [127], 'text': ['키스 보크']}","에피소드의 스토리보드는 아티스트 앤서니 윌리엄스가 맡았다. 한편 데이비스는 에드거 라이트에게 에피소드를 감독할 기회를 주었으나, 라이트는 아직 《새벽의 황당한 저주》를 작업하고 있었기 때문에 어쩔 수 없이 거절했다. 그 대신에 키스 보크가 에피소드 감독을 맡았다. 〈Rose〉는 네번째와 다섯번째 에피소드와 함께 첫번째 제작 블록의 일부로 2004년 7월 촬영을 시작했다. 처음 닷새는 런던에서 촬영하는 시간으로 보냈고, 나머지 기간은 카디프에서 촬영했다. 제작팀은 런던 아이의 불빛을 더욱 밝히는 허가를 받았다. 닥터와 로즈가 런던 시내를 달리는 장면에서 제작팀은 런던 버스가 이들 뒤로 지나가길 원했기 때문에 신중한 타이밍이 시도되었는데 버스가 오길 기다려야만 했다. 카디프에서 다른 장면을 촬영할 때에는 런던처럼 보이게 런던 버스 한 대와 《런던 이브닝 스탠더드》 소속 밴 한 대가 투입되었다. 로즈가 사는 공립 주택단지의 바깥 모습은 런던의 주택단지에서 촬영하는 동시에 다른 장면들은 카디프의 주택단지에서도 촬영했다. 한편 미키네 아파트는 타일러네 아파트와 같은 세트장을 사용했는데, 단지 새로 장식한 것이었다. 제작팀은 카디프 내 장면들을 비밀로 촬영할 것으로 보였지만, 촬영 개시 전날 카디프 시의회가 제작팀이 촬영하게 될 거리들을 지명해 놓은 보도 자료를 발행했다. 에피소드 내 절정에서 등장하는 오톤들의 공격 장면은 런던 아이처럼 장면 배경을 런던의 주요 랜드마크 주변으로 정해놓은 장면들을 제외하고 2004년 7월 20일부터 22일까지 카디프의 워킹 가에서 촬영되었다.",6483060-8-1,Rose의 에피소드 감독을 맡은 사람은 누구인가?,Rose_(닥터_후)
8,"{'answer_start': [412], 'text': ['복지부']}","12월 10일에는 대한의사협회를 중심으로 전국에서 의사 약 1만 명이 서울 덕수궁 앞에서 문재인 케어 반대 및 한의사 의료기기 사용 반대 전국 의사 총궐기대회를 열었다. 이를 주관한 국민겅강수호 비상대책협의회는 ""문재인 케어는 '포퓰리즘' 정책""이라며 ""정부는 의사들이 받는 낮은 수가 문제부터 개선해야 한다""고 주장했다. 또한 ""문재인 케어는 구체적인 재정 확보 방안이 없어 건보 재정을 위태롭게 할 것""이라며 ""이로 인해 의료 수가가 깎이면 의사 집단의 생존권이 위협받을 수 있다""고 주장했다. 하지만 이를 두고 '밥그릇 지키기'라는 비판도 나왔다. 그동안 정보를 독점하여 진료비를 의사 마음대로 책정할 수 있던 비급여 의료 진료 행위가 건강보험 적용으로 바뀌면서 공적 관리체계에 들어오는 상황을 반대하는 목소리라는 지적이다. 이에 복지부는 재정균형 차원에서 급여수가 인상에 대해 협의할 용의가 있음을 밝혔지만 이를 거부한 것도 비급여의 급여화에 대한 반대의 연장선상이 아닌가하는 의심을 샀다.",6580768-125-1,재정균형 차원에서 급여수가 인상을 협의할 용의가 있다고 밝힌 부처는 어디인가?,문재인_정부
9,"{'answer_start': [360], 'text': ['한신 고시엔 구장']}","2010년에는 조지마 겐지의 입단으로 인해 8경기 출전에 그쳤고, 6월에는 오른쪽 팔꿈치의 부상으로 1군 등록이 말소되었다. 이 부상이 빠른 회복의 기미가 보이지 않아 9월 2일에 현역 은퇴를 표명하겠다는 구단에게 입장을 전달하여 구단 측은 이를 받아들였다. 9월 25일 주니치와의 2군 최종전이 은퇴 경기가 되면서 동기인 시모야나기와 배터리를 짜며 1회를 무실점으로 막아 경기를 마무리 지었다. 경기 직후 은퇴식을 가지면서 한신과 옛 친정팀이었던 주니치와 함께 양팀 선수들로부터 헹가래를 받았다. 이때 야노는 “행복한 야구 인생을 보낼 수 있었다”라고 말하며 20년의 현역 생활을 되돌아 보기도 했다. 9월 30일에는 요코하마전 종료 직후의 한신 고시엔 구장에서도 은퇴식을 가졌다.",6522583-4-2,야노 아키히로의 은퇴식이 거행된 장소는?,야노_아키히로


In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [996], 'text': ['43']}","Richmond has a humid subtropical climate (Köppen Cfa), with hot and humid summers and generally cool winters. The mountains to the west act as a partial barrier to outbreaks of cold, continental air in winter; Arctic air is delayed long enough to be modified, then further warmed as it subsides in its approach to Richmond. The open waters of the Chesapeake Bay and Atlantic Ocean contribute to the humid summers and mild winters. The coldest weather normally occurs from late December to early February, and the January daily mean temperature is 37.9 °F (3.3 °C), with an average of 6.0 days with highs at or below the freezing mark. Downtown areas straddle the border between USDA Hardiness zones 7B and 8A, and temperatures seldom lower to 0 °F (−18 °C), with the most recent subzero (°F) reading occurring on January 28, 2000, when the temperature reached −1 °F (−18 °C). The July daily mean temperature is 79.3 °F (26.3 °C), and high temperatures reach or exceed 90 °F (32 °C) approximately 43 days out of the year; while 100 °F (38 °C) temperatures are not uncommon, they do not occur every year. Extremes in temperature have ranged from −12 °F (−24 °C) on January 19, 1940 up to 107 °F (42 °C) on August 6, 1918.[a]",57343cffd058e614000b6b62,About how many days a year does the temperature in Richmond go above 32 degrees Celsius?,"Richmond,_Virginia"
1,"{'answer_start': [0], 'text': ['Chinese characters']}","Chinese characters are logograms used in the writing of Chinese and some other Asian languages. In Standard Chinese they are called Hanzi (simplified Chinese: 汉字; traditional Chinese: 漢字). They have been adapted to write a number of other languages including: Japanese, where they are known as kanji, Korean, where they are known as hanja, and Vietnamese in a system known as chữ Nôm. Collectively, they are known as CJKV characters. In English, they are sometimes called Han characters. Chinese characters constitute the oldest continuously used system of writing in the world. By virtue of their widespread current use in East Asia, and historic use throughout the Sinosphere, Chinese characters are among the most widely adopted writing systems in the world.",5726b6e6f1498d1400e8e896,What are logograms used in the writing of Chinese?,Chinese_characters
2,"{'answer_start': [438], 'text': ['35%']}","The Boston Public Schools enrolls 57,000 students attending 145 schools, including the renowned Boston Latin Academy, John D. O'Bryant School of Math & Science, and Boston Latin School. The Boston Latin School, established 1635, is the oldest public high school in the US; Boston also operates the United States' second oldest public high school, and its oldest public elementary school. The system's students are 40% Hispanic or Latino, 35% Black or African American, 13% White, and 9% Asian. There are private, parochial, and charter schools as well, and approximately 3,300 minority students attend participating suburban schools through the Metropolitan Educational Opportunity Council.",56e14eddcd28a01900c67791,What percentage of Bostons public students are African American?,Boston
3,"{'answer_start': [182], 'text': ['Royal Radar Establishment of the Ministry of Defence']}","The next great advance in computing power came with the advent of the integrated circuit. The idea of the integrated circuit was first conceived by a radar scientist working for the Royal Radar Establishment of the Ministry of Defence, Geoffrey W.A. Dummer. Dummer presented the first public description of an integrated circuit at the Symposium on Progress in Quality Electronic Components in Washington, D.C. on 7 May 1952.",56fdea41761e401900d28c4c,Where did Geoffrey W.A. Dummer work at?,Computer
4,"{'answer_start': [401], 'text': ['""SaveTheArctic.org""']}","In response, Shell filed lawsuits to seek injunctions from possible protests, and Benjamin Jealous of the NAACP and Radford argued that the legal action was ""trampling American's rights."" According to Greenpeace, Shell lodged a request with Google to ban video footage of a Greenpeace protest action that occurred at the Shell-sponsored Formula One (F1) Belgian Grand Prix on 25 August 2013, in which ""SaveTheArctic.org"" banners appear at the winners' podium ceremony. In the video, the banners rise up automatically—activists controlled their appearance with the use of four radio car antennas—revealing the website URL, alongside an image that consists of half of a polar bear's head and half of the Shell logo.",57263f3489a1e219009ac5cb,What banners appeared on the winners' podium at the August 2013 ceremony?,Royal_Dutch_Shell
5,"{'answer_start': [191], 'text': ['Latin origin']}","The Germanic superstrate has had different outcomes in Spanish and Catalan. For example, Catalan fang ""mud"" and rostir ""to roast"", of Germanic origin, contrast with Spanish lodo and asar, of Latin origin; whereas Catalan filosa ""spinning wheel"" and pols ""temple"", of Latin origin, contrast with Spanish rueca and sien, of Germanic origin.",56e16caee3433e1400422f05,What is the origin of some Spanish words?,Catalan_language
6,"{'answer_start': [299], 'text': ['espionage']}","Congress acted defiantly toward the Supreme Court by passing the Drug Kingpin Act of 1988 and the Federal Death Penalty Act of 1994 that made roughly fifty crimes punishable by death, including crimes that do not always involve the death of someone. Such non-death capital offenses include treason, espionage (spying for another country), and high-level drug trafficking. Since no one has yet been sentenced to death for such non-death capital offenses, the Supreme Court has not ruled on their constitutionality.",57101271b654c5140001f7b9,What is another term for the act of spying for another country?,Capital_punishment_in_the_United_States
7,"{'answer_start': [359], 'text': ['fields of co-operation between India and Myanmar include remote sensing, oil and gas exploration, information technology, hydro power and construction']}","Despite Western isolation, Asian corporations have generally remained willing to continue investing in the country and to initiate new investments, particularly in natural resource extraction. The country has close relations with neighbouring India and China with several Indian and Chinese companies operating in the country. Under India's Look East policy, fields of co-operation between India and Myanmar include remote sensing, oil and gas exploration, information technology, hydro power and construction of ports and buildings.",5726f798f1498d1400e8f143,What is the benefit to the two countries involved in the India Look East policy ?,Myanmar
8,"{'answer_start': [96], 'text': ['19 Entertainment']}","American Idol is an American singing competition series created by Simon Fuller and produced by 19 Entertainment, and is distributed by FremantleMedia North America. It began airing on Fox on June 11, 2002, as an addition to the Idols format based on the British series Pop Idol and has since become one of the most successful shows in the history of American television. The concept of the series is to find new solo recording artists, with the winner being determined by the viewers in America. Winners chosen by viewers through telephone, Internet, and SMS text voting were Kelly Clarkson, Ruben Studdard, Fantasia Barrino, Carrie Underwood, Taylor Hicks, Jordin Sparks, David Cook, Kris Allen, Lee DeWyze, Scotty McCreery, Phillip Phillips, Candice Glover, Caleb Johnson, and Nick Fradiani.",56d21eb1e7d4791d00902668,What company produces American idol?,American_Idol
9,"{'answer_start': [154], 'text': ['the Greeks']}","There was a linguistic predisposition to use such terms. The Romans had used them in near Gaul / far Gaul, near Spain / far Spain and others. Before them the Greeks had the habit, which appears in Linear B, the oldest known script of Europe, referring to the near province and the far province of the kingdom of Pylos. Usually these terms were given with reference to a geographic feature, such as a mountain range or a river.",56f8e1749b226e1400dd1173,The appearance of what culture using the terms appears in Linear B?,Near_East


## Preprocessing the training data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("facebook/bart-base")
tokenizerKR = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=682149.0, style=ProgressStyle(descripti…




The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on two sentences (one for the answer, one for the context):

In [None]:
tokenizer("What is your name?", "My name is Sylvain.")

{'input_ids': [0, 2264, 16, 110, 766, 116, 2, 2, 2387, 766, 16, 28856, 1851, 4, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizerKR("이름은?", "김민수.")

{'input_ids': [23667, 262, 20756, 11372, 245], 'token_type_ids': [0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 512 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Let's find one long example in our dataset:

In [None]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 512:
        break
example = datasets["train"][i]

In [None]:
for i, exampleKR in enumerate(datasetsKR["train"]):
    if len(tokenizerKR(exampleKR["question"], exampleKR["context"])["input_ids"]) > 512:
        break
exampleKR = datasetsKR["train"][i]

Note that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and by passing the stride:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

In [None]:
tokenized_exampleKR = tokenizerKR(
    exampleKR["question"],
    exampleKR["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

Now we don't have one list of `input_ids`, but several: 

In [None]:
[len(x) for x in tokenized_example["input_ids"]]

[512, 157]

In [None]:
[len(x) for x in tokenized_exampleKR["input_ids"]]

[512, 227]

And if we decode them, we can see the overlap:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

<s>When were the Rinbung princes overthrown?</s></s>In 1565, the powerful Rinbung princes were overthrown by one of their own ministers, Karma Tseten who styled himself as the Tsangpa, "the one of Tsang", and established his base of power at Shigatse. The second successor of this first Tsang king, Karma Phuntsok Namgyal, took control of the whole of Central Tibet (Ü-Tsang), reigning from 1611–1621. Despite this, the leaders of Lhasa still claimed their allegiance to the Phagmodru as well as the Gelug, while the Ü-Tsang king allied with the Karmapa. Tensions rose between the nationalistic Ü-Tsang ruler and the Mongols who safeguarded their Mongol Dalai Lama in Lhasa. The fourth Dalai Lama refused to give an audience to the Ü-Tsang king, which sparked a conflict as the latter began assaulting Gelug monasteries. Chen writes of the speculation over the fourth Dalai Lama's mysterious death and the plot of the Ü-Tsang king to have him murdered for "cursing" him with illness, although Chen wr

In [None]:
for x in tokenized_exampleKR["input_ids"][:2]:
    print(tokenizerKR.decode(x))

2011-2012 국제빙상경기연맹 스피드스케이팅월드컵 1차 대회 3천m 디비전 B에서 김보름은 몇 위를 차지했는가? 2011년 11월 18일 러시아 첼랴빈스크에서 열린 2011-2012 국제빙상경기연맹 스피드스케이팅월드컵 1차 대회 3000m 디비전B에서 김보름은 4분12초38로 2위를 차지했다. 19일 열린 1500m 디비전B에서는 김보름이 2분01초39로 4위를 차지했다. 20일 열린 팀추월에서는 이주연, 노선영, 김보름이 3분05초79로 4위를 차지했다. 2011년 11월 25일 카자흐스탄 아스타나에서 열린 2차 대회 3000m 디비전A에서 김보름은 4분13초61로 20위를 차지했다. 26일 열린 1500m 디비전A에서는 김보름이 2분02초58로 20위를 차지했다. 27일 열린 매스스타트에서는 김보름(한국체대)이 7분26초85의 기록으로 첫 동메달을 따냈다. 2011년 12월 2일 네덜란드 히렌빈에서 열린 월드컵 3차 대회 5000m 디비전B에서 김보름은 7분26초10으로 12위를 차지했다. 3일 열린 1500m 디비전B에서는 김보름이 2분01초39로 5위를 차지했다. 4일 열린 팀추월에서는 이주연(동두천시청)-노선영(한국체대)-김보름으로 구성된 대표팀이 3분03초18의 기록으로 캐나다(3분00초01)와 러시아(3분02초38)에 이어 3위에 올랐다. 한국 스피드스케이팅이 월드컵 시리즈 팀 추월에서 메달을 딴 것은 이번이 처음이다. 2012년 2월 12일 노르웨이 하마르에서 끝난 월드컵 5차 대회 팀추월 경기에서 이주연(25)과 노선영(23), 김보름(18)이 출전하여 3분05초65의 기록으로 폴란드(3분05초57)에 이어 동메달을 차지했다. 3000m에 출전한 김보름은 4분15초66의 기록으로 16위에 머물렀다. 1500m 디비전A에서는 김보름이 2분00초24로 10위를 기록했다. 디비전B에서는 이주연(동두천시청)과 노선영(한체대)이 각각 1위(2분01초53)와 3위(2분01초82)를 차지했다. 2012년 3월 10일 독일 베를린에서 열린 월드컵 파이널 

In [None]:
for i, exampleKR in enumerate(datasets["train"]):
    if len(tokenizerKR(exampleKR["question"], exampleKR["context"])["input_ids"]) > 512:
        break
exampleKR = datasets["train"][i]

In [None]:
tokenized_exampleKR2 = tokenizerKR(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

In [None]:
for x in tokenized_exampleKR2["input_ids"][:2]:
    print(tokenizerKR.decode(x))

When were the Rinbung princes overthrown? In 1565, the powerful Rinbung princes were overthrown by one of their own ministers, Karma Tseten who styled himself as the Tsangpa, "the one of Tsang", and established his base of power at Shigatse. The second successor of this first Tsang king, Karma Phuntsok Namgyal, took control of the whole of Central Tibet (Ü-Tsang), reigning from 1611–1621. Despite this, the leaders of Lhasa still claimed their allegiance to the Phagmodru as well as the Gelug, while the Ü-Tsang king allied with the Karmapa. Tensions rose between the nationalistic Ü-Tsang ruler and the Mongols who safeguarded their Mongol Dalai Lama in Lhasa. The fourth Dalai Lama refused to give an audience to the Ü-Tsang king, which sparked a conflict as the latter began assaulting Gelug monasteries. Chen writes of the speculation over the fourth Dalai Lama's mysterious death and the plot of the Ü-Tsang king to have him murder
When were the Rinbung princes overthrown? an audience to the

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

This gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. The very first token (`[CLS]`) has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 3 of the question:

In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

It returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). Now with all of this, we can find the first and last token of the answer in one of our input feature (or if the answer is not in this feature):

In [None]:
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

And we can double check that it is indeed the theoretical answer:

In [None]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

In [None]:
pad_on_right = tokenizerKR.padding_side == "right"

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag `allow_impossible_answers` is `False`. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [None]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(datasets['train'][:5])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it.

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7/cache-a5c71e98733887b0.arrow
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7/cache-14932a8c6aecc96d.arrow


Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,1.2206,1.160322,39.5749,272.496
2,0.9452,1.12169,39.706,271.596
3,0.773,1.157358,39.734,271.405


TrainOutput(global_step=16599, training_loss=1.1112074395519933, metrics={'train_runtime': 3487.0114, 'train_samples_per_second': 4.76, 'total_flos': 40606919924189184, 'epoch': 3.0})

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
trainer.save_model("test-squad-trained")

## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [None]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 384]), torch.Size([16, 384]))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each featyre is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 46,  57,  78,  43, 118,  15,  72,  35,  15,  34,  73,  41,  80,  91,
         156,  35], device='cuda:0'),
 tensor([ 47,  58,  81,  55, 118, 110,  75,  37, 110,  36,  76,  53,  83,  94,
         158,  35], device='cuda:0'))

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.


To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call `n_best_size`. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one. Here is how we would do this on the first feature in the batch:

In [None]:
n_best_size = 20

In [None]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

And then we can sort the `valid_answers` according to their `score` and only keep the best one. The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [None]:
max_answer_length = 30

In [None]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 16.706663, 'text': 'Denver Broncos'},
 {'score': 14.635585,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 13.234194, 'text': 'Carolina Panthers'},
 {'score': 12.468662, 'text': 'Broncos'},
 {'score': 11.709289, 'text': 'Denver'},
 {'score': 10.397583,
  'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 10.104669,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 9.721636,
  'text': 'The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 9.007437,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
 {'score': 8.834958,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina'},
 {'score': 8.38701,
  'text': 'Denver Broncos defeated the National Football Conference (NFC)'},
 {'score': 8.143825,
  'text': 'De

We can compare to the actual ground-truth answer:

In [None]:
datasets["validation"][0]["answers"]

{'answer_start': [177, 177, 177],
 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

We're almost ready for our post-processing function. The last bit to deal with is the impossible answer (when `squad_v2 = True`). The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 10570 example predictions split into 10784 features.


HBox(children=(FloatProgress(value=0.0, max=10570.0), HTML(value='')))




Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad_v2" if squad_v2 else "squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. In the case of squad_v2, we also have to set a `no_answer_probability` argument (which we set to 0.0 here as we have already set the answer to empty if we picked it).

In [None]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 76.74550614947965, 'f1': 85.13412652023338}

Don't forget to [upload your model](https://huggingface.co/transformers/model_sharing.html) on the [🤗 Model Hub](https://huggingface.co/models). You can then use it only to generate results like the one shown in the first picture of this notebook!|