# Question Answering on SQUAD

Source: https://huggingface.co/transformers/examples.html#the-big-table-of-tasks

In [1]:
import os

os.environ['NCCL_DEBUG']="WARN"

In [None]:
import torch
print("torch.__version__:", torch.__version__)
print("torch.version.cuda:", torch.version.cuda)
print("torch.cuda.nccl.version():", torch.cuda.nccl.version())
print("torch.cuda.is_available():", torch.cuda.is_available())

In [2]:
# !pip install datasets transformers

In [3]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

# Loading the dataset

In [4]:
from datasets import load_dataset, load_metric

In [5]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

Reusing dataset squad (/home/e/e0389098/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9)


In [6]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [7]:
datasets["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [8]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [64], 'text': ['1585 and 1598']}","In addition there are two service bells, cast by Robert Mot, in 1585 and 1598 respectively, a Sanctus bell cast in 1738 by Richard Phelps and Thomas Lester and two unused bells—one cast about 1320, by the successor to R de Wymbish, and a second cast in 1742, by Thomas Lester. The two service bells and the 1320 bell, along with a fourth small silver ""dish bell"", kept in the refectory, have been noted as being of historical importance by the Church Buildings Council of the Church of England.",56e8f74999e8941900975f3e,When were the two service bells cast?,Westminster_Abbey
1,"{'answer_start': [56], 'text': ['Battle of Jena-Auerstedt']}","After the disastrous defeat of the Prussian Army at the Battle of Jena-Auerstedt in 1806, Napoleon occupied Berlin and had the officials of the Prussian General Directory swear an oath of allegiance to him, while King Frederick William III and his consort Louise fled via Königsberg and the Curonian Spit to Memel. The French troops immediately took up pursuit but were delayed in the Battle of Eylau on 9 February 1807 by an East Prussian contingent under General Anton Wilhelm von L'Estocq. Napoleon had to stay at the Finckenstein Palace, but in May, after a siege of 75 days, his troops led by Marshal François Joseph Lefebvre were able to capture the city Danzig, which had been tenaciously defended by General Count Friedrich Adolf von Kalkreuth. On 14 June, Napoleon ended the War of the Fourth Coalition with his victory at the Battle of Friedland. Frederick William and Queen Louise met with Napoleon for peace negotiations, and on 9 July the Prussian king signed the Treaty of Tilsit.",572a29476aef05140015532a,What defeat led to Prussia having to swear its allegiance to Napoleon?,East_Prussia
2,"{'answer_start': [1006], 'text': ['anti-corporatist']}","Nicholas Lezard described post-punk as ""a fusion of art and music"". The era saw the robust appropriation of ideas from literature, art, cinema, philosophy, politics and critical theory into musical and pop cultural contexts. Artists sought to refuse the common distinction between high and low culture and returned to the art school tradition found in the work of artists such as Captain Beefheart and David Bowie. Among major influences on a variety of post-punk artists were writers such as William S. Burroughs and J.G. Ballard, avant-garde political scenes such as Situationism and Dada, and intellectual movements such as postmodernism. Many artists viewed their work in explicitly political terms. Additionally, in some locations, the creation of post-punk music was closely linked to the development of efficacious subcultures, which played important roles in the production of art, multimedia performances, fanzines and independent labels related to the music. Many post-punk artists maintained an anti-corporatist approach to recording and instead seized on alternate means of producing and releasing music. Journalists also became an important element of the culture, and popular music magazines and critics became immersed in the movement.",572e6570dfa6aa1500f8cffd,Why did many post-punk artists produce and release their own music?,Post-punk
3,"{'answer_start': [323], 'text': ['government']}","The Arthashastra and the Edicts of Ashoka are the primary written records of the Mauryan times. Archaeologically, this period falls into the era of Northern Black Polished Ware (NBPW). The Mauryan Empire was based on a modern and efficient economy and society. However, the sale of merchandise was closely regulated by the government. Although there was no banking in the Mauryan society, usury was customary. A significant amount of written records on slavery are found, suggesting a prevalence thereof. During this period, a high quality steel called Wootz steel was developed in south India and was later exported to China and Arabia.",572860cdff5b5019007da1d6,What organization closely monitored business dealings in the Mauryan Empire?,History_of_India
4,"{'answer_start': [181], 'text': ['United Kingdom and Prussia']}","Over time, the relative power of these five nations fluctuated, which by the dawn of the 20th century had served to create an entirely different balance of power. Some, such as the United Kingdom and Prussia (as the founder of the newly formed German state), experienced continued economic growth and political power. Others, such as Russia and Austria-Hungary, stagnated. At the same time, other states were emerging and expanding in power, largely through the process of industrialization. These countries seeking to attain great power status were: Italy after the Risorgimento, Japan after the Meiji Restoration, and the United States after its civil war. By the dawn of the 20th century, the balance of world power had changed substantially since the Congress of Vienna. The Eight-Nation Alliance was a belligerent alliance of eight nations against the Boxer Rebellion in China. It formed in 1900 and consisted of the five Congress powers plus Italy, Japan, and the United States, representing the great powers at the beginning of 20th century.",57310a8305b4da19006bcd13,What countries found their economic growth in early 20th century?,Great_power
5,"{'answer_start': [863], 'text': ['Xbox Live Vision']}","When the Xbox 360 was released, Microsoft's online gaming service Xbox Live was shut down for 24 hours and underwent a major upgrade, adding a basic non-subscription service called Xbox Live Silver (later renamed Xbox Live Free) to its already established premium subscription-based service (which was renamed Gold). Xbox Live Free is included with all SKUs of the console. It allows users to create a user profile, join on message boards, and access Microsoft's Xbox Live Arcade and Marketplace and talk to other members. A Live Free account does not generally support multiplayer gaming; however, some games that have rather limited online functions already, (such as Viva Piñata) or games that feature their own subscription service (e.g. EA Sports games) can be played with a Free account. Xbox Live also supports voice the latter a feature possible with the Xbox Live Vision.",570b212d6b8089140040f762,Voice support came online with what feature addition?,Xbox_360
6,"{'answer_start': [505], 'text': ['higher taxes']}","Under the millet system, non-Muslim people were considered subjects of the Empire, but were not subject to the Muslim faith or Muslim law. The Orthodox millet, for instance, was still officially legally subject to Justinian's Code, which had been in effect in the Byzantine Empire for 900 years. Also, as the largest group of non-Muslim subjects (or zimmi) of the Islamic Ottoman state, the Orthodox millet was granted a number of special privileges in the fields of politics and commerce, and had to pay higher taxes than Muslim subjects.",572a586fb8ce0319002e2ac4,Being a non-muslim in the Empire resulted in what as it related to taxes?,Ottoman_Empire
7,"{'answer_start': [328], 'text': ['emergency']}","Players may only be transferred during transfer windows that are set by the Football Association. The two transfer windows run from the last day of the season to 31 August and from 31 December to 31 January. Player registrations cannot be exchanged outside these windows except under specific licence from the FA, usually on an emergency basis. As of the 2010–11 season, the Premier League introduced new rules mandating that each club must register a maximum 25-man squad of players aged over 21, with the squad list only allowed to be changed in transfer windows or in exceptional circumstances. This was to enable the 'home grown' rule to be enacted, whereby the League would also from 2010 require at least 8 of the named 25 man squad to be made up of 'home-grown players'.",5733f9bcd058e614000b66e3,On which basis are transfers outside of transfer windows licenced?,Premier_League
8,"{'answer_start': [1255], 'text': ['Bill Gates']}","Education in Israel is highly valued in the national culture with its historical values dating back to Ancient Israel and was viewed as one fundamental blocks of ancient Israelite life. Israeli culture views higher education as the key to higher mobility and socioeconomic status in Israeli society. The emphasis of education within Israeli society goes to the gulf within the Jewish diaspora from the Renaissance and Enlightenment Movement all the way to the roots of Zionism in the 1880s. Jewish communities in the Levant were the first to introduce compulsory education for which the organized community, not less than the parents, was responsible for the education of the next generation of Jews. With contemporary Jewish culture's strong emphasis, promotion of scholarship and learning and the strong propensity to promote cultivation of intellectual pursuits as well as the nations high university educational attainment rate exemplifies how highly Israeli society values higher education. The Israeli education system has been praised for various reasons, including its high quality and its major role in spurring Israel's economic development and technological boom. Many international business leaders and organizations such as Microsoft founder Bill Gates have praised Israel for its high quality of education in helping spur Israel's economic development. In 2012, the country ranked second among OECD countries (tied with Japan and after Canada) for the percentage of 25- to 64-year-olds that have attained tertiary education with 46 percent compared with the OECD average of 32 percent. In addition, nearly twice as many Israelis aged 55–64 held a higher education degree compared to other OECD countries, with 47 percent holding an academic degree compared with the OECD average of 25%. In 2012, the country ranked third in the world in the number of academic degrees per capita (20 percent of the population).",5725c576271a42140099d17b,Who praised Israel for its high quality education?,Israel
9,"{'answer_start': [504], 'text': ['Hundred Years' War']}","Changes also took place within the recruitment and composition of armies. The use of the national or feudal levy was gradually replaced by paid troops of domestic retinues or foreign mercenaries. The practice was associated with Edward III of England and the condottieri of the Italian city-states. All over Europe, Swiss soldiers were in particularly high demand. At the same time, the period also saw the emergence of the first permanent armies. It was in Valois France, under the heavy demands of the Hundred Years' War, that the armed forces gradually assumed a permanent nature.",572738d25951b619008f86ef,Which conflict in France resulted in the establishment of permanent armies?,Late_Middle_Ages


# Preprocessing the training data

In [10]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [11]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [12]:
tokenizer("What is your name?", "My name is Sylvain.")

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
# Allow one (long) example in the dataset to give several input features, each of length shorter than max length of the model
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [14]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

In [15]:
len(tokenizer(example["question"], example["context"])["input_ids"])

396

In [16]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

384

In [17]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",  # Truncate only context, never the question
    return_overflowing_tokens=True,
    stride=doc_stride
)

In [18]:
[len(x) for x in tokenized_example["input_ids"]]

[384, 157]

In [19]:
# See the overlap in contexts
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notr

In [20]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,  # This maps parts of the original context to some tokens
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 330), (330, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374,

In [21]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

how How


In [22]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [23]:
# Find the first and last token of the answer in one of our input feature (or if the answer is not in this feature)
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

23 26


In [24]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

over 1, 600
over 1,600


In [25]:
pad_on_right = tokenizer.padding_side == "right"

In [26]:
# Put all above functions in one function
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [27]:
features = prepare_train_features(datasets['train'][:5])

In [28]:
# apply function on all sentences (or pairs of sentences) in the dataset
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

Loading cached processed dataset at /home/e/e0389098/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9/cache-1e3a6e7007d17481.arrow
Loading cached processed dataset at /home/e/e0389098/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9/cache-1b50f23fb1f44f36.arrow


# Fine-tuning the model

In [29]:
# !pip install transformers



In [30]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

In [31]:
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [32]:
# Data collator batches processed examples together|
from transformers import default_data_collator

data_collator = default_data_collator

In [34]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

RuntimeError: CUDA error: out of memory

In [None]:
trainer.train()

In [None]:
trainer.save_model("test-squad-trained")

# Evaluation