# Question Answering with Hugging Face Transformers

[Original Keras Code](https://keras.io/examples/nlp/question_answering/)

<br/>

**Author**: Yookyung Kho

**Date presented**: 2022/05/09, DSBA keras2torch Study

**Task description**: Question Answering with pretrained `distilbert-base-cased` from HuggingFace

**References**:

- https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html

- https://huggingface.co/transformers/v3.0.2/model_doc/auto.html

- https://huggingface.co/course/chapter7/7?fw=tf

- https://huggingface.co/course/chapter6/3b?fw=pt

- https://huggingface.co/transformers/v3.2.0/custom_datasets.html

## 0. About QA

- Task: **context-based question answering** - questions are asked from a given paragraph

- Dataset: SQUAD v1.1

<img src="qa_output.png" width="1000" height="600">

<img src="qa_input.png" width="1000" height="600">

[img source](https://blog.paperspace.com/how-to-train-question-answering-machine-learning-models/)

- Input: Question(질문), Context(정답 span 포함)

```
[CLS] question [SEP] context [SEP]
```

- Output: Answer의 시작과 끝 토큰

    - `start_logits`: (batch_size, sequence_length)
    
    - `end_logits`: (batch_size, sequence_length)

## 1. Loading the dataset

In [1]:
import torch
from datasets import load_dataset

datasets = load_dataset("squad")

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [3]:
ex_train = datasets["train"][0]
ex_train

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [4]:
ex_valid = datasets["validation"][0]
ex_valid

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


- train -> train, valid

- validation -> test


In [3]:
shuffled_set = datasets["train"].shuffle(seed=602)

small_set = shuffled_set.select(range(10000)) #10000개만 샘플링

train_valid = small_set.train_test_split(test_size=0.2)

train_examples = train_valid["train"]
valid_examples = train_valid["test"]

print(f"Data size: Train({len(train_examples)}), Valid({len(valid_examples)})")

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-b7b1a28774d2c4ca.arrow


Data size: Train(8000), Valid(2000)


## 2. Preprocessing the training data

### 2.1. Datasets

How to deal with very **long context**?

- 보통 최대 길이(max length)에 맞춰 자름

- 하지만 QA에서는 주어진 context에서 정답을 찾아야 하기 때문에 context를 max len에 맞춰 잘라버리면 정답이 사라지는 문제가 발생할 수도 있음!

- (해결책) **overlap을 허용하면서 context를 더 작은 chunk들로 분할**

In [4]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [5]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128  # The authorized overlap between two part of the context when splitting

In [6]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a
    # stride. This results in one example possible giving several features when a context is long,
    # each of those features having a context that overlaps a bit the context of the previous
    # feature.
    examples["question"] = [q.lstrip() for q in examples["question"]]
    examples["context"] = [c.lstrip() for c in examples["context"]]
    
    # 1) tokenizer로 분절 # [CLS] question [SEP] context [SEP]
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second", ### question은 보존, context만 max len을 넘어가면 잘리도록
        max_length=max_length,
        stride=doc_stride, ### overlap 정도(128 토큰)
        return_overflowing_tokens=True, ### let the tokenizer know we want the overflowing tokens
        return_offsets_mapping=True, ### to compute the start_positions and end_positions
        padding="max_length",
    )
    # tokenized_examples: dictionary 반환
    # {'input_ids': [101, 1706, ..., 102],
    #  'attention_mask': [1, 1, 1 ..., 0],
    #  'offset_mapping': [(0, 0), (0, 2), (3, 7), ...(694, 695), (0, 0)],
    #  'overflow_to_sample_mapping': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]}
    
    
    # 2) "overflow_to_sample_mapping"과 "offset_mapping" 삭제
    ### input_ids와 attention mask만 model input으로 들어가기 때문에 나머지 pop~
    ### tokenized_examples 딕셔너리에서 'overflow_to_sample_mapping'과 'offset_mapping'만 따로 빼서 저장해둠
    
    ## "overflow_to_sample_mapping": context가 길어서 여러 feature로 뽑히는 경우 각 feature가 몇번째 문장(샘플)에 속하는지 파악할 수 있음
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    ## "offset_mapping": 각 토큰의 첫 철자와 마지막 철자 위치(인덱스)가 튜플 형태로 표현됨
    #### ex. "cat": (15,17)
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # 3) 정답의 시작, 끝 위치 라벨링
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        ### feature 내 정답 없으면 0으로 라벨링(시퀀스 맨 처음에 위치한 [CLS])
        input_ids = tokenized_examples["input_ids"][i] # i번째 span이자 feature
        cls_index = input_ids.index(tokenizer.cls_token_id) # 0

        ### 시퀀스 내 question, context 위치 파악
        sequence_ids = tokenized_examples.sequence_ids(i) # [cls]와 [sep]은 None, question은 0, context는 1로 채워진 리스트

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i] #몇번째 샘플(데이터, 문장)인지
        answers = examples["answers"][sample_index] #정답 'text'와 'answer_start'(start char idx) 포함하는 딕셔너리
        
        #정답이 주어지지 않은 경우 [cls]를 가상의 정답으로 간주
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1 # max-len - 1
            while sequence_ids[token_end_index] != 1: #None으로 채워진 [SEP] 자리만 거침
                token_end_index -= 1

            ## possible case: 해당 span의 첫번째 토큰의 첫 알파벳이 정답의 첫번째 철자보다 전에 위치하고
            ## 마지막 토큰의 마지막 철자가  정답의 마지막 펄자보다 뒤에 위치해야 해당 span 내 정답 존재할 수 있음
            ### 이 경우를 제외하고는 전부 [CLS]을 가상 정답으로 처리
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while (
                    token_start_index < len(offsets) # max_len (시퀀스 내 최대 토큰 수)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1 #정답 span의 시작 토큰으로 하나씩 접근 
                tokenized_examples["start_positions"].append(token_start_index - 1)
                
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)
    
    return tokenized_examples

# [83, 51, 19, 0, 0, 64, 27, 0, 34, 0, 0, 0, 67, 34, 0, 0, 0, 0, 0] ## start positions
# [85, 53, 21, 0, 0, 70, 33, 0, 40, 0, 0, 0, 68, 35, 0, 0, 0, 0, 0] ## end positions

In [7]:
# features: train_dataset
train_dataset = train_examples.map(
    prepare_train_features,
    batched=True,
    num_proc=3,
    remove_columns=train_examples.column_names,
)
print(f"Train data size: {len(train_examples)} -> {len(train_dataset)}")

Train data size: 8000 -> 8103


In [8]:
train_examples

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 8000
})

In [9]:
train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 8103
})

In [13]:
print(train_dataset[0])

{'input_ids': [101, 1327, 1202, 1103, 1697, 1105, 3732, 2031, 5299, 1116, 22417, 1106, 1146, 8678, 136, 102, 1398, 1433, 3099, 1104, 1103, 1244, 1311, 117, 1259, 1103, 1697, 117, 1103, 3302, 1116, 1104, 1103, 3732, 2031, 117, 1352, 7030, 1105, 27597, 117, 1105, 1155, 1484, 1104, 2757, 117, 20335, 1148, 1105, 17766, 1106, 1146, 8678, 1103, 5317, 119, 1636, 12749, 1116, 170, 3101, 3161, 1306, 1115, 1103, 3013, 1104, 1644, 1110, 7298, 1106, 1103, 3013, 1104, 1251, 1769, 2301, 119, 1335, 1103, 1269, 1159, 117, 1103, 2877, 1433, 1144, 5602, 21435, 131, 1103, 7663, 3392, 1110, 1714, 1106, 4958, 1184, 24026, 1122, 1209, 3593, 117, 1112, 1263, 1112, 1122, 12543, 1439, 1157, 4035, 15447, 5894, 3758, 1105, 18788, 1103, 7950, 1193, 4921, 2266, 1104, 2833, 119, 18872, 117, 1103, 9799, 3392, 1144, 170, 2178, 1104, 9799, 21435, 117, 1105, 1103, 3275, 3392, 1145, 1144, 1672, 21435, 3113, 3758, 1259, 16810, 2916, 21435, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [10]:
# valid도 동일하게 수행
valid_dataset = valid_examples.map(
    prepare_train_features,
    batched=True,
    num_proc=3,
    remove_columns=valid_examples.column_names,
)
print(f"Valid data size: {len(valid_examples)} -> {len(valid_dataset)}")

Valid data size: 2000 -> 2022


In [11]:
print(f"Final Data size: Train({len(train_dataset)}), Valid({len(valid_dataset)})")

Final Data size: Train(8103), Valid(2022)


#### Example

In [34]:
train_examples[53]

{'id': '56cd8d2762d2951400fa66e1',
 'title': 'Sino-Tibetan_relations_during_the_Ming_dynasty',
 'context': 'Van Praag states that the Ming court established diplomatic delegations with Tibet merely to secure urgently needed horses. Wang and Nyima argue that these were not diplomatic delegations at all, that Tibetan areas were ruled by the Ming since Tibetan leaders were granted positions as Ming officials, that horses were collected from Tibet as a mandatory "corvée" tax, and therefore Tibetans were "undertaking domestic affairs, not foreign diplomacy". Sperling writes that the Ming simultaneously bought horses in the Kham region while fighting Tibetan tribes in Amdo and receiving Tibetan embassies in Nanjing. He also argues that the embassies of Tibetan lamas visiting the Ming court were for the most part efforts to promote commercial transactions between the lamas\' large, wealthy entourage and Ming Chinese merchants and officials. Kolmaš writes that while the Ming maintained a laiss

max_len(384) 넘어가는 long sequence이므로 아래와 같이 두개의 feature로 분할됨

- 1번째 feature가 정답을 포함하고 2번째 feature는 정답을 포함하지 않아 0(\[CLS\])으로 라벨링

In [35]:
tokenizer.decode(train_dataset[53]['input_ids']) #max_len=100, stride=20 예시

'[CLS] who were the Tibetan areas were ruled by? [SEP] Van Praag states that the Ming court established diplomatic delegations with Tibet merely to secure urgently needed horses. Wang and Nyima argue that these were not diplomatic delegations at all, that Tibetan areas were ruled by the Ming since Tibetan leaders were granted positions as Ming officials, that horses were collected from Tibet as a mandatory " corvée " tax, and therefore Tibetans were " undertaking domestic affairs, not foreign diplomacy ". Sperling writes that the Ming simultaneously bought horses in the Kham region while fighting Tibetan tribes in Amdo and receiving Tibetan embassies in Nanjing. He also argues that the embassies of Tibetan lamas visiting the Ming court were for the most part efforts to promote commercial transactions between the lamas\'large, wealthy entourage and Ming Chinese merchants and officials. Kolmaš writes that while the Ming maintained a laissez - faire policy towards Tibet and limited the nu

In [36]:
tokenizer.decode(train_dataset[54]['input_ids'])

'[CLS] who were the Tibetan areas were ruled by? [SEP] the latter. As for the Yongle Emperor\'s gifts to his Tibetan and Nepalese vassals such as silver wares, Buddha relics, utensils for Buddhist temples and religious ceremonies, and gowns and robes for monks, Tsai writes " in his effort to draw neighboring states to the Ming orbit so that he could bask in glory, the Yongle Emperor was quite willing to pay a small price ". The Information Office of the State Council of the PRC lists the Tibetan tribute items as oxen, horses, camels, sheep, fur products, medical herbs, Tibetan incenses, thangkas ( painted scrolls ), and handicrafts ; while the Ming awarded Tibetan tribute - bearers an equal value of gold, silver, satin and brocade, bolts of cloth, grains, and tea leaves. Silk workshops during the Ming also catered specifically to the Tibetan market with silk clothes and furnishings featuring Tibetan Buddhist iconography. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

In [38]:
print(train_dataset['start_positions'][53:55])
print(train_dataset['end_positions'][53:55])

[17, 0]
[18, 0]


### 2.2. DataLoader

In [44]:
batch_size=16

In [45]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)

In [46]:
print(f"Train iteration: {len(train_dataloader)}, Valid iteration: {len(valid_dataloader)}")

Train iteration: 507, Valid iteration: 127


#### 예시

In [67]:
batch_ex_v = next(iter(valid_dataloader))

In [68]:
reshape_tensor(batch_ex_v['input_ids'], batch_size=2).shape

torch.Size([2, 384])

In [71]:
len(batch_ex['input_ids'])

384

In [33]:
batch_ex = next(iter(train_dataloader))

In [37]:
prob = batch_ex['input_ids']
input_ids = torch.concat([prob[i][0].view(1) for i in range(len(prob))]).unsqueeze(0) # 384->[1, 384]

for batch_idx in range(1, batch_size):
    new_ids = torch.concat([prob[i][batch_idx].view(1) for i in range(len(prob))]).unsqueeze(0) # 384->[1, 384]
    input_ids = torch.cat([input_ids , new_ids], dim=0)

input_ids

tensor([[ 101, 4434, 1121,  ...,    0,    0,    0],
        [ 101, 1327, 2578,  ...,    0,    0,    0],
        [ 101, 1327, 1583,  ...,    0,    0,    0],
        [ 101, 1731, 1242,  ...,    0,    0,    0]])

In [41]:
batch_ex = next(iter(train_dataloader))

In [42]:
batch_ex['input_ids'][:10]

[tensor([101, 101, 101, 101]),
 tensor([1327, 1327, 2627, 1327]),
 tensor([1710, 1110, 2234, 1132]),
 tensor([1108,  170, 1103, 1103]),
 tensor([1103, 7224,  185, 3501]),
 tensor([ 8099,  1115, 19456,  1637]),
 tensor([ 4264, 18028, 21123,  3002]),
 tensor([1114, 1103, 3855, 1104]),
 tensor([ 136, 2860, 1154, 1103]),
 tensor([ 102, 1104, 3352, 7085])]

In [43]:
batch_ex['attention_mask'][:10]

[tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1]),
 tensor([1, 1, 1, 1])]

In [44]:
def reshape_tensor(org_tensor, batch_size):
    new_tensor = torch.concat([org_tensor[i][0].view(1) for i in range(len(org_tensor))]).unsqueeze(0) # 384->[1, 384]
    for batch_idx in range(1, batch_size):
        new_ids = torch.concat([org_tensor[i][batch_idx].view(1) for i in range(len(org_tensor))]).unsqueeze(0) # 384->[1, 384]
        new_tensor = torch.cat([new_tensor, new_ids], dim=0)
    return new_tensor

In [71]:
reshape_tensor(batch_ex['input_ids'], batch_size)

tensor([[ 101, 1327, 1710,  ...,    0,    0,    0],
        [ 101, 1327, 1110,  ...,    0,    0,    0],
        [ 101, 2627, 2234,  ...,    0,    0,    0],
        [ 101, 1327, 1132,  ...,    0,    0,    0]])

In [100]:
input_ids.shape #attention_mask도 똑같이 해라잉

torch.Size([4, 384])

In [46]:
reshape_tensor(batch_ex['attention_mask'], batch_size)

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

## 3. Fine-tuning the model

In [18]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [48]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint).to(device)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on

In [5]:
from tqdm.auto import tqdm

In [50]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [15]:
# dataloader 구축 이후 input 형태 변화로 인해 tensor reshape로 input 형태 조정
### model input 중 input_ids, attention_mask에 활용

def reshape_tensor(org_tensor, batch_size):
    new_tensor = torch.concat([org_tensor[i][0].view(1) for i in range(len(org_tensor))]).unsqueeze(0) # 384->[1, 384]
    for batch_idx in range(1, batch_size):
        new_ids = torch.concat([org_tensor[i][batch_idx].view(1) for i in range(len(org_tensor))]).unsqueeze(0) # 384->[1, 384]
        new_tensor = torch.cat([new_tensor, new_ids], dim=0)
    return new_tensor

In [52]:
def train_epoch(model, dataloader, optimizer, device):
    model.train()
    losses = 0
    
    for batch_idx, batch in tqdm(enumerate(dataloader)):
        # input
        batch_size = batch['input_ids'][0].size(0)
        
        input_ids = reshape_tensor(batch['input_ids'], batch_size).to(device)
        attention_mask = reshape_tensor(batch['attention_mask'], batch_size).to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        
        # Output
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        
        # Calculate loss, Update parameters
        optimizer.zero_grad()
        loss = outputs[0] # loss는 Model for QA output으로 바로 반환
        
        loss.backward()
        optimizer.step()
        
        losses += loss.item()
    
    train_loss = losses / len(dataloader)
    
    return train_loss

In [53]:
def valid_epoch(model, dataloader, device):
    model.eval()
    losses = 0
    
    for batch_idx, batch in enumerate(dataloader):
        # input
        batch_size = batch['input_ids'][0].size(0)
        
        input_ids = reshape_tensor(batch['input_ids'], batch_size).to(device)
        attention_mask = reshape_tensor(batch['attention_mask'], batch_size).to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        
        # Output
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        
        loss = outputs[0]
        
        losses += loss.item()
        
    valid_loss = losses / len(dataloader)
    
    return valid_loss

In [54]:
from timeit import default_timer as timer

num_epochs = 3

for epoch in range(num_epochs):
    start_time = timer()
    train_loss = train_epoch(model, train_dataloader, optimizer, device)
    end_time = timer()
    valid_loss = valid_epoch(model, valid_dataloader, device)
    
    print((f"[Epoch {epoch}] Train loss: {train_loss:.3f}, Valid loss: {valid_loss:.3f}, Epoch time = {(end_time - start_time):.3f}s"))

0it [00:00, ?it/s]

[Epoch 0] Train loss: 2.644, Valid loss: 1.708, Epoch time = 135.889s


0it [00:00, ?it/s]

[Epoch 1] Train loss: 1.274, Valid loss: 1.614, Epoch time = 138.224s


0it [00:00, ?it/s]

[Epoch 2] Train loss: 0.659, Valid loss: 1.978, Epoch time = 139.553s


In [92]:
torch.save(model, 'distill_bert_qa.pth')

## 4. Inference(Test)

In [16]:
from tqdm.auto import tqdm

In [1]:
import torch
from datasets import load_dataset

datasets = load_dataset("squad", split="validation")
datasets

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})

In [2]:
batch_size = 8

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [3]:
model = torch.load('distill_bert_qa.pth')

### Inference Example

In [78]:
shuffled_test = datasets.shuffle(seed=602)

test_examples = shuffled_test.select([7]) #하나만 뽑아쓰기

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-434e5e0fe7c25b62.arrow


In [79]:
test_examples[0]

{'id': '5730de74f6cb411900e244fd',
 'title': 'United_Methodist_Church',
 'context': 'Unlike confirmation and profession of faith, Baptism is a sacrament in the UMC. The Book of Discipline of the United Methodist Church directs the local church to offer membership preparation or confirmation classes to all people, including adults. The term confirmation is generally reserved for youth, while some variation on membership class is generally used for adults wishing to join the church. The Book of Discipline normally allows any youth at least completing sixth grade to participate, although the pastor has discretionary authority to allow a younger person to participate. In confirmation and membership preparation classes, students learn about Church and the Methodist-Christian theological tradition in order to profess their ultimate faith in Christ.',
 'question': 'How do students learn about the church?',
 'answers': {'text': ['confirmation and membership preparation classes',
   'In confirm

- 3 gold answers : answer의 변형에도 모델을 견고히 하기 위해 세 사람에게 답변을 얻음

- 평가 지표

    - Exact match : 3개 중에 하나로 나왔으면 1, 아니면 0으로 binary accuracy
    
    - F1 : 단어 단위로 구한 F1-score 3개 중에 max one을 per-question F1-score로 두고 전체 macro average
    
    - `metric = load_metric("squad")`로 쉽게 계산 가능


In [80]:
def preprocess_test_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [81]:
test_dataset = test_examples.map(
    preprocess_test_examples,
    batched=True,
    remove_columns=test_examples.column_names,
)

print(f"Test data size: {len(test_examples)} -> {len(test_dataset)}")

  0%|          | 0/1 [00:00<?, ?ba/s]

Test data size: 1 -> 1


In [82]:
from torch.utils.data import DataLoader

test_dataset_for_model = test_dataset.remove_columns(["example_id", "offset_mapping"])

test_dataloader = DataLoader(test_dataset_for_model, batch_size=batch_size, shuffle=False)

In [83]:
import collections
import random
import numpy as np
from datasets import load_metric


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)
    # defaultdict(list,
    #             {'5726398589a1e219009ac58b': [0],
    #              '571cdcb85efbb31900334e0c': [1],
    #              '5730aa52069b531400832221': [2],
    #              '572684f5dd62a815002e87fe': [3],
    #              '572732f8f1498d1400e8f476': [4],
    #              '56beae423aeaaa14008c91f4': [5],
    #              ...,
    #              '5726fc63dd62a815002e9706': [38, 39],
    #              ...,
    #              '5729081d3f37b31900477fad': [100]})
    metric = load_metric("squad")
    predicted_answers = []
    n_best = 20
    max_answer_length = 30
    ex_idx = random.randint(0, len(test_examples)-1) ### 출력용 예시 인덱스 추출

    for idx, example in tqdm(enumerate(examples)):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
        
        ### 출력용
        if idx == ex_idx:
            print(f"Inference Example\n")
            print(f"[Id] {example['id']}\n[Context] {example['context']}\n[Question] {example['question']}")
            print(f"[Real Answers] {example['answers']}")
            print(f"[Pred Answers] {best_answer}")

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

In [84]:
def reshape_tensor(org_tensor, batch_size):
    new_tensor = torch.concat([org_tensor[i][0].view(1) for i in range(len(org_tensor))]).unsqueeze(0) # 384->[1, 384]
    for batch_idx in range(1, batch_size):
        new_ids = torch.concat([org_tensor[i][batch_idx].view(1) for i in range(len(org_tensor))]).unsqueeze(0) # 384->[1, 384]
        new_tensor = torch.cat([new_tensor, new_ids], dim=0)
    return new_tensor

In [85]:
def test_epoch(model, dataloader, features, examples, device):
    model.eval()
    
    starts, ends = [], []
    for batch_idx, batch in tqdm(enumerate(dataloader)):
        # input
        batch_size = batch['input_ids'][0].size(0)
        
        input_ids = reshape_tensor(batch['input_ids'], batch_size).to(device)
        attention_mask = reshape_tensor(batch['attention_mask'], batch_size).to(device)
        
        # Output
        outputs = model(input_ids, attention_mask=attention_mask)
        
        starts.append(outputs.start_logits) # start_logits: (batch_size, max_len)
        ends.append(outputs.end_logits) # end_logits: (batch_size, max_len)
        
    all_start_logits = torch.cat(starts, dim=0).cpu().detach().numpy()
    all_end_logits = torch.cat(ends, dim=0).cpu().detach().numpy()
    
    dict_metrics = compute_metrics(all_start_logits, all_end_logits, features, examples) ###
    
    return dict_metrics['exact_match'], dict_metrics['f1']

#### Good Examples

In [64]:
### good
test_exact_match, test_f1 = test_epoch(model, test_dataloader, test_dataset, test_examples, device)

print(f"[Test Result] Exact Match: {test_exact_match}, F1: {test_f1}")

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Inference Example

[Id] 5729081d3f37b31900477fad
[Context] Neutrophils and macrophages are phagocytes that travel throughout the body in pursuit of invading pathogens. Neutrophils are normally found in the bloodstream and are the most abundant type of phagocyte, normally representing 50% to 60% of the total circulating leukocytes. During the acute phase of inflammation, particularly as a result of bacterial infection, neutrophils migrate toward the site of inflammation in a process called chemotaxis, and are usually the first cells to arrive at the scene of infection. Macrophages are versatile cells that reside within tissues and produce a wide array of chemicals including enzymes, complement proteins, and regulatory factors such as interleukin 1. Macrophages also act as scavengers, ridding the body of worn-out cells and other debris, and as antigen-presenting cells that activate the adaptive immune system.
[Question] What percentage of leukocytes do neutrophils represent?
[Real Answer

In [53]:
### good
test_exact_match, test_f1 = test_epoch(model, test_dataloader, test_dataset, test_examples, device)

print(f"[Test Result] Exact Match: {test_exact_match}, F1: {test_f1}")

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Inference Example

[Id] 572750e8dd62a815002e9af4
[Context] The project must adhere to zoning and building code requirements. Constructing a project that fails to adhere to codes does not benefit the owner. Some legal requirements come from malum in se considerations, or the desire to prevent things that are indisputably bad – bridge collapses or explosions. Other legal requirements come from malum prohibitum considerations, or things that are a matter of custom or expectation, such as isolating businesses to a business district and residences to a residential district. An attorney may seek changes or exemptions in the law that governs the land where the building will be built, either by arguing that a rule is inapplicable (the bridge design will not cause a collapse), or that the custom is no longer needed (acceptance of live-work spaces has grown in the community).
[Question] Who may seek changes or exemptions in the law that governs the land where the building will be built?
[Real An

#### BAD Examples

In [86]:
### bad
test_exact_match, test_f1 = test_epoch(model, test_dataloader, test_dataset, test_examples, device)

print(f"[Test Result] Exact Match: {test_exact_match}, F1: {test_f1}")

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Inference Example

[Id] 5730de74f6cb411900e244fd
[Context] Unlike confirmation and profession of faith, Baptism is a sacrament in the UMC. The Book of Discipline of the United Methodist Church directs the local church to offer membership preparation or confirmation classes to all people, including adults. The term confirmation is generally reserved for youth, while some variation on membership class is generally used for adults wishing to join the church. The Book of Discipline normally allows any youth at least completing sixth grade to participate, although the pastor has discretionary authority to allow a younger person to participate. In confirmation and membership preparation classes, students learn about Church and the Methodist-Christian theological tradition in order to profess their ultimate faith in Christ.
[Question] How do students learn about the church?
[Real Answers] {'text': ['confirmation and membership preparation classes', 'In confirmation and membership preparation 

In [73]:
### bad

test_exact_match, test_f1 = test_epoch(model, test_dataloader, test_dataset, test_examples, device)

print(f"[Test Result] Exact Match: {test_exact_match}, F1: {test_f1}")

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Inference Example

[Id] 5726398589a1e219009ac58b
[Context] Connection-oriented transmission requires a setup phase in each involved node before any packet is transferred to establish the parameters of communication. The packets include a connection identifier rather than address information and are negotiated between endpoints so that they are delivered in order and with error checking. Address information is only transferred to each node during the connection set-up phase, when the route to the destination is discovered and an entry is added to the switching table in each network node through which the connection passes. The signaling protocols used allow the application to specify its requirements and discover link parameters. Acceptable values for service parameters may be negotiated. Routing a packet requires the node to look up the connection id in a table. The packet header can be small, as it only needs to contain this code and any information, such as length, timestamp, or sequ