# SQuAD Question Answering with BERT 


Huggignface Datasets allow SQuAD by using load_dataset("squad")

In [12]:
!pip install transformers
!pip install datasets #from huggingface 



In [13]:
from datasets import load_dataset

squad = load_dataset("squad")

print("\n\n ", squad["train"][0])

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]



  {'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}


In [14]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

# Preprocess

Load the tokenizer

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

주의깊게 볼 점들 

- 데이터셋 내의 일부 문장의 경우 very long context가 있으므로  **truncation= only_second**로 세팅하여 max_length를 초과 하는 경우만 truncating을 적용하면 좋다 
- original context에 대한 답변의 [start, end] position을 매핑해주어야 하는데 이때 **return_offset_mapping = True**를 사용하면 다룰 수 있다.
- sequence_ids 메소드로 해당 오프셋이 question인지 context인지 찾을 수 있다.

In [22]:
def preprocess_function(examples):
  questions = [q.strip() for q in examples["question"]] # strip(): 공백을 제거 
  inputs = tokenizer(questions, examples["context"], max_length=384, truncation="only_second", return_offsets_mapping=True, padding="max_length", )

  offset_mapping = inputs.pop("offset_mapping")
  answers = examples["answers"]

  start_positions = []
  end_positions = [] 

  for i, offset in enumerate(offset_mapping):

    answer = answers[i]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)


    idx = 0 
    while sequence_ids[idx] != 1:
      idx += 1
    context_start = idx 
    while sequence_ids[idx] == 1: 
      idx += 1 
    context_end = idx - 1 

    #If the answer is not fully inside the context, label it (0,0)
    if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
      start_positions.append(0)
      end_positions.append(0)
    else:
      #Otherwise it's the start and end token positions
       idx = context_start 
       while idx <= context_end and offset[idx][0] <= start_char:
         idx += 1 
       start_positions.append(idx - 1)

       idx = context_end 
       while idx >= context_start and offset[idx][1] >= end_char:
         idx -= 1 
       end_positions.append(idx + 1)

  inputs["start_positions"] = start_positions
  inputs["end_positions"] = end_positions
  
  return inputs





In [23]:

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

#Fine-tune with PyTorch-BERT using the Huggingface-Trainer! 
(정석은 아니지만 간편한 방법?정도로 볼 수 있을 것 같아용. 특히 BERT와 같은 Large Transformer Language Model에서 말이죠)

크게 네 가지 프로세스로 볼 수 있습니다 
- **data_collator** : Data에 Batch를 만들어 줍니다. 
- **TrainingArguments**: Training Argument들을 한 번에 설정해줄 수 있습니다.
- **Trainer**: Tensorflow의 model.fit(), PyTorch의 model.forward() 등과 같습니다. 훈련을 가동하는 프로세스 
- **ModelForQuestionAnswering**: 모델의 configuration을 Extractive questing-answering tasks (SQuAD와 같은)에 맞추어 바꾸어 줍니다.

In [24]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased") #컴퓨팅 용량에 따라 "distilbert-base-uncased"를 사용해도 됩니다 


#Gather your training arguments~
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

#Collect your model, training args, dataset, data collator, tokenizer in Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset = tokenized_squad["train"],
    eval_dataset= tokenized_squad["validation"],
    data_collator=data_collator, 
    tokenizer=tokenizer,    
)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [None]:
trainer.train()

***** Running training *****
  Num examples = 87599
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 16425


Epoch,Training Loss,Validation Loss
