# BERT train

작성일자: 210116 \
작성자: 조진욱\
목표: HF 기반 BERT 를 작은 SQuAD 데이터셋으로 Fine tuning 시켜보자\
비고: 
1. 학습속도를 위해 apex 를 사용함
2. transfomers 패키지 내 squad_evaluate 가 실행시간이 너무 오래걸림. 왜지? 아직 모르겠음
- 따로 테스트 해봤을땐 그렇게 느리지 않았던 걸로 기억

레퍼런스 코드
https://github.com/huggingface/transformers/blob/master/examples/legacy/question-answering/run_squad.py

가급적이면 수정사항을 적어두려고함(TO DO)

In [1]:
import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from tqdm.autonotebook import tqdm, trange
import os
import random

In [2]:
import transformers
from transformers import (
    BertForQuestionAnswering,
    BertTokenizer,
)
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate,
)

from transformers.data.processors.squad import SquadResult, SquadProcessor, squad_convert_examples_to_features

In [3]:
import config as cfg
from utils import load_and_cache_examples, set_seed

In [4]:
import apex
from apex import amp

In [8]:
mode = 'train'

In [9]:
# BERT + 마지막 cls 추가 레이어 존재함
# 추가 레이어는 학습이 되어있지 않으므로, 아래 Some weights of the model checkpoint at bert-large-cased were not used 와 같은 에러 발생
# 추후 과제로 낼 시 이 부분을 각자 customize 하도록 과제를 내도 좋을듯 함
model = BertForQuestionAnswering.from_pretrained(cfg.model_name)
tokenizer = BertTokenizer.from_pretrained(cfg.tokenizer_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

In [10]:
model = model.to(cfg.device)

In [11]:
train_dataset = load_and_cache_examples(cfg, tokenizer, mode_or_filename=mode, output_examples=False)

  0%|          | 0/200 [00:00<?, ?it/s]

Creating features from dataset file at %s ./data


100%|██████████| 200/200 [00:09<00:00, 20.72it/s]
convert squad examples to features: 100%|██████████| 38708/38708 [00:57<00:00, 668.00it/s] 
add example index and unique id: 100%|██████████| 38708/38708 [00:00<00:00, 1410098.66it/s]


In [12]:
optimizer = apex.optimizers.FusedLAMB(model.parameters(),
                                lr = cfg.learning_rate,
                                eps=cfg.epsilon,
                                weight_decay=cfg.weight_decay,
                                max_grad_norm=cfg.max_grad_norm)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


In [13]:
dataset, examples, features = load_and_cache_examples(cfg, tokenizer, mode='dev', output_examples=True)
eval_sampler = SequentialSampler(dataset)
eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=cfg.eval_batch_size)

  5%|▌         | 1/20 [00:00<00:02,  6.71it/s]

Creating features from dataset file at %s ./data


100%|██████████| 20/20 [00:01<00:00, 18.75it/s]
convert squad examples to features: 100%|██████████| 4639/4639 [00:06<00:00, 693.77it/s]
add example index and unique id: 100%|██████████| 4639/4639 [00:00<00:00, 1421180.06it/s]


In [14]:
def evaluate(model, tokenizer):
    print("***** Running evaluation *****")
    print("  Num examples = ", len(dataset))
    print("  Batch size = ", cfg.eval_batch_size)
    all_results = []
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(cfg.device) for t in batch)

        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }

            feature_indices = batch[3]
            outputs = model(**inputs)

        for i, feature_index in enumerate(feature_indices):
            eval_feature = features[feature_index.item()]
            unique_id = int(eval_feature.unique_id)

            start_logits = outputs.start_logits[i]
            end_logits = outputs.end_logits[i]
            result = SquadResult(unique_id, start_logits, end_logits)

            all_results.append(result)
            
    predictions = compute_predictions_logits(examples,
                                            features,
                                            all_results,
                                            cfg.n_best_size,
                                            cfg.max_answer_length,
                                            True,
                                            None,
                                            None,
                                            None,
                                            cfg.verbose_logging,
                                            False,
                                            cfg.null_score_diff_threshold,
                                            tokenizer,)
    results = squad_evaluate(examples, predictions)
    return results
    

In [15]:
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=cfg.train_batch_size)

t_total = len(train_dataloader) // cfg.gradient_accumulation_steps * cfg.num_train_epochs

global_step = 1
tr_loss = 0.0
best_metrics = {'f1': 0, 'exact': 0, 'epoch': -1}
model.zero_grad()
# Added here for reproductibility
set_seed(cfg.seed)

In [16]:
# Train!
print("***** Running training *****")
print("  Num examples = ", len(train_dataset))
print("  Num Epochs = ", cfg.num_train_epochs)
print(
    "  Total train batch size = ",
    cfg.train_batch_size
    * cfg.gradient_accumulation_steps
)
print("  Gradient Accumulation steps = ", cfg.gradient_accumulation_steps)
print("  Total optimization steps = ", t_total)

for now_epoch in trange(int(cfg.num_train_epochs), desc="Epoch:"):

    for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration:")):
        model.train()
        batch = tuple(t.to(cfg.device) for t in batch)
                                 
        inputs = {
            "input_ids": batch[0],
            "attention_mask": batch[1],
            "token_type_ids": batch[2],
            "start_positions": batch[3],
            "end_positions": batch[4],
        }

        outputs = model(**inputs)
        loss = outputs[0]

        if cfg.gradient_accumulation_steps > 1:
            loss = loss / cfg.gradient_accumulation_steps

        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()

        tr_loss += loss.item()
        if (step + 1) % cfg.gradient_accumulation_steps == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.max_grad_norm)

            optimizer.step()
            model.zero_grad()
            global_step += 1

    results = evaluate(model, tokenizer)

    if best_metrics['f1'] < results['f1']:
        best_metrics['f1'] = results['f1']
        best_metrics['exact'] = results['exact']
        best_metrics['epoch'] = now_epoch
        model.save_pretrained(cfg.output_dir)

    for key, value in results.items():
        print("dev eval_{}: {}".format(key, value))

    for key, value in best_metrics.items():
        print("dev best eval_{}: {}".format(key, value))


***** Running training *****
  Num examples =  39189
  Num Epochs =  5.0
  Total train batch size =  16
  Gradient Accumulation steps =  2
  Total optimization steps =  12245.0


Epoch::   0%|          | 0/5 [00:00<?, ?it/s]

Iteration::   0%|          | 0/4899 [00:00<?, ?it/s]

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
***** Running evaluation *****
  Num examples = %d 4654
  Batch size = %d 8


Evaluating:   0%|          | 0/582 [00:00<?, ?it/s]

dev eval_exact: 66.99719767191205
dev eval_f1: 79.9423909806456
dev eval_total: 4639
dev eval_HasAns_exact: 66.99719767191205
dev eval_HasAns_f1: 79.9423909806456
dev eval_HasAns_total: 4639
dev eval_best_exact: 66.99719767191205
dev eval_best_exact_thresh: 0.0
dev eval_best_f1: 79.9423909806456
dev eval_best_f1_thresh: 0.0
dev best eval_f1: 79.9423909806456
dev best eval_exact: 66.99719767191205
dev best eval_epoch: 0


Iteration::   0%|          | 0/4899 [00:00<?, ?it/s]

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
***** Running evaluation *****
  Num examples = %d 4654
  Batch size = %d 8


Evaluating:   0%|          | 0/582 [00:00<?, ?it/s]

dev eval_exact: 70.29532226773011
dev eval_f1: 82.54096893361755
dev eval_total: 4639
dev eval_HasAns_exact: 70.29532226773011
dev eval_HasAns_f1: 82.54096893361755
dev eval_HasAns_total: 4639
dev eval_best_exact: 70.29532226773011
dev eval_best_exact_thresh: 0.0
dev eval_best_f1: 82.54096893361755
dev eval_best_f1_thresh: 0.0
dev best eval_f1: 82.54096893361755
dev best eval_exact: 70.29532226773011
dev best eval_epoch: 1


Iteration::   0%|          | 0/4899 [00:00<?, ?it/s]

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
***** Running evaluation *****
  Num examples = %d 4654
  Batch size = %d 8


Evaluating:   0%|          | 0/582 [00:00<?, ?it/s]

dev eval_exact: 70.61866781633972
dev eval_f1: 82.4030336016294
dev eval_total: 4639
dev eval_HasAns_exact: 70.61866781633972
dev eval_HasAns_f1: 82.4030336016294
dev eval_HasAns_total: 4639
dev eval_best_exact: 70.61866781633972
dev eval_best_exact_thresh: 0.0
dev eval_best_f1: 82.4030336016294
dev eval_best_f1_thresh: 0.0
dev best eval_f1: 82.54096893361755
dev best eval_exact: 70.29532226773011
dev best eval_epoch: 1


Iteration::   0%|          | 0/4899 [00:00<?, ?it/s]

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
***** Running evaluation *****
  Num examples = %d 4654
  Batch size = %d 8


Evaluating:   0%|          | 0/582 [00:00<?, ?it/s]

dev eval_exact: 71.07135158439318
dev eval_f1: 82.74566015061633
dev eval_total: 4639
dev eval_HasAns_exact: 71.07135158439318
dev eval_HasAns_f1: 82.74566015061633
dev eval_HasAns_total: 4639
dev eval_best_exact: 71.07135158439318
dev eval_best_exact_thresh: 0.0
dev eval_best_f1: 82.74566015061633
dev eval_best_f1_thresh: 0.0
dev best eval_f1: 82.74566015061633
dev best eval_exact: 71.07135158439318
dev best eval_epoch: 3


Iteration::   0%|          | 0/4899 [00:00<?, ?it/s]

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
***** Running evaluation *****
  Num examples = %d 4654
  Batch size = %d 8


Evaluating:   0%|          | 0/582 [00:00<?, ?it/s]

dev eval_exact: 70.27376589782281
dev eval_f1: 82.27344184912754
dev eval_total: 4639
dev eval_HasAns_exact: 70.27376589782281
dev eval_HasAns_f1: 82.27344184912754
dev eval_HasAns_total: 4639
dev eval_best_exact: 70.27376589782281
dev eval_best_exact_thresh: 0.0
dev eval_best_f1: 82.27344184912754
dev eval_best_f1_thresh: 0.0
dev best eval_f1: 82.74566015061633
dev best eval_exact: 71.07135158439318
dev best eval_epoch: 3


In [2]:
# 결과 20200120
# 현재 환경 기준 한 epoch 이 도는 시간 약 7분
# eval 수행하는 시간 약 16분
# 그래서 한 epoch 당 23분 * 5 = 115 분 걸림