# Fine-tuning BERT for intention classification

## 환경 설정

HuggingFace Transformers, Datasets 라이브러리를 설치합니다.

참고 : [HuggingFace Transformers](https://huggingface.co/docs)

In [7]:
!pip install -q transformers datasets

[K     |████████████████████████████████| 4.9 MB 32.7 MB/s 
[K     |████████████████████████████████| 365 kB 74.9 MB/s 
[K     |████████████████████████████████| 120 kB 63.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 59.2 MB/s 
[K     |████████████████████████████████| 212 kB 61.4 MB/s 
[K     |████████████████████████████████| 115 kB 61.0 MB/s 
[K     |████████████████████████████████| 127 kB 68.3 MB/s 
[?25h

## Load dataset

업로드한 데이터를 로드하기 위해 드라이브에 마운트합니다.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# 자신의 경로에 맞게 설정해주세요.
!ls "/content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning"

Fine_tuning_BERT_for_text_classification.ipynb	preprocess_data_실습.ipynb
Inference_and_Demo.ipynb			preprocess_data.ipynb
중앙정보_버트_미세조정_최종본.key		split_data.ipynb
original.csv


## 미리 나눠놓은 train.csv, test.csv 파일을 불러옵니다.

In [13]:
# 자신의 경로에 맞게 설정해주세요.
train_data = "/content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/4team_0917/processed_train.csv"
eval_data = "/content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/4team_0917/processed_eval.csv"
test_data = "/content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/4team_0917/processed_test.csv"

### datasets 를 이용해 train, test 파일을 로드합니다.

In [14]:
from datasets import load_dataset, ReadInstruction

dataset = load_dataset('csv', data_files={'train': train_data, 'eval': eval_data, 'test': test_data})



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-77767e801d76c6c9/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-77767e801d76c6c9/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

### train과 test가 로드 된것을 확인합니다.
과제를 수행할 때에는 train / validataion / test 데이터로 분리해서 사용합니다.

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'utt', '날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '활동 추천', '카페 추천'],
        num_rows: 4496
    })
    eval: Dataset({
        features: ['Unnamed: 0', 'utt', '날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '활동 추천', '카페 추천'],
        num_rows: 562
    })
    test: Dataset({
        features: ['Unnamed: 0', 'utt', '날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '활동 추천', '카페 추천'],
        num_rows: 562
    })
})

train 데이터의 첫 번째 데이터를 확인해봅니다.

In [16]:
example = dataset['train'][0]
example

{'Unnamed: 0': 5291,
 'utt': '한림 여름에 할 만한 것 뭐있어?',
 '날씨 묻기': 0.0,
 '관광지 추천': 0.0,
 '숙소 추천': 0.0,
 '맛집 추천': 0.0,
 '인사': 0.0,
 '소개': 0.0,
 '활동 추천': 1.0,
 '카페 추천': 0.0}

데이터를 살펴보면 인덱스, 발화, 라벨로 이뤄져있습니다.

라벨만 따로 리스트에 저장하고 각각의 라벨을 정수에 매핑합니다. 

매핑된 것은 dictionary로 만들어줍니다.

In [17]:
labels = [label for label in dataset['train'].features.keys() if label not in ['Unnamed: 0', 'utt']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
print(labels)
print(id2label)
print(label2id)

['날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '활동 추천', '카페 추천']
{0: '날씨 묻기', 1: '관광지 추천', 2: '숙소 추천', 3: '맛집 추천', 4: '인사', 5: '소개', 6: '활동 추천', 7: '카페 추천'}
{'날씨 묻기': 0, '관광지 추천': 1, '숙소 추천': 2, '맛집 추천': 3, '인사': 4, '소개': 5, '활동 추천': 6, '카페 추천': 7}


## Preprocess data

BERT는 텍스트가 아닌 `input_ids`를 입력으로 받습니다. 이를 위해 우선 훈련된 토크나이저를 활용해 텍스트를 토큰화합니다. 

이번 예시에서는 `AutoTokenizer`를 활용해 토큰화를 진행합니다.

In [18]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

def preprocess_data(examples):
  # 배치화된 텍스트를 받습니다.
  text = examples["utt"]
  # 인코딩 합니다.
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # 라벨을 배치로 만들어줍니다.
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # numpy array로 만들기 위해 0 매트릭스를 만들어줍니다.
  labels_matrix = np.zeros((len(text), len(labels)))
  # 채웁니다.
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/495k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [19]:
text = dataset['train']["utt"]
encoding = tokenizer(text, padding="max_length", truncation=True, max_length=64)
dataset['train'].column_names
labels_batch = {k: dataset['train'][k] for k in dataset['train'].column_names if k in labels}
labels_matrix = np.zeros((len(text), len(labels)))
for idx, label in enumerate(labels):
  labels_matrix[:, idx] = labels_batch[label]

encoding["labels"] = labels_matrix.tolist()

In [20]:
encoding[0]

Encoding(num_tokens=64, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

인코딩!

In [21]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [22]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 4496
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 562
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 562
    })
})

결과 확인

In [24]:
example = encoded_dataset['train'][0]
print(example.keys())
print(example)

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
{'input_ids': [2, 17711, 4565, 2170, 1892, 19341, 575, 1097, 2689, 2051, 35, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [25]:
tokenizer.decode(example['input_ids'])

'[CLS] 한림 여름에 할 만한 것 뭐있어? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

In [26]:
example['labels']

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]

In [32]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['활동 추천']

인코딩된 데이터를 파이토치의 포맷에 맞게 변경합니다.
[참조](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.set_format)

In [33]:
encoded_dataset.set_format("torch")

## 모델 정의

모델을 정의하기 위해 `transformers`에서 제공하는 `AutoModelForSequenceClassification` 모듈을 import합니다. 

pretrained된 모델의 weight를 로드하면 text classification을 위한 layer가 random한 weight로 생성됩니다.

In [23]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

## 모델 훈련

허깅페이스에서 제공하는 Trainer API를 사용해 훈련을 시킵니다. 이를 위해 `TrainingArguments`와 `Trainer`를 정의해줘야 합니다.

* `TrainingArguments` : [문서](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)
* `Trainer` : [참조](https://huggingface.co/transformers/main_classes/trainer.html#id1)

In [27]:
batch_size = 1
metric_name = "acc"

In [28]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

훈련을 위해서는 `compute_metrics` 함수를 정의해야합니다.

In [29]:
from sklearn.metrics import accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def acc_metrics(predictions, labels):
    probs = torch.Tensor(predictions)
    # 최대값을 1로 변경합니다.
    y_pred = np.zeros(probs.shape)
    y_pred[np.arange(len(probs)), probs.argmax(1)] = 1
    # metric을 계산합니다.
    y_true = labels
    accuracy = accuracy_score(y_true, y_pred)
    # dictionary를 리턴합니다.
    metrics = {'acc': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = acc_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

In [34]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [35]:
encoded_dataset['train']['input_ids'][0]

tensor([    2, 17711,  4565,  2170,  1892, 19341,   575,  1097,  2689,  2051,
           35,     3,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

In [36]:
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.6334, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.4682, -0.3526,  0.3932, -0.2371, -0.6769, -0.0448,  0.1078,  0.2663]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

훈련 시작!

In [37]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [38]:
trainer.train()

***** Running training *****
  Num examples = 4496
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 22480
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Acc
1,0.0013,0.000849,1.0


***** Running Evaluation *****
  Num examples = 562
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-4496
Configuration saved in bert-finetuned/checkpoint-4496/config.json
Model weights saved in bert-finetuned/checkpoint-4496/pytorch_model.bin
tokenizer config file saved in bert-finetuned/checkpoint-4496/tokenizer_config.json
Special tokens file saved in bert-finetuned/checkpoint-4496/special_tokens_map.json


Epoch,Training Loss,Validation Loss,Acc
1,0.0013,0.000849,1.0
2,0.0002,0.000117,1.0
3,0.0,2e-05,1.0
4,0.0,5e-06,1.0
5,0.0,2e-06,1.0


***** Running Evaluation *****
  Num examples = 562
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-8992
Configuration saved in bert-finetuned/checkpoint-8992/config.json
Model weights saved in bert-finetuned/checkpoint-8992/pytorch_model.bin
tokenizer config file saved in bert-finetuned/checkpoint-8992/tokenizer_config.json
Special tokens file saved in bert-finetuned/checkpoint-8992/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 562
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-13488
Configuration saved in bert-finetuned/checkpoint-13488/config.json
Model weights saved in bert-finetuned/checkpoint-13488/pytorch_model.bin
tokenizer config file saved in bert-finetuned/checkpoint-13488/tokenizer_config.json
Special tokens file saved in bert-finetuned/checkpoint-13488/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 562
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-17984
C

TrainOutput(global_step=22480, training_loss=0.003954472609139396, metrics={'train_runtime': 1953.7037, 'train_samples_per_second': 11.506, 'train_steps_per_second': 11.506, 'total_flos': 1478763790172160.0, 'train_loss': 0.003954472609139396, 'epoch': 5.0})

## Evaluate

훈련 후에는 validation data를 이용해 모델을 평가합니다.

In [39]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 562
  Batch size = 1


{'eval_loss': 0.0008487121667712927,
 'eval_acc': 1.0,
 'eval_runtime': 9.7736,
 'eval_samples_per_second': 57.502,
 'eval_steps_per_second': 57.502,
 'epoch': 5.0}

## 훈련이 완료된 모델을 드라이브에 저장합니다.

In [42]:
# 모델을 저장하고 싶은 경로를 설정해주세요.
trainer.save_model('/content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning/') 

Saving model checkpoint to /content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning/
Configuration saved in /content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning/config.json
Model weights saved in /content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Colab Notebooks/챗폿 프로젝트/2주차_수업자료_학생용/finetuning/special_tokens_map.json


👉 다음 파일은 "Inference_and_Demo.ipynb" 입니다!