<a href="https://colab.research.google.com/github/pu-bi/AI-industry-job-experience-for-non-majors/blob/main/2-preprocess/2_Fine_tuning_BERT_for_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT for intention classification

## 환경 설정

HuggingFace Transformers, Datasets 라이브러리를 설치합니다.

참고 : [HuggingFace Transformers](https://huggingface.co/docs)

In [None]:
!pip install -q transformers datasets

[K     |████████████████████████████████| 5.5 MB 36.1 MB/s 
[K     |████████████████████████████████| 441 kB 60.3 MB/s 
[K     |████████████████████████████████| 163 kB 71.8 MB/s 
[K     |████████████████████████████████| 7.6 MB 53.0 MB/s 
[K     |████████████████████████████████| 95 kB 3.2 MB/s 
[K     |████████████████████████████████| 212 kB 74.5 MB/s 
[K     |████████████████████████████████| 115 kB 52.1 MB/s 
[K     |████████████████████████████████| 127 kB 50.8 MB/s 
[K     |████████████████████████████████| 115 kB 60.6 MB/s 
[?25h

## Load dataset

업로드한 데이터를 로드하기 위해 드라이브에 마운트합니다.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 미리 나눠놓은 train.csv, test.csv 파일을 불러옵니다.

In [None]:
# 자신의 경로에 맞게 설정해주세요.
train_data = "/content/drive/MyDrive/week2/REALLLLwk2/finetuning/processed_train.csv"
eval_data = "/content/drive/MyDrive/week2/REALLLLwk2/finetuning/processed_eval.csv"
test_data = "/content/drive/MyDrive/week2/REALLLLwk2/finetuning/processed_test.csv"

### datasets 를 이용해 train, test 파일을 로드합니다.

In [None]:
from datasets import load_dataset, ReadInstruction

dataset = load_dataset('csv', data_files={'train': train_data, 'eval': eval_data, 'test': test_data})



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-557f642668b11541/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-557f642668b11541/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

### train과 test가 로드 된것을 확인합니다.
과제를 수행할 때에는 train / validataion / test 데이터로 분리해서 사용합니다.

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'utt', '날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '기타'],
        num_rows: 20
    })
    eval: Dataset({
        features: ['Unnamed: 0', 'utt', '날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '기타'],
        num_rows: 2
    })
    test: Dataset({
        features: ['Unnamed: 0', 'utt', '날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '기타'],
        num_rows: 3
    })
})

train 데이터의 첫 번째 데이터를 확인해봅니다.

In [None]:
example = dataset['train'][0]
example

{'Unnamed: 0': 14,
 'utt': '세상에 이런일이..',
 '날씨 묻기': 0.0,
 '관광지 추천': 0.0,
 '숙소 추천': 0.0,
 '맛집 추천': 0.0,
 '인사': 0.0,
 '소개': 0.0,
 '기타': 1.0}

데이터를 살펴보면 인덱스, 발화, 라벨로 이뤄져있습니다.

라벨만 따로 리스트에 저장하고 각각의 라벨을 정수에 매핑합니다. 

매핑된 것은 dictionary로 만들어줍니다.

In [None]:
labels = [label for label in dataset['train'].features.keys() if label not in ['Unnamed: 0', 'utt']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
print(labels)
print(id2label)
print(label2id)

['날씨 묻기', '관광지 추천', '숙소 추천', '맛집 추천', '인사', '소개', '기타']
{0: '날씨 묻기', 1: '관광지 추천', 2: '숙소 추천', 3: '맛집 추천', 4: '인사', 5: '소개', 6: '기타'}
{'날씨 묻기': 0, '관광지 추천': 1, '숙소 추천': 2, '맛집 추천': 3, '인사': 4, '소개': 5, '기타': 6}


## Preprocess data

BERT는 텍스트가 아닌 `input_ids`를 입력으로 받습니다. 이를 위해 우선 훈련된 토크나이저를 활용해 텍스트를 토큰화합니다. 

이번 예시에서는 `AutoTokenizer`를 활용해 토큰화를 진행합니다.

In [None]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

def preprocess_data(examples):
  # 배치화된 텍스트를 받습니다.
  text = examples["utt"]
  # 인코딩 합니다.
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # 라벨을 배치로 만들어줍니다.
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # numpy array로 만들기 위해 0 매트릭스를 만들어줍니다.
  labels_matrix = np.zeros((len(text), len(labels)))
  # 채웁니다.
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/495k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

인코딩!

In [None]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 20
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3
    })
})

결과 확인

In [None]:
example = encoded_dataset['train'][0]
print(example.keys())
print(example)

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
{'input_ids': [2, 3991, 2170, 3667, 2210, 2052, 18, 18, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
tokenizer.decode(example['input_ids'])

'[CLS] 세상에 이런일이.. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

In [None]:
example['labels']

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]

In [None]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['기타']

인코딩된 데이터를 파이토치의 포맷에 맞게 변경합니다.
[참조](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.set_format)

In [None]:
encoded_dataset.set_format("torch")

📍

## 모델 정의

모델을 정의하기 위해 `transformers`에서 제공하는 `AutoModelForSequenceClassification` 모듈을 import합니다. 

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

## 모델 훈련

허깅페이스에서 제공하는 Trainer API를 사용해 훈련을 시킵니다. 이를 위해 `TrainingArguments`와 `Trainer`를 정의해줘야 합니다.

* `TrainingArguments` : [문서](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)
* `Trainer` : [참조](https://huggingface.co/transformers/main_classes/trainer.html#id1)

In [None]:
batch_size = 1
metric_name = "acc"

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


훈련을 위해서는 `compute_metrics` 함수를 정의해야합니다.

In [None]:
from sklearn.metrics import accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def acc_metrics(predictions, labels):
    probs = torch.Tensor(predictions)
    # 최대값을 1로 변경합니다.
    y_pred = np.zeros(probs.shape)
    y_pred[np.arange(len(probs)), probs.argmax(1)] = 1
    # metric을 계산합니다.
    y_true = labels
    accuracy = accuracy_score(y_true, y_pred)
    # dictionary를 리턴합니다.
    metrics = {'acc': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = acc_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

훈련 시작!

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["eval"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 20
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 100
  Number of trainable parameters = 110622727


Epoch,Training Loss,Validation Loss,Acc
1,No log,0.395757,0.5
2,No log,0.381517,0.5
3,No log,0.39057,0.0
4,No log,0.357949,0.5
5,No log,0.358333,0.5


***** Running Evaluation *****
  Num examples = 2
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-20
Configuration saved in bert-finetuned/checkpoint-20/config.json
Model weights saved in bert-finetuned/checkpoint-20/pytorch_model.bin
tokenizer config file saved in bert-finetuned/checkpoint-20/tokenizer_config.json
Special tokens file saved in bert-finetuned/checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-40
Configuration saved in bert-finetuned/checkpoint-40/config.json
Model weights saved in bert-finetuned/checkpoint-40/pytorch_model.bin
tokenizer config file saved in bert-finetuned/checkpoint-40/tokenizer_config.json
Special tokens file saved in bert-finetuned/checkpoint-40/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2
  Batch size = 1
Saving model checkpoint to bert-finetuned/checkpoint-60
Configuration saved in bert-finetun

TrainOutput(global_step=100, training_loss=0.2911382293701172, metrics={'train_runtime': 40.0038, 'train_samples_per_second': 2.5, 'train_steps_per_second': 2.5, 'total_flos': 6578071680000.0, 'train_loss': 0.2911382293701172, 'epoch': 5.0})

## Evaluate

훈련 후 test data를 이용해 모델을 평가합니다.

In [None]:
trainer.evaluate(encoded_dataset["test"])

***** Running Evaluation *****
  Num examples = 3
  Batch size = 1


{'eval_loss': 0.39940711855888367,
 'eval_acc': 0.3333333333333333,
 'eval_runtime': 0.0791,
 'eval_samples_per_second': 37.926,
 'eval_steps_per_second': 37.926,
 'epoch': 5.0}

## 훈련이 완료된 모델을 드라이브에 저장합니다.

In [None]:
# 모델을 저장하고 싶은 경로를 설정해주세요.
trainer.save_model('/content/drive/MyDrive/week2/REALLLLwk2/finetuning/my_model') 

Saving model checkpoint to /content/drive/MyDrive/week2/REALLLLwk2/finetuning/my_model
Configuration saved in /content/drive/MyDrive/week2/REALLLLwk2/finetuning/my_model/config.json
Model weights saved in /content/drive/MyDrive/week2/REALLLLwk2/finetuning/my_model/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/week2/REALLLLwk2/finetuning/my_model/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/week2/REALLLLwk2/finetuning/my_model/special_tokens_map.json


👉 다음 파일은 "Inference_and_Demo.ipynb" 입니다!