## **02** DistillBERT 파인튜닝 및 평가

트랜스포머 기반의 사전학습 모델은 토큰과 문장들의 관계를 레이블이 없는 대규모 텍스트를 가지고 지도학습을 이미 완료했습니다. 이에 레이블이 부여된 소규모 데이터를 대상으로 파인튜닝을 하면 이 모델은 높은 정확도의 예측을 할 수 있습니다. 이를 위해 허깅페이스의 `Trainer` 클래스와 통상적인 파이토치 학습 방법을 사용합니다.

### 사전 준비: GPU 설정

In [2]:
import torch
torch.cuda.is_available()

True

### **004** IMDB 데이터세트

영화 리뷰 코멘트의 긍정적/부정적 감성을 판단하기 위해 사용하는 데이터세트

In [4]:
from torchtext.datasets import IMDB

train_iter = IMDB(split="train")
test_iter = IMDB(split="test")

In [5]:
import random
random.seed(6)

train_lists = list(train_iter)
test_lists = list(test_iter)

train_lists_small = random.sample(train_lists, 1000)
test_lists_small = random.sample(test_lists, 1000)

print(train_lists_small[0])
print(test_lists_small[0])

(2, "I LOVED this movie! I am biased seeing as I am a huge Disney fan, but I really enjoyed myself. The action takes off running in the beginning of the film and just keeps going! This is a bit of a departure for Disney, they don't spend quite as much time on character development (my husband pointed this out)and there are no musical numbers. It is strictly action adventure. I thoroughly enjoyed it and recommend it to anyone who loves Disney, be they young or old.")
(1, 'This was an abysmal show. In short it was about this kid called Doug who guilt-tripped a lot. Seriously he could feel guilty over killing a fly then feeling guilty over feeling guilty for killing the fly and so forth. The animation was grating and unpleasant and the jokes cheap. <br /><br />It aired here in Sweden as a part of the "Disney time" show and i remember liking it some what but then i turned 13.<br /><br />I never got why some of the characters were green and purple too. What was up with that? <br /><br />Tru

### **005** 레이블 인코딩

데이터세트의 레이블은 긍정인 경우 2, 부정인 경우 1로 부여되어 있습니다. 여기서는 긍정을 의미하는 레이블을 1로, 부정을 의미하는 레이블을 0으로 바꾸겠습니다.

In [6]:
train_texts = []
train_labels = []

for label, text in train_lists_small:
    train_labels.append(1 if label==2 else 0)
    train_texts.append(text)

test_texts = []
test_labels = []

for label, text in test_lists_small:
    test_labels.append(1 if label==2 else 0)
    test_texts.append(text)


print(train_texts[0])
print(train_labels[0])
print(test_texts[0])
print(test_labels[0])

I LOVED this movie! I am biased seeing as I am a huge Disney fan, but I really enjoyed myself. The action takes off running in the beginning of the film and just keeps going! This is a bit of a departure for Disney, they don't spend quite as much time on character development (my husband pointed this out)and there are no musical numbers. It is strictly action adventure. I thoroughly enjoyed it and recommend it to anyone who loves Disney, be they young or old.
1
This was an abysmal show. In short it was about this kid called Doug who guilt-tripped a lot. Seriously he could feel guilty over killing a fly then feeling guilty over feeling guilty for killing the fly and so forth. The animation was grating and unpleasant and the jokes cheap. <br /><br />It aired here in Sweden as a part of the "Disney time" show and i remember liking it some what but then i turned 13.<br /><br />I never got why some of the characters were green and purple too. What was up with that? <br /><br />Truly a horri

### **006** 학습 및 검증 데이터 분리

In [7]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=3
)

print(len(train_texts))
print(len(val_texts))

800
200


### **007** 토크나이징 및 인코딩

In [8]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [10]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

print(train_encodings["input_ids"][0][:10])
print(tokenizer.decode(train_encodings["input_ids"][0][:10]))

[101, 4937, 11350, 2038, 2048, 1000, 7592, 14433, 1000, 1011]
[CLS] cat soup has two " hello kitty " -


### **008** 데이터세트 클래스 생성

In [11]:
import torch

class IMDBDataset(torch.utils.data.Dataset):
    
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)
    
train_dataset = IMDBDataset(train_encodings, train_labels)
val_dataset = IMDBDataset(val_encodings, val_labels)
test_dataset = IMDBDataset(test_encodings, test_labels)  

In [12]:
for i in train_dataset:
    print(i)
    break

{'input_ids': tensor([  101,  4937, 11350,  2038,  2048,  1000,  7592, 14433,  1000,  1011,
         2828, 18401,  2015, 28866,  2075,  2006,  1037, 13576,  4440,  2083,
         1996, 25115,  1010,  2073,  2505,  2064,  4148,  1010,  1998,  2515,
         1012,  2023,  2568,  1011,  4440,  4691,  4004,  2460,  3594,  2053,
        13764,  8649,  1010,  4942, 21532,  2773, 22163,  2612,  1012,  2045,
         2003,  2053,  2126,  1997,  7851,  2023, 17183, 14088,  9476,  3272,
         2000,  2425,  2017,  2000,  2156,  2009,  2005,  4426,  1012,  1998,
         2191,  2469,  2053,  2028,  2104,  2184,  2003,  1999,  1996,  2282,
         1012,  4487,  6491,  6633,  5677,  3672,  1998,  2064,  3490, 10264,
         2964,  1998, 18186,  1998,  9576,  2854,  1998,  5573,  2331,  1998,
         2655,  3560, 27770,  2005,  2500,  2024,  2691,  6991,  1012,  7481,
         1012,  3383,  1996,  2087, 13432,  3746,  2003,  2008,  1997,  2019,
        10777,  3605,  1997,  2300,  2008,  1996, 

### **009** 사전학습 모델 불러오기

In [13]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

model

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.we

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### **010** TrainingArguments 설정

In [15]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = "D:/_MODEL_CHECKPOINT/distilbert-base-uncased",
    num_train_epochs=8,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="D:/_MODEL_CHECKPOINT/distilbert-base-uncased/logs",
    logging_steps=10,
)

### **011** GPU로 전송

In [16]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

### **012** Trainer 클래스 사전학습

In [18]:
input_tokens = tokenizer(["I feel fantastic", "My life is going something wrong", "I have not figured out what the chosen title has to do with the movie."], truncation=True, padding=True)

In [19]:
outputs = model(torch.tensor(input_tokens["input_ids"]).to(device))

label_dict = {0: "positive", 1: "negative"}

print([label_dict[i] for i in torch.argmax(outputs["logits"], axis=1).cpu().numpy()])

['positive', 'positive', 'positive']


In [20]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mkminjae618[0m. Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/400 [00:00<?, ?it/s]

{'loss': 0.6992, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.2}
{'loss': 0.6977, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.4}
{'loss': 0.6843, 'learning_rate': 3e-06, 'epoch': 0.6}
{'loss': 0.6851, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.8}
{'loss': 0.6905, 'learning_rate': 5e-06, 'epoch': 1.0}
{'loss': 0.6794, 'learning_rate': 6e-06, 'epoch': 1.2}
{'loss': 0.6756, 'learning_rate': 7.000000000000001e-06, 'epoch': 1.4}
{'loss': 0.6672, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.6}
{'loss': 0.6289, 'learning_rate': 9e-06, 'epoch': 1.8}
{'loss': 0.6114, 'learning_rate': 1e-05, 'epoch': 2.0}
{'loss': 0.5085, 'learning_rate': 1.1000000000000001e-05, 'epoch': 2.2}
{'loss': 0.4384, 'learning_rate': 1.2e-05, 'epoch': 2.4}
{'loss': 0.3489, 'learning_rate': 1.3000000000000001e-05, 'epoch': 2.6}
{'loss': 0.3817, 'learning_rate': 1.4000000000000001e-05, 'epoch': 2.8}
{'loss': 0.2754, 'learning_rate': 1.5e-05, 'epoch': 3.0}
{'loss': 0.1873, 'learning_rate': 1.60

TrainOutput(global_step=400, training_loss=0.28413260404020546, metrics={'train_runtime': 521.0175, 'train_samples_per_second': 12.284, 'train_steps_per_second': 0.768, 'train_loss': 0.28413260404020546, 'epoch': 8.0})

In [22]:
input_tokens = tokenizer(["I feel fantastic", "My life is going something wrong", "I have not figured out what the chosen title has to do with the movie."], truncation=True, padding=True)

outputs = model(torch.tensor(input_tokens["input_ids"]).to(device))

label_dict = {1: "positive", 0: "negative"}

print([label_dict[i] for i in torch.argmax(outputs["logits"], axis=1).cpu().numpy()])

['positive', 'negative', 'negative']


### **013** 파이토치 사전학습

In [23]:
def test_inference(model, tokenizer):
    input_tokens = tokenizer(["I feel fantastic", "My life is going something wrong", "I have not figured out what the chosen title has to do with the movie."], truncation=True, padding=True)

    outputs = model(torch.tensor(input_tokens["input_ids"]).to(device))

    label_dict = {1: "positive", 0: "negative"}

    print([label_dict[i] for i in torch.argmax(outputs["logits"], axis=1).cpu().numpy()])

In [24]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
model.to(device)

print(test_inference(model, tokenizer))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# 최적화 함수 정의
optim = AdamW(model.parameters(), lr=5e-5)

# 모델을 학습 모드로 전환
model.train()

losses = []

for epoch in range(8):
    print(f"epoch: {epoch}")
    for batch in train_loader:
        # 최적화 함수의 그레디언트 초기화
        optim.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        # 모델을 사용한 추론
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

        # 손실 계산
        loss = outputs[0]
        losses.append(loss)

        # 오차 역전파
        loss.backward()

        # 가중치 업데이트
        optim.step()

# 모델을 eval 모드로 전환
model.eval()

print(test_inference(model, tokenizer))

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.we

['negative', 'negative', 'negative']
None
epoch: 0
