# 머신 러닝 교과서 - 파이토치편

<table align="left"><tr><td>
<a href="https://colab.research.google.com/github/rickiepark/ml-with-pytorch/blob/main/ch16/ch16-part3-bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="코랩에서 실행하기"/></a>
</td></tr></table>

## 트랜스포머
- RNN의 속성을 따라감 + 트랜스포머
- 어텐션 메커니즘
    - 멀티 헤드 어텐션으로 문맥 임베딩 인코딩하기
    - 디코더와 마스크드 멀티 헤드 어텐션
    - 위치 인코딩 및 층 정규화

- K, Q, V
    - Q * K (내적, dot product) => 두 벡터의 방향 유사도
        - 병렬 처리에 "약간" 더 좋음
        - Q(질문), K(키)가 각각의 벡터에서 얼마나 가까운가?
        - "이 Q(쿼리)가 각 K(토큰)에 얼마나 집중(주목)해야 하나요?"

    - softmax(내적 값): 가중치 정규화
    - 가중치 × V → 최종 표현



In [1]:
%pip install transformers datasets accelerate evaluate

Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting accelerate
  Downloading accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-23.0.0-cp313-cp313-win_amd64.whl.metadata (3.1 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting httpx<1.0.0 (from datasets)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.6.0-cp313-cp313-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py313-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Using cached fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Collecting anyio (from httpx<1.0.0->datasets)
  Using 

# 16장 트랜스포머 - 어텐션 메커니즘을 통한 자연어 처리 성능 향상 (파트 3/3)

**목차**

- 파이토치에서 BERT 모델 미세 튜닝하기
  - IMDb 영화 리뷰 데이터셋 로드
  - 데이터셋 토큰화
  - 사전 훈련된 BERT 모델 로드 및 미세 튜닝하기
  - 트레이너 API를 사용하여 트랜스포머를 간편하게 미세 튜닝하기
- 요약

---

https://huggingface.co/transformers/custom_datasets.html 에서 발췌:

> DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased , runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

---

In [1]:
from IPython.display import Image

## 파이토치에서 BERT 모델 미세 튜닝하기

### IMDb 영화 리뷰 데이터셋 로드

In [None]:
import time

import pandas as pd

import torch
import torch.nn.functional as F
import torchtext
import transformers

from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

**일반 설정**

In [6]:
torch.manual_seed(42)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NUM_EPOCHS = 3

print(DEVICE)

cpu


**데이터셋 다운로드**

다음 셀은 긍정-부정 감성 분류를 위해 IMDB 영화 리뷰 데이터셋(http://ai.stanford.edu/~amaas/data/sentiment/)을 CSV 형식의 파일로 다운로드합니다:

데이터셋을 확인합니다:

In [10]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [9]:
df.shape

(50000, 2)

**데이터셋을 훈련/검증/테스트로 분할합니다**

In [11]:
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values

valid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].values

test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

## 데이터셋 토큰화

In [12]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [13]:
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

In [14]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

**데이터셋 클래스와 로더**

In [15]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [16]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

## 사전 훈련된 BERT 모델 로드 및 미세 튜닝하기

In [18]:
%pip install hf_xet

Collecting hf_xet
  Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl.metadata (5.0 kB)
Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl (2.9 MB)
   ---------------------------------------- 0.0/2.9 MB ? eta -:--:--
   ------------------------- -------------- 1.8/2.9 MB 9.7 MB/s eta 0:00:01
   ---------------------------------------- 2.9/2.9 MB 10.0 MB/s  0:00:00
Installing collected packages: hf_xet
Successfully installed hf_xet-1.2.0
Note: you may need to restart the kernel to use updated packages.


In [19]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()

optim = torch.optim.Adam(model.parameters(), lr=5e-5)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**모델 훈련 -- 수동 훈련 루프**

In [20]:
def compute_accuracy(model, data_loader, device):
    with torch.no_grad():
        correct_pred, num_examples = 0, 0

        for batch_idx, batch in enumerate(data_loader):

        ### 데이터 준비
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs['logits']
            predicted_labels = torch.argmax(logits, 1)
            num_examples += labels.size(0)
            correct_pred += (predicted_labels == labels).sum()

        return correct_pred.float()/num_examples * 100

In [18]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):

    model.train()

    for batch_idx, batch in enumerate(train_loader):

        ### 데이터 준비
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        ### 정방향
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss, logits = outputs['loss'], outputs['logits']

        ### 역방향
        optim.zero_grad()
        loss.backward()
        optim.step()

        ### 로깅
        if not batch_idx % 250:
            print (f'에포크: {epoch+1:04d}/{NUM_EPOCHS:04d} | '
                   f'배치 {batch_idx:04d}/{len(train_loader):04d} | '
                   f'손실: {loss:.4f}')

    model.eval()

    with torch.set_grad_enabled(False):
        print(f'훈련 정확도: '
              f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\n검증 정확도: '
              f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')

    print(f'소요 시간: {(time.time() - start_time)/60:.2f} min')

print(f'총 훈련 시간: {(time.time() - start_time)/60:.2f} min')
print(f'테스트 정확도: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

에포크: 0001/0003 | 배치 0000/2188 | 손실: 0.6793
에포크: 0001/0003 | 배치 0250/2188 | 손실: 0.4335
에포크: 0001/0003 | 배치 0500/2188 | 손실: 0.3769
에포크: 0001/0003 | 배치 0750/2188 | 손실: 0.1101
에포크: 0001/0003 | 배치 1000/2188 | 손실: 0.4627
에포크: 0001/0003 | 배치 1250/2188 | 손실: 0.3404
에포크: 0001/0003 | 배치 1500/2188 | 손실: 0.3589
에포크: 0001/0003 | 배치 1750/2188 | 손실: 0.3894
에포크: 0001/0003 | 배치 2000/2188 | 손실: 0.1958
훈련 정확도: 96.69%
검증 정확도: 92.56%
소요 시간: 11.56 min
에포크: 0002/0003 | 배치 0000/2188 | 손실: 0.0801
에포크: 0002/0003 | 배치 0250/2188 | 손실: 0.4389
에포크: 0002/0003 | 배치 0500/2188 | 손실: 0.1823
에포크: 0002/0003 | 배치 0750/2188 | 손실: 0.0404
에포크: 0002/0003 | 배치 1000/2188 | 손실: 0.1314
에포크: 0002/0003 | 배치 1250/2188 | 손실: 0.1172
에포크: 0002/0003 | 배치 1500/2188 | 손실: 0.0208
에포크: 0002/0003 | 배치 1750/2188 | 손실: 0.0265
에포크: 0002/0003 | 배치 2000/2188 | 손실: 0.0242
훈련 정확도: 98.79%
검증 정확도: 92.88%
소요 시간: 23.08 min
에포크: 0003/0003 | 배치 0000/2188 | 손실: 0.0188
에포크: 0003/0003 | 배치 0250/2188 | 손실: 0.0543
에포크: 0003/0003 | 배치 0500/2188 | 손실: 0.0050
에포크

In [19]:
del model # 메모리 해제

### 트레이너 API를 사용하여 트랜스포머를 간편하게 미세 튜닝하기

사전 훈련된 모델 로드:

In [21]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [22]:
from transformers import Trainer, TrainingArguments


optim = torch.optim.Adam(model.parameters(), lr=5e-5)
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

In [23]:
# pip install evaluate로 evaluate를 설치합니다.
import evaluate
import numpy as np


metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred # logits are a numpy array, not pytorch tensor
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(
               predictions=predictions, references=labels)

Downloading builder script: 4.20kB [00:00, 8.46MB/s]


In [24]:
optim = torch.optim.Adam(model.parameters(), lr=5e-5)


training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    logging_steps=10
)

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    optimizers=(optim, None) # optimizer and learning rate scheduler
)

# 이전 코드와 비교하기 위해 (여러 대의 GPU가 있더라도)
# 1개의 GPU만 사용합니다.

trainer.args._n_gpu = 1

In [None]:
start_time = time.time()
trainer.train()
print(f'총 훈련 시간: {(time.time() - start_time)/60:.2f} min')

In [25]:
trainer.evaluate()

{'eval_loss': 0.3060450553894043,
 'eval_accuracy': 0.9336,
 'eval_runtime': 49.6515,
 'eval_samples_per_second': 201.404,
 'eval_steps_per_second': 12.588,
 'epoch': 3.0}

In [26]:
model.eval()
model.to(DEVICE)
print(f'테스트 정확도: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

테스트 정확도: 93.36%


# 요약