# [모듈 0.2] IMDB 리뷰 스크래치 (영문)

아래는 IMDB 데이터 셋(영문) 을 통하여 사용자 정의 데이터 셋을 생성하여, Pytorch 및 HF Trainer 를 통하여 훈련하는 것을 배웁니다.

주요 단계는 아래와 같습니다.
- 1. 데이터 IMDB 다운로드
- 2. 데이터 셋 준비
- 3. torch custome Dataset 생성
- 4. Fine-tuning with Trainer
- 5. Fine-tuning with native PyTorch



---
### 참고:
[Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.2.0/custom_datasets.html)

# 1. 데이터 IMDB 다운로드

In [1]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xf aclImdb_v1.tar.gz

--2022-07-11 11:46:05--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-07-11 11:46:08 (25.2 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

  labels.append(0 if label_dir is "neg" else 1)


In [3]:
print(f"train texts length: {len(train_texts)} , Sample: {train_texts[0]}")
print(f"train labels length: {len(train_labels)} , Sample: {train_labels[0]}")

train texts length: 25000 , Sample: I really like this movie. Bozz is an ultra-cool, not to be intimidated soldier who does not want to go to war. His persona is similar in a way to Yossarian in Catch-22, Joseph Heller's classic novel about men and war. This film, however, is not set in a war zone, but in a pre-war combat prep training. This wonderful film is all about the sickening realization that the Vietnam war was a mistake and those men who were pegged to be sacrificed for a losing cause.<br /><br />Colin Farrell is brilliant as Bozz, a soldier who showed as much genuine love and compassion for his fellow soldier as he did disdain and irreverence for the establishment that was trying to kill him. Bozz is totally cool and non-plussed, testing and tweaking his military superiors, getting their goat at every opportunity. He is a Jesus Christ figure with a psychology degree, "saving" his fellow soldiers and showing the ones in genuine need, the way out of this man's army.<br /><br />

In [4]:
print(f"test texts length: {len(test_texts)} , Sample: {test_texts[0]}")
print(f"test labels length: {len(test_labels)} , Sample: {test_labels[0]}")

test texts length: 25000 , Sample: One of Starevich's earliest films made in France is possibly his only political satire. The story of The Frogs Who Wanted A King mirrors its title as a group of high "croakers" feel that democracy has gone flat so they demand a king from Jupiter to rule their land. When he sends down a stump, the frogs ask for another king, saying the stump is but "political timber." Jupiter sends down a hungry stork this time whose frog lusty eyes devour the town's residents. As the original "croaker" is about to slide down the stork's beak, he speaks his moral: "let well enough alone." This film features a few beautiful crowd scenes of dozens of puppet frogs. Starewicz tricks the audience into believing they are all moving at once by keeping the background in constant motion and animating only about six frogs or so at one time. The slightly corny dialogue and problems with lighting in a few places diminish the quality of repeat viewings, however its historical signi

# 2. 데이터 셋 준비

## 검증 데이터 셋 생성

In [5]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## distilbert-base-uncased 모델에 대한 tokenizer 생성

In [6]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]



## distilbert-base-uncased 모델에 대한 입력 인코딩 생성

In [7]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [8]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [9]:
# import numpy as np
# print(np.asarray(train_encodings.data['input_ids'][0]))


# 3. torch custome Dataset 생성

In [10]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# 4. Fine-tuning with Trainer

In [11]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()



Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.w

Step,Training Loss
10,0.6927
20,0.6962
30,0.6889
40,0.6979
50,0.693
60,0.6972
70,0.6861
80,0.6643
90,0.6711
100,0.6279


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1250, training_loss=0.313766100692749, metrics={'train_runtime': 317.9913, 'train_samples_per_second': 62.895, 'train_steps_per_second': 3.931, 'total_flos': 2649347973120000.0, 'train_loss': 0.313766100692749, 'epoch': 1.0})

# 5. Fine-tuning with native PyTorch

In [12]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /home/ec2-user/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.11.0",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /home/ec2-user/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

# 6. 커널 리스타팅

- 위의 노트북을 다 실행하고 나면 아래의 그림과 같이 GPU의 메모리를 차지하고 있습니다. (터미널에서 `nvidia-smi` 입력) 
![before-nvidia-smi.png](img/before-nvidia-smi.png)

- 아래 셀을 실행하면 이 노트북의 커널이 리스타트 되고 해제된 메모리를 확인 할 수 있습니다.
![after-nvidia-smi.png](img/after-nvidia-smi.png)

In [13]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}