# [모듈 0.2] IMDB 리뷰 스크래치 (영문)

아래는 IMDB 데이터 셋(영문) 을 통하여 사용자 정의 데이터 셋을 생성하여, Pytorch 및 HF Trainer 를 통하여 훈련하는 것을 배웁니다.

주요 단계는 아래와 같습니다.
- 1. 데이터 IMDB 다운로드
- 2. 데이터 셋 준비
- 3. torch custome Dataset 생성
- 4. Fine-tuning with Trainer
- 5. Fine-tuning with native PyTorch



---
### 참고:
[Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.2.0/custom_datasets.html)

# 1. 데이터 IMDB 다운로드

In [3]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xf aclImdb_v1.tar.gz

--2022-05-29 02:09:43--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-05-29 02:09:46 (27.2 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [1]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

In [2]:
print(f"train texts length: {len(train_texts)} , Sample: {train_texts[0]}")
print(f"train labels length: {len(train_labels)} , Sample: {train_labels[0]}")

train texts length: 25000 , Sample: Hayao Miyazaki has no equal when it comes to using hand-drawn animation as a form of storytelling, yet often he is being compared to Walt Disney. That is just so unfair, because it becomes apparent by watching Miyazaki's films that he is the superior artist. He really has a gift of thrilling both grownups and children, and Laputa is indeed one awesome ride.<br /><br />But where can I begin to describe a movie so magical and breathtaking! Miyazaki's works have never cease to amaze me. Laputa is an adventure of a grand scale and I wonder how a film can be so packed with details and imagination. Ask yourself this question: if you are a kid dreaming of an adventure so grand in scope and so magical, what would it be like? The answer would be to strap yourself in some seat and watch Laputa, because it's truly a childhood fantasy come true. Every minute of the movie is rich and engrossing ... from the train chase to the amazing air-flying sequences... and t

In [3]:
print(f"test texts length: {len(test_texts)} , Sample: {test_texts[0]}")
print(f"test labels length: {len(test_labels)} , Sample: {test_labels[0]}")

test texts length: 25000 , Sample: I taped The Morrison Murders on Lifetime Movie network and I watched The Morrison Murders on Lifetime, Lifetime Movie network and on Courttv. Jonathan Scarfe and John Corbett did a great job of playing Luke and Walker Morrison. I am glad that Walker got his brother Luke to confess of murdering his parents and their brother Bobby. I enjoy watching True stories on Lifetime, Lifetime Movie network and on Courttv. The Morrison Murders is a good movie to watch. Next time The Morrison Murders is on Lifetime, Lifetime Movienetwork or Courttv I am going to watch The Morrrison Murders again because My favorite actor John Corbbett is in The Morrison Murders. I give The Morrison Murders a ten because it is a good movie about Walker who tries to find out who killed his parents and his brother Bobby and at the end Walker discovers it was his brother Luke who murdered his parents and his brother Bobby.
test labels length: 25000 , Sample: 1


# 2. 데이터 셋 준비

## 검증 데이터 셋 생성

In [4]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## distilbert-base-uncased 모델에 대한 tokenizer 생성

In [5]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

## distilbert-base-uncased 모델에 대한 입력 인코딩 생성

In [6]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [10]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [30]:
# import numpy as np
# print(np.asarray(train_encodings.data['input_ids'][0]))


# 3. torch custome Dataset 생성

In [31]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# 4. Fine-tuning with Trainer

In [32]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

Step,Training Loss
10,0.6911
20,0.6907
30,0.6856
40,0.68
50,0.6673
60,0.6314
70,0.5534
80,0.4444
90,0.3651
100,0.3512


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=939, training_loss=0.22117599048299658, metrics={'train_runtime': 284.0823, 'train_samples_per_second': 211.206, 'train_steps_per_second': 3.305, 'total_flos': 7948043919360000.0, 'train_loss': 0.22117599048299658, 'epoch': 3.0})

# 5. Fine-tuning with native PyTorch

In [34]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /home/ec2-user/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.18.0",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /home/ec2-user/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       