# [모듈 0.2] IMDB 리뷰 스크래치 (영문)

아래는 IMDB 데이터 셋(영문) 을 통하여 사용자 정의 데이터 셋을 생성하여, Pytorch 및 HF Trainer 를 통하여 훈련하는 것을 배웁니다.

주요 단계는 아래와 같습니다.
- 1. 데이터 IMDB 다운로드
- 2. 데이터 셋 준비
- 3. torch custome Dataset 생성
- 4. Fine-tuning with Trainer
- 5. Fine-tuning with native PyTorch



---
### 참고:
[Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.2.0/custom_datasets.html)

# 1. 데이터 IMDB 다운로드

In [1]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xf aclImdb_v1.tar.gz

--2022-06-08 14:13:30--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.2’


2022-06-08 14:13:33 (26.2 MB/s) - ‘aclImdb_v1.tar.gz.2’ saved [84125825/84125825]



In [2]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

In [3]:
print(f"train texts length: {len(train_texts)} , Sample: {train_texts[0]}")
print(f"train labels length: {len(train_labels)} , Sample: {train_labels[0]}")

train texts length: 25000 , Sample: This is an entertaining "history" of the FBI, but it should be viewed as fiction, because that's exactly what it is. What else could it be when J. Edgar Hoover personally approved and had a cameo role in the production. James Stewart is excellent, as usual, and the supporting cast, except for the talentless Vera Miles, is good. Murray Hamilton is especially good in a supporting role as Stewart's partner and best friend. The FBI accomplishments that the film highlights are undoubtedly all true. What is significant is what it leaves out.<br /><br />One of the most shameful parts of the film is the depiction of the killing of John Dillinger. It is portrayed pretty much as it happened, but no mention at all is made of Melvin Purvis, the Chicago Bureau Chief who headed the operation. Instead, the operation is depicted as if the fictional Chip Hardesty were running it. It has been said that Hoover was jealous of the publicity that Purvis received after Dil

In [4]:
print(f"test texts length: {len(test_texts)} , Sample: {test_texts[0]}")
print(f"test labels length: {len(test_labels)} , Sample: {test_labels[0]}")

test texts length: 25000 , Sample: for a movie like this little hidden gem to come out in the 80s, its shocking how not a lot of people know about it.<br /><br />this movie is definitely worth a look. it has all the things you need for a horror movie. especially the good old chills.<br /><br />i remember watching this movie for the first time about 15 years ago, but i couldn't remember the name of it, so i came to IMDb a few years ago to ask for help on finding the title. i eventually got the name of the title, and bought the movie. i still love it as much as i did all those years ago.<br /><br />buy this movie!!
test labels length: 25000 , Sample: 1


# 2. 데이터 셋 준비

## 검증 데이터 셋 생성

In [5]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## distilbert-base-uncased 모델에 대한 tokenizer 생성

In [6]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


## distilbert-base-uncased 모델에 대한 입력 인코딩 생성

In [7]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [8]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [9]:
# import numpy as np
# print(np.asarray(train_encodings.data['input_ids'][0]))


# 3. torch custome Dataset 생성

In [10]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# 4. Fine-tuning with Trainer

In [11]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at di

Step,Training Loss
10,0.6938
20,0.6989
30,0.6861
40,0.6854
50,0.6799
60,0.6727
70,0.6632
80,0.6472
90,0.6054
100,0.5113


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1250, training_loss=0.30584291639328004, metrics={'train_runtime': 333.6115, 'train_samples_per_second': 59.95, 'train_steps_per_second': 3.747, 'total_flos': 2649347973120000.0, 'train_loss': 0.30584291639328004, 'epoch': 1.0})

# 5. Fine-tuning with native PyTorch

In [12]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /home/ec2-user/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.11.0",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.b

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       