# [모듈 2.1] 모델 훈련 스크래치




이 노트북은 아래와 같은 주요 작업을 합니다.
- 1. 환경 설정
- 2. 데이터 로딩
- 3. Hugging Face Electra tokenizer 및 pre-trained model 사용
- 4. torch custome Dataset 생성 및 훈련 준비
- 5. 모델 Fine-Tuning
    - 5.1. Fine-tuning with native PyTorch
    - 5.2. 파이썬 스크립트로 훈련    
    - 5.3. Fine-tuning with Trainer
    
---
### 참고:
- 커스텀 데이터 셋으로 파인 튜닝을 위한 참조 자료
    - [Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.2.0/custom_datasets.html)

# 1. 환경 설정

In [6]:
%load_ext autoreload
%autoreload 2

# src 폴더 경로 설정
import torch
import sys
sys.path.append('./src')
import config
from  data_util import read_nsmc_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
# logger.setLevel(logging.WARNING)
logger.addHandler(logging.StreamHandler(sys.stdout))

In [8]:
%store -r local_train_output_path
%store -r local_test_output_path



# 2. 데이터 로딩


## 2.1. 학습 데이터 로딩

In [9]:
train_texts, train_labels = read_nsmc_split(local_train_output_path)

In [10]:
logger.info(f"len: {len(train_texts)} \nSample: {train_texts[0:5]}")
logger.info(f"len: {len(train_labels)} \nSample: {train_labels[0:5]}")

len: 149552 
Sample: ['흠   포스터보고 초딩영화줄    오버연기조차 가볍지 않구나', '너무재밓었다그래서보는것을추천한다', '교도소 이야기구먼   솔직히 재미는 없다  평점 조정', '사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다', '막 걸음마 뗀 세부터 초등학교 학년생인 살용영화 ㅋㅋㅋ   별반개도 아까움']
len: 149552 
Sample: ['흠   포스터보고 초딩영화줄    오버연기조차 가볍지 않구나', '너무재밓었다그래서보는것을추천한다', '교도소 이야기구먼   솔직히 재미는 없다  평점 조정', '사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다', '막 걸음마 뗀 세부터 초등학교 학년생인 살용영화 ㅋㅋㅋ   별반개도 아까움']
len: 149552 
Sample: [1, 0, 0, 1, 0]
len: 149552 
Sample: [1, 0, 0, 1, 0]


## 2.2. 검증 데이터 셋 생성

In [11]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

# 3. Hugging Face Electra tokenizer 및 pre-trained model 사용

## 3.1. Electra 라이브러리 로딩

In [12]:
# from datasets import load_dataset
from transformers import (
    ElectraModel, 
    ElectraTokenizer, 
    ElectraForSequenceClassification, 
    Trainer, 
    TrainingArguments, 
    set_seed
)
# from transformers.trainer_utils import get_last_checkpoint



## 3.2. Pre-trained model_id, tokenizer_id 지정
- [KoElectra Git](https://github.com/monologg/KoELECTRA)
- KoElectra Model
    - Small:
        - "monologg/koelectra-small-v3-discriminator
    - Base: 
        - monologg/koelectra-base-v3-discriminator
        


In [13]:
tokenizer_id = 'monologg/koelectra-small-v3-discriminator'
model_id = "monologg/koelectra-small-v3-discriminator"


## 3.3. Electra Model 입력 인코딩 생성

In [14]:
%%time 

tokenizer = ElectraTokenizer.from_pretrained(tokenizer_id)

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
# test_encodings = tokenizer(test_texts, truncation=True, padding=True)

CPU times: user 37.9 s, sys: 307 ms, total: 38.2 s
Wall time: 38.3 s


In [15]:
logger.info(f'type of train_encoding: {type(val_encodings)}')

type of train_encoding: <class 'transformers.tokenization_utils_base.BatchEncoding'>
type of train_encoding: <class 'transformers.tokenization_utils_base.BatchEncoding'>


# 4. torch custome Dataset 생성 및 훈련 준비

## 4.1. torch custome dataset 생성

In [16]:
from data_util import NSMCDataset

train_dataset = NSMCDataset(train_encodings, train_labels)
val_dataset = NSMCDataset(val_encodings, val_labels)

In [17]:
logger.info(f"len(train_dataset) : {len(train_dataset)}")
logger.info(f"len(val_dataset) : {len(val_dataset)}")


len(train_dataset) : 119641
len(train_dataset) : 119641
len(val_dataset) : 29911
len(val_dataset) : 29911


## 4.2. 데이터 셋 부가 정보 생성

In [18]:
from train_util import create_train_meta
# Prepare model labels - useful in inference API
seed = 100

# Set seed before initializing model
set_seed(seed)
    
num_labels, label2id, id2label = create_train_meta(train_dataset)

# 5. 모델 Fine-Tuning

## 5.1. Fine-tuning with native PyTorch

### train data loader 생성
- 디버깅을 위해 일부 데이터 셋 사용시
    - train_sample_loader
    - eval_sample_loader
- 풀 데이터 셋 사용시
    - train_loader
    - eval_loader

In [19]:
from torch.utils.data import DataLoader, SubsetRandomSampler


from train_util import create_random_sampler
    
subset_train_sampler = create_random_sampler(train_dataset, frac=0.01, is_shuffle=True, logger=logger)
train_sampler = create_random_sampler(train_dataset, frac=1, is_shuffle=True, logger=logger)

subset_eval_sampler = create_random_sampler(val_dataset, frac=0.001, is_shuffle=False, logger=logger)
eval_sampler = create_random_sampler(val_dataset, frac=1, is_shuffle=False, logger=logger)

# subset_test_sampler = create_random_sampler(test_dataset, frac=0.001, is_shuffle=False, logger=logger)
# test_sampler = create_random_sampler(test_dataset, frac=1, is_shuffle=False, logger=logger)
    
train_sample_loader = DataLoader(dataset=train_dataset, 
                          shuffle=False, 
                          batch_size=16, 
                          sampler=subset_train_sampler)    

train_loader = DataLoader(dataset=train_dataset, 
                          shuffle=False, 
                          batch_size=16, 
                          sampler=train_sampler)    

eval_sample_loader = DataLoader(dataset=val_dataset, 
                          shuffle=False, 
                          batch_size=16, 
                          sampler=subset_eval_sampler)    

eval_loader = DataLoader(dataset=val_dataset, 
                          shuffle=False, 
                          batch_size=16, 
                          sampler=eval_sampler)    


dataset size with frac: 0.01 ==> 1196
dataset size with frac: 0.01 ==> 1196
dataset size with frac: 1 ==> 119641
dataset size with frac: 1 ==> 119641
dataset size with frac: 0.001 ==> 29
dataset size with frac: 0.001 ==> 29
dataset size with frac: 1 ==> 29911
dataset size with frac: 1 ==> 29911


In [20]:
next(iter(train_sample_loader))

{'input_ids': tensor([[    2,  2784,  4172,  ...,     0,     0,     0],
         [    2, 12707,  5158,  ...,     0,     0,     0],
         [    2,  7549, 27802,  ...,     0,     0,     0],
         ...,
         [    2, 18774,  4270,  ...,     0,     0,     0],
         [    2,  9338,  4086,  ...,     0,     0,     0],
         [    2, 25901,  4073,  ...,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'labels': tensor([1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1])}

### 파라미터 정의

In [21]:
class Params:
    def __init__(self):
        self.epochs = 1        
        self.batch_size = 256
        self.lr = 0.001
        self.log_interval = 50
        self.model_dir = config.model_dir
                        
args = Params()
print("# of epochs: ", args.epochs)

# of epochs:  1


### 모델 로딩

In [22]:
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = ElectraForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label
)

model.to(device)
model.train()

optimizer = AdamW(model.parameters(), lr=5e-5)

Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

### 훈련 루프 실행

In [23]:
from train_util import train_epoch, eval_epoch, save_best_model
import time

epochs = 1
best_acc = 0
for epoch in range(epochs):
    start_time = time.time()

    train_epoch(args, 
                model, 
                train_sample_loader, 
                optimizer, 
                epoch, 
                device, 
                logger,
                sampler=None, 
                )            

    elapsed_time = time.time() - start_time    
    print("The time elapse of epoch {:03d}".format(epoch) + " is: " + 
                time.strftime("%H: %M: %S", time.gmtime(elapsed_time)))

    acc = eval_epoch(args, 
               model, 
               epoch, 
               device, 
               logger,
               eval_sample_loader)
    
    best_acc = save_best_model(model, 
                               acc, 
                               epoch, 
                               best_acc,
                               args.model_dir,
                               logger)            
    # best_hr, best_ndcg, best_epoch = test(args, NCF_model, epoch, test_loader, best_hr, model_dir)


The time elapse of epoch 000 is: 00: 00: 04
Train Epoch: 0 Acc=0.519231;
Train Epoch: 0 Acc=0.519231;
the model is saved at models/nsmc/sentimental-electro-hf.pth
the model is saved at models/nsmc/sentimental-electro-hf.pth


## 5.2. 파이썬 스크립트로 훈련

In [25]:
class ParamsScript:
    def __init__(self):
        self.epochs = 1        
        self.train_batch_size = 32        
        self.eval_batch_size = 128 
        self.learning_rate = 5e-5
        self.warmup_steps = 0      
        self.fp16 = True
        self.tokenizer_id = 'monologg/koelectra-small-v3-discriminator'
        self.model_id = 'monologg/koelectra-small-v3-discriminator'     
        # SageMaker Container environment        
        self.output_data_dir = f"{config.output_data_dir}"                                            
        self.model_dir = f"{config.model_dir}"                                       
        self.train_data_dir = f"{config.train_data_dir}"               
        self.checkpoint_dir = f"{config.checkpoint_dir}"                                               
        self.is_evaluation = config.is_evaluation                               
        self.is_test = True
        self.test_data_dir = f"{config.test_data_dir}"                               
        self.eval_ratio = config.eval_ratio                                       
        self.use_subset_train_sampler = config.use_subset_train_sampler                                                       
        self.log_interval = 50
        self.n_gpus = 1                        
        self.seed = 100
                        
script_args = ParamsScript()
print("# of epochs: ", script_args.epochs)

# of epochs:  1


In [26]:
%%time 
from train_lib import train
train(script_args)

##### Args: 
 {'epochs': 1, 'train_batch_size': 32, 'eval_batch_size': 128, 'learning_rate': 5e-05, 'warmup_steps': 0, 'fp16': True, 'tokenizer_id': 'monologg/koelectra-small-v3-discriminator', 'model_id': 'monologg/koelectra-small-v3-discriminator', 'output_data_dir': 'output/nsmc', 'model_dir': 'models/nsmc', 'train_data_dir': 'data/nsmc/train', 'checkpoint_dir': 'checkpoint/nsmc', 'is_evaluation': True, 'is_test': True, 'test_data_dir': 'data/nsmc/test', 'eval_ratio': 0.2, 'use_subset_train_sampler': True, 'log_interval': 50, 'n_gpus': 1, 'seed': 100}
device: cuda
train_data_filenames ['data/nsmc/train/ratings_train.txt']
train_file_path data/nsmc/train/ratings_train.txt
len: 149552 
Sample: ['흠   포스터보고 초딩영화줄    오버연기조차 가볍지 않구나', '너무재밓었다그래서보는것을추천한다', '교도소 이야기구먼   솔직히 재미는 없다  평점 조정', '사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다', '막 걸음마 뗀 세부터 초등학교 학년생인 살용영화 ㅋㅋㅋ   별반개도 아까움']
len: 149552 
Sample: [1, 0, 0, 1, 0]
size of train_dataset : 119641
dataset size with frac: 0.

Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

The time elapse of epoch 000 is: 00: 00: 04
size of val_dataset : 29911
dataset size with frac: 1 ==> 29911
Train Epoch: 0 Acc=0.551960;
the model is saved at models/nsmc/sentimental-electro-hf.pth
size of test_dataset : 49832
dataset size with frac: 1 ==> 49832
Test Accuracy: Acc=0.551227;
CPU times: user 1min 36s, sys: 6.78 s, total: 1min 43s
Wall time: 1min 43s


## 5.3. Fine-tuning with Trainer

In [27]:
%%time

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,
)

model = ElectraForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

Step,Training Loss
100,0.6934
200,0.6797
300,0.5198
400,0.4413
500,0.4141
600,0.3945
700,0.377
800,0.3738
900,0.3452
1000,0.3455


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 9min 10s, sys: 33.7 s, total: 9min 44s
Wall time: 5min 16s


TrainOutput(global_step=1870, training_loss=0.40147568299808606, metrics={'train_runtime': 315.7906, 'train_samples_per_second': 378.862, 'train_steps_per_second': 5.922, 'total_flos': 845575811718948.0, 'train_loss': 0.40147568299808606, 'epoch': 1.0})