<a href="https://colab.research.google.com/github/haeuni7/coding-practice/blob/main/baseline_shopping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 쇼핑 고객 상담 대화 문장 의도 태깅

- 2차 모의경진대회(22.11.28 ~ 22.12.25)
- Text Classification 과제

## 디렉터리 구조
```
$ Shopping/
├── DATA/
│    ├── train/
│    │    ├── texts/
│    │    │    ├── shopping1_0001.txt
│    │    │    ├── ...
│    │    │    └── shopping7_2536.txt
│    │    └── labels/
│    │          ├── shopping1_0001.json
│    │          ├── ...
│    │          └── shopping7_2536.json
│    ├── test/
│    │    └── texts/
│    │          ├── test_0001.txt
│    │          ├── ...
│    │          └── test_1456.txt
│    └── sample_submission.csv
│
└── baseline/
      ├── config/
      │    ├── train.yaml
      │    └── predict.yaml
      │── modules/
      │    ├── earlystopper.py
      │    ├── losses.py
      │    ├── recorders.py
      │    └── utils.py
      ├── results/
      │    └── train/
      ├── wandb/ # wandb 최초 사용 후 생성
      ├── baseline_shopping.ipynb
      └── submission.csv (코드 실행 후 생성되는 추론 파일)

```
    - config : 학습/추론에 필요한 파라미터 등을 기록하는 yaml 파일
    - modules
        - earlystopper.py : loss가 특정 에폭 이상 개선되지 않을 경우 학습을 멈춤
        - losses.py : config에서 지정한 loss function을 리턴
        - recorders.py : log, learning curve, best model.pt 등을 기록
        - utils.py : 여러 확장자 파일을 불러오거나 여러 확장자로 저장하는 등의 함수
    - results
      - train/ : 학습 log, 추론 log를 기록하는 디렉토리
    - baseline_shopping.ipynb : 전처리부터 학습, 추론까지 수행할 코드 

## 노트북 실행 요령
- 해당 베이스라인 코드는 [Wandb](https://wandb.ai/site) 를 활용한 실험 관리가 가능합니다.
- 해당 베이스라인 코드 실행 전 `config/train.yaml`을 수정하세요.
- 해당 베이스라인 코드의 추론 실행 전 `config/predict.yaml`을 수정하세요.
- 학습 결과는 `results/train/train_serial` 에 저장되며, wandb를 통해 학습 그래프를 확인하실 수 있습니다.

#0. 사전 준비

### 구글 드라이브 마운트

In [None]:
# 구글 Colaboratory 를 사용하기 위해 구글 계정으로 로그인합니다. 
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
os.chdir('/content/drive/MyDrive/Shopping/')

### 라이브러리 설치

In [None]:
!pip install transformers
!pip install evaluate
!pip install omegaconf
!pip install wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


 # 1. Import

In [None]:
import os
import json
import wandb
import shutil
import random
import evaluate
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm import tqdm
from pathlib import Path
from omegaconf import OmegaConf
from sklearn.metrics import f1_score
from datetime import datetime, timezone, timedelta
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils import *
from torch.optim import Adam
from torch.utils.data import DataLoader

from transformers import logging, get_linear_schedule_with_warmup
from transformers import ( 
    BertConfig,
    ElectraConfig
)


# 실험에 사용할 모델 라이브러리 추가
from transformers import (
    BertTokenizer,
    BertTokenizerFast,
    AutoTokenizer,
    ElectraTokenizer,
    AlbertTokenizer,
    RobertaTokenizerFast
)
from transformers import (
    BertModel,
    AutoModel, 
    ElectraForSequenceClassification,
    BertForSequenceClassification,
    AlbertForSequenceClassification,
    RobertaForSequenceClassification,
    TextClassificationPipeline
)

# 2. 실험 세팅 
### Train Configuration

In [None]:
# Project directory
PROJECT_DIR = os.getcwd()
print('PROJECT_DIR :',PROJECT_DIR)

# Load train config
## !!! config 수정할때마다 이 셀을 다시 실행하세요 !!!
config_path = os.path.join(PROJECT_DIR, 'baseline', 'config', 'train.yaml')
config = OmegaConf.load(config_path)

PROJECT_DIR : /content/drive/MyDrive/Shopping


### 실험 기록 및 경로 설정

In [None]:
# Train Serial
kst = timezone(timedelta(hours=9))
train_serial = datetime.now(tz=kst).strftime("%Y%m%d_%H%M%S")


# Recorder directory
if config.train.resume==None:
    RECORDER_DIR = os.path.join(PROJECT_DIR, 'results', 'train', train_serial)
else:
    RECORDER_DIR = os.path.join(PROJECT_DIR, config.train.resume_weight_dir)
    
os.makedirs(RECORDER_DIR, exist_ok=True)
print("Results will be found here : ", RECORDER_DIR)

# Data Directory
DATA_DIR = Path(config.train.dataset_path)
TRAIN_DIR = DATA_DIR/'train/'
TEST_DIR = DATA_DIR/'test/'
SAMPLE_DIR = os.path.join(DATA_DIR,'sample_submission.csv')

print('DATA_DIR :',DATA_DIR)
print('TRAIN_DIR :',TRAIN_DIR)
print('TEST_DIR :',TEST_DIR)
print('SAMPLE_DIR :',SAMPLE_DIR)

Results will be found here :  /content/drive/MyDrive/Shopping/results/train/20221019_180402
DATA_DIR : /content/drive/MyDrive/Shopping/DATA
TRAIN_DIR : /content/drive/MyDrive/Shopping/DATA/train
TEST_DIR : /content/drive/MyDrive/Shopping/DATA/test
SAMPLE_DIR : /content/drive/MyDrive/Shopping/DATA/sample_submission.csv


### Seed 고정 및 GPU 지정

In [None]:
# Seed 설정
random.seed(config.train.seed)
np.random.seed(config.train.seed)
torch.manual_seed(config.train.seed)
torch.cuda.manual_seed_all(config.train.seed)

torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

# GPU 설정
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Set logger

In [None]:
os.chdir('/content/drive/MyDrive/Shopping/baseline')
from modules.utils import get_logger

logger = get_logger(name='train', dir_=RECORDER_DIR, stream=False)
logger.info(f"Set Logger {RECORDER_DIR}")
logger.info(OmegaConf.to_yaml(config))

INFO:train:Set Logger /content/drive/MyDrive/Shopping/results/train/20221019_180402
INFO:train:train:
  dataset_path: /content/drive/MyDrive/Shopping/DATA
  num_classes: 14
  max_seq_len: 128
  train_batch_size: 32
  eval_batch_size: 32
  num_epochs: 5
  metric: f1
  seed: 777
  early_stopping_patience: 2
  resume: true
  resume_weight_dir: results/train/20221019_180402
  save_strategy: epoch
  save_steps: 100
  eval_steps: 500
  evaluation_strategy: epoch
  fp16: true
  num_workers: 1
  use_cuda: true
  gpus: '1'
  optimizer: adam
  adam_epsilon: 1.0e-08
  warmup_proportion: 0.04
  learning_rate: 0.0001
  weight_decay: 0.01
  wandb: true
  logging_strategy: epoch
  logging_steps: 200
model:
  pretrained_model: klue/roberta-large
  architecture: RobertaForSequenceClassification
  tokenizer_class: BertTokenizer



# 3. EDA 및 데이터 전처리
### DataFrame으로 시각화

In [None]:
def make_df(data_dir:Path, test=False):
    sentence_id=[]     # 파일명 + .txt + 텍스트 내에서 문장 순서
    speaker = []       # 화자
    text = []          # text
    speechAct = []     # 문장이 어떤 내용인지 말해주는 label값
    
    for directory in data_dir.iterdir():
        for file in directory.iterdir():
            if 'checkpoint' in str(file): continue
            if test ==False:
                try:
                    label = json.load(open(file))
                except:
                    continue
                lines = label['info'][0]['annotations']['lines']
                for i, l in enumerate(lines):
                    sentence_id.append(file.stem+'.txt-'+str(i)) # 문장 id
                    speaker.append(l['text'].split('.')[0]) # 화자 넣기
                    text.append(l['text'].split('.')[-1]) # 화자 구분 빼기
                    speechAct.append(l['speechAct'])
            elif file.match("*.txt"):
                try:
                    with open(file, "r") as f:
                        lines = f.read().splitlines()
                        # print(lines)
                except:
                    continue

                for i, l in enumerate(lines):
                    sentence_id.append(file.stem+'.txt-'+str(i))
                    speaker.append(l.split('.')[0]) # 화자 넣기
                    text.append(l.split('.')[-1]) # 화자 구분 빼기
    if test==False:
        df = pd.DataFrame({'sentence_id': sentence_id, 'speaker':speaker,'text': text, 'speechAct':speechAct})
    else:
        df = pd.DataFrame({'sentence_id': sentence_id, 'speaker':speaker,'text': text})

    return df

In [None]:
# test_df = make_df(TEST_DIR, test=True)
# test_df['order']=test_df['sentence_id'].str.split('-')

# for i in range(len(test_df)):
#     test_df['order'][i][1]=int(test_df['order'][i][1])
# test_df.sort_values(by='order',inplace=True)
# test_df.reset_index(inplace=True)
# test_df.drop(['index', 'speaker', 'order'], axis=1, inplace=True)
# test_df

In [None]:
# test_df.drop(['index', 'speaker'], axis=1)

In [None]:
train_df = make_df(TRAIN_DIR)

In [None]:
train_df

Unnamed: 0,sentence_id,speaker,text,speechAct
0,shopping1_2424.txt-0,A,반갑습니다 상담사 #@이름#입니다,인사하기
1,shopping1_2424.txt-1,B,세탁기 산 지가 아직 일 년이 안 됐는데 고장이 났어요,진술하기
2,shopping1_2424.txt-2,A,고객님 세탁기의 종류는 무엇인가요,질문하기
3,shopping1_2424.txt-3,B,지금 일반 통돌이 쓰고 있어요,진술하기
4,shopping1_2424.txt-4,A,어떤 것이 고장이 났나요,질문하기
...,...,...,...,...
189778,shopping3_1565.txt-6,B,뭐 따로 해야 할 절차가 있을까요,질문하기
189779,shopping3_1565.txt-7,A,이 기능을 이용하기 위해서는요,진술하기
189780,shopping3_1565.txt-8,A,사용할 계좌 인증 및 자동이체 출금 동의 절차를 완료해야 합니다,진술하기
189781,shopping3_1565.txt-9,B,네 그렇군요 알겠습니다,진술하기


In [None]:
# 라벨을 인코딩할 딕셔너리 (학습에 사용)
label2encoding = {Act: ind for ind, Act in enumerate(train_df['speechAct'].unique())}
# 인코딩된 라벨을 디코딩할 딕셔너리 (추론에 사용)
encoding2label = {idx: Act for idx, Act in enumerate(train_df['speechAct'].unique())}

In [None]:
# 길이가 128이 넘는 코멘트가 있는지 확인
train_df['text'][train_df['text'].str.len()>config.train.max_seq_len]

Series([], Name: text, dtype: object)

### Tokenizer

In [None]:
# config.json 에서 지정 이름별로 가져올 라이브러리 지정

TOKENIZER_CLASSES = {
    "BertTokenizer": BertTokenizer,
    "BertTokenizerFast": BertTokenizerFast,
    "AutoTokenizer": AutoTokenizer,
    "ElectraTokenizer": ElectraTokenizer,
    "AlbertTokenizer": AlbertTokenizer,
    "RobertaTokenizerFast" : RobertaTokenizerFast
}
TOKENIZER = TOKENIZER_CLASSES[config.model.tokenizer_class].from_pretrained(config.model.pretrained_model)

In [None]:
# Tokenizer 예시
comment_ex = train_df['text'][0]
print(TOKENIZER(comment_ex))
print(TOKENIZER.encode(comment_ex),"\n")

# 토큰으로 나누기
print(TOKENIZER.tokenize(comment_ex),"\n")

# 토큰 id로 매핑하기
print(TOKENIZER.convert_tokens_to_ids(TOKENIZER.tokenize(comment_ex)))

{'input_ids': [0, 9927, 2219, 3606, 4981, 2063, 7, 36, 3934, 7, 3714, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[0, 9927, 2219, 3606, 4981, 2063, 7, 36, 3934, 7, 3714, 2] 

['반갑', '##습', '##니다', '상담', '##사', '#', '@', '이름', '#', '입니다'] 

[9927, 2219, 3606, 4981, 2063, 7, 36, 3934, 7, 3714]


# 4. Dataset

### CustomDataset 클래스 정의

In [None]:
class CustomDataset(torch.utils.data.Dataset):  # 데이터를 input으로 변환해주는 Dataset 클래스를 상속한 후 커스터마이징

    def __init__(self, df, tokenizer, max_seq_len, mode = 'train'):  # Dataset 클래스는 기본적으로 __init__, __len__, __getitem__를 정의해 주어야 한다

        self.data = df
        self.tokenizer = tokenizer
        self.max_seq_len = max_seq_len
        self.mode = mode
        self.speakers = self.data['speaker'].tolist()
        self.texts = self.data['text'].tolist()
        
        if self.mode!='test':
            try: 
                self.labels = df['speechAct'].tolist()
            except:
                assert False, 'CustomDataset Error : \'label\' column does not exist in the dataframe'
     
    def __len__(self):  # index를 통해 input을 순차적으로 읽어오기 위해서는 데이터의 길이가 먼저 확인되어야 한다. __len__ 함수는 input의 길이를 반환해주는 함수
        return len(self.data)
                

    def __getitem__(self, idx):  # input의 길이가 확인되면 index를 통해 데이터를 불러올 수 있다. __getitem__ 함수는 index에 해당하는 input 데이터를 반환해주는 함수
        """
        전체 데이터에서 특정 인덱스 (idx)에 해당하는 기사제목과 댓글 내용을 
        토크나이즈한 data('input_ids', 'attention_mask','token_type_ids')의 딕셔너리 형태로 불러옴
        """
        tokenized_text = self.tokenizer(self.texts[idx],
                             padding= 'max_length',
                             max_length=self.max_seq_len,
                             truncation=True,
                             return_token_type_ids=True,
                             return_attention_mask=True,
                             return_tensors = "pt")

        if self.mode=='test':
            data = {'input_ids': tokenized_text['input_ids'].squeeze(0).long(),
                   'attention_mask': tokenized_text['attention_mask'].squeeze(0).long(),
                   'token_type_ids': tokenized_text['token_type_ids'].squeeze(0).long()
                    }
        else:
            data = {'input_ids': tokenized_text['input_ids'].squeeze(0).long(),
                   'attention_mask': tokenized_text['attention_mask'].squeeze(0).long(),
                   'token_type_ids': tokenized_text['token_type_ids'].squeeze(0).long(),
                    'labels': label2encoding[self.labels[idx]]
                    }


        return data

### Train, Validation set 나누기

In [None]:
train_data, val_data = train_test_split(train_df, test_size=0.2, random_state=config.train.seed)

train_dataset = CustomDataset(train_data, TOKENIZER, config.train.max_seq_len, 'train')
val_dataset = CustomDataset(val_data, TOKENIZER, config.train.max_seq_len, 'validation')

print("Train dataset: ", len(train_dataset))
print("Validation dataset: ", len(val_dataset))

Train dataset:  151826
Validation dataset:  37957


### DataLoader

In [None]:
# torch.utils.data.DataLoader : input을 배치 단위로 리턴해주는 기능

train_dataloader = DataLoader(dataset=train_dataset,
                                batch_size=config.train.train_batch_size,
                                num_workers=config.train.num_workers, 
                                shuffle=True,
                                pin_memory=True,
                                drop_last=False)

val_dataloader = DataLoader(dataset=val_dataset,
                            batch_size=config.train.eval_batch_size,
                            num_workers=config.train.num_workers, 
                            shuffle=False,
                            pin_memory=True,
                            drop_last=False)


# 5. 모델 선언

In [None]:
from transformers import logging
logging.set_verbosity_error()

# config.json 에 입력된 architecture 에 따라 베이스 모델 설정
BASE_MODELS = {
    "BertForSequenceClassification": BertForSequenceClassification,
    "AutoModel": AutoModel,
    "ElectraForSequenceClassification": ElectraForSequenceClassification,
    "AlbertForSequenceClassification": AlbertForSequenceClassification,
    "RobertaForSequenceClassification": RobertaForSequenceClassification
}

model = BASE_MODELS[config.model.architecture].from_pretrained(config.model.pretrained_model, num_labels = config.train.num_classes)
# print(model)

In [None]:
# # 학습을 재개할 시 config 파일의 resume_weight_dir를 설정해주세요
# if config.train.resume :
#     checkpoint = torch.load(config.train.resume_weight_dir)
#     model.load_state_dict(checkpoint['model'])

# 6. Optimizer

In [None]:
optimizer = Adam(model.parameters(), lr=config.train.learning_rate, eps=config.train.adam_epsilon)

total_steps = len(train_dataloader) * config.train.num_epochs 
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(total_steps * config.train.warmup_proportion), 
                                                num_training_steps=total_steps)

# 7. Metric

- [evaluate 라이브러리 document](https://pypi.org/project/evaluate/)
- [각 metric 사용법](https://huggingface.co/evaluate-metric?sort_spaces=alphabetical#spaces)

In [None]:
# 사용 가능한 metric 리스트를 보려면 아래를 주석 해제하여 확인하여 config 파일을 수정하여 다시 불러오세요.
# evaluate.list_evaluation_modules()

def compute_metrics(eval_pred):
    metric_fn = evaluate.load(config.train.metric)
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    if config.train.metric=='f1':
        score = metric_fn.compute(predictions=predictions, references=labels, average='macro')
    else:
        score = metric_fn.compute(predictions=predictions, references=labels)
    return score

# 8. Early stopper

In [None]:
from modules.earlystoppers import EarlyStopper

early_stopper = EarlyStopper(patience=config.train.early_stopping_patience,
                             mode='min',
                             logger=logger)

INFO:train:Initiated ealry stopper, mode: min, best score: inf, patience: 2


# 9. Recorder

In [None]:
from modules.recorders import Recorder

recorder = Recorder(record_dir=RECORDER_DIR,
                    model=model,
                    optimizer=optimizer,
                    scheduler=None,
                    amp=None,
                    logger=logger)

In [None]:
# Wandb
# Wandb 사용법은 11월 21일 곽대훈 멘토님의 강의를 참고하세요!
if config.train.wandb:
    
    wandb_project_serial = 'shopping_conversation'
    os.environ["WANDB_NOTEBOOK_NAME"] = wandb_project_serial
    wandb_username = 'cheche7' # 각자의 계정으로 수정하세요.
    wandb.init(project=wandb_project_serial, dir=RECORDER_DIR, entity=wandb_username)
    wandb.run.name = train_serial
    wandb.config.update(config)
    wandb.watch(model)


[34m[1mwandb[0m: Currently logged in as: [33mcheche7[0m. Use [1m`wandb login --relogin`[0m to force relogin


# 10. Trainer Class 정의

### 학습 파라미터 설정

- [transformers Trainer 설명](https://huggingface.co/docs/transformers/main_classes/trainer)
- [transformers TrainingArguments 파라미터 설명](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/trainer#transformers.TrainingArguments) : Training Arguments 파라미터 설명이 기술되어 있으니 참고하시어 각 파라미터가 의미하는 바를 알아보고 조정, 혹은 추가해보세요.

In [None]:
# transformers에서 Trainer를 불러와서 학습에 사용할텐데, 그때 사용할 파라미터들을 미리 지정해둔다

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir= RECORDER_DIR,  # output directory
    num_train_epochs=config.train.num_epochs,  # total number of training epochs
    resume_from_checkpoint = config.train.resume_weight_dir,
    per_device_train_batch_size=config.train.train_batch_size,  # batch size per device during training
    per_device_eval_batch_size=config.train.eval_batch_size,  # batch size for evaluation
    learning_rate = config.train.learning_rate,
    warmup_ratio = config.train.warmup_proportion,
    weight_decay=config.train.weight_decay,
    logging_strategy = config.train.logging_strategy,
    # logging_steps=config.train.logging_steps, # valid when logging_strategy ='steps'
    save_strategy=config.train.save_strategy,
    # save_steps=config.train.save_steps, # valid when save_strategy ='steps'
    # eval_steps=config.train.eval_steps, # valid when save_strategy ='steps'
    do_train=True, # Perform training
    do_eval=True, # Perform evaluation
    evaluation_strategy = config.train.evaluation_strategy, # evalute after each epoch
    gradient_accumulation_steps = 64,  # total number of steps before back propagation
    fp16 = config.train.fp16, # Use mixed precision
    run_name = train_serial, # experiment name, typically used for wandb
    report_to="wandb",  # disable wandb
    load_best_model_at_end = True,
    overwrite_output_dir=True,
    save_total_limit=3,
    seed= config.train.seed  # Seed for experiment reproducibility
)

## 학습 

In [None]:
class Run:

    def __init__(self, trainer, tokenizer, training_args, test=None, submission_name = None):
        self.trainer = trainer
        self.tokenizer = tokenizer
        self.training_args = training_args
        self.test = test
        self.submission_name = submission_name

    def __call__(self):
        if self.training_args.do_train:
            self.train()

        if self.training_args.do_eval:
            self.validate()

        if self.training_args.do_predict and self.test is not None:
            self.predict()

    def train(self):
        self.trainer.train()
        self.trainer.save_model()
        if self.trainer.is_world_process_zero():
            self.tokenizer.save_pretrained(RECORDER_DIR)

    def validate(self):
        logger.info("*** Evaluate ***")
        result = self.trainer.evaluate()
        output_eval_file = os.path.join(RECORDER_DIR, "eval_results.txt")
        if self.trainer.is_world_process_zero():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results *****")
                for key, value in result.items():
                    logger.info("%s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))

        logger.info("Validation set result : {}".format(result))

    # 추론 함수
    def predict(self):
        logger.info("*** Test ***")
        predictions = self.trainer.predict(test_dataset=self.test)
        test_df['speechAct'] = self.prediction(predictions.predictions)
        test_df.to_csv(self.submission_name, index=False)
        test_df.sort_values(by='sentence_id', ascending=True)
        print("Submission file saved as :", self.submission_name)

    def prediction(self,logit):
        labels = np.argmax(logit, axis=1)
        return list(map(lambda x: encoding2label[x], labels))

# 11. Trainer 선언

In [None]:
# Trainer
from transformers import Trainer
from transformers import EarlyStoppingCallback

# Set trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=config.train.early_stopping_patience)]

)

# Set runner
train = Run(
    training_args=training_args,
    trainer=trainer,
    tokenizer=TOKENIZER,
    )

Using cuda_amp half precision backend


# 12. 학습

In [None]:
train()

***** Running training *****
  Num examples = 151826
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 2048
  Gradient Accumulation steps = 64
  Total optimization steps = 370
  Number of trainable parameters = 336670734
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
***** Running Evaluation *****
  Num examples = 37957
  Batch size = 32


{'loss': 0.8224, 'learning_rate': 8.338028169014085e-05, 'epoch': 1.0}


Saving model checkpoint to /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-74
Configuration saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-74/config.json


{'eval_loss': 0.534602165222168, 'eval_f1': 0.5187044945686227, 'eval_runtime': 218.9306, 'eval_samples_per_second': 173.375, 'eval_steps_per_second': 5.422, 'epoch': 1.0}


Model weights saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-74/pytorch_model.bin
Deleting older checkpoint [/content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-222] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 37957
  Batch size = 32


{'loss': 0.5148, 'learning_rate': 6.253521126760565e-05, 'epoch': 2.0}


Saving model checkpoint to /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-148
Configuration saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-148/config.json


{'eval_loss': 0.5170103311538696, 'eval_f1': 0.5493392387885896, 'eval_runtime': 218.8314, 'eval_samples_per_second': 173.453, 'eval_steps_per_second': 5.424, 'epoch': 2.0}


Model weights saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-148/pytorch_model.bin
Deleting older checkpoint [/content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-296] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 37957
  Batch size = 32


{'loss': 0.4655, 'learning_rate': 4.1690140845070425e-05, 'epoch': 3.0}


Saving model checkpoint to /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-222
Configuration saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-222/config.json


{'eval_loss': 0.5127819180488586, 'eval_f1': 0.5320473912790674, 'eval_runtime': 218.9583, 'eval_samples_per_second': 173.353, 'eval_steps_per_second': 5.421, 'epoch': 3.0}


Model weights saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-222/pytorch_model.bin
Deleting older checkpoint [/content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-370] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 37957
  Batch size = 32


{'loss': 0.4153, 'learning_rate': 2.0845070422535212e-05, 'epoch': 4.0}


Saving model checkpoint to /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-296
Configuration saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-296/config.json


{'eval_loss': 0.5286005735397339, 'eval_f1': 0.562328156642362, 'eval_runtime': 219.0053, 'eval_samples_per_second': 173.315, 'eval_steps_per_second': 5.42, 'epoch': 4.0}


Model weights saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-296/pytorch_model.bin
Deleting older checkpoint [/content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-74] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 37957
  Batch size = 32


{'loss': 0.3584, 'learning_rate': 0.0, 'epoch': 5.0}


Saving model checkpoint to /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-370
Configuration saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-370/config.json


{'eval_loss': 0.5588827133178711, 'eval_f1': 0.5605111398970889, 'eval_runtime': 218.8559, 'eval_samples_per_second': 173.434, 'eval_steps_per_second': 5.424, 'epoch': 5.0}


Model weights saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-370/pytorch_model.bin
Deleting older checkpoint [/content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-148] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /content/drive/MyDrive/Shopping/results/train/20221019_180402/checkpoint-222 (score: 0.5127819180488586).
Saving model checkpoint to /content/drive/MyDrive/Shopping/results/train/20221019_180402
Configuration saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/config.json


{'train_runtime': 14148.68, 'train_samples_per_second': 53.654, 'train_steps_per_second': 0.026, 'train_loss': 0.515282713400351, 'epoch': 5.0}


Model weights saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Shopping/results/train/20221019_180402/special_tokens_map.json
INFO:train:*** Evaluate ***
***** Running Evaluation *****
  Num examples = 37957
  Batch size = 32
INFO:train:***** Eval results *****
INFO:train:eval_loss = 0.5127819180488586
INFO:train:eval_f1 = 0.5320473912790674
INFO:train:eval_runtime = 219.7114
INFO:train:eval_samples_per_second = 172.758
INFO:train:eval_steps_per_second = 5.403
INFO:train:epoch = 5.0
INFO:train:Validation set result : {'eval_loss': 0.5127819180488586, 'eval_f1': 0.5320473912790674, 'eval_runtime': 219.7114, 'eval_samples_per_second': 172.758, 'eval_steps_per_second': 5.403, 'epoch': 5.0}


{'eval_loss': 0.5127819180488586, 'eval_f1': 0.5320473912790674, 'eval_runtime': 219.7114, 'eval_samples_per_second': 172.758, 'eval_steps_per_second': 5.403, 'epoch': 5.0}


# 13. 추론 
### 추론 config 설정

In [None]:
# Load config
predict_config_path = os.path.join(PROJECT_DIR, 'baseline', 'config', 'predict.yaml')
predict_config = OmegaConf.load(predict_config_path)

### 추론을 로깅할 디렉토리 생성

In [None]:
# Serial
if predict_config['predict']['train_serial'] is None:
    train_serial = os.path.basename(RECORDER_DIR)
else:
    train_serial = predict_config['predict']['train_serial']
    

# Predict directory
PREDICT_DIR = os.path.join(RECORDER_DIR, 'predict')
os.makedirs(PREDICT_DIR, exist_ok=True)

### 데이터 경로 지정

In [None]:
# Data Directory
DATA_DIR = predict_config['predict']['dataset_path']

### 테스트 데이터셋 및 데이터 로더 선언


In [None]:
test_df = make_df(TEST_DIR, test=True)
test_df['order']=test_df['sentence_id'].str.split('-')

for i in range(len(test_df)):
    test_df['order'][i][1]=int(test_df['order'][i][1])

test_df.sort_values(by='order',inplace=True)
test_df.reset_index(inplace=True)
test_df.drop(['index', 'order'], axis=1, inplace=True)
# test_df.drop(['index', 'speaker', 'order'], axis=1, inplace=True)

test_dataset = CustomDataset(test_df, TOKENIZER, config.train.max_seq_len, 'test')
print("Test dataset: ", len(test_dataset))

test_dataloader = DataLoader(dataset=test_dataset,
                            batch_size=predict_config.predict.batch_size,
                            num_workers=predict_config.predict.num_workers,
                            shuffle=False,
                            pin_memory=True,
                            drop_last=False)

Test dataset:  20520




In [None]:
test_df

Unnamed: 0,sentence_id,speaker,text
0,test_0001.txt-0,A,안녕하십니까 #@소속# 에어컨 상담사 #@이름#입니다
1,test_0001.txt-1,A,제가 말씀 못 드린 부분이 있어서요
2,test_0001.txt-2,A,혹시 냉매가 부족한 경우에는 냉매보충비가 별도로 나올 수가 있습니다
3,test_0001.txt-3,B,가스 추가 말씀하시는 거죠
4,test_0001.txt-4,B,가스 추가는 보통 얼마나 하나요
...,...,...,...
20515,test_1456.txt-8,A,내일 언제쯤 연락을 드릴까요
20516,test_1456.txt-9,B,다섯시 이후에는 가능할 거 같습니다
20517,test_1456.txt-10,A,네 그럼 내일 다섯시 반에 다시 연락드리겠습니다 고객님
20518,test_1456.txt-11,B,네 감사합니다 수고하세요


In [None]:
model = BASE_MODELS[config.model.architecture].from_pretrained(RECORDER_DIR)

predicting_args = TrainingArguments(
        run_name=train_serial,
        disable_tqdm=False,
        per_device_eval_batch_size = predict_config.predict.batch_size,
        fp16=config.train.fp16,
        gradient_accumulation_steps=64,
        do_train=False,
        do_eval=False,
        do_predict=True,
        output_dir='.',
    )

trainer_prediction = Trainer(
    model= model,
    args=predicting_args
)

predict = Run(
    training_args=predicting_args,
    trainer=trainer_prediction,
    tokenizer=TOKENIZER,
    test=test_dataset,
    submission_name = predict_config.predict.submission_name
    )

loading configuration file /content/drive/MyDrive/Shopping/results/train/20221019_180402/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-large",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 12,
    "LABEL_13": 13,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LA

## 추론

In [None]:
predict()

INFO:train:*** Test ***
***** Running Prediction *****
  Num examples = 20520
  Batch size = 1


Submission file saved as : submission.csv
