## [0418] XLM-RoBERTa를 이용한 첫 베이스라인

- **add_special_tokens 함수**: Entity 위치 정보를 활용해 [ENT],[/ENT] entity special token 추가

In [1]:
!pip install mxnet
!pip install gluonnlp pandas tqdm
!pip install sentencepiece
!pip install transformers==3
!pip install torch



In [28]:
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import pandas as pd
import numpy as np
import re
import tarfile
import pickle as pickle
from tqdm import tqdm
from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup
from sklearn.model_selection import train_test_split

# Using KoELECTRA Model
from transformers import *

# Added by Me
import os
from tqdm.notebook import tqdm
from ohsuz.utils import *
from ohsuz.loss import *
from ohsuz.config import *
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau, CosineAnnealingLR

In [29]:
max_len = 128
batch_size = 16
warmup_ratio = 0.01
epochs = 10
max_grad_norm = 1
log_interval = 50
lr = 5e-5

MODEL_NAME = "xlm-roberta-large"

In [30]:
# error labels need to be fixed later
error_label_0 = ['wikitree-12599-4-108-111-4-7',
                 'wikipedia-25967-115-24-26-35-37',
                 'wikipedia-16427-6-14-17-20-22',
                 'wikipedia-16427-8-0-3-26-28',
                 'wikitree-19765-5-30-33-6-8',
                 'wikitree-58702-0-18-20-22-24',
                 'wikitree-71638-8-21-23-15-17',
                 'wikipedia-257-0-0-1-53-57',
                 'wikipedia-13649-28-66-70-14-24',
                 'wikipedia-6017-8-20-26-4-7']
error_label_1 = ['wikitree-55837-4-0-2-10-11']
error_label_2 = ['wikitree-62775-3-3-7-0-2']
error_label_3 = ['wikipedia-23188-0-74-86-41-42']

In [31]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
config = AutoConfig.from_pretrained(MODEL_NAME)
config.num_labels = 42

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, config=config)
model = model.to(device)
model.parameters()

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.we

<generator object Module.parameters at 0x7f4b614c1550>

In [32]:
# check everything is fine
print(type(tokenizer))
print(type(config))
print(type(model))

<class 'transformers.tokenization_xlm_roberta.XLMRobertaTokenizer'>
<class 'transformers.configuration_xlm_roberta.XLMRobertaConfig'>
<class 'transformers.modeling_xlm_roberta.XLMRobertaForSequenceClassification'>


### 1. Dataset & DataLoader 준비

**add_entity_tokens**
- input
    - entity token을 추가할 문장
    - 첫 번째 entity 시작, 끝 index
    - 두 번째 entity 시작, 끝 index
- output
    - 해당하는 index에 entity token이 추가된 문장

**make_embedding_layer**
- input
    - 문장의 input_ids
- output
    - entity에 해당하는 token이면 1, 아니면 0으로 나타내는 tensor

In [43]:
def make_embedding_layer(input_ids):
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    flag = False
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    special_tokens = special_tokens_dict['additional_special_tokens']
    is_entity_layer = []

    for i, token in enumerate(tokens):
        if token in special_tokens:
            if flag == False:
                flag = True
            else:
                flag = False
        else:
            if flag == True:
                is_entity_layer.append(5)
                continue
        is_entity_layer.append(0)

    is_entity_layer = torch.tensor(is_entity_layer)
    return is_entity_layer

In [44]:
def add_entity_tokens(sentence, a1, a2, b1, b2):
    new_sentence = None
    # special_tokens = special_tokens_dict['additional_special_tokens']
    
    if a1 > b1: # b1 먼저
        # new_sentence = sentence[:b1] + special_tokens[2] + sentence[b1:b2+1] + special_tokens[3] + sentence[b2+1:a1] + special_tokens[0] + sentence[a1:a2+1] + special_tokens[1] + sentence[a2+1:]
        new_sentence = sentence[:b1] + "$" + sentence[b1:b2+1] + "$" + sentence[b2+1:a1] + "#" + sentence[a1:a2+1] + "#" + sentence[a2+1:]
    else: # a1 먼저
        new_sentence = sentence[:a1] + "#" + sentence[a1:a2+1] + "#" + sentence[a2+1:b1] + "$" + sentence[b1:b2+1] + "$" + sentence[b2+1:]
    return new_sentence

In [45]:
def load_data(dataset_dir, add_entity=True):
    with open('/opt/ml/input/data/label_type.pkl', 'rb') as f:
        label_type = pickle.load(f)
    dataset = pd.read_csv(dataset_dir, delimiter='\t', header=None)
    dataset = preprocessing_dataset(dataset, label_type, add_entity)
    return dataset


def preprocessing_dataset(dataset, label_type, add_entity):
    label = []
    sentences = None
    """
    for i in dataset[8]:
        if i == 'blind':
            label.append(100)
        else:
            label.append(label_type[i])
    """
    for ID, i in zip(dataset[0], dataset[8]):
        if i == 'blind':
            label.append(100)
        elif ID in error_label_0:
            label.append(label_type['관계_없음'])
        elif ID in error_label_1:
            label.append(label_type['단체:구성원'])
        elif ID in error_label_2:
            label.append(label_type['단체:본사_도시'])
        elif ID in error_label_3:
            label.append(label_type['단체:하위_단체'])
        else:
            label.append(label_type[i])
    
    if add_entity:
        ### 이 부분을 더 효율적으로 고치려면???
        sentences = [add_entity_tokens(dataset[1][i], dataset[3][i], dataset[4][i], dataset[6][i], dataset[7][i]) for i in range(len(dataset))]
    else:
        sentences = dataset[1]

    out_dataset = pd.DataFrame({'sentence':sentences,'entity_01':dataset[2],'entity_02':dataset[5],'label':label,})
    return out_dataset

In [46]:
add_UNK_token = True
add_ENT_token = False

In [47]:
def tokenizing_dataset(dataset, tokenizer):
    added_token_num = 0
    if add_UNK_token:
        for text in list(dataset['sentence']):
            input_ids = tokenizer.encode(text, add_special_tokens=False)
            decoded_ids = tokenizer.decode(input_ids)

            ori_text = ''.join(text.split(' '))
            dec_text = ''.join(decoded_ids.split(' '))
            if ori_text == dec_text: continue

            unk_cha = ''
            unk_list = []
            for dec in list(dec_text.split('[UNK]')):
                if dec == '': continue
                ori_text = list(ori_text.split(dec))
                unk_cha = ori_text[0]
                if unk_cha != '':
                    unk_list.append(unk_cha)
                    unk_cha = ''
                ori_text = ''.join(ori_text[1:])
                if ori_text == '': break
            added_token_num += tokenizer.add_tokens(list(''.join(unk_list)))
    if add_ENT_token:
        special_tokens_dict = {'additional_special_tokens': ['[ENT]']}
        added_token_num += tokenizer.add_special_tokens(special_tokens_dict)
        
    concat_entity = []
    for e01, e02 in zip(dataset['entity_01'], dataset['entity_02']):
        temp = e01 + '[SEP][SEP]' + e02 # 태양님 추천
        concat_entity.append(temp)

    tokenized_dataset = tokenizer(concat_entity,
                                  list(dataset['sentence']),
                                  return_tensors="pt",
                                  padding=True,
                                  truncation=True,
                                  max_length=190,
                                  add_special_tokens=True)
    return tokenized_dataset, added_token_num

In [48]:
class KoElecDataset(Dataset):
    def __init__(self, tsv_file, add_entity=True, handle_UNK='REMOVE'):
        self.dataset = load_data(tsv_file, add_entity)
        self.dataset['sentence'] = self.dataset['entity_01'] + ' [SEP] ' + self.dataset['entity_02'] + ' [SEP] ' + self.dataset['sentence']
        self.sentences = list(self.dataset['sentence'])
        self.labels = list(self.dataset['label'])
        self.tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
        self.tokenizer.add_special_tokens(special_tokens_dict)
        self.handle_UNK = handle_UNK
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        sentence, label = self.sentences[idx], self.labels[idx]
        inputs = self.tokenizer(
            sentence,
            return_tensors='pt',
            truncation=True,
            max_length=190,
            pad_to_max_length=True,
            add_special_tokens=True
        )
            
        input_ids = inputs['input_ids'][0]
        is_embedding_layer = make_embedding_layer(input_ids, self.tokenizer)
        attention_mask = inputs['attention_mask'][0] + is_embedding_layer
        
        return input_ids, attention_mask, label

In [49]:
class KlueDataset(Dataset):
    def __init__(self, tokenized_dataset, labels):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels
        
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.tokenized_dataset.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        is_embedding_layer = make_embedding_layer(item['input_ids'])
        item['attention_mask'] = item['attention_mask'] + is_embedding_layer
        # item['input_ids'], item['attention_mask'], item['labels']
        return item['input_ids'], item['attention_mask'], item['labels']
    
    def __len__(self):
        return len(self.labels)

In [50]:
dataset = load_data(os.path.join(train_dir, 'train.tsv'))
labels = dataset['label'].values

tokenized_dataset, added_token_num = tokenizing_dataset(dataset, tokenizer)
klue_dataset = KlueDataset(tokenized_dataset, labels)

model.resize_token_embeddings(tokenizer.vocab_size + added_token_num)

We need to remove 26 to truncate the inputbut the first sequence has a length 12. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.
We need to remove 89 to truncate the inputbut the first sequence has a length 15. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.
We need to remove 54 to truncate the inputbut the first sequence has a length 13. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.
We need to remove 30 to truncate the inputbut the first sequence has a length 17. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.
We need to remove 35 to truncate the inputbut the first sequence has a length 14. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance

Embedding(250002, 1024)

In [51]:
def load_test_dataset(dataset_dir, tokenizer):
    test_dataset = load_data(dataset_dir)
    test_label = test_dataset['label'].values
    tokenized_test = tokenizing_dataset(test_dataset, tokenizer)
    return tokenized_test, test_label

test_dataset, test_label = load_test_dataset(os.path.join(test_dir, 'test.tsv'), tokenizer)
test_dataset = KlueDataset(test_dataset, test_label)

We need to remove 20 to truncate the inputbut the first sequence has a length 19. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.
We need to remove 23 to truncate the inputbut the first sequence has a length 16. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.
We need to remove 53 to truncate the inputbut the first sequence has a length 12. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 'longest_first' or 'only_second'.


**Train, Valid set 8 : 2 로 분리**

In [12]:
# train_dataset, val_dataset = train_test_split(dataset, test_size=0.2, random_state=42)

In [13]:
# train_dataset.__getitem__(0)

In [52]:
#train_loader = DataLoader(train_dataset, batch_size=batch_size, num_workers=5)
#val_loader = DataLoader(val_dataset, batch_size=batch_size, num_workers=5)
train_loader = DataLoader(klue_dataset, batch_size=batch_size, num_workers=5, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, num_workers=5, shuffle=False)

In [54]:
import sys

In [55]:
next(iter(train_loader))

  import sys
  import sys
  import sys
  import sys
  import sys


IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "<ipython-input-49-8cedcc0bed95>", line 9, in __getitem__
    is_embedding_layer = make_embedding_layer(item['input_ids'])
  File "<ipython-input-43-a07466dad1f1>", line 4, in make_embedding_layer
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
  File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 859, in convert_ids_to_tokens
    tokens.append(self._convert_id_to_token(index))
  File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_xlm_roberta.py", line 285, in _convert_id_to_token
    return self.sp_model.IdToPiece(index - self.fairseq_offset)
  File "/opt/conda/lib/python3.7/site-packages/sentencepiece/__init__.py", line 501, in _batched_func
    return _func(self, arg)
  File "/opt/conda/lib/python3.7/site-packages/sentencepiece/__init__.py", line 494, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.


### 2. Model 준비

In [16]:
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-base-v3-discriminator", num_labels=42).to(device)

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'electra.embeddings.position_ids']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discri

In [17]:
model.resize_token_embeddings(35004) # 내가 임의로 입력함 나중에 수정

Embedding(35004, 768)

### 3. Train

In [18]:
optimizer = AdamW(model.parameters(), lr=lr)
ls_loss = LabelSmoothingLoss()
cels_loss = CELSLoss()
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6) # 익효님꺼로 파라미터 변경
# scheduler = CosineAnnealingLR(optimizer, T_max=2, eta_min=0.)

In [19]:
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

In [20]:
best_acc = 0.0

for epoch in range(epochs):
    train_acc = 0.0
    test_acc = 0.0
    
    model.train()
    
    for batch_id, (input_ids_batch, attention_masks_batch, y_batch) in tqdm(enumerate(train_loader)):
        optimizer.zero_grad()
        y_batch = y_batch.to(device)
        # 생각해보니까 내가 추가적으로 만든 embedding layer를 입력으로 주려면 모델 내부 구조를 바꿔야 되지 않나...?
        # 우선 attention mask에 더해서 입력으로 줘보자
        y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
        loss = cels_loss(y_pred, y_batch)
        loss.backward()
        optimizer.step()
        scheduler.step()
        train_acc += calc_accuracy(y_pred, y_batch)
        if batch_id % log_interval == 0:
            print(f"epoch {epoch+1} batch id {batch_id+1} loss {loss.data.cpu().numpy()} train acc {train_acc / (batch_id+1)}")

    train_acc = train_acc / (batch_id+1)
    print(f"epoch {epoch+1} train acc {train_acc}")
 
    """
    model.eval()
    for batch_id, (input_ids_batch, attention_masks_batch, y_batch) in tqdm(enumerate(val_loader)):
        y_batch = y_batch.to(device)
        y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
        test_acc += calc_accuracy(y_pred, y_batch)
        
    print(f"epoch {epoch+1} test acc {test_acc / (batch_id+1)}")
    
    if test_acc >= best_acc:
        best_acc = test_acc
        torch.save(model.state_dict(), "/opt/ml/models/0417_koelectra_eda.pt")
    """
    
    if train_acc >= best_acc:
        best_acc = train_acc
        torch.save(model.state_dict(), "/opt/ml/models/0417_koelectra_eda.pt")

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 1 batch id 1 loss 7.272258758544922 train acc 0.1875
epoch 1 batch id 51 loss 6.673955917358398 train acc 0.39705882352941174
epoch 1 batch id 101 loss 6.026796340942383 train acc 0.42636138613861385
epoch 1 batch id 151 loss 4.209381103515625 train acc 0.44867549668874174
epoch 1 batch id 201 loss 5.922874450683594 train acc 0.45677860696517414
epoch 1 batch id 251 loss 4.489680290222168 train acc 0.4663844621513944
epoch 1 batch id 301 loss 8.037888526916504 train acc 0.4688538205980066
epoch 1 batch id 351 loss 5.937761306762695 train acc 0.47150997150997154
epoch 1 batch id 401 loss 3.1563119888305664 train acc 0.47989401496259354
epoch 1 batch id 451 loss 8.230195999145508 train acc 0.48822062084257206
epoch 1 batch id 501 loss 4.939570903778076 train acc 0.4806636726546906
epoch 1 batch id 551 loss 3.7603254318237305 train acc 0.48139745916515425
epoch 1 batch id 601 loss 5.61888313293457 train acc 0.47878535773710484
epoch 1 batch id 651 loss 5.980191707611084 train acc 0.

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 2 batch id 1 loss 3.7403199672698975 train acc 0.3125
epoch 2 batch id 51 loss 1.5157744884490967 train acc 0.6029411764705882
epoch 2 batch id 101 loss 3.796792984008789 train acc 0.556930693069307
epoch 2 batch id 151 loss 1.2219057083129883 train acc 0.5620860927152318
epoch 2 batch id 201 loss 3.876227378845215 train acc 0.5699626865671642
epoch 2 batch id 251 loss 1.7792237997055054 train acc 0.5864043824701195
epoch 2 batch id 301 loss 6.283151149749756 train acc 0.5820182724252492
epoch 2 batch id 351 loss 5.178122520446777 train acc 0.5888532763532763
epoch 2 batch id 401 loss 1.028862476348877 train acc 0.5897755610972568
epoch 2 batch id 451 loss 6.330960273742676 train acc 0.6004711751662971
epoch 2 batch id 501 loss 1.6671016216278076 train acc 0.5993013972055888
epoch 2 batch id 551 loss 3.046321392059326 train acc 0.6012931034482759
epoch 2 batch id 601 loss 3.7662477493286133 train acc 0.598585690515807
epoch 2 batch id 651 loss 4.479378700256348 train acc 0.603782

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 3 batch id 1 loss 2.4970407485961914 train acc 0.4375
epoch 3 batch id 51 loss 0.7761947512626648 train acc 0.6678921568627451
epoch 3 batch id 101 loss 3.484156847000122 train acc 0.6398514851485149
epoch 3 batch id 151 loss 0.47594958543777466 train acc 0.6485927152317881
epoch 3 batch id 201 loss 2.02579402923584 train acc 0.6598258706467661
epoch 3 batch id 251 loss 0.9196346402168274 train acc 0.6710657370517928
epoch 3 batch id 301 loss 5.404382705688477 train acc 0.668812292358804
epoch 3 batch id 351 loss 3.343883514404297 train acc 0.6785968660968661
epoch 3 batch id 401 loss 0.2589179277420044 train acc 0.6769014962593516
epoch 3 batch id 451 loss 5.069751739501953 train acc 0.6838968957871396
epoch 3 batch id 501 loss 0.5187137126922607 train acc 0.685379241516966
epoch 3 batch id 551 loss 2.507948160171509 train acc 0.6855716878402904
epoch 3 batch id 601 loss 2.098541736602783 train acc 0.6815723793677204
epoch 3 batch id 651 loss 3.6151535511016846 train acc 0.68721

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 4 batch id 1 loss 1.5283775329589844 train acc 0.8125
epoch 4 batch id 51 loss 0.33055874705314636 train acc 0.7463235294117647
epoch 4 batch id 101 loss 3.163536548614502 train acc 0.719059405940594
epoch 4 batch id 151 loss 0.334639310836792 train acc 0.7272350993377483
epoch 4 batch id 201 loss 1.8308120965957642 train acc 0.7307213930348259
epoch 4 batch id 251 loss 0.3766354024410248 train acc 0.7355577689243028
epoch 4 batch id 301 loss 4.703818321228027 train acc 0.7323504983388704
epoch 4 batch id 351 loss 2.7060351371765137 train acc 0.7403846153846154
epoch 4 batch id 401 loss 0.06334559619426727 train acc 0.739713216957606
epoch 4 batch id 451 loss 3.7516708374023438 train acc 0.7448725055432373
epoch 4 batch id 501 loss 0.23402813076972961 train acc 0.747255489021956
epoch 4 batch id 551 loss 2.2515430450439453 train acc 0.75
epoch 4 batch id 601 loss 1.1276555061340332 train acc 0.7480241264559068
epoch 4 batch id 651 loss 3.0977838039398193 train acc 0.7529761904761

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 6 batch id 1 loss 0.3223203718662262 train acc 1.0
epoch 6 batch id 51 loss 0.10573983937501907 train acc 0.8725490196078431
epoch 6 batch id 101 loss 2.3689968585968018 train acc 0.8508663366336634
epoch 6 batch id 151 loss 0.15261194109916687 train acc 0.8497516556291391
epoch 6 batch id 201 loss 1.1380054950714111 train acc 0.8479477611940298
epoch 6 batch id 251 loss 0.14735965430736542 train acc 0.8523406374501992
epoch 6 batch id 301 loss 1.8486146926879883 train acc 0.8515365448504983
epoch 6 batch id 351 loss 1.5735357999801636 train acc 0.8575498575498576
epoch 6 batch id 401 loss 0.036719661206007004 train acc 0.8559850374064838
epoch 6 batch id 451 loss 1.0059475898742676 train acc 0.8607261640798226
epoch 6 batch id 501 loss 0.049501001834869385 train acc 0.8627744510978044
epoch 6 batch id 551 loss 0.42797160148620605 train acc 0.8651315789473685
epoch 6 batch id 601 loss 0.31830114126205444 train acc 0.8632487520798668
epoch 6 batch id 651 loss 2.948869228363037 tra

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 7 batch id 1 loss 0.893836259841919 train acc 0.875
epoch 7 batch id 51 loss 0.06119949370622635 train acc 0.8970588235294118
epoch 7 batch id 101 loss 0.8392494916915894 train acc 0.8886138613861386
epoch 7 batch id 151 loss 0.17795896530151367 train acc 0.8853476821192053
epoch 7 batch id 201 loss 1.043816328048706 train acc 0.8843283582089553
epoch 7 batch id 251 loss 0.05219545587897301 train acc 0.8837151394422311
epoch 7 batch id 301 loss 1.32590913772583 train acc 0.8824750830564784
epoch 7 batch id 351 loss 0.4039571285247803 train acc 0.8883547008547008
epoch 7 batch id 401 loss 0.0065556475892663 train acc 0.8855985037406484
epoch 7 batch id 451 loss 1.0869872570037842 train acc 0.8899667405764967
epoch 7 batch id 501 loss 0.03300292044878006 train acc 0.8919660678642715
epoch 7 batch id 551 loss 1.8676316738128662 train acc 0.89258166969147
epoch 7 batch id 601 loss 0.2760053277015686 train acc 0.8861272878535774
epoch 7 batch id 651 loss 2.453744411468506 train acc 0.

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 8 batch id 1 loss 0.08752886950969696 train acc 1.0
epoch 8 batch id 51 loss 0.02913292869925499 train acc 0.9154411764705882
epoch 8 batch id 101 loss 0.4835392236709595 train acc 0.9071782178217822
epoch 8 batch id 151 loss 0.1325603425502777 train acc 0.9060430463576159
epoch 8 batch id 201 loss 0.5812167525291443 train acc 0.9076492537313433
epoch 8 batch id 251 loss 0.0539797805249691 train acc 0.897410358565737
epoch 8 batch id 301 loss 1.185375690460205 train acc 0.8978405315614618
epoch 8 batch id 351 loss 0.26715177297592163 train acc 0.9038461538461539
epoch 8 batch id 401 loss 0.00900849886238575 train acc 0.9036783042394015
epoch 8 batch id 451 loss 0.4436936378479004 train acc 0.9067350332594235
epoch 8 batch id 501 loss 0.03314460813999176 train acc 0.908807385229541
epoch 8 batch id 551 loss 0.10402217507362366 train acc 0.9095961887477314
epoch 8 batch id 601 loss 0.2489045262336731 train acc 0.9076539101497504
epoch 8 batch id 651 loss 2.6365795135498047 train ac

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 9 batch id 1 loss 0.06647329032421112 train acc 1.0
epoch 9 batch id 51 loss 0.013598231598734856 train acc 0.9142156862745098
epoch 9 batch id 101 loss 0.3521116375923157 train acc 0.9195544554455446
epoch 9 batch id 151 loss 0.0018363241106271744 train acc 0.9221854304635762
epoch 9 batch id 201 loss 0.5658109188079834 train acc 0.9228855721393034
epoch 9 batch id 251 loss 0.05831364914774895 train acc 0.9220617529880478
epoch 9 batch id 301 loss 0.6558061838150024 train acc 0.9233803986710963
epoch 9 batch id 351 loss 0.2306309938430786 train acc 0.926994301994302
epoch 9 batch id 401 loss 0.004605754278600216 train acc 0.9265897755610972
epoch 9 batch id 451 loss 0.3317747712135315 train acc 0.928630820399113
epoch 9 batch id 501 loss 0.016798539087176323 train acc 0.9286427145708582
epoch 9 batch id 551 loss 0.05090568959712982 train acc 0.9284255898366606
epoch 9 batch id 601 loss 0.3126387894153595 train acc 0.9255407653910149
epoch 9 batch id 651 loss 2.123741388320923 tr

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

epoch 10 batch id 1 loss 0.04791385680437088 train acc 1.0
epoch 10 batch id 51 loss 0.025720154866576195 train acc 0.9301470588235294
epoch 10 batch id 101 loss 0.13396503031253815 train acc 0.932549504950495
epoch 10 batch id 151 loss 0.005933879874646664 train acc 0.9358443708609272
epoch 10 batch id 201 loss 0.13361045718193054 train acc 0.9375
epoch 10 batch id 251 loss 0.03144533187150955 train acc 0.9404880478087649
epoch 10 batch id 301 loss 0.2751581072807312 train acc 0.9428986710963455
epoch 10 batch id 351 loss 0.1719191074371338 train acc 0.9433760683760684
epoch 10 batch id 401 loss 0.0032591994386166334 train acc 0.9435785536159601
epoch 10 batch id 451 loss 0.2934042811393738 train acc 0.9438747228381374
epoch 10 batch id 501 loss 0.00810663215816021 train acc 0.9434880239520959
epoch 10 batch id 551 loss 0.08300015330314636 train acc 0.9422640653357531
epoch 10 batch id 601 loss 0.17189796268939972 train acc 0.9395798668885191
epoch 10 batch id 651 loss 1.4332994222640

### **6. 예측**

In [21]:
model.load_state_dict(torch.load("/opt/ml/models/0417_koelectra_eda.pt"))

model.eval()

predictions = []

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
    _, predict = torch.max(y_pred, 1)
    predictions.extend(predict.tolist())

HBox(children=(FloatProgress(value=0.0, max=313.0), HTML(value='')))




In [22]:
submission = pd.DataFrame(predictions, columns=['pred'])
submission.to_csv(os.path.join(submission_dir, '0417_submission_1.csv'), index=False)

### **7. 데이터 증강**