<a href="https://colab.research.google.com/github/KYUSEONGHAN/Drawing-Dirary/blob/master/text/KoBERT_short_long_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 필요 라이브러리 설치
!pip install mxnet
!pip install gluonnlp pandas tqdm
!pip install sentencepiece
!pip install transformers==3.0.2
!pip install torch

#깃허브에서 KoBERT 파일 로드
!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mxnet
  Downloading mxnet-1.9.1-py3-none-manylinux2014_x86_64.whl (49.1 MB)
[K     |████████████████████████████████| 49.1 MB 1.8 MB/s 
[?25hCollecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed graphviz-0.8.4 mxnet-1.9.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gluonnlp
  Downloading gluonnlp-0.10.0.tar.gz (344 kB)
[K     |████████████████████████████████| 344 kB 31.9 MB/s 
Building wheels for collected packages: gluonnlp
  Building wheel for gluonnlp (setup.py) ... [?25l[?25hdone
  Created wheel for gluonnlp: filename=gluonnlp-0.10.0-cp3

In [2]:
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook

#kobert
from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model

#transformers
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup

#GPU 사용
device = torch.device("cuda:0")

#BERT 모델, Vocabulary 불러오기
bertmodel, vocab = get_pytorch_kobert_model()

/content/.cache/kobert_v1.zip[██████████████████████████████████████████████████]
/content/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece[██████████████████████████████████████████████████]


In [3]:
# local google drive에서 파일 가져오기
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# dataset으로 excel dataset file 불러오기 및 필요에 맞게 조작
import pandas as pd

short_dataset = pd.read_excel('/content/drive/MyDrive/Github/DataSet/한국어_감정_정보가_포함된_단발성_대화_데이터셋.xlsx')
long_dataset = pd.read_excel('/content/drive/MyDrive/Github/DataSet/한국어_감정_정보가_포함된_연속적_대화_데이터셋.xlsx')

In [6]:
short_dataset.head()

Unnamed: 0,Sentence,Emotion,Unnamed: 2,Unnamed: 3,Unnamed: 4,공포,5468
0,언니 동생으로 부르는게 맞는 일인가요..??,공포,,,,놀람,5898.0
1,그냥 내 느낌일뿐겠지?,공포,,,,분노,5665.0
2,아직너무초기라서 그런거죠?,공포,,,,슬픔,5267.0
3,유치원버스 사고 낫다던데,공포,,,,중립,4830.0
4,근데 원래이런거맞나요,공포,,,,행복,6037.0


In [7]:
short_dataset['Emotion'].unique()

array(['공포', '놀람', '분노', '슬픔', '중립', '행복', '혐오'], dtype=object)

In [8]:
long_dataset.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,행복,중립,슬픔,공포,혐오,분노,놀람
0,dialog #,발화,감정,,,1030.0,,,,,,
1,S,아 진짜! 사무실에서 피지 말라니깐! 간접흡연이 얼마나 안좋은데!,분노,,,,,,,,,
2,,그럼 직접흡연하는 난 얼마나 안좋겠니? 안그래? 보면 꼭... 지 생각만 하고.,혐오,,,,,,,,,
3,,손님 왔어요.,중립,,,,,,,,,
4,,손님? 누구?,중립,,,,,,,,,


In [10]:
long_dataset['Unnamed: 2'][1:].unique()

array(['분노', '혐오', '중립', '놀람', '행복', '공포', '슬픔', 'ㅈ중립', '분ㄴ', '중림', nan,
       'ㅍ', 'ㄴ중립', '분', '줄'], dtype=object)

In [12]:
# 위 셀에서 볼 수 있듯이 연속적 대화 데이터셋은 데이터가 잘못 들어가 있는 경우가 많다
# 잘못된 데이터 전처리
long_dataset['Unnamed: 2'] = long_dataset['Unnamed: 2'][1:].replace(['ㅈ중립'], '중립').replace('중림', '중립').replace('분ㄴ', '분노').replace('ㄴ중립', '중립').replace('분', '분노').replace('줄', '중립').replace('ㅍ', '중립')
long_dataset['Unnamed: 2'][1:].unique()

array(['분노', '혐오', '중립', '놀람', '행복', '공포', '슬픔', nan], dtype=object)

In [13]:
# nan: 결측값, (not a number) 제거하기
long_dataset.dropna(subset=['Unnamed: 2'], inplace=True)

long_dataset['Unnamed: 2'][1:].unique()

array(['혐오', '중립', '분노', '놀람', '행복', '공포', '슬픔'], dtype=object)

In [15]:
# 원본 파일에서의 sentence와 emotion 컬럼만 필요하므로 해당 값만 가져옴
short_dataset = short_dataset[['Sentence', 'Emotion']]
short_dataset

Unnamed: 0,Sentence,Emotion
0,언니 동생으로 부르는게 맞는 일인가요..??,공포
1,그냥 내 느낌일뿐겠지?,공포
2,아직너무초기라서 그런거죠?,공포
3,유치원버스 사고 낫다던데,공포
4,근데 원래이런거맞나요,공포
...,...,...
38589,솔직히 예보 제대로 못하는 데 세금이라도 아끼게 그냥 폐지해라..,혐오
38590,재미가 없으니 망하지,혐오
38591,공장 도시락 비우생적임 아르바이트했는데 화장실가성 손도 않씯고 재료 담고 바닥 떨어...,혐오
38592,코딱지 만한 나라에서 지들끼리 피터지게 싸우는 센징 클래스 ㅉㅉㅉ,혐오


In [16]:
# 원본 파일에서의 sentence와 emotion 컬럼만 필요하므로 해당 값만 가져옴
long_dataset = long_dataset[['Unnamed: 1', 'Unnamed: 2']][1:]

long_dataset

Unnamed: 0,Unnamed: 1,Unnamed: 2
2,그럼 직접흡연하는 난 얼마나 안좋겠니? 안그래? 보면 꼭... 지 생각만 하고.,혐오
3,손님 왔어요.,중립
4,손님? 누구?,중립
5,몰라요. 팀장님 친구래요.,중립
6,내 친구? 친구 누구?,중립
...,...,...
55624,얘긴 다 끝났냐? 원예부,중립
55625,"예. 그거 때문에, 부탁이 있......는......데요.",중립
55626,여자 숨겨달라는거면 사절이다.,중립
55627,아무래도 안되나요?,중립


In [17]:
# 연속적 대화 데이터셋의 컬럼 네임 변경
long_dataset.columns = ['Sentence', 'Emotion']

long_dataset

Unnamed: 0,Sentence,Emotion
2,그럼 직접흡연하는 난 얼마나 안좋겠니? 안그래? 보면 꼭... 지 생각만 하고.,혐오
3,손님 왔어요.,중립
4,손님? 누구?,중립
5,몰라요. 팀장님 친구래요.,중립
6,내 친구? 친구 누구?,중립
...,...,...
55624,얘긴 다 끝났냐? 원예부,중립
55625,"예. 그거 때문에, 부탁이 있......는......데요.",중립
55626,여자 숨겨달라는거면 사절이다.,중립
55627,아무래도 안되나요?,중립


In [18]:
# 단발성 대화 데이터셋과 연속적 대화 데이터셋 합치기
frames = [short_dataset, long_dataset]
result = pd.concat(frames)

result

Unnamed: 0,Sentence,Emotion
0,언니 동생으로 부르는게 맞는 일인가요..??,공포
1,그냥 내 느낌일뿐겠지?,공포
2,아직너무초기라서 그런거죠?,공포
3,유치원버스 사고 낫다던데,공포
4,근데 원래이런거맞나요,공포
...,...,...
55624,얘긴 다 끝났냐? 원예부,중립
55625,"예. 그거 때문에, 부탁이 있......는......데요.",중립
55626,여자 숨겨달라는거면 사절이다.,중립
55627,아무래도 안되나요?,중립


In [20]:
print(len(result))  # 총 row 수: 94214

result['Emotion'].unique()

94214


array(['공포', '놀람', '분노', '슬픔', '중립', '행복', '혐오'], dtype=object)

In [21]:
# 데이터셋의 각 감정별로 분류화
result.loc[(result['Emotion'] == "공포"), 'Emotion'] = 0  #공포 => 0
result.loc[(result['Emotion'] == "놀람"), 'Emotion'] = 1  #놀람 => 1
result.loc[(result['Emotion'] == "분노"), 'Emotion'] = 2  #분노 => 2
result.loc[(result['Emotion'] == "슬픔"), 'Emotion'] = 3  #슬픔 => 3
result.loc[(result['Emotion'] == "중립"), 'Emotion'] = 4  #중립 => 4
result.loc[(result['Emotion'] == "행복"), 'Emotion'] = 5  #행복 => 5
result.loc[(result['Emotion'] == "혐오"), 'Emotion'] = 6  #혐오 => 6

data_list = []

for q, label in zip(result['Sentence'], result['Emotion'])  :
    data = []
    data.append(q)
    data.append(str(label))

    data_list.append(data)

In [22]:
#train & test 데이터로 나누기
from sklearn.model_selection import train_test_split
                                                         
dataset_train, dataset_test = train_test_split(data_list, test_size=0.25, random_state=0)

In [23]:
# BERT 모델에 들어가기 위한 dataset을 만들어주는 클래스
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

In [24]:
# Setting parameters
max_len = 64
batch_size = 64
warmup_ratio = 0.1
num_epochs = 10
max_grad_norm = 1
log_interval = 200
learning_rate =  5e-5

#토큰화
tokenizer = get_tokenizer()
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)

using cached model. /content/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece


In [25]:
data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False)
data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

In [26]:
train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, num_workers=5)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5)

  cpuset_checked))


In [27]:
class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=7,   ##클래스 수 조정##
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate
                 
        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)
    
    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)
        
        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

In [28]:
model = BERTClassifier(bertmodel,  dr_rate=0.5).to(device)

# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)

scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)

In [29]:
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

In [30]:
from time import time

start_time = time()

for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

print("total run time: ", time() - start_time)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 1 batch id 1 loss 2.0548183917999268 train acc 0.15625
epoch 1 batch id 201 loss 1.472127079963684 train acc 0.4193097014925373
epoch 1 batch id 401 loss 1.1792147159576416 train acc 0.48889495012468825
epoch 1 batch id 601 loss 1.0727893114089966 train acc 0.531353993344426
epoch 1 batch id 801 loss 1.019858479499817 train acc 0.5582084893882646
epoch 1 batch id 1001 loss 0.8428675532341003 train acc 0.5765016233766234
epoch 1 train acc 0.5836255656108598


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 1 test acc 0.6652693089430894


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 2 batch id 1 loss 0.9817138314247131 train acc 0.59375
epoch 2 batch id 201 loss 0.9503335356712341 train acc 0.659981343283582
epoch 2 batch id 401 loss 1.0881073474884033 train acc 0.6612375311720698
epoch 2 batch id 601 loss 0.8558388352394104 train acc 0.6651674292845258
epoch 2 batch id 801 loss 0.7448445558547974 train acc 0.6720310549313359
epoch 2 batch id 1001 loss 0.7134836912155151 train acc 0.6785714285714286
epoch 2 train acc 0.6816600678733031


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 2 test acc 0.6852981029810298


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 3 batch id 1 loss 0.7935680747032166 train acc 0.640625
epoch 3 batch id 201 loss 0.6328109502792358 train acc 0.7242692786069652
epoch 3 batch id 401 loss 0.8691173195838928 train acc 0.7262312967581047
epoch 3 batch id 601 loss 0.5990959405899048 train acc 0.7278234193011647
epoch 3 batch id 801 loss 0.554921567440033 train acc 0.734335986267166
epoch 3 batch id 1001 loss 0.5956881046295166 train acc 0.7395260989010989
epoch 3 train acc 0.7431561085972851


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 3 test acc 0.6821646341463414


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 4 batch id 1 loss 0.6105411648750305 train acc 0.71875
epoch 4 batch id 201 loss 0.49215376377105713 train acc 0.7755752487562189
epoch 4 batch id 401 loss 0.64748615026474 train acc 0.7796524314214464
epoch 4 batch id 601 loss 0.5262615084648132 train acc 0.7806260399334443
epoch 4 batch id 801 loss 0.4521009624004364 train acc 0.7876092384519351
epoch 4 batch id 1001 loss 0.43125468492507935 train acc 0.7930975274725275
epoch 4 train acc 0.7965073529411765


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 4 test acc 0.6804708672086721


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 5 batch id 1 loss 0.5150302052497864 train acc 0.8125
epoch 5 batch id 201 loss 0.3327905535697937 train acc 0.8277363184079602
epoch 5 batch id 401 loss 0.3711697459220886 train acc 0.8288263715710723
epoch 5 batch id 601 loss 0.3797905147075653 train acc 0.8300228785357737
epoch 5 batch id 801 loss 0.2495306134223938 train acc 0.8374102684144819
epoch 5 batch id 1001 loss 0.3707127571105957 train acc 0.8422514985014985
epoch 5 train acc 0.8445135746606335


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 5 test acc 0.6767022357723578


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 6 batch id 1 loss 0.22809535264968872 train acc 0.90625
epoch 6 batch id 201 loss 0.3405245542526245 train acc 0.8706467661691543
epoch 6 batch id 401 loss 0.2633797824382782 train acc 0.87196072319202
epoch 6 batch id 601 loss 0.37289607524871826 train acc 0.873674084858569
epoch 6 batch id 801 loss 0.15798546373844147 train acc 0.8794085518102372
epoch 6 batch id 1001 loss 0.15849870443344116 train acc 0.882352022977023
epoch 6 train acc 0.8844032805429864


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 6 test acc 0.6791581978319783


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 7 batch id 1 loss 0.1255810558795929 train acc 0.96875
epoch 7 batch id 201 loss 0.16630958020687103 train acc 0.9088930348258707
epoch 7 batch id 401 loss 0.33294418454170227 train acc 0.9087437655860349
epoch 7 batch id 601 loss 0.15972420573234558 train acc 0.9092658069883528
epoch 7 batch id 801 loss 0.14473915100097656 train acc 0.9133504993757803
epoch 7 batch id 1001 loss 0.1946878433227539 train acc 0.9163492757242757
epoch 7 train acc 0.9177743212669683


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 7 test acc 0.6687838753387534


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 8 batch id 1 loss 0.0576060488820076 train acc 0.984375
epoch 8 batch id 201 loss 0.08498072624206543 train acc 0.9353233830845771
epoch 8 batch id 401 loss 0.2435993254184723 train acc 0.9338372817955112
epoch 8 batch id 601 loss 0.07727284729480743 train acc 0.9353161397670549
epoch 8 batch id 801 loss 0.1322789192199707 train acc 0.9372659176029963
epoch 8 batch id 1001 loss 0.1004389151930809 train acc 0.9383272977022977
epoch 8 train acc 0.9394513574660633


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 8 test acc 0.6705199864498645


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 9 batch id 1 loss 0.029289882630109787 train acc 1.0
epoch 9 batch id 201 loss 0.10230683535337448 train acc 0.9486940298507462
epoch 9 batch id 401 loss 0.17442022264003754 train acc 0.9472802369077307
epoch 9 batch id 601 loss 0.041350096464157104 train acc 0.9491212562396006
epoch 9 batch id 801 loss 0.0769433006644249 train acc 0.9510962858926342
epoch 9 batch id 1001 loss 0.13195060193538666 train acc 0.9526411088911089
epoch 9 train acc 0.9535492081447964


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 9 test acc 0.6705623306233063


  0%|          | 0/1105 [00:00<?, ?it/s]

epoch 10 batch id 1 loss 0.026002151891589165 train acc 1.0
epoch 10 batch id 201 loss 0.06982851028442383 train acc 0.9571672885572139
epoch 10 batch id 401 loss 0.14161288738250732 train acc 0.9553070448877805
epoch 10 batch id 601 loss 0.035604119300842285 train acc 0.9560108153078203
epoch 10 batch id 801 loss 0.111112579703331 train acc 0.9575530586766542
epoch 10 batch id 1001 loss 0.06541336327791214 train acc 0.9588848651348651
epoch 10 train acc 0.9594598416289593


  0%|          | 0/369 [00:00<?, ?it/s]

epoch 10 test acc 0.6717903116531165
total run time:  8100.676142215729


In [31]:
# 학습 모델 저장
PATH = 'drive/MyDrive/Github/model/'
torch.save(model, PATH + 'KoBERT_단발_연속.pt')  # 전체 모델 저장
torch.save(model.state_dict(), PATH + '단발_연속_model_state_dict.pt')  # 모델 객체의 state_dict 저장
torch.save({
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict()
}, PATH + 'all_2.tar')  # 여러 가지 값 저장, 학습 중 진행 상황 저장을 위해 epoch, loss 값 등 일반 scalar값 저장 가능