# Korean BERT pre-trained cased (KoBERT) for Huggingface Transformers
모델 다운로드 서비스가 작동을 중지 -> 모델 다운로드 방식을 Hugging Face를 통한 모델 다운로드로 전환
https://github.com/SKTBrain/KoBERT/tree/master/kobert_hf/kobert_tokenizer

참고 블로그
1. https://complexoftaste.tistory.com/2

2. https://velog.io/@j_aion/iOS-KoBERT-%EA%B0%90%EC%84%B1-%EB%B6%84%EC%84%9D-%EB%AA%A8%EB%8D%B8%EB%A7%81
3. https://velog.io/@seolini43/KOBERT%EB%A1%9C-%EB%8B%A4%EC%A4%91-%EB%B6%84%EB%A5%98-%EB%AA%A8%EB%8D%B8-%EB%A7%8C%EB%93%A4%EA%B8%B0-%ED%8C%8C%EC%9D%B4%EC%8D%ACColab
4. https://sig413.tistory.com/m/80



## 오류 참고 링크
1. https://github.com/SKTBrain/KoBERT/issues/104
2. https://blog.naver.com/newyearchive/223097878715



## 환경설정

In [1]:
!pip install mxnet
!pip install gluonnlp==0.8.0
!pip install tqdm pandas
!pip install sentencepiece
!pip install transformers
!pip install torch



In [2]:
!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'

Collecting kobert_tokenizer
  Cloning https://github.com/SKTBrain/KoBERT.git to /tmp/pip-install-x3ztqhn2/kobert-tokenizer_47a53007716e4e01a16425b5bfc16e35
  Running command git clone --filter=blob:none --quiet https://github.com/SKTBrain/KoBERT.git /tmp/pip-install-x3ztqhn2/kobert-tokenizer_47a53007716e4e01a16425b5bfc16e35
  Resolved https://github.com/SKTBrain/KoBERT.git to commit 47a69af87928fc24e20f571fe10c3cc9dd9af9a3
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [3]:
# 구글 코랩에서 파이토치, ML 모델링 임포트
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook
import pandas as pd
from sklearn.model_selection import train_test_split



In [4]:
# ★ Hugging Face를 통한 모델 및 토크나이저 Import
from kobert_tokenizer import KoBERTTokenizer
from transformers import BertModel
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup

In [5]:

#드라이브 연동
from google.colab import drive
drive.mount('/content/drive')


# #GPU 연결 해야함 -> 안 하면 오류 발생
# device = torch.device("cuda:0")

# Torch GPU 설정
device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
device = torch.device(device_type)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
def get_kobert_model(model_path, vocab_file, ctx="cpu"):
    bertmodel = BertModel.from_pretrained(model_path)
    device = torch.device(ctx)
    bertmodel.to(device)
    bertmodel.eval()
    vocab_b_obj = nlp.vocab.BERTVocab.from_sentencepiece(vocab_file,
                                                         padding_token='[PAD]')
    return bertmodel, vocab_b_obj

In [7]:
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
bertmodel, vocab = get_kobert_model('skt/kobert-base-v1',tokenizer.vocab_file)
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower = False)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'KoBERTTokenizer'.


In [8]:
OneOff_data = pd.read_excel("/content/drive/MyDrive/Prog_All/Data/ko_단발성_대화_데이터셋.xlsx", engine='openpyxl')
continuity_data = pd.read_excel("/content/drive/MyDrive/Prog_All/Data/ko_연속적_대화_데이터셋.xlsx", engine='openpyxl')

In [9]:
OneOff_data.head(10)

Unnamed: 0,Sentence,Emotion,Unnamed: 2,Unnamed: 3,Unnamed: 4,공포,5468
0,언니 동생으로 부르는게 맞는 일인가요..??,공포,,,,놀람,5898.0
1,그냥 내 느낌일뿐겠지?,공포,,,,분노,5665.0
2,아직너무초기라서 그런거죠?,공포,,,,슬픔,5267.0
3,유치원버스 사고 낫다던데,공포,,,,중립,4830.0
4,근데 원래이런거맞나요,공포,,,,행복,6037.0
5,남자친구가 떠날까봐요,공포,,,,혐오,5429.0
6,이거 했는데 허리가 아플수도 있나요? ;;,공포,,,,Total,38594.0
7,내가불안해서꾸는걸까..,공포,,,,,
8,일주일도 안 남았당...ㅠㅠ,공포,,,,,
9,약은 최대한 안먹으려고 하는데좋은 음시있나요?0,공포,,,,,


In [10]:
type(OneOff_data)

pandas.core.frame.DataFrame

In [11]:
continuity_data.head(10)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,행복,중립,슬픔,공포,혐오,분노,놀람
0,dialog #,발화,감정,,,1030.0,,,,,,
1,S,아 진짜! 사무실에서 피지 말라니깐! 간접흡연이 얼마나 안좋은데!,분노,,,,,,,,,
2,,그럼 직접흡연하는 난 얼마나 안좋겠니? 안그래? 보면 꼭... 지 생각만 하고.,혐오,,,,,,,,,
3,,손님 왔어요.,중립,,,,,,,,,
4,,손님? 누구?,중립,,,,,,,,,
5,,몰라요. 팀장님 친구래요.,중립,,,,,,,,,
6,,내 친구? 친구 누구?,중립,,,,,,,,,
7,,그걸 내가 어떻게 알아요!,분노,,,,,,,,,
8,S,그래서... 무슨 일 해?,중립,,,,,,,,,
9,,그냥 방송일 조금.,중립,,,,,,,,,


In [12]:
OneOff_data['Emotion'].unique()

array(['공포', '놀람', '분노', '슬픔', '중립', '행복', '혐오'], dtype=object)

In [13]:
OneOff_data = OneOff_data[['Sentence','Emotion']]
continuity_data = continuity_data[['Unnamed: 1','Unnamed: 2']]
continuity_data.drop([0],axis=0,inplace=True)
continuity_data.rename(columns={'Unnamed: 1':'Sentence','Unnamed: 2':'Emotion'},inplace=True)
continuity_data.replace('ㅍ','공포',inplace=True)
continuity_data.replace(['분','분ㄴ'],'분노',inplace=True)
continuity_data.replace(['ㅈ중립','중림','ㄴ중립','줄'],'분노',inplace=True)

In [14]:
continuity_data['Emotion'].unique()

array(['분노', '혐오', '중립', '놀람', '행복', '공포', '슬픔', nan], dtype=object)

In [15]:
# nan 제거
continuity_data = continuity_data.dropna(how='any')

In [16]:
ooidx = OneOff_data[OneOff_data['Emotion']=='중립'].index
OneOff_data = OneOff_data.drop(ooidx)

In [17]:
cidx = continuity_data[continuity_data['Emotion']=='중립'].index
continuity_data = continuity_data.drop(cidx)

In [18]:
OneOff_data['Emotion'].unique()

array(['공포', '놀람', '분노', '슬픔', '행복', '혐오'], dtype=object)

In [19]:
continuity_data['Emotion'].unique()

array(['분노', '혐오', '놀람', '행복', '공포', '슬픔'], dtype=object)

In [20]:
print(type(OneOff_data),type(continuity_data))

<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>


In [21]:
data = pd.concat([OneOff_data,continuity_data])
data.sample(n=10)

Unnamed: 0,Sentence,Emotion
36343,당장 쳐 끌고 와야지 뭐하는데...,혐오
14159,난 오늘 이날씨에 노가다했다 더 할말있냐?,분노
16862,좌파언론의 선동질에 나라 무너진다,분노
29440,무..무슨 소릴 하는 거야?,놀람
27522,멋지고 자랑스런 우리 국민!!,행복
30128,죄송해요. 그냥 좀 쉬고 싶어서요..,슬픔
14259,살인마 전대가리가 아직도 잘 살고 있는 헬조선,분노
19655,진짜 너무 한심하네요. . 고인의 명복을빕니다~~,슬픔
35702,댓글 달때마다 우선 정은이 찬양글부터 올리는거 잊지말고~~,혐오
25681,네? 제가요?,놀람


In [22]:
data.drop_duplicates(['Sentence','Emotion'],inplace=True) # 중복행 제거
len(data)

44475

In [23]:
# data.loc[(data['Emotion'] == "공포"), 'label'] = 0  # 공포 → 0 fear
# data.loc[(data['Emotion'] == "놀람"), 'label'] = 1  # 놀람 → 1 surprise
# data.loc[(data['Emotion'] == "분노"), 'label'] = 2  # 분노 → 2 anger
# data.loc[(data['Emotion'] == "슬픔"), 'label'] = 3  # 슬픔 → 3 sadness
# data.loc[(data['Emotion'] == "중립"), 'label'] = 4  # 중립 → 4 neutral
# data.loc[(data['Emotion'] == "행복"), 'label'] = 5  # 행복 → 5 happiness
# data.loc[(data['Emotion'] == "혐오"), 'label'] = 6  # 혐오 → 6 disgust

In [24]:
num_labeling_dics ={
    '공포': 0,
    '놀람': 1,
    '분노': 2,
    '슬픔': 3,
    '행복': 4,
    '혐오': 5
  }


In [25]:
# 감정 -> 숫자 레이블링
for label_class in num_labeling_dics:
    data.loc[(data['Emotion'] == label_class), 'Emotion'] = num_labeling_dics[label_class]

In [26]:
data_list = []
for q, label in zip(data['Sentence'], data['Emotion'])  :
    check_data = []
    check_data.append(q)
    check_data.append(str(label))

    data_list.append(check_data)

In [27]:
print(data_list[0])
print(data_list[6000])
print(data_list[12000])
print(data_list[18000])
print(data_list[24000])
print(data_list[30000])
print(data_list[-1])

['언니 동생으로 부르는게 맞는 일인가요..??', '0']
['36도라고...미쳤다', '1']
['빵셔틀!박근혜!', '2']
['잠은오는데 20분정도 자다가깨고방금도 자다가 깼어요...', '3']
['청하 솔로로 ost도 낸다던데 다재다능하다 진짜', '4']
['당신을 계속 봐야하는 팬들이 젤 힘들어...', '5']
['자네는 대체 뭘 하러 왔나! 젖은 생쥐 꼴이 된 나를 보면서 비웃으러 왔나?', '2']


In [28]:
data['Emotion'].value_counts()

1    9866
2    9238
3    7167
4    7015
5    5621
0    5568
Name: Emotion, dtype: int64

# wandb

In [29]:
!pip install wandb



In [30]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mtracy110410[0m ([33mteam_5g[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [31]:
import wandb
wandb.init(project="CJ_KoBERT", entity='tracy110410')

[34m[1mwandb[0m: Currently logged in as: [33mtracy110410[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 모델불러오기
참조 블로그: https://github.com/ChangZero/koBERT-finetuning-demo/blob/main/kobert_colab.ipynb

In [32]:

class BERTSentenceTransform:
    r"""BERT style data transformation.

    Parameters
    ----------
    tokenizer : BERTTokenizer.
        Tokenizer for the sentences.
    max_seq_length : int.
        Maximum sequence length of the sentences.
    pad : bool, default True
        Whether to pad the sentences to maximum length.
    pair : bool, default True
        Whether to transform sentences or sentence pairs.
    """

    def __init__(self, tokenizer, max_seq_length,vocab, pad=True, pair=True):
        self._tokenizer = tokenizer
        self._max_seq_length = max_seq_length
        self._pad = pad
        self._pair = pair
        self._vocab = vocab

    def __call__(self, line):
        """Perform transformation for sequence pairs or single sequences.

        The transformation is processed in the following steps:
        - tokenize the input sequences
        - insert [CLS], [SEP] as necessary
        - generate type ids to indicate whether a token belongs to the first
        sequence or the second sequence.
        - generate valid length

        For sequence pairs, the input is a tuple of 2 strings:
        text_a, text_b.

        Inputs:
            text_a: 'is this jacksonville ?'
            text_b: 'no it is not'
        Tokenization:
            text_a: 'is this jack ##son ##ville ?'
            text_b: 'no it is not .'
        Processed:
            tokens: '[CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]'
            type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
            valid_length: 14

        For single sequences, the input is a tuple of single string:
        text_a.

        Inputs:
            text_a: 'the dog is hairy .'
        Tokenization:
            text_a: 'the dog is hairy .'
        Processed:
            text_a: '[CLS] the dog is hairy . [SEP]'
            type_ids: 0     0   0   0  0     0 0
            valid_length: 7

        Parameters
        ----------
        line: tuple of str
            Input strings. For sequence pairs, the input is a tuple of 2 strings:
            (text_a, text_b). For single sequences, the input is a tuple of single
            string: (text_a,).

        Returns
        -------
        np.array: input token ids in 'int32', shape (batch_size, seq_length)
        np.array: valid length in 'int32', shape (batch_size,)
        np.array: input token type ids in 'int32', shape (batch_size, seq_length)

        """

        # convert to unicode
        text_a = line[0]
        if self._pair:
            assert len(line) == 2
            text_b = line[1]

        tokens_a = self._tokenizer.tokenize(text_a)
        tokens_b = None

        if self._pair:
            tokens_b = self._tokenizer(text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            self._truncate_seq_pair(tokens_a, tokens_b,
                                    self._max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > self._max_seq_length - 2:
                tokens_a = tokens_a[0:(self._max_seq_length - 2)]

        # The embedding vectors for `type=0` and `type=1` were learned during
        # pre-training and are added to the wordpiece embedding vector
        # (and position vector). This is not *strictly* necessary since
        # the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.

        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        #vocab = self._tokenizer.vocab
        vocab = self._vocab
        tokens = []
        tokens.append(vocab.cls_token)
        tokens.extend(tokens_a)
        tokens.append(vocab.sep_token)
        segment_ids = [0] * len(tokens)

        if tokens_b:
            tokens.extend(tokens_b)
            tokens.append(vocab.sep_token)
            segment_ids.extend([1] * (len(tokens) - len(segment_ids)))

        input_ids = self._tokenizer.convert_tokens_to_ids(tokens)

        # The valid length of sentences. Only real  tokens are attended to.
        valid_length = len(input_ids)

        if self._pad:
            # Zero-pad up to the sequence length.
            padding_length = self._max_seq_length - valid_length
            # use padding tokens for the rest
            input_ids.extend([vocab[vocab.padding_token]] * padding_length)
            segment_ids.extend([0] * padding_length)

        return np.array(input_ids, dtype='int32'), np.array(valid_length, dtype='int32'),\
            np.array(segment_ids, dtype='int32')





## Model Class & Funcs

In [33]:
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, vocab, max_len,
                 pad, pair):
        transform = BERTSentenceTransform(bert_tokenizer, max_seq_length=max_len,vocab=vocab, pad=pad, pair=pair)
        #transform = nlp.data.BERTSentenceTransform(
        #    tokenizer, max_seq_length=max_len, pad=pad, pair=pair)
        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

In [34]:
#train & test 데이터로 나누기
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data_list, test_size=0.2, random_state=0)

In [35]:
# Setting parameters
max_len = 64
batch_size = 64
warmup_ratio = 0.1
num_epochs = 5
max_grad_norm = 1
log_interval = 200
learning_rate =  5e-5
random_seed = 123

In [36]:
# config
wandb.config ={
  "learning_rate": learning_rate,
  "epochs": num_epochs,
  "batch_size": batch_size,
  "seed": random_seed
}


https://teddylee777.github.io/machine-learning/wandb/


In [37]:
# loss 추적
wandb.define_metric('train_loss', summary='min')
wandb.define_metric('val_loss', summary='min')
# f1 score 추적
wandb.define_metric('train_f1', summary='max')
wandb.define_metric('val_f1', summary='max')

<wandb.sdk.wandb_metric.Metric at 0x7cf1a41d52d0>

In [38]:
data_train = BERTDataset(train_data, 0, 1, tokenizer, vocab, max_len, True, False)
data_test = BERTDataset(test_data, 0, 1, tokenizer, vocab, max_len, True, False)

In [39]:
train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, num_workers=5,shuffle=True)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5,shuffle=True)

In [40]:
### KoBERT 학습모델

class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=6,   ##클래스 수 6개로 조정##
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate

        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)

    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)

        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device),return_dict=False)
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

In [41]:
#BERT 모델 불러오기
model = BERTClassifier(bertmodel,  dr_rate=0.5).to(device)

In [42]:
# Prepare optimizer and schedule (linear warmup and decay)
#optimizer와 schedule 설정
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss() # 다중분류를 위한 대표적인 loss func

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)

scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)





In [43]:
#정확도 측정을 위한 함수 정의
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

train_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7cf1a4b579a0>

In [44]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla V100-SXM2-16GB


In [45]:
# KoBERT 모델 학습시키기
wandb.watch(model)

# train_history=[]
# test_history=[]
# loss_history=[]
y_preds =[]
results ={
      "train_loss": [],
      "train_acc": [],
      "val_loss": [],
      "val_acc": []
  }


for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0

    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)


        y_pred = model(token_ids, valid_length, segment_ids)

        #print(label.shape,out.shape)
        train_loss = loss_fn(y_pred, label)
        train_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(y_pred, label)


        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, train_loss.data.cpu().numpy(), train_acc / (batch_id+1)))
            # train_history.append(train_acc / (batch_id+1))
            # loss_history.append(train_loss.data.cpu().numpy())
            results["train_loss"].append(train_loss.data.cpu().numpy())
            results["train_acc"].append(train_acc / (batch_id+1))

        # wandb log
        wandb.log({"train_acc": train_acc / (batch_id+1)}, commit=False)
        wandb.log({"train_loss": train_loss.data.cpu().numpy()}, commit=False)


    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    #train_history.append(train_acc / (batch_id+1))

    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        test_pred = model(token_ids, valid_length, segment_ids)
        y_preds.append(test_pred.cpu()) # for confusion matrix

        test_loss = loss_fn(test_pred, label)
        test_loss.backward()

        test_acc += calc_accuracy(test_pred, label)

        # wandb log
        wandb.log({"test_acc": test_acc / (batch_id+1)}, commit=False)
        wandb.log({"test_loss": test_loss.data.cpu().numpy()})


        results["val_loss"].append(test_loss.data.cpu().numpy())
        results["val_acc"].append(test_acc/(batch_id+1))




    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))


    #test_history.append(test_acc / (batch_id+1))

y_pred_tensor = torch.cat(y_preds)



Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):


  0%|          | 0/556 [00:00<?, ?it/s]

epoch 1 batch id 1 loss 1.8416944742202759 train acc 0.109375
epoch 1 batch id 201 loss 1.3807346820831299 train acc 0.31848569651741293
epoch 1 batch id 401 loss 1.182374358177185 train acc 0.4178615960099751
epoch 1 train acc 0.45757643884892085


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):


  0%|          | 0/139 [00:00<?, ?it/s]

epoch 1 test acc 0.5883079250885006


  0%|          | 0/556 [00:00<?, ?it/s]

epoch 2 batch id 1 loss 0.9759495258331299 train acc 0.6875
epoch 2 batch id 201 loss 1.0230379104614258 train acc 0.6204912935323383
epoch 2 batch id 401 loss 0.8198161721229553 train acc 0.6244934538653366
epoch 2 train acc 0.6239939298561151


  0%|          | 0/139 [00:00<?, ?it/s]

epoch 2 test acc 0.6141693787826881


  0%|          | 0/556 [00:00<?, ?it/s]

epoch 3 batch id 1 loss 0.7173115611076355 train acc 0.734375
epoch 3 batch id 201 loss 0.813073992729187 train acc 0.7112873134328358
epoch 3 batch id 401 loss 0.7934165596961975 train acc 0.7122428304239401
epoch 3 train acc 0.7150048711031175


  0%|          | 0/139 [00:00<?, ?it/s]

epoch 3 test acc 0.6280975219824141


  0%|          | 0/556 [00:00<?, ?it/s]

epoch 4 batch id 1 loss 0.5100511312484741 train acc 0.875
epoch 4 batch id 201 loss 0.8458660840988159 train acc 0.7860696517412935
epoch 4 batch id 401 loss 0.5504788160324097 train acc 0.790017144638404
epoch 4 train acc 0.7939879346522781


  0%|          | 0/139 [00:00<?, ?it/s]

epoch 4 test acc 0.6280957376955579


  0%|          | 0/556 [00:00<?, ?it/s]

epoch 5 batch id 1 loss 0.3688485622406006 train acc 0.875
epoch 5 batch id 201 loss 0.6185582876205444 train acc 0.841806592039801
epoch 5 batch id 401 loss 0.3019507825374603 train acc 0.8446072319201995
epoch 5 train acc 0.8464703237410072


  0%|          | 0/139 [00:00<?, ?it/s]

epoch 5 test acc 0.6273160043393857


## predict :
주어진 문장이 현재 학습이 완료된 모델 내에서 어떤 라벨과 argmax인지 판단하고 추론된 결과를 리턴하는 함수

In [46]:
def predict(predict_sentence):

    data = [predict_sentence, '0']
    dataset_another = [data]

    another_test = BERTDataset(dataset_another, 0, 1, tokenizer, vocab, max_len, True, False)
    test_dataloader = torch.utils.data.DataLoader(another_test, batch_size=batch_size, num_workers=5)

    model.eval()

    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(test_dataloader):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)

        valid_length= valid_length
        label = label.long().to(device)

        out = model(token_ids, valid_length, segment_ids)
        print(out)


        test_eval=[]
        for i in out:
            logits=i
            logits = logits.detach().cpu().numpy()
            print(logits)
            print(np.argmax(logits))

            if np.argmax(logits) == 0:
                test_eval.append("공포가")
            elif np.argmax(logits) == 1:
                test_eval.append("놀람이")
            elif np.argmax(logits) == 2:
                test_eval.append("분노가")
            elif np.argmax(logits) == 3:
                test_eval.append("슬픔이")
            elif np.argmax(logits) == 4:
                test_eval.append("중립이")
            elif np.argmax(logits) == 5:
                test_eval.append("행복이")
            elif np.argmax(logits) == 6:
                test_eval.append("혐오가")

        print(">> 입력하신 내용에서 " + test_eval[0] + " 느껴집니다.")

In [47]:
# #version2
# # 토큰화
# tokenizer = get_tokenizer()
# tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)

# def predict(predict_sentence):

#     data = [predict_sentence, '0']
#     dataset_another = [data]

#     another_test = BERTDataset(dataset_another, 0, 1, tok, max_len, True, False)
#     test_dataloader = torch.utils.data.DataLoader(another_test, batch_size=batch_size, num_workers=5)

#     model.eval()

#     for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(test_dataloader):
#         token_ids = token_ids.long().to(device)
#         segment_ids = segment_ids.long().to(device)

#         valid_length= valid_length
#         label = label.long().to(device)

#         out = model(token_ids, valid_length, segment_ids)


# #         test_eval=[]
#         for i in out:
#             logits=i
#             logits = logits.detach().cpu().numpy()
#             emotion = emotion_dict[np.argmax(logits)]


#         print(f">> 입력하신 내용의 감정은 {emotion}입니다.")


In [48]:
predict_sentence = '영화에 나오는 귀신이 너무 무섭네요'
predict(predict_sentence)

tensor([[ 4.2408,  1.2631, -1.7393, -0.8239, -1.8942, -1.3148]],
       device='cuda:0', grad_fn=<AddmmBackward0>)
[ 4.240846    1.2631497  -1.739334   -0.82390773 -1.8942081  -1.3148326 ]
0
>> 입력하신 내용에서 공포가 느껴집니다.


In [49]:
#질문 무한반복하기! 0 입력시 종료
while True:
    sentence = input("하고싶은 말을 입력해주세요 : ")
    if sentence == "0" :
        print("감정 분석을 종료합니다.")
        break
    predict(sentence)
    print("\n")

하고싶은 말을 입력해주세요 : 이가 아파요
tensor([[ 0.0942, -0.3275, -1.4182,  4.6459, -0.9046, -1.7782]],
       device='cuda:0', grad_fn=<AddmmBackward0>)
[ 0.09424177 -0.3274801  -1.4182079   4.6458745  -0.90463585 -1.7781863 ]
3
>> 입력하신 내용에서 슬픔이 느껴집니다.


하고싶은 말을 입력해주세요 : 내 친구? 친구 누구?
tensor([[-1.3561,  4.7290, -0.3394, -0.8494, -0.1847, -2.0528]],
       device='cuda:0', grad_fn=<AddmmBackward0>)
[-1.3561198   4.7290363  -0.33938766 -0.8493966  -0.18468508 -2.052827  ]
1
>> 입력하신 내용에서 놀람이 느껴집니다.


하고싶은 말을 입력해주세요 : 몰라요. 팀장님 친구래요.
tensor([[-1.0671, -0.2598,  0.1842,  4.0124, -0.8151, -1.8532]],
       device='cuda:0', grad_fn=<AddmmBackward0>)
[-1.0671387  -0.25978243  0.18422553  4.0123873  -0.81507313 -1.8531748 ]
3
>> 입력하신 내용에서 슬픔이 느껴집니다.


하고싶은 말을 입력해주세요 : 뭐가 이상해? 우정만 돋는구만. 뭐.
tensor([[-1.4978,  4.0791, -1.0941, -0.9226,  0.2268, -0.8000]],
       device='cuda:0', grad_fn=<AddmmBackward0>)
[-1.4978005   4.0790687  -1.0941209  -0.9226144   0.22683087 -0.8000285 ]
1
>> 입력하신 내용에서 놀람이 느껴집니다.


하고싶은 말을 