<h2>BERT(Bidirectional Encoder Representations from Transformers</h2>

<h2>Input representation</h2>
<b>3가지 입력 임베딩(Token, Segment, Position)의 합으로 구성</b><br>

<b>Token Embedding - Word Piece 임베딩 방식, 가장 긴 길이의 sub-word를 하나의 단위로 생성</b><br>
<b>기존 워드 임베딩 방법에 존재하는 Out-of-vocabulary(OOV) 처리에 효과적이며 정확도 상승 효과도 있음</b><br>

<b>Sentence Embeddings - 입력 길이의 제한으로 두 문장은 합쳐서 512 subword 이하로 제한</b><br>
<b>긴 문장은 128로 제한하여 학습한 후, 나머지 입력들을 모아 마지막에 따로 추가 학습하는 방식</b><br>

<b>Position Embedding - 입력 토큰 위치 정보가 필요한 Self-Attention 모델을 사용</b><br>
<b>단순하게 Token 순서대로 0,1,2 ... 와 같이 순서대로 인코딩</b><br>

<h2>언어 모델링 데이터</h2>
<b>총 3.3억 단어(8억 단어의 BookCorpus 데이터와 25억 단어 Wikipedia 데이터)를 이용한 학습</b><br>
<b>MLM, NSP 모델 적용을 위해 스스로 라벨을 만들고 수행하여 준지도학습(Semi-supervised)라고 함</b><br>

<b>MLM(Masked Language Model) - 입력 문장에서 임의로 Token을 마스킹(masking), 그 Token을 맞추는 방식, 문장 빈칸 채우기 문제 학습</b><br>

In [1]:
import os
import re
import json
import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

from tqdm import tqdm
import tensorflow as tf
import transformers
from transformers import *
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [2]:
tf.random.set_seed(111)
np.random.seed(111)

BATCH_SIZE = 32
NUM_EPOCHS = 3
VALID_SPLIT = 0.2
MAX_LEN = 39

In [3]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [4]:
import urllib.request
train_file = urllib.request.urlopen('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt')
test_file = urllib.request.urlopen('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt')

train_data = pd.read_table(train_file)
test_data = pd.read_table(test_file)

train_data = train_data.dropna()
test_data = test_data.dropna()

In [5]:
train_data.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [6]:
test_data.head()

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0
2,8544678,뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아,0
3,6825595,지루하지는 않은데 완전 막장임... 돈주고 보기에는....,0
4,6723715,3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??,0


<b>BerTokenizer</b>

In [7]:
from ipywidgets import IntProgress

In [8]:
from tqdm import tqdm

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', cache_dir = 'bert_ckpt', do_lower_case = False)

def bert_tokenizer(sentence, MAX_LEN):
    encoded_dict = tokenizer.encode_plus(
        text = sentence,
        add_special_tokens = True,
        max_length = MAX_LEN,
        pad_to_max_length = True,
        return_attention_mask = True
    )
    
    input_id = encoded_dict['input_ids']
    attention_mask = encoded_dict['attention_mask']
    token_type_id = encoded_dict['token_type_ids']
    
    return input_id, attention_mask, token_type_id

In [None]:
input_ids = []
attention_masks = []
token_type_ids = []
train_data_labels = []

for train_sentence, train_label in tqdm(zip(train_data['document'], train_data['label']), total = len(train_data)):
    try:
        input_id, attention_mask, token_type_id = bert_tokenizer(train_sentence, MAX_LEN)
        
        input_ids.append(input_id)
        attention_masks.append(attention_mask)
        token_type_ids.append(token_type_id)
        train_data_labels.append(train_label)
    
    except Exception as e:
        print(e)
        pass
    
train_movie_input_ids = np.array(input_ids , dtype = int)
train_movie_attention_masks = np.array(attention_masks, dtype = int)
train_movie_token_type_ids = np.array(token_type_ids, dtype = int)
train_movie_inputs = (train_movie_input_ids, train_movie_attention_masks, train_movie_token_type_ids)
train_data_labels = np.asarray(train_data_labels, dtype=np.int32)
        
    
print("Sentences: {}\nLabels: {}".format(len(train_movie_input_ids), len(train_data_labels)))

In [11]:
idx = 5

input_id = train_movie_input_ids[idx]
attention_mask = train_movie_attention_masks[idx]
token_type_id = train_movie_token_type_ids[idx]

print(input_id)
print(attention_mask)
print(token_type_id)
print(tokenizer.decode(input_id))

[   101   9247   8867  32158  23811    100    124  24982  17655   9757
  55511    122  23321  10954  24017  12030    129 106249  24974  30858
  18227    119    100    119    119    119   9353  30134  21789  12092
   9519 118671 119169    119    102      0      0      0      0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]
[CLS] 막 걸음마 [UNK] 3세부터 초등학교 1학년생인 8살용영화. [UNK]... 별반개도 아까움. [SEP] [PAD] [PAD] [PAD] [PAD]


In [None]:
class TFBertClassifier(tf.keras.Model):
    def __init__(self, model_name, dir_path, num_class):
        super(TFBertClassifier, self).__init__()
        
        self.bert = TFBertModel.from_pretrained(model_name, cache_dir = dir_path)
        self.dropout = tf.keras.layers.Dropout(self.bert.config.hidden_dropout_prob)
        self.classifier = tf.keras.layers.Dense(num_class,
                                               kernel_initializer = tf.keras.initializers.TruncatedNormal(self.bert.config.initializer_range),
                                               name = 'classifier')
        
    def call(self, inputs, attention_mask = None, token_type_ids = None, training = False):
        outputs = self.bert(inputs, attention_mask = attention_mask, token_type_ids = token_type_ids)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output, training = training)
        logits = self.classifier(pooled_output)
        
        return logits

cls_model = TFBertClassifier(model_name = 'bert-base-multilingual-cased',
                            dir_path = 'bert_ckpt',
                            num_class=2)

<b>모델 학습</b>

In [13]:
optimizer  = tf.keras.optimizers.Adam(3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
cls_model.compile(optimizer = optimizer, loss = loss, metrics = [metric])

In [None]:
model_name = 'tf2_bert_naver_movie'

es_callback = EarlyStopping(monitor = 'val_accuracy', min_delta = 0.0001, patience = 2)

checkpoint_path = os.path.join('./', model_name, 'weights.h5')
checkpoint_dir = os.path.dirname(checkpoint_path)

if os.path.exists(checkpoint_dir):
    print("{} Directory already exists\n".format(checkpoint_dir))
else:
    os.makedirs(checkpoint_dir, exist_ok = True)
    print("{} Directory already exists\n".format(checkpoint_dir))

cp_callback = ModelCheckpoint(checkpoint_path, monitor = 'val_accuracy',
                             verbose = 1, save_best_only = True, save_weights_only = True)

history = cls_model.fit(train_movie_inputs, train_data_labels,
                       epochs = NUM_EPOCHS, batch_size = BATCH_SIZE, validation_split = VALID_SPLIT,
                       callbacks = [es_callback, cp_callback])

print(history.history)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_losss'], '')
plt.xlabel('Epochs')
plt.ylabel('loss')
plt.legend(['loss', 'val_loss'])
plt.show()

<b>모델 평가</b>

In [None]:
input_ids = []
attention_masks = []
token_type_ids = []
test_data_labels = []

for test_sentence, test_label in tqdm(zip(test_data['document'], test_data['label'])):
    try:
        input_id, attention_mask, token_type_id = bert_tokenizer(test_sentence, MAX_LEN)
        
        input_ids.append(input_id)
        attention_masks.append(attention_mask)
        token_type_ids.append(token_type_id)
        test_data_labels.append(test_label)
    
    except Exception as e:
        print(e)
        pass
    
test_movie_input_ids = np.array(input_ids , dtype = int)
test_movie_attention_masks = np.array(attention_masks, dtype = int)
test_movie_token_type_ids = np.array(token_type_ids, dtype = int)
test_movie_inputs = (test_movie_input_ids, test_movie_attention_masks, test_movie_token_type_ids)
test_data_labels = np.asarray(test_data_labels, dtype=np.int32)

In [None]:
cls_model.evaluate(test_movie_inputs, test_data_labels, batch_size = 1024)