최종 분류기 계층을 사용자가 임의로 만든 모델<br>
(pretrained model + classification layer)


https://github.com/ukairia777/tensorflow-nlp-tutorial/blob/main/18.%20Fine-tuning%20BERT%20(Cls%2C%20NER%2C%20NLI)/18-3.%20google_bert_nsmc_tpu.ipynb <br><br>

https://wikidocs.net/158085

In [1]:
!pip install transformers



In [2]:
import transformers
transformers.__version__

'4.46.3'

In [3]:
import pandas as pd
import numpy as np
import urllib.request
import os
from tqdm import tqdm
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

In [4]:
# 데이터 가져오기
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename="ratings_train.txt")
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt", filename="ratings_test.txt")

('ratings_test.txt', <http.client.HTTPMessage at 0x7a01b70a4220>)

In [5]:
train_data = pd.read_table('ratings_train.txt')
test_data = pd.read_table('ratings_test.txt')

In [6]:
print('Total count of reviews for train:', len(train_data))
print('Total count of reviews for test:', len(test_data))

Total count of reviews for train: 150000
Total count of reviews for test: 50000


In [7]:
# 상위 5개 항목 데이터 체크
train_data.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [8]:
test_data.head()

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0
2,8544678,뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아,0
3,6825595,지루하지는 않은데 완전 막장임... 돈주고 보기에는....,0
4,6723715,3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??,0


In [9]:
# 데이터 전처리
train_data = train_data.dropna(how='any') # Null 값이 존재하는 행 제거
train_data = train_data.reset_index(drop=True)
print(train_data.isnull().values.any()) # Null 값이 존재하는지 확인

test_data = test_data.dropna(how='any')
test_data = test_data.reset_index(drop=True)
print(test_data.isnull().values.any())

False
False


In [10]:
# 전처리 후 데이터 크기 확인
print(len(train_data))
print(len(test_data))

149995
49997


In [11]:
# Tokenizer 생성
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [12]:
# Tokenizer 동작 테스트
sample = '보는내내 그대로 들어맞는 예측 카리스마 없는 악역'
print(tokenizer.encode(sample)) # to ebmedding vector
print(tokenizer.tokenize(sample)) # 실제 토큰화가 어떻게 되는지 확인
print(tokenizer.decode(tokenizer.encode(sample))) # embedding vector가 다시 원상복구 되는지 확인
for elem in tokenizer.encode(sample): # 개별 토큰이 어떻게 토큰화 되어있는지 확인
    print(tokenizer.decode(elem))

[101, 9356, 11018, 31605, 31605, 110589, 71568, 118913, 11018, 9576, 119281, 9786, 79940, 23811, 40364, 9520, 23160, 102]
['보', '##는', '##내', '##내', '그대로', '들어', '##맞', '##는', '예', '##측', '카', '##리스', '##마', '없는', '악', '##역']
[CLS] 보는내내 그대로 들어맞는 예측 카리스마 없는 악역 [SEP]
[CLS]
보
##는
##내
##내
그대로
들어
##맞
##는
예
##측
카
##리스
##마
없는
악
##역
[SEP]


In [13]:
# 특수 Encoding value가 실제 어떤 값인지 확인 (개별 decode)
print(tokenizer.decode(101)) # [CLS]; Sequence의 시작
print(tokenizer.decode(102)) # [SEP]; Sequence의 분리

[CLS]
[SEP]


In [14]:
# Sequence 길이 고정해 두고 빈 공간 padding 값이 들어가는 상황 확인
#- Fine-tuning 등을 위해 Sequence 길이를 고정해야 한다. Sequence 이외의 공간은 Padding값(=0)으로 채워져 있어야 한다.
max_seq_len = 128
encoded_result = tokenizer.encode(sample, max_length=max_seq_len, pad_to_max_length=True) # 빈공간을 pad로 채운다.
print(encoded_result)
print('length:', len(encoded_result))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[101, 9356, 11018, 31605, 31605, 110589, 71568, 118913, 11018, 9576, 119281, 9786, 79940, 23811, 40364, 9520, 23160, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
length: 128




In [15]:
# Samples를 학습을 위한 features 형태로 변경
def convert_examples_to_features(examples, labels, max_seq_len, tokenizer):
    input_ids, attention_masks, token_type_ids, data_labels = [], [], [], []

    for example, label in tqdm(zip(examples, labels), total=len(examples)): # tqdm으로 진행상황 표시
        # input_id는 워드 임베딩을 위한 문장의 정수 인코딩
        input_id = tokenizer.encode(example, max_length=max_seq_len, pad_to_max_length=True) # max_length 만큼의 고정길이 sequence를 사용하고, 실제 입력 시퀀스 외에는 패딩으로 채워라.

        # attention_mask는 실제 단어가 위치하면 1, 패딩의 위치에는 0인 시퀀스
        padding_count = input_id.count(tokenizer.pad_token_id)
        attention_mask = [1] * (max_seq_len - padding_count) + [0] * padding_count

        # token_tye_ids은 세그먼트 임베딩을 위한 것으로 이번 예제는 문장이 1개이므로 전부 0으로 통일
        token_type_id = [0] * max_seq_len

        # sequence 길이가 max_seq_len이 아니면 Error
        assert len(input_id) == max_seq_len, "Eror with input length {} vs {}".format(len(input_id), max_seq_len)
        assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(len(attention_mask), max_seq_len)
        assert len(token_type_id) == max_seq_len, "Error with token type length {} vs {}".format(len(token_type_id), max_seq_len)

        # data column 채우기
        input_ids.append(input_id)
        attention_masks.append(attention_mask)
        token_type_ids.append(token_type_id)
        data_labels.append(label)

    # 계산을 위한 data type 캐스팅
    # input
    input_ids = np.array(input_ids, dtype=int)
    attention_masks = np.array(attention_masks, dtype=int)
    token_type_ids = np.array(token_type_ids, dtype=int)

    # output
    data_labels = np.asarray(data_labels, dtype=np.int32) #? 이거 왜 asarray 썼으며, dtype도 np.int32 일까?

    return (input_ids, attention_masks, token_type_ids), data_labels

In [16]:
train_X, train_y = convert_examples_to_features(train_data['document'], train_data['label'], max_seq_len=max_seq_len, tokenizer=tokenizer)

100%|██████████| 149995/149995 [01:13<00:00, 2035.22it/s]


In [17]:
test_X, test_y = convert_examples_to_features(test_data['document'], test_data['label'], max_seq_len=max_seq_len, tokenizer=tokenizer)

100%|██████████| 49997/49997 [00:14<00:00, 3390.81it/s]


In [18]:
# test
inputs = tokenizer(sample, return_tensors='pt')
print(inputs['input_ids']) # 문장의 embeded vector
print(inputs['token_type_ids']) # 문장 구분 세그먼트; 0이면 첫번재 문장, 1이면 두번째 문장
print(inputs['attention_mask']) # 단어와 패딩 구분; 1이면 실제 문장, 0이면 패딩


tensor([[   101,   9356,  11018,  31605,  31605, 110589,  71568, 118913,  11018,
           9576, 119281,   9786,  79940,  23811,  40364,   9520,  23160,    102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [19]:
# Feature data sample 확인
# 최대 길이: 128 = max_seq_len
input_id = train_X[0][0]
attention_mask = train_X[1][0]
token_type_id = train_X[2][0]
label = train_y[0]

print('Token에 대한 정수 인코딩:', input_id)
print('어텐션 마스크:', attention_mask)
print('세그먼트 인코딩:', token_type_id)
print('각 인코딩의 길이:', len(input_id))
print('정수 인코딩 복원:', tokenizer.decode(input_id))
print('레이블:', label)

Token에 대한 정수 인코딩: [   101   9519   9074 119005    119    119   9708 119235   9715 119230
  16439  77884  48549   9284  22333  12692    102      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0]
어텐션 마스크: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [20]:
# BERT pretrained 모델 가져오기
# https://huggingface.co/google-bert/bert-base-multilingual-cased
model = TFBertModel.from_pretrained("bert-base-multilingual-cased")

# Bert pretrained 모델 가져온 것의 데이터 형태 확인하기
input_ids_layer = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)
attention_masks_layer = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)
token_type_ids_layer = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)

outputs = model([input_ids_layer, attention_masks_layer, token_type_ids_layer])
print(outputs)
print(outputs[0]) # input features
print(outputs[1]) # output features

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [27]:
# Bert 모델 정의; pretrained_model에서 분류기 층만 수정
class TFBertForSequenceClassification(tf.keras.Model):
    def __init__(self, model_name):
        super(TFBertForSequenceClassification, self).__init__()
        self.bert = TFBertModel.from_pretrained(model_name, from_pt=True)
        self.classifier = tf.keras.layers.Dense(1,
                                                kernel_initializer=tf.keras.initializers.TruncatedNormal(0.02),
                                                activation='sigmoid',
                                                name='classifier')

    def call(self, inputs): #? 확인 방법은?
        input_ids, attention_mask, token_type_ids = inputs
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        cls_token = outputs[1] #?
        prediction = self.classifier(cls_token)

        return prediction

In [28]:
# 모델 컴파일 (w/ optimizer, loss function)
model = TFBertForSequenceClassification("bert-base-multilingual-cased")
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.BinaryCrossentropy() # 2가지 중 분류
model.compile(optimizer=optimizer, loss=loss, metrics = ['accuracy'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already

In [29]:
# 학습
model.fit(train_X, train_y, epochs=2, batch_size=64, validation_split=0.2) # 0.8 for train, 02. for valid

Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x7a01711315a0>

In [30]:
# test 데이터로 검증
results = model.evaluate(test_X, test_y, batch_size=1024)
print("test loss, test acc:", results)

test loss, test acc: [0.3222801983356476, 0.8611716628074646]


In [31]:
# Google drive 연동
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [34]:
# 모델 저장
# model.save_pretrained('/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base')  # This line causes the error
model.save('/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base', save_format='tf') # Use the standard Keras save method instead
tokenizer.save_pretrained('/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base')


('/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base/tokenizer_config.json',
 '/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base/special_tokens_map.json',
 '/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base/vocab.txt',
 '/content/gdrive/MyDrive/Colab Notebooks/nsmc_model/bert-base/added_tokens.json')

In [35]:
# 예측

def sentiment_predict(new_sentence):
    input_id = tokenizer.encode(new_sentence, max_length=max_seq_len, pad_to_max_length=True)

    #! 아래 항목들은 전부 출력해서 어떤 값인지 확인이 필요하다
    padding_count = input_id.count(tokenizer.pad_token_id) #? pad_token_id 에 해당하는항목의 개수 인것 같다
    attention_mask = [1] * (max_seq_len - padding_count) + [0] * padding_count
    token_type_id = [0] * max_seq_len

    input_ids = np.array([input_id])
    attention_masks = np.array([attention_mask])
    token_type_ids = np.array([token_type_id])

    encoded_input = [input_ids, attention_masks, token_type_ids]
    score = model.predict(encoded_input)[0][0]
    print(score)

    if score > 0.5:
        print("positive:", score)
    else:
        print("negative:", (1 - score))


In [36]:
sentiment_predict("별 똥같은 영화를 다 보네. 개별로입니다.")



0.032830168
negative: 0.9671698324382305
