https://huggingface.co/transformers/

- HuggingFace는 자연어 처리 인공지능 모델에서, BERT 모델 같은 트랜스포머 모델들을 쉽게 다룰 수 있게 해주는 패키지입니다.
- 기본적으로 pytorch 기반으로 만들어져 있지만, 텐서플로우 2.0에서도 본 패키지 사용 가능합니다.
- 텐서플로우 2.0은 기존 케라스를 포함하고 있기 때문에, 기존 텐서플로우나 케라스에 익숙하신 분들이 쉽게 사용할 수 있습니다.
- 텐서플로우 2.0 기반의 huggingface 사용 방법을 네이버 영화 긍부정 분석을 실슴하면서 배워 보도록 하겠습니다.

* 인스톨
huggingface 패키지를 Colab에 설치합니다.

- 허깅페이스는 트랜스포머를 기반으로 하는 다양한 모델 (transformer.models)과 학습 스크립트(transformer.Trainer)를 구현해 놓은 모듈이다. 
- 원래는 파이토치로 layer, model등을 선언해주고 학습 스크립트도 전부 구현해야 하지만, 허깅 페이스를 사용하면 이런 수고를 덜 수 있다.
- 정리하면 허깅페이스라는 회사가 만든 transformers 패키지가 있고, 일반적인 파이토치 구현체의 layer.py, model.py이 transfomer.models 에 train.py가 transformer.Trainer에 대응된다

In [56]:
!pip install transformers
!pip install sentencepiece



In [60]:
import tensorflow as tf
import numpy as np
import pandas as pd
from transformers import TFBertModel
import json
from tqdm import tqdm
import os

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
!git clone https://github.com/e9t/nsmc.git

Cloning into 'nsmc'...
remote: Enumerating objects: 14763, done.[K
remote: Total 14763 (delta 0), reused 0 (delta 0), pack-reused 14763[K
Receiving objects: 100% (14763/14763), 56.19 MiB | 23.12 MiB/s, done.
Resolving deltas: 100% (1749/1749), done.
Checking out files: 100% (14737/14737), done.


In [None]:
os.listdir('nsmc')

['ratings_train.txt',
 'ratings.txt',
 'ratings_test.txt',
 'raw',
 'README.md',
 '.git',
 'synopses.json',
 'code']

In [None]:
train = pd.read_table('nsmc/' + 'ratings_train.txt')
test = pd.read_table('nsmc/' + 'ratings_test.txt')
train.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [None]:
# bert input
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [None]:
tokenizer.encode('보는내내 그대로 들어맞는 예측 카리스마 없는 악역')

[101,
 9356,
 11018,
 31605,
 31605,
 110589,
 71568,
 118913,
 11018,
 9576,
 119281,
 9786,
 79940,
 23811,
 40364,
 9520,
 23160,
 102]

In [None]:
tokenizer.tokenize('보는내내 그대로 들어맞는 예측 카리스마 없는 악역')

['보',
 '##는',
 '##내',
 '##내',
 '그대로',
 '들어',
 '##맞',
 '##는',
 '예',
 '##측',
 '카',
 '##리스',
 '##마',
 '없는',
 '악',
 '##역']

In [None]:
print( tokenizer.encode('전율을 일으키는 영화. 다시 보고 싶은 영화', max_length = 129 ,pad_to_max_length = True))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[101, 9665, 119183, 10622, 9641, 119185, 66815, 42428, 119, 25805, 98199, 9495, 10892, 42428, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]




In [None]:
# mask 인풋
valid_num = len(tokenizer.encode('전율을 일으키는 영화. 다시 보고 싶은 영화'))
print(valid_num * [1] + (128 -valid_num) * [0])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


# 네이버 영화 평가 무장들을 버트 인풋으로 변환

In [70]:
def convert_data(data_df):
  global tokenizer

  SEQ_LEN= 128  # bert 인풋의 길이 

  tokens, masks, segments, targets = [],[],[],[]

  for i in tqdm(range(len(data_df))):
    token = tokenizer.encode(data_df[DATA_COLUMN][i], max_length= SEQ_LEN, truncation=True,padding='max_length')

    num_zeros= token.count(0)
    mask = [1]* (SEQ_LEN-num_zeros) + [0]*num_zeros

    segment = [0] *SEQ_LEN

    tokens.append(token)
    masks.append(mask)
    segments.append(segment)

    targets.append(data_df[LABEL_COLUMN][i])

  tokens= np.array(tokens)
  masks= np.array(masks)
  segments= np.array(segments)
  targets= np.array(targets)

  return  [tokens,masks, segments], targets


In [71]:
def load_data(pandas_dataframe):
  data_df = pandas_dataframe
  data_df[DATA_COLUMN] = data_df[DATA_COLUMN].astype(str)
  data_df[LABEL_COLUMN] = data_df[LABEL_COLUMN].astype(int)
  data_x, data_y = convert_data(data_df)
  return data_x, data_y


SEQ_LEN= 128
BATCH_SIZE = 20
DATA_COLUMN = 'document'
LABEL_COLUMN= 'label'

train_x, train_y= load_data(train)


100%|██████████| 150000/150000 [00:31<00:00, 4708.05it/s]


In [72]:
test_x, test_y = load_data(test)

100%|██████████| 50000/50000 [00:10<00:00, 4918.65it/s]


In [73]:
test_y

array([1, 0, 0, ..., 0, 0, 0])

#### 버트를 활용한 감성분석 모델 만들기

In [74]:
# TPU 객체지정
TPU = True
if TPU :
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
else:
    pass

KeyError: ignored

In [None]:
# Rectified Adam 옵티마이저 사용
!pip install tensorflow_addons
import tensorflow_addons as tfa
opt = tfa.optimizers.RectifiedAdam(
    learning_rate = 1.0e-5, weight_decay = 0.0025, warmup_proportion = 0.05
)

In [None]:
def create_sentiment_bert():
  model = TFBertModel.from_pretrained('bert-base-multilingual-cased')
  token_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_word_ids')
  mask_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_masks')
  segment_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_segment')
  bert_outputs = model([token_inputs, mask_inputs, segment_inputs])

  bert_outputs = bert_outputs[1]
  sentiment_first = tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))(bert_outputs)
  sentiment_model = tf.keras.Model([token_inputs, mask_inputs, segment_inputs], sentiment_first)

  sentiment_model.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(), metrics = ['accuracy'])
  return sentiment_model

In [None]:
# TPU 실행시
if TPU:
    strategy  =  tf. distribute. experimental. TPUStrategy( resolver )
    with strategy.scope():
        sentiment_model = create_sentiment_bert()
    sentiment_model.fit(train_x, train_y, epochs = 4, shuffle = True, batch_size = 100, validation_data = ( test_x, test_y) )
else:
    sentiment_model = create_sentiment_bert()
    sentiment_model.fit(train_x, train_y, epochs = 4, shuffle = True, batch_size = 100, validation_data = ( test_x, test_y) )

한글 데이터를 분석하려면, 100개가 넘는 언얻에 대해 훈련된 버트를 사용해야 합니다.
이번에는 한국어 데이터로 훈련되었고, SKT에서 만든 koBERT를 사용하도록 하겠습니다.
모델을 로드하기에 앞서, 토크나이저를 불러오도록 하겠습니다.
huggingface에서는 아주 쉽게 토크나이저를 불러올 수 있습니다.

https://github.com/monologg/KoBERT-NER

How to use KoBERT on Huggingface Transformers Library
- 기존의 KoBERT를 transformers라이브러리에서 곧바로 사용할 수 있도록 맞췄습니다.
- transformers v2.2.2부터 개인이 만든 모델을 transformers를 통해 직접 업로드/다운로드하여 사용할 수 있습니다
- Tokenizer를 사용하려면 tokenization_kobert.py 에서 KoBertTokenizer를 임포트 해야합니다.



In [75]:
# 네이버 영화 감성분석 데이터 다운로드
!git clone https://github.com/e9t/nsmc.git

fatal: destination path 'nsmc' already exists and is not an empty directory.


In [76]:
import os
os.listdir('nsmc')

['ratings_train.txt',
 '.git',
 'synopses.json',
 'ratings.txt',
 'ratings_test.txt',
 'raw',
 'README.md',
 'code']

In [77]:
import pandas as pd
train = pd.read_table('nsmc/' + 'ratings_train.txt')
test = pd.read_table('nsmc/' + 'ratings_test.txt')

In [78]:
train.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [79]:
# tokenization_kobert.py upload
from google.colab import files
files.upload()

Saving tokenization_kobert.py to tokenization_kobert (2).py




In [80]:
from tokenization_kobert import KoBertTokenizer
tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'KoBertTokenizer'.


In [81]:
from tqdm import tqdm
import numpy as np

In [82]:
def convert_data(data_df):
    global tokenizer
    SEQ_LEN = 64

    tokens, masks, segments, targets = [], [], [], []

    for i in tqdm (range( len( data_df ))):
        token = tokenizer.encode (data_df[DATA_COLUMN][i], truncation = True, padding = 'max_length', max_length = SEQ_LEN)
        num_zeros = token.count(0)
        mask = [1] * (SEQ_LEN-num_zeros) + [0] * num_zeros

        segment = [0] * SEQ_LEN
        tokens.append(token)
        masks.append(mask)
        segments.append(segment)

        targets.append(data_df[LABEL_COLUMN][i])
    
    tokens = np.array(tokens)
    masks = np.array(masks)
    segments = np.array(segments)
    targets = np.array(targets)

    return [tokens, masks, segments], targets

In [83]:
def load_data(pandas_dataframe):
    data_df = pandas_dataframe
    data_df[DATA_COLUMN] = data_df[DATA_COLUMN].astype(str)
    data_df[LABEL_COLUMN] = data_df[LABEL_COLUMN].astype(int)
    data_x, data_y = convert_data(data_df)
    return data_x, data_y

In [84]:
SEQ_LEN = 64
BATCH_SIZE = 32
DATA_COLUMN = 'document'
LABEL_COLUMN = 'label'

train_x, train_y = load_data(train)

100%|██████████| 150000/150000 [00:30<00:00, 4839.15it/s]


In [85]:

model = TFBertModel.from_pretrained("monologg/kobert", from_pt=True)
# 토큰 인풋, 마스크 인풋, 세그먼트 인풋 정의
token_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_word_ids')
mask_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_masks')
segment_inputs = tf.keras.layers.Input((SEQ_LEN,), dtype=tf.int32, name='input_segment')
# 인풋이 [토큰, 마스크, 세그먼트]인 모델 정의
bert_outputs = model([token_inputs, mask_inputs, segment_inputs])

All PyTorch model weights were used when initializing TFBertModel.

All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [86]:
bert_outputs = bert_outputs[1]
bert_outputs.shape

TensorShape([None, 768])

In [87]:
!pip install tensorflow_addons
import tensorflow_addons as tfa
# 총 batch size * 4 epoch = 2344 * 4
opt = tfa.optimizers.RectifiedAdam(lr=5.0e-5, total_steps = 2344*2, warmup_proportion=0.1, min_lr=1e-5, epsilon=1e-08, clipnorm=1.0)



  "The `lr` argument is deprecated, use `learning_rate` instead.")


In [88]:
sentiment_drop = tf.keras.layers.Dropout(0.5)(bert_outputs)
sentiment_first = tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))(sentiment_drop)
sentiment_model = tf.keras.Model([token_inputs, mask_inputs, segment_inputs], sentiment_first)
sentiment_model.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(), metrics = ['accuracy'])

sentiment_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 64)]         0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 64)]         0                                            
__________________________________________________________________________________________________
input_segment (InputLayer)      [(None, 64)]         0                                            
__________________________________________________________________________________________________
tf_bert_model_2 (TFBertModel)   TFBaseModelOutputWit 92186880    input_word_ids[0][0]             
                                                                 input_masks[0][0]          

In [89]:
sentiment_model.fit(train_x, train_y, epochs=2, shuffle=True, batch_size=64, validation_data=(test_x, test_y))

Epoch 1/2

ValueError: ignored