# BERT를 사용한 종단 간 마스크 언어 모델링

**Author:** [Ankur Singh](https://twitter.com/ankur310794)<br>
**Date created:** 2020/09/18<br>
**Last modified:** 2020/09/18<br>
**Description:** BERT를 사용하여 MLM(Masked Language Model)을 구현하고 IMDB 리뷰 데이터 세트에서 미세 조정합니다.

## 소개

Masked Language Modeling은 빈칸 채우기 작업으로, 모델이 마스크 토큰을 둘러싼 컨텍스트 단어를 사용하여 마스크된 단어가 무엇인지 예측하려고 시도합니다.

하나 이상의 마스크 토큰이 포함된 입력의 경우 모델은 각각에 대해 가장 가능성이 높은 대체를 생성합니다.

Example:

- Input: "I have watched this [MASK] and it was awesome."
- Output: "I have watched this movie and it was awesome."

마스크된 언어 모델링은 셀프 지도학습(SSL) 설정(사람이 주석 처리한 레이블 없음)에서 언어 모델을 훈련하는 좋은 방법입니다. 그런 다음 이러한 모델을 미세 조정하여 다양한 지도학습 NLP 작업을 수행할 수 있습니다.

이 예제는 BERT 모델을 처음부터 구축하고, 마스크된 언어 모델링 작업으로 훈련시킨 다음, 감정 분류 작업에서 이 모델을 미세 조정하는 방법을 알려줍니다.

Keras `TextVectorization`와 `MultiHeadAttention`레이어를 사용하여 BERT Transformer-Encoder 네트워크 아키텍처를 생성합니다.

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from dataclasses import dataclass
import pandas as pd
import numpy as np
import glob
import re
from pprint import pprint

## 설정

In [2]:
@dataclass
class Config:
    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 0.001
    VOCAB_SIZE = 30000
    EMBED_DIM = 128
    NUM_HEAD = 8  # used in bert model
    FF_DIM = 128  # used in bert model
    NUM_LAYERS = 1

config = Config()

## 데이터 로드

먼저 IMDB 데이터를 다운로드하고 Pandas 데이터 프레임에 로드합니다.

In [3]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 80.2M    0 98304    0     0  74228      0  0:18:53  0:00:01  0:18:52 74247
  0 80.2M    0  320k    0     0   133k      0  0:10:13  0:00:02  0:10:11  134k
  0 80.2M    0  720k    0     0   216k      0  0:06:19  0:00:03  0:06:16  216k
  1 80.2M    1 1440k    0     0   327k      0  0:04:11  0:00:04  0:04:07  327k
  3 80.2M    3 2496k    0     0   467k      0  0:02:55  0:00:05  0:02:50  497k
  4 80.2M    4 4096k    0     0   651k      0  0:02:06  0:00:06  0:02:00  806k
  8 80.2M    8 6848k    0     0   927k      0  0:01:28  0:00:07  0:01:21 1306k
 12 80.2M   12 10.2M    0     0  1261k      0  0:01:05  0:00:08  0:00:57 1957k
 19 80.2M   19 15.6M    0     0  1717k      0  0:00

In [4]:
def get_text_list_from_files(files):
    text_list = []
    for name in files:
        with open(name, encoding='UTF-8') as f:
            for line in f:
                text_list.append(line)
    return text_list


def get_data_from_text_files(folder_name):

    pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt")
    pos_texts = get_text_list_from_files(pos_files)
#     print(pos_texts)
    neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt")
    neg_texts = get_text_list_from_files(neg_files)
    df = pd.DataFrame(
        {
            "review": pos_texts + neg_texts,
            "sentiment": [0] * len(pos_texts) + [1] * len(neg_texts),
        }
    )
    df = df.sample(len(df)).reset_index(drop=True)
    return df


train_df = get_data_from_text_files("train")
test_df = get_data_from_text_files("test")

print(train_df.head())

all_data = train_df.append(test_df)

                                              review  sentiment
0  Sholay: Considered to be one of the greatest f...          1
1  Raising Victor Vargas fails terribly in what i...          1
2  From the moment the film begins, already there...          1
3  This documentary film is based on incomplete c...          1
4  Honestly, I can't be bothered to spend my time...          1


## 데이터 세트 준비

`TextVectorization` 레이어를 사용 하여 텍스트를 정수 토큰 ID로 벡터화합니다. 문자열 배치를 토큰 인덱스 시퀀스(순서대로 하나의 샘플 = 정수 토큰 인덱스의 1D 배열) 또는 조밀한 표현(하나의 샘플 = 정렬되지 않은 토큰 세트를 인코딩하는 부동 소수점 값의 1D 배열)으로 변환합니다.

아래에서는 3개의 전처리 기능을 정의합니다.

1.  `get_vectorize_layer`함수는 `TextVectorization`레이어를 만듭니다.
2.  `encode` 함수는 원시 텍스트를 정수 토큰 ID로 인코딩합니다.
3.  `get_masked_input_and_labels`함수는 입력 토큰 ID를 마스킹합니다. 무작위로 각 시퀀스의 모든 입력 토큰의 15%를 마스킹합니다.

In [10]:
all_data.review.values.tolist()

["Sholay: Considered to be one of the greatest films. I always wondered if they would ever remake being the classic it is. That was the time RGV announced this movie and I was somewhat excited to see it. I always thought that maybe this will be a good movie, but every week we would here RGV change something. And the movie is a very B-Grade movie, something that I had not hoped.<br /><br />I really tried looking for positives, but I promised to keep Sholay out of my mind. The cinematography is awesome. The movie tries to be its own. But that is the up side. The action sequences are weak. The screenplay had potential. The biggest flaw is editing. None of the scenes excite you. For example, the comedy sequences felt very out of place and forced. Ironic because comedy was just as entertaining in the original. And none of the characters are developed. And no scenes will linger until the end. And the ending was very disappointing.<br /><br />The biggest question is acting. Amitabh Bachchan w

In [15]:
texts = ['You got to go and dig those holes. Holes only leaves troble, which makes a movie so good. Disney has done it again.Shia LaBeouf should be nominated for Best Actor for his performance as Stanley Yelnats. He has alredy won the Daytime Emmy for Best Actor in a Comedy Series (Even Stevens). Holes is one of the best movies in 2003.',
         'This film made John Glover a star. Alan Raimy is one of the most compelling character that I have ever seen on film. And I mean that sport.']
vectorize_layer = TextVectorization(
        max_tokens=100,
        output_mode="int",
        output_sequence_length=100,
    )

vectorize_layer.adapt(texts)
vocab = vectorize_layer.get_vocabulary()
print(vocab)
encoded_texts = vectorize_layer(texts)
print(encoded_texts)

['', '[UNK]', 'the', 'holes', 'for', 'best', 'a', 'that', 'one', 'of', 'is', 'in', 'i', 'has', 'film', 'and', 'actor', 'you', 'yelnats', 'won', 'which', 'troble', 'to', 'those', 'this', 'stevens', 'star', 'stanley', 'sport', 'so', 'should', 'series', 'seen', 'raimy', 'performance', 'only', 'on', 'nominated', 'movies', 'movie', 'most', 'mean', 'makes', 'made', 'leaves', 'labeouf', 'john', 'it', 'his', 'he', 'have', 'got', 'good', 'go', 'glover', 'ever', 'even', 'emmy', 'done', 'disney', 'dig', 'daytime', 'compelling', 'comedy', 'character', 'be', 'as', 'alredy', 'alan', 'againshia', '2003']
tf.Tensor(
[[17 51 22 53 15 60 23  3  3 35 44 21 20 42  6 39 29 52 59 13 58 47 69 45
  30 65 37  4  5 16  4 48 34 66 27 18 49 13 67 19  2 61 57  4  5 16 11  6
  63 31 56 25  3 10  8  9  2  5 38 11 70  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0]
 [24 14 43 46 54  6 26 68 33 10  8  9  2 40 62 64  7 12 50 55 32 36 14 15
  12 41

In [76]:
encoded_texts = np.arange(80).reshape(10,8)
print(encoded_texts)
inp_mask = np.random.rand(10,8) < 0.15
inp_mask[encoded_texts <= 2] = False

labels = -1 * np.ones(encoded_texts.shape, dtype=int)
print(labels)
labels[inp_mask] = encoded_texts[inp_mask]
print(labels)
encoded_texts_masked = np.copy(encoded_texts)
inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
print(inp_mask_2mask)
encoded_texts_masked[inp_mask_2mask] = 99
print(encoded_texts_masked)

inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
encoded_texts_masked[inp_mask_2random] = np.random.randint( 3, 80, inp_mask_2random.sum())
# np.random.randint( 3, 80, inp_mask_2random.sum())
print(encoded_texts_masked)

sample_weights = np.ones(labels.shape)
print(sample_weights)
sample_weights[labels == -1] = 0
print(sample_weights)

[[ 0  1  2  3  4  5  6  7]
 [ 8  9 10 11 12 13 14 15]
 [16 17 18 19 20 21 22 23]
 [24 25 26 27 28 29 30 31]
 [32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47]
 [48 49 50 51 52 53 54 55]
 [56 57 58 59 60 61 62 63]
 [64 65 66 67 68 69 70 71]
 [72 73 74 75 76 77 78 79]]
[[-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]]
[[-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 13 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 26 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 41 -1 -1 -1 -1 -1 -1]
 [-1 49 -1 -1 -1 53 -1 -1]
 [-1 -1 58 59 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1 -1 -1]]
[[False False False False False False False False]
 [False False False False False  True False False]
 [False False False False False False False False]
 [False False  True False False Fa

In [99]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), ""
    )


def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
    """텍스트 벡터화 레이어 구축

    Args:
      texts (list): 문자열 목록, 즉 입력 텍스트
      vocab_size (int): 어휘 크기
      max_seq (int): 최대 시퀀스 길이.
      special_tokens (list, optional): 특수 토큰 목록입니다. 기본값은 ['[MASK]']입니다.

    Returns:
        layers.Layer: TextVectorization Keras 레이어 반환
    """
    vectorize_layer = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        standardize=custom_standardization,
        output_sequence_length=max_seq,
    )
    vectorize_layer.adapt(texts)

    # 어휘에 마스크 토큰 삽입
    vocab = vectorize_layer.get_vocabulary()
#     print(vocab)
    vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
#     print(vocab)
    vectorize_layer.set_vocabulary(vocab)
    return vectorize_layer


vectorize_layer = get_vectorize_layer(
    all_data.review.values.tolist(),
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# vocab = vectorize_layer.get_vocabulary()
# print(vocab)

# 마스크된 언어 모델에 대한 마스크 토큰 ID 가져오기
mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0]
# print(mask_token_id)

def encode(texts):
    encoded_texts = vectorize_layer(texts)
    return encoded_texts.numpy()


def get_masked_input_and_labels(encoded_texts):
    # 15% BERT 마스킹
    inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
    # 특수 토큰을 마스킹하지 마십시오.
    inp_mask[encoded_texts <= 2] = False
    # 기본적으로 대상을 -1로 설정합니다. 무시를 의미합니다.
    labels = -1 * np.ones(encoded_texts.shape, dtype=int)
    # 마스킹된 토큰에 대한 레이블 설정
    labels[inp_mask] = encoded_texts[inp_mask]

    # 입력 준비
    encoded_texts_masked = np.copy(encoded_texts)
    # 90%의 토큰에 대한 마지막 토큰인 [MASK]에 입력을 설정합니다.
    # 이것은 10%를 그대로 두는 것을 의미합니다.
    inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
    encoded_texts_masked[
        inp_mask_2mask
    ] = mask_token_id  # 마스크 토큰은 dict의 마지막입니다.

    # 10%를 임의의 토큰으로 설정
    inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
    encoded_texts_masked[inp_mask_2random] = np.random.randint(
        3, mask_token_id, inp_mask_2random.sum()
    )

    # .fit() 메서드에 전달할 sample_weights 준비
    sample_weights = np.ones(labels.shape)
    sample_weights[labels == -1] = 0

    # y_labels는 encode_texts, 즉 입력 토큰과 동일합니다.
    y_labels = np.copy(encoded_texts)

    return encoded_texts_masked, y_labels, sample_weights


# 훈련을 위한 25000개의 예제가 있습니다.
x_train = encode(train_df.review.values)  # 벡터라이저로 리뷰 인코딩
print(x_train.shape)

y_train = train_df.sentiment.values
print(y_train.shape)
train_classifier_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(1000)
    .batch(config.BATCH_SIZE)
)

# 테스트를 위한 25000개의 예제가 있습니다.
x_test = encode(test_df.review.values)
y_test = test_df.sentiment.values
test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(
    config.BATCH_SIZE
)

# 종단 간 모델 입력을 위한 데이터 세트 구축(마지막에 사용됨)
test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices(
    (test_df.review.values, y_test)
).batch(config.BATCH_SIZE)

# 마스킹된 언어 모델에 대한 데이터 준비
x_all_review = encode(all_data.review.values)
print(x_all_review.shape)

x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(
    x_all_review
)

# print(x_masked_train[0])
# print(y_masked_labels[0])
# print(sample_weights[0])

mlm_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_train, y_masked_labels, sample_weights)
)
mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)

(25000, 256)
(25000,)
(50000, 256)


In [100]:
sent = """This is really a new low in entertainment. () {} << Even though there are a lot worse movies out.<br /><br />In the Gangster / Drug scene genre it is hard to have a convincing storyline (this movies does not, i mean Sebastians motives for example couldn't be more far fetched and worn out cliché.) Then you would also need a setting of character relationships that is believable (this movie does not.) <br /><br />Sure Tristan is drawn away from his family but why was that again? what's the deal with his father again that he has to ask permission to go out at his age? interesting picture though to ask about the lack and need of rebellious behavior of kids in upper class family. But this movie does not go in this direction. Even though there would be the potential judging by the random Backflashes. Wasn't he already down and out, why does he do it again? <br /><br />So there are some interesting questions brought up here for a solid socially critic drama (but then again, this movie is just not, because of focusing on "cool" production techniques and special effects an not giving the characters a moment to reflect and most of all forcing the story along the path where they want it to be and not paying attention to let the story breath and naturally evolve.) <br /><br />It wants to be a drama to not glorify abuse of substances and violence (would be political incorrect these days, wouldn't it?) but on the other hand it is nothing more then a cheap action movie (like there are so so many out there) with an average set of actors and a Vinnie Jones who is managing to not totally ruin what's left of his reputation by doing what he always does.<br /><br />So all in all i .. just ... can't recommend it.<br /><br />1 for Vinnie and 2 for the editing."""

sent = custom_standardization(sent)
print(sent)

tf.Tensor(b'this is really a new low in entertainment    even though there are a lot worse movies out  in the gangster  drug scene genre it is hard to have a convincing storyline this movies does not i mean sebastians motives for example couldnt be more far fetched and worn out clich\xc3\xa9 then you would also need a setting of character relationships that is believable this movie does not   sure tristan is drawn away from his family but why was that again whats the deal with his father again that he has to ask permission to go out at his age interesting picture though to ask about the lack and need of rebellious behavior of kids in upper class family but this movie does not go in this direction even though there would be the potential judging by the random backflashes wasnt he already down and out why does he do it again   so there are some interesting questions brought up here for a solid socially critic drama but then again this movie is just not because of focusing on "cool" produ

In [101]:
a = np.random.randint( 3, 29999, 2)
a

array([23008,  1762])

## 마스크 언어 모델링을 위한 BERT 모델(Pretraining Model) 생성

레이어 를 사용하여 BERT와 같은 사전 학습 모델 아키텍처를 생성합니다 `MultiHeadAttention`. 토큰 ID를 입력(마스킹된 토큰 포함)으로 사용하고 마스크된 입력 토큰의 올바른 ID를 예측합니다.

In [102]:

def bert_module(query, key, value, i):
    # Multi headed self-attention
    attention_output = layers.MultiHeadAttention(
        num_heads=config.NUM_HEAD,
        key_dim=config.EMBED_DIM // config.NUM_HEAD,
        name="encoder_{}/multiheadattention".format(i),
    )(query, key, value)
    attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))(
        attention_output
    )
    attention_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i)
    )(query + attention_output)

    # Feed-forward layer
    ffn = keras.Sequential(
        [
            layers.Dense(config.FF_DIM, activation="relu"),
            layers.Dense(config.EMBED_DIM),
        ],
        name="encoder_{}/ffn".format(i),
    )
    ffn_output = ffn(attention_output)
    ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))(
        ffn_output
    )
    sequence_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i)
    )(attention_output + ffn_output)
    return sequence_output


def get_pos_encoding_matrix(max_len, d_emb):
    pos_enc = np.array(
        [
            [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]
            if pos != 0
            else np.zeros(d_emb)
            for pos in range(max_len)
        ]
    )
    pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2])  # dim 2i
    pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2])  # dim 2i+1
    return pos_enc


loss_fn = keras.losses.SparseCategoricalCrossentropy(
    reduction=tf.keras.losses.Reduction.NONE
)
loss_tracker = tf.keras.metrics.Mean(name="loss")


class MaskedLanguageModel(tf.keras.Model):
    def train_step(self, inputs):
        if len(inputs) == 3:
            features, labels, sample_weight = inputs
        else:
            features, labels = inputs
            sample_weight = None

        with tf.GradientTape() as tape:
            predictions = self(features, training=True)
            loss = loss_fn(labels, predictions, sample_weight=sample_weight)

        # 그라디언트 계산
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # 가중치 업데이트
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # 메트릭 계산
        loss_tracker.update_state(loss, sample_weight=sample_weight)

        # 메트릭 이름을 현재 값으로 매핑하는 dict 반환
        return {"loss": loss_tracker.result()}

    @property
    def metrics(self):
        # `reset_states()`가 될 수 있도록 `Metric` 객체를 여기에 나열합니다.
        # 각 Epoch 시작 시 자동으로 호출됨
        # 또는 `evaluate()`의 시작 부분에서.
        # 이 속성을 구현하지 않으면 다음을 호출해야 합니다.
        # 선택한 시간에 `reset_states()`
        return [loss_tracker]


def create_masked_language_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)

    word_embeddings = layers.Embedding(
        config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
    )(inputs)
    position_embeddings = layers.Embedding(
        input_dim=config.MAX_LEN,
        output_dim=config.EMBED_DIM,
        weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],
        name="position_embedding",
    )(tf.range(start=0, limit=config.MAX_LEN, delta=1))
    embeddings = word_embeddings + position_embeddings

    encoder_output = embeddings
    for i in range(config.NUM_LAYERS):
        encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)

    mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
        encoder_output
    )
    mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")

    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    mlm_model.compile(optimizer=optimizer)
    return mlm_model


id2token = dict(enumerate(vectorize_layer.get_vocabulary()))
token2id = {y: x for x, y in id2token.items()}


class MaskedTextGenerator(keras.callbacks.Callback):
    def __init__(self, sample_tokens, top_k=5):
        self.sample_tokens = sample_tokens
        self.k = top_k

    def decode(self, tokens):
        return " ".join([id2token[t] for t in tokens if t != 0])

    def convert_ids_to_tokens(self, id):
        return id2token[id]

    def on_epoch_end(self, epoch, logs=None):
        prediction = self.model.predict(self.sample_tokens)

        masked_index = np.where(self.sample_tokens == mask_token_id)
        masked_index = masked_index[1]
        mask_prediction = prediction[0][masked_index]

        top_indices = mask_prediction[0].argsort()[-self.k :][::-1]
        values = mask_prediction[0][top_indices]

        for i in range(len(top_indices)):
            p = top_indices[i]
            v = values[i]
            tokens = np.copy(sample_tokens[0])
            tokens[masked_index[0]] = p
            result = {
                "input_text": self.decode(sample_tokens[0].numpy()),
                "prediction": self.decode(tokens),
                "probability": v,
                "predicted mask token": self.convert_ids_to_tokens(p),
            }
            pprint(result)


sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])
generator_callback = MaskedTextGenerator(sample_tokens.numpy())

bert_masked_model = create_masked_language_bert_model()
bert_masked_model.summary()

Model: "masked_bert_model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 256)]        0           []                               
                                                                                                  
 word_embedding (Embedding)     (None, 256, 128)     3840000     ['input_1[0][0]']                
                                                                                                  
 tf.__operators__.add (TFOpLamb  (None, 256, 128)    0           ['word_embedding[0][0]']         
 da)                                                                                              
                                                                                                  
 encoder_0/multiheadattention (  (None, 256, 128)    66048       ['tf.__operators_

## 훈련과 저장

In [103]:
bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback])
bert_masked_model.save("bert_mlm_imdb.h5")

Epoch 1/5
 'predicted mask token': 'this',
 'prediction': 'i have watched this this and it was awesome',
 'probability': 0.05381592}
{'input_text': 'i have watched this [mask] and it was awesome',
 'predicted mask token': 'i',
 'prediction': 'i have watched this i and it was awesome',
 'probability': 0.046898097}
{'input_text': 'i have watched this [mask] and it was awesome',
 'predicted mask token': 'to',
 'prediction': 'i have watched this to and it was awesome',
 'probability': 0.030786717}
{'input_text': 'i have watched this [mask] and it was awesome',
 'predicted mask token': 'a',
 'prediction': 'i have watched this a and it was awesome',
 'probability': 0.026925892}
{'input_text': 'i have watched this [mask] and it was awesome',
 'predicted mask token': 'movie',
 'prediction': 'i have watched this movie and it was awesome',
 'probability': 0.021953668}
Epoch 2/5
 'predicted mask token': 'movie',
 'prediction': 'i have watched this movie and it was awesome',
 'probability': 0.2870

In [104]:
bert_masked_model = keras.models.load_model(
    "bert_mlm_imdb.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)

In [117]:
def decode(tokens):
    return " ".join([id2token[t] for t in tokens if t != 0])

# sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])
# sample_tokens = vectorize_layer(["This movie did not even [mask] close to being boring."])
# this is really a new low in entertainment
sample_tokens = vectorize_layer(["this is really a new [mask] in entertainment even though there are a lot worse movies out"])

print(sample_tokens)
prediction = bert_masked_model.predict(sample_tokens)
print(prediction.shape)

masked_index = np.where(sample_tokens == mask_token_id)
print(masked_index)
masked_index = masked_index[1]
print(masked_index)
mask_prediction = prediction[0][masked_index]
print(mask_prediction.shape)
print(mask_prediction[0][mask_prediction[0].argsort()[-5:][::-1]])
top_indices = mask_prediction[0].argsort()[-5 :][::-1]
print(top_indices)
values = mask_prediction[0][top_indices]
print(values)

for i in range(len(top_indices)):
    p = top_indices[i]
    v = values[i]
    tokens = np.copy(sample_tokens[0])
    tokens[masked_index[0]] = p
    result = {
        "input_text": decode(sample_tokens[0].numpy()),
        "prediction": decode(tokens),
        "probability": v,
        "predicted mask token": id2token[p],
    }
    pprint(result)

tf.Tensor(
[[   11     7    62     4   167 29999     8   724    53   152    47    23
      4   163   419    92    45     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0  

## 감정 분류 모델 미세 조정

감정 분류의 다운스트림 작업에서 자체 지도 모델을 미세 조정할 것입니다. 이를 위해 `Dense`사전 훈련된 BERT 기능 위에 풀링 계층과 계층을 추가하여 분류기를 생성해 보겠습니다.

In [120]:
# 사전 훈련된 bert 모델 불러오기
mlm_model = keras.models.load_model(
    "bert_mlm_imdb.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)
pretrained_bert_model = tf.keras.Model(
    mlm_model.input, mlm_model.get_layer("encoder_0/ffn_layernormalization").output
)

# 동결
pretrained_bert_model.trainable = False


def create_classifier_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)
    sequence_output = pretrained_bert_model(inputs)
    pooled_output = layers.GlobalMaxPooling1D()(sequence_output)
    hidden_layer = layers.Dense(64, activation="relu")(pooled_output)
    outputs = layers.Dense(1, activation="sigmoid")(hidden_layer)
    classifer_model = keras.Model(inputs, outputs, name="classification")
    optimizer = keras.optimizers.Adam()
    classifer_model.compile(
        optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
    )
    return classifer_model


classifer_model = create_classifier_bert_model()
classifer_model.summary()

# 고정된 BERT 단계로 분류기 훈련
# classifer_model.fit(
#     train_classifier_ds,
#     epochs=5,
#     validation_data=test_classifier_ds,
# )

# 미세 조정을 위해 BERT 모델 고정 해제
pretrained_bert_model.trainable = True
optimizer = keras.optimizers.Adam()
classifer_model.compile(
    optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
)
classifer_model.fit(
    train_classifier_ds,
    epochs=5,
    validation_data=test_classifier_ds,
)

Model: "classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 256)]             0         
                                                                 
 model_1 (Functional)        (None, 256, 128)          3939584   
                                                                 
 global_max_pooling1d_1 (Glo  (None, 128)              0         
 balMaxPooling1D)                                                
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 3,947,905
Trainable params: 8,321
Non-trainable params: 3,939,584
______________________________________

<keras.callbacks.History at 0x1dcee9a9d60>

## 종단 간 모델 생성 및 평가

모델을 배포하려는 경우 프로덕션 환경에서 사전 처리 논리를 다시 구현할 필요가 없도록 사전 처리 파이프라인이 이미 포함되어 있는 것이 가장 좋습니다. `TextVectorization`레이어 를 통합하는 종단 간 모델을 만들고 평가해 보겠습니다. 우리 모델은 원시 문자열을 입력으로 받아들입니다.

In [121]:

def get_end_to_end(model):
    inputs_string = keras.Input(shape=(1,), dtype="string")
    indices = vectorize_layer(inputs_string)
    outputs = model(indices)
    end_to_end_model = keras.Model(inputs_string, outputs, name="end_to_end_model")
    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    end_to_end_model.compile(
        optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
    )
    return end_to_end_model


end_to_end_classification_model = get_end_to_end(classifer_model)
end_to_end_classification_model.evaluate(test_raw_classifier_ds)



[0.8236384987831116, 0.8361999988555908]