문장의 일부 단어를 맞추게 하는 task로 pre-training을 시킨다.  
`GPT`에서는 __< SOS >__ 를 보고 | 를 맞추고, |를 보고 study를 맞췄다. 즉, 전 후 문맥을 안보고 앞의 단어만을 보고 다음 단어를 예측했다. BERT는 앞 뒤 문맥을 다 고려하도록 양방향을 고려한다.  

Masked Language Model(MLM) 을 pre-training에 사용  
* 문장에 빈 칸이 있으면 왼쪽과 오른쪽 정보를 다 이해하여 유추한다는 Bidirerctional 개념을 사용  
* I study math 라는 문장을 일정 확률로 각 단어를 빈칸으로 만들어 예측  
* k% 확률로 입력 단어 masking, masking된 단어를 예측  
    * 비율을 높이면 해당하는 mask를 마주는 정보가 부족해서 문제 발생  
    * 비율을 낮추면 효율이 떨어지거나 학습이 느려짐  
    * 가장 적절한 k = 15  
* 문제점 : mask token은 fine-tuning 하는 동안에는 볼 수 없음. pre-training 할 때 생기는 단어간의 관계(mask를 고려하는)가 실제 문제(classification 등) 에서는 나타나지 않기 때문에 문제가 됨.  
* 해결책  
    * 15% 단어를 맞추되, 100% 확률로 Mask 로 치환시키지 않음  
    * 80%는 mask로 변환, 10%는 랜덤 단어로 치환, 10%는 원래 단어 그대로 놔둠  
   
* Pre-training Task in BERT : Next Sentence Predicion (문장간 관계를 학습하는 방법 제시)  
    * 두 개의 문장을 가져와서 각 문장 뒤에 [SEP]를 붙이고 합쳐줌  
    * 첫 문장의 앞에는 [CLS] 라는 다수의 문장에서 예측을 담당하는 tokne을 붙임(Classification task에서는 사용되지만, 그렇지 않을 경우엔 무시)   
    * 라벨 없는 입력 데이터만으로 예측하도록 학습  
    * 첫 문장 뒤에 두 번째 문장이 나오는게 적절한지 아닌지를 파악  
    * 맞으면 IsNext label, 아니면 NotNext label 할당  
    * 문장간 관계를 이해하기 위해 A 문장을 수행해서 나오는 B 문장이 나와야하는지 예측

In [4]:
# %%
# code by Tae Hwan Jung(Jeff Jung) @graykode
# Reference : https://github.com/jadore801120/attention-is-all-you-need-pytorch
#           https://github.com/JayParks/transformer, https://github.com/dhlee347/pytorchic-bert
import math
import re
from random import *
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

In [100]:
# sample IsNext and NotNext to be same in small batch size
def make_batch():
    batch = []
    positive = negative = 0
    while positive != batch_size/2 or negative != batch_size/2:
        tokens_a_index, tokens_b_index= randrange(len(sentences)), randrange(len(sentences)) # 문장 길이 범위내 랜덤 index 설정
        tokens_a, tokens_b= token_list[tokens_a_index], token_list[tokens_b_index] # 입력 문장의 token 중 index로 뽑아내기
        input_ids = [word_dict['[CLS]']] + tokens_a + [word_dict['[SEP]']] + tokens_b + [word_dict['[SEP]']] # word_dict = PAD + CLS + SEP +  MASK + token
        segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1) # 0 or1
        # token_a : CLS + tokens_a + SEP
        # token_b : tokens_b + SEP 

        # MASK LM
        # The training data generator chooses 15% of the token positions at random for prediction
        # 훈련 데이터 생성기는 예측을 위해 토큰 위치의 15%를 무작위로 선택합니다
        n_pred =  min(max_pred, max(1, int(round(len(input_ids) * 0.15)))) # 전체 중 15%만 맞출거야!
        cand_maked_pos = [i for i, token in enumerate(input_ids)
                          if token != word_dict['[CLS]'] and token != word_dict['[SEP]']] # input_ids 중 CLS와 SEP token이 아닌 token의 index
        shuffle(cand_maked_pos) # shuffle -> 뒤에 랜덤한 15%를 mask하기 위해 사전에 shuffle
        masked_tokens, masked_pos = [], []
        for pos in cand_maked_pos[:n_pred]: # 15%의 mask 중 -> 사전에 shuffle을 했기 때문에 [:n_pred] = 15% random 추출
            masked_pos.append(pos) # masking 할 token의 index
            masked_tokens.append(input_ids[pos]) # masking 할 token
            if random() < 0.8:  # 80%
                input_ids[pos] = word_dict['[MASK]'] # make mask
            elif random() < 0.5:  # 10%
                index = randint(0, vocab_size - 1) # random index in vocabulary
                input_ids[pos] = word_dict[number_dict[index]] # replace

        # Zero Paddings -> max_len 사이즈로 zero_padding
        n_pad = maxlen - len(input_ids)
        input_ids.extend([0] * n_pad)
        segment_ids.extend([0] * n_pad)

        # Zero Padding (100% - 15%) tokens -> max_pred size 만큼 mask_tokens을 zero_padding
        if max_pred > n_pred:
            n_pad = max_pred - n_pred
            masked_tokens.extend([0] * n_pad)
            masked_pos.extend([0] * n_pad)

        if tokens_a_index + 1 == tokens_b_index and positive < batch_size/2: # 위에서 랜덤하게 뽑은 두 문장이 연결되는 문장이고 positive가 batch_size/2라면
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True]) # IsNext
            positive += 1
        elif tokens_a_index + 1 != tokens_b_index and negative < batch_size/2: # 두 문장이 연결되는 문장이 아니라면
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False]) # NotNext
            negative += 1
    return batch
# Proprecessing Finished

___

In [101]:
def get_attn_pad_mask(seq_q, seq_k): # Attention을 구할 때 Padding 부분을 제외하기 위한 Mask
                                     # Padding = False, Tokens = True
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)  # batch_size x 1 x len_k(=len_q), one is masking
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

In [102]:
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

In [103]:
get_attn_pad_mask(input_ids, input_ids)[0][0] # True = 14, False(zero padding) = 16

tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True])

___

In [104]:
def gelu(x):
    "Implementation of the gelu activation function by Hugging Face"
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))

___

In [105]:
class Embedding(nn.Module):
    def __init__(self):
        super(Embedding, self).__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)  # token 값 자체 embedding
        self.pos_embed = nn.Embedding(maxlen, d_model)  # token 고유 위치 값mbedding
        self.seg_embed = nn.Embedding(n_segments, d_model)  # token이 어느 문장에 있었는지에 대한 정보 값 embedding
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, seg):
        seq_len = x.size(1) # padding을 포함한 문장의 길이 = max_len
        pos = torch.arange(seq_len, dtype=torch.long) # np.range와 동일 -> tensor화
        pos = pos.unsqueeze(0).expand_as(x)  # (seq_len,) -> (batch_size, seq_len)
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)

In [106]:
input_ids.size()

torch.Size([6, 30])

In [107]:
torch.arange(30, dtype=torch.long)

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

___

In [108]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn

In [109]:
class MultiHeadAttention(nn.Module): # MH-Self Attention + Add & Norm
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_v * n_heads)
    def forward(self, Q, K, V, attn_mask):
        # q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
        residual, batch_size = Q, Q.size(0)
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
        q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]

        attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]

        # context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size x len_q x n_heads * d_v]
        output = nn.Linear(n_heads * d_v, d_model)(context)
        return nn.LayerNorm(d_model)(output + residual), attn # output: [batch_size x len_q x d_model]

In [110]:
class PoswiseFeedForwardNet(nn.Module): # FFNN
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff) # (768 * (768 * 4))
        self.fc2 = nn.Linear(d_ff, d_model) # ((768 * 4) * 768)

    def forward(self, x):
        # (batch_size, len_seq, d_model) -> (batch_size, len_seq, d_ff) -> (batch_size, len_seq, d_model)
        return self.fc2(gelu(self.fc1(x)))

Encoder  
1. Mutli-head Self-Attention  
2. Add & Norm  
3. FFNN(Feed Forward Neural Network)  
4. Add & Norm

In [111]:
class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, enc_inputs, enc_self_attn_mask):
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
        enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size x len_q x d_model]
        return enc_outputs, attn

In [125]:
class BERT(nn.Module):
    def __init__(self):
        super(BERT, self).__init__()
        self.embedding = Embedding() # tok_embedding + pos_embedding + seg_embedding
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)]) # Multihead Attention * 6
        self.fc = nn.Linear(d_model, d_model) # FC Layer (768 * 768) # Embedding Size
        self.activ1 = nn.Tanh()
        self.linear = nn.Linear(d_model, d_model)
        self.activ2 = gelu
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 2) # output = 2
        # decoder is shared with embedding layer
        embed_weight = self.embedding.tok_embed.weight
        n_vocab, n_dim = embed_weight.size()
        self.decoder = nn.Linear(n_dim, n_vocab, bias=False)
        self.decoder.weight = embed_weight
        self.decoder_bias = nn.Parameter(torch.zeros(n_vocab))

    def forward(self, input_ids, segment_ids, masked_pos):
        output = self.embedding(input_ids, segment_ids) # embedding layer
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids) # Attention을 구할 때 Padding 부분을 제외하기 위한 Mask
        for layer in self.layers: # # Multihead Attention * 6
            output, enc_self_attn = layer(output, enc_self_attn_mask) 
        # output : [batch_size, len, d_model], attn : [batch_size, n_heads, d_mode, d_model]
        # it will be decided by first token(CLS)
        h_pooled = self.activ1(self.fc(output[:, 0])) # [batch_size, d_model] # tanh
        logits_clsf = self.classifier(h_pooled) # [batch_size, 2] 

        # get masked position from final output of transformer.
        masked_pos = masked_pos[:, :, None].expand(-1, -1, output.size(-1)) # [batch_size, max_pred, d_model]
        
        h_masked = torch.gather(output, 1, masked_pos) # masking position [batch_size, max_pred, d_model]
        h_masked = self.norm(self.activ2(self.linear(h_masked)))
        logits_lm = self.decoder(h_masked) + self.decoder_bias # [batch_size, max_pred, n_vocab]

        return logits_lm, logits_clsf

In [189]:
embedding = Embedding()

In [195]:
# word_token, segment index
input_ids, segment_ids

(tensor([[ 1, 22,  3,  5,  2, 21,  3, 12, 17, 26, 20, 12, 25,  2,  0,  0,  0,  0,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 1, 19, 10, 23, 14,  3,  5, 21, 24,  3, 12,  2, 19, 26, 20,  3,  6, 16,
          10,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 1,  3, 28,  5,  2, 21,  4,  3, 17, 26, 20, 12, 25,  2,  0,  0,  0,  0,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 1, 19, 26, 20, 12,  3, 16, 10,  2, 19,  3, 23, 14, 13,  5,  3, 24,  4,
          12,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 1, 21,  4, 12, 17, 26,  3, 12, 25,  2,  3, 23, 18,  3,  8, 15,  9,  2,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 1, 21,  4,  3,  3, 26, 20, 12, 25,  2,  3, 23, 18, 11,  8, 15,  9,  2,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]]),
 tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [193]:
# embedding layer
output = embedding(input_ids, segment_ids)
print(output.shape)
output
# token + segment + position

torch.Size([6, 30, 768])


tensor([[[ 0.1189, -0.6494, -0.1653,  ..., -1.5683, -0.5090, -1.1351],
         [ 0.2719,  0.0470,  0.2538,  ..., -1.5979, -0.3503, -0.1319],
         [-0.6173, -1.3230, -0.9118,  ..., -1.8902,  0.8973, -0.5789],
         ...,
         [-0.2015,  1.1879,  2.1656,  ..., -1.6345, -0.6579,  0.4428],
         [-0.6804,  1.2126,  1.5787,  ..., -1.7551, -0.8357, -0.0583],
         [-1.4206,  0.3712,  1.6523,  ..., -0.9094, -2.0265, -0.1076]],

        [[ 0.1189, -0.6494, -0.1653,  ..., -1.5683, -0.5090, -1.1351],
         [-0.2506, -1.1500,  0.1414,  ..., -2.3564,  0.6285,  0.3956],
         [ 0.4329, -0.1175, -0.6646,  ..., -0.7343,  0.8666, -0.0565],
         ...,
         [-0.2015,  1.1879,  2.1656,  ..., -1.6345, -0.6579,  0.4428],
         [-0.6804,  1.2126,  1.5787,  ..., -1.7551, -0.8357, -0.0583],
         [-1.4206,  0.3712,  1.6523,  ..., -0.9094, -2.0265, -0.1076]],

        [[ 0.1189, -0.6494, -0.1653,  ..., -1.5683, -0.5090, -1.1351],
         [-0.4722, -0.5195, -0.4120,  ..., -1

In [203]:
# zero padding 된 부분 True로 표시
enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids)
print(enc_self_attn_mask.shape)
enc_self_attn_mask[0]

torch.Size([6, 30, 30])


tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, Fals

In [206]:
# transformer-encoder-Multihead Attention 아키텍쳐 정의
layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
layers

ModuleList(
  (0): EncoderLayer(
    (enc_self_attn): MultiHeadAttention(
      (W_Q): Linear(in_features=768, out_features=768, bias=True)
      (W_K): Linear(in_features=768, out_features=768, bias=True)
      (W_V): Linear(in_features=768, out_features=768, bias=True)
    )
    (pos_ffn): PoswiseFeedForwardNet(
      (fc1): Linear(in_features=768, out_features=3072, bias=True)
      (fc2): Linear(in_features=3072, out_features=768, bias=True)
    )
  )
  (1): EncoderLayer(
    (enc_self_attn): MultiHeadAttention(
      (W_Q): Linear(in_features=768, out_features=768, bias=True)
      (W_K): Linear(in_features=768, out_features=768, bias=True)
      (W_V): Linear(in_features=768, out_features=768, bias=True)
    )
    (pos_ffn): PoswiseFeedForwardNet(
      (fc1): Linear(in_features=768, out_features=3072, bias=True)
      (fc2): Linear(in_features=3072, out_features=768, bias=True)
    )
  )
  (2): EncoderLayer(
    (enc_self_attn): MultiHeadAttention(
      (W_Q): Linear(in_feature

In [207]:
# Multihead Attention * 6 실행
for layer in layers: # # Multihead Attention * 6
    output, enc_self_attn = layer(output, enc_self_attn_mask) 

In [210]:
output.shape, enc_self_attn.shape

(torch.Size([6, 30, 768]), torch.Size([6, 12, 30, 30]))

In [212]:
fc = nn.Linear(d_model, d_model)  # (768 * 768)

In [214]:
activ1 = nn.Tanh()

In [215]:
print(output[:, 0].shape)
h_pooled = activ1(fc(output[:, 0]))

torch.Size([6, 768])


In [216]:
h_pooled.shape

torch.Size([6, 768])

In [133]:
classifier = nn.Linear(d_model, 2)

In [137]:
logits_clsf = classifier(h_pooled) # 768 * 2

In [138]:
logits_clsf

tensor([[-0.0257,  0.1241],
        [-0.0215,  0.1168],
        [-0.0196,  0.1300],
        [-0.0148,  0.1248],
        [-0.0348,  0.1221],
        [-0.0328,  0.1235]], grad_fn=<AddmmBackward0>)

In [139]:
masked_pos

tensor([[[ 6,  6,  6,  ...,  6,  6,  6],
         [ 2,  2,  2,  ...,  2,  2,  2],
         [ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  0]],

        [[ 5,  5,  5,  ...,  5,  5,  5],
         [15, 15, 15,  ..., 15, 15, 15],
         [ 9,  9,  9,  ...,  9,  9,  9],
         [ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  0]],

        [[ 1,  1,  1,  ...,  1,  1,  1],
         [ 7,  7,  7,  ...,  7,  7,  7],
         [ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  0]],

        [[ 5,  5,  5,  ...,  5,  5,  5],
         [10, 10, 10,  ..., 10, 10, 10],
         [15, 15, 15,  ..., 15, 15, 15],
         [ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  0]],

        [[13, 13, 13,  ..., 13, 13, 13],
         [ 6,  6,  6,  ...,  6,  6,  6],
         [10, 10, 10,  ..., 10, 10, 10],
         [ 0,  0,  0,  ...,  0,  0,  0],
        

In [149]:
output.shape, h_masked.shape

(torch.Size([6, 30, 768]), torch.Size([6, 5, 768]))

In [141]:
h_masked = torch.gather(output, 1, masked_pos) # torch.gather(input, dim, index)
h_masked

tensor([[[ 0.3436,  0.0796, -0.1103,  ..., -0.3413, -0.1028,  0.0774],
         [ 0.0917, -0.2705, -0.0349,  ..., -0.0708, -0.0929, -0.1157],
         [-0.1180,  0.0144, -0.1075,  ..., -0.1348, -0.2497,  0.1692],
         [-0.1180,  0.0144, -0.1075,  ..., -0.1348, -0.2497,  0.1692],
         [-0.1180,  0.0144, -0.1075,  ..., -0.1348, -0.2497,  0.1692]],

        [[-0.0712, -0.1533,  0.0117,  ..., -0.0195, -0.2495,  0.2690],
         [ 0.2108,  0.1466,  0.0904,  ..., -0.2638, -0.1845,  0.1594],
         [-0.1219, -0.2331,  0.1493,  ..., -0.0733, -0.1656,  0.1293],
         [-0.1773,  0.0553, -0.0641,  ..., -0.1362, -0.2448,  0.1423],
         [-0.1773,  0.0553, -0.0641,  ..., -0.1362, -0.2448,  0.1423]],

        [[ 0.1992, -0.0858, -0.0056,  ...,  0.0475, -0.0161,  0.0718],
         [ 0.3804, -0.0241,  0.0120,  ..., -0.2772, -0.1647,  0.0564],
         [-0.1039,  0.0225, -0.0971,  ..., -0.1524, -0.2400,  0.1594],
         [-0.1039,  0.0225, -0.0971,  ..., -0.1524, -0.2400,  0.1594],
  

In [150]:
norm = nn.LayerNorm(d_model)

In [152]:
activ2 = gelu

In [155]:
linear = nn.Linear(d_model, d_model)

In [156]:
h_masked = norm(activ2(linear(h_masked)))

In [158]:
h_masked.shape

torch.Size([6, 5, 768])

In [160]:
embed_weight = embedding.tok_embed.weight
n_vocab, n_dim = embed_weight.size()

decoder = nn.Linear(n_dim, n_vocab, bias=False)
decoder_bias = nn.Parameter(torch.zeros(n_vocab))

In [166]:
logits_lm = decoder(h_masked) + decoder_bias

In [169]:
logits_clsf.shape, logits_lm.shape

(torch.Size([6, 2]), torch.Size([6, 5, 29]))

In [186]:
logits_lm.data.max(2)[1]

tensor([[ 8,  5,  5,  5,  5],
        [18, 18, 15,  5,  5],
        [12, 18,  5,  5,  5],
        [18, 18, 18,  5,  5],
        [ 5,  5, 18,  5,  5],
        [24,  5, 18,  5,  5]])

In [170]:
logits_lm.data.max(2)[1][0].data.numpy()

array([8, 5, 5, 5, 5], dtype=int64)

___

In [26]:
if __name__ == '__main__':
    # BERT Parameters
    maxlen = 30 # maximum of length
    batch_size = 6
    max_pred = 5  # max tokens of prediction
    n_layers = 6 # number of Encoder of Encoder Layer
    n_heads = 12 # number of heads in Multi-Head Attention
    d_model = 768 # Embedding Size
    d_ff = 768 * 4  # 4*d_model, FeedForward dimension
    d_k = d_v = 64  # dimension of K(=Q), V
    n_segments = 2

    text = (
        'Hello, how are you? I am Romeo.\n'
        'Hello, Romeo My name is Juliet. Nice to meet you.\n'
        'Nice meet you too. How are you today?\n'
        'Great. My baseball team won the competition.\n'
        'Oh Congratulations, Juliet\n'
        'Thanks you Romeo'
    )
    sentences = re.sub("[.,!?\\-]", '', text.lower()).split('\n')  # filter '.', ',', '?', '!'
    word_list = list(set(" ".join(sentences).split()))
    word_dict = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}
    for i, w in enumerate(word_list):
        word_dict[w] = i + 4
    number_dict = {i: w for i, w in enumerate(word_dict)}
    vocab_size = len(word_dict)

    token_list = list()
    for sentence in sentences:
        arr = [word_dict[s] for s in sentence.split()]
        token_list.append(arr)

    model = BERT()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    batch = make_batch()
    input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

    for epoch in range(100):
        optimizer.zero_grad()
        logits_lm, logits_clsf = model(input_ids, segment_ids, masked_pos)
        loss_lm = criterion(logits_lm.transpose(1, 2), masked_tokens) # for masked LM
        loss_lm = (loss_lm.float()).mean()
        loss_clsf = criterion(logits_clsf, isNext) # for sentence classification
        loss = loss_lm + loss_clsf
        if (epoch + 1) % 10 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
        loss.backward()
        optimizer.step()

    # Predict mask tokens ans isNext
    input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(batch[0]))
    print(text)
    print([number_dict[w.item()] for w in input_ids[0] if number_dict[w.item()] != '[PAD]'])

    logits_lm, logits_clsf = model(input_ids, segment_ids, masked_pos)
    logits_lm = logits_lm.data.max(2)[1][0].data.numpy()
    print('masked tokens list : ',[pos.item() for pos in masked_tokens[0] if pos.item() != 0])
    print('predict masked tokens list : ',[pos for pos in logits_lm if pos != 0])

    logits_clsf = logits_clsf.data.max(1)[1].data.numpy()[0]
    print('isNext : ', True if isNext else False)
    print('predict isNext : ',True if logits_clsf else False)

Epoch: 0010 cost = 50.184807
Epoch: 0020 cost = 28.885002
Epoch: 0030 cost = 11.690194
Epoch: 0040 cost = 5.196136
Epoch: 0050 cost = 3.803531
Epoch: 0060 cost = 3.314933
Epoch: 0070 cost = 3.649222
Epoch: 0080 cost = 3.378980
Epoch: 0090 cost = 3.077944
Epoch: 0100 cost = 2.820939
Hello, how are you? I am Romeo.
Hello, Romeo My name is Juliet. Nice to meet you.
Nice meet you too. How are you today?
Great. My baseball team won the competition.
Oh Congratulations, Juliet
Thanks you Romeo
['[CLS]', 'oh', '[MASK]', 'juliet', '[SEP]', 'nice', '[MASK]', 'you', 'too', 'how', 'are', 'you', 'today', '[SEP]']
masked tokens list :  [4, 28]
predict masked tokens list :  [12, 12, 12, 12, 12]
isNext :  False
predict isNext :  False


In [28]:
logits_clsf

0

In [258]:
model

BERT(
  (embedding): Embedding(
    (tok_embed): Embedding(29, 768)
    (pos_embed): Embedding(30, 768)
    (seg_embed): Embedding(2, 768)
    (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (layers): ModuleList(
    (0): EncoderLayer(
      (enc_self_attn): MultiHeadAttention(
        (W_Q): Linear(in_features=768, out_features=768, bias=True)
        (W_K): Linear(in_features=768, out_features=768, bias=True)
        (W_V): Linear(in_features=768, out_features=768, bias=True)
      )
      (pos_ffn): PoswiseFeedForwardNet(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
      )
    )
    (1): EncoderLayer(
      (enc_self_attn): MultiHeadAttention(
        (W_Q): Linear(in_features=768, out_features=768, bias=True)
        (W_K): Linear(in_features=768, out_features=768, bias=True)
        (W_V): Linear(in_features=768, out_features=768, bias=True)
      )
      (pos_ffn): Po

In [2]:
135

135

In [None]:
17+18