# [GD-10] Transformer - Translator
"Going Deeper Node 10. Making translator using Transformer" / 2022. 04. 12 (Tue) 이형주

## Contents
---
- **1. Environment Setup**
- **2. Modeling**
- **3. Project Retrospective**

## Rubric 평가기준
---

|  평가문항  |  상세기준  |
|:---------|:---------|
|1. 번역기 모델 학습에 필요한 텍스트 데이터 전처리가 잘 이루어졌다.|데이터 정제, SentencePiece를 활용한 토큰화 및 데이터셋 구축의 과정이 지시대로 진행되었다.
|2. Transformer 번역기 모델이 정상적으로 구동된다.|Transformer 모델의 학습과 추론 과정이 정상적으로 진행되어, 한-영 번역기능이 정상 동작한다.
|3. 테스트 결과 의미가 통하는 수준의 번역문이 생성되었다.|제시된 문장에 대한 그럴듯한 영어 번역문이 생성되며, 시각화된 Attention Map으로 결과를 뒷받침한다.

## 1. Environment Setup

In [1]:
# 한글폰트 설치

!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

Reading package lists... Done
Building dependency tree       
Reading state information... Done
fonts-nanum is already the newest version (20180306-3).
The following packages were automatically installed and are no longer required:
  accountsservice-ubuntu-schemas bc bluez-obexd cups cups-browsed cups-client
  cups-common cups-core-drivers cups-daemon cups-filters
  cups-filters-core-drivers cups-ipp-utils cups-ppdc cups-server-common
  fonts-droid-fallback fonts-noto-mono fonts-urw-base35 ghostscript
  gir1.2-dbusmenu-glib-0.4 gnome-bluetooth gnome-power-manager
  gnome-screensaver gsettings-ubuntu-schemas gvfs-backends indicator-applet
  indicator-application indicator-appmenu indicator-bluetooth indicator-common
  indicator-datetime indicator-keyboard indicator-messages indicator-power
  indicator-printers indicator-session indicator-sound jayatana
  libaccounts-glib0 libbamf3-2 libcdio-cdda2 libcdio-paranoia2 libcdio18
  libcupsfilters1 libfcitx-config4 libfcitx-gclient1 libfcitx-u

In [2]:
# 한글폰트 셋팅

%matplotlib inline  

import matplotlib as mpl  # 기본 설정 만지는 용도
import matplotlib.pyplot as plt  # 그래프 그리는 용도
import matplotlib.font_manager as fm  # 폰트 관련 용도

plt.rc('font', family='NanumGothic')

In [3]:
sys_font=fm.findSystemFonts()
print(f"sys_font number: {len(sys_font)}")
print(sys_font)

nanum_font = [f for f in sys_font if 'Nanum' in f]
print(f"nanum_font number: {len(nanum_font)}")

sys_font number: 103
['/usr/share/fonts/truetype/dejavu/DejaVuSerif-Italic.ttf', '/usr/share/fonts/truetype/liberation/LiberationSerif-BoldItalic.ttf', '/usr/share/fonts/opentype/urw-base35/NimbusSansNarrow-BoldOblique.otf', '/usr/share/fonts/truetype/liberation/LiberationSansNarrow-Regular.ttf', '/usr/share/fonts/truetype/dejavu/DejaVuSerifCondensed-Italic.ttf', '/usr/share/fonts/truetype/ubuntu/Ubuntu-MI.ttf', '/usr/share/fonts/truetype/dejavu/DejaVuSerifCondensed-Bold.ttf', '/usr/share/fonts/truetype/dejavu/DejaVuSerifCondensed-BoldItalic.ttf', '/usr/share/fonts/truetype/ubuntu/UbuntuMono-BI.ttf', '/usr/share/fonts/truetype/nanum/NanumGothicBold.ttf', '/usr/share/fonts/truetype/liberation/LiberationSans-Italic.ttf', '/usr/share/fonts/opentype/urw-base35/NimbusMonoPS-Bold.otf', '/usr/share/fonts/opentype/urw-base35/URWGothic-BookOblique.otf', '/usr/share/fonts/truetype/nanum/NanumGothic.ttf', '/usr/share/fonts/truetype/liberation/LiberationSansNarrow-BoldItalic.ttf', '/usr/share/font

In [4]:
path = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'  # 설치된 나눔글꼴중 원하는 폰트의 전체 경로를 가져온다.
font_name = fm.FontProperties(fname=path, size=10).get_name()
print(font_name)
plt.rc('font', family=font_name)

NanumGothic


In [5]:
import tensorflow as tf
import numpy as np
import time
import re
import os
import io
import random
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

In [6]:
def positional_encoding(pos, d_model):
    def cal_angle(position, i):
        return position / np.power(10000, int(i) / d_model)

    def get_posi_angle_vec(position):
        return [cal_angle(position, i) for i in range(d_model)]

    sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(pos)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])
    return sinusoid_table

In [7]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
            
        self.depth = d_model // self.num_heads
            
        self.W_q = tf.keras.layers.Dense(d_model)
        self.W_k = tf.keras.layers.Dense(d_model)
        self.W_v = tf.keras.layers.Dense(d_model)
            
        self.linear = tf.keras.layers.Dense(d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask):
        d_k = tf.cast(K.shape[-1], tf.float32)
        QK = tf.matmul(Q, K, transpose_b=True)

        scaled_qk = QK / tf.math.sqrt(d_k)

        if mask is not None: scaled_qk += (mask * -1e9)  

        attentions = tf.nn.softmax(scaled_qk, axis=-1)
        out = tf.matmul(attentions, V)

        return out, attentions
            

    def split_heads(self, x):
        batch_size = x.shape[0]
        split_x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        split_x = tf.transpose(split_x, perm=[0, 2, 1, 3])

        return split_x

    def combine_heads(self, x):
        batch_size = x.shape[0]
        combined_x = tf.transpose(x, perm=[0, 2, 1, 3])
        combined_x = tf.reshape(combined_x, (batch_size, -1, self.d_model))

        return combined_x

        
    def call(self, Q, K, V, mask):
        WQ = self.W_q(Q)
        WK = self.W_k(K)
        WV = self.W_v(V)
        
        WQ_splits = self.split_heads(WQ)
        WK_splits = self.split_heads(WK)
        WV_splits = self.split_heads(WV)
            
        out, attention_weights = self.scaled_dot_product_attention(
            WQ_splits, WK_splits, WV_splits, mask)
    				        
        out = self.combine_heads(out)
        out = self.linear(out)
                
        return out, attention_weights

In [8]:
class PoswiseFeedForwardNet(tf.keras.layers.Layer):
    def __init__(self, d_model, d_ff):
        super(PoswiseFeedForwardNet, self).__init__()
        self.w_1 = tf.keras.layers.Dense(d_ff, activation='relu')
        self.w_2 = tf.keras.layers.Dense(d_model)

    def call(self, x):
        out = self.w_1(x)
        out = self.w_2(out)
            
        return out

print("Completed")

Completed


In [9]:
# Encoder Layer

class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()

        self.enc_self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = PoswiseFeedForwardNet(d_model, d_ff)

        self.norm_1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout = tf.keras.layers.Dropout(dropout)
        
    def call(self, x, mask):

        """
        Multi-Head Attention
        """
        residual = x
        out = self.norm_1(x)
        out, enc_attn = self.enc_self_attn(out, out, out, mask)
        out = self.dropout(out)
        out += residual
        
        """
        Position-Wise Feed Forward Network
        """
        residual = out
        out = self.norm_2(out)
        out = self.ffn(out)
        out = self.dropout(out)
        out += residual
        
        return out, enc_attn

print("Completed")

Completed


In [10]:
# Decoder Layer

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()

        self.dec_self_attn = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)

        self.ffn = PoswiseFeedForwardNet(d_model, d_ff)

        self.norm_1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout = tf.keras.layers.Dropout(dropout)
    
    def call(self, x, enc_out, causality_mask, padding_mask):

        """
        Masked Multi-Head Attention
        """
        residual = x
        out = self.norm_1(x)
        out, dec_attn = self.dec_self_attn(out, out, out, padding_mask)
        out = self.dropout(out)
        out += residual

        """
        Multi-Head Attention
        """
        residual = out
        out = self.norm_2(out)
        out, dec_enc_attn = self.enc_dec_attn(out, enc_out, enc_out, causality_mask)
        out = self.dropout(out)
        out += residual
        
        """
        Position-Wise Feed Forward Network
        """
        residual = out
        out = self.norm_3(out)
        out = self.ffn(out)
        out = self.dropout(out)
        out += residual

        return out, dec_attn, dec_enc_attn

In [11]:
# Encoder

class Encoder(tf.keras.Model):
    def __init__(self,
                 n_layers,
                 d_model,
                 n_heads,
                 d_ff,
                 dropout):
        super(Encoder, self).__init__()
        self.n_layers = n_layers
        self.enc_layers = [EncoderLayer(d_model, n_heads, d_ff, dropout) 
                        for _ in range(n_layers)]
        
    def call(self, x, mask):
        out = x
    
        enc_attns = list()
        for i in range(self.n_layers):
            out, enc_attn = self.enc_layers[i](out, mask)
            enc_attns.append(enc_attn)
        
        return out, enc_attns

print("Completed")

Completed


In [12]:
# Decoder

class Decoder(tf.keras.Model):
    def __init__(self,
                 n_layers,
                 d_model,
                 n_heads,
                 d_ff,
                 dropout):
        super(Decoder, self).__init__()
        self.n_layers = n_layers
        self.dec_layers = [DecoderLayer(d_model, n_heads, d_ff, dropout) 
                            for _ in range(n_layers)]
                            
                            
    def call(self, x, enc_out, causality_mask, padding_mask):
        out = x
    
        dec_attns = list()
        dec_enc_attns = list()
        for i in range(self.n_layers):
            out, dec_attn, dec_enc_attn = \
            self.dec_layers[i](out, enc_out, causality_mask, padding_mask)

            dec_attns.append(dec_attn)
            dec_enc_attns.append(dec_enc_attn)

        return out, dec_attns, dec_enc_attns

print("Completed")

Completed


In [13]:
class Transformer(tf.keras.Model):
    def __init__(self,
                    n_layers,
                    d_model,
                    n_heads,
                    d_ff,
                    src_vocab_size,
                    tgt_vocab_size,
                    pos_len,
                    dropout=0.2,
                    shared=True):
        super(Transformer, self).__init__()
        self.d_model = tf.cast(d_model, tf.float32)

        self.enc_emb = tf.keras.layers.Embedding(src_vocab_size, d_model)
        self.dec_emb = tf.keras.layers.Embedding(tgt_vocab_size, d_model)

        self.pos_encoding = positional_encoding(pos_len, d_model)
        self.dropout = tf.keras.layers.Dropout(dropout)

        self.encoder = Encoder(n_layers, d_model, n_heads, d_ff, dropout)
        self.decoder = Decoder(n_layers, d_model, n_heads, d_ff, dropout)

        self.fc = tf.keras.layers.Dense(tgt_vocab_size)

        self.shared = shared

        if shared: self.fc.set_weights(tf.transpose(self.dec_emb.weights))

    def embedding(self, emb, x):
        seq_len = x.shape[1]
        out = emb(x)

        if self.shared: out *= tf.math.sqrt(self.d_model)

        out += self.pos_encoding[np.newaxis, ...][:, :seq_len, :]
        out = self.dropout(out)

        return out

        
    def call(self, enc_in, dec_in, enc_mask, causality_mask, dec_mask):
        enc_in = self.embedding(self.enc_emb, enc_in)
        dec_in = self.embedding(self.dec_emb, dec_in)

        enc_out, enc_attns = self.encoder(enc_in, enc_mask)
        
        dec_out, dec_attns, dec_enc_attns = \
        self.decoder(dec_in, enc_out, causality_mask, dec_mask)
        
        logits = self.fc(dec_out)
        
        return logits, enc_attns, dec_attns, dec_enc_attns

In [14]:
def generate_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]

def generate_causality_mask(src_len, tgt_len):
    mask = 1 - np.cumsum(np.eye(src_len, tgt_len), 0)
    return tf.cast(mask, tf.float32)

def generate_masks(src, tgt):
    enc_mask = generate_padding_mask(src)
    dec_mask = generate_padding_mask(tgt)

    dec_enc_causality_mask = generate_causality_mask(tgt.shape[1], src.shape[1])
    dec_enc_mask = tf.maximum(enc_mask, dec_enc_causality_mask)

    dec_causality_mask = generate_causality_mask(tgt.shape[1], tgt.shape[1])
    dec_mask = tf.maximum(dec_mask, dec_causality_mask)

    return enc_mask, dec_enc_mask, dec_mask

print("genetate_padding_mask 정의 완료")

genetate_padding_mask 정의 완료


In [15]:
class LearningRateScheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(LearningRateScheduler, self).__init__()
        self.d_model = d_model
        self.warmup_steps = warmup_steps
    
    def __call__(self, step):
        arg1 = step ** -0.5
        arg2 = step * (self.warmup_steps ** -1.5)
        
        return (self.d_model ** -0.5) * tf.math.minimum(arg1, arg2)

learning_rate = LearningRateScheduler(512)
optimizer = tf.keras.optimizers.Adam(learning_rate,
                                     beta_1=0.9,
                                     beta_2=0.98, 
                                     epsilon=1e-9)

In [16]:
data_dir = os.getenv('HOME')+'/aiffel/transformer/data'
kor_path = data_dir + "/korean-english-park.train.ko"
eng_path = data_dir + "/korean-english-park.train.en"

In [17]:
def clean_corpus(kor_path, eng_path):
    with open(kor_path, "r") as f: kor = f.read().splitlines()
    with open(eng_path, "r") as f: eng = f.read().splitlines()
    assert len(kor) == len(eng)
    
    cleaned_corpus = list(set(zip(kor, eng)))
    
    return cleaned_corpus
    
cleaned_corpus = clean_corpus(kor_path, eng_path)
print('정제된 corpus 데이터의 수는', len(cleaned_corpus), '개 입니다.')

정제된 corpus 데이터의 수는 78968 개 입니다.


In [18]:
def preprocess_sentence(sentence):
    # 모든 입력은 소문자로 변환
    sentence = sentence.lower()
    # 알파벳, 문장부호, 한글만 남기고 모두 제거
    sentence = re.sub(r"[^a-zA-Zㄱ-ㅎ가-힣ㅏ-ㅣ.,?!]+", " ", sentence)
    # 문장부호 양 사이드에 공백 추가
    sentence = re.sub(r"([,.?!])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    
    return sentence

In [19]:
import sentencepiece as spm

# Sentencepiece를 활용하여 학습한 tokenizer를 생성
def generate_tokenizer(corpus, vocab_size, lang="ko", pad_id=0, bos_id=1, eos_id=2, unk_id=3):
    # corpus를 받아 txt 파일로 저장
    temp_file = os.getenv('HOME') + f'/aiffel/transformer/temp/corpus_{lang}.txt'
    
    with open(temp_file, 'w') as f:
        for row in corpus:
            f.write(str(row) + '\n')
    
    # Sentencepiece
    spm.SentencePieceTrainer.Train(
        f'--input={temp_file} --pad_id={pad_id} --bos_id={bos_id} --eos_id={eos_id} \
        --unk_id={unk_id} --model_prefix=spm_{lang} --vocab_size={vocab_size}'
    )
    tokenizer = spm.SentencePieceProcessor()
    tokenizer.Load(f'spm_{lang}.model')

    return tokenizer

In [20]:
SRC_VOCAB_SIZE = TGT_VOCAB_SIZE = 20000

eng_corpus = []
kor_corpus = []

for pair in cleaned_corpus:
    k, e = pair[0], pair[1]

    kor_corpus.append(preprocess_sentence(k))
    eng_corpus.append(preprocess_sentence(e))

ko_tokenizer = generate_tokenizer(kor_corpus, SRC_VOCAB_SIZE, "ko")
en_tokenizer = generate_tokenizer(eng_corpus, TGT_VOCAB_SIZE, "en")
en_tokenizer.set_encode_extra_options("bos:eos")

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/aiffel/aiffel/transformer/temp/corpus_ko.txt --pad_id=0 --bos_id=1 --eos_id=2         --unk_id=3 --model_prefix=spm_ko --vocab_size=20000
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /aiffel/aiffel/transformer/temp/corpus_ko.txt
  input_format: 
  model_prefix: spm_ko
  model_type: UNIGRAM
  vocab_size: 20000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  un

True

g vocabs: spm_en.vocab


In [21]:
cleaned_corpus[5]

('NSIDC의 선임연구원인 마크 서리지는 이 연구결과에 대해 “깜짝 놀랐다”고 설명했다.',
 'Mark Serreze, senior research scientist at NSIDC, termed the decline "astounding.')

In [22]:
from tqdm import tqdm_notebook
import tqdm
import tensorflow as tf

src_corpus = []
tgt_corpus = []

assert len(kor_corpus) == len(eng_corpus)

# 토큰의 길이가 50 이하인 문장만
for idx in tqdm_notebook(range(len(kor_corpus))):
    src = ko_tokenizer.EncodeAsIds(kor_corpus[idx])
    tgt = en_tokenizer.EncodeAsIds(eng_corpus[idx])
    
    if len(src) <= 50 and len(tgt) <= 50:
        src_corpus.append(src)
        tgt_corpus.append(tgt)
        
# 패딩처리를 완료하여 학습용 데이터를 완성 
enc_train = tf.keras.preprocessing.sequence.pad_sequences(src_corpus, padding='post')
dec_train = tf.keras.preprocessing.sequence.pad_sequences(tgt_corpus, padding='post')

  0%|          | 0/78968 [00:00<?, ?it/s]

In [23]:
# Loss 함수 정의
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    # Masking 되지 않은 입력의 개수로 Scaling하는 과정
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

In [24]:
@tf.function()
def train_step(src, tgt, model, optimizer):
    gold = tgt[:, 1:]
        
    enc_mask, dec_enc_mask, dec_mask = generate_masks(src, tgt)

    # 계산된 loss에 tf.GradientTape()를 적용해 학습을 진행합니다.
    with tf.GradientTape() as tape:
        predictions, enc_attns, dec_attns, dec_enc_attns = model(src, tgt, enc_mask, dec_enc_mask, dec_mask)
        loss = loss_function(gold, predictions[:, :-1])

    # 최종적으로 optimizer.apply_gradients()가 사용됩니다. 
    gradients = tape.gradient(loss, model.trainable_variables)    
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    return loss, enc_attns, dec_attns, dec_enc_attns

## 2. Modeling

In [25]:
# Attention 시각화 함수
def visualize_attention(src, tgt, enc_attns, dec_attns, dec_enc_attns):
    def draw(data, ax, x="auto", y="auto"):
        import seaborn
        seaborn.heatmap(data, 
                        square=True,
                        vmin=0.0, vmax=1.0, 
                        cbar=False, ax=ax,
                        xticklabels=x,
                        yticklabels=y)
        
    for layer in range(0, 2, 1):
        fig, axs = plt.subplots(1, 4, figsize=(20, 10))
        print("Encoder Layer", layer + 1)
        for h in range(4):
            draw(enc_attns[layer][0, h, :len(src), :len(src)], axs[h], src, src)
        plt.show()
        
    for layer in range(0, 2, 1):
        fig, axs = plt.subplots(1, 4, figsize=(20, 10))
        print("Decoder Self Layer", layer+1)
        for h in range(4):
            draw(dec_attns[layer][0, h, :len(tgt), :len(tgt)], axs[h], tgt, tgt)
        plt.show()

        print("Decoder Src Layer", layer+1)
        fig, axs = plt.subplots(1, 4, figsize=(20, 10))
        for h in range(4):
            draw(dec_enc_attns[layer][0, h, :len(tgt), :len(src)], axs[h], src, tgt)
        plt.show()

In [26]:
# 번역 생성 함수
def evaluate(sentence, model, src_tokenizer, tgt_tokenizer):
    sentence = preprocess_sentence(sentence)
    pieces = src_tokenizer.encode_as_pieces(sentence)
    tokens = src_tokenizer.encode_as_ids(sentence)

    _input = tf.keras.preprocessing.sequence.pad_sequences([tokens], maxlen=enc_train.shape[-1], padding='post')
    
    print(len(_input))
    print(enc_train.shape[-1])

    ids = []
    output = tf.expand_dims([tgt_tokenizer.bos_id()], 0)
    for i in range(dec_train.shape[-1]):
        enc_padding_mask, combined_mask, dec_padding_mask = generate_masks(_input, output)
        
        # InvalidArgumentError: In[0] mismatch In[1] shape: 50 vs. 1: [1,8,1,50] [1,8,1,64] 0 0 [Op:BatchMatMulV2]
        predictions, enc_attns, dec_attns, dec_enc_attns = model(_input, output, enc_padding_mask, combined_mask, dec_padding_mask)
        
        predicted_id = tf.argmax(tf.math.softmax(predictions, axis=-1)[0, -1]).numpy().item()
        if tgt_tokenizer.eos_id() == predicted_id:
            result = tgt_tokenizer.decode_ids(ids)
            return pieces, result, enc_attns, dec_attns, dec_enc_attns

        ids.append(predicted_id)
        output = tf.concat([output, tf.expand_dims([predicted_id], 0)], axis=-1)
    result = tgt_tokenizer.decode_ids(ids)
    return pieces, result, enc_attns, dec_attns, dec_enc_attns

In [27]:
# 번역 생성 및 Attention 시각화 결합
def translate(sentence, model, src_tokenizer, tgt_tokenizer, plot_attention=False):
    pieces, result, enc_attns, dec_attns, dec_enc_attns = evaluate(sentence, model, src_tokenizer, tgt_tokenizer)
    
    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))

    if plot_attention:
        visualize_attention(pieces, result.split(), enc_attns, dec_attns, dec_enc_attns)

In [28]:
transformer = Transformer(
    n_layers=2,
    d_model=512,
    n_heads=8,
    d_ff = 2048,
    src_vocab_size=SRC_VOCAB_SIZE,
    tgt_vocab_size=TGT_VOCAB_SIZE,
    pos_len=200,
    dropout=0.2,
    shared=True
)

In [29]:
from tqdm.notebook import tqdm

BATCH_SIZE = 64
EPOCHS = 20

examples = [
            "오바마는 대통령이다.",
            "시민들은 도시 속에 산다.",
            "커피는 필요 없다.",
            "일곱 명의 사망자가 발생했다."
]

for epoch in range(EPOCHS):
    total_loss = 0
    
    idx_list = list(range(0, enc_train.shape[0], BATCH_SIZE))
    random.shuffle(idx_list)
    t = tqdm_notebook(idx_list)

    for (batch, idx) in enumerate(t):
        batch_loss, enc_attns, dec_attns, dec_enc_attns = \
        train_step(enc_train[idx:idx+BATCH_SIZE],
                    dec_train[idx:idx+BATCH_SIZE],
                    transformer,
                    optimizer)

        total_loss += batch_loss
        
        t.set_description_str('Epoch %2d' % (epoch + 1))
        t.set_postfix_str('Loss %.4f' % (total_loss.numpy() / (batch + 1)))

    for example in examples:
        translate(example, transformer, ko_tokenizer, en_tokenizer)

  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is a vote .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the same time in the area of the town of the town of the area .
1
50
Input: 커피는 필요 없다.
Predicted translation: the first time , the first time .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the quake was found in the town of the town of the town of the town of the town of the town of the town of the town of the town of the town of the death toll .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is a campaign that is a one of obama s campaign .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the city is the city of city .
1
50
Input: 커피는 필요 없다.
Predicted translation: the coffee is not available .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the death toll from the death toll from the death toll in the death toll .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: the obama is the first time .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the city is the city siena .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee is not coffee .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the death toll was hit by a rise in the city .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is a third person .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the demonstrators are being used to be in the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee does not mean coffee .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the death toll is hit by a seventh of seven people .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: the obama is a man .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the city is a mountainous city .
1
50
Input: 커피는 필요 없다.
Predicted translation: the memory is no .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven seven of seven seven seven people were killed tuesday .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: it is the kind of personal .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the city is in the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: there s no more need to be there .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven people were killed and were wounded .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: the president is part of the country .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the city is one of the city s biggest city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs to do so .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven people were confirmed deaths on monday , and seven others were wounded .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: that s the president of the country .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the city is keeping the mountain in a mountain .
1
50
Input: 커피는 필요 없다.
Predicted translation: practice has no closer .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven of the seven death toll stood on sunday , on the seventh of the seventh day of the seventh day of the seventh fatality .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is the second leg of his five year old connection
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: they are on city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs to do away with coffee
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven seven of the seventh fatality in my seven my seven my seven seven seven seven my seventh floor seven days .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: it s the point of that obama .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the only in the city of mumbai .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs to do is take at coffee .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven seven of seven seven seven seven seven of the seven seven seven seven seven seven seven seven people were killed .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is the second of the president .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: so the city is beneath .
1
50
Input: 커피는 필요 없다.
Predicted translation: if coffee needs coffee is fading .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven sri lankan seven personnel were among the seven seven seven seven seven seven seven seven of the seven seven seven seven seven seven seven seven seven seven people .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is eighth .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the only city around .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs coffee
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven seven sri lankan seven astronauts were killed and seven others .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: that s the president .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the only neighbors in the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs at the coffee house .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven others died wednesday seven of seven deaths .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: yes , that would be president of the united states .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: it s the city of to go to the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs to do music at the coffee
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the seventh floor leader was killed friday when a seventh personnel died .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is the second life of the younger man .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the only place in the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs no coffee right at the coffee
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the seven seven seventh of the seventh vote fourth in the seventh vote related to the seventh vote thursday


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is the second presidential candidate .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: where the city is a god .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs at coffee
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven seven seven others were originally listed in theni province on friday , a seven report said .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is the second leg of the oval office .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: where the city is the only one of the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs at coffee houses .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven seven days were taken into the began thursday evening .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: he is a second person in that country .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: where goes to the city .
1
50
Input: 커피는 필요 없다.
Predicted translation: if stay there is no coffee .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven others were among the seven seven people .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: president obama is in very car .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: on sunday the city is a god renowned for god .
1
50
Input: 커피는 필요 없다.
Predicted translation: if coffee do is not coffee .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: seven sri lanka s seven report shows seven of the seven seven seven people were dead and seven others were still vulnerable .


  0%|          | 0/1128 [00:00<?, ?it/s]

1
50
Input: 오바마는 대통령이다.
Predicted translation: obama is a popular vote on his wife .
1
50
Input: 시민들은 도시 속에 산다.
Predicted translation: the crowd is on the brink of five .
1
50
Input: 커피는 필요 없다.
Predicted translation: coffee needs to do is not coffee .
1
50
Input: 일곱 명의 사망자가 발생했다.
Predicted translation: the seven were among the seven casualtiess and a mostly whom died , the fire officials said on sunday .


## 3. Project Retrospective

+ 두 가지 오류로 인해 Transformer 모델 구현 및 프로젝트 완성에 많은 시간이 소요되었으며, LMS -> Colab에서도 나중에야 동작에 문제 없음을 다시 한 번 확인하였다.
    - 허무하게도, 원인은 원본 파일이 위치해 있는 (.ko) 곳에 txt 파일이 동시에 위치해서는 안 되는 것이었다. 이유는 공식 문서에서도 찾지 못했다. 폴더를 따로 만들어, 경로를 분리 지정하였더니 해결되었다.
    - How to solve OSError: [Errno 30] Read-only file system?
        + https://www.kaggle.com/questions-and-answers/70138
    - OSError: Not found: unknown field name “xxxx” in TrainerSpec
        + https://forums.fast.ai/t/lesson-8-notebook-10-nlp-oserror-not-found-unknown-field-name-minloglevel-in-trainerspec/71436
        
+ 이번 프로젝트에서는 이전과 동일한 DataSet을 사용하므로 전체적인 윤곽을 잡는 것은 그렇게 어렵지 않았다. 하지만 이전과 달랐던 점은 모델 학습결과 번역 성능이 Attention 대비 월등히 향상되었다는 점이며, 데이터 전처리부터 토크나이징, 패딩 외에 성능을 끌어올리기 위한 다양한 예시 코드들을 보면서 이전에 하던 하이퍼파라미터 튜닝은 참 소소한(?) 작업이었구나 하는 점을 다시금 느낀다.

+ 다음 그룹 프로젝트에서 진행하는 것도, 데이터셋만 다를 뿐, 실제로 진행하는 전 과정이 비슷하게 진행될 것으로 보인다. 따라서 이번에 고생한 경험을 발판삼아 조금 덜 고생할 수 있기를 바란다.