<a href="https://colab.research.google.com/github/jinsusong/21-study-paper-review/blob/main/Transformer_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Transformer : Attention is All You Need

2021년 기준으로 최신 고성능 모델들은 Transformer 아키텍처를 기반으로 함 

GPT : Transformer의 디코더 아키텍처를 활용
BERT : Transformer의 인코더 아키텍처를 활용



마운트 구글 드라이브

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%pwd

""" 
Use this javascript code in inspect>console so you wont need to click the page every 15 min:

########################
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);
########################

"""

change current path to where the working project folder is at 

In [None]:
%cd drive/MyDrive/projects/transformers_translation/

### Step 0 : Get The Data 

upload the data to our current path and unzip it (numcomment and run this only once)

In [None]:
# # data is from: https://www.statmt.org/europarl/ you can use this or just upload your own data
# %cd data
# !wget https://www.statmt.org/europarl/v7/de-en.tgz
# !tar -xvf de-en.tgz
# %cd ..
# %pwd

get non breaking prefixs

In [None]:
# get non_breaking_prefixes from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes
# then rename them to: "nonbreaking_prefix.en" and "nonbreaking_prefix.de" and put them in your data folder so we dont consider the
# dot in 'mr.jackson' as the end of a sentence

### Step 1 : Importing Dependencies

In [None]:
import numpy as np
import math 
import re
import time # to see how long it takes in training


In [None]:
%tensorflow_version 2.x

import tensorflow as tf
from tensorflow.keras import layers 
import tensorflow_datasets as tfds # tools for the tokenizer 



### Step 2 : Data Preprocessing 

read files

In [None]:
with open("data/europarl-v7.de-en.en", mode='r', encoding="utf-8") as f:
    text_en = f.read()

with open("data/europarl-v7.de-en.de", mode='r', encoding="utf-8") as f:
    text_de = f.read()

print(text_en[:50])
print(text_de[:50])



In [None]:
with open("data/nonbreaking-prefix.en", mode='r', encoding="utf-8") as f: 
    non_breaking_prefix_en = f.read()

with open("data/nonbreaking-prefix.de", mode='r', encoding="utf-8") as f:
    non_breaking_prefix_de = f.read()

print(non_breaking_prefix_en[:5])
print(non_breaking_prefix_de[:5])


Cleaning

In [None]:
# 해석 필요 
for prefix in non_breaking_prefix_en:
    text_en = text_en.replace(prefix, prefix + '###')

text_en = re.sub(r"\.(?=[0-9]|[a-z]|[A-Z])", ".###", text_en)
text_en = re.sub(r"\.###",'',text_en)
text_en = re.sub(r" +", ' ', text_en)
text_en = text_en.replace('###',' ')

text_en = text_en.split("\n")

for prefix in non_breaking_prefix_de:
    text_de = text_de.replace(prefix, prefix + '###')
text_de = re.sub(r"\.(?=[0-9]|[a-z]|[A-Z])", ".###", text_de)
text_de = re.sub(r"\.###",'',text_de)
text_de = re.sub(r" +",' ',text_de)
text_de = text_de.replace('###',' ')

text_de = text_de.split("\n")





### Tokenizing


In [None]:
tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    text_en, target_vocab_size=8000
)

tokenizer_de = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    text_de, target_vocag_size=8000
)





In [None]:
VOCAB_SIZE_EN = tokenizer_en.vocab_size + 2
VOCAB_SIZE_DE = tokenizer_de.vocab_size + 2 

# we put start and tokens as size-1 and size-2 which are the same as 
# tokenizer_size and tokenizer_size +1 because the words are from [0 to ts -1]
# tokenizer_en.encode(sentence) give a list then list + list + list appends them

input = [[VOCAB_SIZE_EN-2] + tokenizer_en.encode(sentence) + [VOCAB_SIZE_EN-1]
         for sentence in text_en]

outputs = [[VOCAB_SIZE_DE-2] + tokenizer_de.encode(sentence) + [VOCAB_SIZE_DE-1]
          for sentence in text_de]




###Remove too long sentences

- Why? (1) because when we pad we will have a hugeeee ram issuie for example sentence sizes of 1,100,2 when we pad they become 100,100,100 which we would rather loose that 100 than pad all to 100 (2) takes too much time to train

In [None]:
MAX_LENGTH = 20 # we will still have a lot of data with max len of 20 

# this part. why we do it is a bit tricky. pay attention why we do it like this:
idx_to_remove = [count for count, sent in enumerate(inputs)
if len(sent) > MAX_LENGTH]

# we remove in reversed because of shifting issuies when we satrt from begining
for idx in reversed(idx_to_remive):
    del inputs[idx]
    del outputs[idx]

# same stuff for outputs > 20 
idx_to_remove = [count for count, sent in enumerate(outputs)
if len(sent) > MAX_LENGTH]

for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

 

### input / output creation

1. padding
2. batching

In [None]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs,
                                                       value=0,
                                                       padding='post',
                                                       maxlen = MAX_LENGTH)

outpus = tf.keras.preprocessing.sequence.pad_sequences(outputs,
                                                       value=0,
                                                       padding='post',
                                                       maxlen = MAX_LENGTH)



In [None]:
BATCH_SIZE =64
BUFFER_SIZE = 20000 # how much data to keep

# now we turned our data into a dataset 
dataset = tf.data.Dataset.from_tensort_slices((inputs, outputs))

#this is something that improves the way the dataset is stored. it increases
# the speed of accessing the data which increases training speed in return :
data = dataset.cache()

#now we shuffle in batches
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

#this increases the speed even further:
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)


### Step 3 : Model Building

- A - Positional Encoding ( look at the formula in the paper)

In [None]:
class PositionalEncoding(layers.Layer):
    def __init__(self):
        # this positional encoder we made it a child of the Layers so it has all
        # the properties that a layer has 
        super(PositionalEncodeing, self).__init__()

    def get angles(self, pos, i, d_model):
        """
        :pos: (seq_len, 1) index of the word in sentence [0 to 19]
        :i: the dimensions of the embedding (glove dims 200) then-> [0 to 199]
        :d_model: the size (dimension) of the embeded (e.g. glove size 200)
        :return: (seq_len, d_model) why? we are getting the encoding of the
                every positions vs every one of the dimensions of that word
        """
        angles = 1 / np.power(10000., (2*(i//2))/np.float32(d_model))
        return pos * angles # dim: (seq_len, d_model)