# 텍스트 데이터 전처리
1. 데이터 표준화
- 대문자나 발음 표현 기호 같은거 다 무시하고 동일하게 소문자로 통일
- 쉼표나 콤마 등 스톱워드(의미가 없는 문자들) 삭제
- 머신 러닝에서 드물게 문법 다 무시하고 기본형으로 바꿔버리는 어간 추출이란 기법도 있다( 물론 그만큼 원본에 대한 데이터는 사라지는 게 맞다)
2. 텍스트 토큰화
- 시퀀스 모델과 BoW모델로 나뉜다
- BoW모델의 경우 문장의 순서에 대한 정보는 거의 없지만, 2-gram, 3-gram등 순서정보가 눈꼽만큼 들어가 있는 토큰화 방법도 있다



 토큰화를 진행할 때 각 토큰을 수치로 인코딩하여 벡터화 시켜야 하는데, 이를 지금부터 구현해보도록 하자.
 **주의할 점**
 - 어휘사전에 모든 토큰이 기록되어 있지 않기 때문에 예외 단어를 위한 인덱스를 만들어 둔다(보통은 1을 사용한다 -> 1은 어휘사전에 없는 모든 단어에 대응함)
 - 이러한 예외처리 토큰을 OOV(out of vocabulary)라고 부른다
 - 0번쨰 인덱스는 일반적으로 무시할 수 있는 토큰을 매핑하기 위해 사용한다.

In [None]:
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int"
)

In [None]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms"
]
text_vectorization.adapt(dataset) # 말뭉치로 어휘사전 인덱싱 가능

In [None]:
text_vectorization.get_vocabulary() # 저장된 어휘사전 열람

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [None]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print("encoded sentence : ", encoded_sentence)
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print("decoded sentence : ", decoded_sentence)

encoded sentence :  tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
decoded sentence :  i write rewrite and [UNK] rewrite again


# IMDB 영화 리뷰 데이터 준비하기

In [None]:
!curl -0 https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz --output aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  9224k      0  0:00:08  0:00:08 --:--:-- 17.2M


In [None]:
!rm -r aclImdb/train/unsup
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
  os.makedirs(val_dir / category)
  files = os.listdir(train_dir / category)
  random.Random(1337).shuffle(files)
  num_val_samples = int(0.2 * len(files))
  val_files = files[-num_val_samples:]
  for fname in val_files:
    shutil.move(train_dir / category / fname, val_dir / category / fname)

In [None]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    train_dir, batch_size = batch_size
)

val_ds = keras.utils.text_dataset_from_directory(
    val_dir, batch_size = batch_size
)

test_ds = keras.utils.text_dataset_from_directory(
    base_dir / "test", batch_size = batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
for inputs, targets in train_ds:
  print("inputs.shape : ", inputs.shape)
  print("inputs.dtype : ", inputs.dtype)
  print("targets.shape : ", targets.shape)
  print("targets.dtype : ", targets.dtype)
  print("input[0] : ", inputs[0])
  print("target[0] : ", targets[0])
  break

inputs.shape :  (32,)
inputs.dtype :  <dtype: 'string'>
targets.shape :  (32,)
targets.dtype :  <dtype: 'int32'>
input[0] :  tf.Tensor(b'[***POSSIBLE SPOILERS***] This movie\'s reputation precedes it, so it was with anticipation that I sat down to watch it in letterbox on TCM. What a major disappointment.<br /><br />The cast is superb and the production values are first-rate, but the characters are without depth, the plot is thin, and the whole thing goes on too long. For a movie that deals with alcoholism, family divisions, unfaithfulness, gambling, and sexual repression, the movie is curiously flat, prosaic, lifeless, and cliche-ridden. One example is the portrayal of Frank Hirsch\'s unfaithfuness: his rather heavy-handed request to his wife to "go upstairs and relax a bit" followed by her predictable pleading of a headache, leads - even more predictably - to his evening liaison with his secretary ("hey Nancy, I\'ve got the blues tonight. Let\'s go for a drive"), all according to wel

# BoW 방식의 처리
- 단어사전에 등록된 단어 갯수를 20000개로 해서 위의 string을 shape 20000, 의 one_hot_vector로 만들어 보자!!!


In [None]:
text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot" # 멀티_핫_이진벡터로 출력
)

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [None]:
for inputs, targets in binary_1gram_train_ds:
  print("inputs.shape : ", inputs.shape)
  print("inputs.dtype : ", inputs.dtype)
  print("targets.shape : ", targets.shape)
  print("targets.dtype : ", targets.dtype)
  print("input[0] : ", inputs[0])
  print("target[0] : ", targets[0])
  break

inputs.shape :  (32, 20000)
inputs.dtype :  <dtype: 'float32'>
targets.shape :  (32,)
targets.dtype :  <dtype: 'int32'>
input[0] :  tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
target[0] :  tf.Tensor(1, shape=(), dtype=int32)


# 2-gram bow 방식 사용
- 위에서는 unigram 방식으로 BoW 인코딩을 진행하였다, 2-gram 방식의 인코딩도 진행해보자



In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot" # 멀티_핫_이진벡터로 출력
)

In [None]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [None]:
for inputs, targets in binary_2gram_train_ds:
  print("inputs.shape : ", inputs.shape)
  print("inputs.dtype : ", inputs.dtype)
  print("targets.shape : ", targets.shape)
  print("targets.dtype : ", targets.dtype)
  print("input[0] : ", inputs[0])
  print("target[0] : ", targets[0])
  break

inputs.shape :  (32, 20000)
inputs.dtype :  <dtype: 'float32'>
targets.shape :  (32,)
targets.dtype :  <dtype: 'int32'>
input[0] :  tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
target[0] :  tf.Tensor(1, shape=(), dtype=int32)


# TF-IDF 인코딩을 사용한 바이그램
- 개별단어나 N-그램의 등장 횟수를 카운트한 정보를 추가하기 위한 기법이다.
- TF-IDF 인코딩을 활용할 때는 단어마다 카운트를 세어 많이 등장하는 단어에 대해 가중치를 부여한다.
- 그러나, a 나 the같은 의미는 없지만 많이 쓰이는 단어에 대해서만 너무 카운트가 커져버리는 단점이 있다.
- 그렇기에 특정 단어에 대해 현재 문서에서 많이 쓰일수록, 다른 문서에서 적게 쓰일수록 해당 단어의 가중치를 높게 잡는 방식을 활용하여 각 단어의 Count를 정규화해준다.
- 각 단어에 대해 Count = (문서에서 쓰인 횟수) / log(전체문서에서 쓰인 횟수)



In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf" # tf-idf 사용
)

In [None]:
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [None]:
for inputs, targets in tfidf_2gram_train_ds:
  print("inputs.shape : ", inputs.shape)
  print("inputs.dtype : ", inputs.dtype)
  print("targets.shape : ", targets.shape)
  print("targets.dtype : ", targets.dtype)
  print("input[0] : ", inputs[0])
  print("target[0] : ", targets[0])
  break

inputs.shape :  (32, 20000)
inputs.dtype :  <dtype: 'float32'>
targets.shape :  (32,)
targets.dtype :  <dtype: 'int32'>
input[0] :  tf.Tensor(
[986.4775     14.647509    1.4225553 ...   0.          0.
   0.       ], shape=(20000,), dtype=float32)
target[0] :  tf.Tensor(0, shape=(), dtype=int32)


# 시퀀스 모델
- 지금까지는 단어의 순서에 관한 정보를 거의 신경 안쓰도록 하는 data를 준비해봤다.(따로 fully connected layer로 학습은 안함..)
- 지금부터는 RNN을 활용하여 단어의 순서까지 고려하는 모델을 구성해보겠따
- 케라스 책에서는 각 단어를 단순 one_hot_vector로 나타내는 거부터 하긴 하는데 그냥 단어 임베딩을 바로 활용해 보도록 하자


# 단어 임베딩
- 각 단어를 임의의 벡터로 나타내는 행렬을 학습한다
- 학습과정에서 woman이나 girl같은 유사한 단어들은 코사인 유사도나 L2거리가 작고, woman의 단어 벡터와 king의 단어벡터를 합치면 queen의 단어벡터가 나오는 등 여러 훌륭한 특징들을 도출해 낼 수 있다.

1. 직접 이러한 가중치행렬을 학습시키거나
2. 사전 훈련된 단어 임베딩을 활용하는 방법이 있다.

In [None]:
from tensorflow.keras import layers

max_tokens = 20000
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256) # 각 문장을 max_tokens 길이의 문장으로, 단어 벡터의 차원을 256으로 설정
# ( batch_size, sequence_length ) 를 입력으로 받고, ( batch_size, sequence_length, output_dim )을 출력으로 내놓음

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary() # 임베딩 층을 활용하는 양방향 LSTM 예시 훈련은 안할거임.. 너무 오래걸림

# 패딩과 마스킹 이해하기

- TextVectorizaion이나 Embedding 층을 활용하여 단어를 인코딩 할 때 각 문장은 max_tokens 갯수로 짤리게 된다
- 만약 최대 문장 길이 600까지만 본다면 그보다 짧은 문장은 0으로 패딩이 이루어지고 그보다 긴 문장은 짤린다
- 그런데 짧은 문장의 경우 끝부분이 0으로 가득차게 되면서 최종 예측에 큰 영향을 미치기 때문에 이를 건너뛸 방법이 필요한데 이것이 마스킹이다


In [None]:
embedding_layer = layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)
some_input = [
    [4, 3, 2, 1, 0, 0, 0],
    [5, 4, 3, 2, 1, 0, 0],
    [2, 1, 0, 0, 0, 0, 0]
]
mask = embedding_layer.compute_mask(some_input)
mask

<tf.Tensor: shape=(3, 7), dtype=bool, numpy=
array([[ True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False, False],
       [ True,  True, False, False, False, False, False]])>

In [None]:
# 사전 훈련된 임베딩 사용하기
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-09-20 06:26:23--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-09-20 06:26:24--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-09-20 06:26:24--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glov

In [None]:
import numpy as np

path_to_glove_file = "glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
  for line in f:
    word, coefs = line.split(maxsplit=1)
    coefs = np.fromstring(coefs, 'f', sep=" ")
    embeddings_index[word] = coefs

print(f"단어 벡터 갯수 : {len(embeddings_index)}")

단어 벡터 갯수 : 400000


In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
  if i < max_tokens:
    embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

In [None]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True
)

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 100)         2000000   
                                                                 
 bidirectional_1 (Bidirecti  (None, 64)                34048     
 onal)                                                           
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2034113 (7.76 MB)
Trainable params: 34113 (133.25 KB)
Non-trainable params: 2000000 (7.63 MB)
_________________

In [None]:
max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length
)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(
    lambda x, y : (text_vectorization(x), y), num_parallel_calls=4
)

int_val_ds = val_ds.map(
    lambda x, y : (text_vectorization(x), y), num_parallel_calls=4
)

int_test_ds = test_ds.map(
    lambda x, y : (text_vectorization(x), y), num_parallel_calls=4
)

In [None]:
for inputs, targets in int_train_ds:
  print("inputs.shape : ", inputs.shape)
  print("inputs.dtype : ", inputs.dtype)
  print("targets.shape : ", targets.shape)
  print("targets.dtype : ", targets.dtype)
  print("input[0] : ", inputs[0])
  print("target[0] : ", targets[0])
  break

inputs.shape :  (32, 600)
inputs.dtype :  <dtype: 'int64'>
targets.shape :  (32,)
targets.dtype :  <dtype: 'int32'>
input[0] :  tf.Tensor(
[   10   281    11   522  4112     4    10    26     6   129    30     2
  1793   195    20    92    17     1   192   121   108   773     1     2
    20     7   384   283  3105    46    45    23    24     3   842   327
    11    19 10152     5  1329     4   830    47     5     2   647     5
     2     1    24     2   115   121  2408     6    20    52   927   276
  9730     1  5247  1261     1     6    65  1819  1378     1    13   914
     1  2227   364    16  1258  2557    78   382    23     1   158     2
   195     1    30  5851  3661  2214   417    11    20  2202    37   377
     6  1405    10   393   936    17    53    72    12    54    60  3254
     7    12     7    14   100   352     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0    

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_seq_model.x", save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7bb3623dba00>

# 셀프 어텐션 이해하기
- 트랜스포머 모델을 살펴보기 전에 셀프 어텐션 메커니즘에 대해 이해해보자

In [None]:
def self_attention(input_sequence):
  output = np.zeros_like(input_sequence)
  for i, pivot_vector in enumerate(input_sequence): # 각 단어 벡터 pivot_vector에 대해서
    scores = np.zeros(shape = (len(input_sequence),))
    for j, vector in enumerate(input_sequence): # 그 단어와 다른 단어의 유사도를 내적으로 계산한다
      scores[j] = np.dot(pivot_vector, vector.T)
    scores /= np.sqrt(input_sequence.shape[1]) # 단어 공간 차원 크기의 제곱근으로 정규화
    scores = np.softmax(scores) # 소프트맥스
    new_pivot_representation = np.zeros_like(pivot_vector)
    for j, vector in enumerate(input_sequence): # 해당 pivot_vector의 새로운 표현으로 각 벡터를 스코어배 해서 선형결합한 벡터를 얻어냄
      new_pivot_representation += vector * scores[j]
    output[i] = new_pivot_representation # 이렇게 얻어낸 새로운 표현으로 각 단어 벡터를 인코딩함
  return output

# 트랜스포머로 seq2seq 학습 해보기


In [2]:
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip

--2023-09-21 06:57:33--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.175.207, 74.125.24.207, 142.250.4.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.175.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip.1’


2023-09-21 06:57:35 (2.11 MB/s) - ‘spa-eng.zip.1’ saved [2638744/2638744]

replace spa-eng/_about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [3]:
text_file = "spa-eng/spa.txt"
with open(text_file) as f:
  lines = f.read().split('\n')[:-1]
text_pairs = []
for line in lines:
  english, spanish = line.split("\t")
  spanish = "[start]" + spanish + "[end]"
  text_pairs.append((english, spanish))

In [4]:
import random
print(random.choice(text_pairs))

('Tom is really a good worker.', '[start]Tomás es realmente un buen trabajador.[end]')


In [5]:
import random

random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

# 영어와 스페인어 텍스트 쌍 벡터화

In [6]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import string
import re

strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
  lowercase = tf.strings.lower(input_string)
  return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")
vocab_size = 15000
sequence_length = 20

source_vectorization = layers.TextVectorization(
    max_tokens = vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length
)

target_vectorization = layers.TextVectorization(
    max_tokens = vocab_size,
    output_mode="int",
    output_sequence_length = (sequence_length+1),
    standardize = custom_standardization,
)

train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

In [7]:
batch_size = 64

def format_dataset(eng, spa):
  eng = source_vectorization(eng)
  spa = target_vectorization(spa)
  return ({
      "english" : eng,
      "spanish" : spa[:, :-1] # 입력은 [start]어쩌구 저쩌구
  }, spa[:, 1:])  #출력은 저쩌구 저쩌구[end]

def make_dataset(pairs):
  eng_texts, spa_texts = zip(*pairs)
  eng_texts = list(eng_texts)
  spa_texts = list(spa_texts)
  dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
  dataset = dataset.batch(batch_size)
  dataset = dataset.map(format_dataset, num_parallel_calls=4)
  return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [8]:
for inputs, targets in train_ds.take(1):
  print(f"inputs['english'].shape : {inputs['english'].shape}")
  print(f"inputs['spanish'].shape : {inputs['spanish'].shape}")
  print(f"targets.shape : {targets.shape}")

inputs['english'].shape : (64, 20)
inputs['spanish'].shape : (64, 20)
targets.shape : (64, 20)


# 트랜스포머 encoding layer

In [9]:
class TransformerEncoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention = layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_dim
    )
    self.dense_proj = keras.Sequential(
        [layers.Dense(dense_dim, activation="relu"),
         layers.Dense(embed_dim)]
    )
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()

  def call(self, inputs, mask=None):
    if mask is not None:
      mask = mask[:, tf.newaxis, :] # num_heads에 대한 차원이 추가되기 때문에 요렇게 해줌
    attention_output = self.attention(
        inputs, inputs, attention_mask=mask # 인코딩 MultiHeadAttention Q, K, V 모두 source sequence
    )
    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)

  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim" : self.embed_dim,
        "num_heads" : self.num_heads,
        "dense_dim" : self.dense_dim
    })
    return config
# 사용자 정의층을 지정할 때 위와같이 get_config 함수를 지정해서 직렬화 해 주어야 한다.
# keras.models.load_model 사용시에도 custom_objects={"층 이름" : 층이름} 식으로 사용자 정의 클래스를 명시해야 된다.

In [14]:
class PositionalEmbedding(layers.Layer):
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)
    self.token_embeddings = layers.Embedding(
        input_dim=input_dim, output_dim=output_dim
    )
    self.position_embeddings = layers.Embedding(
        input_dim=sequence_length, output_dim=output_dim
    )
    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim

  def call(self, inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    return embedded_tokens + embedded_positions

  def compute_mask(self, inputs, mask=None):
    return tf.math.not_equal(inputs, 0) # 정수 시퀀스인 inputs 중 0인 값에 대하여 False로 마스킹

  def get_config(self):
    config = super().get_config()
    config.update({
        "output_dim" : self.output_dim,
        "input_dim" : self.input_dim,
        "sequence_length" : self.sequence_length
    })
    return config

In [18]:
class TransformerDecoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention_1 = layers.MultiHeadAttention(
        num_heads = num_heads, key_dim = embed_dim
    )
    self.attention_2 = layers.MultiHeadAttention(
        num_heads = num_heads, key_dim = embed_dim
    )
    self.dense_proj = keras.Sequential([
        layers.Dense(dense_dim, activation="relu"),
        layers.Dense(embed_dim)
    ])
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()
    self.layernorm_3 = layers.LayerNormalization()

  def get_causal_attention_mask(self, inputs):
    input_shape = tf.shape(inputs)
    batch_size, sequence_length = input_shape[0], input_shape[1]
    i = tf.range(sequence_length)[:, tf.newaxis]
    j = tf.range(sequence_length)
    mask = tf.cast(i >= j, dtype="int32")
    mask = tf.reshape(mask, (1, sequence_length, sequence_length))
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1),
         tf.constant([1, 1], dtype="int32")], axis=0
    )
    return tf.tile(mask, mult)

  def call(self, inputs, encoder_outputs, mask=None):
    causal_mask = self.get_causal_attention_mask(inputs)
    if mask is not None:
      padding_mask = tf.cast(
          mask[:, tf.newaxis, :], dtype="int32"
      )
      padding_mask = tf.minimum(padding_mask, causal_mask)

    attention_output_1 = self.attention_1(
        query=inputs,
        value=inputs,
        key=inputs,
        attention_mask = causal_mask
    )
    attention_output_1 = self.layernorm_1(inputs + attention_output_1)
    attention_output_2 = self.attention_2(
        query=inputs,
        value=encoder_outputs,
        key=encoder_outputs,
        attention_mask = padding_mask
    )
    attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)
    proj_output = self.dense_proj(attention_output_2)
    return self.layernorm_3(attention_output_2 + proj_output)

  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim" : embed_dim,
        "num_heads" : num_heads,
        "dense_dim" : dense_dim
    })
    return config

In [19]:
embed_dim = 256
dense_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)

decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [20]:
callbacks = [
    tf.keras.callbacks.ModelCheckpoint("eng_to_spa_by_first_transformer.x",
                                       save_best_only=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=5)
]

transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=20, callbacks=callbacks, validation_data=val_ds)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7a77c839bc70>

In [26]:
import numpy as np

spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
  tokenized_input_sentence = source_vectorization([input_sentence])
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = target_vectorization(
        [decoded_sentence])[:, :-1]
    predictions = transformer(
        [tokenized_input_sentence, tokenized_target_sentence]
    )
    sampled_token_index = np.argmax(predictions[0, i, :])
    sampled_token = spa_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    if sampled_token == "[end]":
      break
  return decoded_sentence


test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(10):
  input_sentence = random.choice(test_eng_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence(input_sentence))
  print('\n')

-
You're afraid of him.
[start] miedo[end]                   


-
I could kill you.
[start] podría matar[end]                  


-
My grandpa drinks coffee with a group of old guys every morning.
[start] el café con un [UNK] de [UNK] de todas las mañanas[end]         


-
I can't wait for spring to come so we can sit under the cherry trees.
[start] no puedo llegar a la primavera hasta los [UNK] cerca de los árboles[end]       


-
That never happens around here.
[start] aquí no hay de eso[end]               


-
He got in with a shotgun in his hands.
[start] con una [UNK] en la mano[end]              


-
I sometimes lie on the grass.
[start] a veces en la [UNK]               


-
The temperature is above average this winter.
[start] la temperatura está [UNK] en invierno[end]              


-
What rotten luck!
[start] qué suerte[end]                  


-
I couldn't make myself heard in the noisy class.
[start] no pude leer a lo suficientemente [UNK]             


