<h3> Summary of notebook: </h3>

- The notebook "translator_transformer_tidy_mode_v1.ipynb" has all the pieces.
- Here I separate the code into a script for training and a script for translating
- Use this script ("train_v1.ipynb") to train the model; save the source arrays, target arrays, dictionaries, model weights
- Use the next script ("translate_v1.ipynb") to load the model and predict
- *** In "translate_v1.ipynb", you need to instantiate the Transformer with the same parameters from the "train_v1.ipynb" script ***
- Then you load the model weights and use the functions to predict new sentences + plot attention weights

- In google colab, I experimented with some sets of (hyper) parameters, using 10% of the data (see the file "Training_Observations.docx")
- So far, the most promising are: embedding_dim = 256, num_heads = 8, num_layers = 4, fully_connected_dim = 512
- Increasing any of the parameters meant the accuracy reached a peak and then decreased to a useless value. 
- You could investigate further why this happened; with "Params 3", I also tried training with 50% of the data, but the accuracy still showed the strange behaviour.

In [1]:
import pandas as pd
import numpy as np
import string
from string import digits
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import tensorflow as tf
from tensorflow.keras.layers import Bidirectional, Concatenate, LSTM, Embedding, Dense, MultiHeadAttention, LayerNormalization, Dropout
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.initializers import Constant
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import re
import os
import io
import time

In [2]:
# from google.colab import drive
# drive.mount('/content/gdrive')
# %cd gdrive/MyDrive/ColabNotebooks/train_translate

# df_en_de = pd.read_table('/content/gdrive/MyDrive/deu-eng/deu.txt', names=['eng', 'deu', 'attr'])

In [3]:
from model_components import preprocess_sentence, get_angles, positional_encoding, create_padding_mask, create_look_ahead_mask, FullyConnected, EncoderLayer, Encoder, DecoderLayer, Decoder, Transformer, CustomSchedule

In [4]:
df_en_de = pd.read_table('deu-eng/deu.txt', names=['eng', 'deu', 'attr'])
df_en_de = df_en_de.drop('attr',axis = 1).rename(columns = {'eng':'english', 'deu':'german'})

In [88]:
# pre-process sentences using helper function

pairs = df_en_de
pairs = pairs.sample(frac = 0.1)
pairs['english'] = pairs['english'].apply(preprocess_sentence)
pairs['german'] = pairs['german'].apply(preprocess_sentence)

In [89]:
len(pairs)

252

In [90]:
# define source and target

source = pairs['german']
target = pairs ['english']

In [91]:
# create tokenizer & tensor for source and target
source_sentence_tokenizer= Tokenizer(filters='')
source_sentence_tokenizer.fit_on_texts(source)
source_tensor = source_sentence_tokenizer.texts_to_sequences(source)
source_tensor= tf.keras.preprocessing.sequence.pad_sequences(source_tensor, padding='post' )

target_sentence_tokenizer= Tokenizer(filters='')
target_sentence_tokenizer.fit_on_texts(target)
target_tensor = target_sentence_tokenizer.texts_to_sequences(target)
target_tensor= tf.keras.preprocessing.sequence.pad_sequences(target_tensor, padding='post' )

In [92]:
# Create word to index and index to word mappings for source and target

source_word_index = source_sentence_tokenizer.word_index
target_word_index = target_sentence_tokenizer.word_index

source_index_word = source_sentence_tokenizer.index_word
target_index_word = target_sentence_tokenizer.index_word

<h3> Convert Dictionary into DataFrame + Convert DataFrame into Dictionary </h3>

In [93]:
# Convert dictionary into dataframe
# This will be used in the training environment. 
# Once you fit the tokenizer to the training set and create a word_index dictionary,
# you save the dictionary as a csv file. 
def df_word_index(dictionary):
    df = pd.DataFrame.from_dict(dictionary, orient = 'index', columns= ['index']).reset_index()
    df = df.rename(columns = {'level_0':'word'})
    return df

def df_to_dict (df):
    dict_word_index = {row['word']:row['index'] for index, row in df.iterrows()}
    dict_index_word = {row['index']: row['word'] for index, row in df.iterrows()}
    return dict_word_index, dict_index_word

In [94]:
# convert dictionary into dataframes
df_source_word_index = df_word_index(source_word_index)
df_target_word_index = df_word_index(target_word_index)

# save dataframes as csv files; they will be loaded in the "translate_v1.ipynb" script
df_source_word_index.to_csv('df_source_word_index.csv', index = False)
df_target_word_index.to_csv('df_target_word_index.csv', index = False)

In [95]:
vocab_len_source = len(source_word_index.keys())
vocab_len_target = len(target_word_index.keys())
vocab_len_source, vocab_len_target

(682, 619)

In [96]:
num_tokens_source = vocab_len_source + 1
num_tokens_target = vocab_len_target + 1

In [97]:
source_train_tensor, source_test_tensor, target_train_tensor, target_test_tensor = train_test_split(
                                                                source_tensor, target_tensor,test_size=0.2
                                                                )

In [98]:
# save numpy array as csv file:
np.savetxt('source_train_tensor.csv', source_train_tensor, delimiter = ',')
np.savetxt('source_test_tensor.csv', source_test_tensor, delimiter = ',')
np.savetxt('target_train_tensor.csv', target_train_tensor, delimiter = ',')
np.savetxt('target_test_tensor.csv', target_test_tensor, delimiter = ',')


"""
For translating mode
"""
# # load numpy array from csv file:
# load_ar = np.loadtxt('source_train_tensor.csv', delimiter = ',')

'\nFor translating mode\n'

In [99]:
max_source_length= max(len(t) for t in source_tensor)
max_target_length= max(len(t) for t in  target_tensor)


In [100]:
max_source_length, max_target_length

(20, 19)

In [101]:
BATCH_SIZE = 32
#Create training dataset and shuffle
dataset_train = tf.data.Dataset.from_tensor_slices((source_train_tensor, target_train_tensor)).shuffle(BATCH_SIZE)
# divide into batches
dataset_train = dataset_train.batch(BATCH_SIZE, drop_remainder=True)

#Create test dataset
dataset_test = tf.data.Dataset.from_tensor_slices((source_test_tensor, target_test_tensor)).shuffle(BATCH_SIZE)
dataset_test = dataset_test.batch(BATCH_SIZE, drop_remainder=True)


In [102]:
source_batch_train, target_batch_train =next(iter(dataset_train))
print(source_batch_train.shape, target_batch_train.shape)



(32, 20) (32, 19)


<h3> Parameters 6 </h3>

In [103]:
# Transformer arguments: 
# num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, 
# target_vocab_size, max_positional_encoding_input,
# max_positional_encoding_target, dropout_rate=0.1, layernorm_eps=1e-6

num_layers = 4
embedding_dim = 128
num_heads = 5
fully_connected_dim = 512
input_vocab_size = num_tokens_source
target_vocab_size = num_tokens_target
max_positional_encoding_input = max_source_length
max_positional_encoding_target = max_target_length

<h3> Modify parameter number </h3>

In [104]:
# params_3 = f'\n params_3: num_layers: {num_layers}, embedding_dim: {embedding_dim}, num_heads: {num_heads}, \
# fully_connected_dim: {fully_connected_dim}, input_vocab_size: {input_vocab_size}, target_vocab_size: {target_vocab_size}, \
# max_positional_encoding_input: {max_positional_encoding_input}, max_positional_encoding_target: {max_positional_encoding_target}'

In [105]:
# Transformer arguments: 
# num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, 
# target_vocab_size, max_positional_encoding_input,
# max_positional_encoding_target, dropout_rate=0.1, layernorm_eps=1e-6

transformer = Transformer(
    num_layers=num_layers,
    embedding_dim=embedding_dim,
    num_heads=num_heads,
    fully_connected_dim=fully_connected_dim,
    input_vocab_size=input_vocab_size,
    target_vocab_size=target_vocab_size,
    max_positional_encoding_input = max_positional_encoding_input,
    max_positional_encoding_target = max_positional_encoding_target
    )

(1, 20, 32)
(1, 19, 32)


- Create optimizer
- Use customised learning rate as defined in 'Attention Is All You Need' paper
- The learning rate increases linearly until "warmup_steps" training steps, then decays asymptotically
- Inputs: d_model, warmup_steps (default = 4000)

In [106]:
learning_rate = CustomSchedule(embedding_dim, warmup_steps = 2500)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

In [107]:
# define loss object
# from_logits = False, because we apply softmax to final Dense layer of Transformer
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=False, reduction='none')


In [108]:
def loss_function(real, pred):
                                                            # real = (m, Ty)
                                                            # pred = (m, Ty, num_tokens_target)

  mask = tf.math.logical_not(tf.math.equal(real, 0))        # want to select only non-zero values
                                                            # mask = (m, Ty), and is "True" for non-zero values

  loss_ = loss_object(real, pred)                           # compute loss for each time-step
                                                            # loss = (m, Ty)

  mask = tf.cast(mask, dtype=loss_.dtype)                   
  loss_ *= mask                                             # only count loss from non-zero values

  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)           # divide sum(loss) by number of non-zero values

def accuracy_function(real, pred):                          # pred = (m, Ty, num_tokens_target)
  
  accuracies = tf.equal(real, tf.cast(tf.argmax(pred, axis=2), tf.int32))      # accuracies = (m, Ty) -- binary values

  mask = tf.math.logical_not(tf.math.equal(real, 0))        # mask = (m, Ty) -- boolean values
  accuracies = tf.math.logical_and(mask, accuracies)        # suppress values where real value is 0

  accuracies = tf.cast(accuracies, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)      # divide sum of 1s in "accuracies" by sum of 1s in "mask"

In [109]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

test_loss = tf.keras.metrics.Mean(name = 'test_loss')
test_accuracy = tf.keras.metrics.Mean(name = 'test_accuracy')

In [110]:
@tf.function
def train_step(inp, tar):
                            # inp = (m, Tx)
                            # tar = (m, Ty)


  tar_inp = tar[:, :-1]     # "start_" to last word
  tar_real = tar[:, 1:]     # first word to "_end"

  with tf.GradientTape() as tape:
    predictions, _ = transformer(inputs = (inp, tar_inp),
                                 training = True)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
  acc = accuracy_function(tar_real, predictions)

  # store cumulative loss and acc in train_loss and train_accuracy
  train_loss(loss)
  train_accuracy(acc)

In [111]:
def compute_test_loss_acc(inp, tar):
    # inp = (m, Tx)
    # tar = (m, Ty)

    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]

    predictions, _ = transformer (inputs = (inp, tar_inp), training = False)
    test_loss = loss_function(tar_real, predictions)
    test_acc = accuracy_function(tar_real, predictions)

    return test_loss, test_acc

In [112]:
# checkpoint_path = './checkpoints/train'

# ckpt = tf.train.Checkpoint(optimizer=optimizer,
#                                  transformer=transformer)

# ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep = 3)
# if ckpt_manager.latest_checkpoint:
#     ckpt.restore(ckpt_manager.latest_checkpoint)
#     print('Latest checkpoint restored!')

In [113]:
EPOCHS = 10
# train_loss_dict = {}
# train_acc_dict = {}
# test_loss_dict = {}
# test_acc_dict = {}

train_loss_list = []
train_acc_list = []
test_loss_list = []
test_acc_list = []


for epoch in range(EPOCHS):
  start = time.time()

  # reset tf Mean objects
  train_loss.reset_states()
  train_accuracy.reset_states()
  test_loss.reset_states()
  train_accuracy.reset_states()

  # iterate over every batch (= (inp, tar) tuple) in training dataset
  for (batch, (inp, tar)) in enumerate(dataset_train):
    train_step(inp, tar)

    if batch % 50 == 0:
      print(f'Epoch {epoch + 1} Batch {batch} -- Train_Loss: {train_loss.result():.4f} Train_Accuracy: {train_accuracy.result():.4f}')

  #if (epoch+1) % 1 == 0:
    #ckpt_save_path = ckpt_manager.save()
    #print(f'Saving checkpoint for epoch {epoch + 1} at {ckpt_save_path}')
  
  # after one epoch of training, compute test loss and test acc
  for (batch, (inp, tar)) in enumerate(dataset_test):
    test_loss_batch, test_accuracy_batch = compute_test_loss_acc(inp, tar)
    # Update tf Mean objects
    test_loss(test_loss_batch)
    test_accuracy(test_accuracy_batch)
  


  print(f'Summary -- Epoch {epoch + 1} Train_Loss: {train_loss.result():.4f} Train_Accuracy: {train_accuracy.result():.4f} \
    Test_Loss: {test_loss.result():.4f} Test_Accuracy: {test_accuracy.result():.4f}')
  
  train_loss_list.append (train_loss.result().numpy())
  train_acc_list.append(train_accuracy.result().numpy())

  test_loss_list.append(test_loss.result().numpy())
  test_acc_list.append(test_accuracy.result().numpy())

  print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

Epoch 1 Batch 0 -- Train_Loss: 6.4782 Train_Accuracy: 0.0041
Summary -- Epoch 1 Train_Loss: 6.4721 Train_Accuracy: 0.0014     Test_Loss: 6.4957 Test_Accuracy: 0.0000
Time taken for 1 epoch: 4.64 secs

Epoch 2 Batch 0 -- Train_Loss: 6.4691 Train_Accuracy: 0.0043
Summary -- Epoch 2 Train_Loss: 6.4643 Train_Accuracy: 0.0021     Test_Loss: 6.4419 Test_Accuracy: 0.0000
Time taken for 1 epoch: 0.39 secs

Epoch 3 Batch 0 -- Train_Loss: 6.4627 Train_Accuracy: 0.0043
Summary -- Epoch 3 Train_Loss: 6.4345 Train_Accuracy: 0.0028     Test_Loss: 6.3969 Test_Accuracy: 0.0000
Time taken for 1 epoch: 0.38 secs

Epoch 4 Batch 0 -- Train_Loss: 6.3918 Train_Accuracy: 0.0041
Summary -- Epoch 4 Train_Loss: 6.3895 Train_Accuracy: 0.0027     Test_Loss: 6.3710 Test_Accuracy: 0.0000
Time taken for 1 epoch: 0.38 secs

Epoch 5 Batch 0 -- Train_Loss: 6.3708 Train_Accuracy: 0.0086
Summary -- Epoch 5 Train_Loss: 6.3427 Train_Accuracy: 0.0108     Test_Loss: 6.3037 Test_Accuracy: 0.0131
Time taken for 1 epoch: 0.39 s

In [116]:
all_metrics = zip(train_loss_list, train_acc_list, test_loss_list, test_acc_list)
df_metrics = pd.DataFrame(all_metrics, columns = ['train_loss', 'train_acc', 'test_loss', 'test_acc']).reset_index().rename(columns = {'index':'epoch'})
df_metrics['epoch'] = df_metrics['epoch'].apply(lambda x: x+1)
df_metrics = df_metrics.apply(lambda x: round(x, 3))
df_metrics

Unnamed: 0,epoch,train_loss,train_acc,test_loss,test_acc
0,1,6.472,0.001,6.496,0.0
1,2,6.464,0.002,6.442,0.0
2,3,6.434,0.003,6.397,0.0
3,4,6.39,0.003,6.371,0.0
4,5,6.343,0.011,6.304,0.013
5,6,6.292,0.06,6.277,0.03
6,7,6.25,0.107,6.245,0.048
7,8,6.208,0.135,6.193,0.061
8,9,6.189,0.138,6.222,0.071
9,10,6.159,0.137,6.182,0.079


In [60]:
df_metrics.to_csv('df_metrics.csv', index = False)

In [61]:
# with open("params.txt", "a") as text_file:
#     text_file.write(params_3)
#     #text_file.write('params_3 -- time taken for 1 epoch: 54 secs')

In [62]:
file_path = 'saved_models/model'
transformer.save_weights(file_path,save_format='tf')

# # Recreate the exact same model purely from the file
# new_model = keras.models.load_model('path_to_my_model')