<h3> Notebook Overview </h3>

- 1) Load pre-processed source and target tensors (train + test)
- 2) Load pre-trained tokenizers from json files
- 3) Load embedding matrices for source and target languages
- 4) Create Tensorflow dataset
- 5) Instantiate model's components -- Encoder, BahdanauAttention, Decoder (from subclasses in my own module "model_components")
- 6) Functions for computing loss and gradients
- 7) Set up checkpoints for training
- 8) Train
- 9) Save trained model

In [2]:
import numpy as np
import json
from keras.preprocessing.text import Tokenizer, tokenizer_from_json
import tensorflow as tf
import time
import seaborn as sns

In [3]:
# # For colab
# from google.colab import drive
# drive.mount('/content/gdrive')
# %cd gdrive/MyDrive/Colab Notebooks/colab_upload

Mounted at /content/gdrive
/content/gdrive/MyDrive/Colab Notebooks/colab_upload


In [4]:
# Import from my own module
from model_components import Encoder, BahdanauAttention, Decoder

<h3> 1) Load source and target tensors (train + test) </h3>

In [5]:
# ## For colab

# file_path = '/content/gdrive/MyDrive/attention_data/tensors/'

# source_train_tensor = np.loadtxt(file_path + 'source_train_tensor.csv', delimiter = ',', dtype = 'int32')
# source_test_tensor = np.loadtxt(file_path + 'source_test_tensor.csv', delimiter = ',', dtype = 'int32')
# target_train_tensor = np.loadtxt(file_path + 'target_train_tensor.csv', delimiter = ',', dtype = 'int32')
# target_test_tensor = np.loadtxt(file_path + 'target_test_tensor.csv', delimiter = ',', dtype = 'int32')

In [None]:
# load source & train arrays from csv file:
source_train_tensor = np.loadtxt('tensors/source_train_tensor.csv', delimiter = ',', dtype = 'int32')
source_test_tensor = np.loadtxt('tensors/source_test_tensor.csv', delimiter = ',', dtype = 'int32')
target_train_tensor = np.loadtxt('tensors/target_train_tensor.csv', delimiter = ',', dtype = 'int32')
target_test_tensor = np.loadtxt('tensors/target_test_tensor.csv', delimiter = ',', dtype = 'int32')

In [6]:
max_source_length= max(len(t) for t in np.concatenate((source_train_tensor, source_test_tensor), axis=0))
max_target_length= max(len(t) for t in np.concatenate((target_train_tensor, target_test_tensor), axis=0))

print(max_source_length, max_target_length)

77 103


<h3> 2) Load pre-trained tokenizers from json files </h3>

In [7]:
with open ('tokenizers/source_sentence_tokenizer.json') as f:
    data = json.load(f)
    source_sentence_tokenizer = tokenizer_from_json(data)

with open ('tokenizers/target_sentence_tokenizer.json') as f:
    data = json.load(f)
    target_sentence_tokenizer = tokenizer_from_json(data)

- Create word-to-index and index-to-word mappings for source and target languages

In [8]:
source_word_index = source_sentence_tokenizer.word_index
target_word_index = target_sentence_tokenizer.word_index

source_index_word = source_sentence_tokenizer.index_word
target_index_word = target_sentence_tokenizer.index_word

- Retrieve vocab size and number of tokens for source and target languages

In [9]:
vocab_len_source = len(source_word_index.keys())
vocab_len_target = len(target_word_index.keys())

num_tokens_source = vocab_len_source + 1
num_tokens_target = vocab_len_target + 1

<h3> 3) Load embedding matrices for source and target languages </h3>

In [10]:
# ## For colab

# file_path = '/content/gdrive/MyDrive/attention_data/embeddings/'

# embedding_matrix_source = np.loadtxt(file_path + 'embedding_matrix_source.csv', delimiter = ',', dtype = 'int32')
# embedding_matrix_target = np.loadtxt(file_path + 'embedding_matrix_target.csv', delimiter = ',', dtype = 'int32')


In [None]:
# load embedding matrices
embedding_matrix_source = np.loadtxt('embeddings/embedding_matrix_source.csv', delimiter = ',', dtype = 'int32')
embedding_matrix_target = np.loadtxt('embeddings/embedding_matrix_target.csv', delimiter = ',', dtype = 'int32')



In [11]:
# Retrieve embedding dimensions for source and target languages
embedding_dim_source = embedding_matrix_source.shape[1]
embedding_dim_target = embedding_matrix_target.shape[1]

print(embedding_dim_source, embedding_dim_target)

96 300


<h3> 4) Create Tensorflow dataset </h3>

In [12]:
BATCH_SIZE = 32
# Create Tensorflow dataset and shuffle
dataset = tf.data.Dataset.from_tensor_slices((source_train_tensor, target_train_tensor)).shuffle(BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

# Check dataset
source_batch, target_batch = next(iter(dataset))
print(source_batch.shape, target_batch.shape)
print(source_batch[1])

(32, 77) (32, 103)
tf.Tensor(
[   1  230  221  193    5 1340   10   22  148    2    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0], shape=(77,), dtype=int32)


<h3> 5) Instantiate NMT model's components </h3>

- Define all arguments for the model components

In [13]:
BATCH_SIZE = 32
BUFFER_SIZE = len(source_train_tensor)
steps_per_epoch= BUFFER_SIZE//BATCH_SIZE

num_tokens_source = num_tokens_source
num_tokens_target = num_tokens_target

embedding_dim_source = embedding_dim_source
embedding_dim_target = embedding_dim_target

units = 256
attention_layer_units = 100

- Create Encoder, Attention, Decoder objects
- To see the code for the Encoder, BahdanauAttention, Decoder -- check "model_components.py"

In [14]:
encoder = Encoder(num_tokens_source, embedding_dim_source, units, BATCH_SIZE, embedding_matrix_source)
attention_layer= BahdanauAttention(attention_layer_units)
decoder = Decoder(num_tokens_target, embedding_dim_target, 2*units, BATCH_SIZE, embedding_matrix_target, attention_layer_units)

# Note: We are passing in "2 * units" as the "decoder_units" argument for the Decoder, since the encoder uses bi-directional LSTMs,
# and we are feeding the final hidden and cell states of the Encoder as the initial hidden and cell states of the Decoder.


# Check dimensions are correct for Encoder, BahdanauAttention layer, and Decoder.
enc_sequential, enc_final_h, enc_final_c = encoder(source_batch)
print (f'Encoder sequential: {enc_sequential.shape}')
print (f'Encoder final state_h: {enc_final_h.shape}')
print (f'Encoder final state_c: {enc_final_c.shape}')

attention_result, attention_weights = attention_layer(enc_final_h, enc_sequential)
print(f"Context vector: (batch size, units) {attention_result.shape}")
print(f"Attention weights: (batch_size, sequence_length, 1) {attention_weights.shape}")

sample_decoder_output, _, _, _ = decoder(tf.random.uniform((BATCH_SIZE,1)), enc_final_h, enc_final_c, enc_sequential)
print (f'Decoder output shape: (batch_size, vocab size) {sample_decoder_output.shape}')

Encoder sequential: (32, 77, 512)
Encoder final state_h: (32, 512)
Encoder final state_c: (32, 512)
Context vector: (batch size, units) (32, 512)
Attention weights: (batch_size, sequence_length, 1) (32, 77, 1)
Decoder output shape: (batch_size, vocab size) (32, 17410)


- Create optimizer

In [15]:
optimizer = tf.keras.optimizers.Adam()

<h3> 6) Loss and Gradients </h3>

- define loss_function() (for computing loss per batch) and get_loss_and_grads() (for computing gradient of loss w.r.t variables)

In [16]:
def loss_function(real, pred):      
    """
    Compute mean loss for batch

    Arguments
    real -- (m, 1)
    pred -- (m, vocab_size)

    Returns
    mean loss for batch
    """
    
    # create mask, such that mask value = 1 when "real" value is non-zero
    mask = 1 - tf.cast(tf.experimental.numpy.equal(real, 0), tf.float32)

    # compute cross-categorical entropy for values of "real" that are non-zero
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels = real, logits = pred) * mask
    return tf.reduce_mean(loss_)

In [17]:
# use tf.function decorator to activate graph mode instead of eager execution
# this improves performance considerably
@tf.function
def get_loss_and_grads(inp, targ):
    """
    Compute gradient of loss (for batch) with respect to variables
    Loop over Ty and use teacher forcing (i.e. compare y_pred at time t with y_true at time t+1)

    Arguments
    inp = (m, Tx)
    targ = (m, Ty)

    Returns 
    loss -- scalar
    gradients
    """

    loss = 0
    with tf.GradientTape() as tape:
        # retrieve Encoder outputs
        enc_sequential, enc_final_h, enc_final_c = encoder(inp)

        # Initialise the Decoder (hidden + cell) states at time-step 0 with the final states of Encoder
        dec_h = enc_final_h
        dec_c = enc_final_c

        # Initialise Decoder inputs at time-step 0 with "start_" token
        dec_input = tf.expand_dims([target_sentence_tokenizer.word_index['start_']] * BATCH_SIZE, 1)    # (m, 1)

        # Loop over Ty, starting from time-step = 1, to predict output iteratively
        for t in range(1, targ.shape[1]):
            # retrieve "predictions", "dec_h", "dec_c" for current time-step
            predictions, dec_h, dec_c, _ = decoder(dec_input, dec_h, dec_c, enc_sequential)     # predictions = (m, vocab_size)
            
            # Teacher forcing: compute loss by comparing "predictions" with ground-truth values at time-step t
            loss += loss_function(targ[:, t], predictions)
            
            # Teacher forcing: update "dec_input" with ground-truth values at time-step t
            dec_input = tf.expand_dims(targ[:, t], 1)
        
        # Now you have the loss for the current batch ("inputs")
        # Compute gradients (d_loss/d_variables)
        variables = encoder.variables + decoder.variables
        gradients = tape.gradient(loss, variables)
        return loss, gradients


<h3> 7) Set up checkpoints for training </h3>

In [18]:
checkpoint_path = './checkpoints'

ckpt = tf.train.Checkpoint(optimizer = optimizer,
                                 encoder = encoder, decoder = decoder)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep = 3)
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!')
else:
    print('Initialising from scratch')

Initialising from scratch


<h3> 8) Train </h3>

In [21]:
EPOCHS = 10

# for every epoch
for epoch in range(EPOCHS):
    start = time.time()
    total_loss = 0
    
    # for every batch
    for (batch, (inp, targ)) in enumerate(dataset):
        # compute loss and gradients for batch
        loss, gradients = get_loss_and_grads(inp, targ)
        # update total_loss
        total_loss += loss
        
        # use optimizer and gradients to update variables
        variables = encoder.variables + decoder.variables
        optimizer.apply_gradients(zip(gradients, variables))

        if batch % 2000 == 0:
            print(f'Epoch {epoch + 1} Batch {batch} Loss {loss.numpy():.4f}')
    
    print(f'Epoch {epoch + 1} Loss {total_loss / steps_per_epoch:.4f}')
    print(f'Time taken for 1 epoch {time.time() - start} sec\n')
    if (epoch+1) % 1 == 0:
        ckpt_save_path = ckpt_manager.save()
        print(f'Saving checkpoint after epoch {epoch + 1} at {ckpt_save_path}')
        

Epoch 1 Batch 0 Loss 70.1638
Epoch 1 Batch 2000 Loss 20.7534
Epoch 1 Batch 4000 Loss 19.3555
Epoch 1 Batch 6000 Loss 18.7942
Epoch 1 Loss 21.8595
Time taken for 1 epoch 2006.8411746025085 sec

Saving checkpoint after epoch 1 at ./checkpoints/ckpt-1
Epoch 2 Batch 0 Loss 11.7460
Epoch 2 Batch 2000 Loss 8.5506
Epoch 2 Batch 4000 Loss 10.1264
Epoch 2 Batch 6000 Loss 11.7742
Epoch 2 Loss 11.8022
Time taken for 1 epoch 1920.6357457637787 sec

Saving checkpoint after epoch 2 at ./checkpoints/ckpt-2
Epoch 3 Batch 0 Loss 7.5941
Epoch 3 Batch 2000 Loss 6.3093
Epoch 3 Batch 4000 Loss 8.3682
Epoch 3 Batch 6000 Loss 9.4711
Epoch 3 Loss 8.4040
Time taken for 1 epoch 1922.575403213501 sec

Saving checkpoint after epoch 3 at ./checkpoints/ckpt-3
Epoch 4 Batch 0 Loss 5.2665
Epoch 4 Batch 2000 Loss 5.8861
Epoch 4 Batch 4000 Loss 5.1178
Epoch 4 Batch 6000 Loss 6.9959
Epoch 4 Loss 6.5665
Time taken for 1 epoch 1921.3809475898743 sec

Saving checkpoint after epoch 4 at ./checkpoints/ckpt-4
Epoch 5 Batch 0 

<h3> 9) Save model </h3>

In [22]:
file_path = 'saved_models/model'
encoder.save_weights(file_path + '/encoder',save_format='tf')
decoder.save_weights(file_path + '/decoder',save_format='tf')