## **0. Preliminary Settings**

Guarantee access to *Google Drive* to store partial results and checkpoints

In [None]:
from google.colab import drive
drive.mount('/content/drive')

checkpoint_path = '/content/drive/My Drive/Colab Notebooks/Models/'
results_path = '/content/drive/My Drive/Colab Notebooks/Results/'
model_name = '4. Verse-Space Transformer/1. Word-Level/'

First of all, we need to clone the repository to get access to the code and use utility functions inside the notebook

In [None]:
!git clone https://github.com/mazzio97/DeepComedy.git

project_path = 'DeepComedy/'

This folder is then added to the system path so that the modules can be used inside the notebook

In [None]:
import sys

sys.path.append(project_path + 'src')

Finally, the *Divine Comedy* is loaded and stored in a variable

In [None]:
with open(project_path + 'res/divine_comedy.txt', 'r', encoding='ISO-8859-1') as f:
  divine_comedy = f.read()

print(divine_comedy[:231])
print('\n\n[...]\n\n')
print(divine_comedy[-266:])

Also, we set Python's, Numpy's/Keras' and Tensorflow's seeds to guarantee the maximal level of reproducibility

> Though, the results could still differ a little bit due to other randomized routines called during the execution and the inner stochasticity introduced by parallel computing

In [None]:
import random
import numpy as np
import tensorflow as tf

random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)

## **1. Data Processing**

### ***1.1 Text Mark***

We use the provided function `mark` to map the original *Divine Comedy* into a marked version containing:

* a marker both at the beginning and at the end of each *cantica*

* a marker both at the beginning and at the end of each *canto*

* a marker between each couple of *tercets*

In [None]:
from text_processing.markers import mark

divine_comedy_marked = mark(divine_comedy)
print(divine_comedy_marked[:260])
print('\n\n[...]\n\n')
print(divine_comedy_marked[-319:])

### ***1.2 Extracting the Verses***

We want to build a dataset in which the input sequence represents a piece of the *Divine Comedy* going from verse *i* to verse *i+n* and the target sequence represents a piece of the *Divine Comedy* going from verse *i+1* to verse *i+n+1*, thus we need at first to split the dataset and get a list of verses

In [None]:
divine_comedy_split = divine_comedy_marked.split('\n')

for i, verse in enumerate(divine_comedy_split[:20]):
  print(f'{i+1:02} --> {verse}')

In [None]:
print(len(divine_comedy_split))

### ***1.3 Building the Dataset***

As we know what is the rhyming scheme of the *Divine Comedy*, we know that we will need at least the last *3* verses (*3 actual verses or 2 actual verses + 1 marker verse to indicate the end of the tercet*) to predict a correct fifth one, so we set `seq_length = 3`

> Differently from single-token models, here we have a lower amount of samples and a greater variability (indeed, the dataset is less dense), thus we can choose a `step_length` of *1* and a larger `train_val_split`

In [None]:
seq_length = 3
step_length = 1
batch_size = 64
train_val_split = 0.7

tot_samples = int((len(divine_comedy_split) - seq_length) / step_length)
train_samples = round(tot_samples * train_val_split)

print('Train Samples:', train_samples)
print('  Val Samples:', tot_samples - train_samples)

Now, we map the list of verses into a dataset taking *4* verses per time, and splitting them into an input string of the first *3* verses and a target string of the last *3* verses

In [None]:
from tensorflow.data import Dataset
from tensorflow.strings import reduce_join

def split_input_target(chunk):
  input_text = reduce_join(chunk[:-1], separator='\n') + '\n'
  target_text = reduce_join(chunk[1:], separator='\n')
  return input_text, target_text

dataset = Dataset.from_tensor_slices(divine_comedy_split)
dataset = dataset.window(seq_length + 1, step_length, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(seq_length + 1))
dataset = dataset.map(split_input_target).shuffle(tot_samples, seed=0)

Finally, we encode each block of the comedy using the provided `word_tokenizer` to tokenize the text into words, including punctuation

> Some special tokens are reserved to the markers

In [None]:
from text_processing.tokenizers import word_tokenizer

tokenizer = word_tokenizer(divine_comedy)
print(tokenizer.vocab_size, 'tokens:')
print()
for i, token in enumerate(tokenizer.tokens[:40]):
  print("'{}'".format('\\n' if token == '\n' else token))

In [None]:
def encode_dataset(input_dataset, target_dataset):
  def encode_sample(input, target):
    input = [tokenizer.vocab_size] + tokenizer.encode(input.numpy()) + [tokenizer.vocab_size+1]
    target = [tokenizer.vocab_size] + tokenizer.encode(target.numpy()) + [tokenizer.vocab_size+1]
    return input, target

  input_dataset, target_dataset = tf.py_function(encode_sample, [input_dataset, target_dataset], [tf.int64, tf.int64])
  input_dataset.set_shape([None])
  target_dataset.set_shape([None])
  return input_dataset, target_dataset

train_dataset = dataset.take(train_samples).map(encode_dataset)
train_dataset = train_dataset.cache()
train_dataset = train_dataset.padded_batch(batch_size)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

val_dataset = dataset.skip(train_samples).map(encode_dataset)
val_dataset = val_dataset.cache()
val_dataset = val_dataset.padded_batch(batch_size)
val_dataset = val_dataset.prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
for input, target in train_dataset.take(1):
  input = input.numpy()[0]
  target = target.numpy()[0]

  print(f'Input  Shape: {input.shape}')
  print(f'Target Shape: {target.shape}')
  print()

  print('INPUT:\n')
  print(tokenizer.decode([token for token in input if token < tokenizer.vocab_size]))
  print('\n\n---------------------\n\n')
  print('TARGET:\n')
  print(tokenizer.decode([token for token in target if token < tokenizer.vocab_size]))

## **2. Model**

### ***2.1 Architecture***

The model consists of an initial *Embedding* layer that maps the tokenized characters into a dense vector which is then passed to one or two *RNN* layer(s) and, eventually, to a final *Dense* layer, post-processed using *softmax* activation, which outputs the probability of each token

> The variable parameters of the model are:
> * the dimension of the *Embedding* layer
> * the kind of *RNN* (*GRU* or *LSTM*)
> * the number of units of the *RNN* layers
> * the forward and recurrent dropout rates 

In [None]:
from transformer.encoder import *
from transformer.decoder import *
from transformer.attention import *
from transformer.functions import *
from transformer.model import *

history = {'train loss': [], 'train acc': [], 'val loss': [], 'val acc': []}

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
val_loss = tf.keras.metrics.Mean(name='val_loss')
val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='val_accuracy')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)

step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

@tf.function(input_signature=step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]
  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
  with tf.GradientTape() as tape:
    predictions, _ = model(inp, tar_inp, True, enc_padding_mask, combined_mask, dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, model.trainable_variables)    
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  
  train_loss(loss)
  train_accuracy(tar_real, predictions)

@tf.function(input_signature=step_signature)
def val_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]
  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
  with tf.GradientTape() as tape:
    predictions, _ = model(inp, tar_inp, True, enc_padding_mask, combined_mask, dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  val_loss(loss)
  val_accuracy(tar_real, predictions)

In [None]:
from tensorflow.keras.optimizers import Adam

num_layers = 6
num_heads = 8
d_model = 128
dff = 512
dropout = 0.2
input_vocab_size = tokenizer.vocab_size + 2
target_vocab_size = tokenizer.vocab_size + 2
optimizer = Adam(CustomSchedule(d_model), beta_1=0.9, beta_2=0.98, epsilon=1e-9)

model = Transformer(
    num_layers, d_model, num_heads, dff,
    input_vocab_size, target_vocab_size,
    pe_input=input_vocab_size, pe_target=target_vocab_size,
    rate=dropout
)

for (inp, tar) in train_dataset.take(1):
  val_step(inp, tar)

model.summary()

### ***2.2 Training***

We can now proceed with the training phase, storing every `epochs_interval` epochs the weights of the model in a file that indicates the values of its parameters

In [None]:
import time
from utils.checkpoint import restore_checkpoint, save_checkpoint

checkpoint_signature = 'seq_{} stp_{} btc_{} tvs_{} nlr_{} nhd_{} dmd_{} dff_{} drp_{} epc_'.format(
    seq_length, step_length, batch_size, train_val_split,
    num_layers, num_heads, d_model, dff, dropout
)
checkpoint_directory = checkpoint_path + model_name
initial_epoch = restore_checkpoint(model, checkpoint_directory, checkpoint_signature)

epochs = 50 + initial_epoch
epochs_interval = 10
batches_interval = 20

for epoch in range(initial_epoch, epochs):
  start = time.time()
  train_loss.reset_states()
  train_accuracy.reset_states()
  val_loss.reset_states()
  val_accuracy.reset_states()
  print(f'Starting Epoch {epoch+1}/{epochs}')
  
  for (batch, (inp, tar)) in enumerate(train_dataset):
    train_step(inp, tar)
    if (batch + 1) % batches_interval == 0:
      print(f'  > Batch {batch+1}', end=' \t\t ')
      print(f'- train_loss: {train_loss.result():.4f} - train_acc: {train_accuracy.result():.4f}')      
  history['train loss'].append(train_loss.result())
  history['train acc'].append(train_accuracy.result())

  for (batch, (inp, tar)) in enumerate(val_dataset):
    val_step(inp, tar)  
  history['val loss'].append(val_loss.result())
  history['val acc'].append(val_accuracy.result())

  elapsed = time.time() - start
  print(f'Ending Epoch {epoch+1}/{epochs}', end=' \t ')
  print(f'- train_loss: {history["train loss"][-1]:.4f} - train_acc: {history["train acc"][-1]:.4f}', end=' ')
  print(f'- val_loss: {history["val loss"][-1]:.4f} - val_acc: {history["val acc"][-1]:.4f}')
  print(f'Elapsed Time {elapsed:.2f}s\n')

  save_checkpoint(model, epoch, checkpoint_directory, checkpoint_signature, epochs_interval, verbose=True)

Here's a graphical representation of the improvement of the model, with respect both to the loss and the accuracy, across the epochs

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

if epochs - initial_epoch > 0:
  sns.set_style('darkgrid')
  sns.set_context('notebook')
  plt.figure(figsize=(12, 5))

  x = np.arange(epochs) + initial_epoch + 1

  plt.subplot(1, 2, 1)
  plt.plot(x, history['train loss'], label='train')
  plt.plot(x, history['val loss'], label='val')
  plt.legend()
  plt.title('Loss')

  plt.subplot(1, 2, 2)
  plt.plot(x, history['train acc'], label='train')
  plt.plot(x, history['val acc'], label='val')
  plt.legend()
  plt.title('Accuracy')

  plt.show()

## **3. Evaluation**

### ***3.1 Generation***

The generation is based on the trained model and it uses a `temperature_factor` to allow some degree of randomness

> The next token is chosen among a subset of those having a probability which is at least `1 / temperature_factor` with respect to the maximal one

> It goes without saying that a higher `temperature_factor` leads to a more explorative generation, while a lower `temperature_factor` leads to a more conservative one (in particular, with `temperature_factor = 1` the generation is completely deterministic)

In [None]:
from tensorflow.nn import softmax
from text_processing.markers import unmark, MARKERS

newline_token = tokenizer.encode('\n')[0]

def evaluate(inp_sentence, max_length=30, temperature_factor=1, verbose=False):
  # the encoder input is the input sentence surrounded by a start and an end token
  encoder_input = tf.expand_dims([tokenizer.vocab_size] + inp_sentence + [tokenizer.vocab_size + 1], 0)

  # the decoder input is the same sentence, without the first character, preceded by a start token
  decoder_input = tf.expand_dims([tokenizer.vocab_size] + inp_sentence[1:], 0)

  # the final output of the evaluation (initially, this is an empty list)
  output = []

  # we repeat the process to get the entire verse (until the end token or the newline token is predicted)
  for i in range(max_length):
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, decoder_input)  
    logits, _ = model(
        encoder_input, decoder_input, False,
        enc_padding_mask, combined_mask, dec_padding_mask
    )

    # we get the probabilities for the decoded token (i-th token after the last token of the input sentence)
    probabilities = softmax(logits[0, len(inp_sentence)+i-1, :tokenizer.vocab_size]).numpy()

    # we take a subset of possible tokens whose probability is at least 1/temperature_factor of the maximal one
    indices = np.arange(tokenizer.vocab_size)[probabilities >= probabilities.max() / temperature_factor]

    # we renormalize this subset using, again, a softmax activation
    probabilities = softmax(probabilities[probabilities >= probabilities.max() / temperature_factor]).numpy()
    
    # the id is randomly chosen among the indices according to the computed probabilities
    predicted_id = np.random.choice(indices, size=1, p=probabilities)[0]
    
    # if the token coincides with the nd token or the newline token, the generation is interrupted
    if predicted_id in [newline_token, tokenizer.vocab_size+1]:
      break

    # otherwise the token is appended both to the new decoder input and to the final output
    decoder_input = tf.concat([decoder_input, [[predicted_id]]], axis=-1)
    output.append(predicted_id)

    if verbose:
      print(tokenizer.decode([predicted_id]), end='')

  return output

def generate(
    input_string=divine_comedy_marked[:386], # first three tercets of the comedy
    max_iterations=250, end_marker=MARKERS['canto end'],
    temperature_factor=1.0, verbose=False
):
  # at the beginning, the generated string is the encoding of the input string (plus a newline character)
  generated_string = input_string + '\n'

  for i in range(max_iterations):
    # the input sequence is made up of the last 'seq_length' verses of the generated string
    input_sequence = input_string.split('\n')[-seq_length:]
    input_sequence = tokenizer.encode('\n'.join(input_sequence))

    # the generated verse is then decoded
    generated_verse = tokenizer.decode(evaluate(input_sequence))
    if verbose:
      print()

    # if the verse coincides with the end marker, the generation is interrupted, otherwise it is appended with a newline
    if generated_verse == end_marker:
      break
    generated_string += generated_verse + '\n'
  
  # we finally return the decoded (and unmarked) string, excluding the input provided by the user
  return unmark(generated_string[len(input_string):])

generate(verbose=True)

### ***3.2 Results***

We then try different kind of temperatures and take an average evaluation, then, finally, we use the `store` utility function to store the results in a *.txt* file

In [None]:
import os
from metrics.store import store

result_dir = results_path + model_name
result_file = 'seq_{} stp_{} btc_{} tvs_{} nlr_{} nhd_{} dmd_{} dff_{} drp_{}.txt'.format(
    seq_length, step_length, batch_size, train_val_split,
    num_layers, num_heads, d_model, dff, dropout
)

print('Generating Temperature 1...')
samples = [generate(temperature_factor=1)]
store(result_dir, result_file, temperature=1, sample_texts=samples, original_text=divine_comedy)

repetitions = 5
for temperature in [2, 3, 5, 10, 20, 50, 100]:
  samples = []
  for r in range(repetitions):
    print(f'Generating Temperature {temperature} (repetition {r+1})...')
    samples.append(generate(temperature_factor=temperature))
  store(result_dir, result_file, temperature, sample_texts=samples, original_text=divine_comedy)