## **0. Preliminary Settings**

Guarantee access to *Google Drive* to store partial results and checkpoints

In [None]:
from google.colab import drive
drive.mount('/content/drive')

checkpoint_path = '/content/drive/My Drive/Colab Notebooks/Models/'
results_path = '/content/drive/My Drive/Colab Notebooks/Results/'
model_name = '6. Transformer GAN/2. Subword-Level/'

First of all, we need to clone the repository to get access to the code and use utility functions inside the notebook

In [None]:
!git clone https://github.com/mazzio97/DeepComedy.git

project_path = 'DeepComedy/'

This folder is then added to the system path so that the modules can be used inside the notebook

In [None]:
import sys

sys.path.append(project_path + 'src')

Finally, the *Divine Comedy* is loaded and stored in a variable

In [None]:
with open(project_path + 'res/divine_comedy.txt', 'r', encoding='ISO-8859-1') as f:
  divine_comedy = f.read()

print(divine_comedy[:231])
print('\n\n[...]\n\n')
print(divine_comedy[-266:])

Also, we set Python's, Numpy's/Keras' and Tensorflow's seeds to guarantee the maximal level of reproducibility

> Though, the results could still differ a little bit due to other randomized routines called during the execution and the inner stochasticity introduced by parallel computing

In [None]:
import random
import numpy as np
import tensorflow as tf

random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)

## **1. Data Processing**

### ***1.1 Text Mark***

We use the provided function `mark` to map the original *Divine Comedy* into a marked version containing:

* a marker both at the beginning and at the end of each *cantica*

* a marker both at the beginning and at the end of each *canto*

* a marker between each couple of *tercets*

In [None]:
from text_processing.markers import mark

divine_comedy_marked = mark(divine_comedy)
print(divine_comedy_marked[:260])
print('\n\n[...]\n\n')
print(divine_comedy_marked[-319:])

### ***1.2 The Tokenizer***

We use the provided `subword_tokenizer` to tokenize the text into subwords, including punctuation

> Some special tokens are reserved to the markers

In [None]:
from text_processing.tokenizers import subword_tokenizer

tokenizer = subword_tokenizer(divine_comedy, target_vocab_size=2048, max_subword_length=3)
print(tokenizer.vocab_size, 'tokens:')
print()
for i, token in enumerate(tokenizer.subwords[:40]):
  print("'{}'".format('\\n' if token == '\n' else token))

In [None]:
divine_comedy_tokenized = tokenizer.encode(divine_comedy_marked)
print(len(divine_comedy_tokenized))

In [None]:
tokenized_sample = divine_comedy_tokenized[:57]
print(tokenizer.decode(tokenized_sample))

In [None]:
for token in tokenized_sample:
  print(token, '-->', tokenizer.decode([token]))

### ***1.3 Building the Dataset***

In order to understand which one should be the minimal length of a window sequence so that the net could be able to clearly have an insight about the thyming scheme, we compute which one is the maximal length of an encoded verse and take the minimal length as at least four verses 

In [None]:
# the newline token
newline = tokenizer.encode('\n')[0]

# the indices of each newline
indices = [i + 1 for i, t in enumerate(divine_comedy_tokenized) if t == newline]

# the length of each verse (or marker)
verses_lengths = [end - start for start, end in zip([0] + indices, indices +  [len(divine_comedy_tokenized)])]

# five verses (4 + tercet mark) should be enough to understand the rhyming scheme
sequences_lengths = [sum(verses_lengths[i:i+5]) for i in range(len(verses_lengths)-4)]
max(sequences_lengths)

Given that the ***sequence length*** should be at least *109*, we set it as *126 (+2 special start/end tokens = 128)*, then we choose a ***step_length***, namely the value that indicates how often we decide to take a sample and, finally, a ***train/validation split*** percentage

> Being the text very dense we cannot take a too small `step_length`, as it will lead both to a prohibitive training time and a lot of overfitting

> In order to avoid this behaviour but having the most possible trustworthy set of data, we choose a medium `step_length` together with a small `train_val_split`, so that (at the cost of a quite more expensive training) we could easily monitor overfitting while still using a lot of training data

In [None]:
seq_length = 126
step_length = 8
batch_size = 128

tot_samples = int((len(divine_comedy_tokenized) - seq_length) / step_length)
print('Tot Samples:', tot_samples)

Finally, the tokenized dataset is split into windows of length `seq_length` (*+1*) sampled every `step_length` tokens and these windows are then shared into an *input sequence* and a *target sequence*, both of length `seq_length`, having an offset of one single token

In [None]:
from tensorflow.data import Dataset

def split_input_target(chunk):
  input_sequence = tf.concat(([tokenizer.vocab_size], chunk[:-1], [tokenizer.vocab_size+1]), axis=-1)
  target_token = tf.one_hot(chunk[-1], tokenizer.vocab_size + 2, dtype=tf.int64)
  return input_sequence, target_token

dataset = Dataset.from_tensor_slices(divine_comedy_tokenized)
dataset = dataset.window(seq_length + 1, step_length, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(seq_length + 1))
dataset = dataset.map(split_input_target).shuffle(tot_samples, seed=0)
dataset = dataset.batch(batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
for input, target in dataset.take(1):
  print(f'INPUT {input.shape}\n')
  print(tokenizer.decode([t for t in input[0].numpy() if 0 < t < tokenizer.vocab_size]))
  print('\n\n---------------------\n\n')
  print(f'TARGET {target.shape}\n')
  print(tokenizer.decode([tf.argmax(target[0])]))

## **2. Model**

### ***2.1 Architecture***

The ***Transformer*** is a state-of-the-art model for *Natural Language Processing* and *Machine Translation* tasks proposed by *Vaswani et al.* in 2017 (https://arxiv.org/pdf/1706.03762v5.pdf)

* It consists of an *Encoder* and a *Decoder*, each of them made up of a given number of layers having two sub-modules: a *Multi-Head Attention* (with a parametric number of heads) sub-module and a classical *Feed-Foward* sub-module

* Also, as it does not use recurrent layers to process strictly sequential data, it both process the input data with a standard token encoding and a *positional encoding* as well

> The variable parameters of the model are:
> * the number of layers for the *Encoder* and the *Decoder*
> * the number of heads for the *Multi-Head Attention* sub-module
> * the dimension of all sub-layers in the model, as well as the embedding layers, known as *d_model*
> * the inner feed-forward dimension, known as *dff*
> * the dropout rate

#### *2.1.1 Discriminator*

In [None]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, GRU, Reshape, Attention, AdditiveAttention, Concatenate, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import plot_model

embedding_dim = 256
gru_units = 1024
attention = 'ADD'
discriminator_dropout = 0.2

input = Input((seq_length + 2,), name='input')
input_embedding = Embedding(tokenizer.vocab_size + 2, embedding_dim, name=f'input_embedding')(input)
gru, state = GRU(
    gru_units, return_sequences=True, return_state=True, stateful=False,
    dropout=discriminator_dropout, recurrent_initializer='glorot_uniform', name=f'gru'
)(input_embedding)
expanded_state = Reshape((1, gru_units), name=f'expanded_state')(state)
if attention == 'ADD':
  input_features = AdditiveAttention(name=f'input_features')([expanded_state, gru])
elif attention == 'MUL':
  input_features = Attention(name=f'input_features')([expanded_state, gru])
input_features = Reshape((gru_units,), name=f'input_flatten')(input_features)

target = Input((tokenizer.vocab_size + 2,), name='target')
concatenate = Concatenate(name='concatenate')([input_features, target])
output = Dense(1, activation='sigmoid', name='output')(concatenate)

discriminator = Model([input, target], output, name='Discriminator')
discriminator_optimizer = Adam()

display(plot_model(discriminator, show_shapes=True, show_layer_names=False))
print()
discriminator.summary()

#### *2.1.2 Generator*

In [None]:
from transformer.model import *
from transformer.functions import *

@tf.function
def build_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
  predictions, _ = generator(inp, tar_inp, True, enc_padding_mask, combined_mask, dec_padding_mask)

num_layers = 3
num_heads = 4
d_model = 256
dff = 512
generator_dropout = 0.2
input_vocab_size = tokenizer.vocab_size + 2
target_vocab_size = tokenizer.vocab_size + 2
generator_optimizer = Adam(CustomSchedule(d_model), beta_1=0.9, beta_2=0.98, epsilon=1e-9)

generator = Transformer(
    num_layers, d_model, num_heads, dff,
    input_vocab_size, target_vocab_size,
    pe_input=input_vocab_size, pe_target=target_vocab_size,
    rate=generator_dropout
)

for (inp, tar) in dataset.take(1):
  build_step(inp, tf.expand_dims(tf.argmax(tar, axis=-1), 1))

generator.summary()

### ***2.2 Training***

We need to write a custom training loop as, during the decoding phase, we will need to add one token at a time as well as setting the correct input state, then we will proceed with the training phase, storing every `epochs_interval` epochs the weights of the model in a file that indicates the values of its parameters

In [None]:
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.metrics import binary_accuracy

history = {'discr loss': [], 'discr acc': [], 'adver loss': [], 'adver acc': []}

# the discriminator's labels are 1s for real samples and 0s for generated ones
# as we are trying to improve the discriminator weights only
@tf.function
def discriminator_step(inputs, targets):
  with tf.GradientTape() as tape:
    dec_inputs = tf.expand_dims(tf.ones(batch_size, dtype=tf.int64) * tokenizer.vocab_size, 1)
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inputs, dec_inputs)
    logits, _ = generator(inputs, dec_inputs, False, enc_padding_mask, combined_mask, dec_padding_mask)
    generated = tf.squeeze(logits)

    real_preds = tf.squeeze(discriminator([inputs, targets]))
    fake_preds = tf.squeeze(discriminator([inputs, generated]))

    loss = binary_crossentropy(tf.ones(batch_size), real_preds) + binary_crossentropy(tf.zeros(batch_size), fake_preds)
    acc = binary_accuracy(tf.ones(batch_size), real_preds) + binary_accuracy(tf.zeros(batch_size), fake_preds)

  gradients = tape.gradient(loss, discriminator.trainable_variables)
  discriminator_optimizer.apply_gradients(zip(gradients, discriminator.trainable_variables))
  return loss / 2, acc / 2

# the discriminator's labels are 1s as we are trying to improve the generator weights
@tf.function
def adversarial_step(inputs):
  with tf.GradientTape() as tape:
    dec_inputs = tf.expand_dims(tf.ones(batch_size, dtype=tf.int64) * tokenizer.vocab_size, 1)
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inputs, dec_inputs)
    logits, _ = generator(inputs, dec_inputs, False, enc_padding_mask, combined_mask, dec_padding_mask)
    generated = tf.squeeze(logits)

    preds = tf.squeeze(discriminator([inputs, generated]))
    loss = binary_crossentropy(tf.ones(batch_size), preds)
    acc = binary_accuracy(tf.ones(batch_size), preds)

  gradients = tape.gradient(loss, generator.trainable_variables)
  generator_optimizer.apply_gradients(zip(gradients, generator.trainable_variables))
  return loss, acc

In [None]:
import time
from utils.checkpoint import restore_checkpoint, save_checkpoint

discriminator_checkpoint_signature = \
  'seq_{} stp_{} btc_{} emb_{} unt_{} att_{} ddr_{} nlr_{} nhd_{} dmd_{} dff_{} gdr_{} epc_'.format(
    seq_length, step_length, batch_size,
    embedding_dim, gru_units, attention, discriminator_dropout,
    num_layers, num_heads, d_model, dff, generator_dropout
  )
generator_checkpoint_signature = \
  'seq_{} stp_{} btc_{} emb_{} unt_{} att_{} ddr_{} nlr_{} nhd_{} dmd_{} dff_{} gdr_{} epc_'.format(
    seq_length, step_length, batch_size,
    embedding_dim, gru_units, attention, discriminator_dropout,
    num_layers, num_heads, d_model, dff, generator_dropout
  )
checkpoint_directory = checkpoint_path + model_name
initial_epoch = restore_checkpoint(discriminator, checkpoint_directory, discriminator_checkpoint_signature)
assert initial_epoch == restore_checkpoint(generator, checkpoint_directory, generator_checkpoint_signature, verbose=False)

epochs = 50
epochs_interval = 10
batches_interval = 20

for epoch in range(initial_epoch, epochs):
  start = time.time()
  history['discr loss'].append(0)
  history['discr acc'].append(0)
  history['adver loss'].append(0)
  history['adver acc'].append(0)
  print(f'Starting Epoch {epoch+1}/{epochs}')
  
  for (batch, (inp, tar)) in enumerate(dataset):
    discr_loss, discr_acc = discriminator_step(inp, tar)
    adver_loss, adver_acc = adversarial_step(inp)
    history['discr loss'][-1] += discr_loss
    history['discr acc'][-1] += discr_acc
    history['adver loss'][-1] += adver_loss
    history['adver acc'][-1] += adver_acc
    if (batch + 1) % batches_interval == 0:
      print(f'  > Batch {batch+1}', end=' \t\t ')
      print(f'- discr_loss: {discr_loss:.4f} - discr_acc: {discr_acc:.4f}', end=' ')
      print(f'- adver_loss: {adver_loss:.4f} - adver_acc: {adver_acc:.4f}')

  history['discr loss'][-1] /= batch
  history['discr acc'][-1] /= batch
  history['adver loss'][-1] /= batch
  history['adver acc'][-1] /= batch
  elapsed = time.time() - start
  print(f'Ending Epoch {epoch+1}/{epochs}', end=' \t ')
  print(f'- discr_loss: {history["discr loss"][-1]:.4f} - discr_acc: {history["discr acc"][-1]:.4f}', end=' ')
  print(f'- adver_loss: {history["adver loss"][-1]:.4f} - adver_acc: {history["adver acc"][-1]:.4f}')
  print(f'Elapsed Time {elapsed:.2f}s\n')

  save_checkpoint(discriminator, epoch, checkpoint_directory, discriminator_checkpoint_signature, epochs_interval)
  save_checkpoint(generator, epoch, checkpoint_directory, generator_checkpoint_signature, epochs_interval, verbose=False)

Here's a graphical representation of the improvement of the model, with respect both to the loss and the accuracy, across the epochs

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

if epochs - initial_epoch > 0:
  sns.set_style('darkgrid')
  sns.set_context('notebook')
  plt.figure(figsize=(12, 5))

  x = np.arange(initial_epoch, epochs) + 1

  plt.subplot(1, 2, 1)
  plt.plot(x, history['discr loss'], label='discriminator')
  plt.plot(x, history['adver loss'], label='adversarial')
  plt.legend()
  plt.title('Loss')

  plt.subplot(1, 2, 2)
  plt.plot(x, history['discr acc'], label='discriminator')
  plt.plot(x, history['adver acc'], label='adversarial')
  plt.legend()
  plt.title('Accuracy')

  plt.show()

## **3. Generation**

The generation is based on the trained model and it uses a `temperature_factor` to allow some degree of randomness

> The next token is chosen among a subset of those having a probability which is at least `1 / temperature_factor` with respect to the maximal one

> It goes without saying that a higher `temperature_factor` leads to a more explorative generation, while a lower `temperature_factor` leads to a more conservative one (in particular, with `temperature_factor = 1` the generation is completely deterministic)

In [None]:
from tensorflow.nn import softmax
from text_processing.markers import unmark, MARKERS

def evaluate(inp_sentence):
  # the encoder input is the input sentence surrounded by a start and an end token
  encoder_input = tf.expand_dims([tokenizer.vocab_size] + inp_sentence + [tokenizer.vocab_size + 1], 0)

  # the decoder input is the same sentence, without the first character, preceded by a start token
  decoder_input = tf.expand_dims([tokenizer.vocab_size] + inp_sentence[1:], 0)  

  # we now call the generator to get the logits tensor
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, decoder_input)
  logits, attention_weights = generator(
      encoder_input, decoder_input, False,
      enc_padding_mask, combined_mask, dec_padding_mask
  )

  # finally, we return the logits for the decoded character
  return logits[0, len(inp_sentence)-1, :].numpy()[:tokenizer.vocab_size]

def generate(
    input_string=divine_comedy_marked[:386], # first three tercets of the comedy
    max_iterations=4000, end_marker=MARKERS['canto end'],
    temperature_factor=1.0, verbose=False
):
  # at the beginning, the generated string is the encoding of the input string
  generated_string = tokenizer.encode(input_string)

  for i in range(max_iterations):
    # the input sequence is made up of the last 'seq_length' tokens of the generated string
    input_sequence = generated_string[-seq_length:]

    # as the evaluate function returns logits, we need to apply a softmax activation to get probabilities
    probabilities = softmax(evaluate(input_sequence)).numpy()

    # we take a subset of possible tokens whose probability is at least 1/temperature_factor of the maximal one
    indices = np.arange(tokenizer.vocab_size)[probabilities >= probabilities.max() / temperature_factor]

    # we renormalize this subset using, again, a softmax activation
    probabilities = softmax(probabilities[probabilities >= probabilities.max() / temperature_factor]).numpy()

    # the id is randomly chosen among the indices according to the computed probabilities
    predicted_id = np.random.choice(indices, size=1, p=probabilities)[0]

    # the id is then mappend into a token from the vocabulary
    predicted_token = tokenizer.decode([predicted_id])
    if verbose:
      print(predicted_token, end='')

    # if the token coincides with the end marker, the generation is interrupted, otherwise the token is appended 
    if predicted_token == end_marker:
      break
    generated_string.append(predicted_id)
  
  # we finally return the decoded (and unmarked) string, excluding the input provided by the user
  return unmark(tokenizer.decode(generated_string)[len(input_string):])

### ***3.2 Results***

We then try different kind of temperatures and take an average evaluation, then, finally, we use the `store` utility function to store the results in a *.txt* file

In [None]:
from metrics.store import store

result_dir = results_path + model_name
result_file = 'seq_{} stp_{} btc_{} emb_{} unt_{} att_{} ddr_{} nlr_{} nhd_{} dmd_{} dff_{} gdr_{}'.format(
    seq_length, step_length, batch_size,
    embedding_dim, gru_units, attention, discriminator_dropout,
    num_layers, num_heads, d_model, dff, generator_dropout
)

print('Generating Temperature 1...')
samples = [generate(temperature_factor=1)]
store(result_dir, result_file, temperature=1, sample_texts=samples, original_text=divine_comedy)

repetitions = 5
for temperature in [2, 3, 5, 10]:
  samples = []
  for r in range(repetitions):
    print(f'Generating Temperature {temperature} (repetition {r+1})...')
    samples.append(generate(temperature_factor=temperature))
  store(result_dir, result_file, temperature, sample_texts=samples, original_text=divine_comedy)