## **0. Preliminary Settings**

First of all, we need to clone the repository to get access to the code and use utility functions inside the notebook

In [None]:
!git clone https://github.com/mazzio97/DeepComedy.git

project_path = 'DeepComedy/'

This folder is then added to the system path so that the modules can be used inside the notebook

In [None]:
import sys

sys.path.append(project_path + 'src')

Finally, the *Divine Comedy* is loaded and stored in a variable

In [None]:
with open(project_path + 'res/divine_comedy.txt', 'r', encoding='ISO-8859-1') as f:
  divine_comedy = f.read()

print(divine_comedy[:231])
print('\n\n[...]\n\n')
print(divine_comedy[-266:])

Also, we set Python's, Numpy's/Keras' and Tensorflow's seeds to guarantee the maximal level of reproducibility

> Though, the results could still differ a little bit due to other randomized routines called during the execution and the inner stochasticity introduced by parallel computing

In [None]:
import random
import numpy as np
import tensorflow as tf

random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)

## **1. Data Processing**

### ***1.1 Text Mark***

We use the provided function `mark` to map the original *Divine Comedy* into a marked version containing:

* a marker both at the beginning and at the end of each *cantica*

* a marker both at the beginning and at the end of each *canto*

* a marker between each couple of *tercets*

In [None]:
from text_processing.markers import mark

divine_comedy_marked = mark(divine_comedy)
print(divine_comedy_marked[:260])
print('\n\n[...]\n\n')
print(divine_comedy_marked[-319:])

### ***1.2 The Tokenizer***

We use the provided `subword_tokenizer` to tokenize the text into subwords, including punctuation

> Some special tokens are reserved to the markers

In [None]:
from text_processing.tokenizers import subword_tokenizer

tokenizer = subword_tokenizer(divine_comedy, target_vocab_size=2048, max_subword_length=3)
print(tokenizer.vocab_size, 'tokens:')
print()
for i, token in enumerate(tokenizer.subwords[:40]):
  print("'{}'".format('\\n' if token == '\n' else token))

In [None]:
divine_comedy_tokenized = tokenizer.encode(divine_comedy_marked)
print(len(divine_comedy_tokenized))

In [None]:
tokenized_sample = divine_comedy_tokenized[:57]
print(tokenizer.decode(tokenized_sample))

In [None]:
for token in tokenized_sample:
  print(token, '-->', tokenizer.decode([token]))

### ***1.3 Building the Dataset***

In order to understand which one should be the minimal length of a window sequence so that the net could be able to clearly have an insight about the thyming scheme, we compute which one is the maximal length of an encoded verse and take the minimal length as at least four verses 

In [None]:
# the newline token
newline = tokenizer.encode('\n')[0]

# the indices of each newline
indices = [i + 1 for i, t in enumerate(divine_comedy_tokenized) if t == newline]

# the length of each verse (or marker)
verses_lengths = [end - start for start, end in zip([0] + indices, indices +  [len(divine_comedy_tokenized)])]

# five verses (4 + tercet mark) should be enough to understand the rhyming scheme
sequences_lengths = [sum(verses_lengths[i:i+5]) for i in range(len(verses_lengths)-4)]
max(sequences_lengths)

Given that the ***sequence length*** should be at least *109*, we set it as *128*, then we choose a ***step_length***, namely the value that indicates how often we decide to take a sample and, finally, a ***train/validation split*** percentage

> Being the text very dense we cannot take a too small `step_length`, as it will lead both to a prohibitive training time and a lot of overfitting

> In order to avoid this behaviour but having the most possible trustworthy set of data, we choose a medium `step_length` together with a small `train_val_split`, so that (at the cost of a quite more expensive training) we could easily monitor overfitting while still using a lot of training data

In [None]:
seq_length = 128
step_length = 16
batch_size = 128
train_val_split = 0.3

tot_samples = int((len(divine_comedy_tokenized) - seq_length) / step_length)
train_samples = round(tot_samples * train_val_split)

print('Train Samples:', train_samples)
print('  Val Samples:', tot_samples - train_samples)

Finally, the tokenized dataset is split into windows of length `seq_length` (*+1*) sampled every `step_length` tokens and these windows are then shared into an *input sequence* and a *target sequence*, both of length `seq_length`, having an offset of one single token

In [None]:
from tensorflow.data import Dataset

def split_input_target(chunk):
  input_sequence = chunk[:-1]
  target_sequence = chunk[1:]
  return input_sequence, target_sequence

dataset = Dataset.from_tensor_slices(divine_comedy_tokenized)
dataset = dataset.window(seq_length + 1, step_length, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(seq_length + 1))
dataset = dataset.map(split_input_target).shuffle(tot_samples, seed=0)

train_dataset = dataset.take(train_samples).batch(batch_size)
val_dataset = dataset.take(tot_samples - train_samples).batch(batch_size)

In [None]:
for input, target in dataset.take(1):
  print('INPUT:\n')
  print(tokenizer.decode(input))
  print('\n\n---------------------\n\n')
  print('TARGET:\n')
  print(tokenizer.decode(target))

## **2. Model**

### ***2.1 Architecture***

The model consists of an initial *Embedding* layer that maps the tokenized characters into a dense vector which is then passed to one or two *RNN* layer(s) and, eventually, to a final *Dense* layer, post-processed using *softmax* activation, which outputs the probability of each token

> The variable parameters of the model are:
> * the dimension of the *Embedding* layer
> * the kind of *RNN* (*GRU* or *LSTM*)
> * the number of units of the *RNN* layers
> * the dropout rate 

In [None]:
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Embedding, GRU, LSTM, Dense
from tensorflow.keras.utils import plot_model

embedding_dim = 64
rnn_type = 'LSTM'
rnn_units_1 = 512
rnn_units_2 = None
dropout = 0.1

def rnn_layer(units, name):
  if rnn_type == 'LSTM':
    return LSTM(
      units, dropout=dropout, return_sequences=True, stateful=False,
      recurrent_initializer='glorot_uniform', name=name
    )
  elif rnn_type == 'GRU':
    return GRU(
      units, dropout=dropout, return_sequences=True, stateful=False,
      recurrent_initializer='glorot_uniform', name=name
    )

input_tensor = Input((seq_length,), dtype='int64', name='input')
embedding_tensor = Embedding(tokenizer.vocab_size, embedding_dim, name='embedding')(input_tensor)
rnn_tensor = rnn_layer(rnn_units_1, 'rnn_layer_1')(embedding_tensor)
if rnn_units_2 is not None:
    rnn_tensor = rnn_layer(rnn_units_2, 'rnn_layer_2')(rnn_tensor)
output_tensor = Dense(tokenizer.vocab_size, activation='softmax', name='output_layer')(rnn_tensor)

model = Model(input_tensor, output_tensor, name='DeepComedy')
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

display(plot_model(model, show_shapes=True, show_layer_names=False, rankdir='LR'))
model.summary()

### ***2.2 Training***

We can now proceed with the training phase, storing every `epochs_interval` epochs the weights of the model in a file that indicates the values of its parameters

In [None]:
from utils.validation import validation_callback
from utils.checkpoint import restore_checkpoint, checkpoint_callback

checkpoint_signature = 'seq_{} stp_{} btc_{} tvs_{} emb_{} rnn_{} ru1_{} ru2_{} drp_{} epc_'.format(
    seq_length, step_length, batch_size, train_val_split,
    embedding_dim, rnn_type, rnn_units_1, rnn_units_2, dropout
)
checkpoint_directory = checkpoint_path + model_name
initial_epoch = restore_checkpoint(model, checkpoint_directory, checkpoint_signature)

epochs = 50
epochs_interval = 10
batches_interval = 20

val_callback, history = validation_callback(model, val_dataset, epochs, batches_interval)
ckp_callback = checkpoint_callback(model, checkpoint_directory, checkpoint_signature, epochs_interval)
model.fit(train_dataset, epochs=epochs, initial_epoch=initial_epoch, callbacks=[val_callback, ckp_callback], verbose=0)

Here's a graphical representation of the improvement of the model, with respect both to the loss and the accuracy, across the epochs

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

if epochs - initial_epoch > 0:
  sns.set_style('darkgrid')
  sns.set_context('notebook')
  plt.figure(figsize=(12, 5))

  x = np.arange(initial_epoch, epochs) + 1

  plt.subplot(1, 2, 1)
  plt.plot(x, history['train loss'], label='train')
  plt.plot(x, history['val loss'], label='val')
  plt.legend()
  plt.title('Loss')

  plt.subplot(1, 2, 2)
  plt.plot(x, history['train acc'], label='train')
  plt.plot(x, history['val acc'], label='val')
  plt.legend()
  plt.title('Accuracy')

  plt.show()

## **3. Generation**

The generation is based on the trained model and it uses a `temperature_factor` to allow some degree of randomness

> The next token is chosen among a subset of those having a probability which is at least `1 / temperature_factor` with respect to the maximal one

> It goes without saying that a higher `temperature_factor` leads to a more explorative generation, while a lower `temperature_factor` leads to a more conservative one (in particular, with `temperature_factor = 1` the generation is completely deterministic)

In [None]:
from tensorflow.nn import softmax
from text_processing.markers import unmark, MARKERS

def generate(
    input_string=divine_comedy_marked[:386], # first three tercets of the comedy
    max_iterations=4000, end_marker=MARKERS['canto end'],
    temperature_factor=1.0, verbose=False
):
  # at the beginning, the generated string is the encoding of the input string
  generated_string = tokenizer.encode(input_string)
  
  for i in range(max_iterations):  
    # the input sequence is made up of the last 'seq_length' tokens of the generated string
    input_sequence = np.array([generated_string[-seq_length:]], dtype='int64')
    
    # we are interested in the probabilities for the last element of the sequence
    probabilities = model.predict(input_sequence)[0, -1]

    # we take a subset of possible tokens whose probability is at least 1/temperature_factor of the maximal one
    indices = np.arange(tokenizer.vocab_size)[probabilities >= probabilities.max() / temperature_factor]

    # we renormalize this subset using, again, a softmax activation
    probabilities = softmax(probabilities[probabilities >= probabilities.max() / temperature_factor]).numpy()

    # the id is randomly chosen among the indices according to the computed probabilities
    predicted_id = np.random.choice(indices, size=1, p=probabilities)[0]

    # the id is then mappend into a token from the vocabulary
    predicted_token = tokenizer.decode([predicted_id])
    if verbose:
      print(predicted_token, end='')

    # if the token coincides with the end marker, the generation is interrupted, otherwise the token is appended 
    if predicted_token == end_marker:
      break
    generated_string.append(predicted_id)
  
  # we finally return the decoded (and unmarked) string, excluding the input provided by the user
  return unmark(tokenizer.decode(generated_string)[len(input_string):])

In [None]:
generated_canto = generate(temperature_factor=3.0, verbose=True)

In [None]:
print(generated_canto)