# TensorFlow 2 Text generator on Dante Alighieri's Divine Comedy

Author: **Ivan Bongiorni**, [LinkedIn profile](https://www.linkedin.com/in/ivan-bongiorni-b8a583164/)


This Notebook contains a **text generator RNN** that was trained on the **Divina Commedia** (the *Divine Comedy*) by **Dante Alighieri**. This is a poem written at the beginning of the XII century. It's hard to explain what it represents for Italian culture: it's without any doubt the main pillar of our national literature, one of the building blocks of modern Italian language, and arguably the gratest poem ever. All modern representations of Hell, Purgatory and Heaven derive from this opera.

It's structure is extremely interesting: each verse is composed of 11 syllables, and its rhymes follow an **A-B-A-B** structure. Lot of pattern to be learned! 

In [1]:
import time
import re

import numpy as np
import pandas as pd

%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

from matplotlib import pyplot as plt

# Read file from Colab Notebook
#from google.colab import drive
#drive.mount('/content/drive')

2.3.0


In [2]:
current_path = " [...] /TF_2.0/NLP/text_generator/"

# Read the Divina Commedia
with open( "DivinaCommedia.txt", 'r', encoding="utf8") as file:
    divina_commedia = file.read()

# Replace rare characters
divina_commedia = divina_commedia.replace("ä", "a")
divina_commedia = divina_commedia.replace("é", "è")
divina_commedia = divina_commedia.replace("ë", "è")
divina_commedia = divina_commedia.replace("Ë", "E")
divina_commedia = divina_commedia.replace("ï", "i")
divina_commedia = divina_commedia.replace("Ï", "I")
divina_commedia = divina_commedia.replace("ó", "ò")
divina_commedia = divina_commedia.replace("ö", "o")
divina_commedia = divina_commedia.replace("ü", "u")

divina_commedia = divina_commedia.replace("(", "-")
divina_commedia = divina_commedia.replace(")", "-")
#divina_commedia = divina_commedia.replace("[", "")
#divina_commedia = divina_commedia.replace("]", "")

divina_commedia = re.sub(r'[0-9]+', '', divina_commedia)
divina_commedia = re.sub(r'\[.*\r?\n', '', divina_commedia)
divina_commedia = re.sub(r'.*Canto.*\r?\n', '', divina_commedia)

# divina_commedia = divina_commedia.replace(" \n", "\n")  # with this i lose the "terzina": results are not so exciting
#divina_commedia = divina_commedia.replace(" \n", "<eot>")  # end of terzina
#divina_commedia = divina_commedia.replace("\n", "<eor>")

In [None]:
print(divina_commedia[1:1000])

NFERNO



Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
chè la diritta via era smarrita.

Ahi quanto a dir qual era è cosa dura
esta selva selvaggia e aspra e forte
che nel pensier rinova la paura! 

Tant'è amara che poco è più morte;
ma per trattar del ben ch'i' vi trovai,
dirò de l'altre cose ch'i' v' ho scorte. 

Io non so ben ridir com'i' v'intrai,
tant'era pien di sonno a quel punto
che la verace via abbandonai. 

Ma poi ch'i' fui al piè d'un colle giunto,
là dove terminava quella valle
che m'avea di paura il cor compunto, 

guardai in alto e vidi le sue spalle
vestite già de' raggi del pianeta
che mena dritto altrui per ogne calle. 

Allor fu la paura un poco queta,
che nel lago del cor m'era durata
la notte ch'i' passai con tanta pieta. 

E come quei che con lena affannata,
uscito fuor del pelago a la riva,
si volge a l'acqua perigliosa e guata, 

così l'animo mio, ch'ancor fuggiva,
si volse a retro a rimirar lo passo
che non lasciò già mai persona viva.


In [None]:
# Check lenght of text
print(len(divina_commedia))

534048


I will now extract the set of unique characters, and create a dictionary for vectorization of text. In order to feed the text into a Neural Network, I must turn each character into a number.

In [3]:
# Store unique characters into a dict with numerical encoding
unique_chars = list(set(divina_commedia))
unique_chars.sort()  # to make sure you get the same encoding at each run

# Store them in a dict, associated with a numerical index
char2idx = { char[1]: char[0] for char in enumerate(unique_chars) }


In [None]:
print(len(char2idx))

62


In [None]:
char2idx

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 "'": 4,
 ',': 5,
 '-': 6,
 '.': 7,
 ':': 8,
 ';': 9,
 '?': 10,
 'A': 11,
 'B': 12,
 'C': 13,
 'D': 14,
 'E': 15,
 'F': 16,
 'G': 17,
 'H': 18,
 'I': 19,
 'L': 20,
 'M': 21,
 'N': 22,
 'O': 23,
 'P': 24,
 'Q': 25,
 'R': 26,
 'S': 27,
 'T': 28,
 'U': 29,
 'V': 30,
 'Z': 31,
 'a': 32,
 'b': 33,
 'c': 34,
 'd': 35,
 'e': 36,
 'f': 37,
 'g': 38,
 'h': 39,
 'i': 40,
 'j': 41,
 'l': 42,
 'm': 43,
 'n': 44,
 'o': 45,
 'p': 46,
 'q': 47,
 'r': 48,
 's': 49,
 't': 50,
 'u': 51,
 'v': 52,
 'x': 53,
 'y': 54,
 'z': 55,
 'È': 56,
 'à': 57,
 'è': 58,
 'ì': 59,
 'ò': 60,
 'ù': 61}

Once I have a dictionary that maps each characted with its respective numerical index, I can process the whole corpus.

In [4]:
def numerical_encoding(text, char_dict):
    """ Text to list of chars, to np.array of numerical idx """
    chars_list = [ char for char in text ]
    chars_list = [ char_dict[char] for char in chars_list ]
    chars_list = np.array(chars_list)
    return chars_list


In [None]:
# Let's see what the first line will look like
print("{}".format(divina_commedia[276:511]))
print("\nbecomes:")
print(numerical_encoding(divina_commedia[276:511], char2idx))

el ben ch'i' vi trovai,
dirò de l'altre cose ch'i' v' ho scorte. 

Io non so ben ridir com'i' v'intrai,
tant'era pien di sonno a quel punto
che la verace via abbandonai. 

Ma poi ch'i' fui al piè d'un colle giunto,
là dove terminava qu

becomes:
[36 42  1 33 36 44  1 34 39  4 40  4  1 52 40  1 50 48 45 52 32 40  5  0
 35 40 48 60  1 35 36  1 42  4 32 42 50 48 36  1 34 45 49 36  1 34 39  4
 40  4  1 52  4  1 39 45  1 49 34 45 48 50 36  7  1  0  0 19 45  1 44 45
 44  1 49 45  1 33 36 44  1 48 40 35 40 48  1 34 45 43  4 40  4  1 52  4
 40 44 50 48 32 40  5  0 50 32 44 50  4 36 48 32  1 46 40 36 44  1 35 40
  1 49 45 44 44 45  1 32  1 47 51 36 42  1 46 51 44 50 45  0 34 39 36  1
 42 32  1 52 36 48 32 34 36  1 52 40 32  1 32 33 33 32 44 35 45 44 32 40
  7  1  0  0 21 32  1 46 45 40  1 34 39  4 40  4  1 37 51 40  1 32 42  1
 46 40 58  1 35  4 51 44  1 34 45 42 42 36  1 38 40 51 44 50 45  5  0 42
 57  1 35 45 52 36  1 50 36 48 43 40 44 32 52 32  1 47 51]


## RNN dataprep

I need to generate a set of stacked input sequences. My goal is to train a Neural Network to find a mapping between an input sequence and an output sequence of equal length, in which each character is shifted left of one position.

For example, the first verse:

> Nel mezzo del cammin di nostra vita

would be translated in a train sequence as:

`Nel mezzo del cammin di nostra vit`

be associated with the target sequence:

`el mezzo del cammin di nostra vita`

The following function is a preparatory step for that. More generally, given a sequence:

```
A B C D E F G H I
```

and assuming input sequences of length 5, it will generate a matrix like:

```
A B C D E
B C D E F
C D E F G
D E F G H
E F G H I
```

I will save that matrix as it is in .csv format, to use it to train the Language Generator later.
The split between train and target sets will be as:

```
 Train:           Target:
                 
A B C D E        B C D E F
B C D E F        C D E F G
C D E F G        D E F G H
D E F G H        E F G H I
                 
```

Train and target sets are fundamentally the same matrix, with the train having the last row removed, and the target set having the first removed.

In [5]:
# Apply it on the whole Comedy
encoded_text = numerical_encoding(divina_commedia, char2idx)

In [None]:
print(encoded_text[311:600])

[42 50 48 36  1 34 45 49 36  1 34 39  4 40  4  1 52  4  1 39 45  1 49 34
 45 48 50 36  7  1  0  0 19 45  1 44 45 44  1 49 45  1 33 36 44  1 48 40
 35 40 48  1 34 45 43  4 40  4  1 52  4 40 44 50 48 32 40  5  0 50 32 44
 50  4 36 48 32  1 46 40 36 44  1 35 40  1 49 45 44 44 45  1 32  1 47 51
 36 42  1 46 51 44 50 45  0 34 39 36  1 42 32  1 52 36 48 32 34 36  1 52
 40 32  1 32 33 33 32 44 35 45 44 32 40  7  1  0  0 21 32  1 46 45 40  1
 34 39  4 40  4  1 37 51 40  1 32 42  1 46 40 58  1 35  4 51 44  1 34 45
 42 42 36  1 38 40 51 44 50 45  5  0 42 57  1 35 45 52 36  1 50 36 48 43
 40 44 32 52 32  1 47 51 36 42 42 32  1 52 32 42 42 36  0 34 39 36  1 43
  4 32 52 36 32  1 35 40  1 46 32 51 48 32  1 40 42  1 34 45 48  1 34 45
 43 46 51 44 50 45  5  1  0  0 38 51 32 48 35 32 40  1 40 44  1 32 42 50
 45  1 36  1 52 40 35 40  1 42 36  1 49 51 36  1 49 46 32 42 42 36  0 52
 36]


In [6]:
def get_text_matrix(sequence, len_input):
    
    # create empty matrix
    X = np.empty((len(sequence)-len_input, len_input))
    
    # fill each row/time window from input sequence
    for i in range(X.shape[0]):
        X[i,:] = sequence[i : i+len_input]
        
    return X

In [9]:
text_matrix = get_text_matrix(encoded_text, 100)

In [None]:
print(text_matrix.shape)

(533848, 200)


In [None]:
print("100th train sequence:\n")
print(text_matrix[ 100, : ])
print("\n\n100th target sequence:\n")
print(text_matrix[ 101, : ])
print("\n\n102th target sequence:\n")
print(text_matrix[ 102, : ])
print("\n\n115th target sequence:\n")
print(text_matrix[ 180, : ])

100th train sequence:

[36. 48. 32.  1. 49. 43. 32. 48. 48. 40. 50. 32.  7.  0.  0. 11. 39. 40.
  1. 47. 51. 32. 44. 50. 45.  1. 32.  1. 35. 40. 48.  1. 47. 51. 32. 42.
  1. 36. 48. 32.  1. 58.  1. 34. 45. 49. 32.  1. 35. 51. 48. 32.  0. 36.
 49. 50. 32.  1. 49. 36. 42. 52. 32.  1. 49. 36. 42. 52. 32. 38. 38. 40.
 32.  1. 36.  1. 32. 49. 46. 48. 32.  1. 36.  1. 37. 45. 48. 50. 36.  0.
 34. 39. 36.  1. 44. 36. 42.  1. 46. 36. 44. 49. 40. 36. 48.  1. 48. 40.
 44. 45. 52. 32.  1. 42. 32.  1. 46. 32. 51. 48. 32.  2.  1.  0.  0. 28.
 32. 44. 50.  4. 58.  1. 32. 43. 32. 48. 32.  1. 34. 39. 36.  1. 46. 45.
 34. 45.  1. 58.  1. 46. 40. 61.  1. 43. 45. 48. 50. 36.  9.  0. 43. 32.
  1. 46. 36. 48.  1. 50. 48. 32. 50. 50. 32. 48.  1. 35. 36. 42.  1. 33.
 36. 44.  1. 34. 39.  4. 40.  4.  1. 52. 40.  1. 50. 48. 45. 52. 32. 40.
  5.  0.]


100th target sequence:

[48. 32.  1. 49. 43. 32. 48. 48. 40. 50. 32.  7.  0.  0. 11. 39. 40.  1.
 47. 51. 32. 44. 50. 45.  1. 32.  1. 35. 40. 48.  1. 47. 51. 32. 

# Architecture

At this point, I can specify the RNN architecture with all its hyperparameters. An `Embedding()` layer will first learn a representation of each character; the sequence of chracters embedding will then be fed into an `LSTM()` layer, that will extract information from their sequence; `Dense()` layers at the end will produce the next character prediction.

The Network is structured to be fed with batches of data of fixed size.

In [10]:
# size of vocabulary
vocab_size = len(char2idx)

# size of mini batches during training
batch_size = 200  # 100


# size of training subset at each epoch
subset_size = batch_size * 100

# vector size of char embeddings
embedding_size = 300  # 250

len_input = 2048   # 200

hidden_size = 300  # for Dense() layers 250

In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.activations import elu, relu, softmax

In [None]:
x = tf.constant([ 4, 40, 43, 43, 36, 38, 42, 40,  9,  1,  0,  0, 36,  1, 49, 59,  1, 34,
 45, 43, 36,  1, 35, 40,  1, 42, 36, 40,  1, 33, 36, 52, 52, 36,  1, 42,
 32,  1, 38, 48, 45, 44, 35, 31,  0, 35, 36,  1, 42, 36,  1, 46, 32, 42,
 46, 36, 33, 48, 36,  1, 43, 40, 36,  5,  1, 34, 45, 49, 59,  1, 43, 40,
  1, 46, 32, 48, 52, 36,  0, 35, 40,  1, 49, 51, 32,  1, 42, 51, 44, 38,
 39, 36, 55, 55, 32,  1, 35, 40, 52, 36, 44, 51, 50, 32,  1, 50, 45, 44,
 35, 32,  7,  1,  0,  0, 24, 45, 40,  5,  1, 34, 45, 43, 36,  1, 38, 36,
 44, 50, 36,  1, 49, 50, 32, 50, 32,  1, 49, 45, 50, 50, 45,  1, 42, 32,
 48, 52, 36,  5,  0, 34, 39, 36,  1, 46, 32, 48, 36,  1, 32, 42, 50, 48,
 45,  1, 34, 39, 36,  1, 46, 48, 40, 43, 32,  5,  1, 49, 36,  1, 49, 40,
  1, 49, 52, 36, 49, 50, 36,  0, 42, 32,  1, 49, 36, 43, 33, 40, 32, 44,
 55, 32], dtype='float32', shape=(1, 200))

y = tf.constant([40, 43, 43, 36, 38, 42, 40,  9,  1,  0,  0, 36,  1, 49, 59,  1, 34, 45,
 43, 36,  1, 35, 40,  1, 42, 36, 40,  1, 33, 36, 52, 52, 36,  1, 42, 32,
  1, 38, 48, 45, 44, 35, 32,  0, 35, 36,  1, 42, 36,  1, 46, 32, 42, 46,
 36, 33, 48, 36,  1, 43, 40, 36,  5,  1, 34, 45, 49, 59,  1, 43, 40,  1,
 46, 32, 48, 52, 36,  0, 35, 40,  1, 49, 51, 32,  1, 42, 51, 44, 38, 39,
 36, 55, 55, 32,  1, 35, 40, 52, 36, 44, 51, 50, 32,  1, 50, 45, 44, 35,
 32,  7,  1,  0,  0, 24, 45, 40,  5,  1, 34, 45, 43, 36,  1, 38, 36, 44,
 50, 36,  1, 49, 50, 32, 50, 32,  1, 49, 45, 50, 50, 45,  1, 42, 32, 48,
 52, 36,  5,  0, 34, 39, 36,  1, 46, 32, 48, 36,  1, 32, 42, 50, 48, 45,
  1, 34, 39, 36,  1, 46, 48, 40, 43, 32,  5,  1, 49, 36,  1, 49, 40,  1,
 49, 52, 36, 49, 50, 36,  0, 42, 32,  1, 49, 36, 43, 33, 40, 32, 44, 55,
 32,  1], dtype='float32', shape=(1, 200))


get_custom_loss(x, y)

<tf.Tensor: shape=(), dtype=float64, numpy=0.3333333333333333>

In [None]:
[[4, 40, 43, 43, 36, 38, 42, 40, 9, 1], 
 [36, 1, 49, 59, 1, 34, 45, 43, 36, 1, 35, 40, 1, 42, 36, 40, 1, 33, 36, 52, 52, 36, 1, 42, 32, 1, 38, 48, 45, 44, 35, 32], 
 [35, 36, 1, 42, 36, 1, 46, 32, 42, 46, 36, 33, 48, 36, 1, 43, 40, 36, 5, 1, 34, 45, 49, 59, 1, 43, 40, 1, 46, 32, 48, 52, 36], 
 [35, 40, 1, 49, 51, 32, 1, 42, 51, 44, 38, 39, 36, 55, 55, 32, 1, 35, 40, 52, 36, 44, 51, 50, 32, 1, 50, 45, 44, 35, 32, 7, 1], 
 [24, 45, 40, 5, 1, 34, 45, 43, 36, 1, 38, 36, 44, 50, 36, 1, 49, 50, 32, 50, 32, 1, 49, 45, 50, 50, 45, 1, 42, 32, 48, 52, 36, 5], 
 [34, 39, 36, 1, 46, 32, 48, 36, 1, 32, 42, 50, 48, 45, 1, 34, 39, 36, 1, 46, 48, 40, 43, 32, 5, 1, 49, 36, 1, 49, 40, 1, 49, 52, 36, 49, 50, 36]]


In [None]:
idx2char = { v: k for k, v in char2idx.items() }
text_generated = ""
for x in [[4, 40, 43, 43, 36, 38, 42, 40, 9, 1], 
          [36, 1, 49, 59, 1, 34, 45, 43, 36, 1, 35, 40, 1, 42, 36, 40, 1, 33, 36, 52, 52, 36, 1, 42, 32, 1, 38, 48, 45, 44, 35, 32], 
          [35, 36, 1, 42, 36, 1, 46, 32, 42, 46, 36, 33, 48, 36, 1, 43, 40, 36, 5, 1, 34, 45, 49, 59, 1, 43, 40, 1, 46, 32, 48, 52, 36], 
          [35, 40, 1, 49, 51, 32, 1, 42, 51, 44, 38, 39, 36, 55, 55, 32, 1, 35, 40, 52, 36, 44, 51, 50, 32, 1, 50, 45, 44, 35, 32, 7, 1], 
          [24, 45, 40, 5, 1, 34, 45, 43, 36, 1, 38, 36, 44, 50, 36, 1, 49, 50, 32, 50, 32, 1, 49, 45, 50, 50, 45, 1, 42, 32, 48, 52, 36, 5], 
          [34, 39, 36, 1, 46, 32, 48, 36, 1, 32, 42, 50, 48, 45, 1, 34, 39, 36, 1, 46, 48, 40, 43, 32, 5, 1, 49, 36, 1, 49, 40, 1, 49, 52, 36, 49, 50, 36]]:
  for predicted_id in x:
    text_generated += idx2char[predicted_id]
  text_generated += '\n'
print(text_generated)

In [12]:
'''
EXPERIMENT
CUSTOM LOSS
'''
def divide_versi(y):
  doppiozero = False

  y_divided = [[]]
  for ly in y:
    ly = int(ly)

    # devo pulire la lista dai segni di punteggiatura, 
    # in chartoidx significa i numeri da 1 a 10 compresi.
    if ly in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:  #non posso perchè con i Tensor non funziona
    # if ly is 1 or ly is 2 or ly is 3 or ly is 4 or ly is 5 or ly is 6 or ly is 7 \
    #    or ly is 8 or ly is 9 or ly is 10:
        continue
    else:
      # se è zero vuol dire \n quindi aggiungo una nuova riga
      if ly is 0:
        if not doppiozero:
          y_divided.append([])
        doppiozero = True
        continue

      y_divided[-1].append(ly)
      doppiozero = False

  if y_divided is not []:
    if y[-1] != 0:
      # dato che l'ultima riga non finisce con 0 vuol dire che è incompleta e la rimuovo
      y_divided.pop()

    if len(y_divided[0]) < 4:
      # se la prima riga è minore di 4 non posso farci nulla quindi la elimino
      y_divided.pop(0)

  return y_divided

def rhymes_extractor(y_divided):
  # estraggo lo schema di rime da y
  rhymes = []
  for i in range(len(y_divided)):
    # con la fine del verso (ultime due lettere) controllo se le altre righe 
    # finiscono con le stesse lettere
    vy = y_divided[i]

    last_word_1 = vy[-2:]

    # ABA BCB CDC

    # devo controllare se la riga i fa rima con la riga i+2 
    if i+2 < len(y_divided):
      next_vy = y_divided[i+2]
      # print(vy[-2:])
      # print(next_vy[-2:])
      if last_word_1 == next_vy[-2:]:
        rhymes.append((i, i+2))
    
    if i+4 < len(y_divided):
      # print(vy[-2:])
      # print(next_vy[-2:])
      next_vy = y_divided[i+4]
      if last_word_1 == next_vy[-2:]:
        rhymes.append((i, i+4))

  # print(rhymes)
  return rhymes


def get_custom_loss(x_batch, y_batch):
  summed_custom_loss = 0
  # x_batch ha lo shape (200, 200) quindi ho 200 vettori con 200 lettere ognuno
  # le 200 lettere sono le feature

  x_bin_tot = np.ones(1)
  y_bin_tot = np.ones(1)

  # scorro i 200 vettori
  # for (x, y) in zip(x_batch, y_batch):  # Non funziona con i tensori
  for v in range(len(x_batch)):
    x = x_batch[v]
    y = y_batch[v]

    # dividio il vettore in versi utili
    x_divided = divide_versi(x)
    y_divided = divide_versi(y)

    # assicuro che il numero di versi siano uguali
    # !!! non posso perchè il generato può avere errori e quindi, per esempio,
    # avere più o meno versi
    # assert len(x_divided) == len(y_divided)

    # estraggo lo schema di rime
    x_rhymes = rhymes_extractor(x_divided)
    y_rhymes = rhymes_extractor(y_divided)

    # mi ritorna una lista con il numero delle righe che fanno rima
    # Esempio: [(1,3), (2,4)] significa che le righe 1 e 3 fanno rima e che le 
    # righe 2 e 4 pure 
    # TODO se avessimo due terzine intere si potrebbe valutare rime a 3 righe [aBaBcB]

    if x_rhymes == []:
      return 0.9  # max custom loss
    
    # se lo schema di rime del generato e di dante è uguale stop
    if x_rhymes == y_rhymes:
      return -0.2

    # creo un vettore di 1 per la y perchè le rime ci sono sempre
    y_bin = np.ones(len(y_rhymes))
    # creo un vettore di 0 per le rime generate, metterò 1 se la rima 
    # corrispondente è valida (cioè in dante)
    x_bin = np.zeros(len(y_rhymes))

    # se la rima generata è nelle rime originali di Dante allora la segno come valida
    for i in range(len(y_rhymes)):
      if y_rhymes[i] in x_rhymes:
        x_bin[i] = 1

    # concateno i vettori con l'encoding delle rime
    x_bin_tot = np.concatenate((x_bin_tot, x_bin))
    y_bin_tot = np.concatenate((y_bin_tot, y_bin))

  # MSE sui vettori
  return tf.keras.losses.mean_squared_error(y_bin_tot, x_bin_tot)

# NEW VERSION
# creo un vettore con le rime di y reale e di y generato
# Ex: in y reale se ho ABABC il vettore è [1,2,1,2,3] con o zero ad indicare nulla
# per y generato devo creare un vettore di lunghezza uguale per poi valutarlo con una sparse_crossentropy
# problema: non avrà mai le stesse righe


In [None]:
'''
EXPERIMENT
MODEL
'''
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Attention, Flatten, Input
from tensorflow.keras.activations import elu, relu, softmax
from tensorflow.keras.metrics import categorical_accuracy, sparse_categorical_crossentropy, categorical_crossentropy
# Define custom training utilities that are widely used for language modelling

n_epochs = 100

learning_rate = 0.001  # 0.0001
optimizer = tf.keras.optimizers.Adamax(learning_rate=learning_rate)  # Adam

def loss(y_true, y_pred):
    """Calculates categorical crossentropy as loss"""
    return categorical_crossentropy(y_true=y_true, y_pred=y_pred)


def perplexity(labels, logits):
    """Calculates perplexity metric = 2^(entropy) or e^(entropy)"""
    return pow(2, loss(y_true=labels, y_pred=logits))

# Input Layer
X = Input(shape=(None, ), batch_size=batch_size)  # 100 is the number of features

# Word-Embedding Layer
embedded = Embedding(vocab_size, embedding_size, 
                     batch_input_shape=(batch_size, None), 
                     embeddings_initializer=tf.keras.initializers.GlorotNormal(), 
                     embeddings_regularizer=tf.keras.regularizers.L1L2()
                     )(X)
embedded = Dense(embedding_size, relu)(embedded)
encoder_output, hidden_state, cell_state = LSTM(units=1024,
                                                         return_sequences=True,
                                                         return_state=True)(embedded)
#attention_input = [encoder_output, hidden_state]
encoder_output = Dropout(0.3)(encoder_output)
encoder_output = Dense(embedding_size, activation='relu')(encoder_output)

#encoder_output = Attention()(attention_input, training=True)

initial_state = [hidden_state, cell_state]

# initial_state_double = [tf.concat([hidden_state, hidden_state], 1), tf.concat([hidden_state, hidden_state], 1)]
encoder_output, hidden_state, cell_state = LSTM(units=1024,
                                                         return_sequences=True,
                                                         return_state=True)(encoder_output, initial_state=initial_state)
encoder_output = Dropout(0.3)(encoder_output)
#encoder_output = Flatten()(encoder_output)
encoder_output = Dense(hidden_size, activation='relu')(encoder_output)
# Prediction Layer
Y = Dense(units=vocab_size)(encoder_output)

# Compile model
model = Model(inputs=X, outputs=Y)
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer='adam', metrics=[perplexity, sparse_categorical_crossentropy])
print(model.summary())

# This is an Autograph function
# its decorator makes it a TF op - i.e. much faster
# @tf.function
def train_on_batch(x, y):
    with tf.GradientTape() as tape:
        current_loss = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(
                y, model(x), from_logits = True)
            + get_custom_loss(x, y)
            )
    gradients = tape.gradient(current_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return current_loss


loss_history = []

for epoch in range(n_epochs):
    start = time.time()
    
    # Take subsets of train and target
    sample = np.random.randint(0, text_matrix.shape[0]-1, subset_size)
    sample_train = text_matrix[ sample , : ]
    sample_target = text_matrix[ sample+1 , : ]


    #sample = list(range(subset_size*epoch, subset_size*(epoch+1)))
    #sample_train = text_matrix[ sample , : ]
    #next_sample = [x+1 for x in sample]
    #sample_target = text_matrix[ next_sample , : ]
    
    for iteration in range(sample_train.shape[0] // batch_size):
        take = iteration * batch_size
        x = sample_train[ take:take+batch_size , : ]
        y = sample_target[ take:take+batch_size , : ]

        current_loss = train_on_batch(x, y)
        loss_history.append(current_loss)
    
    print("{}.  \t  Loss: {}  \t  Time: {}sec/epoch".format(
        epoch+1, current_loss.numpy(), round(time.time()-start, 2)))
    




Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(200, None)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (200, None, 300)     18600       input_1[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (200, None, 300)     90300       embedding[0][0]                  
__________________________________________________________________________________________________
lstm (LSTM)                     [(200, None, 1024),  5427200     dense[0][0]                      
_______________________________________________________________________________________

In [None]:
model.save("model_custom_loss_00.h5")

In [None]:
'''
EXPERIMENT
GENERATOR
'''

# Input Layer
X = Input(shape=(None, ), batch_size=1)  # 100 is the number of features

# Word-Embedding Layer
embedded = Embedding(vocab_size, embedding_size)(X)
embedded = Dense(embedding_size, relu)(embedded)
encoder_output, hidden_state, cell_state = LSTM(units=1024,
                                                         return_sequences=True,
                                                         return_state=True,
                                              stateful=True)(embedded)
#attention_input = [encoder_output, hidden_state]

encoder_output = Dropout(0.3)(encoder_output)

encoder_output = Dense(embedding_size, activation='relu')(encoder_output)

# encoder_output = Attention()(attention_input, training=True)
initial_state = [hidden_state,  cell_state]

# initial_state_double = [tf.concat([hidden_state, hidden_state], 1), tf.concat([hidden_state, hidden_state], 1)]
encoder_output, hidden_state, cell_state = LSTM(units=1024,
                                                         return_sequences=True,
                                                         return_state=True,
                                                stateful=True)(encoder_output, initial_state=initial_state)
#encoder_output = Flatten()(encoder_output)
encoder_output = Dropout(0.3)(encoder_output)
encoder_output = Dense(hidden_size, activation='relu')(encoder_output)
# Prediction Layer
Y = Dense(units=vocab_size)(encoder_output)

# Compile model
generator = Model(inputs=X, outputs=Y)
generator.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer='adam', metrics=[perplexity, sparse_categorical_crossentropy])
print(model.summary())


# Import trained weights from model to generator
generator.set_weights(model.get_weights())

def generate_text(start_string, num_generate = 1000, temperature = 1.0):
    
    # Vectorize input string
    input_eval = [char2idx[s] for s in start_string]  
    input_eval = tf.expand_dims(input_eval, 0)
    
    text_generated = [] # List to append predicted chars 
    
    idx2char = { v: k for k, v in char2idx.items() }  # invert char-index mapping
    
    generator.reset_states()
    
    for i in range(num_generate):
        predictions = generator(input_eval)
        predictions = tf.squeeze(predictions, 0)
        
        # sample next char based on distribution and temperature
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])
        
    return (start_string + ''.join(text_generated))


# Let's feed the first lines:
start_string = """
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
chè la diritta via era smarrita.

"""

for t in [0.1, 0.5, 1.0, 1.5, 2]:
    print("####### TEXT GENERATION - temperature = {}\n".format(t))
    print(generate_text(start_string = start_string, num_generate = 1000, temperature = t))
    print("\n\n\n")

In [None]:
RNN = Sequential([
    Embedding(vocab_size, embedding_size,
              batch_input_shape=(batch_size, None)),
    Dense(embedding_size, activation = relu),
    
    LSTM(len_input, return_sequences = True),

    Dropout(0.33),
    
    Dense(hidden_size, activation = relu), 

    Dropout(0.33),

    LSTM(len_input, return_sequences = True),
    
    Dense(vocab_size)
])

RNN.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (200, None, 300)          18600     
_________________________________________________________________
dense (Dense)                (200, None, 300)          90300     
_________________________________________________________________
lstm (LSTM)                  (200, None, 2048)         19243008  
_________________________________________________________________
dropout (Dropout)            (200, None, 2048)         0         
_________________________________________________________________
dense_1 (Dense)              (200, None, 300)          614700    
_________________________________________________________________
dropout_1 (Dropout)          (200, None, 300)          0         
_________________________________________________________________
lstm_1 (LSTM)                (200, None, 2048)         1

In [None]:
n_epochs = 150

learning_rate = 0.001  # 0.0001
optimizer = tf.keras.optimizers.Adamax(learning_rate = learning_rate)  # Adam

In [None]:
# This is an Autograph function
# its decorator makes it a TF op - i.e. much faster
# @tf.function
def train_on_batch(x, y):
    with  tf.GradientTape() as tape:
        # TODO: implementare la custom loss prendendo le rime da Y 
        # e controllando che schema di rime c'è
        # Avendo lo schema di rime controllare la X e dare un voto sugli ultimi
        # 3 caratteri. Dando un punteggio coerente con il numero di lettere 
        # che fanno rima, valutare un punteggio negativo

        current_loss = tf.reduce_mean(
            tf.keras.losses.sparse_categorical_crossentropy(
                y, RNN(x), from_logits = True) 
            + get_custom_loss(x, y))
    gradients = tape.gradient(current_loss, RNN.trainable_variables)
    optimizer.apply_gradients(zip(gradients, RNN.trainable_variables))
    return current_loss

In [None]:
loss_history = []

for epoch in range(n_epochs):
    start = time.time()
    
    # Take subsets of train and target
    sample = np.random.randint(0, text_matrix.shape[0]-1, subset_size)
    sample_train = text_matrix[ sample , : ]
    sample_target = text_matrix[ sample+1 , : ]

    # NEW SEQUENTIAL MODE
    # sample = list(range(subset_size*epoch, subset_size*(epoch+1)))
    # sample_train = text_matrix[ sample , : ]
    # next_sample = [x+1 for x in sample]
    # sample_target = text_matrix[ next_sample , : ]
    
    for iteration in range(sample_train.shape[0] // batch_size):
        take = iteration * batch_size
        x = sample_train[ take:take+batch_size , : ]
        y = sample_target[ take:take+batch_size , : ]

        current_loss = train_on_batch(x, y)
        loss_history.append(current_loss)
    
    print("{}.  \t  Loss: {}  \t  Time: {}ss".format(
        epoch+1, current_loss.numpy(), round(time.time()-start, 2)))

1.  	  Loss: 2.456529378890991  	  Time: 367.43ss
2.  	  Loss: 2.0141706466674805  	  Time: 373.03ss
3.  	  Loss: 1.8379888534545898  	  Time: 374.37ss
4.  	  Loss: 1.7415854930877686  	  Time: 372.65ss
5.  	  Loss: 1.6536414623260498  	  Time: 369.8ss
6.  	  Loss: 1.5538181066513062  	  Time: 373.9ss
7.  	  Loss: 1.4830265045166016  	  Time: 371.92ss
8.  	  Loss: 1.4170233011245728  	  Time: 370.39ss
9.  	  Loss: 1.3489421606063843  	  Time: 372.4ss
10.  	  Loss: 1.2993950843811035  	  Time: 372.0ss
11.  	  Loss: 1.2515488862991333  	  Time: 373.58ss
12.  	  Loss: 1.1995846033096313  	  Time: 370.65ss
13.  	  Loss: 1.1460455656051636  	  Time: 372.25ss
14.  	  Loss: 1.0806668996810913  	  Time: 373.65ss
15.  	  Loss: 1.0468740463256836  	  Time: 370.84ss
16.  	  Loss: 0.9778279066085815  	  Time: 370.92ss
17.  	  Loss: 0.9033732414245605  	  Time: 370.41ss
18.  	  Loss: 0.8289470672607422  	  Time: 374.46ss
19.  	  Loss: 0.7542098164558411  	  Time: 373.84ss
20.  	  Loss: 0.6742076873

In [None]:
plt.plot(loss_history)
plt.title("Training Loss")
plt.show()

In [None]:
RNN.save( "text_generator_RNN_03.h5")

# Text Generation

At this point, let's check how the model generates text. In order to do it, I must make some changes to my RNN architecture above.

First, I must change the fixed batch size. After training, I want to feed just one sentence into my Network to make it continue the character sequence. I will feed a string into the model, make it predict the next character, update the input sequence, and repeat the process until a long generated text is obtained. Because of this, the succession of input sequences is now different from training session, in which portions of text were sampled randomly. I now have to set `stateufl = True` in the `LSTM()` layer, so that each LSTM cell will keep in memory the internal state from the previous sequence. With this I hope the model will better remember sequential information while generating text.

I will instantiate a new `generator` RNN with these new features, and transfer the trained weights of my `RNN` into it.

In [None]:
generator = Sequential([
   Embedding(vocab_size, embedding_size,
              batch_input_shape=(1, None)),
    Dense(embedding_size, activation = relu),
    
    LSTM(len_input, return_sequences = True, stateful=1),

    Dropout(0.3),
    
    Dense(hidden_size, activation = relu), 

    Dropout(0.3),

    LSTM(len_input, return_sequences = True, stateful=1),
    
    Dense(vocab_size)
])

generator.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (1, None, 300)            18600     
_________________________________________________________________
dense_9 (Dense)              (1, None, 300)            90300     
_________________________________________________________________
lstm_6 (LSTM)                (1, None, 1024)           5427200   
_________________________________________________________________
dropout_6 (Dropout)          (1, None, 1024)           0         
_________________________________________________________________
dense_10 (Dense)             (1, None, 300)            307500    
_________________________________________________________________
dropout_7 (Dropout)          (1, None, 300)            0         
_________________________________________________________________
lstm_7 (LSTM)                (1, None, 1024)          

In [None]:
# Import trained weights from RNN to generator
generator.set_weights(RNN.get_weights())

In [None]:
def generate_text(start_string, num_generate = 1000, temperature = 1.0):
    
    # Vectorize input string
    input_eval = [char2idx[s] for s in start_string]  
    input_eval = tf.expand_dims(input_eval, 0)
    
    text_generated = [] # List to append predicted chars 
    
    idx2char = { v: k for k, v in char2idx.items() }  # invert char-index mapping
    
    generator.reset_states()
    
    for i in range(num_generate):
        predictions = generator(input_eval)
        predictions = tf.squeeze(predictions, 0)
        
        # sample next char based on distribution and temperature
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])
        
    return (start_string + ''.join(text_generated))


(This function is based on [this tutorial](https://www.tensorflow.org/tutorials/text/text_generation).)

In [None]:
# Let's feed the first lines:
start_string = """
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
chè la diritta via era smarrita.

"""

for t in [0.1, 0.5, 1.0, 1.5, 2]:
    print("####### TEXT GENERATION - temperature = {}\n".format(t))
    print(generate_text(start_string = start_string, num_generate = 1000, temperature = 1.0))
    print("\n\n\n")

####### TEXT GENERATION - temperature = 0.1


Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
chè la diritta via era smarrita.



Ora questo che tenna con esso pianta,
fonno de li altri rispiose poscegno,
che idronemo e menera a fruttorna
nei perchè tanta inverno, tutte 'l puoi".



in vendi ' pietra il tanto parte era,
primandono entro, evei nè me si stesse
buono alqual ch'ogne non aperto leto
ch'el tanto inghiecco quell'ca divina. 

Quasi alloral lorar com'io duca;
e riguarrille, duo per malco a brace
me che già veglia, quanto purto colpa
fora d'un che, profodo scallegio
ch'io disse: "E questo fiate menore. 

Centor Berve fui che tant' io luoto:
chè, quando le cagion che s'accriga una;
la cima prei del vide, ed esse, anda
sodra, andarata drite, e il trascosto,
il parlarento e 'l pèotò come avveggia,
tal che d'angella sua leggia condetta,
sopr'alla voi di mezzo solo è benno. 

Io mi fua lantia, inverabon pria 'ntrima
di ruoi senti
la vente mio menando me ormi?",


The best generation is, IMHO, the one with `temperature = 1.5`. The sentences of course do not make sense, but it's amazing that such a simple model could achieve similar results, and generate absolutely Dante-esque text with just ~40 minutes of GPU training.

Many things could be done at this point:



*   Try fancier architectures, such as seq2seq. (I must say though that stacked RNNs didn't provide better results during prototyping.)
*   Try Attention models.
*   Longer training.
*   Adversarial training.

I'll try a lot of these techniques, alone and combined. My goal is to make a model that can learn the amazing structure of syllables and rhymes of the whole Comedy.



# NEW IDEAS

#### Training:
*   Cross validation
*   Insert Rhyme as feature to learn as haiku
*   Use syllable as input and not word
*   Different training on different dataset
* Use categorical_crossentropy instead of sparse_ but with one-hot encoded inputs
* Symbols for explicit start and end terzina
* training as classificator for structure: like "these two world are rhymes" or "this is a endecasillable and this not" or "this is a terzina and this not" then generation
* use dropout 
* use two lstm
* 

#### Presentation
* graphs over the vocabulary like distribution of used words




In [None]:
RNN = Sequential([
    Embedding(vocab_size, embedding_size,
              batch_input_shape=(batch_size, None)),
              
    Dense(embedding_size, activation = relu),
    
    LSTM(len_input, return_sequences = True),

    Dropout(0.3),
    
    Dense(hidden_size, activation = relu), 

    Dropout(0.3),
    
    Dense(vocab_size)
])

RNN.summary()

generator = Sequential([
    Embedding(vocab_size, embedding_size,batch_input_shape=(1, None)),

    Dense(embedding_size, activation = relu),
    
    LSTM(len_input, return_sequences = True, stateful=True),

    Dropout(0.3),
    
    Dense(hidden_size, activation = relu), 

    Dropout(0.3),
    
    Dense(vocab_size)


])

generator.summary()