# DanteGen: a tercet generator

The purpose of this project is to create a generative model based on Dante's Divine Comdey loosely based on what A. Karpathy [did](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) with Shakespeare's works.
I'm going to follow step by step what J. Howard [did](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/char-rnn.ipynb) in one of his great lessons on fast.ai.

## Getting the data

In [1]:
import numpy as np
from keras.models import Sequential, Model
from keras.layers import Input, Embedding, Reshape, merge, LSTM, Bidirectional
from keras.layers import TimeDistributed, Activation, SimpleRNN, GRU
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.optimizers import SGD, RMSprop, Adam
from keras.utils.data_utils import get_file
from numpy.random import random, permutation, randn, normal, uniform, choice

Using TensorFlow backend.


Let's start by getting the text from Project Guttenberg at the link below and strip the header and footer.

The whole text is converted to lowercase to reduce the overall dictionary length. It would be more interesting to keep the upper case characters and see if the model can correctly employ them. We also convert vowels with umlaut (ä, ë, ...) to unaccented vowels, for the same reason.

In [2]:
#path = get_file('divcomm.txt', origin="http://www.gutenberg.org/files/1012/1012-0.txt")
path = '../data/raw/divcomm.txt'
text = open(path,encoding='UTF8').read().lower()
text = text[932:-19658]
umlaut = {'ä':'a','ë':'e','ï':'i','ö':'o','ü':'u','-':'—'}
for word, initial in umlaut.items():
    text = text.replace(word, initial)


Let's print the beginning and end of the corpus, as well as the total length.

In [3]:
print(text[:500])

la divina commedia
  di dante alighieri





  inferno




  inferno • canto i


  nel mezzo del cammin di nostra vita
  mi ritrovai per una selva oscura,
  ché la diritta via era smarrita.

  ahi quanto a dir qual era è cosa dura
  esta selva selvaggia e aspra e forte
  che nel pensier rinova la paura!

  tant’ è amara che poco è più morte;
  ma per trattar del ben ch’i’ vi trovai,
  dirò de l’altre cose ch’i’ v’ho scorte.

  io non so ben ridir com’ i’ v’intrai,
  tant’ era pien di sonno a que


In [4]:
print(text[-500:])

mètra che tutto s’affige
  per misurar lo cerchio, e non ritrova,
  pensando, quel principio ond’ elli indige,

  tal era io a quella vista nova:
  veder voleva come si convenne
  l’imago al cerchio e come vi s’indova;

  ma non eran da ciò le proprie penne:
  se non che la mia mente fu percossa
  da un fulgore in che sua voglia venne.

  a l’alta fantasia qui mancò possa;
  ma già volgeva il mio disio e ’l velle,
  sì come rota ch’igualmente è mossa,

  l’amor che move il sole e l’altre stelle.


In [5]:
len(text)

561094

Let's now get the set of all characters in the text and print them out. We'll add a null carachter for padding.

In [6]:
chars = sorted(list(set(text)))
chars.insert(0, "\0")
vocab_size = len(chars)

In [7]:
vocab_size

50

In [8]:
"".join(chars)

'\x00\n !(),.:;?abcdefghijlmnopqrstuvxyz«»àèéìòóù—‘’“”•'

Now we have to assign an index to each character.

In [9]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

In [10]:
idx = [char_indices[c] for c in text]

Now idx contains the whole divine comedy text encoded with the indeces we've just created.

In [11]:
idx[:18] # = "LA DIVINA COMMEDIA"

[21, 11, 2, 14, 19, 31, 19, 23, 11, 2, 13, 24, 22, 22, 15, 14, 19, 11]

In [12]:
print(''.join(indices_char[i] for i in idx[:18]))

la divina commedia


# Creating the model

Our model will take the first n characters and try to predict the next one. Let's take n=40.

We'll create all the 40 chars sequences in the text (sentences) and associate them to the 1-char shifted corresponding sequence (next_chars).

In [13]:
maxlen = 40
sentences = []
next_chars = []
for i in range(0, len(idx) - maxlen+1):
    sentences.append(idx[i: i + maxlen])
    next_chars.append(idx[i+1: i+maxlen+1])
print('nb sequences:', len(sentences))

nb sequences: 561055


Now we cast them in np arrays and throw away the very last one.

In [14]:
sentences = np.concatenate([[np.array(o)] for o in sentences[:-2]])
next_chars = np.concatenate([[np.array(o)] for o in next_chars[:-2]])

In [15]:
sentences.shape, next_chars.shape

((561053, 40), (561053, 40))

In [16]:
n_fac = 24 #number of embeddings
batch_size = 64
LSTM_units = 512

Next we define our model: we start with an embedding layer, followed by two LSTM networks and a fully connected layer with softmax to obtain each character probability.

In [17]:
model = Sequential(LSTM(LSTM_units, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, 
             ))

TypeError: 'LSTM' object is not iterable

In [24]:
model.add(Embedding(vocab_size, n_fac, input_length=maxlen))

In [None]:
model.add( LSTM(LSTM_units, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, 
             ))

In [19]:
a = LSTM(512)

In [18]:
model = Sequential([Embedding(vocab_size, n_fac, input_length=maxlen)
                   ])

In [None]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=maxlen),
        LSTM(LSTM_units, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, 
             ),
        Dropout(0.2),
        LSTM(LSTM_units, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, 
             ),
        Dropout(0.2),
        TimeDistributed(Dense(vocab_size)),
        Activation('softmax')
    ])

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [None]:
model.summary()

# Train the model

Here's a function to print an example of text generated from the model.

In [None]:
def print_example(n=500, seed_string="nel mezzo del cammin di nostra vita\n  mi"):
    for i in range(n):
        x=np.array([char_indices[c] for c in seed_string[-maxlen:]])[np.newaxis,:]
        preds = model.predict(x, verbose=0)[0][-1]
        preds = preds/np.sum(preds)
        next_char = choice(chars, p=preds)
        seed_string = seed_string + next_char
    print(seed_string)

Now we start training the model.

In [None]:
model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=3)

In [None]:
model.save_weights('dante1.h5')

In [None]:
model.load_weights('dante1.h5')

In [None]:
print_example()

This might now make any sense unless you know some Italian, but you can see that the model correctly uses punctuation like full stops and question marks. 

In [None]:
model.optimizer.lr=0.001

In [None]:
model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=1)

In [None]:
print_example()

In [None]:
model.optimizer.lr=0.0001

In [None]:
model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=1)

In [None]:
print_example()

In [None]:
model.optimizer.lr=0.00001

In [None]:
model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=3)

In [None]:
print_example()

In [None]:
model.save_weights('dante1.h5')

In [None]:
print_example()

In [None]:
print_example(1000)

After training for a few more epochs, the results are interesting in terms of language used, but we still see a few mistakes: direct speech chunks are sometimes opened and not closed; the three line (tercet) pattern is mostly not used and the rhyming also doesn't work.

Let's try a few combination of the model parameters and see if we can get some better results.

In [None]:
MAX_LEN = [40,60]
EMB = [24,32]
LSTM_UNITS = [512,1024]

In [None]:
def get_sequences(maxlen):
    sentences = []
    next_chars = []
    for i in range(0, len(idx) - maxlen+1):
        sentences.append(idx[i: i + maxlen])
        next_chars.append(idx[i+1: i+maxlen+1])
    sentences = np.concatenate([[np.array(o)] for o in sentences[:-2]])
    next_chars = np.concatenate([[np.array(o)] for o in next_chars[:-2]])
    return (sentences, next_chars)

def get_model(maxlen, embeddings, LSTM_units):
    model=Sequential([
        Embedding(vocab_size, embeddings, input_length=maxlen),
        LSTM(LSTM_units, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, 
             implementation=2),
        Dropout(0.2),
        LSTM(LSTM_units, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, 
             implementation=2),
        Dropout(0.2),
        TimeDistributed(Dense(vocab_size)),
        Activation('softmax')
    ])
    model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
    return model

In [None]:
from keras.callbacks import History
k = 1
losses = []
for maxlen in MAX_LEN:
    sentences, next_chars = get_sequences(maxlen)
    for embeddings in EMB:
        for units in LSTM_UNITS:
            k = k + 1
            model = get_model(maxlen, embeddings, units)
            model.optimizer.lr = 0.01
            history = History()
            loss = 3
            i = 0
            while (loss>1.3) & (i<6):
                i = i + 1
                model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=1, 
                     callbacks=[history])
                loss = history.history['loss'][-1]
            print_example()
            for j in [0.001,0.0001,0.00001,0.00001,0.00001]
                model.optimizer.lr = j
                model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=1, 
                     callbacks=[history])
                print_example()
            losses.append(history.history['loss'][-1])
            model.save_weights('dante' + str(k))