<a href="https://colab.research.google.com/github/nurriol2/dgd_lyric_generation/blob/ft-rnn/Lyric_Generation_Dropout_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lyric Generation with a Recurrent Neural Network
## Experimenting with different model topologies ##  

---
# Dropout Topology #

This is a continuation of the previous notebook. Here, I focus on finding ways to improve the model by changing the model architecture.

The base model demonstrates the overall ML workflow for training a RNN. The focus for the remaining parts of this notebook will be improving the predictive power of the RNN.

LSTMs can easily overfit the data. Adding dropout layers to the model topology can reduce the amount of overfitting.

Dropout layers can have different *dropout rates*. Adding dropout layers also adds the dropout layer as a tunable hyperparameter.

**Question** `LSTM` layer has a dropout parameter, but I'm not sure if it works the same as a `Dropout` layer

**Idea** It might be helpful to define layer building functions. Then layers are added from a list of initialized layers. This approach might make grid search for hyperparameters easier to automate 

In [1]:
import tensorflow as tf
import numpy as np
import os
import time
import requests

In [2]:
#a single .csv containing Dance Gavin Dance lyrics 
filepath = "https://raw.githubusercontent.com/nurriol2/dgd_lyric_generation/ft-rnn/dance_gavin_dance_lyrics.txt"
text = requests.get(filepath).text
#print the first few characters to check that this is the data we expect 
print(text[:250])

[Verse 1: Tilian & Jon Mess]
Do you crave a greater reason to exist?
Have you always known that symmetry is bliss?
We know you see the pattern
Lay in your lap, think of your path
Philosophy don't bother me, come back when you're trash
You are welcome


In [3]:
#normalization step by reducing the vocabulary size
text = text.lower()

In [4]:
#total number of characters in the file
print ('Length of text: {} characters'.format(len(text)))

Length of text: 257869 characters


*vocabulary* - set of all elements that make up the sequence data 
- elements in this case are characters
- characters are unique:  A != a 
- needs to be converted to an ingestible form for the model (aka numbers)

In [5]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

63 unique characters


**Note 1**  
Apostrophes and commas are currently part of the vocabulary. The model might predict that the next character is a comma when (as a human) it would make more sense to predict the letter "m". So, a simple improvement might be removing such characters

In [6]:
#encoding characters as integers
char2idx = {u:i for i, u in enumerate(vocab)}
#a decode map to get text as output, instead of integers
idx2char = np.array(vocab)
#vectorize the text
text_as_int = np.array([char2idx[c] for c in text])

## Overview of the problem workflow ##
- The model is fed *a sequence* with a specific length *n*
- The model tries to predict the next *most probable* character, **based on the last n characters**

## What the X_ and y_ look like ##
Pretend the sequence length *n*==4. Then, an (input, output) pair might look like this  
("Hell", "ello")  

The process of making a training-testing dataset with this format is ~~automated by the function~~ begins with
`tf.data.Dataset.from_tensor_slices`. The data is sliced along axis=0 to create a new `Dataset` obj

In [7]:
#the maximum length sentence we want for a single input in characters
seq_length = 87
#the quotient here makes sense because there can only be "quotient" number of sequences in the text
examples_per_epoch = len(text)//(seq_length+1)

### create (training examples, targets)###

#from_tensor_slices -> slice along axis=0
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

#a Dataset with 5 elements
for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

#combine consecutive elements from Dataset obj into another Dataset
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

[
v
e
r
s


## What's going on with `drop_remiander=True`? ##
There is no gurantee that the quotient (len of dataset)/(seq len) is an integer. In the case that the last batch is smaller than the desired sequence length, this param lets you drop/include the batch.  

It might be interesting to check this quotient directly and see if (in the case of this data) the last batch is being dropped and if the model might perform better including the extra examples

In [8]:
def split_input_target(chunk):
    """
    Form the input and target by shifting a fixed length window 1 character forward
    
    Args:
    chunk (str):  The input sequence
    
    Returns:
    (tuple):  A pair of strings, (input text, target text)
    """
    
    input_text = chunk[:-1]
    target_text = chunk[1:]

    return input_text, target_text

#apply this function to all sequences
dataset = sequences.map(split_input_target)

At this point, all we've done is create a labeled dataset that can be used to train the model.  

Upcoming printed text is human-readable example of what we want the model to do. 

In [9]:
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '[verse 1: tilian & jon mess]\ndo you crave a greater reason to exist?\nhave you always kn'
Target data: 'verse 1: tilian & jon mess]\ndo you crave a greater reason to exist?\nhave you always kno'


*From TensorFlow Tutorial* - Understanding text as a time series

>Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and trys to predict the index for "i" as the next character. At the next timestep, **it does the same thing but the RNN considers the previous step context in addition to the current input character**.

## Shuffling and splitting time series data ##  

**Shuffling** - Shuffling in this context has to be viewed differently than other sequential data. I don't actually care that the model learns patterns from "The Jiggler" before learning from "Prisoner". I would expect relatively the same performance from any order because **the well ordered temporal axis is NOT the order of the songs**. Instead, **the temporal axis is the order of the characters in each sequence**.  

*Key Point*: Shuffling the order of each sequence **produces an equivalent representation of the dataset** - all of the songs are still there! In contrast to a sequential dataset where the temporal axis is actually time (historical stock prices) shuffling those sequences **produces a fundamentally different dataset**. 

**Note 2** Part of the tuning phase is balancing the number of epochs and batch size. (As of right now, this is a heuristic) $\Rightarrow$ Increasing epochs and reducing batch size will give the model more opportunity to be updated and learn.

# A different way of reshaping the input

We know that the model expects data in with this shape: `(number of samples, sequence length, number of featuers)`.    

In the block below, the data is being reshaped into groups of `(number of samples, sequence length)`.

In [10]:
#the number of examples to propogate
BATCH_SIZE = 65

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((65, 87), (65, 87)), types: (tf.int64, tf.int64)>

# Building the model #

In [11]:
#number of training samples
nsamples = len(text)

#length of the vocabulary in chars
vocab_size = len(vocab)

#the input layer size
embedding_dim = 256

#number of RNN units
rnn_units = 1024

In [12]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense

def build_dropout_model(vocab_size, embedding_dim, rnn_units, batch_size, dropout_rate=0.24):
    model = Sequential()
    model.add(Embedding(vocab_size,
                        embedding_dim,
                        batch_input_shape=[batch_size, None]))
    model.add(LSTM(rnn_units,
                   return_sequences=True,
                   stateful=True,
                   recurrent_initializer="glorot_uniform"))
    model.add(Dropout(dropout_rate))
    model.add(Dense(vocab_size))    
    return model

In [13]:
#dropout model
model = build_dropout_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

## Output Shape ##

Only passing in vectors of (batch, sequence length) and yet outputting (batch, sequence length, vocab length). The indices of the last dimension reflects the probability that the i-th character is predicted.

In [14]:
for input_example_batch, target_example_batch in dataset.take(1):
    print("EXAMPLE INPUT: {}".format(input_example_batch.shape))
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

EXAMPLE INPUT: (65, 87)
(65, 87, 63) # (batch_size, sequence_length, vocab_size)


In [15]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (65, None, 256)           16128     
_________________________________________________________________
lstm (LSTM)                  (65, None, 1024)          5246976   
_________________________________________________________________
dropout (Dropout)            (65, None, 1024)          0         
_________________________________________________________________
dense (Dense)                (65, None, 63)            64575     
Total params: 5,327,679
Trainable params: 5,327,679
Non-trainable params: 0
_________________________________________________________________


In [16]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([41, 38, 59, 49,  9, 28,  4, 44, 38, 14, 42, 23, 46, 22, 58, 44, 38,
       30, 30, 50, 28, 22, 57, 53,  8, 60, 34, 34, 33, 12, 10,  9, 36, 40,
       55,  6, 18,  9, 49, 44, 25, 30,  9, 19,  4, 37, 51,  2, 41, 31, 51,
       22, 50, 28,  5, 34, 57, 41,  5, 46, 47, 24, 21, 58, 53, 49, 29, 56,
       60, 52, 47, 26, 28,  6, 49, 20,  5,  8, 14, 25, 16, 45, 38, 48, 11,
       38, 52])

# Model Training

In [17]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [18]:
model.compile(optimizer='adam', loss=loss)

In [19]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

# Using Best Weights for Prediction #

ModelCheckpoint has a `save_best_only` option that works with the `monitor` parameter. This option is how we are going to ensure that predictions are made with weights that minimize the loss.

In [20]:
EPOCHS=39
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/39
Epoch 2/39
Epoch 3/39
Epoch 4/39
Epoch 5/39
Epoch 6/39
Epoch 7/39
Epoch 8/39
Epoch 9/39
Epoch 10/39
Epoch 11/39
Epoch 12/39
Epoch 13/39
Epoch 14/39
Epoch 15/39
Epoch 16/39
Epoch 17/39
Epoch 18/39
Epoch 19/39
Epoch 20/39
Epoch 21/39
Epoch 22/39
Epoch 23/39
Epoch 24/39
Epoch 25/39
Epoch 26/39
Epoch 27/39
Epoch 28/39
Epoch 29/39
Epoch 30/39
Epoch 31/39
Epoch 32/39
Epoch 33/39
Epoch 34/39
Epoch 35/39
Epoch 36/39
Epoch 37/39
Epoch 38/39
Epoch 39/39


**Question**
When saving checkpoints (in the block below), where is the gurantee that calling the weights with the minimum loss? Does this block just assume that SGD landed at the minimum? 

**Note 3** Might be better to include the loss in the name of the checkpoint

In [21]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_39'

In [22]:
model = build_dropout_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [23]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            16128     
_________________________________________________________________
lstm_1 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dropout_1 (Dropout)          (1, None, 1024)           0         
_________________________________________________________________
dense_1 (Dense)              (1, None, 63)             64575     
Total params: 5,327,679
Trainable params: 5,327,679
Non-trainable params: 0
_________________________________________________________________


# Prediction Loop

In [24]:
def generate_text(model, start_string):
    """
    Generate text using a trained model
    
    Args:
    model (tensorflow.keras.Model):  A trained model
    start_string (str):  The starting input sequence
    
    Returns:
    (str):  The predicted text. Concatenation of start_string and following `num_generate` predicted characters.
    """

    #number of characters to generate
    num_generate = 1000

    #converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    #storing the predicted indices (wrt look-up table)
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 0.97

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [25]:
print(generate_text(model, start_string=u"[intro: "))

[intro: tilian]
we could give guy

[verse 1: jon mess]
to my mocking cares
you would have been to clame
and wo freathe to sucticing sale
the sky we'll give this shit
why you don't need to know you’re farling what it see
that the dechored every ane believed, i'm find on your bodesta tway asking the festival)
shine alone, it's t're to live, let the get in a chipping a dient

[chorus: tilian]
when you away for long for mess]
bleed up some things that my mouth there

[verse 2: jon mess]
parling around, get stuck

[outro: tilian & jon mess]
my clostes, dip the page
in the sign
i she's seary time i fool
got let them of line
we were ereaging and check
my friend and sale ampone
i believe there's meaning)
we fut your jealousy survive it
sups me, come back here's meaning)
no, i believe there's nothing
i believe there's meaning
no, i believe there's nothing
i believe there's meaning
no, i believe there's meaning
no, i believe tteres in a lon clan you hear (oh-oh-oh-oh-oh-oh)
we're so high (oh-oh-