# Lyric Generation with Recurrent Nerual Network #
## Generating Dance Gavin Dance lyrics with machine learning ##

The starting point for this project is the official [TensorFlow Tutorial - Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation). I made several modifications to the base tutorial to meet my project goals. 

### Modifications ###
The first change has to do with the data itself. Since I am interested in generating song lyrics (instead of Shakespearean sonnets) I'm using a dataset of Dance Gavin Dance song lyrics that I scraped from the web. The web scraping program I wrote and the data itself, are [on my project repo](https://github.com/nurriol2/dgd_lyric_generation). 

Additionally, the supporting text in this notebook highlights what I think is important *for my own understanding*. At this time, this project is not a tutorial for implementing RNNs (it's more of a project diary).

While the tutorial provides some guidance on improving the model, I found great ideas in [Jason Brownlee's post on text generation](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/). I implemented a few of these suggestions throughout this notebook.

---

I intend for this noteboook **to run on [Google Colab](https://colab.research.google.com/)**.  

Here are several resasons I chose Google Colab:
1. Even with only a few epochs, my laptop cannot train a model quickly
2. Google Colab does not require any installations (Python or otherwise) to get started programming
3. The errors I had running the same code with AWS and Spell, simply, do not exist when using Google Colab  

If you have a powerful GPU or would like to try running this code on your own machine or somewhere besides Google Colab, everything is available on the project repo - including a `requirements.txt` file. If you do use a different platform to explore this project, I would love to hear about it! 

In [1]:
import tensorflow as tf
import numpy as np
import os
import time
import requests

In [2]:
#a single .csv containing Dance Gavin Dance lyrics 
filepath = "https://raw.githubusercontent.com/nurriol2/dgd_lyric_generation/ft-rnn/dance_gavin_dance_lyrics.txt"
text = requests.get(filepath).text
#print the first few characters to check that this is the data we expect 
print(text[:250])

[Verse 1: Tilian & Jon Mess]
Do you crave a greater reason to exist?
Have you always known that symmetry is bliss?
We know you see the pattern
Lay in your lap, think of your path
Philosophy don't bother me, come back when you're trash
You are welcome


In [3]:
#total number of characters in the file
print ('Length of text: {} characters'.format(len(text)))

Length of text: 257869 characters


*vocabulary* - set of all elements that make up the sequence data 
- elements in this case are characters
- characters are unique:  A != a 
- needs to be converted to an ingestible form for the model (aka numbers)

In [4]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

89 unique characters


**Note 1**  
Apostrophes and commas are currently part of the vocabulary. The model might predict that the next character is a comma when (as a human) it would make more sense to predict the letter "m". So, a simple improvement might be removing such characters

In [5]:
#encoding characters as integers
char2idx = {u:i for i, u in enumerate(vocab)}
#a decode map to get text as output, instead of integers
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])
print(text[:9])
print(text_as_int[:9])

#V maps to 51; : maps to 22
print("V - {}\n: - {}".format(char2idx["V"], char2idx[":"]))

[Verse 1:
[51 46 57 70 71 57  1 13 22]
V - 46
: - 22


## Overview of the problem workflow ##
- The model is fed *a sequence* with a specific length *n*
- The model tries to predict the next *most probable* character, **based on the last n characters**

## What the X_ and y_ look like ##
Pretend the sequence length *n*==4. Then, an (input, output) pair might look like this  
("Hell", "ello")  

The process of making a training-testing dataset with this format is ~~automated by the function~~ begins with
`tf.data.Dataset.from_tensor_slices`. The data is sliced along axis=0 to create a new `Dataset` obj

In [6]:
#the maximum length sentence we want for a single input in characters
seq_length = 100
#the quotient here makes sense because there can only be "quotient" number of sequences in the text
examples_per_epoch = len(text)//(seq_length+1)

### create (training examples, targets)###

#from_tensor_slices -> slice along axis=0
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

#a Dataset with 5 elements
for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

#combine consecutive elements from Dataset obj into another Dataset
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

[
V
e
r
s
'[Verse 1: Tilian & Jon Mess]\nDo you crave a greater reason to exist?\nHave you always known that symme'
"try is bliss?\nWe\u2005know\u2005you see the\u2005pattern\nLay in your lap, think of\u2005your path\nPhilosophy don't bother"
" me, come back when you're trash\nYou are welcome here but you must come alone\nYou know everything is "
'everywhere is home\nDo you see it?\n\n[Chorus: Tilian]\nPrisoner, prisoner\nWe found you\nWe feel you breat'
"hing\nAre you there?\nCan you hear us calling you?\nWe'll never judge you\n\n[Verse 2: Jon Mess & Tillian]"


## What's going on with `drop_remiander=True`? ##
There is no gurantee that the quotient (len of dataset)/(seq len) is an integer. In the case that the last batch is smaller than the desired sequence length, this param lets you drop/include the batch.  

It might be interesting to check this quotient directly and see if (in the case of this data) the last batch is being dropped and if the model might perform better including the extra examples

In [7]:
def split_input_target(chunk):
    """
    Form the input and target by shifting a fixed length window 1 character forward
    
    Args:
    chunk (str):  The input sequence
    
    Returns:
    (tuple):  A pair of strings, (input text, target text)
    """
    
    input_text = chunk[:-1]
    target_text = chunk[1:]

    return input_text, target_text

#apply this function to all sequences
dataset = sequences.map(split_input_target)

At this point, all we've done is create a labeled dataset that can be used to train the model.  

Upcoming printed text is human-readable example of what we want the model to do. 

In [8]:
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '[Verse 1: Tilian & Jon Mess]\nDo you crave a greater reason to exist?\nHave you always known that symm'
Target data: 'Verse 1: Tilian & Jon Mess]\nDo you crave a greater reason to exist?\nHave you always known that symme'


*From TensorFlow Tutorial* - Understanding text as a time series

>Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and trys to predict the index for "i" as the next character. At the next timestep, **it does the same thing but the RNN considers the previous step context in addition to the current input character**.

## Shuffling and splitting time series data ##  

**Shuffling** - Shuffling in this context has to be viewed differently than other sequential data. I don't actually care that the model learns patterns from "The Jiggler" before learning from "Prisoner". I would expect relatively the same performance from any order because **the well ordered temporal axis is NOT the order of the songs**. Instead, **the temporal axis is the order of the characters in each sequence**.  

*Key Point*: Shuffling the order of each sequence **produces an equivalent representation of the dataset** - all of the songs are still there! In contrast to a sequential dataset where the temporal axis is actually time (historical stock prices) shuffling those sequences **produces a fundamentally different dataset**. 

**Note 2** Part of the tuning phase is balancing the number of epochs and batch size. (As of right now, this is a heuristic) $\Rightarrow$ Increasing epochs and reducing batch size will give the model more opportunity to be updated and learn.

# A different way of reshaping the input

We know that the model expects data in with this shape: `(number of samples, sequence length, number of featuers)`.    

In the block below, the data is being reshaped into groups of `(number of samples, sequence length)`.

In [9]:
#the number of examples to propogate
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

# Building the model #

In [10]:
#length of the vocabulary in chars
vocab_size = len(vocab)

#the input layer size
embedding_dim = 256

#number of RNN units
rnn_units = 1024

In [11]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    """
    Build a Sequential model with specific topology by specifying neuron numbers.
    Topology:  Input - LSTM - Dense
    Notes:  Follows TF Tutorial exactly. Refer to this model as the "basic model" in notes. 
    
    Args:
    vocab_size (int):  The number of unique elements that comprise a dataset
    embedding_dim (int):  Dimension of the mapping between characters and a dense vector
    batch_size (int): The number of memory units (GRU or LSTM)
    
    Returns:
    (tensorflow.keras.Model):  A linear stack of neuron layers
    """
    
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model

In [12]:
#basic model
model = build_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

## Output Shape ##

Only passing in vectors of (batch, sequence length) and yet outputting (batch, sequence length, vocab length). The indices of the last dimension reflects the probability that the i-th character is predicted.

In [13]:
for input_example_batch, target_example_batch in dataset.take(1):
    print("EXAMPLE INPUT: {}".format(input_example_batch.shape))
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

EXAMPLE INPUT: (64, 100)
(64, 100, 89) # (batch_size, sequence_length, vocab_size)


In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           22784     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 89)            91225     
Total params: 5,360,985
Trainable params: 5,360,985
Non-trainable params: 0
_________________________________________________________________


In [15]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([76, 49, 14, 21, 51, 79,  5, 74, 30,  0, 29, 72, 18, 81, 59, 77, 30,
       11, 60,  3, 50, 35, 25, 46, 34, 70, 29, 37, 35, 49, 32, 65, 43, 68,
       43, 75, 88, 16, 45, 60, 68, 53, 78, 75, 77, 81, 83, 82, 72,  6,  7,
       55, 88,  5, 48, 46, 13, 84, 57, 26, 55, 38, 12, 71, 65, 16, 64,  7,
       36, 66, 69, 12, 43, 78, 55, 81, 84, 40, 35, 51, 18, 81, 84, 54, 44,
        0, 49,  0,  6, 86, 79, 10,  5, 60, 19, 67, 80, 87, 83, 26])

In [16]:
#makes sense b/c it's a predicted sequence
len(sampled_indices)

100

In [17]:
#human readable decoding of the prediction
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "can't keep my mind open now\nOh-oh, whoa-oh-oh\n\n[Verse 2: Tilian]\nMake up your mind, we're running ou"

Next Char Predictions: 
 'xY29[ç\'vF\nEt6ígyF.h"ZKAVJrEMKYHmSpSw\u205f4Uhpazwyí\u2005út()c\u205f\'XV1\u200aeBcN0sm4l)Lnq0Szcí\u200aPK[6í\u200abT\nY\n(‚ç-\'h7oé“\u2005B'


# Model Training

In [18]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 89)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.4883547


In [19]:
model.compile(optimizer='adam', loss=loss)

In [20]:
#directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
#name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

#using the save_best_only option
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True,
    save_best_only=True)

In [21]:
EPOCHS=60
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

# Question #
When saving checkpoints (in the block below), where is the gurantee that calling the weights with the minimum loss? Does this block just assume that SGD landed at the minimum? 

**Note 3** Might be better to include the loss in the name of the checkpoint

In [26]:
tf.train.latest_checkpoint(checkpoint_dir)

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [None]:
model.summary()

# Prediction Loop

In [27]:
def generate_text(model, start_string):
    """
    Generate text using a trained model
    
    Args:
    model (tensorflow.keras.Model):  A trained model
    start_string (str):  The starting input sequence
    
    Returns:
    (str):  The predicted text. Concatenation of start_string and following `num_generate` predicted characters.
    """

    #number of characters to generate
    num_generate = 1000

    #converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    #storing the predicted indices (wrt look-up table)
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [None]:
print(generate_text(model, start_string=u"[Intro: "))