<div align="center">
  <h1>4a - Lyrics Generation - RNN</h1> <a name="0-bullet"></a>
</div>

- [1. Setup](#1-bullet)
    * [1.1 Set the working directory](#11-bullet)
    * [1.2 Load the data](#12-bullet)
- [2. Preprocess the data](#2-bullet)
    * [2.1 Prepare the text](#21-bullet)
    * [2.2 Vectorize the text](#22-bullet)
- [3. Create the training dataset](#3-bullet)
    * [3.1 Create training examples and targets](#31-bullet)
    * [3.2 Create training batches](#32-bullet)
- [4. Build the model](#4-bullet)
- [5. Model training](#5-bullet)
    * [5.1 Configure checkpoint](#51-bullet)
    * [5.2 Train the model](#52-bullet)
    * [5.3 Export/load the model](#53-bullet)
- [6. Lyrics generation](#6-bullet)
    * [6.1 Lyrics generator model](#61-bullet)
    * [6.2 Export/load the generator](#62-bullet)
    * [6.3 Generate lyrics](#63-bullet)
    * [6.4 Calculate lyrics similarity](#64-bullet)
    * [6.5 Store lyrics to a text file](#65-bullet)

> References: 
> * [TensorFlow - Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation)

---

In [None]:
import os
import time
import json

import numpy as np
import pandas as pd
import pickle5 as pickle

import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

from sklearn.feature_extraction.text import TfidfVectorizer

---

# 1. Setup <a name="1-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 1.1 Set the working directory <a name="11-bullet"></a> 

In [None]:
ROOT_DIR = "./eminem-lyrics-generator/notebooks/" 
IN_GOOGLE_COLAB = True

if IN_GOOGLE_COLAB:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')

    # change the current working directory
    %cd gdrive/'My Drive'

    # create a root directory if there's none
    if not os.path.isdir(ROOT_DIR):
        %mkdir $ROOT_DIR

    # change the current working directory
    %cd $ROOT_DIR

Mounted at /content/gdrive
/content/gdrive/My Drive
/content/gdrive/My Drive/eminem-lyrics-generator/notebooks


## 1.2 Load the data <a name="12-bullet"></a> 

In [None]:
# specifies paths to all files in the project
SETTINGS_FILE_PATH = os.path.join(os.path.abspath(".."), 'SETTINGS.json')
settings = json.load(open(SETTINGS_FILE_PATH))

In [None]:
DATA_FILE_DIR = settings['LYRICS_DF_SONGS_PATH']      # 'LYRICS_DF_ALL_PATH' or 'LYRICS_DF_SONGS_PATH'

with open(DATA_FILE_DIR, 'rb') as f:
    eminem_df = pickle.load(f)

In [None]:
eminem_df

Unnamed: 0,title,lyrics
0,Rap God,"[Intro] ""Look, I was gonna go easy on you not ..."
1,Killshot,"[Intro] You sound like a bitch, bitch Shut the..."
2,Godzilla,"[Intro] Ugh, you're a monster [Verse 1: Emine..."
3,Lose Yourself,"[Intro] Look, if you had one shot or one oppor..."
4,The Monster,[Intro: Rihanna] I'm friends with the monster ...
...,...,...
372,Rap Game (Bump Heads),"[Intro: Eminem, DJ Butter & D12 Member] I am n..."
373,Whoo Kid Freestyle,"Step right up, i'm about to light up the skyli..."
374,Hit ’Em Up,"[Intro] ""Aiyyo Head, that's why I fucked your..."
375,The Wake Up Show Freestyle,[Verse 1] Met a retarded kid named Greg with a...


# 2. Preprocess the data <a name="2-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 2.1 Prepare the text <a name="21-bullet"></a> 

### a) filter out songs with no section headers in the lyrics

In [None]:
has_section_headers = eminem_df.lyrics.apply(lambda lyrics: "[" in lyrics )
eminem_df = eminem_df[has_section_headers].reset_index(drop=True)

### b) add titles to the beginning of lyrics

In [None]:
eminem_lyrics_df = eminem_df.apply(lambda song: "[Title]\n" + song.title + "\n\n" + song.lyrics, axis=1)

### c) replace triple newlines with double newlines 

In [None]:
eminem_lyrics_df = eminem_lyrics_df.apply(lambda lyrics: lyrics.replace('\n\n\n', '\n\n'))

### d) add prefix and suffix tokens to lyrics

In [None]:
SOT = "<SOT>"     # start of text
EOT = "<EOT>"     # end of text
eminem_lyrics_df = eminem_lyrics_df.apply(lambda lyrics: SOT + lyrics + EOT)

### e) join all lyrics to a text string

In [None]:
text = '\n\n\n'.join(eminem_lyrics_df.values)

## 2.2 Vectorize the text <a name="22-bullet"></a> 

In [None]:
# length of the text in chars
print('Length of text: {} characters'.format(len(text)))

# unique chars in the text
vocab = sorted(set(text))
print('{} unique characters'.format(len(vocab)))

Length of text: 1411222 characters
116 unique characters


In [None]:
# converting characters into ids and ids into characters
ids_from_chars = preprocessing.StringLookup(vocabulary=list(vocab))
chars_from_ids = preprocessing.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), 
                                            invert=True)

In [None]:
# helper function to get text from ids
def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

# 3. Create the training dataset <a name="3-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 3.1 Create training examples and targets <a name="31-bullet"></a> 

In [None]:
# split the text into chars and encode it to ids
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
# create a dataset from the ids
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

# divide the dataset into sequences/examples
SEQ_LENGTH = 3000 
sequences = ids_dataset.batch(SEQ_LENGTH + 1, drop_remainder=True)

# create the train dataset by splitting each sequence into input text and target text 
dataset = sequences.map(lambda sequence: (sequence[:-1], sequence[1:]))

## 3.2 Create training batches <a name="32-bullet"></a> 

In [None]:
# buffer size to shuffle the dataset
BUFFER_SIZE = 10000
# batch size
BATCH_SIZE = 64

dataset = dataset.shuffle(BUFFER_SIZE) \
                 .batch(BATCH_SIZE, drop_remainder=True) \
                 .prefetch(tf.data.experimental.AUTOTUNE)

# 4. Build the model <a name="4-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

### a) construct the model

In [None]:
# embedding dimension
EMBEDDING_DIM = 256
# number of RNN units
RNN_UNITS = 1024

In [None]:
# length of the vocabulary in chars
vocab_size = len(vocab)

class MyModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__(self)
        # input/embedding layer
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        # rnn layer
        self.gru = tf.keras.layers.GRU(rnn_units,
                                      return_sequences=True, 
                                      return_state=True)
        # output layer
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        # input/embedding layer output
        x = self.embedding(inputs, training=training)

        if states is None:
            states = self.gru.get_initial_state(x)
        # rnn layer output
        x, states = self.gru(x, initial_state=states, training=training)
        
        # output/dense layer output
        x = self.dense(x, training=training)

        if return_state:
            return x, states
        
        return x

### b) init the model 

In [None]:
model = MyModel(vocab_size=len(ids_from_chars.get_vocabulary()),
                embedding_dim=EMBEDDING_DIM,
                rnn_units=RNN_UNITS)

# 5. Model Training <a name="5-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 5.1 Configure checkpoint <a name="51-bullet"></a>

In [None]:
RNN_MODEL_DIR = settings['RNN_MODEL_DIR']
CHECKPOINT_DIR = os.path.join(RNN_MODEL_DIR, 'training_checkpoints')
CHECKPOINT_PREFIX = os.path.join(CHECKPOINT_DIR, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=CHECKPOINT_PREFIX,
                                                         save_weights_only=True)

## 5.2 Train the model <a name="52-bullet"></a>

In [None]:
EPOCHS = 250

# configure the training procedure
model.compile(optimizer='adam', 
              loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True))

# fit the model
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/350
Epoch 2/350
Epoch 3/350
Epoch 4/350
Epoch 5/350
Epoch 6/350
Epoch 7/350
Epoch 8/350
Epoch 9/350
Epoch 10/350
Epoch 11/350
Epoch 12/350
Epoch 13/350
Epoch 14/350
Epoch 15/350
Epoch 16/350
Epoch 17/350
Epoch 18/350
Epoch 19/350
Epoch 20/350
Epoch 21/350
Epoch 22/350
Epoch 23/350
Epoch 24/350
Epoch 25/350
Epoch 26/350
Epoch 27/350
Epoch 28/350
Epoch 29/350
Epoch 30/350
Epoch 31/350
Epoch 32/350
Epoch 33/350
Epoch 34/350
Epoch 35/350
Epoch 36/350
Epoch 37/350
Epoch 38/350
Epoch 39/350
Epoch 40/350
Epoch 41/350
Epoch 42/350
Epoch 43/350
Epoch 44/350
Epoch 45/350
Epoch 46/350
Epoch 47/350
Epoch 48/350
Epoch 49/350
Epoch 50/350
Epoch 51/350
Epoch 52/350
Epoch 53/350
Epoch 54/350
Epoch 55/350
Epoch 56/350
Epoch 57/350
Epoch 58/350
Epoch 59/350
Epoch 60/350
Epoch 61/350
Epoch 62/350
Epoch 63/350
Epoch 64/350
Epoch 65/350
Epoch 66/350
Epoch 67/350
Epoch 68/350
Epoch 69/350
Epoch 70/350
Epoch 71/350
Epoch 72/350
Epoch 73/350
Epoch 74/350
Epoch 75/350
Epoch 76/350
Epoch 77/350
Epoch 78

## 5.3 Export/load the model <a name="53-bullet"></a>

### a) Export the model

In [None]:
model.save(settings['RNN_MODEL_DIR'])

INFO:tensorflow:Assets written to: ../models/rnn_model/assets


### b) Load the model

In [None]:
model = tf.keras.models.load_model(settings['RNN_MODEL_DIR'])

# 6. Lyrics generation <a name="6-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 6.1 Lyrics generator model <a name="61-bullet"></a> 

In [None]:
class OneStep(tf.keras.Model):
    def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
        super().__init__()
        self.temperature=temperature
        self.model = model
        self.chars_from_ids = chars_from_ids
        self.ids_from_chars = ids_from_chars

        # create a mask to prevent "" or "[UNK]" from being generated
        skip_ids = self.ids_from_chars(['','[UNK]'])[:, None]
        sparse_mask = tf.SparseTensor(values=[-float('inf')]*len(skip_ids),     # put a -inf at each bad index
                                      indices=skip_ids,                         
                                      dense_shape=[len(ids_from_chars.get_vocabulary())])   # match the shape to the vocabulary
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function
    def generate_one_step(self, inputs, states=None):
        # convert strings to token IDs
        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        # run the model
        # predicted_logits.shape is [batch, char, next_char_logits] 
        predicted_logits, states =  self.model(inputs=input_ids, 
                                               states=states, 
                                               return_state=True)
        # only use the last prediction
        predicted_logits = predicted_logits[:,-1, :]
        predicted_logits = predicted_logits / self.temperature
        # apply the prediction mask: prevent "" or "[UNK]" from being generated
        predicted_logits = predicted_logits + self.prediction_mask

        # sample the output logits to generate token IDs
        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)

        # convert from token ids to characters
        predicted_chars = self.chars_from_ids(predicted_ids)

        # return the characters and model state
        return predicted_chars, states

In [None]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars, temperature=0.6)

## 6.2 Export/load the generator <a name="62-bullet"></a>

### a) Export the generator

In [None]:
tf.saved_model.save(one_step_model, settings['RNN_GENERATOR_DIR'])





INFO:tensorflow:Assets written to: ../models/one_step_generator/assets


INFO:tensorflow:Assets written to: ../models/one_step_generator/assets


### b) Load the generator

In [None]:
one_step_reloaded = tf.saved_model.load(settings['RNN_GENERATOR_DIR'])

## 6.3 Generate lyrics <a name="63-bullet"></a> 

In [None]:
# start and end of text tokens
SOT = "<SOT>"
EOT = "<EOT>"

# lyrics length in chars
LYRICS_LENGTH = 4800

SONG_TITLE = 'Artificial Intelligence'

next_char = tf.constant([SOT + '[Title]\n' + SONG_TITLE + '\n\n'])
result = [next_char]
states = None

# for tracking the last five generated chars - to detect the end of text token
last_5_chars = ""

for n in range(LYRICS_LENGTH):
    next_char, states = one_step_model.generate_one_step(next_char, states=states)
    result.append(next_char)

    # track the last five chars 
    last_5_chars = (last_5_chars + next_char.numpy()[0].decode("utf-8"))[-5:]
    # stop the generation if EOT is detected
    if last_5_chars==EOT:
        break

# convert to string
result = tf.strings.join(result)[0].numpy().decode('utf-8')
# remove the SOT token
result = result[5:]
# remove the EOT token if the generation ended with one
gen_lyrics = result[:-5] if last_5_chars==EOT else result

print(gen_lyrics)

[Title]
Artificial Intelligence

[Intro]
Uh, uh, uh, uh, alain, debridee
You know I just be sayin' that to get you mad
And when I rap about a buncha shit you wished you had
(A wish) Uh (Ban!)
We're bullets, you did it! I'm so sing and get to do again

[Verse 2: Eminem]
Yeah, you're gonna get in it
But I still wanna fight yeur of a Slim Shady irony
If we are Slamm in a Jordnes are going his own diant
Who's got these huggs that don't make is thinkin'
You motherfuckers 'cause he ain't shows with no remorse for you
We start on 'cause I am not a lastrone who want a Goddimate, Axact
Or blowin' up like ferty murdered the shells and leave you swallowed a little line
Ain't no one nappin' it off the feal in half
Caught in a chair, man, how come here?
Fay-back dum
All by once the fuck you think I want from me?
I'm a freak, I'm just thinky's simple as dellain and boys like
You hope the matter-window with you
Blockbate, keep ya head up a little like Conociance
Little faggots teacher did this well-b

## 6.4 Calculate lyrics similarity <a name="64-bullet"></a> 

In [None]:
# put the generated text to the top of the lyrics corpus
lyrics = np.concatenate([[gen_lyrics], eminem_lyrics_df.values])
# transform lyrics into TF-IDF vectors
tfidf = TfidfVectorizer(stop_words="english").fit_transform(lyrics)

# compute the cosine similarity  
pairwise_similarity = tfidf * tfidf.T 
# isolate only the top row (the row with similarities for the generated text)
pairwise_similarity = pairwise_similarity.toarray()[0]
# mask the diagonal element (the similarity to itself)
pairwise_similarity[0] = -1

# get the top 3 most similar lyrics to the generated text
most_similar_idxs = pairwise_similarity.argsort()[-3:][::-1] 

# list of things to print
output = [SONG_TITLE,
          ', '.join(eminem_df.iloc[most_similar_idxs - 1].title), 
          *pairwise_similarity[most_similar_idxs], 
          most_similar_idxs - 1]

print("Title: {}\nSimilar: {:50s}\nScores: {:.3f}, {:.3f}, {:.3f}\nCorpus: {}\n".format(*output))

Title: Artificial Intelligence
Similar: The Re-Up, 2004 Tim Westwood Freestyle, Eminem Freestyles on Tim Westwood | 2010
Scores: 0.345, 0.153, 0.152
Corpus: [255 309 198]



## 6.5 Store lyrics to a text file <a name="65-bullet"></a> 

In [None]:
GENERATED_LYRICS_DIR = settings['GENERATED_LYRICS_DIR']
FILE_NAME = "rnn_model_lyrics.txt"
DELIMITER = '\n\n\n'

with open(os.path.join(GENERATED_LYRICS_DIR, FILE_NAME), "a") as text_file:
    text_file.write(gen_lyrics + DELIMITER)