### **Introduction**


We are planning to build a generative model for song lyrics that involves training a machine learning algorithm to learn the patterns and structures that exist within a corpus of existing song lyrics, and then using this knowledge to generate new lyrics that are similar in style and content to the original corpus.

There are a few steps I would like to follow to build a generative model for song lyrics:

Gather a corpus of song lyrics from our dataset. We may also choose a specific genre or artist as a starting point to test. We may also start with english lyrics first.

Preprocess the data by removing irrelevant information, tokenization, stemming and lemmatization. 

Train a language model: Use a deep learning algorithm like a Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) to train a language model on the preprocessed lyrics data. The model will learn the relationships between the words and phrases in the corpus and will be able to generate new lyrics based on this knowledge. 

Generate new lyrics: Once the model is trained, we can use it to generate new lyrics by giving it a starting prompt or seed. The model will then use its knowledge of the patterns and structures in the corpus to generate new lyrics that are similar in style and content to the original lyrics.

Evaluate the results: Evaluate the generated lyrics to see how well they match the style and content of the original corpus. We may use metrics like perplexity, BLEU score, or human evaluation to assess the quality of the generated lyrics. We would like to explore if LSTM can perform better than RNN in terms of lyrics generation. 

Refine the model: If the generated lyrics are not of high quality, we would refine the model by adjusting the hyperparameters or training it on a larger or more diverse corpus of lyrics.

We anticipate that generating high-quality song lyrics can be challenging, as there are many nuances and complexities in the language and structure of lyrics. Therefore, it's important to carefully evaluate and refine the model to ensure that it generates high-quality lyrics.


### Import Packages

In [41]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import re
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('wordnet')

import os
import time
import lzma
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package omw-1.4 to /home/yyk/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /home/yyk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yyk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [42]:
colab = False
if colab == True:
    from google.colab import drive
    drive.mount('/content/drive')
    file_path = os.path.join('/content/drive/MyDrive/',file_path)


### Getting the Data Ready







#### Compile a list of top US artists

In [34]:
df_all = pd.read_csv('lyrics-data.csv')

In [35]:
df_all.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language
0,/ivete-sangalo/,Arerê,/ivete-sangalo/arere.html,"Tudo o que eu quero nessa vida,\nToda vida, é\...",pt
1,/ivete-sangalo/,Se Eu Não Te Amasse Tanto Assim,/ivete-sangalo/se-eu-nao-te-amasse-tanto-assim...,Meu coração\nSem direção\nVoando só por voar\n...,pt
2,/ivete-sangalo/,Céu da Boca,/ivete-sangalo/chupa-toda.html,É de babaixá!\nÉ de balacubaca!\nÉ de babaixá!...,pt
3,/ivete-sangalo/,Quando A Chuva Passar,/ivete-sangalo/quando-a-chuva-passar.html,Quando a chuva passar\n\nPra quê falar\nSe voc...,pt
4,/ivete-sangalo/,Sorte Grande,/ivete-sangalo/sorte-grande.html,A minha sorte grande foi você cair do céu\nMin...,pt


In [43]:
df_all['artist'] = df_all['ALink'].str.replace('[\/]','')
df_all['artist'] = df_all['artist'].str.replace('[\-]',' ')
df_all = df_all[df_all.language == 'en']
df_all.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language,artist
69,/ivete-sangalo/,Careless Whisper,/ivete-sangalo/careless-whisper.html,I feel so unsure\nAs I take your hand and lead...,en,ivete sangalo
86,/ivete-sangalo/,Could You Be Loved / Citação Musical do Rap: S...,/ivete-sangalo/could-you-be-loved-citacao-musi...,"Don't let them fool, ya\nOr even try to school...",en,ivete sangalo
88,/ivete-sangalo/,Cruisin' (Part. Saulo),/ivete-sangalo/cruisin-part-saulo.html,"Baby, let's cruise, away from here\nDon't be c...",en,ivete sangalo
111,/ivete-sangalo/,Easy,/ivete-sangalo/easy.html,"Know it sounds funny\nBut, I just can't stand ...",en,ivete sangalo
140,/ivete-sangalo/,For Your Babies (The Voice cover),/ivete-sangalo/for-your-babies-the-voice-cover...,You've got that look again\nThe one I hoped I ...,en,ivete sangalo


We aimed to compile a list of the top 50 American artists based on their song count. However, certain artists were excluded due to either not being from the United States or being a collective of multiple artists. As a result, our refined list features 32 exceptional US artists.

In [44]:
# Get a count of unique artist names
artist_counts = df_all['artist'].value_counts()
top50 = artist_counts[:50]
#filtering out non-American singers/bands 
exclusion = ['temas de filmes','matheus hardke','glee','hillsong united','elton john','bee gees','elvis costello','paul mccartney','vineyard','david bowie','the rolling stones','rod stewart','van morrison','kylie minogue','u2','the beatles','eric clapton','drake']
US_top = top50.loc[~top50.index.isin(exclusion)]

In [50]:
print('Number of Top US Artists:',len(US_top))
US_top

Number of Top US Artists: 32


frank sinatra        819
elvis presley        747
dolly parton         723
lil wayne            689
chris brown          623
guided by voices     620
prince               564
johnny cash          555
bob dylan            548
george jones         534
neil young           515
bruce springsteen    502
snoop dogg           485
eminem               484
50 cent              466
roy orbison          438
ella fitzgerald      421
taylor swift         385
waylon jennings      383
2pac tupac shakur    382
bb king              371
bon jovi             367
george strait        365
madonna              360
diana ross           355
bill monroe          351
beach boys           332
barry manilow        330
alice cooper         326
nas                  324
ray charles          322
beck                 320
Name: artist, dtype: int64

With the above list, we conducted more data-preprocessing to remove songs with featured artists, and removed text inside () and []. See Kenny Tang's notebook for more details.
Now we import the new clean dataframe: 

In [11]:
df = pd.read_csv('clean_lyrics_df.csv')

In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,ALink,SName,SLink,Lyric,language,features
0,5400,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,False
1,5401,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,False
2,5402,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,False
3,5403,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,False
4,5404,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,False


In [15]:
df.language.unique()

array(['en'], dtype=object)

In [22]:
df = df.drop('Unnamed: 0',axis = 1)

In [57]:
df.columns = ["artist", "songname", "songlink", "lyric", "language","features"]
df.head()

Unnamed: 0,artist,songname,songlink,lyric,language,features
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,False
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,False
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,False
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,False
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,False


In [58]:
df_en = df[df.language == 'en']
df_en

Unnamed: 0,artist,songname,songlink,lyric,language,features
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,False
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,False
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,False
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,False
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,False
...,...,...,...,...,...,...
14323,barry manilow,You Oughta Be Home With Me,/barry-manilow/you-oughta-be-home-with-me.html,"everybody's here, spinnin' the bottle\neverybo...",en,False
14324,barry manilow,You're Leaving Too Soon,/barry-manilow/youre-leaving-too-soon.html,you're leavin' too soon\nyou oughta try believ...,en,False
14325,barry manilow,You're Looking Hot Tonight,/barry-manilow/youre-looking-hot-tonight.html,you're looking hot tonight\nbarry manilow\nby:...,en,False
14326,barry manilow,You're There,/barry-manilow/youre-there.html,our friends all use the past tense when they s...,en,False


In [59]:
# Export the DataFrame to a pickle file
pickle_file = "df_en.pickle"
df_en.to_pickle(pickle_file)

In some cases, stopwords may actually carry important contextual information and contribute to the overall meaning and tone of the lyrics. Removing them may result in the loss of nuance and the creation of less coherent lyrics. To generate more nuanced and complex lyrics, we may keep the stopwords.

Lemmatization can also result in the loss of some information, as certain forms of a word may have different meanings and connotations. For example, "loving" and "loved" have different meanings and may be used in different contexts, so reducing both of them to "love" may lead to some loss of sentiment. 

In [60]:
stop_word_list = stopwords.words('english')
lemma= WordNetLemmatizer()
def text_preprocess(sentence, stopwords_removal = True):
    '''This function takes in a dataframe, extract and format the text in a standardized format.'''
    sentence = str(sentence)
    sentence = sentence.lower() # lower case
    sentence = re.sub(r'[^a-zA-Z0-9]', r' ', sentence)   # replace these punctuation with space
    # sentence = re.sub(r'lrb|rrb', r'', sentence)
    tokens = sentence.split()
    clean_text = []
    for item in tokens:
        if stopwords_removal == True:
            if item not in stop_word_list:
                clean_text.append(lemma.lemmatize(item))
        else:
             clean_text.append(lemma.lemmatize(item))
    clean_text  = " ".join(clean_text)

    return clean_text   

In [61]:
df_en['cleaned_lyric'] = df_en['lyric'].apply(lambda x:text_preprocess(x,stopwords_removal = False)) 

In [62]:
df_en

Unnamed: 0,artist,songname,songlink,lyric,language,features,cleaned_lyric
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,False,go go go go go go go shawty it s your birthday...
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,False,new york city you are now rapping with 50 cent...
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,False,i don t know what you heard about me but a b c...
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,False,man we gotta go get something to eat man i m h...
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,False,yeah uh huh so seductive i ll take you to the ...
...,...,...,...,...,...,...,...
14323,barry manilow,You Oughta Be Home With Me,/barry-manilow/you-oughta-be-home-with-me.html,"everybody's here, spinnin' the bottle\neverybo...",en,False,everybody s here spinnin the bottle everybody ...
14324,barry manilow,You're Leaving Too Soon,/barry-manilow/youre-leaving-too-soon.html,you're leavin' too soon\nyou oughta try believ...,en,False,you re leavin too soon you oughta try believin...
14325,barry manilow,You're Looking Hot Tonight,/barry-manilow/youre-looking-hot-tonight.html,you're looking hot tonight\nbarry manilow\nby:...,en,False,you re looking hot tonight barry manilow by 1a...
14326,barry manilow,You're There,/barry-manilow/youre-there.html,our friends all use the past tense when they s...,en,False,our friend all use the past tense when they sp...


In [63]:
df_en.to_csv('NN_Test_Data/preprocessed_english_lyrics.csv', index=False)

#### Filter by Artist Name

In [75]:
data = df_en
data

Unnamed: 0,artist,songname,songlink,lyric,language,features,cleaned_lyric
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,False,go go go go go go go shawty it s your birthday...
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,False,new york city you are now rapping with 50 cent...
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,False,i don t know what you heard about me but a b c...
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,False,man we gotta go get something to eat man i m h...
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,False,yeah uh huh so seductive i ll take you to the ...
...,...,...,...,...,...,...,...
14323,barry manilow,You Oughta Be Home With Me,/barry-manilow/you-oughta-be-home-with-me.html,"everybody's here, spinnin' the bottle\neverybo...",en,False,everybody s here spinnin the bottle everybody ...
14324,barry manilow,You're Leaving Too Soon,/barry-manilow/youre-leaving-too-soon.html,you're leavin' too soon\nyou oughta try believ...,en,False,you re leavin too soon you oughta try believin...
14325,barry manilow,You're Looking Hot Tonight,/barry-manilow/youre-looking-hot-tonight.html,you're looking hot tonight\nbarry manilow\nby:...,en,False,you re looking hot tonight barry manilow by 1a...
14326,barry manilow,You're There,/barry-manilow/youre-there.html,our friends all use the past tense when they s...,en,False,our friend all use the past tense when they sp...


Although we have a cleaned version of lyrics but we would like to export a version for the orginal lyrics for the generative model as it's character-based. We would like the model to learn the patterns of the orginal lyrics.

In [66]:
# Get the top n most common names
top_artist = US_top.index
print(len(top_artist))
top_artist = ['frank sinatra', 'elvis presley', 'dolly parton', 'lil wayne',
       'chris brown', 'guided by voices', 'prince', 'johnny cash', 'bob dylan',
       'george jones', 'neil young', 'bruce springsteen', 'snoop dogg',
       'eminem', '50 cent', 'roy orbison', 'ella fitzgerald', 'taylor swift',
       'waylon jennings', '2pac tupac shakur', 'bb king', 'bon jovi',
       'george strait', 'madonna', 'diana ross', 'bill monroe', 'beach boys',
       'barry manilow', 'alice cooper', 'nas', 'ray charles', 'beck']

32


In [70]:
# Filter the dataframe by the top artists
data_topUS = data[data['artist'].isin(top_artist)]
print(len(data_topUS))


14328


In [72]:
# export the 'lyric' column to a text file
data_topUS['lyric'].to_csv('NN_Test_Data/topUS32.txt', header=False, index=False)

In [73]:
def extract_artist(data,artist_name):
    df_artist = data[data.artist == artist_name]
    df_artist['lyric'].to_csv('NN_Test_Data/{}.txt'.format(artist_name),mode='w',header=False, index=False)



for artist in top_artist:
    extract_artist(data,artist)

##### The below needs to be clean up 

### **RNN - GRU**

Recurrent Neural Networks (RNNs) can be used as a type of language model, as they are capable of predicting the probability of a sequence of words in a natural language.

In an RNN-based language model, the network takes in a sequence of words as input, one word at a time, and processes each word in the context of the previous words in the sequence. This allows the model to capture the dependencies and relationships between the words in the sequence.

The RNN language model consists of an input layer, an RNN layer, and an output layer. The RNN layer maintains a hidden state that is updated at each time step, and this hidden state serves as a memory that encodes the information from the previous words in the sequence. The output layer of the model predicts the probability distribution over the possible next words in the sequence.

Training an RNN language model involves minimizing the cross-entropy loss between the predicted probability distribution and the actual next word in the sequence. The model can be fine-tuned using a variety of techniques, such as backpropagation through time (BPTT), which adjusts the model's parameters to minimize the loss.

#### Import Processed Data and Setup Tensorflow

In [10]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

##### Select an artist

In [4]:
import ipywidgets as widgets

top_artist = ['frank sinatra', 'elvis presley', 'dolly parton', 'lil wayne',
       'chris brown', 'guided by voices', 'prince', 'johnny cash', 'bob dylan',
       'george jones', 'neil young', 'bruce springsteen', 'snoop dogg',
       'eminem', '50 cent', 'roy orbison', 'ella fitzgerald', 'taylor swift',
       'waylon jennings', '2pac tupac shakur', 'bb king', 'bon jovi',
       'george strait', 'madonna', 'diana ross', 'bill monroe', 'beach boys',
       'barry manilow', 'alice cooper', 'nas', 'ray charles', 'beck']
print(len(top_artist))
options = top_artist
dropdown = widgets.Dropdown(options=options, value=options[0], description='Select artist:')
display(dropdown)


32


Dropdown(description='Select artist:', options=('frank sinatra', 'elvis presley', 'dolly parton', 'lil wayne',…

In [5]:
print(dropdown.value)

taylor swift


In [6]:
artist_name = dropdown.value
print(artist_name)
file_path = '/content/drive/MyDrive/Capstone/NN_Test_Data/{}.txt'.format(artist_name)
print(file_path)
#file_path_topUS = "/content/drive/MyDrive/Capstone/NN_Test_Data/" + "original_topUS32.txt"


taylor swift
/content/drive/MyDrive/Capstone/NN_Test_Data/taylor swift.txt


In [1]:
def create_vocab(file_path):

    # Read, then decode for py2 compat.
    text = open(file_path, 'rb').read().decode(encoding='utf-8')
    # length of text is the number of characters in it
    print(f'Length of text: {len(text)} characters')
    # Take a look at the first 200 characters in text
    print(text[:200])

    # The unique characters in the file
    vocab = sorted(set(text))
    print(f'{len(vocab)} unique characters')

    return text, vocab 

text,vocab = create_vocab(file_path)
#### Text Vectorization
#Now create the tf.keras.layers.StringLookup layer:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)
ids_from_chars
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))
#the model expects sequences of 100 tokens in length
seq_length = 100
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))


def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())
#This function effectively splits each sequence in the dataset into an input sequence and a corresponding target sequence, which is a common preprocessing step in many natural language processing problems

def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text
dataset = sequences.map(split_input_target)

# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

#### Build The GRU Model

#Main Parameters:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) #The input layer. A trainable lookup table that will map each character-ID to a vector with embedding_dim dimensions;
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True) # A type of RNN with size units=rnn_units
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())


#### Train the GRU Model
#https://stackoverflow.com/questions/53515547/check-perplexity-of-a-language-model

from tensorflow.keras import backend as K

def perplexity(y_true, y_pred):
    """
    The perplexity metric. Why isn't this part of Keras yet?!
    https://stackoverflow.com/questions/41881308/how-to-calculate-perplexity-of-rnn-in-tensorflow
    https://github.com/keras-team/keras/issues/8267
    """
    cross_entropy = K.sparse_categorical_crossentropy(y_true, y_pred)
    perplexity = K.exp(cross_entropy)
    return perplexity

loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)
tf.exp(example_batch_mean_loss).numpy()
# Define the model architecture and compile it, define the metrics
# model.compile(optimizer='adam', loss=loss)
model.compile(optimizer='adam', loss=loss, metrics=[perplexity]) # YJ: added custom metrics
# Define the path to the checkpoint directory
checkpoint_dir = '/content/drive/MyDrive/Capstone/training_checkpoints' #save to our google drive

# Create the checkpoint directory if it doesn't exist
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "my_model_{epoch}")

# Define a callback to save the model during training
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True,
    # save_best_only=True,
    # monitor='val_loss',
    # # mode='min',
    # save_freq=5,
    # overwrite=True
)                                #yj: added params for what to save
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)
start = time.time()
states = None
next_char = tf.constant(['oh baby'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

# Save the output to a text file

timestamp = time.strftime("%Y%m%d-%H%M%S")
filename = os.path.join("/content/drive/MyDrive/Capstone/MySavedModel/", f"{artist_name}_output_{timestamp}.txt")
with open(filename, 'w') as file:
    file.write(result[0].numpy().decode('utf-8'))
#### Save the model
# saved_model_path = '/content/drive/MyDrive/Capstone/MySavedModel/'
# tf.saved_model.save(one_step_model, saved_model_path)
# one_step_reloaded = tf.saved_model.load(saved_model_path)

saved_model_path = os.path.join("/content/drive/MyDrive/Capstone/MySavedModel", artist_name)
tf.saved_model.save(one_step_model, saved_model_path)

# Load the saved model
# one_step_reloaded = tf.saved_model.load(saved_model_path)

NameError: name 'file_path' is not defined

In [None]:
#Lyrics generation 
start = time.time()
states = None
next_char = tf.constant(['oh baby'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_reloaded.generate_one_step(next_char, states=states)
  result.append(next_char) 

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

# Save the output to a text file

timestamp = time.strftime("%Y%m%d-%H%M%S")
filename = os.path.join(saved_model_path, f"{artist_name}_output_{timestamp}.txt")
with open(filename, 'w') as file:
    file.write(result[0].numpy().decode('utf-8'))

### Create a Function for One-Step Training

In [None]:
top_artist = ['frank sinatra', 'elvis presley', 'dolly parton', 'lil wayne',
              'chris brown', 'guided by voices', 'prince', 'johnny cash', 'bob dylan',
              'george jones', 'neil young', 'bruce springsteen', 'snoop dogg',
              'eminem', '50 cent', 'roy orbison', 'ella fitzgerald', 'taylor swift',
              'waylon jennings', '2pac tupac shakur', 'bb king', 'bon jovi',
              'george strait', 'madonna', 'diana ross', 'bill monroe', 'beach boys',
              'barry manilow', 'alice cooper', 'nas', 'ray charles', 'beck']
for artist in top_artist:
    one_step_training(artist)

In [None]:
def one_step_training(artist_name):
    #file_path = '/content/drive/MyDrive/Capstone/NN_Test_Data/{}.txt'.format(artist_name)
    file_path = 'Capstone/NN_Test_Data/{}.txt'.format(artist_name)



### Evaluate the GRU Model 




In [None]:
#https://stackoverflow.com/questions/53515547/check-perplexity-of-a-language-model
from tensorflow.keras import backend as K

def perplexity(y_true, y_pred):
    """
    The perplexity metric. Why isn't this part of Keras yet?!
    https://stackoverflow.com/questions/41881308/how-to-calculate-perplexity-of-rnn-in-tensorflow
    https://github.com/keras-team/keras/issues/8267
    """
    cross_entropy = K.sparse_categorical_crossentropy(y_true, y_pred)
    perplexity = K.exp(cross_entropy)
    return perplexity


### LSTM 
LSTM (Long Short-Term Memory) and RNN (Recurrent Neural Network) are both types of neural networks that can be used for text generation. However, there are some key differences between the two approaches.

RNNs are a type of neural network that are designed to work with sequential data, such as text. They process input data one token at a time and maintain a "memory" of the previous tokens they have seen. This makes RNNs well-suited for text generation tasks where the context of the previous tokens is important for predicting the next token.

LSTMs are a type of RNN that are specifically designed to address the "vanishing gradient" problem that can occur in standard RNNs. This problem can cause the RNN to have difficulty learning long-term dependencies in the data, which can be important for text generation tasks. LSTMs use a more complex architecture that includes a "memory cell" and several "gates" that control the flow of information into and out of the cell. This allows LSTMs to better capture long-term dependencies in the data and can lead to improved performance for text generation tasks.

Overall, while both RNNs and LSTMs can be used for text generation, LSTMs are generally considered to be more powerful and effective for this task due to their ability to capture long-term dependencies in the data.

#### Build the LSTM Model

Now let's modify the above model using a LSTM layer:


In [None]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024


Modifications made: 

1. Replaced tf.keras.layers.GRU with tf.keras.layers.LSTM
2. The LSTM layer returns two states - state_h and state_c - corresponding to the hidden state and cell state respectively. We need to unpack these states from the output of the LSTM layer so we modified the if return_state block to return both state_h and state_c.

In [None]:
class MyLSTMModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) 
    self.lstm = tf.keras.layers.LSTM(rnn_units,
                                     return_sequences=True,
                                     return_state=True) 
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.lstm.get_initial_state(x)
    x, state_h, state_c = self.lstm(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, [state_h, state_c]
    else:
      return x

In [None]:
# What to tune
# vocab_size and embedding_dim can be tuned for the embedding
# rnn_units can be tuned to make the model have more or less parameters, watch out for overfit
# loss function may change

In [None]:
LSTM_model = MyLSTMModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = LSTM_model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [None]:
LSTM_model.summary()

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [None]:
sampled_indices

In [None]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

In [None]:
# # concatenate all the lyrics into a single string
# # all_britney = ' '.join(britney['cleaned_lyric'])

# # create a tokenizer and fit it on the concatenated string
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(britney['cleaned_lyric'])
# word_index = tokenizer.word_index
# sequences = tokenizer.texts_to_sequences(britney['cleaned_lyric'])

# # Create a sequence dataset
# max_sequence_length = 50
# sequences = np.array(pad_sequences(sequences, maxlen=max_sequence_length, padding='pre'))

# input_sequences = sequences[:, :-1]
# output_sequences = sequences[:, -1]

# dataset = tf.data.Dataset.from_tensor_slices((input_sequences, output_sequences))
# dataset = dataset.shuffle(len(input_sequences)).batch(64)

# # Build and train the model
# model = tf.keras.Sequential([
#     tf.keras.layers.Embedding(len(word_index)+1, 128, input_length=max_sequence_length-1),
#     tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True)),
#     tf.keras.layers.Dropout(0.2),
#     tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
#     tf.keras.layers.Dense(len(word_index)+1, activation='softmax')
# ])

# model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.fit(dataset, epochs=50)

# # Generate lyrics
# seed_text = "I wanna hold your hand"
# next_words = 100

# for _ in range(next_words):
#     sequence = tokenizer.texts_to_sequences([seed_text])[0]
#     sequence = pad_sequences([sequence], maxlen=max_sequence_length-1, padding='pre')
#     predicted = model.predict(sequence)[0]
#     predicted_word_index = np.argmax(predicted)
#     output_word = ''
#     for word, index in word_index.items():
#         if index == predicted_word_index:
#             output_word = word
#             break
#     seed_text += ' ' + output_word

# print(seed_text)

#### Train the LSTM Model

In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

In [None]:
tf.exp(example_batch_mean_loss).numpy()

In [None]:
LSTM_model.compile(optimizer='adam', loss=loss)

In [None]:
# # Directory where the checkpoints will be saved
# checkpoint_dir = './training_checkpoints'
# # Name of the checkpoint files
# checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

# checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
#     filepath=checkpoint_prefix,
#     save_weights_only=True)

In [None]:
EPOCHS = 1

In [None]:
history = LSTM_model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

In [None]:
# inference
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
one_step_model = OneStep(LSTM_model, chars_from_ids, ids_from_chars)

In [None]:
start = time.time()
states = None
next_char = tf.constant(['oh baby'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

In the above model we first load and preprocess the lyrics data using Pandas. We then use the Tokenizer class to tokenize the lyrics data and create a sequence dataset using TensorFlow's sequence dataset functionality.

Next, we build a model architecture using TensorFlow's Keras API, which consists of an embedding layer, two bidirectional LSTM layers, a dropout layer, and a dense layer with a softmax activation function. We train the model on the sequence dataset for 50 epochs.

Finally, we use the trained model to generate new lyrics by starting with a seed sequence and iteratively predicting the next token in the sequence using the model's predict() method. We repeat this process for a specified number of output words and convert the sequence of predicted tokens back into text using the Tokenizer's reverse mapping.

### Ensemble Model

Steps to ensemble multiple models for multiple artists:
1. Train individual models for each artist using above LSTM model. 
2. After training, save the individual models.
3. Load the saved models and create an ensemble model that takes in an artist's name as input and outputs a sequence of words.
4. Use the ensemble model to generate lyrics for the specified artist.


In [None]:
artist_list = ['frank sinatra', 'elvis presley', 'dolly parton', 'matheus hardke',
       'lil wayne', 'glee', 'hillsong united', 'elton john', 'temas de filmes',
       'chris brown']
print(artist_list)
print(len(artist_list))

In [None]:
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)

In [None]:
print(latest_checkpoint)

In [None]:
# Load the model from the latest checkpoint file
model_1 = tf.keras.models.load_model(latest_checkpoint)