### Artificial lyric generator_(single_char)_any_artist_
- Use recurrent neural network
- Mimic any popular artists from datasets
- use single characters as sequential inputs
- Bee Gees as example
#### Dawson Sargent, Chenghui Song, Kelvin Yi


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import os

### Create songlist from the datasets
Data source: https://www.kaggle.com/datasets/neisse/scrapped-lyrics-from-6-genres?select=lyrics-data.csv
- Two csv datasets are needed:
 - lyrics-data.csv: 
 - artists-data.csv: 

In [2]:
#curr_dir = 'C:\\Users\\cheng\\Downloads\\'
lyrics = pd.read_csv('lyrics-data.csv') 
lyrics=lyrics.query("language=='en'")
artists = pd.read_csv("artists-data.csv")
lyrics_df = pd.merge(lyrics,artists,left_on="ALink",right_on="Link")
lyrics_df = lyrics_df[["Artist","Genres","Popularity","Songs","SName","Lyric"]]
        # Note: Popularity score based on how much each artist/lyric is accessed on the website
lyrics_popular=lyrics_df.query("Songs>500")
lyrics_popular=lyrics_popular.sort_values(['Songs','Artist'], ascending=[False,True])
#lyrics_hop = lyrics_df[lyrics_df["Genres"].str.contains("Hip Hop",na=False)]
print("Shape of songlist from popular artists:",lyrics_popular.shape)
lyrics_popular.head()

Shape of songlist from popular artists: (12330, 6)


Unnamed: 0,Artist,Genres,Popularity,Songs,SName,Lyric
120727,Frank Sinatra,Jazz; Clássico; Romântico,16.1,828.0,My Way,"And now the end is near,\nAnd so I face the fi..."
120728,Frank Sinatra,Jazz; Clássico; Romântico,16.1,828.0,Fly Me to the Moon,Fly me to the moon\nLet me play among the star...
120729,Frank Sinatra,Jazz; Clássico; Romântico,16.1,828.0,"New York, New York","Start spreading the news,\nI'm leaving today\n..."
120730,Frank Sinatra,Jazz; Clássico; Romântico,16.1,828.0,That's Life,"That's life, that's what all the people say.\n..."
120731,Frank Sinatra,Jazz; Clássico; Romântico,16.1,828.0,Days Of Wine and Roses,<with a jauntier melody than Andy Williams' ve...


### Find the most popular artists by number of songs

In [3]:
# Find list of artists that have > 500 songs in the current dataset
artists = lyrics_popular.reset_index()
artists = artists.drop(columns=['index'])
artists = artists[["Artist","Genres","Popularity","Songs",'SName']]
artists_songs = artists.groupby(['Artist','Genres','Popularity','Songs'])["SName"].count()
list = artists_songs.reset_index()
list=list.drop(columns=['Songs'])
list=list.rename(columns={'SName':'Songs'})
list=list.query("Songs>500")
list=list.reset_index().drop(columns=['index'])
list = list.sort_values(['Songs'],ascending=False).reset_index().drop(columns=['index'])
list

Unnamed: 0,Artist,Genres,Popularity,Songs
0,Frank Sinatra,Jazz; Clássico; Romântico,16.1,819
1,Elvis Presley,Rockabilly; Romântico; Rock,23.1,747
2,Dolly Parton,Country,1.3,723
3,Matheus Hardke,Pop/Rock,0.1,707
4,Lil Wayne,R&B; Black Music; Rap,4.3,689
5,Glee,Trilha Sonora; Pop/Rock; Pop,3.0,687
6,Hillsong United,Gospel/Religioso; Pop/Rock; Rock,25.8,646
7,Elton John,Soft Rock; Romântico; Pop/Rock,44.7,638
8,Temas de Filmes,COLETÂNEA; Trilha Sonora; Romântico; Instrumental,8.3,628
9,Chris Brown,Rap; Hip Hop; Pop,11.8,623


### Select an artist to create the lyric corpus

In [4]:
# Collect all lyrics from selected artist
selected_artist = "Bee Gees"
lyrics_artist = lyrics_popular[lyrics_popular.Artist==selected_artist]['Lyric'].values.tolist()
text = ""
for i in range (len(lyrics_artist)):
    text += lyrics_artist[i]+"\n\n"
lines = text.split("\n")
with open("lyrics.txt",'w', encoding="utf-8") as f:
    for line in lines:
        f.write(line)
        f.write("\n")
print('Total number of characters in the corpus is:',len(text))
print('The first 1000 characters of the corpus are as follows:\n\n',text[:1000])

Total number of characters in the corpus is: 564378
The first 1000 characters of the corpus are as follows:

 I know your eyes in the morning sun
I feel you touch me in the pouring rain
And the moment that you wander far from me
I wanna feel you in my arms again

And you come to me on a summer breeze
Keep me warm in your love then you softly leave
And it's me you need to show
How deep is your love

How deep is your love, how deep is your love
I really mean to learn
'Cause we're living in a world of fools
Breaking us down
When they all should let us be
We belong to you and me

I believe in you
And you know the door to my very soul
You're the light in the deepest darkest hour
You're my savior when I fall

And you may not think
That I care for you
When you know down inside that I really do
And it's me you need to show
How deep is your love

How deep is your love, how deep is your love
I really mean to learn
'Cause we're living in a world of fools
Breaking us down
When they all should let 

### Vecterize the text corpus
- Create a vocabulary for all unique characters in the corpus
- assign an integer for each charcter
- text_as_int to vectorize the text corpus
- idx2char (map from integer to char) for decoding

In [5]:
# The unique characters in the corpus
vocab = sorted(set(text))
print ('The number of unique characters in the corpus is', len(vocab))
print('A slice of the unique characters set:\n', vocab[:10])

The number of unique characters in the corpus is 93
A slice of the unique characters set:
 ['\n', ' ', '!', '"', '&', "'", '(', ')', '*', '+']


In [6]:
# Create a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
# Make a copy of the unique set elements in NumPy array format for later use in the decoding the predictions
idx2char = np.array(vocab)
# Vectorize the text with a for loop
text_as_int = np.array([char2idx[c] for c in text])

In [7]:
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) 
# for i in char_dataset.take(5): 
#   print(i.numpy())
seq_length = 100 # The max. length for single input
# examples_per_epoch = len(text)//(seq_length+1) # double-slash for “floor” division
sequences = char_dataset.batch(seq_length+1, drop_remainder=True) 
# for item in sequences.take(5): 
#   print(repr(''.join(idx2char[item.numpy()])))

In [8]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

dataset = sequences.map(split_input_target)

In [9]:
BUFFER_SIZE = 10000 # TF shuffles the data only within buffers

BATCH_SIZE = 64 # Batch size

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

print(dataset)

<BatchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>


### Build the model

In [10]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024

In [11]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [12]:
model = build_model(
    vocab_size = len(vocab), # no. of unique characters
    embedding_dim=embedding_dim, # 256
    rnn_units=rnn_units, # 1024
    batch_size=BATCH_SIZE)  # 64 for the traning

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           23808     
                                                                 
 gru (GRU)                   (64, None, 1024)          3938304   
                                                                 
 dense (Dense)               (64, None, 93)            95325     
                                                                 
Total params: 4,057,437
Trainable params: 4,057,437
Non-trainable params: 0
_________________________________________________________________


#### About the model layers:
- Embedding Layer: serves as the input layer, accepting input values (in number format) and convert them into vectors.
- GRU layer: an RNN layer filled with 1024 Gradient Descent Units
- Dense layer: to output the result, with vocab_size outputs.

### Train the model and save the weights
- Adam as optimizer
- Sparse categorical crossentropy function as loss function.
 - sparse categorical cross-entropy is used when truth labels are integer encoded like in this model
 - Class problemCategorical cross-entropy is used when true labels are one-hot encoded like [0,0,1]

In [13]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# example_batch_loss  = loss(target_example_batch, example_batch_predictions)
# print("Prediction shape: ", example_batch_predictions.shape, " (batch_size, sequence_length, vocab_size)")
# print("scalar_loss:      ", example_batch_loss.numpy().mean())

model.compile(optimizer='adam', loss=loss)

In [14]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [15]:
EPOCHS = 30
history = model.fit(dataset, 
                    epochs=EPOCHS, 
                    callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Generate New Text
- Use a new model with batch_size = 1
- use saved weights
- use temperature to adjust variability of the predictions 
 - a categorical distribution to predict the character returned by the model
 - higher increases the probability of selecting a less likely character
 - lower --> more predictable
- Select output length
- Select start words

In [16]:
# Locate saved weights
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_30'

In [17]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (1, None, 256)            23808     
                                                                 
 gru_1 (GRU)                 (1, None, 1024)           3938304   
                                                                 
 dense_1 (Dense)             (1, None, 93)             95325     
                                                                 
Total params: 4,057,437
Trainable params: 4,057,437
Non-trainable params: 0
_________________________________________________________________


In [18]:
def generate_text(model, num_generate, temperature, start_string):
  input_eval = [char2idx[s] for s in start_string] # string to numbers (vectorizing)
  input_eval = tf.expand_dims(input_eval, 0) # dimension expansion
  text_generated = [] # Empty string to store our results
  model.reset_states() # Clears the hidden states in the RNN

  for i in range(num_generate): #Run a loop for number of characters to generate
    predictions = model(input_eval) # prediction for single character
    predictions = tf.squeeze(predictions, 0) # remove the batch dimension

    # using a categorical distribution to predict the character returned by the model
    # higher temperature increases the probability of selecting a less likely character
    # lower --> more predictable
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # The predicted character as the next input to the model
    # along with the previous hidden state
    # So the model makes the next prediction based on the previous character
    input_eval = tf.expand_dims([predicted_id], 0) 
    # Also devectorize the number and add to the generated text
    text_generated.append(idx2char[predicted_id]) 

  return (start_string + ''.join(text_generated))

In [19]:
generated_text = generate_text(
                    model, 
                    num_generate=1000, 
                    temperature=1, 
                    start_string=u"Love")
print(generated_text)

Love is sungry

When I see you everywhere

Jimmett dakes
You're my angel on the sun
We turn to stand alone.



Hell on my mind , a stay.

When I see you every mornin,

I can see a miracle , a dialone to do theak to me
I can be strong, ohat, my story 'rn't like to be over.
Lire I can get you there

So forever haven't got a friend in me

Warm ride, warm ride, we can reach the highest silent nothing of life

And when you're out at nightta
If you're not here by me
It's understand what my friends.

I was there inside

Good man I don't change my way
There's a fire line tomorrow
de life , whoa
You did when she shakes all over me , in Can just get ignited,
Let this be my prayer, I will stay, I will go with the thoughts of leaves
I have feed, look at us now
We're all along with your life
Come on our shoulders for me

So want me, I can't let go of you
I'm just a clown that used to be
I stumble is just
Like the dreams , we'll be the secret -- goes living on

No, you can't keep a good man down
Whe

#### References:
- Create Your Own Artificial Shakespeare in 10 Minutes with Natural Language Processing<br>
https://towardsdatascience.com/create-your-own-artificial-shakespeare-in-10-minutes-with-natural-language-processing-1fde5edc8f28
-dataset: <br>
https://www.kaggle.com/datasets/neisse/scrapped-lyrics-from-6-genres?select=lyrics-data.csv