## **Introduction**


Our goal is to build a generative model for song lyrics that can learn the patterns and structures that exist within a corpus of existing song lyrics, and then use this knowledge to generate new lyrics that are similar in style and content to the original corpus.

There are a few steps I would like to follow to build a generative model for song lyrics:

1. Gather a corpus of song lyrics from our dataset. Preprocess the data by removing irrelevant information.

2. Train a language model: Use deep learning algorithms like a Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) to train a language model on the preprocessed lyrics data. The model will learn the relationships between the words and phrases in the corpus and will be able to generate new lyrics based on this knowledge. 

3. Generate new lyrics: Once the model is trained, we can use it to generate new lyrics by giving it a starting prompt or seed. The model will then use its knowledge of the patterns and structures in the corpus to generate new lyrics that are similar in style and content to the original lyrics.

4. Evaluate the results: Evaluate the generated lyrics to see how well they match the style and content of the original corpus. We may use metrics like perplexity, custom metrics, or human evaluation to assess the quality of the generated lyrics. We would like to explore if LSTM can perform better than RNN in terms of lyrics generation. 

5. Refine the model: If the generated lyrics are not of high quality, we would refine the model by adjusting the hyperparameters or training it on a larger or more diverse corpus of lyrics.

We anticipate that generating high-quality song lyrics can be challenging, as there are many nuances and complexities in the language and structure of lyrics. Therefore, it's important to carefully evaluate and refine the model to ensure that it generates high-quality lyrics.


### Import Packages

In [2]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import re
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('wordnet')

import os
import time
import lzma
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package omw-1.4 to /home/yyk/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /home/yyk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yyk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Getting the Data Ready







#### Compile a list of top US artists

Download the data from Kaggle here: https://www.kaggle.com/datasets/neisse/scrapped-lyrics-from-6-genres?select=lyrics-data.csv 

Then save it in the current directory. 

In [3]:
df_all = pd.read_csv('lyrics-data.csv') 

In [5]:
df_all.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language
0,/ivete-sangalo/,Arerê,/ivete-sangalo/arere.html,"Tudo o que eu quero nessa vida,\nToda vida, é\...",pt
1,/ivete-sangalo/,Se Eu Não Te Amasse Tanto Assim,/ivete-sangalo/se-eu-nao-te-amasse-tanto-assim...,Meu coração\nSem direção\nVoando só por voar\n...,pt
2,/ivete-sangalo/,Céu da Boca,/ivete-sangalo/chupa-toda.html,É de babaixá!\nÉ de balacubaca!\nÉ de babaixá!...,pt
3,/ivete-sangalo/,Quando A Chuva Passar,/ivete-sangalo/quando-a-chuva-passar.html,Quando a chuva passar\n\nPra quê falar\nSe voc...,pt
4,/ivete-sangalo/,Sorte Grande,/ivete-sangalo/sorte-grande.html,A minha sorte grande foi você cair do céu\nMin...,pt


In [6]:
df_all['artist'] = df_all['ALink'].str.replace('[\/]','')
df_all['artist'] = df_all['artist'].str.replace('[\-]',' ')
df_all = df_all[df_all.language == 'en']
df_all.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language,artist
69,/ivete-sangalo/,Careless Whisper,/ivete-sangalo/careless-whisper.html,I feel so unsure\nAs I take your hand and lead...,en,ivete sangalo
86,/ivete-sangalo/,Could You Be Loved / Citação Musical do Rap: S...,/ivete-sangalo/could-you-be-loved-citacao-musi...,"Don't let them fool, ya\nOr even try to school...",en,ivete sangalo
88,/ivete-sangalo/,Cruisin' (Part. Saulo),/ivete-sangalo/cruisin-part-saulo.html,"Baby, let's cruise, away from here\nDon't be c...",en,ivete sangalo
111,/ivete-sangalo/,Easy,/ivete-sangalo/easy.html,"Know it sounds funny\nBut, I just can't stand ...",en,ivete sangalo
140,/ivete-sangalo/,For Your Babies (The Voice cover),/ivete-sangalo/for-your-babies-the-voice-cover...,You've got that look again\nThe one I hoped I ...,en,ivete sangalo


We aimed to compile a list of the top 50 US artists based on their song count. However, certain artists were excluded due to either not being from the United States or being a collective of multiple artists. As a result, our refined list features 32 exceptional US artists. We filtered the dataset in this manner to obtain a more cohesive subset for training purposes, thereby enabling the model to better learn the underlying structure and patterns.


In [7]:
# Get a count of unique artist names
artist_counts = df_all['artist'].value_counts()
top50 = artist_counts[:50]
#filtering out non-American singers/bands 
exclusion = ['temas de filmes','matheus hardke','glee','hillsong united','elton john','bee gees','elvis costello','paul mccartney','vineyard','david bowie','the rolling stones','rod stewart','van morrison','kylie minogue','u2','the beatles','eric clapton','drake']
US_top = top50.loc[~top50.index.isin(exclusion)]

In [8]:
print('Number of Top US Artists:',len(US_top))
US_top

Number of Top US Artists: 32


frank sinatra        819
elvis presley        747
dolly parton         723
lil wayne            689
chris brown          623
guided by voices     620
prince               564
johnny cash          555
bob dylan            548
george jones         534
neil young           515
bruce springsteen    502
snoop dogg           485
eminem               484
50 cent              466
roy orbison          438
ella fitzgerald      421
taylor swift         385
waylon jennings      383
2pac tupac shakur    382
bb king              371
bon jovi             367
george strait        365
madonna              360
diana ross           355
bill monroe          351
beach boys           332
barry manilow        330
alice cooper         326
nas                  324
ray charles          322
beck                 320
Name: artist, dtype: int64

With the above list, we conducted further data preprocessing to remove songs featuring other artists and eliminated text enclosed within parentheses "()" and square brackets "[]". The reason for these adjustments is to simplify the dataset and reduce potential noise, ensuring a more focused and consistent training experience for the model. 
Now we import the new clean dataframe generated in "data_cleaning.ipynb": 

In [5]:
df = pd.read_csv('../clean_lyrics_df.csv')

In [3]:
df.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en


In [12]:
df.language.unique()

array(['en'], dtype=object)

In [21]:
df.columns = ["artist", "songname", "songlink", "lyric", "language"]
df.head()

Unnamed: 0,artist,songname,songlink,lyric,language
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en


In [24]:
#check if there is any lyric not in english
df[df.language != 'en']


Unnamed: 0,artist,songname,songlink,lyric,language


In [25]:
# Export the DataFrame to a pickle file
pickle_file = "df_en.pickle"
df_en.to_pickle(pickle_file)

Althought we have developed functions below, we didn’t go further with stopwords removal or lemmatization. Because in some cases, stopwords may actually carry important contextual information and contribute to the overall meaning and tone of the lyrics. Some artists may use more certain stopwords than others, and we hope that we can see that pattern in the generated lyrics. Lemmatization can also result in the loss of some information. For example, "loving" and "loved" have different meanings and may be used in different contexts, so reducing both of them to "love" may lead to some loss of sentiment.


In [26]:
stop_word_list = stopwords.words('english')
lemma= WordNetLemmatizer()
def text_preprocess(sentence, stopwords_removal = True):
    '''This function takes in a dataframe, extract and format the text in a standardized format.'''
    sentence = str(sentence)
    sentence = sentence.lower() # lower case
    sentence = re.sub(r'[^a-zA-Z0-9]', r' ', sentence)   # replace these punctuation with space
    # sentence = re.sub(r'lrb|rrb', r'', sentence)
    tokens = sentence.split()
    clean_text = []
    for item in tokens:
        if stopwords_removal == True:
            if item not in stop_word_list:
                clean_text.append(lemma.lemmatize(item))
        else:
             clean_text.append(lemma.lemmatize(item))
    clean_text  = " ".join(clean_text)

    return clean_text   

In [27]:
df_en['cleaned_lyric'] = df_en['lyric'].apply(lambda x:text_preprocess(x,stopwords_removal = False)) 

In [29]:
df_en.head()

Unnamed: 0,artist,songname,songlink,lyric,language,cleaned_lyric
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,go go go go go go go shawty it s your birthday...
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,new york city you are now rapping with 50 cent...
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,i don t know what you heard about me but a b c...
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,man we gotta go get something to eat man i m h...
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,yeah uh huh so seductive i ll take you to the ...


In [30]:
df_en.to_csv('NN_Test_Data/preprocessed_english_lyrics.csv', index=False)

#### Filter by Artist Name

In [31]:
data = df_en
data

Unnamed: 0,artist,songname,songlink,lyric,language,cleaned_lyric
0,50 cent,In da Club,/50-cent/in-da-club.html,"go, go, go, go\ngo, go, go shawty\nit's your b...",en,go go go go go go go shawty it s your birthday...
1,50 cent,21 Questions,/50-cent/21-questions.html,new york city!\nyou are now rapping...with 50 ...,en,new york city you are now rapping with 50 cent...
2,50 cent,P.I.M.P.,/50-cent/p-i-m-p.html,i don't know what you heard about me\nbut a b*...,en,i don t know what you heard about me but a b c...
3,50 cent,Many Men (Wish Death),/50-cent/many-men-wish-death.html,man we gotta go get something to eat man\ni'm ...,en,man we gotta go get something to eat man i m h...
4,50 cent,Candy Shop,/50-cent/candy-shop.html,yeah...\nuh huh\nso seductive\ni'll take you t...,en,yeah uh huh so seductive i ll take you to the ...
...,...,...,...,...,...,...
13942,barry manilow,You Oughta Be Home With Me,/barry-manilow/you-oughta-be-home-with-me.html,"everybody's here, spinnin' the bottle\neverybo...",en,everybody s here spinnin the bottle everybody ...
13943,barry manilow,You're Leaving Too Soon,/barry-manilow/youre-leaving-too-soon.html,you're leavin' too soon\nyou oughta try believ...,en,you re leavin too soon you oughta try believin...
13944,barry manilow,You're Looking Hot Tonight,/barry-manilow/youre-looking-hot-tonight.html,you're looking hot tonight\nbarry manilow\nby:...,en,you re looking hot tonight barry manilow by 1a...
13945,barry manilow,You're There,/barry-manilow/youre-there.html,our friends all use the past tense when they s...,en,our friend all use the past tense when they sp...


As discussed above, although we have a cleaned version of lyrics, we would like to export a version for the orginal lyrics for the generative model as it's character-based. We would like the model to learn the patterns of the orginal lyrics.

In [32]:
# Get the top n most common names
top_artist = US_top.index
print(len(top_artist))
top_artist = ['frank sinatra', 'elvis presley', 'dolly parton', 'lil wayne',
       'chris brown', 'guided by voices', 'prince', 'johnny cash', 'bob dylan',
       'george jones', 'neil young', 'bruce springsteen', 'snoop dogg',
       'eminem', '50 cent', 'roy orbison', 'ella fitzgerald', 'taylor swift',
       'waylon jennings', '2pac tupac shakur', 'bb king', 'bon jovi',
       'george strait', 'madonna', 'diana ross', 'bill monroe', 'beach boys',
       'barry manilow', 'alice cooper', 'nas', 'ray charles', 'beck']

32


In [34]:
# Filter the dataframe by the top artists
data_topUS = data[data['artist'].isin(top_artist)]
print(len(data_topUS))


13947


In [35]:
# export the 'lyric' column to a text file
data_topUS['lyric'].to_csv('NN_Test_Data/topUS32.txt', header=False, index=False)

### Extract the text file for original lyrics

In [36]:
def extract_artist(data,artist_name):
    df_artist = data[data.artist == artist_name]
    df_artist['lyric'].to_csv('NN_Test_Data/{}.txt'.format(artist_name),mode='w',header=False, index=False)



In [37]:
for artist in top_artist:
    extract_artist(data,artist)