# Database Creation

As its name implies it, the first step of any Data Science project consist in **gathering data**.

In this project we decided to establish a model which would be able to recognize an artist thanks to the lyrics of a song.

Thefore, to collect these data, we scrapped the website called **Genius** (https://genius.com/). **Genius** is a music media dealing with the latest music release and news. However it is more widely known for containing the lyrics, and lyrics explanations, for almost every music ever released.

We scrapped this website for a few choosen artist thanks to the Genius API *LyricsGenius*.

* [Step 0 : Pre-Requirements](#first-bullet)
* [Step 1 : Data Scrapping](#second-bullet)
* [Step 2 : Data Cleaning & Database Enrichment](#third-bullet)

### Step 0 : Pre-Requirements <a class="anchor" id="first-bullet"></a>

First off, we installed/imported the required packages. 

*NB : functions is a python file containing the customized functions needed to scrap our data, and elaborate our database.*

In [1]:
from lyricsgenius import Genius
import pandas as pd
import functions as fc

The use of this API required an access token. Here's the token created for this project. 

In [2]:
#token permettant de scrapper genius
token = 'sxXw2RwH_IyZ_AYE4gvp8Myo7sT0z8B-wEErToK43kDfEXk7pLBf0X7nfauTmh0g'
genius = Genius(token,timeout=45,retries=3)

We met a conflict between virutal environment with the use of the genius.lyrics function. We thus had to define the lyrics_for_df function in this file.

In [3]:
def lyrics_for_df(id_):
    return genius.lyrics(song_id=id_)

### Step 1 : Data Scrapping <a class="anchor" id="second-bullet"></a>

The following cell enables you type the name of the artist from which you want the discography.

Please run the cell and do not misspell the name of the artist when asked you.

In our examples, we scrapped the discography of the following artists: 
- *Drake, Kanye West, Lana Del Rey, Bruno Mars, Adele, Eminem, Rihanna*.

In [5]:
print('Which artist do you want to scrap?')
artist_name = input("Artist name : ")
artist_id = genius.search_artists(artist_name)['sections'][0]['hits'][0]['result']['id']

# creation of a DataFrame containing data on every album from the given artist
albums_artist = fc.albums_data(artist_id, artist_name)


#initialisation
df_artist = pd.DataFrame()
compteur = 1

for album_id in albums_artist['id']:
    try :
        print("Processing..." , f" Album n° {compteur} / {len(albums_artist)}",'|',pd.DataFrame(albums_artist.loc[albums_artist['id'] == album_id])['name'].values[0])
        compteur += 1
        
        # creation of a DataFrame containing data on every song of the album
        tracklist = fc.tracklist_data(album_id,albums_artist)
        tracklist['Lyrics'] = tracklist['song_id'].apply(lyrics_for_df)
        
        # concatenation of every albums data to create the discography
        df_artist = pd.concat([df_artist, tracklist], axis=0)

    except :
        # some albums may not include any lyrics (instrumental albums)
        print('Impossible to scrap this album')

df_artist = df_artist.dropna()

df_artist.to_csv(f'artist_data/discography_{artist_name}')
print('Scrapping done')

Which artist do you want to scrap?


Artist name :  Drake


Processing...  Album n° 1 / 19 | Certified Lover Boy
Scrapping done


The scrapped data is contained in a csv with the following format 

In [6]:
df_artist.head(3)

Unnamed: 0,Title,song_id,Featuring,Album,album_id,Release Date,Artist,Lyrics
0,Champagne Poetry,6859296,Drake,Certified Lover Boy,585647,"{'year': 2021, 'month': 9, 'day': 3}",Drake,"[Part I]\n\n[Intro]\nI love you, I love you, I..."
1,Papi’s Home,7165569,Drake,Certified Lover Boy,585647,"{'year': 2021, 'month': 9, 'day': 3}",Drake,[Intro: Drake & Montell Jordan]\nI know that I...
2,Girls Want Girls,7165581,Drake (Ft. Lil Baby),Certified Lover Boy,585647,"{'year': 2021, 'month': 9, 'day': 3}",Drake,"[Intro: Drake]\nWoah, woah\nWoah, woah, woah\n..."


### Step 2 : Data Cleaning & Database Enrichment <a class="anchor" id="third-bullet"></a>

In order to further exploit the data, we must **preprocess it**. As with every Natural Language Processing project, we had to clean the textual data.

Here are our the main steps toward to cleaning our text: 
 - **remove the punctuation** and **lower** every character
 - **lemmatize** the words (reduce every word to his radical (as every model is statistic based))
 - remove every **text pattern** specific to genius 
 
Generally, we also remove the stopwords from the text (senseless words like "i","you","a","an","the"...). However, in our case, the statistical models use tfidf matrix, which do not take these words into account thanks to its mathematical definition.



In [7]:
df_artist['Clean Lyrics'] = df_artist['Lyrics'].apply(fc.lyrics_cleaning)
df_artist['Clean Tokenized Lyrics'] = df_artist['Clean Lyrics'].apply(fc.tokenized_lyrics)

We noticed that we could deduce from our data some additional indicators which might be interesting for data vizualization and might help improve our models (the number of verses, the length of the song, the presence of intro/outro to the song...)

The following cell therefore helps enriching our database.

In [8]:
df_artist['Word Frequency in song'] = df_artist['Clean Tokenized Lyrics'].apply(fc.dict_freq_words)
df_artist['Release Date'] = df_artist['Release Date'].apply(fc.release_date)
df_artist['Intro'] = df_artist['Lyrics'].apply(fc.intro_detection)
df_artist['Interlude'] = df_artist['Lyrics'].apply(fc.nbr_interlude)
df_artist['Chorus'] = df_artist['Lyrics'].apply(fc.nbr_chorus)
df_artist['Bridge'] = df_artist['Lyrics'].apply(fc.nbr_bridge)
df_artist['Pre-Chorus'] = df_artist['Lyrics'].apply(fc.nbr_pre_chorus)
df_artist['Parts'] = df_artist['Lyrics'].apply(fc.nbr_parts)
df_artist['Verses'] = df_artist['Lyrics'].apply(fc.nbr_verses)
df_artist['Outro'] = df_artist['Lyrics'].apply(fc.outro_detection)
df_artist['Song Length'] = df_artist['Clean Tokenized Lyrics'].apply(fc.len_song)
df_artist['Featuring'] = df_artist[['Featuring', 'Artist']].apply(fc.featuring, axis=1)

df_artist.to_csv(f"artist_data/discography_{artist_name}")

After this improvement step, our database contains the following indicators

In [13]:
df_artist.drop(['Lyrics','Clean Tokenized Lyrics','Word Frequency in song','Clean Lyrics','song_id','album_id'],axis=1).sample(frac=1).head(3)

Unnamed: 0,Title,Featuring,Album,Release Date,Artist,Intro,Interlude,Chorus,Bridge,Pre-Chorus,Parts,Verses,Outro,Song Length
0,Champagne Poetry,0,Certified Lover Boy,2021,Drake,1,2,0,0,0,2,3,1,918
10,Yebba’s Heartbreak,1,Certified Lover Boy,2021,Drake,0,0,0,0,0,0,1,0,133
13,7am on Bridle Path,0,Certified Lover Boy,2021,Drake,1,0,0,0,0,0,1,1,784


csv documents are created for each artist and stored in the *artist_data* folder.

You can scrap data for other artists by repeating the operations from [this point](#second-bullet).