## Data Clean Up
Looking through the data collected from the spotify and genius api, it can be seen that there are several duplicate values for the tracks collected as well as repeating lyrics, which need to be resolved. Some lyrics are in spanish which need to be separated from the english lyrics in order to maintain an accurate representation for each language. There are several null values for the lyrics and several non-lyrics values retreived from the genius api which need to be removed. 

The spotify api did not return any genres for the songs which makes the genres column insignifigant to the analysis. However, looking at the artist data in the ArtistDetails.csv, the genres for the songs can be filled with the artist main genres to get a more inclusive analysis. In order to do that, it is important to track the unique genres and create an algorithm that finds  the main genre under which the artist would e categorized. For example, if an artist has 'soft pop' and 'canadian pop' for genres under their name then the main genre would be pop and so on.

### Imports

In [1152]:
import pandas as pd
import numpy as np
import regex as re

First the csv containing the data is read and samples are taken in order to identify the distriution and attributes of the data collected

In [1153]:
df = pd.read_csv('/Users/mariamtamer/VSCodeProjects/lyricalanalysis copy/All_Songs.csv')

In [1154]:
df.head()

Unnamed: 0,song_artists,uri,track_name,duration_ms,explicit,track_popularity,track_number,album_name,album_artist,album_release_date,...,loudness,mode,speechiness,tempo,time_signature,valence,song_lyrics,lyrics_page_views,cleaned_title,featured_artists
0,['Drake'],spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,336511,True,82,1,Certified Lover Boy,Drake,2021-09-03,...,-7.012,0.0,0.326,86.743,4.0,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,Champagne Poetry,
1,['Drake'],spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,178623,True,76,2,Certified Lover Boy,Drake,2021-09-03,...,-6.157,1.0,0.313,140.177,4.0,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,Papi’s Home,
2,"['Drake', 'Lil Baby']",spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),221979,True,86,3,Certified Lover Boy,Drake,2021-09-03,...,-8.726,0.0,0.29,86.975,4.0,0.381,,,,
3,"['Drake', 'Lil Durk', 'Giveon']",spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),296568,True,79,4,Certified Lover Boy,Drake,2021-09-03,...,-8.35,0.0,0.297,143.07,4.0,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,In The Bible,"['GIVĒON', 'Lil Durk']"
4,"['Drake', 'JAY-Z']",spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),228461,True,77,5,Certified Lover Boy,Drake,2021-09-03,...,-5.442,1.0,0.287,92.131,4.0,0.155,,,,


In [1155]:
df.sample(10)

Unnamed: 0,song_artists,uri,track_name,duration_ms,explicit,track_popularity,track_number,album_name,album_artist,album_release_date,...,loudness,mode,speechiness,tempo,time_signature,valence,song_lyrics,lyrics_page_views,cleaned_title,featured_artists
14072,['Demi Lovato'],spotify:track:6zBQ1w06kbswixA2LvnFIv,Butterfly,157252,False,44,18,Dancing With The Devil…The Art of Starting Over,Demi Lovato,2021-04-02,...,-5.229,1.0,0.0362,139.853,4.0,0.469,,,,
77069,['Alejandro Sanz'],spotify:track:5Vf0C2h6SgNO2WbWm6TNNh,Amiga mía,304013,False,50,3,MTV Unplugged,Alejandro Sanz,2001-11-19,...,-8.952,0.0,0.0311,180.343,3.0,0.128,"Amiga Mía Lyrics\nAmiga mía, lo sé, solo vives...",9838.0,Amiga Mía,
155101,"['Ella Fitzgerald', 'Louis Armstrong']",spotify:track:24jvRTDOvNENB2umZlBEIf,A Fine Romance,235973,False,28,10,The Complete Ella And Louis On Verve,Ella Fitzgerald,1997-05-20,...,-14.11,0.0,0.16,173.733,3.0,0.707,"A Fine Romance LyricsA fine romance, with no k...",6240.0,A Fine Romance,
39479,"['French Montana', 'Alkaline']",spotify:track:6VYHQu2gmnyrQhBcI3NzxY,Formula,234274,True,37,15,Jungle Rules,French Montana,2017-07-14,...,-23.435,0.0,0.33,124.603,4.0,0.664,"Formula Lyrics\nYes!\nEverything spicy, eeh?!\...",11188.0,Formula,['Alkaline']
92707,['Blake Shelton'],spotify:track:5kFN8BG2qO933G3gFxVvYv,I Can't Walk Away,221386,False,10,13,Pure BS (Deluxe Edition),Blake Shelton,2007-05-01,...,-7.997,1.0,0.0309,137.96,4.0,0.129,I Can’t Walk Away Lyrics\nThis morning when I ...,,I Can’t Walk Away,
132677,['Scorpions'],spotify:track:2Q8bBl1ZBQW61cdOuU7ZKb,Lorelei,272906,False,2,7,Sting In The Tail,Scorpions,2010-08-30,...,-5.676,0.0,0.0826,165.638,4.0,0.417,Lorelei Lyrics\nThere was a time when we saile...,6556.0,Lorelei,
60140,['Bruce Springsteen'],spotify:track:6iqlxdQjvNlAuUrHCgMau8,It's Hard To Be A Saint In The City - Live at ...,327773,False,27,8,"Hammersmith Odeon, London '75",Bruce Springsteen,2006-02-28,...,-6.028,1.0,0.34,158.569,4.0,0.407,,,,
23142,['Gunna'],spotify:track:5bWYPkdp7tPbp1dhE9AX43,Yao Ming,149613,True,55,6,Drip or Drown 2,Gunna,2019-02-22,...,-5.103,1.0,0.0742,169.751,4.0,0.308,"Yao Ming Lyrics\nYeah, yeah\nYeah, uh\n\nYao M...",52445.0,Yao Ming,
152318,['Dean Martin'],spotify:track:3Y2jaEYu4R9ECTjL2D260H,Brahms' Lullaby,180400,False,21,12,Sleep Warm,Dean Martin,1959,...,-6.536,1.0,0.09,116.123,4.0,0.164,Brahm’s Lullaby LyricsLullaby and good night\n...,,Brahm’s Lullaby,
31981,['The Rolling Stones'],spotify:track:3omdSLoDVN74I8mqOzHHDx,She's A Rainbow - Full Version / With Intro,275160,False,25,6,Their Satanic Majesties Request,The Rolling Stones,1967-12-08,...,-9.135,1.0,0.042,109.143,4.0,0.533,,,,


### Clean Up
As seen above, the column names are not descriptive and inconsistant and can be rather confusing to work with, so it is necessary to have column names that exactly tell the function of the column.

Additionally, the column names can be reorganized in a more logical fashion to have the closely related attributes following each other and enable viewing them side by side.

#### Cleaning Column Names

In [1156]:
df.rename(columns = {'song_artists': 'track_artists', 'uri':'track_uri', 'duration_ms': 'track_duration_ms', 'explicit': 'track_is_explicit', 
'label': 'album_record_label', 'genres': 'track_genres', 'song_lyrics': 'track_lyrics', 'cleaned_title': 'cleaned_track_name', 
'acousticness': 'track_acousticness', 'danceability':'track_danceability', 'energy': 'track_energy', 'instrumentalness': 'track_instrumentalness', 
'key':'track_key', 'liveness': 'track_liveness' ,'loudness': 'track_loudness', 'mode': 'track_mode', 'speechiness': 'track_speechiness', 
'tempo': 'track_tempo', 'time_signature': 'track_time_signature', 'valence': 'track_valence'}, inplace = True)

In [1157]:
column_names = ['track_uri', 'track_name', 'cleaned_track_name', 'track_artists', 'featured_artists', 'track_is_explicit', 'track_popularity', 'track_genres', 'track_duration_ms', 'track_time_signature', 'track_acousticness', 
'track_danceability', 'track_energy', 'track_instrumentalness', 'track_key', 'track_mode', 'track_liveness', 'track_loudness', 'track_speechiness', 'track_tempo', 'track_valence', 
'track_lyrics', 'lyrics_page_views', 'track_number', 'album_name', 'album_artist', 'album_release_date', 'album_popularity','album_record_label', 'album_cover']

df = df.reindex(columns=column_names)

In [1158]:
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_genres,track_duration_ms,track_time_signature,...,track_valence,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,['Drake'],,True,82,,336511,4.0,...,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,['Drake'],,True,76,,178623,4.0,...,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"['Drake', 'Lil Baby']",,True,86,,221979,4.0,...,0.381,,,3,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"['Drake', 'Lil Durk', 'Giveon']","['GIVĒON', 'Lil Durk']",True,79,,296568,4.0,...,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"['Drake', 'JAY-Z']",,True,77,,228461,4.0,...,0.155,,,5,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,['Westlife'],,False,47,,214866,3.0,...,0.381,,,14,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,['Westlife'],,False,46,,213066,4.0,...,0.744,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,['Westlife'],,False,45,,222893,4.0,...,0.426,,,16,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,['Westlife'],,False,45,,264485,4.0,...,0.656,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [1159]:
# Let's see df's dimensionality with df.shape

df.shape

(159231, 30)

#### Defining the Key Signature Column
According to the spotify API and some background knowledge in music, the key signature of the song is represented using integer values available in two columns: Key and Mode. The combination of the key and the mode can be mapped to more commonly known key signatures which can give a more meaningful description to the data collected. Therefore, two dictionaries one for when the mode is 1 (major key) and another for when the mode is 0 (minor key). The dictionaries map the key to their full key signature. This will be used to create an extra column holding the key signature of the each track. 

In [1160]:
mode_1 = {0: 'C Major', 1: 'D♭ Major', 2: 'D Major', 3: 'E♭ Major', 4: 'E Major', 5: 'F Major', 6: 'F# Major', 7: 'G Major', 8: 'A♭ Major', 9: 'A Major', 10: 'B♭ Major', 11: 'B Major'}
mode_0 = {0: 'C Minor', 1: 'C# Minor', 2: 'D Minor', 3: 'D# Minor', 4: 'E Minor', 5: 'F Minor', 6: 'F# Minor', 7: 'G Minor', 8: 'G# Minor', 9: 'A Minor', 10: 'B♭ Minor', 11: 'B Minor'}

In [1161]:
def get_key_signature(mode, key):
    if (mode == 1):
        key_signature = mode_1.get(key)
    elif (mode == 0):
        key_signature = mode_0.get(key)
    else:
        key_signature = None
    return key_signature

In [1162]:
key_signature = df.apply(lambda x: get_key_signature(x.track_mode, x.track_key), axis= 1)
df.insert(13, 'track_key_signature', key_signature)
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_genres,track_duration_ms,track_time_signature,...,track_valence,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,['Drake'],,True,82,,336511,4.0,...,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,['Drake'],,True,76,,178623,4.0,...,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"['Drake', 'Lil Baby']",,True,86,,221979,4.0,...,0.381,,,3,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"['Drake', 'Lil Durk', 'Giveon']","['GIVĒON', 'Lil Durk']",True,79,,296568,4.0,...,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"['Drake', 'JAY-Z']",,True,77,,228461,4.0,...,0.155,,,5,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,['Westlife'],,False,47,,214866,3.0,...,0.381,,,14,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,['Westlife'],,False,46,,213066,4.0,...,0.744,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,['Westlife'],,False,45,,222893,4.0,...,0.426,,,16,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,['Westlife'],,False,45,,264485,4.0,...,0.656,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [1163]:
# To find how many unique values columns have

df.nunique()

track_uri                 156480
track_name                105118
cleaned_track_name         57046
track_artists              17885
featured_artists            5782
track_is_explicit              2
track_popularity              98
track_genres                   0
track_duration_ms          53780
track_time_signature           5
track_acousticness          4155
track_danceability          1151
track_energy                1898
track_key_signature           24
track_instrumentalness      5290
track_key                     12
track_mode                     2
track_liveness              1668
track_loudness             18448
track_speechiness           1594
track_tempo                44084
track_valence               1650
track_lyrics               57965
lyrics_page_views          26616
track_number                  50
album_name                  9574
album_artist                 797
album_release_date          3772
album_popularity             100
album_record_label          1448
album_cove

In [1164]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)

In [1165]:
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
track_uri,0,,,,,,,,
track_name,0,,,,,,,,
cleaned_track_name,67860,,,,,,,,
track_artists,0,,,,,,,,
featured_artists,144057,,,,,,,,
track_is_explicit,0,,,,,,,,
track_popularity,0,159231.0,30.132688,19.65021,0.0,14.0,29.0,44.0,97.0
track_genres,159231,0.0,,,,,,,
track_duration_ms,0,159231.0,226734.568445,109223.12257,3338.0,177263.0,214515.0,259120.0,4794398.0
track_time_signature,48,159183.0,3.900932,0.448415,0.0,4.0,4.0,4.0,5.0


#### Defining the Genres Column
Based on the above description of the dataframe, it is observed that none of the tracks present have a genre. This is probably due to the restriction that spotify imposes on the object returned when accessing different endpoints. In order to obtain the missing values for the genres, the artist genre from the ArtistDetails.csv is used. Seeing as there are several subgenres of a specific genre under an artist. The genres will first be filtered to a general genres to be appended to each song and artist.

##### Generating the Artist's Main Genres

In [1166]:
artist_df = pd.read_csv('/Users/mariamtamer/VSCodeProjects/lyricalanalysis copy/2_Artist_Data_Acquisition/ArtistDetails.csv')

In [1167]:
artist_df.head()

Unnamed: 0,uri,artist_name,artist_total_followers,artist_image,genres,popularity
0,spotify:artist:3TVXtAsR1Inumwj472S9r4,Drake,62430349,https://i.scdn.co/image/ab676161000051749e46a7...,"['canadian hip hop', 'canadian pop', 'hip hop'...",98
1,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,95198498,https://i.scdn.co/image/ab6761610000517412a2ef...,"['pop', 'uk pop']",96
2,spotify:artist:4q3ewBCX7sLwd24euuV69X,Bad Bunny,46313053,https://i.scdn.co/image/ab676161000051746ad57a...,"['latin', 'reggaeton', 'trap latino']",100
3,spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,43517807,https://i.scdn.co/image/ab676161000051742f71b6...,"['canadian contemporary r&b', 'canadian pop', ...",97
4,spotify:artist:66CXWjxzNUsdJxJ2JdwvnR,Ariana Grande,78088925,https://i.scdn.co/image/ab67616100005174cdce76...,"['dance pop', 'pop']",93


**Cleaning Column Names**

In [1168]:
artist_df.rename(columns = {'uri':'artist_uri', 'genres': 'artist_genres', 'popularity': 'artist_popularity'}, inplace = True)


The following function retreives the unique genres from each row in the data frame and appends it to the list.

In [1169]:
genres = []
def get_all_genres(row):
    for i in row:
        if i not in genres:
            genres.append(i)
        # if i not in artists_genres[artist]:
        #     artists_genres[artist].append(i)


Before applying the function on the data frame, it can be seen that the array is stored as a string inside the data frame so with the help of the literal_eval function which accepts strings of Python literals and can identify their structure to turn the string into a readable array.

In [1170]:
# the array for the genres is saved as a string so it is important to turn it into an array before applying the function
from ast import literal_eval
artist_df['artist_genres'] = artist_df['artist_genres'].apply(literal_eval)


The genres are retrieved here using a lambda function

In [1171]:
artist_df.apply(lambda x: get_all_genres(x.artist_genres), axis=1)
print(genres)

['canadian hip hop', 'canadian pop', 'hip hop', 'rap', 'toronto rap', 'pop', 'uk pop', 'latin', 'reggaeton', 'trap latino', 'canadian contemporary r&b', 'dance pop', 'detroit hip hop', 'dfw rap', 'melodic rap', 'k-pop', 'k-pop boy group', 'reggaeton colombiano', 'chicago rap', 'art pop', 'electropop', 'permanent wave', 'emo rap', 'miami hip hop', 'puerto rican pop', 'pop r&b', 'modern rock', 'rock', 'slap house', 'barbadian pop', 'pop rap', 'urban contemporary', 'pop rock', 'viral pop', 'big room', 'edm', 'pop dance', 'electro house', 'house', 'progressive house', 'uk dance', 'latin hip hop', 'classic rock', 'glam rock', 'conscious hip hop', 'west coast rap', 'tropical house', 'boy band', 'post-teen pop', 'talent show', 'r&b', 'atl hip hop', 'southern hip hop', 'trap', 'hip pop', 'queens hip hop', 'north carolina hip hop', 'etherpop', 'indie poptimism', 'reggaeton flow', 'trap boricua', 'british soul', 'pop soul', 'beatlesque', 'british invasion', 'merseybeat', 'psychedelic rock', 'aus


Since there are several subgenres for each genre (for example: canadian pop, soft pop, etc which would fall under pop), it is important to isolate the one word genres first which can work as standalone genres . This is done by checking to see if  the string contains no spaces and appending to a new list. 

In [1172]:
one_word_genres = []
for i in genres:
    if ' ' not in i:
        one_word_genres.append(i)
        
# for i in genres:
#     word = i.split(' ')
#     count = 0
#     for j in word:
#         if j in one_word_genres:
#             count +=1
#     if count == len(word):
#         one_word_genres.append(i)

In [1173]:
print(one_word_genres)

['rap', 'pop', 'latin', 'reggaeton', 'k-pop', 'electropop', 'rock', 'edm', 'house', 'r&b', 'trap', 'etherpop', 'beatlesque', 'merseybeat', 'brostep', 'post-grunge', 'soul', 'moombahton', 'indietronica', 'metropopolis', 'ninja', 'emo', 'metal', 'singer-songwriter', 'complextro', 'reggae', 'trance', 'punk', 'bachata', 'bolero', 'scandipop', 'lounge', 'grunge', 'arrocha', 'sertanejo', 'banda', 'norteno', 'electro', 'country', 'melancholia', 'plugg', 'pluggnb', 'rock-and-roll', 'rockabilly', 'pixie', 'mariachi', 'ranchera', 'neo-psychedelic', 'britpop', 'madchester', 'folk-pop', 'dancehall', 'europop', 'hollywood', 'industrial', 'metalcore', 'funk', 'motown', 'soundtrack', 'downtempo', 'forro', 'neo-singer-songwriter', 'tropical', 'cantautor', 'salsa', 'filmi', 'folk', 'disco', 'drill', 'ccm', 'worship', 'hoerspiel', 'nwobhm', 'champeta', 'vallenato', 'piseiro', 'grupera', 'sierreno', 'synthpop', 'proto-metal', 'nu-cumbia', 'grime', 'dembow', 'basshall', 'francoton', 'chillwave', 'neo-clas


The one word strings are removed and any genres containing the one word strings as a substring are also removed from the genres list. This leaves a small amount of genres which are sorted manually in a new list called new_genres. The list is added to the one_word_genres.

In [1174]:
# for i in one_word_genres:
#     for j in genres:
#         if i in j:
#             genres.remove(j)

for i in one_word_genres:
    if i in genres:
        genres.remove(i)

new_genres =  ['hip hop', 'permanent wave', 'contemporary r&b', 'contemporary', 'big room', 'dance', 'boy band', 'talent show', 'british invasion', 'indie', 
'thrash', 'neo mellow', 'girl group', 'german techno', 'mellow gold', 'german dance', 'adult standards', 'musica mexicana', 'alternative', 
'easy listening', 'stomp and holler', 'americana', 'psych', 'alt z', 'glee club', 'neue deutsche harte', 'quiet storm', 'show tunes', 
'escape room', 'french hip hop', 'mexican hip hop', 'a cappella', 'modern bollywood', 'acoustic cover', 'dream smp', 'spanish hip hop', 
'urbano espanol', 'christian music', 'melodic dubstep', 'new wave', 'eau claire indie', 'hardcore', 'ska argentino', 'vocal jazz', 
'contemporary vocal jazz', 'cancion melodica', 'athens indie', 'electric blues', 'compositional ambient', 'italian hip hop', 'middle earth', 
'jazz blues', 'ska mexicano', 'canzone napoletana', 'italian tenor', 'lo-fi indie', 'modern blues', 'video game music', 'harlem renaissance', 
'jazz trumpet', 'new orleans jazz', 'brooklyn indie', 'rock nacional brasileiro', 'palm desert scene', 'lo-fi cover', 'lo-fi product', 
'colombian hip hop', 'turkish hip hop', 'el paso indie', 'norwegian indie', 'lo-fi chill', "women's music", 'white noise', 'new french touch', 
'veracruz indie', 'batidao romantico', 'zhongguo feng', 'jam band', 'nashville sound', 'roots', 'bases de freestyle', 'techno', 'early music', 
'boom bap espanol', 'venezuelan hip hop', 'italian underground hip hop', 'visual kei', 'lo-fi', 'hardstyle', 'k-pop boy group', 'reggaeton colombiano', 
'latin hip hop', 'trap latino', 'rap latina']

one_word_genres.extend(new_genres)

print(one_word_genres)

['rap', 'pop', 'latin', 'reggaeton', 'k-pop', 'electropop', 'rock', 'edm', 'house', 'r&b', 'trap', 'etherpop', 'beatlesque', 'merseybeat', 'brostep', 'post-grunge', 'soul', 'moombahton', 'indietronica', 'metropopolis', 'ninja', 'emo', 'metal', 'singer-songwriter', 'complextro', 'reggae', 'trance', 'punk', 'bachata', 'bolero', 'scandipop', 'lounge', 'grunge', 'arrocha', 'sertanejo', 'banda', 'norteno', 'electro', 'country', 'melancholia', 'plugg', 'pluggnb', 'rock-and-roll', 'rockabilly', 'pixie', 'mariachi', 'ranchera', 'neo-psychedelic', 'britpop', 'madchester', 'folk-pop', 'dancehall', 'europop', 'hollywood', 'industrial', 'metalcore', 'funk', 'motown', 'soundtrack', 'downtempo', 'forro', 'neo-singer-songwriter', 'tropical', 'cantautor', 'salsa', 'filmi', 'folk', 'disco', 'drill', 'ccm', 'worship', 'hoerspiel', 'nwobhm', 'champeta', 'vallenato', 'piseiro', 'grupera', 'sierreno', 'synthpop', 'proto-metal', 'nu-cumbia', 'grime', 'dembow', 'basshall', 'francoton', 'chillwave', 'neo-clas

The code is rerun to bring back strings that contain the genres that contain substrings of the one_word_genres defined earlier and and the list is rechecked to ensure no genres are lost.

In [1175]:
for i in genres:
    word = i.split(' ')
    count = 0
    for j in word:
        if j in one_word_genres:
            count +=1
    if count == len(word):
        one_word_genres.append(i)

print(one_word_genres)

['rap', 'pop', 'latin', 'reggaeton', 'k-pop', 'electropop', 'rock', 'edm', 'house', 'r&b', 'trap', 'etherpop', 'beatlesque', 'merseybeat', 'brostep', 'post-grunge', 'soul', 'moombahton', 'indietronica', 'metropopolis', 'ninja', 'emo', 'metal', 'singer-songwriter', 'complextro', 'reggae', 'trance', 'punk', 'bachata', 'bolero', 'scandipop', 'lounge', 'grunge', 'arrocha', 'sertanejo', 'banda', 'norteno', 'electro', 'country', 'melancholia', 'plugg', 'pluggnb', 'rock-and-roll', 'rockabilly', 'pixie', 'mariachi', 'ranchera', 'neo-psychedelic', 'britpop', 'madchester', 'folk-pop', 'dancehall', 'europop', 'hollywood', 'industrial', 'metalcore', 'funk', 'motown', 'soundtrack', 'downtempo', 'forro', 'neo-singer-songwriter', 'tropical', 'cantautor', 'salsa', 'filmi', 'folk', 'disco', 'drill', 'ccm', 'worship', 'hoerspiel', 'nwobhm', 'champeta', 'vallenato', 'piseiro', 'grupera', 'sierreno', 'synthpop', 'proto-metal', 'nu-cumbia', 'grime', 'dembow', 'basshall', 'francoton', 'chillwave', 'neo-clas

In order to obtain the main genres for each artist from the list defined, it is important to consider that there are unnecessary words that will not appear anywhere in the list. The first possible solution is to loop through the each artist genre from the list in the data frame and loop through the one_word_genres list defined and append when a match is met. However, seeing that there are genres that mix two of the main categories together, it is redundant to have the separate categories as well as the compound categories (for example: if an artist has 'canadian contemporary r&b', then his main genres would end up being 'contemporary', 'r&b', 'contemporary r&b'). This is very inefficient for the analysis since the artist will end up having 3 genres instead of one which would change the overall total of songs available for the genre.

Therefore, the solution implemented involves finding the all the possible genres from the one word list defined above and appends them to a list before appending them to the main genres list. This is to avoid adding both single and compund genres for a compound genre. The longest match is found using the max function by iterating through the list and returning the genre with the longest string or in other words the best matched genre. The genre is returned as a tuple, an immutable type to indicate that it cannot be further modified and ensure that functions like nunique which help summarize data are working on the dataframe appropriately.

In [1176]:
def simplify_genre(genres):
    artist_genre = []
    for i in genres:
        if i in one_word_genres and i not in artist_genre:
            artist_genre.append(i)
        else: 
            helper_list = []
            for j in one_word_genres:
                if j in i and j not in artist_genre:
                    helper_list.append(j)
            if (len(helper_list) > 0):
                res = max(helper_list, key=len)
                artist_genre.append(res)
    return tuple(artist_genre)

The function is then applied on the genres column and a new column is created with the simplified genres.

In [1177]:
main_genres = artist_df.apply(lambda x: simplify_genre(x.artist_genres), axis= 1)
artist_df.insert(4, 'artist_main_genres', main_genres)
artist_df

Unnamed: 0,artist_uri,artist_name,artist_total_followers,artist_image,artist_main_genres,artist_genres,artist_popularity
0,spotify:artist:3TVXtAsR1Inumwj472S9r4,Drake,62430349,https://i.scdn.co/image/ab676161000051749e46a7...,"(hip hop, pop, rap)","[canadian hip hop, canadian pop, hip hop, rap,...",98
1,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,95198498,https://i.scdn.co/image/ab6761610000517412a2ef...,"(pop,)","[pop, uk pop]",96
2,spotify:artist:4q3ewBCX7sLwd24euuV69X,Bad Bunny,46313053,https://i.scdn.co/image/ab676161000051746ad57a...,"(latin, reggaeton, trap latino)","[latin, reggaeton, trap latino]",100
3,spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,43517807,https://i.scdn.co/image/ab676161000051742f71b6...,"(contemporary r&b, pop)","[canadian contemporary r&b, canadian pop, pop]",97
4,spotify:artist:66CXWjxzNUsdJxJ2JdwvnR,Ariana Grande,78088925,https://i.scdn.co/image/ab67616100005174cdce76...,"(dance pop, pop)","[dance pop, pop]",93
...,...,...,...,...,...,...,...
994,spotify:artist:7FY5V3XMwlNBPitEjXowHQ,Darius Rucker,2161236,https://i.scdn.co/image/ab676161000051748e5582...,"(americana, contemporary country, country)","[black americana, contemporary country, countr...",70
995,spotify:artist:6QtgPSJPSzcnn7dPZ4VINp,King Von,1853606,https://i.scdn.co/image/ab676161000051745c0b21...,"(rap,)",[chicago rap],82
996,spotify:artist:66W9LaWS0DPdL7Sz8iYGYe,JP Saxe,339184,https://i.scdn.co/image/ab67616100005174e1963b...,"(alt z, contemporary r&b, pop)","[alt z, canadian contemporary r&b, pop]",73
997,spotify:artist:3gk0OYeLFWYupGFRHqLSR7,Showtek,452048,https://i.scdn.co/image/ab676161000051746ac094...,"(hardstyle, edm, electro house, pop dance, ele...","[classic hardstyle, edm, electro house, euphor...",67


In [1178]:
output_path="ArtistWithGenres.csv"
artist_df.to_csv(output_path)

##### Appending Genres to Track Based on Artist

The track artists are checked and their corresponding genres are appended to the genres list for the track. Again, the genres are returned as a tuple to maintain immutability.

In [1179]:
def get_genre(artist_list):
    genres = []
    for artist in artist_list:
        if artist in artist_df['artist_name'].values:
            genres_cell = artist_df.loc[artist_df['artist_name'] == artist, 'artist_main_genres'].values[0]
            for genre in genres_cell:
                if genre not in genres:
                    genres.append(genre)
    return tuple(genres)

#save = get_genre(['Drake', 'Ed Sheeran', 'Ariana Grande'])
#print(save)

Similar to the genres in the artist_df, the track artists array is stored as a string inside the data frame so with literal_eval is used to turn the string into a readable array.

In [1180]:
from ast import literal_eval
df['track_artists'] = df['track_artists'].apply(literal_eval)


The same is done with the featured artists. But beforehand, the np.nan values are replaced with string None values since literal_eval only operates on strings and would throw a NoneType error if no string value was passed to it.

In [1181]:
df['featured_artists'] = df.featured_artists.replace(np.nan,'None')
df['featured_artists'] = df['featured_artists'].apply(literal_eval)

In [1182]:
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_genres,track_duration_ms,track_time_signature,...,track_valence,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,[Drake],,True,82,,336511,4.0,...,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,[Drake],,True,76,,178623,4.0,...,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"[Drake, Lil Baby]",,True,86,,221979,4.0,...,0.381,,,3,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"[Drake, Lil Durk, Giveon]","[GIVĒON, Lil Durk]",True,79,,296568,4.0,...,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"[Drake, JAY-Z]",,True,77,,228461,4.0,...,0.155,,,5,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,[Westlife],,False,47,,214866,3.0,...,0.381,,,14,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,[Westlife],,False,46,,213066,4.0,...,0.744,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,[Westlife],,False,45,,222893,4.0,...,0.426,,,16,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,[Westlife],,False,45,,264485,4.0,...,0.656,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


The next two functions are used to convert the track_artist and featured_artist column values into tuples to ensure that they are immutable types and to allow data frame description functions to be used later without throwing errors.

In [1183]:
def convert(list):
    return tuple(list)

def convert_with_none(list):
    if list == None:
        return np.nan
    return tuple(list)

The functions are applied on each of the columns. Since featured artists contains none type data which are not iterable, the function applied checks for none type and returns np.nan to maintain consistency of the dataframe.

In [1184]:
df['track_artists'] = df.apply(lambda x: convert(x.track_artists), axis=1)

df['featured_artists'] = df.apply(lambda x: convert_with_none(x.featured_artists), axis=1)

The get genre function is then applied on  the dataframe by iterating through the track artists and returning their combined genres as found in track_genres column.

In [1185]:
track_genres = df.apply(lambda x: get_genre(x.track_artists), axis= 1)
df['track_genres'] = track_genres
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_genres,track_duration_ms,track_time_signature,...,track_valence,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,"(Drake,)",,True,82,"(hip hop, pop, rap)",336511,4.0,...,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,"(Drake,)",,True,76,"(hip hop, pop, rap)",178623,4.0,...,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"(Drake, Lil Baby)",,True,86,"(hip hop, pop, rap, trap)",221979,4.0,...,0.381,,,3,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"(Drake, Lil Durk, Giveon)","(GIVĒON, Lil Durk)",True,79,"(hip hop, pop, rap, drill, trap, r&b)",296568,4.0,...,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"(Drake, JAY-Z)",,True,77,"(hip hop, pop, rap)",228461,4.0,...,0.155,,,5,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,"(Westlife,)",,False,47,"(boy band, dance pop, europop)",214866,3.0,...,0.381,,,14,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,"(Westlife,)",,False,46,"(boy band, dance pop, europop)",213066,4.0,...,0.744,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,"(Westlife,)",,False,45,"(boy band, dance pop, europop)",222893,4.0,...,0.426,,,16,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,"(Westlife,)",,False,45,"(boy band, dance pop, europop)",264485,4.0,...,0.656,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [1186]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)

In [1187]:
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
track_uri,0,,,,,,,,
track_name,0,,,,,,,,
cleaned_track_name,67860,,,,,,,,
track_artists,0,,,,,,,,
featured_artists,144057,,,,,,,,
track_is_explicit,0,,,,,,,,
track_popularity,0,159231.0,30.132688,19.65021,0.0,14.0,29.0,44.0,97.0
track_genres,0,,,,,,,,
track_duration_ms,0,159231.0,226734.568445,109223.12257,3338.0,177263.0,214515.0,259120.0,4794398.0
track_time_signature,48,159183.0,3.900932,0.448415,0.0,4.0,4.0,4.0,5.0


As demonstrated by the data description, the genres are no longer null. However there are still multiple null values for the lyrics which can be due to the many instrumental songs, live versions, and remixes of an already existing song in the list.

#### Lyrics Cleanup

Before beginning the lyric clean up, let's take a look at the unique values for each  column in the dataframe.

In [1188]:
# To find how many unique values columns have
df.nunique()

track_uri                 156480
track_name                105118
cleaned_track_name         57046
track_artists              17885
featured_artists            5782
track_is_explicit              2
track_popularity              98
track_genres                2867
track_duration_ms          53780
track_time_signature           5
track_acousticness          4155
track_danceability          1151
track_energy                1898
track_key_signature           24
track_instrumentalness      5290
track_key                     12
track_mode                     2
track_liveness              1668
track_loudness             18448
track_speechiness           1594
track_tempo                44084
track_valence               1650
track_lyrics               57965
lyrics_page_views          26616
track_number                  50
album_name                  9574
album_artist                 797
album_release_date          3772
album_popularity             100
album_record_label          1448
album_cove

Seeing that the track_uri sum of unique values does not equal the total data of 159231, it can be deduced that there are tracks with the same uri or in other words duplicates since the uri is the unique identifier for a song as in the Spotify API documentation. Thus, the duplicates can be further explored.

In [1189]:
df[df['track_uri'].isin(df['track_uri'][df['track_uri'].duplicated()])].sort_values(by='track_uri')

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_genres,track_duration_ms,track_time_signature,...,track_valence,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_popularity,album_record_label,album_cover
18746,spotify:track:00KjOnN3U40e3lXFUOue7h,10AM/Save The World (feat. Gucci Mane),10AM / Save the World,"(Metro Boomin, Gucci Mane)","(Gucci Mane,)",True,60,"(hip hop, pop rap, rap, trap)",226320,4.0,...,0.842,"10AM / Save the World Lyrics\nYeah\nYeah, good...",91702.0,1,NOT ALL HEROES WEAR CAPES (Deluxe),Metro Boomin,2018-11-06,84,Republic Records,https://i.scdn.co/image/ab67616d00001e022887f8...
26592,spotify:track:00KjOnN3U40e3lXFUOue7h,10AM/Save The World (feat. Gucci Mane),10AM / Save the World,"(Metro Boomin, Gucci Mane)","(Gucci Mane,)",True,60,"(hip hop, pop rap, rap, trap)",226320,4.0,...,0.130,"10AM / Save the World Lyrics\nYeah\nYeah, good...",91720.0,1,NOT ALL HEROES WEAR CAPES (Deluxe),Metro Boomin,2018-11-06,84,Republic Records,https://i.scdn.co/image/ab67616d00001e022887f8...
70784,spotify:track:00Qt1c8zamewy9XXUWwm7P,Rakata - Remix,Rakata (Remix),"(Wisin & Yandel, Ja Rule)","(N.O.R.E., Ja Rule, Pitbull)",False,52,"(electro, latin, latin hip hop, reggaeton, tra...",211186,4.0,...,0.314,Rakata (Remix) LyricsDJ any one w el sobrebivi...,,20,Pa'l Mundo,Wisin & Yandel,2005-11-08,56,Machete Music,https://i.scdn.co/image/ab67616d00001e023b2fe6...
73107,spotify:track:00Qt1c8zamewy9XXUWwm7P,Rakata - Remix,Rakata (Remix),"(Wisin & Yandel, Ja Rule)","(N.O.R.E., Ja Rule, Pitbull)",False,52,"(electro, latin, latin hip hop, reggaeton, tra...",211186,4.0,...,0.314,Rakata (Remix) LyricsDJ any one w el sobrebivi...,,20,Pa'l Mundo,Wisin & Yandel,2005-11-08,56,Machete Music,https://i.scdn.co/image/ab67616d00001e023b2fe6...
52502,spotify:track:00Y9yFHumsN6Cg4cK3wXkM,Cap (feat. Trouble) - From Jxmtro,,"(Rae Sremmurd, Swae Lee, Slim Jxmmi, Trouble)",,True,37,"(hip hop, rap, pop rap, trap)",192000,4.0,...,0.726,,,5,SR3MM,Rae Sremmurd,2018-05-04,72,Mike WiLL Made-It,https://i.scdn.co/image/ab67616d00001e02ba9015...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72831,spotify:track:7zi8UxAGIpaaymfzfHvOB7,Tu Nombre,,"(Wisin & Yandel,)",,False,9,"(electro, latin, latin hip hop, reggaeton, tra...",271360,4.0,...,0.668,,,1,Líderes (Muve Sessions),Wisin & Yandel,2012-01-01,22,UMLE - Machete,https://i.scdn.co/image/ab67616d00001e02eb0282...
55218,spotify:track:7zkbsyd1GfiCtoF7uxLBS7,Changed,,"(YoungBoy Never Broke Again,)",,True,43,"(rap, trap)",191351,4.0,...,0.377,,,4,Mind of a Menace 3 (Reloaded),YoungBoy Never Broke Again,2016-10-21,50,"Never Broke Again, LLC",https://i.scdn.co/image/ab67616d00001e02f59909...
27564,spotify:track:7zkbsyd1GfiCtoF7uxLBS7,Changed,,"(YoungBoy Never Broke Again,)",,True,43,"(rap, trap)",191351,4.0,...,0.539,,,4,Mind of a Menace 3 (Reloaded),YoungBoy Never Broke Again,2016-10-21,50,"Never Broke Again, LLC",https://i.scdn.co/image/ab67616d00001e02f59909...
147637,spotify:track:7zkyxCoflcjvxjJaEPZ5J9,"Oh Bess, Oh Where's My Bess? (From ""Porgy and ...",,"(Louis Armstrong, Russell Garcia and His Orche...",,False,17,"(adult standards, dixieland, harlem renaissanc...",157480,4.0,...,0.298,,,14,"Milestones of a Jazz Legend, Vol. 6",Ella Fitzgerald,2021-10-22,27,Intense Media GmbH,https://i.scdn.co/image/ab67616d00001e0244550b...


All 5395 duplicates identified are true duplicates with the same features as they were taken from the same track uri endpoint and lyric endpoint. Therefore it is safe to drop the duplicates leaving only their first appearance without affecting data analysis, 2644 dropped to be exact. Out of all duplicates, there are about 1880 tracks where the lyrics are unavailable due to the song being instrumental or the lyrics not added yet. 

In [1190]:
df.drop_duplicates(subset='track_uri', keep="first", inplace=True, ignore_index=True) 

In [1191]:
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_genres,track_duration_ms,track_time_signature,...,track_valence,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,"(Drake,)",,True,82,"(hip hop, pop, rap)",336511,4.0,...,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,"(Drake,)",,True,76,"(hip hop, pop, rap)",178623,4.0,...,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"(Drake, Lil Baby)",,True,86,"(hip hop, pop, rap, trap)",221979,4.0,...,0.381,,,3,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"(Drake, Lil Durk, Giveon)","(GIVĒON, Lil Durk)",True,79,"(hip hop, pop, rap, drill, trap, r&b)",296568,4.0,...,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"(Drake, JAY-Z)",,True,77,"(hip hop, pop, rap)",228461,4.0,...,0.155,,,5,Certified Lover Boy,Drake,2021-09-03,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156475,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,"(Westlife,)",,False,47,"(boy band, dance pop, europop)",214866,3.0,...,0.381,,,14,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
156476,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,"(Westlife,)",,False,46,"(boy band, dance pop, europop)",213066,4.0,...,0.744,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
156477,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,"(Westlife,)",,False,45,"(boy band, dance pop, europop)",222893,4.0,...,0.426,,,16,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
156478,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,"(Westlife,)",,False,45,"(boy band, dance pop, europop)",264485,4.0,...,0.656,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [1192]:
df.nunique()

track_uri                 156480
track_name                105118
cleaned_track_name         57046
track_artists              17885
featured_artists            5782
track_is_explicit              2
track_popularity              98
track_genres                2867
track_duration_ms          53780
track_time_signature           5
track_acousticness          4155
track_danceability          1151
track_energy                1896
track_key_signature           24
track_instrumentalness      5290
track_key                     12
track_mode                     2
track_liveness              1668
track_loudness             18413
track_speechiness           1594
track_tempo                44004
track_valence               1650
track_lyrics               57956
lyrics_page_views          25974
track_number                  50
album_name                  9574
album_artist                 797
album_release_date          3771
album_popularity             100
album_record_label          1448
album_cove

In [1193]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
track_uri,0,,,,,,,,
track_name,0,,,,,,,,
cleaned_track_name,66990,,,,,,,,
track_artists,0,,,,,,,,
featured_artists,141659,,,,,,,,
track_is_explicit,0,,,,,,,,
track_popularity,0,156480.0,29.935493,19.590452,0.0,14.0,29.0,44.0,97.0
track_genres,0,,,,,,,,
track_duration_ms,0,156480.0,227073.490523,109823.237372,3338.0,177413.0,214706.0,259733.0,4794398.0
track_time_signature,48,156432.0,3.900672,0.448843,0.0,4.0,4.0,4.0,5.0


##### Removing Non-Lyrics

The Genius API sometimes returns discographies or interviews for unreleased or unavailable lyrics instead of returning the lyrics for the specific song. Thus to make the analysis as accurate as possible, it is important to remove any data that was retrieved from the Genius API that takes that format. In order to begin this process, the track_lyrics column is checked for the words "unreleased, Unreleased, Discography, discography" which tracks of this format usually have in common. This is done using the contains function. The na value is set to false to only get the tracks that meet the condition specified.

In [1194]:
contain_values = df[df['track_lyrics'].str.contains('unreleased|Unreleased|Discography|discography', na=False)]
contain_values[['cleaned_track_name','track_lyrics']]

Unnamed: 0,cleaned_track_name,track_lyrics
237,Drake Discography,Drake Discography LyricsProjects:2006:• Room f...
298,Unreleased Songs [Discography List],Unreleased Songs Lyrics2007Grow Back2008Fuldh...
952,Unreleased Songs [Discography List],Unreleased Songs LyricsBy Year:20072008200920...
977,Unreleased Songs [Discography List],Unreleased Songs LyricsBy Year:20072008200920...
1572,Eminem Discography List,Eminem Discography List LyricsProjects:1996:• ...
...,...,...
123466,Unreleased Songs [Discography List],Unreleased Songs Lyrics1998Give Unto Me (Evan...
142033,Unreleased Songs [Discography List],Unreleased Songs Lyrics2005Don't Lie (No No N...
150197,Unreleased Songs [Discography List],Unreleased Songs Lyrics2004Ahh! (Goodies outt...
150236,Unreleased Songs [Discography List],Unreleased Songs Lyrics2004Ahh! (Goodies outt...


After taking a closer look at the 158 rows returned, it is noticed that 13 tracks actually contained the word discography as a lyric so in order to ensure these tracks don't get removed from the dataframe. Therefore, the cleaned_track_name column is considered as almost all tracks share the substring "Discography" with some exceptions which are taken into account below. 

In [1195]:
values = df[df['cleaned_track_name'].str.contains('Anniversary Box Set|Discography|Cherry Bomb: The Documentary|The History of Iron Maiden - Part 1: The Early Days', na=False)]
values[['cleaned_track_name','track_lyrics']]

Unnamed: 0,cleaned_track_name,track_lyrics
237,Drake Discography,Drake Discography LyricsProjects:2006:• Room f...
298,Unreleased Songs [Discography List],Unreleased Songs Lyrics2007Grow Back2008Fuldh...
952,Unreleased Songs [Discography List],Unreleased Songs LyricsBy Year:20072008200920...
977,Unreleased Songs [Discography List],Unreleased Songs LyricsBy Year:20072008200920...
1572,Eminem Discography List,Eminem Discography List LyricsProjects:1996:• ...
...,...,...
123466,Unreleased Songs [Discography List],Unreleased Songs Lyrics1998Give Unto Me (Evan...
142033,Unreleased Songs [Discography List],Unreleased Songs Lyrics2005Don't Lie (No No N...
150197,Unreleased Songs [Discography List],Unreleased Songs Lyrics2004Ahh! (Goodies outt...
150236,Unreleased Songs [Discography List],Unreleased Songs Lyrics2004Ahh! (Goodies outt...


The rows returned after applying the conditions discussed above on the cleaned_track_name are all true non-lyric data which can safely be replaced with np.nan values in the dataframe.

In [1196]:
df.loc[df['cleaned_track_name'].str.contains('Anniversary Box Set|Discography|Cherry Bomb: The Documentary|The History of Iron Maiden - Part 1: The Early Days', na=False), ['cleaned_track_name', 'featured_artists', 'track_lyrics', 'lyrics_page_views']] = np.nan

# vals where successfully removed
#vals = df[df['cleaned_track_name'].str.contains('Anniversary Box Set|Discography|Cherry Bomb: The Documentary|The History of Iron Maiden - Part 1: The Early Days', na=False)]


Now, the null values can be displayed again to ensure successful removal. The values for 'cleaned_track_name', 'featured_artists', 'track_lyrics', and 'lyrics_page_views' went from 66990, 141659, 72209, and 111908 to 67135, 141659, 72354, and 111929 respectively indicating a change in the null data in these columns and a successful application of the replacement function.

In [1197]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
track_uri,0,,,,,,,,
track_name,0,,,,,,,,
cleaned_track_name,67135,,,,,,,,
track_artists,0,,,,,,,,
featured_artists,141659,,,,,,,,
track_is_explicit,0,,,,,,,,
track_popularity,0,156480.0,29.935493,19.590452,0.0,14.0,29.0,44.0,97.0
track_genres,0,,,,,,,,
track_duration_ms,0,156480.0,227073.490523,109823.237372,3338.0,177413.0,214706.0,259733.0,4794398.0
track_time_signature,48,156432.0,3.900672,0.448843,0.0,4.0,4.0,4.0,5.0


After further research and examination of the dataframe, it was identified that the Genius API also returned some charts, playlists, tracklists, interviews, track credits and radio episodes instead of returning lyrics for some of the songs. The patterns that appeared in the lyrics were identified and used to retrieve song lyrics that contained any of the mentioned patterns using str.contains. 

In [1226]:
import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

non_lyrics = df[df['track_lyrics'].str.contains('Tracklist|Radio Episode|Chart History|Billboard 200 Charts|Macklemore|Track List Lyrics|Music Videos Lyrics|Playlist Lyrics|Exclusive Playlist|Interview|23:55 Lyrics|HILLSONG UNITEDEmbed|Stones for a Thousand Years LyricsEmbed|A State Of Trance  - A State of Trance Year Mix 2014|In Rhythm LyricsEmbed|The Adventures of Bobby Ray  Lyrics|Fingertips, part 2 - live at the regal theater/1963/ single version', na=False)]
non_lyrics[['cleaned_track_name','track_lyrics']]

Unnamed: 0,cleaned_track_name,track_lyrics
54,OVO Sound Radio Episode 62 Tracklist,OVO Sound Radio Episode 62 Tracklist LyricsOli...
63,OVO Sound Radio Episode 11 Tracklist,OVO Sound Radio Episode 11 Tracklist LyricsILo...
92,OVO Sound Radio Episode 65 Tracklist,OVO Sound Radio Episode 65 Tracklist LyricsOct...
209,HYFR (Hell Ya Fucking Right),HYFR (Hell Ya Fucking Right) Lyrics\nGotta do ...
228,HYFR (Hell Ya Fucking Right),HYFR (Hell Ya Fucking Right) Lyrics\nGotta do ...
...,...,...
143218,Bang 3 (Part 1 & Part 2) [Album Art + Tracklist],Bang 3 (Part 1 & Part 2) LyricsPart 1Tracklis...
143230,Bang 3 (Part 1 & Part 2) [Album Art + Tracklist],Bang 3 (Part 1 & Part 2) LyricsPart 1Tracklis...
152185,Rockin’ In Rhythm,Rockin’ In Rhythm LyricsEmbed
153002,Rockin’ In Rhythm,Rockin’ In Rhythm LyricsEmbed


After taking a closer look at the 233 rows returned, it is identified that using the lyrics patterns is not ideal as there are some actual song lyrics containing the pattern above. Rather, it is a better approach to use the cleaned_track_name to avoid removing lyrics. There were common patterns identified for the cleaned_track_name of the non-lyrics data with substrings such as \[Credits\], Tracklists, Music  Videos and some others as shown below.

In [1227]:
non_lyrics_check = df[df['cleaned_track_name'].str.contains('OVO Sound Radio Episode|Chart History| Tracklist| Music Videos|The Singles Collection Tracklist| - Album Art| Track List| Album Art| Album Art/Tracklist|Hnscc| Album Cover| \[Tracklist + Album Cover\]|Yup (Bath Time Playlist)|The motherfucking future Playlist|Liam Payne: Exclusive Playlist|Evening Standard Magazine| Billboard Interview|23:55|Entertainment Weekly Interview|A State Of Trance \[ASOT 692\] - A State of Trance Year Mix 2014|Interview - 107.7 The End - Blue vs. Pinkerton|There Is Nothing Like| \[Tracklist & Artwork\]|Fingertips, part 2 - live at the regal theater/1963/ single version|Stones for a Thousand Years| \[Credits\]| \[Tracklist + Album Art\]| Album Art + Tracklist| \[Tracklist + Cover Art\]| In Rhythm', na=False)]
non_lyrics_check[['cleaned_track_name','track_lyrics']]

Unnamed: 0,cleaned_track_name,track_lyrics
54,OVO Sound Radio Episode 62 Tracklist,OVO Sound Radio Episode 62 Tracklist LyricsOli...
63,OVO Sound Radio Episode 11 Tracklist,OVO Sound Radio Episode 11 Tracklist LyricsILo...
92,OVO Sound Radio Episode 65 Tracklist,OVO Sound Radio Episode 65 Tracklist LyricsOct...
345,Ed Sheeran’s Chart History,Ed Sheeran’s Chart History LyricsAlbums & EPs ...
684,MEMENTO MORI Ep. 16 Tracklist,MEMENTO MORI Ep. 16 Tracklist LyricsThe Weeknd...
...,...,...
143218,Bang 3 (Part 1 & Part 2) [Album Art + Tracklist],Bang 3 (Part 1 & Part 2) LyricsPart 1Tracklis...
143230,Bang 3 (Part 1 & Part 2) [Album Art + Tracklist],Bang 3 (Part 1 & Part 2) LyricsPart 1Tracklis...
152185,Rockin’ In Rhythm,Rockin’ In Rhythm LyricsEmbed
153002,Rockin’ In Rhythm,Rockin’ In Rhythm LyricsEmbed


The result from rows returned using the cleaned_track_name non-lyrics patterns are all true non-lyrics data which can be safely replaced with np.nan values in the dataframe. 

In [1228]:
df.loc[df['cleaned_track_name'].str.contains('OVO Sound Radio Episode|Chart History| Tracklist| Music Videos|The Singles Collection Tracklist| - Album Art| Track List| Album Art| Album Art/Tracklist|Hnscc| Album Cover| \[Tracklist + Album Cover\]|Yup (Bath Time Playlist)|The motherfucking future Playlist|Liam Payne: Exclusive Playlist|Evening Standard Magazine| Billboard Interview|23:55|Entertainment Weekly Interview|A State Of Trance \[ASOT 692\] - A State of Trance Year Mix 2014|Interview - 107.7 The End - Blue vs. Pinkerton|There Is Nothing Like| \[Tracklist & Artwork\]|Fingertips, part 2 - live at the regal theater/1963/ single version|Stones for a Thousand Years| \[Credits\]| \[Tracklist + Album Art\]| Album Art + Tracklist| \[Tracklist + Cover Art\]| In Rhythm', na=False), ['cleaned_track_name', 'featured_artists', 'track_lyrics', 'lyrics_page_views']] = np.nan

Now, the null values can be displayed again to ensure successful removal. The values for 'cleaned_track_name', 'featured_artists', 'track_lyrics', and 'lyrics_page_views' went from 67135, 141659, 72354, and 111929 to 67368, 141661, 72587, and 111993 respectively indicating a change in the null data in these columns and a successful application of the replacement function.

In [1229]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
track_uri,0,,,,,,,,
track_name,0,,,,,,,,
cleaned_track_name,67368,,,,,,,,
track_artists,0,,,,,,,,
featured_artists,141661,,,,,,,,
track_is_explicit,0,,,,,,,,
track_popularity,0,156480.0,29.935493,19.590452,0.0,14.0,29.0,44.0,97.0
track_genres,0,,,,,,,,
track_duration_ms,0,156480.0,227073.490523,109823.237372,3338.0,177413.0,214706.0,259733.0,4794398.0
track_time_signature,48,156432.0,3.900672,0.448843,0.0,4.0,4.0,4.0,5.0


The following functions are used to check that how many tracks were instrumental from the null data collected.

In [1234]:
instrumental_lyrics = df[(df['track_name'].str.contains('Instrumental|instrumental', na=False)) & (df['track_lyrics'].isnull())]
instrumental_lyrics[['track_name','track_lyrics']]

Unnamed: 0,track_name,track_lyrics
1568,The Real Slim Shady - Instrumental,
1569,The Way I Am - Instrumental,
1616,Guilty Conscience - Instrumental,
1618,My Name Is - Instrumental,
1620,Just Don't Give A Fuck - Instrumental,
...,...,...
155173,If I Could Be With You (One Hour Tonight) - In...,
155174,I Hear Music - Instrumental; 1993 Digital Rema...,
155175,Tea For Two - Instrumental;1993 Digital Remaster,
155226,What Is This Thing Called Love? - Instrumental,


The cleaned dataframe is exported to be stored on hadoop.

In [1237]:
df.to_csv('all_tracks_hadoop.csv', index=False, header=False)