## Data Clean Up
Looking through the data collected from the spotify and genius api, it can be seen that there are several duplicate values for the tracks collected as well as repeating lyrics, which need to be resolved. Some lyrics are in spanish which need to be separated from the english lyrics in order to maintain an accurate representation for each language. There are several null values for the lyrics and several non-lyrics values retreived from the genius api which need to be removed. 

The spotify api did not return any genres for the songs which makes the genres column insignifigant to the analysis. However, looking at the artist data in the ArtistDetails.csv, the genres for the songs can be filled with the artist main genres to get a more inclusive analysis. In order to do that, it is important to track the unique genres and create an algorithm that finds  the main genre under which the artist would e categorized. For example, if an artist has 'soft pop' and 'canadian pop' for genres under their name then the main genre would be pop and so on.

### Imports

In [1]:
import pandas as pd
import numpy as np
import regex as re

In [6]:
df = pd.read_csv('/Users/mariamtamer/VSCodeProjects/lyricalanalysis copy/All_Songs.csv')

In [7]:
df.head()

Unnamed: 0,song_artists,uri,track_name,duration_ms,explicit,track_popularity,track_number,album_name,album_artist,album_release_date,...,loudness,mode,speechiness,tempo,time_signature,valence,song_lyrics,lyrics_page_views,cleaned_title,featured_artists
0,['Drake'],spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,336511,True,82,1,Certified Lover Boy,Drake,2021-09-03,...,-7.012,0.0,0.326,86.743,4.0,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,Champagne Poetry,
1,['Drake'],spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,178623,True,76,2,Certified Lover Boy,Drake,2021-09-03,...,-6.157,1.0,0.313,140.177,4.0,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,Papi’s Home,
2,"['Drake', 'Lil Baby']",spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),221979,True,86,3,Certified Lover Boy,Drake,2021-09-03,...,-8.726,0.0,0.29,86.975,4.0,0.381,,,,
3,"['Drake', 'Lil Durk', 'Giveon']",spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),296568,True,79,4,Certified Lover Boy,Drake,2021-09-03,...,-8.35,0.0,0.297,143.07,4.0,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,In The Bible,"['GIVĒON', 'Lil Durk']"
4,"['Drake', 'JAY-Z']",spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),228461,True,77,5,Certified Lover Boy,Drake,2021-09-03,...,-5.442,1.0,0.287,92.131,4.0,0.155,,,,


In [17]:
df.sample(10)

Unnamed: 0,song_artists,uri,track_name,duration_ms,explicit,track_popularity,track_number,album_name,album_artist,album_release_date,...,loudness,mode,speechiness,tempo,time_signature,valence,song_lyrics,lyrics_page_views,cleaned_title,featured_artists
72174,['Jeremih'],spotify:track:49Rh91AUiqsv8DbZtYfeml,773 Love,229856,False,46,8,Late Nights With Jeremih,Jeremih,2022-03-11,...,-4.672,0.0,0.371,90.753,4.0,0.317,"773 Love Lyrics\nEar Drummers\nHa, you know th...",46892.0,773 Love,
50972,"['Gucci Mane', 'Figg Panamera']",spotify:track:7wergCwlxtxaFCVrESpWhV,Spare Dat Bitch,138556,True,9,14,Fillmoelanta 3,Gucci Mane,2013-07-20,...,-6.332,1.0,0.051,97.014,4.0,0.0657,Spare Dat Bitch LyricsHook:\nScandalous ass bi...,,Spare Dat Bitch,
103626,['The Cure'],spotify:track:061LND8RZbg9PowQbf1uWp,The Walk - Live,208053,False,37,5,Anniversary: 1978 - 2018 Live In Hyde Park London,The Cure,2019-10-18,...,-5.983,1.0,0.119,107.669,5.0,0.133,"The walk - live at the palace, auburn hills, m...",,"The walk - live at the palace, auburn hills, m...",
18998,"['Diplo', 'Deorro', 'Steve Aoki', 'Steve Bays']",spotify:track:0GATmBanwPiHQBfl2mDq3o,Freak (feat. Steve Bays),281250,False,52,4,Random White Dude Be Everywhere,Diplo,2014-07-29,...,-3.593,0.0,0.0483,150.045,4.0,0.624,Freak LyricsHey!\nHey!\nHey!\nHey!\nHey!\nHey!...,21309.0,Freak,['Steve Bays']
108532,['The Beach Boys'],spotify:track:1hnTtn9CKhZdPIZHMJzYLc,Deirdre,209866,False,40,5,"""Feel Flows"" The Sunflower & Surf’s Up Session...",The Beach Boys,2021-08-27,...,-8.478,1.0,0.125,100.0,4.0,0.309,Deirdre Lyrics\nDeirdre\n\nThe trouble you had...,6451.0,Deirdre,
83723,['Dire Straits'],spotify:track:7LXxMApcyuh5A3sod0nV5s,You and Your Friend,359333,False,41,6,On Every Street (Remaster),Dire Straits,1991-09-10,...,-14.037,1.0,0.053,113.818,4.0,0.222,You and Your Friend Lyrics\n\nWill you and you...,6533.0,You and Your Friend,
157395,['Nat King Cole Trio'],spotify:track:3XOPpTTVZXK6hAQeblDnRG,(I Love You) For Sentimental Reasons - 2003 Re...,174760,False,13,5,The World Of Nat King Cole,Nat King Cole,2005-01-01,...,-8.626,1.0,0.0355,132.362,4.0,0.222,,,,
57108,['The Script'],spotify:track:2QWP8NYYplOqEFBYGCcq0S,Rain,209346,True,65,2,Freedom Child,The Script,2017-09-01,...,-12.188,1.0,0.0387,197.569,4.0,0.22,,,,
65486,['James Bay'],spotify:track:3ZQsCvbpp2HVXt9Mp46f8n,Let It Go,261533,False,29,3,Chaos And The Calm,James Bay,2015-03-25,...,-18.446,1.0,0.0408,121.615,4.0,0.727,Let It Go Lyrics\nFrom walking home and talkin...,913814.0,Let It Go,
125420,['Rod Wave'],spotify:track:0JjsAFEWLrv1b8nMFhQqhI,Ain't Mad At You,180733,True,54,25,Pray 4 Love (Deluxe),Rod Wave,2020-08-07,...,-9.666,0.0,0.0589,89.984,4.0,0.497,"Ain’t Mad at You Lyrics\n(Trillo Beats, you di...",32800.0,Ain’t Mad at You,


As seen above, the column names are not descriptive and inconsistant and can be rather confusing to work with, so it is necessary important to have column names that exactly tell the function of the column.

Additionally, the column names can be reorganized in a more logical fashion to have the closely related attriutes following each other and enable viewing them side by side.

In [25]:
df.rename(columns = {'song_artists': 'track_artists', 'uri':'track_uri', 'duration_ms': 'track_duration_ms', 'explicit': 'track_is_explicit', 
'label': 'album_record_label', 'genres': 'album_genres', 'song_lyrics': 'track_lyrics', 'cleaned_title': 'cleaned_track_name', 
'acousticness': 'track_acousticness', 'danceability':'track_danceability', 'energy': 'track_energy', 'instrumentalness': 'track_instrumentalness', 
'key':'track_key', 'liveness': 'track_liveness' ,'loudness': 'track_loudness', 'mode': 'track_mode', 'speechiness': 'track_speechiness', 
'tempo': 'track_tempo', 'time_signature': 'track_time_signature', 'valence': 'track_valence'}, inplace = True)

In [27]:
column_names = ['track_uri', 'track_name', 'cleaned_track_name', 'track_artists', 'featured_artists', 'track_is_explicit', 'track_popularity', 'track_duration_ms', 'track_time_signature', 'track_acousticness', 
'track_danceability', 'track_energy', 'track_instrumentalness', 'track_key', 'track_liveness', 'track_loudness', 'track_mode', 'track_speechiness', 'track_tempo', 'track_valence', 
'track_lyrics', 'lyrics_page_views', 'track_number', 'album_name', 'album_artist', 'album_release_date', 'album_genres', 'album_popularity','album_record_label', 'album_cover']

df = df.reindex(columns=column_names)

In [28]:
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_duration_ms,track_time_signature,track_acousticness,...,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_genres,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,['Drake'],,True,82,336511,4.0,0.758000,...,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,['Drake'],,True,76,178623,4.0,0.112000,...,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"['Drake', 'Lil Baby']",,True,86,221979,4.0,0.181000,...,,,3,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"['Drake', 'Lil Durk', 'Giveon']","['GIVĒON', 'Lil Durk']",True,79,296568,4.0,0.614000,...,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"['Drake', 'JAY-Z']",,True,77,228461,4.0,0.354000,...,,,5,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,['Westlife'],,False,47,214866,3.0,0.000067,...,,,14,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,['Westlife'],,False,46,213066,4.0,0.000084,...,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,['Westlife'],,False,45,222893,4.0,0.012300,...,,,16,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,['Westlife'],,False,45,264485,4.0,0.009850,...,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [29]:
# Let's see df's dimensionality with df.shape

df.shape

(159231, 30)

In [33]:
# To find how many unique values columns have

df.nunique()

track_uri                 156480
track_name                105118
cleaned_track_name         57046
track_artists              17885
featured_artists            5782
track_is_explicit              2
track_popularity              98
track_duration_ms          53780
track_time_signature           5
track_acousticness          4155
track_danceability          1151
track_energy                1898
track_instrumentalness      5290
track_key                     12
track_liveness              1668
track_loudness             18448
track_mode                     2
track_speechiness           1594
track_tempo                44084
track_valence               1650
track_lyrics               57965
lyrics_page_views          26616
track_number                  50
album_name                  9574
album_artist                 797
album_release_date          3772
album_genres                   0
album_popularity             100
album_record_label          1448
album_cover                 9752
dtype: int

In [8]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)

In [9]:
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
song_artists,0,,,,,,,,
uri,0,,,,,,,,
track_name,0,,,,,,,,
duration_ms,0,159231.0,226734.568445,109223.12257,3338.0,177263.0,214515.0,259120.0,4794398.0
explicit,0,,,,,,,,
track_popularity,0,159231.0,30.132688,19.65021,0.0,14.0,29.0,44.0,97.0
track_number,0,159231.0,8.916128,6.524022,1.0,4.0,8.0,12.0,50.0
album_name,0,,,,,,,,
album_artist,0,,,,,,,,
album_release_date,0,,,,,,,,
