## Data Clean Up
Looking through the data collected from the spotify and genius api, it can be seen that there are several duplicate values for the tracks collected as well as repeating lyrics, which need to be resolved. Some lyrics are in spanish which need to be separated from the english lyrics in order to maintain an accurate representation for each language. There are several null values for the lyrics and several non-lyrics values retreived from the genius api which need to be removed. 

The spotify api did not return any genres for the songs which makes the genres column insignifigant to the analysis. However, looking at the artist data in the ArtistDetails.csv, the genres for the songs can be filled with the artist main genres to get a more inclusive analysis. In order to do that, it is important to track the unique genres and create an algorithm that finds  the main genre under which the artist would e categorized. For example, if an artist has 'soft pop' and 'canadian pop' for genres under their name then the main genre would be pop and so on.

### Imports

In [52]:
import pandas as pd
import numpy as np
import regex as re

First the csv containing the data is read and samples are taken in order to identify the distriution and attributes of the data collected

In [53]:
df = pd.read_csv('/Users/mariamtamer/VSCodeProjects/lyricalanalysis copy/All_Songs.csv')

In [54]:
df.head()

Unnamed: 0,song_artists,uri,track_name,duration_ms,explicit,track_popularity,track_number,album_name,album_artist,album_release_date,...,loudness,mode,speechiness,tempo,time_signature,valence,song_lyrics,lyrics_page_views,cleaned_title,featured_artists
0,['Drake'],spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,336511,True,82,1,Certified Lover Boy,Drake,2021-09-03,...,-7.012,0.0,0.326,86.743,4.0,0.496,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,Champagne Poetry,
1,['Drake'],spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,178623,True,76,2,Certified Lover Boy,Drake,2021-09-03,...,-6.157,1.0,0.313,140.177,4.0,0.588,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,Papi’s Home,
2,"['Drake', 'Lil Baby']",spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),221979,True,86,3,Certified Lover Boy,Drake,2021-09-03,...,-8.726,0.0,0.29,86.975,4.0,0.381,,,,
3,"['Drake', 'Lil Durk', 'Giveon']",spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),296568,True,79,4,Certified Lover Boy,Drake,2021-09-03,...,-8.35,0.0,0.297,143.07,4.0,0.147,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,In The Bible,"['GIVĒON', 'Lil Durk']"
4,"['Drake', 'JAY-Z']",spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),228461,True,77,5,Certified Lover Boy,Drake,2021-09-03,...,-5.442,1.0,0.287,92.131,4.0,0.155,,,,


In [55]:
df.sample(10)

Unnamed: 0,song_artists,uri,track_name,duration_ms,explicit,track_popularity,track_number,album_name,album_artist,album_release_date,...,loudness,mode,speechiness,tempo,time_signature,valence,song_lyrics,lyrics_page_views,cleaned_title,featured_artists
102263,['Paul McCartney'],spotify:track:4HaODGY6Rs7ADVlvlGXpH7,Motor Of Love - Remastered 2017,387626,False,3,12,Flowers In The Dirt,Paul McCartney,2018-12-07,...,-6.969,1.0,0.0348,98.586,4.0,0.28,,,,
4766,['Maroon 5'],spotify:track:36oXJgMw917bhXdD6f7VBT,New Love,196546,True,14,8,V (Extended Edition),Maroon 5,2015-08-14,...,-5.216,0.0,0.076,165.012,4.0,0.824,New Love Lyrics\nI'll be your sun and your moo...,12113.0,New Love,
157086,"['Nat King Cole', 'Nelson Riddle Orchestra']",spotify:track:6AC8ABfBml4R4hS4DASrJr,"Darling, je vous aime beaucoup",169826,False,12,10,Unforgettable,Nat King Cole,2016-03-04,...,-12.883,1.0,0.0314,99.686,3.0,0.161,,,,
70188,"['Duki', 'NEGRO DUB']",spotify:track:3qSXDTVldnymCphR6Hqi9z,Vida Eterna,196000,True,56,4,24,Duki,2020-06-24,...,-3.323,1.0,0.234,93.431,4.0,0.527,"Vida Eterna Lyrics\n\nTierra, guerra\nFerra, v...",10252.0,Vida Eterna,['Negro Dub']
152513,['Dean Martin'],spotify:track:2ghmhqUMZ4Z2GqpwrH8yLj,"Let Me Go, Lover!",184426,False,22,21,"Dean Martin: The Capitol Recordings, Vol. 5 (1...",Dean Martin,1954,...,-8.6,0.0,0.456,178.13,4.0,0.657,,,,
86901,"['John Williams', 'London Symphony Orchestra']",spotify:track:7vYmFK7sjAhyGZiawKUXL7,Augie's Great Municipal Band and End Credits,577426,False,30,17,Star Wars: Die Dunkle Bedrohung (Original Film...,John Williams,1999-05-04,...,-15.683,1.0,0.0605,96.454,3.0,0.21,,,Augie’s great municipal band and end credits -...,
40807,['Meek Mill'],spotify:track:43dN6S41rPK6GJ4LTl2KYy,Get My Paper Right - Freestyle,50126,False,6,2,Phillystyles,Meek Mill,2013-01-30,...,-9.476,1.0,0.0389,81.408,4.0,0.306,,,,
70747,"['Wisin & Yandel', 'Don Omar']",spotify:track:1kL6o6t0tpWsgHwkEQpA7t,Las Cosas Cambiaron,226093,False,33,8,Los Extraterrestres - Otra Dimension,Wisin & Yandel,2007,...,-26.713,0.0,0.0533,77.387,4.0,0.27,Las Cosas Cambiaron Lyrics\n\nTengo noticias p...,,Las Cosas Cambiaron,['Don Omar']
76063,['Céline Dion'],spotify:track:65opQiguKJ30uwNlXudCSn,Bozo (Live in Quebec City),188533,False,25,13,Céline... Une seule fois / Live 2013,Céline Dion,2014-05-16,...,-10.848,1.0,0.041,82.963,3.0,0.23,,,,
90514,['Danny Ocean'],spotify:track:7ffqx174e4SK2bfUjVqLdE,ADO,174000,False,67,2,@dannocean,Danny Ocean,2022-02-17,...,-13.477,1.0,0.161,69.959,4.0,0.488,,,,


### Clean Up
As seen above, the column names are not descriptive and inconsistant and can be rather confusing to work with, so it is necessary to have column names that exactly tell the function of the column.

Additionally, the column names can be reorganized in a more logical fashion to have the closely related attributes following each other and enable viewing them side by side.

#### Cleaning Column Names

In [56]:
df.rename(columns = {'song_artists': 'track_artists', 'uri':'track_uri', 'duration_ms': 'track_duration_ms', 'explicit': 'track_is_explicit', 
'label': 'album_record_label', 'genres': 'album_genres', 'song_lyrics': 'track_lyrics', 'cleaned_title': 'cleaned_track_name', 
'acousticness': 'track_acousticness', 'danceability':'track_danceability', 'energy': 'track_energy', 'instrumentalness': 'track_instrumentalness', 
'key':'track_key', 'liveness': 'track_liveness' ,'loudness': 'track_loudness', 'mode': 'track_mode', 'speechiness': 'track_speechiness', 
'tempo': 'track_tempo', 'time_signature': 'track_time_signature', 'valence': 'track_valence'}, inplace = True)

In [57]:
column_names = ['track_uri', 'track_name', 'cleaned_track_name', 'track_artists', 'featured_artists', 'track_is_explicit', 'track_popularity', 'track_duration_ms', 'track_time_signature', 'track_acousticness', 
'track_danceability', 'track_energy', 'track_instrumentalness', 'track_key', 'track_mode', 'track_liveness', 'track_loudness', 'track_speechiness', 'track_tempo', 'track_valence', 
'track_lyrics', 'lyrics_page_views', 'track_number', 'album_name', 'album_artist', 'album_release_date', 'album_genres', 'album_popularity','album_record_label', 'album_cover']

df = df.reindex(columns=column_names)

In [58]:
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_duration_ms,track_time_signature,track_acousticness,...,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_genres,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,['Drake'],,True,82,336511,4.0,0.758000,...,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,['Drake'],,True,76,178623,4.0,0.112000,...,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"['Drake', 'Lil Baby']",,True,86,221979,4.0,0.181000,...,,,3,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"['Drake', 'Lil Durk', 'Giveon']","['GIVĒON', 'Lil Durk']",True,79,296568,4.0,0.614000,...,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"['Drake', 'JAY-Z']",,True,77,228461,4.0,0.354000,...,,,5,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,['Westlife'],,False,47,214866,3.0,0.000067,...,,,14,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,['Westlife'],,False,46,213066,4.0,0.000084,...,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,['Westlife'],,False,45,222893,4.0,0.012300,...,,,16,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,['Westlife'],,False,45,264485,4.0,0.009850,...,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [59]:
# Let's see df's dimensionality with df.shape

df.shape

(159231, 30)

#### Defining the Key Signature Column
According to the spotify API and some background knowledge in music, the key signature of the song is represented using integer values available in two columns: Key and Mode. The combination of the key and the mode can be mapped to more commonly known key signatures which can give a more meaningful description to the data collected. Therefore, two dictionaries one for when the mode is 1 (major key) and another for when the mode is 0 (minor key). The dictionaries map the key to their full key signature. This will be used to create an extra column holding the key signature of the each track. 

In [60]:
mode_1 = {0: 'C Major', 1: 'D♭ Major', 2: 'D Major', 3: 'E♭ Major', 4: 'E Major', 5: 'F Major', 6: 'F# Major', 7: 'G Major', 8: 'A♭ Major', 9: 'A Major', 10: 'B♭ Major', 11: 'B Major'}
mode_0 = {0: 'C Minor', 1: 'C# Minor', 2: 'D Minor', 3: 'D# Minor', 4: 'E Minor', 5: 'F Minor', 6: 'F# Minor', 7: 'G Minor', 8: 'G# Minor', 9: 'A Minor', 10: 'B♭ Minor', 11: 'B Minor'}

In [61]:
def get_key_signature(mode, key):
    if (mode == 1):
        key_signature = mode_1.get(key)
    elif (mode == 0):
        key_signature = mode_0.get(key)
    else:
        key_signature = None
    return key_signature

In [62]:
key_signature = df.apply(lambda x: get_key_signature(x.track_mode, x.track_key), axis= 1)
df.insert(13, 'track_key_signature', key_signature)
df

Unnamed: 0,track_uri,track_name,cleaned_track_name,track_artists,featured_artists,track_is_explicit,track_popularity,track_duration_ms,track_time_signature,track_acousticness,...,track_lyrics,lyrics_page_views,track_number,album_name,album_artist,album_release_date,album_genres,album_popularity,album_record_label,album_cover
0,spotify:track:2HSmyk2qMN8WQjuGhaQgCk,Champagne Poetry,Champagne Poetry,['Drake'],,True,82,336511,4.0,0.758000,...,"Champagne Poetry Lyrics\n\nI love you, I love ...",688853.0,1,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
1,spotify:track:6jy9yJfgCsMHdu2Oz4BGKX,Papi’s Home,Papi’s Home,['Drake'],,True,76,178623,4.0,0.112000,...,Papi’s Home Lyrics\nI know that I hurt you\nYe...,445883.0,2,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
2,spotify:track:37Nqx7iavZpotJSDXZWbJ3,Girls Want Girls (with Lil Baby),,"['Drake', 'Lil Baby']",,True,86,221979,4.0,0.181000,...,,,3,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
3,spotify:track:61S79KIVA4I9FXbnsylEHT,In The Bible (with Lil Durk & Giveon),In The Bible,"['Drake', 'Lil Durk', 'Giveon']","['GIVĒON', 'Lil Durk']",True,79,296568,4.0,0.614000,...,"In The Bible Lyrics\nOkay, okay, okay\nCountin...",439186.0,4,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
4,spotify:track:4VCbgIdr8ptegWeJpqLVHH,Love All (with JAY-Z),,"['Drake', 'JAY-Z']",,True,77,228461,4.0,0.354000,...,,,5,Certified Lover Boy,Drake,2021-09-03,,95,OVO,https://i.scdn.co/image/ab67616d00001e02cd945b...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159226,spotify:track:0VxTtE5HoNMf9sp30j6c9V,Try Again,,['Westlife'],,False,47,214866,3.0,0.000067,...,,,14,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159227,spotify:track:3EHx4H0FsTplZrcFSeuLeE,What I Want Is What I Got,What I Want Is What I’ve Got,['Westlife'],,False,46,213066,4.0,0.000084,...,What I Want Is What I’ve Got Lyrics\nAll that ...,,15,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159228,spotify:track:4GfGx2zvY8pIwf2o2SAufU,We Are One,,['Westlife'],,False,45,222893,4.0,0.012300,...,,,16,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...
159229,spotify:track:7dODnrD8danC9FD5xLb9Tu,Can't Lose What You Never Had,Can’t Lose What You Never Had,['Westlife'],,False,45,264485,4.0,0.009850,...,Can’t Lose What You Never Had Lyrics\nBaby you...,6644.0,17,Westlife,Westlife,1999-11-01,,73,RCA Records Label,https://i.scdn.co/image/ab67616d00001e0244ead2...


In [63]:
# To find how many unique values columns have

df.nunique()

track_uri                 156480
track_name                105118
cleaned_track_name         57046
track_artists              17885
featured_artists            5782
track_is_explicit              2
track_popularity              98
track_duration_ms          53780
track_time_signature           5
track_acousticness          4155
track_danceability          1151
track_energy                1898
track_instrumentalness      5290
track_key_signature           24
track_key                     12
track_mode                     2
track_liveness              1668
track_loudness             18448
track_speechiness           1594
track_tempo                44084
track_valence               1650
track_lyrics               57965
lyrics_page_views          26616
track_number                  50
album_name                  9574
album_artist                 797
album_release_date          3772
album_genres                   0
album_popularity             100
album_record_label          1448
album_cove

In [64]:
describ = df.describe() # assign describe to variable
null_sum = pd.concat([df.isnull().sum().rename('NullData'),describ.T],axis=1)

In [65]:
null_sum

Unnamed: 0,NullData,count,mean,std,min,25%,50%,75%,max
track_uri,0,,,,,,,,
track_name,0,,,,,,,,
cleaned_track_name,67860,,,,,,,,
track_artists,0,,,,,,,,
featured_artists,144057,,,,,,,,
track_is_explicit,0,,,,,,,,
track_popularity,0,159231.0,30.132688,19.65021,0.0,14.0,29.0,44.0,97.0
track_duration_ms,0,159231.0,226734.568445,109223.12257,3338.0,177263.0,214515.0,259120.0,4794398.0
track_time_signature,48,159183.0,3.900932,0.448415,0.0,4.0,4.0,4.0,5.0
track_acousticness,48,159183.0,0.318934,0.318564,0.0,0.0377,0.192,0.578,0.996


#### Defining the Genres Column
Based on the above description of the dataframe, it is observed that none of the tracks present have a genre. This is probably due to the restriction that spotify imposes on the object returned when accessing different endpoints. In order to obtain the missing values for the genres, the artist genre from the ArtistDetails.csv is used. Seeing as there are several subgenres of a specific genre under an artist. The genres will first be filtered to a general genres to be appended to each song and artist.

In [446]:
artist_df = pd.read_csv('/Users/mariamtamer/VSCodeProjects/lyricalanalysis copy/2_Artist_Data_Acquisition/ArtistDetails.csv')

In [447]:
artist_df.head()

Unnamed: 0,uri,artist_name,artist_total_followers,artist_image,genres,popularity
0,spotify:artist:3TVXtAsR1Inumwj472S9r4,Drake,62430349,https://i.scdn.co/image/ab676161000051749e46a7...,"['canadian hip hop', 'canadian pop', 'hip hop'...",98
1,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,95198498,https://i.scdn.co/image/ab6761610000517412a2ef...,"['pop', 'uk pop']",96
2,spotify:artist:4q3ewBCX7sLwd24euuV69X,Bad Bunny,46313053,https://i.scdn.co/image/ab676161000051746ad57a...,"['latin', 'reggaeton', 'trap latino']",100
3,spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,43517807,https://i.scdn.co/image/ab676161000051742f71b6...,"['canadian contemporary r&b', 'canadian pop', ...",97
4,spotify:artist:66CXWjxzNUsdJxJ2JdwvnR,Ariana Grande,78088925,https://i.scdn.co/image/ab67616100005174cdce76...,"['dance pop', 'pop']",93


The following function retreives the unique genres from each row in the data frame and appends it to the list.

In [448]:
genres = []
def get_all_genres(row):
    for i in row:
        if i not in genres:
            genres.append(i)
        # if i not in artists_genres[artist]:
        #     artists_genres[artist].append(i)

Before applying the function on the data frame, it can be seen that the array is saved as a string inside the data frame so with the help of the literal_eval function which accepts strings of Python literals and can identify their structure to turn the string into a readable array.

In [449]:
# the array for the genres is saved as a string so it is important to turn it into an array before applying the function
from ast import literal_eval
artist_df['genres'] = artist_df['genres'].apply(literal_eval)

The genres are retrieved here using a lambda function

In [450]:
artist_df.apply(lambda x: get_all_genres(x.genres), axis=1)
print(genres)

['canadian hip hop', 'canadian pop', 'hip hop', 'rap', 'toronto rap', 'pop', 'uk pop', 'latin', 'reggaeton', 'trap latino', 'canadian contemporary r&b', 'dance pop', 'detroit hip hop', 'dfw rap', 'melodic rap', 'k-pop', 'k-pop boy group', 'reggaeton colombiano', 'chicago rap', 'art pop', 'electropop', 'permanent wave', 'emo rap', 'miami hip hop', 'puerto rican pop', 'pop r&b', 'modern rock', 'rock', 'slap house', 'barbadian pop', 'pop rap', 'urban contemporary', 'pop rock', 'viral pop', 'big room', 'edm', 'pop dance', 'electro house', 'house', 'progressive house', 'uk dance', 'latin hip hop', 'classic rock', 'glam rock', 'conscious hip hop', 'west coast rap', 'tropical house', 'boy band', 'post-teen pop', 'talent show', 'r&b', 'atl hip hop', 'southern hip hop', 'trap', 'hip pop', 'queens hip hop', 'north carolina hip hop', 'etherpop', 'indie poptimism', 'reggaeton flow', 'trap boricua', 'british soul', 'pop soul', 'beatlesque', 'british invasion', 'merseybeat', 'psychedelic rock', 'aus

Since there are several subgenres for each genre (for example: canadian pop, soft pop, etc which would fall under pop), it is important to isolate the one word genres first which can work as standalone genres . This is done by checking to see if  the string contains no spaces and appending to a new list. 

In [451]:
one_word_genres = []
for i in genres:
    if ' ' not in i:
        one_word_genres.append(i)
        
# for i in genres:
#     word = i.split(' ')
#     count = 0
#     for j in word:
#         if j in one_word_genres:
#             count +=1
#     if count == len(word):
#         one_word_genres.append(i)

In [452]:
print(one_word_genres)

['rap', 'pop', 'latin', 'reggaeton', 'k-pop', 'electropop', 'rock', 'edm', 'house', 'r&b', 'trap', 'etherpop', 'beatlesque', 'merseybeat', 'brostep', 'post-grunge', 'soul', 'moombahton', 'indietronica', 'metropopolis', 'ninja', 'emo', 'metal', 'singer-songwriter', 'complextro', 'reggae', 'trance', 'punk', 'bachata', 'bolero', 'scandipop', 'lounge', 'grunge', 'arrocha', 'sertanejo', 'banda', 'norteno', 'electro', 'country', 'melancholia', 'plugg', 'pluggnb', 'rock-and-roll', 'rockabilly', 'pixie', 'mariachi', 'ranchera', 'neo-psychedelic', 'britpop', 'madchester', 'folk-pop', 'dancehall', 'europop', 'hollywood', 'industrial', 'metalcore', 'funk', 'motown', 'soundtrack', 'downtempo', 'forro', 'neo-singer-songwriter', 'tropical', 'cantautor', 'salsa', 'filmi', 'folk', 'disco', 'drill', 'ccm', 'worship', 'hoerspiel', 'nwobhm', 'champeta', 'vallenato', 'piseiro', 'grupera', 'sierreno', 'synthpop', 'proto-metal', 'nu-cumbia', 'grime', 'dembow', 'basshall', 'francoton', 'chillwave', 'neo-clas

The one word strings are removed and any genres containing the one word strings as a substring are also removed from the genres list. This leaves a small amount of genres which are sorted manually in a new list called new_genres. The list is added to the one_word_genres.

In [453]:
# for i in one_word_genres:
#     for j in genres:
#         if i in j:
#             genres.remove(j)

for i in one_word_genres:
    if i in genres:
        genres.remove(i)

new_genres =  ['hip hop', 'permanent wave', 'contemporary r&b', 'contemporary', 'big room', 'dance', 'boy band', 'talent show', 'british invasion', 'indie', 
'thrash', 'neo mellow', 'girl group', 'german techno', 'mellow gold', 'german dance', 'adult standards', 'musica mexicana', 'alternative', 
'easy listening', 'stomp and holler', 'americana', 'psych', 'alt z', 'glee club', 'neue deutsche harte', 'quiet storm', 'show tunes', 
'escape room', 'french hip hop', 'mexican hip hop', 'a cappella', 'modern bollywood', 'acoustic cover', 'dream smp', 'spanish hip hop', 
'urbano espanol', 'christian music', 'melodic dubstep', 'new wave', 'eau claire indie', 'hardcore', 'ska argentino', 'vocal jazz', 
'contemporary vocal jazz', 'cancion melodica', 'athens indie', 'electric blues', 'compositional ambient', 'italian hip hop', 'middle earth', 
'jazz blues', 'ska mexicano', 'canzone napoletana', 'italian tenor', 'lo-fi indie', 'modern blues', 'video game music', 'harlem renaissance', 
'jazz trumpet', 'new orleans jazz', 'brooklyn indie', 'rock nacional brasileiro', 'palm desert scene', 'lo-fi cover', 'lo-fi product', 
'colombian hip hop', 'turkish hip hop', 'el paso indie', 'norwegian indie', 'lo-fi chill', "women's music", 'white noise', 'new french touch', 
'veracruz indie', 'batidao romantico', 'zhongguo feng', 'jam band', 'nashville sound', 'roots', 'bases de freestyle', 'techno', 'early music', 
'boom bap espanol', 'venezuelan hip hop', 'italian underground hip hop', 'visual kei', 'lo-fi', 'hardstyle', 'k-pop boy group', 'reggaeton colombiano', 
'latin hip hop', 'trap latino', 'rap latina']

one_word_genres.extend(new_genres)

print(one_word_genres)

['rap', 'pop', 'latin', 'reggaeton', 'k-pop', 'electropop', 'rock', 'edm', 'house', 'r&b', 'trap', 'etherpop', 'beatlesque', 'merseybeat', 'brostep', 'post-grunge', 'soul', 'moombahton', 'indietronica', 'metropopolis', 'ninja', 'emo', 'metal', 'singer-songwriter', 'complextro', 'reggae', 'trance', 'punk', 'bachata', 'bolero', 'scandipop', 'lounge', 'grunge', 'arrocha', 'sertanejo', 'banda', 'norteno', 'electro', 'country', 'melancholia', 'plugg', 'pluggnb', 'rock-and-roll', 'rockabilly', 'pixie', 'mariachi', 'ranchera', 'neo-psychedelic', 'britpop', 'madchester', 'folk-pop', 'dancehall', 'europop', 'hollywood', 'industrial', 'metalcore', 'funk', 'motown', 'soundtrack', 'downtempo', 'forro', 'neo-singer-songwriter', 'tropical', 'cantautor', 'salsa', 'filmi', 'folk', 'disco', 'drill', 'ccm', 'worship', 'hoerspiel', 'nwobhm', 'champeta', 'vallenato', 'piseiro', 'grupera', 'sierreno', 'synthpop', 'proto-metal', 'nu-cumbia', 'grime', 'dembow', 'basshall', 'francoton', 'chillwave', 'neo-clas

The code is rerun to bring back strings that contain the genres that contain substrings of the one_word_genres defined earlier and and the list is rechecked to ensure no genres are lost.

In [454]:
for i in genres:
    word = i.split(' ')
    count = 0
    for j in word:
        if j in one_word_genres:
            count +=1
    if count == len(word):
        one_word_genres.append(i)

print(one_word_genres)

['rap', 'pop', 'latin', 'reggaeton', 'k-pop', 'electropop', 'rock', 'edm', 'house', 'r&b', 'trap', 'etherpop', 'beatlesque', 'merseybeat', 'brostep', 'post-grunge', 'soul', 'moombahton', 'indietronica', 'metropopolis', 'ninja', 'emo', 'metal', 'singer-songwriter', 'complextro', 'reggae', 'trance', 'punk', 'bachata', 'bolero', 'scandipop', 'lounge', 'grunge', 'arrocha', 'sertanejo', 'banda', 'norteno', 'electro', 'country', 'melancholia', 'plugg', 'pluggnb', 'rock-and-roll', 'rockabilly', 'pixie', 'mariachi', 'ranchera', 'neo-psychedelic', 'britpop', 'madchester', 'folk-pop', 'dancehall', 'europop', 'hollywood', 'industrial', 'metalcore', 'funk', 'motown', 'soundtrack', 'downtempo', 'forro', 'neo-singer-songwriter', 'tropical', 'cantautor', 'salsa', 'filmi', 'folk', 'disco', 'drill', 'ccm', 'worship', 'hoerspiel', 'nwobhm', 'champeta', 'vallenato', 'piseiro', 'grupera', 'sierreno', 'synthpop', 'proto-metal', 'nu-cumbia', 'grime', 'dembow', 'basshall', 'francoton', 'chillwave', 'neo-clas

In order to obtain the main genres for each artist from the list defined, it is important to consider that there are unnecessary words that will not appear anywhere in the list. The first possible solution is to loop through the each artist genre from the list in the data frame and loop through the one_word_genres list defined and append when a match is met. However, seeing that there are genres that mix two of the main categories together, it is redundant to have the separate categories as well as the compound categories (for example: if an artist has 'canadian contemporary r&b', then his main genres would end up being 'contemporary', 'r&b', 'contemporary r&b'). This is very inefficient for the analysis since the artist will end up having 3 genres instead of one which would change the overall total of songs available for the genre.

Therefore, the solution implemented involves finding the all the possible genres from the one word list defined above and appends them to a list before appending them to the main genres list. This is to avoid adding both single and compund genres for a compound genre. The longest match is found using the max function by iterating through the list and returning the genre with the longest string or in other words the est matched genre.

In [455]:
def simplify_genre(genres):
    artist_genre = []
    for i in genres:
        if i in one_word_genres and i not in artist_genre:
            artist_genre.append(i)
        else: 
            helper_list = []
            for j in one_word_genres:
                if j in i and j not in artist_genre:
                    helper_list.append(j)
            if (len(helper_list) > 0):
                res = max(helper_list, key=len)
                artist_genre.append(res)
    return artist_genre

The function is then applied on the genres column and a new column is created with the simplified genres.

In [456]:
main_genres = artist_df.apply(lambda x: simplify_genre(x.genres), axis= 1)
artist_df.insert(4, 'artist_main_genres', main_genres)
artist_df

Unnamed: 0,uri,artist_name,artist_total_followers,artist_image,artist_main_genres,genres,popularity
0,spotify:artist:3TVXtAsR1Inumwj472S9r4,Drake,62430349,https://i.scdn.co/image/ab676161000051749e46a7...,"[hip hop, pop, rap]","[canadian hip hop, canadian pop, hip hop, rap,...",98
1,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,95198498,https://i.scdn.co/image/ab6761610000517412a2ef...,[pop],"[pop, uk pop]",96
2,spotify:artist:4q3ewBCX7sLwd24euuV69X,Bad Bunny,46313053,https://i.scdn.co/image/ab676161000051746ad57a...,"[latin, reggaeton, trap latino]","[latin, reggaeton, trap latino]",100
3,spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,43517807,https://i.scdn.co/image/ab676161000051742f71b6...,"[contemporary r&b, pop]","[canadian contemporary r&b, canadian pop, pop]",97
4,spotify:artist:66CXWjxzNUsdJxJ2JdwvnR,Ariana Grande,78088925,https://i.scdn.co/image/ab67616100005174cdce76...,"[dance pop, pop]","[dance pop, pop]",93
...,...,...,...,...,...,...,...
994,spotify:artist:7FY5V3XMwlNBPitEjXowHQ,Darius Rucker,2161236,https://i.scdn.co/image/ab676161000051748e5582...,"[americana, contemporary country, country]","[black americana, contemporary country, countr...",70
995,spotify:artist:6QtgPSJPSzcnn7dPZ4VINp,King Von,1853606,https://i.scdn.co/image/ab676161000051745c0b21...,[rap],[chicago rap],82
996,spotify:artist:66W9LaWS0DPdL7Sz8iYGYe,JP Saxe,339184,https://i.scdn.co/image/ab67616100005174e1963b...,"[alt z, contemporary r&b, pop]","[alt z, canadian contemporary r&b, pop]",73
997,spotify:artist:3gk0OYeLFWYupGFRHqLSR7,Showtek,452048,https://i.scdn.co/image/ab676161000051746ac094...,"[hardstyle, edm, electro house, pop dance, ele...","[classic hardstyle, edm, electro house, euphor...",67


In [457]:
import os
output_path="ArtistWithGenres.csv"
artist_df.to_csv(output_path, mode='a', header=not os.path.exists(output_path), index=False)