Introduction to the Spotify Musical Popularity Prediction System.

Business Understanding:

Many music related businesses and services, including record labels (Rough 
Trade, EMI, Matador, etc.), music reviewing publications (Pitchfork, Stereogum, 
Consequence of sound, etc.), radio stations (KEXP, NPR, The Current, KCRW, 
etc.), rely on music curation and playlist creation, as a means to establish their 
presence or artists presence in their respective music scene. A music related 
company that delivers popular playlists with songs that appeal to a broad range 
of listeners is more likely to succeed than a music related company that does not. 
I am looking to see if it is possible to create a music recommendation system, to 
help these businesses choose the best songs/sounds/genres to apply to their
playlists to achieve maximum visibility and exposure, either to their publications.
Radio stations, and record labels/artists.

In [1]:
#Imports (Will import neccessary packages as the progress progresses)
import pandas as pd


Data Importing and Basic Analysis

In [2]:
#Importing spotify data set found from Kaggle, which contains over 230,000 songs, along with their attributes.

df = pd.read_csv('SpotifyFeatures.csv')
df.head()



Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [3]:
#viewing the individual statistics of each column in the data set
df.describe()



Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0,232725.0
mean,41.127502,0.36856,0.554364,235122.3,0.570958,0.148301,0.215009,-9.569885,0.120765,117.666585,0.454917
std,18.189948,0.354768,0.185608,118935.9,0.263456,0.302768,0.198273,5.998204,0.185518,30.898907,0.260065
min,0.0,0.0,0.0569,15387.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,29.0,0.0376,0.435,182857.0,0.385,0.0,0.0974,-11.771,0.0367,92.959,0.237
50%,43.0,0.232,0.571,220427.0,0.605,4.4e-05,0.128,-7.762,0.0501,115.778,0.444
75%,55.0,0.722,0.692,265768.0,0.787,0.0358,0.264,-5.501,0.105,139.054,0.66
max,100.0,0.996,0.989,5552917.0,0.999,0.999,1.0,3.744,0.967,242.903,1.0


In [4]:
#Viewing the data with their categorical and numerical information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232725 entries, 0 to 232724
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   genre             232725 non-null  object 
 1   artist_name       232725 non-null  object 
 2   track_name        232725 non-null  object 
 3   track_id          232725 non-null  object 
 4   popularity        232725 non-null  int64  
 5   acousticness      232725 non-null  float64
 6   danceability      232725 non-null  float64
 7   duration_ms       232725 non-null  int64  
 8   energy            232725 non-null  float64
 9   instrumentalness  232725 non-null  float64
 10  key               232725 non-null  object 
 11  liveness          232725 non-null  float64
 12  loudness          232725 non-null  float64
 13  mode              232725 non-null  object 
 14  speechiness       232725 non-null  float64
 15  tempo             232725 non-null  float64
 16  time_signature    23

In [5]:
#Viewing the various values contained within each of the columns
for col in df.columns:
    print(f"Column: {col}")
    print(df[col].value_counts())
    print("--------------------")



Column: genre
Comedy              9681
Soundtrack          9646
Indie               9543
Jazz                9441
Pop                 9386
Electronic          9377
Children’s Music    9353
Folk                9299
Hip-Hop             9295
Rock                9272
Alternative         9263
Classical           9256
Rap                 9232
World               9096
Soul                9089
Blues               9023
R&B                 8992
Anime               8936
Reggaeton           8927
Ska                 8874
Reggae              8771
Dance               8701
Country             8664
Opera               8280
Movie               7806
Children's Music    5403
A Capella            119
Name: genre, dtype: int64
--------------------
Column: artist_name
Giuseppe Verdi            1394
Giacomo Puccini           1137
Kimbo Children's Music     971
Nobuo Uematsu              825
Richard Wagner             804
                          ... 
Tom O'Connor                 1
Yung Fume                  

In [6]:
#Above, I found some things that will need to be considered when doing my analysis and modeling.
#There is duplicative information in the dataset, one of them is the genre "Children's Music", and another is the duplicated values in the track_id column

DATA CLEANING: SCRUBBING

In [7]:
#The first bit of cleaning that I will be doing will be in regard to the duplicative "Children's Music" genre.As seen in the below code and output:
df['genre'].value_counts()


Comedy              9681
Soundtrack          9646
Indie               9543
Jazz                9441
Pop                 9386
Electronic          9377
Children’s Music    9353
Folk                9299
Hip-Hop             9295
Rock                9272
Alternative         9263
Classical           9256
Rap                 9232
World               9096
Soul                9089
Blues               9023
R&B                 8992
Anime               8936
Reggaeton           8927
Ska                 8874
Reggae              8771
Dance               8701
Country             8664
Opera               8280
Movie               7806
Children's Music    5403
A Capella            119
Name: genre, dtype: int64

In [8]:
#Since we have 2 different genres called "Children's Music", I am going to have to merge them both together so that the data can be consistent
df.loc[df['genre']=="Children’s Music",'genre']="Children's Music"

In [9]:
#Double Checking to make sure the merge has successfully occurred
df['genre'].value_counts()

Children's Music    14756
Comedy               9681
Soundtrack           9646
Indie                9543
Jazz                 9441
Pop                  9386
Electronic           9377
Folk                 9299
Hip-Hop              9295
Rock                 9272
Alternative          9263
Classical            9256
Rap                  9232
World                9096
Soul                 9089
Blues                9023
R&B                  8992
Anime                8936
Reggaeton            8927
Ska                  8874
Reggae               8771
Dance                8701
Country              8664
Opera                8280
Movie                7806
A Capella             119
Name: genre, dtype: int64

In [10]:
#Now that I have merged the duplicative genres together, I am going to check for Missing Values
df.isna().sum()

genre               0
artist_name         0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

In [11]:
#Since there are no missing values, I will be moving on to the other issue, which is the duplicated tracks portion.
df[df['track_id'].duplicated()]

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
1348,Alternative,Doja Cat,Go To Town,6iOvnACn4ChlAw4lWUU4dd,64,0.07160,0.710,217813,0.710,0.000001,C,0.2060,-2.474,Major,0.0579,169.944,4/4,0.700
1385,Alternative,Frank Ocean,Seigfried,1BViPjTT585XAhkUUrkts0,61,0.97500,0.377,334570,0.255,0.000208,E,0.1020,-11.165,Minor,0.0387,125.004,5/4,0.370
1452,Alternative,Frank Ocean,Bad Religion,2pMPWE7PJH1PizfgGRMnR9,56,0.77900,0.276,175453,0.358,0.000003,A,0.0728,-7.684,Major,0.0443,81.977,4/4,0.130
1554,Alternative,Steve Lacy,Some,4riDfclV7kPDT9D58FpmHd,58,0.00548,0.784,118393,0.554,0.254000,G,0.0995,-6.417,Major,0.0300,104.010,4/4,0.634
1634,Alternative,tobi lou,Buff Baby,1F1QmI8TMHir9SUFrooq5F,59,0.19000,0.736,215385,0.643,0.000000,F,0.1060,-8.636,Major,0.0461,156.002,4/4,0.599
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232715,Soul,Emily King,Down,5cA0vB8c9FMOVDWyJHgf26,42,0.55000,0.394,281853,0.346,0.000002,E,0.1290,-13.617,Major,0.0635,90.831,4/4,0.436
232718,Soul,Muddy Waters,I Just Want To Make Love To You - Electric Mud...,2HFczeynfKGiM9KF2z2K7K,43,0.01360,0.294,258267,0.739,0.004820,C,0.1380,-7.167,Major,0.0434,176.402,4/4,0.945
232720,Soul,Slave,Son Of Slide,2XGLdVl7lGeq8ksM6Al7jT,39,0.00384,0.687,326240,0.714,0.544000,D,0.0845,-10.626,Major,0.0316,115.542,4/4,0.962
232722,Soul,Muddy Waters,(I'm Your) Hoochie Coochie Man,2ziWXUmQLrXTiYjCg2fZ2t,47,0.90100,0.517,166960,0.419,0.000000,D,0.0945,-8.282,Major,0.1480,84.135,4/4,0.813


In [12]:
#As seen above, there is 55,951 duplicative rows of data. Before I fix this, I will need to look at the cause of the duplication.
#I start by checking rows for duplicated ids to see differences. I will do this for a few different tracks for consistency purposes.
df[df['track_id']=='2pMPWE7PJH1PizfgGRMnR9'] 


Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
277,R&B,Frank Ocean,Bad Religion,2pMPWE7PJH1PizfgGRMnR9,61,0.779,0.276,175453,0.358,3e-06,A,0.0728,-7.684,Major,0.0443,81.977,4/4,0.13
1452,Alternative,Frank Ocean,Bad Religion,2pMPWE7PJH1PizfgGRMnR9,56,0.779,0.276,175453,0.358,3e-06,A,0.0728,-7.684,Major,0.0443,81.977,4/4,0.13
68864,Hip-Hop,Frank Ocean,Bad Religion,2pMPWE7PJH1PizfgGRMnR9,61,0.779,0.276,175453,0.358,3e-06,A,0.0728,-7.684,Major,0.0443,81.977,4/4,0.13
77707,Children's Music,Frank Ocean,Bad Religion,2pMPWE7PJH1PizfgGRMnR9,61,0.779,0.276,175453,0.358,3e-06,A,0.0728,-7.684,Major,0.0443,81.977,4/4,0.13
192203,Soul,Frank Ocean,Bad Religion,2pMPWE7PJH1PizfgGRMnR9,61,0.779,0.276,175453,0.358,3e-06,A,0.0728,-7.684,Major,0.0443,81.977,4/4,0.13


In [13]:
df[df['track_id']=='2XGLdVl7lGeq8ksM6Al7jT']

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
179212,Jazz,Slave,Son Of Slide,2XGLdVl7lGeq8ksM6Al7jT,39,0.00384,0.687,326240,0.714,0.544,D,0.0845,-10.626,Major,0.0316,115.542,4/4,0.962
232720,Soul,Slave,Son Of Slide,2XGLdVl7lGeq8ksM6Al7jT,39,0.00384,0.687,326240,0.714,0.544,D,0.0845,-10.626,Major,0.0316,115.542,4/4,0.962


In [14]:
#As seen in the two different code blocks, most of the attributes for the songs with duplicative values are consistent, with the exception of the values 'popularity' and 'genre'
#To combat this issue, I researched that the best way was to create different columns with genre names, and having binary values represent whether or not certain songs belong in a certain genre. 


In [15]:
#Step one in this process is creating a list with the genre names
genre_list = list(df['genre'].unique())

In [16]:
#Next is creating the genre columns using this new genre list
for genre in genre_list:
    df[genre] = (df['genre']==genre).astype('int')

In [18]:
#Next is grouping each of the tracks by track_id to cleanse the dataset of duplicates, and ensure that the maximum values stay the same in each of the columns.
df=df.groupby(['track_id']).max()

KeyboardInterrupt: 

In [19]:
#removing redundant genre column
df.drop('genre', axis=1, inplace=True)
df.head()

Unnamed: 0_level_0,artist_name,track_name,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,...,Pop,Reggae,Reggaeton,Jazz,Rock,Ska,Comedy,Soul,Soundtrack,World
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00021Wy6AyMbLP2tqij86e,Capcom Sound Team,Zangief's Theme,13,0.234,0.617,169173,0.862,0.976,G,0.141,...,0,0,0,0,0,0,0,0,0,0
000CzNKC8PEt1yC3L8dqwV,Henri Salvador,Coeur Brisé à Prendre - Remastered,5,0.249,0.518,130653,0.805,0.0,F,0.333,...,0,0,0,0,0,0,0,0,0,0
000DfZJww8KiixTKuk9usJ,Mike Love,Earthlings,30,0.366,0.631,357573,0.513,4e-06,D,0.109,...,0,1,0,0,0,0,0,0,0,0
000EWWBkYaREzsBplYjUag,Don Philippe,Fewerdolr,39,0.815,0.768,104924,0.137,0.922,C#,0.113,...,0,0,0,1,0,0,0,0,0,0
000xQL6tZNLJzIrtIgxqSl,ZAYN,Still Got Time,70,0.131,0.748,188491,0.627,0.0,G,0.0852,...,1,0,0,0,0,0,0,0,0,0
