Lab | Clustering the songs from the databases

Now it's time to cluster the songs of the hot_songs and not_hot_songs databases according to the song's audio features. You will need to consider the following:

Are you going to use all the audio features? If not, which ones do you think to make more sense?
What is the optimal number of clusters (for methods that need to know this beforehand)?
What is the best distance to use?
What clustering method provides better results?
Does the clustering method need a transformer?
Be aware that this process is extremely time-consuming!!! Therefore, when testing different options, save the models into your disk in order to be able to use the best model later.  You don't want to retrain the best model again when you know what are the optimal parameters for each.

Add to the hot_songs and not_hot_songs databases a new column for each clustering method with the cluster membership of each song for each method.

In [1]:
# Importing all needeed libraries

In [2]:
import pandas as pd
import numpy as np

In [3]:
from sklearn import datasets # sklearn comes with some toy datasets to practise
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from scipy.spatial import distance_matrix
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
# importing datasets 

In [5]:
top_100 = pd.read_csv('top_100_audio_feat.csv')
top_100.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,artist,id,danceability,energy,key,loudness,mode,...,liveness,valence,tempo,type,id.1,uri,track_href,analysis_url,duration_ms,time_signature
0,0,0,As It Was,Harry Styles,4LRPiXqCikLlN15c3yImP7,0.52,0.731,6,-5.338,0,...,0.311,0.662,173.93,audio_features,4LRPiXqCikLlN15c3yImP7,spotify:track:4LRPiXqCikLlN15c3yImP7,https://api.spotify.com/v1/tracks/4LRPiXqCikLl...,https://api.spotify.com/v1/audio-analysis/4LRP...,167303,4
1,1,1,First Class,Jack Harlow,1rDQ4oMwGJI7B4tovsBOxc,0.905,0.563,8,-6.135,1,...,0.113,0.324,106.998,audio_features,1rDQ4oMwGJI7B4tovsBOxc,spotify:track:1rDQ4oMwGJI7B4tovsBOxc,https://api.spotify.com/v1/tracks/1rDQ4oMwGJI7...,https://api.spotify.com/v1/audio-analysis/1rDQ...,173948,4
2,2,2,Wait For U,Future Featuring Drake & Tems,59nOXPmaKlBfGMDeOVGrIK,0.463,0.642,1,-4.474,1,...,0.0686,0.339,83.389,audio_features,59nOXPmaKlBfGMDeOVGrIK,spotify:track:59nOXPmaKlBfGMDeOVGrIK,https://api.spotify.com/v1/tracks/59nOXPmaKlBf...,https://api.spotify.com/v1/audio-analysis/59nO...,189893,4
3,3,3,Late Night Talking,Harry Styles,1qEmFfgcLObUfQm0j1W2CK,0.714,0.728,10,-4.595,1,...,0.106,0.901,114.996,audio_features,1qEmFfgcLObUfQm0j1W2CK,spotify:track:1qEmFfgcLObUfQm0j1W2CK,https://api.spotify.com/v1/tracks/1qEmFfgcLObU...,https://api.spotify.com/v1/audio-analysis/1qEm...,177955,4
4,4,4,About Damn Time,Lizzo,1PckUlxKqWQs3RlWXVBLw3,0.836,0.743,10,-6.305,0,...,0.335,0.722,108.966,audio_features,1PckUlxKqWQs3RlWXVBLw3,spotify:track:1PckUlxKqWQs3RlWXVBLw3,https://api.spotify.com/v1/tracks/1PckUlxKqWQs...,https://api.spotify.com/v1/audio-analysis/1Pck...,191822,4


In [6]:
top_100.isna().sum()

Unnamed: 0          0
Unnamed: 0.1        0
title               0
artist              0
id                  0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
type                0
id.1                0
uri                 0
track_href          0
analysis_url        0
duration_ms         0
time_signature      0
dtype: int64

In [7]:
top_100.shape

(95, 23)

In [8]:
not_hot = pd.read_csv('not_hot_songs_alb.csv')
not_hot.shape

(5000, 23)

In [9]:
not_hot.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,singer,song,id,danceability,energy,key,loudness,mode,...,liveness,valence,tempo,type,id.1,uri,track_href,analysis_url,duration_ms,time_signature
0,0,0,Su ta gar,Abere izate ukatua,4tMbI2qkOIfc1kzsbw7X64,0.393,0.967,10.0,-5.775,0.0,...,0.147,0.469,98.016,audio_features,4tMbI2qkOIfc1kzsbw7X64,spotify:track:4tMbI2qkOIfc1kzsbw7X64,https://api.spotify.com/v1/tracks/4tMbI2qkOIfc...,https://api.spotify.com/v1/audio-analysis/4tMb...,243973.0,4.0
1,1,1,Spice Girls,Viva Forever (Tony Rich Remix Instrumental),1dehm5fgmBVxxur0YnZLMq,0.809,0.478,10.0,-7.371,0.0,...,0.125,0.594,168.897,audio_features,1dehm5fgmBVxxur0YnZLMq,spotify:track:1dehm5fgmBVxxur0YnZLMq,https://api.spotify.com/v1/tracks/1dehm5fgmBVx...,https://api.spotify.com/v1/audio-analysis/1deh...,342733.0,4.0
2,2,2,Dandy Livingstone,Think About That,5vWscUwJJSsIvKR0YJa8cV,0.759,0.654,0.0,-8.462,1.0,...,0.113,0.664,95.544,audio_features,5vWscUwJJSsIvKR0YJa8cV,spotify:track:5vWscUwJJSsIvKR0YJa8cV,https://api.spotify.com/v1/tracks/5vWscUwJJSsI...,https://api.spotify.com/v1/audio-analysis/5vWs...,172947.0,4.0
3,3,3,Orquesta Harlow,That Groovy Shingaling,2kTZKa7J0ykBpmzs5TMjWT,0.454,0.481,10.0,-10.895,1.0,...,0.158,0.757,122.281,audio_features,2kTZKa7J0ykBpmzs5TMjWT,spotify:track:2kTZKa7J0ykBpmzs5TMjWT,https://api.spotify.com/v1/tracks/2kTZKa7J0ykB...,https://api.spotify.com/v1/audio-analysis/2kTZ...,135187.0,4.0
4,4,4,King B.A.V.,Take It Off,77ihANWN9MuLGim3ZtYIti,0.706,0.58,7.0,-3.692,0.0,...,0.256,0.115,75.102,audio_features,77ihANWN9MuLGim3ZtYIti,spotify:track:77ihANWN9MuLGim3ZtYIti,https://api.spotify.com/v1/tracks/77ihANWN9MuL...,https://api.spotify.com/v1/audio-analysis/77ih...,244827.0,4.0


In [10]:
not_hot.isna().sum()

Unnamed: 0             0
Unnamed: 0.1           0
singer                 0
song                   0
id                  1417
danceability        1419
energy              1419
key                 1419
loudness            1419
mode                1419
speechiness         1419
acousticness        1419
instrumentalness    1419
liveness            1419
valence             1419
tempo               1419
type                1419
id.1                1419
uri                 1419
track_href          1419
analysis_url        1419
duration_ms         1419
time_signature      1419
dtype: int64

In [11]:
# let's drop NA's rows : 

In [12]:
not_hot.dropna(inplace=True)

In [13]:
not_hot.isna().sum()

Unnamed: 0          0
Unnamed: 0.1        0
singer              0
song                0
id                  0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
type                0
id.1                0
uri                 0
track_href          0
analysis_url        0
duration_ms         0
time_signature      0
dtype: int64

In [14]:
not_hot.shape

(3581, 23)

In [15]:
# Now let's normalize both dataframes to be able to concat them together :

In [16]:
not_hot.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'singer', 'song', 'id', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id.1',
       'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'],
      dtype='object')

In [17]:
top_100.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'title', 'artist', 'id', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id.1',
       'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'],
      dtype='object')

In [18]:
not_hot.rename(columns = {'singer' : 'artist', 'song' : 'title'}, inplace = True)
not_hot.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'artist', 'title', 'id', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id.1',
       'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'],
      dtype='object')

In [19]:
not_hot = not_hot[['title', 'artist', 'id', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id.1',
       'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature']]

In [20]:
top_100 = top_100[['title', 'artist', 'id', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id.1',
       'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature']]

In [21]:
# Let's add a new column to point from which dataframe the rows belong

In [22]:
not_hot['original_df'] = "not_hot"
top_100['original_df'] = "hot"

In [23]:
display(not_hot.head())
display(top_100.head())

Unnamed: 0,title,artist,id,danceability,energy,key,loudness,mode,speechiness,acousticness,...,valence,tempo,type,id.1,uri,track_href,analysis_url,duration_ms,time_signature,original_df
0,Abere izate ukatua,Su ta gar,4tMbI2qkOIfc1kzsbw7X64,0.393,0.967,10.0,-5.775,0.0,0.0511,1e-05,...,0.469,98.016,audio_features,4tMbI2qkOIfc1kzsbw7X64,spotify:track:4tMbI2qkOIfc1kzsbw7X64,https://api.spotify.com/v1/tracks/4tMbI2qkOIfc...,https://api.spotify.com/v1/audio-analysis/4tMb...,243973.0,4.0,not_hot
1,Viva Forever (Tony Rich Remix Instrumental),Spice Girls,1dehm5fgmBVxxur0YnZLMq,0.809,0.478,10.0,-7.371,0.0,0.187,0.0106,...,0.594,168.897,audio_features,1dehm5fgmBVxxur0YnZLMq,spotify:track:1dehm5fgmBVxxur0YnZLMq,https://api.spotify.com/v1/tracks/1dehm5fgmBVx...,https://api.spotify.com/v1/audio-analysis/1deh...,342733.0,4.0,not_hot
2,Think About That,Dandy Livingstone,5vWscUwJJSsIvKR0YJa8cV,0.759,0.654,0.0,-8.462,1.0,0.0523,0.134,...,0.664,95.544,audio_features,5vWscUwJJSsIvKR0YJa8cV,spotify:track:5vWscUwJJSsIvKR0YJa8cV,https://api.spotify.com/v1/tracks/5vWscUwJJSsI...,https://api.spotify.com/v1/audio-analysis/5vWs...,172947.0,4.0,not_hot
3,That Groovy Shingaling,Orquesta Harlow,2kTZKa7J0ykBpmzs5TMjWT,0.454,0.481,10.0,-10.895,1.0,0.0318,0.708,...,0.757,122.281,audio_features,2kTZKa7J0ykBpmzs5TMjWT,spotify:track:2kTZKa7J0ykBpmzs5TMjWT,https://api.spotify.com/v1/tracks/2kTZKa7J0ykB...,https://api.spotify.com/v1/audio-analysis/2kTZ...,135187.0,4.0,not_hot
4,Take It Off,King B.A.V.,77ihANWN9MuLGim3ZtYIti,0.706,0.58,7.0,-3.692,0.0,0.0423,0.000473,...,0.115,75.102,audio_features,77ihANWN9MuLGim3ZtYIti,spotify:track:77ihANWN9MuLGim3ZtYIti,https://api.spotify.com/v1/tracks/77ihANWN9MuL...,https://api.spotify.com/v1/audio-analysis/77ih...,244827.0,4.0,not_hot


Unnamed: 0,title,artist,id,danceability,energy,key,loudness,mode,speechiness,acousticness,...,valence,tempo,type,id.1,uri,track_href,analysis_url,duration_ms,time_signature,original_df
0,As It Was,Harry Styles,4LRPiXqCikLlN15c3yImP7,0.52,0.731,6,-5.338,0,0.0557,0.342,...,0.662,173.93,audio_features,4LRPiXqCikLlN15c3yImP7,spotify:track:4LRPiXqCikLlN15c3yImP7,https://api.spotify.com/v1/tracks/4LRPiXqCikLl...,https://api.spotify.com/v1/audio-analysis/4LRP...,167303,4,hot
1,First Class,Jack Harlow,1rDQ4oMwGJI7B4tovsBOxc,0.905,0.563,8,-6.135,1,0.102,0.0254,...,0.324,106.998,audio_features,1rDQ4oMwGJI7B4tovsBOxc,spotify:track:1rDQ4oMwGJI7B4tovsBOxc,https://api.spotify.com/v1/tracks/1rDQ4oMwGJI7...,https://api.spotify.com/v1/audio-analysis/1rDQ...,173948,4,hot
2,Wait For U,Future Featuring Drake & Tems,59nOXPmaKlBfGMDeOVGrIK,0.463,0.642,1,-4.474,1,0.34,0.314,...,0.339,83.389,audio_features,59nOXPmaKlBfGMDeOVGrIK,spotify:track:59nOXPmaKlBfGMDeOVGrIK,https://api.spotify.com/v1/tracks/59nOXPmaKlBf...,https://api.spotify.com/v1/audio-analysis/59nO...,189893,4,hot
3,Late Night Talking,Harry Styles,1qEmFfgcLObUfQm0j1W2CK,0.714,0.728,10,-4.595,1,0.0468,0.298,...,0.901,114.996,audio_features,1qEmFfgcLObUfQm0j1W2CK,spotify:track:1qEmFfgcLObUfQm0j1W2CK,https://api.spotify.com/v1/tracks/1qEmFfgcLObU...,https://api.spotify.com/v1/audio-analysis/1qEm...,177955,4,hot
4,About Damn Time,Lizzo,1PckUlxKqWQs3RlWXVBLw3,0.836,0.743,10,-6.305,0,0.0656,0.0995,...,0.722,108.966,audio_features,1PckUlxKqWQs3RlWXVBLw3,spotify:track:1PckUlxKqWQs3RlWXVBLw3,https://api.spotify.com/v1/tracks/1PckUlxKqWQs...,https://api.spotify.com/v1/audio-analysis/1Pck...,191822,4,hot


In [24]:
# Now we can concatenate them together, vertically :-) 
df_songs = pd.concat([not_hot, top_100], axis = 0)
df_songs.head()

Unnamed: 0,title,artist,id,danceability,energy,key,loudness,mode,speechiness,acousticness,...,valence,tempo,type,id.1,uri,track_href,analysis_url,duration_ms,time_signature,original_df
0,Abere izate ukatua,Su ta gar,4tMbI2qkOIfc1kzsbw7X64,0.393,0.967,10.0,-5.775,0.0,0.0511,1e-05,...,0.469,98.016,audio_features,4tMbI2qkOIfc1kzsbw7X64,spotify:track:4tMbI2qkOIfc1kzsbw7X64,https://api.spotify.com/v1/tracks/4tMbI2qkOIfc...,https://api.spotify.com/v1/audio-analysis/4tMb...,243973.0,4.0,not_hot
1,Viva Forever (Tony Rich Remix Instrumental),Spice Girls,1dehm5fgmBVxxur0YnZLMq,0.809,0.478,10.0,-7.371,0.0,0.187,0.0106,...,0.594,168.897,audio_features,1dehm5fgmBVxxur0YnZLMq,spotify:track:1dehm5fgmBVxxur0YnZLMq,https://api.spotify.com/v1/tracks/1dehm5fgmBVx...,https://api.spotify.com/v1/audio-analysis/1deh...,342733.0,4.0,not_hot
2,Think About That,Dandy Livingstone,5vWscUwJJSsIvKR0YJa8cV,0.759,0.654,0.0,-8.462,1.0,0.0523,0.134,...,0.664,95.544,audio_features,5vWscUwJJSsIvKR0YJa8cV,spotify:track:5vWscUwJJSsIvKR0YJa8cV,https://api.spotify.com/v1/tracks/5vWscUwJJSsI...,https://api.spotify.com/v1/audio-analysis/5vWs...,172947.0,4.0,not_hot
3,That Groovy Shingaling,Orquesta Harlow,2kTZKa7J0ykBpmzs5TMjWT,0.454,0.481,10.0,-10.895,1.0,0.0318,0.708,...,0.757,122.281,audio_features,2kTZKa7J0ykBpmzs5TMjWT,spotify:track:2kTZKa7J0ykBpmzs5TMjWT,https://api.spotify.com/v1/tracks/2kTZKa7J0ykB...,https://api.spotify.com/v1/audio-analysis/2kTZ...,135187.0,4.0,not_hot
4,Take It Off,King B.A.V.,77ihANWN9MuLGim3ZtYIti,0.706,0.58,7.0,-3.692,0.0,0.0423,0.000473,...,0.115,75.102,audio_features,77ihANWN9MuLGim3ZtYIti,spotify:track:77ihANWN9MuLGim3ZtYIti,https://api.spotify.com/v1/tracks/77ihANWN9MuL...,https://api.spotify.com/v1/audio-analysis/77ih...,244827.0,4.0,not_hot


In [25]:
df_songs.shape

(3676, 22)

In [26]:
# Dropping columns we don't need for the models :
df_songs_for_model = df_songs.drop(['title', 'artist', 'id', 'key', 'mode', 'type', 'id.1', 'uri', 'track_href', 
                                    'analysis_url', 'duration_ms', 'time_signature', 'original_df'], axis = 1)

In [27]:
df_songs_for_model.columns

Index(['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo'],
      dtype='object')

In [28]:
#Now we can start working on our clustering models with this new dataframe
#Scaling Data

In [30]:
# Scaling data with tandardscaler :

X = df_songs_for_model

scaler = StandardScaler()
scaler.fit(X) # fitting with 9 features
X_scaled = scaler.transform(X)

# saving the scaler in a file 
import pickle 

filename = r"C:\Users\Akshaykumar\lab-web-scraping-single-page.pickle" 
with open(filename, "wb") as file:
        pickle.dump(scaler,file) 

# putting our variable X into a dataframe 
df_scaled = pd.DataFrame(X_scaled, columns = X.columns)
display(X.head())
print()
display(df_scaled.head())

Unnamed: 0,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0.393,0.967,-5.775,0.0511,1e-05,0.0013,0.147,0.469,98.016
1,0.809,0.478,-7.371,0.187,0.0106,0.861,0.125,0.594,168.897
2,0.759,0.654,-8.462,0.0523,0.134,0.234,0.113,0.664,95.544
3,0.454,0.481,-10.895,0.0318,0.708,0.254,0.158,0.757,122.281
4,0.706,0.58,-3.692,0.0423,0.000473,0.0,0.256,0.115,75.102





Unnamed: 0,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,-0.871333,1.364469,0.777323,-0.315499,-0.925261,-0.632742,-0.344534,-0.162416,-0.789471
1,1.479125,-0.566999,0.441499,0.873749,-0.894034,1.894959,-0.460769,0.309879,1.661844
2,1.196618,0.128171,0.211935,-0.304998,-0.530161,0.051446,-0.52417,0.574364,-0.874961
3,-0.526674,-0.55515,-0.300008,-0.484392,1.162409,0.11025,-0.286417,0.925752,0.049698
4,0.89716,-0.164116,1.21562,-0.392507,-0.923896,-0.636564,0.231358,-1.499955,-1.581918


In [31]:
#Clustering with K-Means

In [32]:
kmeans = KMeans(n_clusters=8, random_state=1234)
kmeans.fit(X_scaled_df)

NameError: name 'X_scaled_df' is not defined