# Clustering the songs from dataframes

Now it's time to cluster the songs of the hot_songs and not_hot_songs databases according to the song's audio features. You will need to consider the following:

- Are you going to use all the audio features? If not, which ones do you think to make more sense?
- What is the optimal number of clusters (for methods that need to know this beforehand)?
- What is the best distance to use?
- What clustering method provides better results?
- Does the clustering method need a transformer?

Be aware that this process is extremely time-consuming!!! Therefore, when testing different options, save the models into your disk in order to be able to use the best model later.  You don't want to retrain the best model again when you know what are the optimal parameters for each.

Add to the hot_songs and not_hot_songs databases a new column for each clustering method with the cluster membership of each song for each method.

## Importing the libraries

In [7]:
pip install sklearn

Note: you may need to restart the kernel to use updated packages.


In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pickle

In [9]:
all_songs = pd.read_csv('data/allsongsconcat_df.csv')

## Removing all the unnecessary audio features:
These are not actually audio features, just links and other kind of information that has nothing to do with audio qualities.

In [10]:
all_songs_clean = all_songs.drop(['analysis_url', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature', 'type'],axis =1)

## Store this cleaned Dataframe in a csv file:

In [11]:
all_songs_clean.to_csv("data/all_songs_clean.csv", index=False)

## Numerical and Categorical split:
- X_num will be for Numerical columns
- X_cat will be for Categorical ones

In [12]:
X_num = all_songs_clean.drop(['songs', 'artists'], axis =1)

In [13]:
X_cat = all_songs[['songs', 'artists']]

## Checking the Data types per column

In [14]:
X_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4228 entries, 0 to 4227
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      200 non-null    float64
 1   energy            200 non-null    float64
 2   key               200 non-null    float64
 3   loudness          200 non-null    float64
 4   mode              200 non-null    float64
 5   speechiness       200 non-null    float64
 6   acousticness      200 non-null    float64
 7   instrumentalness  200 non-null    float64
 8   liveness          200 non-null    float64
 9   valence           200 non-null    float64
 10  tempo             200 non-null    float64
dtypes: float64(11)
memory usage: 363.5 KB


In [15]:
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4228 entries, 0 to 4227
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   songs    4228 non-null   object
 1   artists  4228 non-null   object
dtypes: object(2)
memory usage: 66.2+ KB


## Scaling the features

In [16]:
scaler = StandardScaler()
scaler.fit(X_num)#We will not use all the X, only the numerical features, select the appropriate num features
X_scaled = scaler.transform(X_num)
filename = "/Users/Hector_Martin/Documents/Labs/music_recommender_project/scalers/standardscaler.pickle" # Path with filename
with open(filename, "wb") as file:
        pickle.dump(scaler,file)
X_scaled_df = pd.DataFrame(X_scaled, columns = X_num.columns)
print('Data before the transformation')
print('------------------------------')
display(X_num.head())#data before the transformation
print()
print('Data after the transformation')
print('------------------------------')
display(X_scaled_df.head())#data after the transformation

Data before the transformation
------------------------------


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0.463,0.642,1.0,-4.474,1.0,0.34,0.314,0.0,0.0686,0.339,83.389
1,0.52,0.731,6.0,-5.338,0.0,0.0557,0.342,0.00101,0.311,0.662,173.93
2,0.905,0.563,8.0,-6.135,1.0,0.102,0.0254,1e-05,0.113,0.324,106.998
3,0.883,0.657,8.0,-5.748,1.0,0.305,0.0603,0.0,0.128,0.284,124.992
4,0.761,0.525,11.0,-6.9,1.0,0.0944,0.44,7e-06,0.0921,0.531,80.87



Data after the transformation
------------------------------


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,-1.58598,0.068788,-1.156007,0.956141,0.79959,1.501557,0.70178,-0.241779,-0.857198,-0.693443,-1.184862
1,-1.177056,0.688434,0.311007,0.483034,-1.250641,-0.686009,0.847332,0.50948,1.388129,0.750392,1.454368
2,1.584975,-0.481235,0.897812,0.046615,0.79959,-0.329751,-0.798444,-0.234556,-0.445925,-0.760495,-0.49667
3,1.427145,0.173222,0.897812,0.258528,0.79959,1.232247,-0.617024,-0.241779,-0.306982,-0.939298,0.027847
4,0.551904,-0.745803,1.778021,-0.372281,0.79959,-0.388229,1.356763,-0.236795,-0.63952,0.164812,-1.25829


## Training Models with different K values to assess which offers the best performance:

In [22]:
def k_means_trainer(df):
    
    '''
    The formula trains several models and plots their performance using the Silhouette and the Elbow method.
    All models are stored in a pickle file.
    '''
    
    #We start with 2 because we need at least 2 groups to compare 
    #From 2 to 21 because we want to compare the performance of our models with up to 20 songs
    
    K = range(2, 21)
    inertia = [] #Store the inertia value of every model
    silhouette = [] #Store the silhouette score of every model

    for k in K:
        print("Training a K-Means model with {} neighbours! ".format(k))
        print()
        kmeans = KMeans(n_clusters=k,
                        n_init = 10, #Train 10 models, the function will store only 1 as a pickle file.
                        random_state=1234,
                        verbose =1) #Display progress messages
        kmeans.fit(df)
        filename = "/Users/Hector_Martin/Documents/Labs/music_recommender_project/models/kmeans_" + str(k) + ".pickle"
        with open(filename, "wb") as file:
            pickle.dump(kmeans,file)
        inertia.append(kmeans.inertia_)
        silhouette.append(silhouette_score(df, kmeans.predict(df)))


    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline

    #Elbow Method:
    fig, ax = plt.subplots(1,2,figsize=(16,8))
    ax[0].plot(K, inertia, 'bx-')
    ax[0].set_xlabel('k')
    ax[0].set_ylabel('inertia')
    ax[0].set_xticks(np.arange(min(K), max(K)+1, 1.0))
    ax[0].set_title('Elbow Method showing the optimal k')

    #Silhouette Method:
    ax[1].plot(K, silhouette, 'bx-')
    ax[1].set_xlabel('k')
    ax[1].set_ylabel('silhouette score')
    ax[1].set_xticks(np.arange(min(K), max(K)+1, 1.0))
    ax[1].set_title('Silhouette Method showing the optimal k')

In [21]:
k_means_trainer(X_scaled_df)

Training a K-Means model with 2 neighbours! 



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Loading the scaler and the best model:

In [None]:
def load(filename = "filename.pickle"): #if I don't specify the name of the file it's going to be 'filename.pickle'
    try: 
        with open(filename, "rb") as file: 
            return pickle.load(file) 
    except FileNotFoundError: 
        print("File not found!") 

#### Loading the scaler from a pickle file

In [None]:
scaler = load("scalers/standardscaler.pickle")
scaler

#### Loading the best_model from a pickle file:
Based on the Elbow method graphic we plotted, we can determine that the best model is the one with 8Ks.

In [None]:
best_model = load("models/kmeans_8.pickle")