# Clustering

## Based on the article [PATS](http://ismir2002.ircam.fr/proceedings/OKPROC02-FP07-4.pdf)

The ideia is to establish a similarity metric and do a clusterization process with these metrics. 

In [1]:
# Importing libraries 
import pandas as pd 
import numpy as np 

from sklearn.cluster import AffinityPropagation, SpectralClustering, DBSCAN
from sklearn.preprocessing import MinMaxScaler
from scipy.spatial.distance import cdist, squareform

from seaborn import heatmap 
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm
import glob
import os
import time

## Defining Important Features 

This features will be used to understand the data. 

In [2]:
metadata =  ['playlist_id', 'duration_ms', 'explicit', 'id', 'album_type', 'popularity', 'album_id', 
             'album_release_date', 'artists_ids', 'name', 'artists_names']
audio_features = ['danceability', 'energy', 'loudness', 'key', 'mode', 'speechiness', 'acousticness', 
                  'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'id']

## Playlist and tracks dataframes 

Getting the data generated by Spotify API. 

In [3]:
playlists_df = pd.read_pickle('../data/sp_playlists.pkl')[['id']]
playlists_df['playlist_id'] = playlists_df['id']
playlists_df = playlists_df[['playlist_id']]
audio_features_df = pd.read_pickle('../data/sp_audio_features.pkl')[audio_features]
tracks_df = pd.concat(
    [pd.read_pickle(file)[metadata] for file in glob.glob('../data/sp_tracks_ready_*.pkl')],
    ignore_index=True
)

In [4]:
tracks_df = audio_features_df.merge(tracks_df, on = 'id')
tracks_df = tracks_df.merge(playlists_df, on = 'playlist_id')

In [5]:
del audio_features_df
del playlists_df 

I convert the dates to datetime and use the year as a continuum value. 

In [6]:
tracks_df['album_release_date'] = tracks_df['album_release_date'].apply(lambda x: 
                                                                        None if x == '0000' else pd.to_datetime(x))
tracks_df['years'] = tracks_df['album_release_date'].apply(lambda x: x.year + x.month/12 + x.day/365)
tracks_df.sample()

Unnamed: 0,danceability,energy,loudness,key,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,duration_ms,explicit,album_type,popularity,album_id,album_release_date,artists_ids,name,artists_names,years
273121,0.552,0.615,-11.941,11,1,0.0605,0.417,0.000113,0.106,0.233,...,254106.0,False,album,0.0,2HVx2tiZnLX8xeaUthed1e,1976-09-28,[7guDJrEfX3qb6FEbdPA5qi],Summer Soft,[Stevie Wonder],1976.826712


We have 45 nan values in the years columns. I will put the mean of the values, because it's few missing data.

In [7]:
tracks_df['years'].fillna(np.mean(tracks_df['years']), inplace = True)

I separate the catefortical, numerical and set_oriented features, to make the ideia of the similarity matrix. 

In [8]:
features_categorical =  ['explicit', 'album_type', 'album_id', 'key', 'mode', 'time_signature']
features_numerical = ['duration_ms', 'popularity', 'danceability', 'energy', 'loudness', 'speechiness', 
                      'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'years']
features_set_oriented = ['artists_ids']

features = [] 
features.extend(features_categorical)
features.extend(features_numerical)
features.extend(features_set_oriented)

Only to ensure correct type here.

In [9]:
tracks_df[features_numerical] = tracks_df[features_numerical].astype(float)

Let's build the metrics proposed. For now, I normalize the numerical data, ensuring the range to be $[0,1]$. 

In [10]:
scaler = MinMaxScaler()
tracks_df[features_numerical] = scaler.fit_transform(tracks_df[features_numerical])

In [11]:
metric_categorical = lambda x1,x2: x1 == x2
metric_set_oriented = lambda x1, x2: len(set(x1) & set(x2))/(max(len(x1), len(x2)))
metric_numerical = lambda x1, x2: 1 - abs(x1 - x2)

def metric_songs(x: np.array, y: np.array) -> float: 
    # Arbitrary choice 
    weight = np.ones(19)/19
    
    similarity = 0
    similarity += sum(weight[0:6]*metric_categorical(x[0:6], y[0:6]))
    similarity += sum(weight[6:18]*metric_numerical(x[6:18], y[6:18]))
    similarity += weight[18]*metric_set_oriented(x[18], y[18])

    return similarity

## Simple example :


Let's calculate a simple case with two songs.

In [12]:
x1 = np.array(tracks_df[features].iloc[10])
x2 = np.array(tracks_df[features].iloc[34])
metric_songs(x1, x2)

0.8309429212833058

## Calculate the Similarity Matrix

I use the pdist function to calculate distance between tracks. I sample only 2 thousand song, because it can take a long time to run the comparision between all songs. After I can convert this similarity matrix to a distance matrix if I wish. 

In [13]:
#tracks_selected = tracks_df.sample(2000, random_state = 1000)
#tracks_similarity = pdist(tracks_selected[features].values, metric = metric_songs)

In [14]:
# tracks_similarity = - lambda*np.log(lambda*tracks_similarity)
#tracks_similarity = squareform(tracks_similarity) + np.eye(2000)

In [15]:
#fig, ax = plt.subplots(figsize = (15,12))
#heatmap(tracks_similarity, vmin = 0, vmax = tracks_similarity.max(), ax = ax)

#plt.show()

## Clusterization Algorithms

This algorithms require distance matrix. 

In [16]:
#model = AffinityPropagation(affinity = 'precomputed').fit(tracks_similarity)

Quantos labels foram feitos? 

In [17]:
#len(np.unique(model.labels_))

Quantas playlists realmente existiam?

In [18]:
#len(tracks.iloc[0:2000].playlist_id.unique())

In [19]:
#model2 = DBSCAN(eps = 0.25, metric = 'precomputed').fit(tracks_similarity)

In [20]:
#len(model2.labels_[model2.labels_ == -1])

In [21]:
#len(np.unique(model2.labels_))

In [22]:
#model3 = SpectralClustering(n_clusters=27, affinity='precomputed').fit(tracks_similarity)

## Recommendation based on similarity. 

I use the metric specified like a probability. That probability tell us which factor we should recommend a song. In special, we choose, iteratively, the music that optimizes the product of probabilities, related to the likelihood. 

#### Test

I initialize with a random song and will take 10 songs to build a playlist.

In [23]:
number_of_songs = 10

In [36]:
# Functions

def get_similar_track(tracks_similarity: np.array) -> int: 
    
    interest_tracks = tracks_similarity.prod(axis = 0)
    song = np.argmax(interest_tracks)
    
    return song 

def get_playlist(number_of_songs: int, tracks: np.array, songs: list) -> list: 
    
    n = len(songs)
    
    tracks_similarity = np.ones((number_of_songs + n, tracks.shape[0]))
    
    # calculate the similarity between track in the playlist
    tracks_similarity[0:n-1,:] = cdist(tracks[songs[:-1],:], tracks, metric = metric_songs)
    
    for s in range(n-1): 
        tracks_similarity[s,songs[s]] = 0
    
    for playlist_item in tqdm(range(n-1, number_of_songs + n-1)):

        tracks_similarity[playlist_item,:] = cdist(tracks[songs[len(songs)-1:],:], tracks, metric = metric_songs) 
        
        # Setting it to 0 take the used music to 0
        tracks_similarity[playlist_item, songs[-1]] = 0
        
        songs.append(get_similar_track(tracks_similarity))
    
    return songs

In [25]:
tracks_df_drop = tracks_df.drop_duplicates('id', ignore_index=True) 

In [26]:
playlist_test = tracks_df.sample()['playlist_id'].iloc[0]
playlist_tracks = tracks_df[tracks_df['playlist_id'] == playlist_test]

In [27]:
songs = list(tracks_df_drop[tracks_df_drop['id'].isin(tracks_df.loc[playlist_tracks.index]['id'])].index)

In [38]:
selected_songs = list(np.random.choice(songs, 50, replace = False))
playlist = get_playlist(20, tracks_df_drop[features].values, selected_songs)

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))




In [50]:
playlist_id_songs = set(tracks_df_drop.iloc[playlist].id)

No song from the playlist was achieved! 

In [51]:
len(set(playlist_tracks.id) & playlist_id_songs)

50