# Similarity Natrix

## Based on the article [PATS](http://ismir2002.ircam.fr/proceedings/OKPROC02-FP07-4.pdf)

The ideia is to establish a similarity metric and build a matrix to get playlists.. 

In [1]:
# Importing libraries 
import pandas as pd 
import numpy as np 

from sklearn.cluster import AffinityPropagation, SpectralClustering, DBSCAN
from sklearn.preprocessing import MinMaxScaler
from scipy.spatial.distance import cdist, squareform, pdist
from scipy.sparse import csr_matrix, lil_matrix
from sklearn.model_selection import train_test_split

from seaborn import heatmap 
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm
import glob
import os
import time

t0 = time.time()

## Defining Important Features 

This features will be used to understand the data. 

In [2]:
metadata =  ['playlist_id', 'duration_ms', 'explicit', 'id', 'album_type', 'popularity', 'album_id', 
             'album_release_date', 'artists_ids', 'name', 'artists_names']
audio_features = ['danceability', 'energy', 'loudness', 'key', 'mode', 'speechiness', 'acousticness', 
                  'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'id']

## Playlist and tracks dataframes 

Getting the data generated by Spotify API. 

In [3]:
playlists_df = pd.read_pickle('../data/sp_playlists.pkl')[['owner_id', 'id', 'tracks']]
playlists_df.rename(columns = {'id': 'playlist_id', 'tracks': 'n_tracks'}, inplace = True)

playlists_df.n_tracks = playlists_df.n_tracks.apply(lambda x: x['total'])

# Getting Playlists with at least 5 tracks and maximum of 500 tracks
playlists_df = playlists_df[(playlists_df.n_tracks >= 5) & (playlists_df.n_tracks <= 500)]

In [4]:
audio_features_df = pd.read_pickle('../data/sp_audio_features.pkl')[audio_features]
tracks_df = pd.concat(
    [pd.read_pickle(file)[metadata] for file in glob.glob('../data/sp_tracks_ready_*.pkl')],
    ignore_index=True
)
tracks_df = audio_features_df.merge(tracks_df, on = 'id')
del audio_features_df

## Treating the data

I convert the dates to datetime and use the year as a continuum value. 

In [5]:
tracks_df['album_release_date'].replace(to_replace = '0000', value = None, inplace=True)
tracks_df['album_release_date'] = pd.to_datetime(tracks_df['album_release_date'])
tracks_df['album_release_date'] = (tracks_df['album_release_date'] - tracks_df['album_release_date'].min())
tracks_df['days'] = tracks_df['album_release_date']/np.timedelta64(1,'D')

We have 45 nan values in the years columns. I will put the mean of the values, because it's few missing data.

In [6]:
tracks_df['days'].fillna(np.mean(tracks_df['days']), inplace = True)

Convert the artists to set, in order to follow the metric presented below

In [7]:
tracks_df.artists_ids = tracks_df.artists_ids.apply(set)

I separate the catefortical, numerical and set_oriented features, to make the ideia of the similarity matrix. 

In [8]:
features_categorical =  ['explicit', 'album_type', 'album_id', 'key', 'mode', 'time_signature']
features_numerical = ['duration_ms', 'popularity', 'danceability', 'energy', 'loudness', 'speechiness', 
                      'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'days']
features_set_oriented = ['artists_ids']

features = [] 
features.extend(features_categorical)
features.extend(features_numerical)
features.extend(features_set_oriented)

Only to ensure correct type here.

In [9]:
tracks_df[features_numerical] = tracks_df[features_numerical].astype(float)

Let's build the metrics proposed. For now, I normalize the numerical data, ensuring the range to be $[0,1]$. 

In [10]:
scaler = MinMaxScaler()
tracks_df[features_numerical] = scaler.fit_transform(tracks_df[features_numerical])

In [11]:
metric_categorical = lambda x1,x2:  x1 == x2
metric_set_oriented = lambda x1, x2: len(x1 & x2)/(len(x1.union(x2)))
metric_numerical = lambda x1, x2: 1 - abs(x1 - x2)

# Ideia: I will give grades of importance (1 - 5) based on my experience to each feature. 
# Arbitrary choice
weights = [1, 1, 5, 2, 3, 3, 3, 2, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5]
weights = np.array(weights)/sum(weights)

def metric_songs(x: np.array, y: np.array) -> float: 
    
    similarity = 0
    similarity += np.dot(weights[0:6], metric_categorical(x[0:6], y[0:6]))
    similarity += np.dot(weights[6:18], metric_numerical(x[6:18], y[6:18]))
    similarity += weights[18]*metric_set_oriented(x[18], y[18])

    return similarity

## Simple example :


Let's calculate a simple case with two songs.

In [12]:
x1 = np.array(tracks_df[features].iloc[27:30])
x2 = np.array(tracks_df[features].iloc[1003:1005])
matrix = cdist(x1, x2, metric = metric_songs)
print(matrix)

[[0.6429353  0.67196831]
 [0.67909603 0.67121037]
 [0.71947156 0.67978537]]


## Recommendation based on similarity. 

We will use the metric described above. The similarity between two songs will be interpreted as a **probability**. We could build the role track similarity but it requires much computation. So I will do a simple modification. I will calculate the metric between two songs if they are in the same playlist, for some playlist in the dataset. I expect it reduces the number of calculations! After, we will have a sparser matrix and in order too add tracks to a playlist, we will add iteratively. With a list with n songs, we have the similarities with all tracks. It will be zero when the tracks aren't in the same playlist, for all playlists. We mutiply these probabilities for all tracks (it will be our likelihood) and maximeze it. 

In [228]:
# Functions

class SimilarityModel:
    
    def __init__(self, tracks: pd.DataFrame, playlists: pd.DataFrame): 
        '''Function with the implementation of the Simmilarity Model described above. 
           The metric used are describe in PATS article. 
           - tracks: all the tracks in your world. 
           - playlists: the training playlists.
        '''
        
        self.tracks = tracks
        self.playlists = playlists
        # We will consider a dataframe with the unique tracks and create numerical indexes
        self.tracks_index = self.tracks[['id']].drop_duplicates('id').reset_index()
        self.playlists = self.playlists.set_index('playlist_id')
        
    def get_similar_track(self, tracks_similarity: np.array, n_of_songs: int) -> int: 

        # With this mask, we get only the columns with all lines having values
        #counting_elements = tracks_similarity.getnnz(axis = 0)
        #mask = counting_elements == tracks_similarity.shape[0]
        #interest_tracks = np.zeros(tracks_similarity.shape[1], dtype = float)
        interest_tracks = tracks_similarity.mean(axis = 0).A.flatten()
        songs = np.argpartition(interest_tracks, -n_of_songs)[-n_of_songs:]
        return songs
    
    def _get_index(self, tracks_ids):
        
        indexes = self.tracks_index[self.tracks_index.id.isin(tracks_ids)].index
        
        return list(indexes)
    
    def _get_track_number(self, index):
        
        track_id = self.tracks_index.loc[index]
        return track_id.id
    
    def accuracy_metric(self, yhat, ytrue,n, j):

        acc = (len(ytrue & yhat) - j)/(n - j)
        return acc
     
    def fit(self): 
        '''This functions build the model with the tracks and playlists disposed. '''
        
        tracks_similarity = lil_matrix((len(self.tracks_index), len(self.tracks_index)), 
                                       dtype = float)
        
        for playlist_id in tqdm(self.playlists.index): 
            
            tracks_playlist = self.tracks[self.tracks.playlist_id == playlist_id]
            
            indexes = self._get_index(tracks_playlist.id)
            tracks_similarity[np.ix_(indexes, indexes)] = squareform(pdist(tracks_playlist, 
                                                                           metric = metric_songs))
        
        self.tracks_similarity = tracks_similarity
        
    def predict(self, given_tracks: pd.DataFrame, n_of_songs: int):
        
        n = len(given_tracks)
        
        indexes = self._get_index(given_tracks.id)
        similarity = self.tracks_similarity[indexes]
        tracks_chosen = self.get_similar_track(similarity, n_of_songs)
        for track in tracks_chosen: 
            track_id = self._get_track_number(track)
            given_tracks = given_tracks.append(self.tracks[self.tracks.id == track_id].iloc[0])              

        return given_tracks
    
    def training_accuracy(self, playlists: pd.DataFrame = None,rate = 0.7): 
        
        accuracy = []
        if playlists is None:
            playlists = self.playlists
        
        for playlist_id in tqdm(playlists.index): 
            
            testing_prediction = self.tracks[self.tracks.playlist_id == playlist_id]
            
            n = len(testing_prediction)
            if n == 0: 
                continue
            
            # Already known tracks
            j = int(rate*n)
            
            prediction = self.predict(testing_prediction.iloc[0:j], n - j)
            
            assert len(prediction) == n
        
            phat = set(prediction.id)       # Predicted tracks
            p = set(testing_prediction.id)  # Tracks in the training set
            
            acc = self.accuracy_metric(phat, p, n, j)
            
            accuracy.append(acc)
        
        return np.mean(accuracy)

## Testing the Results

First, I will get `playlist_id` for train and test. I get also only the necessary features from the tracks. 
I drop the duplicates cause I'm not interested in playlists with repeated tracks, given that I already know two equal songs have similarity 1. 

In [259]:
train, test = train_test_split(playlists_df.drop_duplicates().sample(frac = 0.01))
tracks_subset = tracks_df[features + ['id', 'playlist_id']]
tracks_subset = tracks_subset.drop_duplicates(['id', 'playlist_id'])

### Let's train the model

In [260]:
model = SimilarityModel(tracks_subset, train)

In [261]:
model.fit()

HBox(children=(FloatProgress(value=0.0, max=71.0), HTML(value='')))




Checking the accuracy of the predicted songs for each playlist. 

In [263]:
acc = model.training_accuracy()
print(acc)

HBox(children=(FloatProgress(value=0.0, max=71.0), HTML(value='')))


0.751512730425017


### Let's see in the testing set

I only have to set test index to `playlist_id` because it is only done automatically in the training set. 

In [264]:
test = test.set_index('playlist_id')

In [265]:
test_acc = model.training_accuracy(test)
print(test_acc)

HBox(children=(FloatProgress(value=0.0, max=24.0), HTML(value='')))


0.001388888888888889
