# k-NN model

Given a playlist, we want to add more tracks: it's the **playlist continuation** problem. Following [Kelen et al.](https://dl.acm.org/doi/abs/10.1145/3267471.3267477), the idea here is to define a similarity metric between two playlists, select the $k$ most similar playlists to ours, define a score metric for tracks continuing our playlist and choose the best tracks to continue it.

In [1]:
from scipy.sparse import lil_matrix
from tqdm.notebook import tqdm
import glob
import numpy as np
import pandas as pd

from spotipy.oauth2 import SpotifyClientCredentials
import spotipy

auth_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=auth_manager)

## Treatment

Here we load and treat the tracks dataset.

In [2]:
tracks_dfs = (
    pd.read_pickle(file)[['id', 'playlist_id']] for file in glob.glob('../../data/sp_tracks_ready_*.pkl')
)
tracks_df = pd.concat(tracks_dfs, ignore_index=True)
tracks_df.dropna(inplace=True)

grouped_tracks = tracks_df.groupby('playlist_id')['id'].apply(list)
grouped_tracks

playlist_id
002tR8Orc24qAjOUA1WZky    [4nRPcD4a5QGH95ZnwGvjwr, 66e6O0B9GmSRNAnzB0Bm7...
005DlsAeMYKNkiH68qg42D    [1EzZKyFDupKMTkanau4nBz, 0sCNal67P5Nydob9zITXr...
005TzWiU0IWtwpvJRl0E2U    [5sjINZsTACv5hmwv6nSrST, 2kH9pYAE74wdKG80AtL73...
00CgYEsYFmBBMY2xMc8ll2    [1Y4URywn643uZPth46W1Y4, 3iKlGM4UeSOecjb7pj10n...
00CxBYCkFSb8rXXI2HuEDw    [1uxzFavoQSzR6NhzeSbHdM, 7MoJIfitwutdf2B9RbJdd...
                                                ...                        
7ziIxFZYxrk5PGjmTIHw8U    [2IHaGyfxNoFPLJnaEg4GTs, 45e8T07myBjutIx2xh8Kq...
7zlLMQZc0JP6yStjLopDk9    [1MJ5f5EYBC92ADD6xcz7nb, 2J4obtL5oOmstcOfq5UW3...
7znebC1MKbXiQiUqxDGVMx    [2Fxmhks0bxGSBdJ92vM42m, 2qT1uLXPVPzGgFOx4jtEu...
7zr3klk3xuOVCKyxV4USDO    [6x0jeG72v8tYQfpfox6dWh, 1zcWL2FZKee3ub0wbTfNG...
7zsK5nJQzCjJaTnjiAcvd0    [3ZqZ3IbrVbFTdTvGcK0vpm, 1KqpIQGNINpG0hDdbB9Pi...
Name: id, Length: 10638, dtype: object

Because we will use matrix multiplication, we have to index each track and each playlist id to index of the matrix. We do it here using dictionaries:

In [3]:
track_ids_go = dict(zip(set(tracks_df.id), range(len(set(tracks_df.id)))))
track_ids_back = dict(zip(track_ids_go.values(), track_ids_go.keys()))
playlist_ids = dict(zip(set(tracks_df.playlist_id), range(len(set(tracks_df.playlist_id)))))

We build the relevance matrix $R$. $R_{ij}=r_{ji}$ indicates if a track $j$ is relevant to the playlist $i$, that is, the track is in the playlist.

In [4]:
R = lil_matrix((len(set(tracks_df.playlist_id)), len(set(tracks_df.id))))

for index, row in tqdm(tracks_df.iterrows(), total=len(tracks_df)):
    R[playlist_ids[row.playlist_id], track_ids_go[row.id]] = 1

HBox(children=(FloatProgress(value=0.0, max=810282.0), HTML(value='')))




## The model

The similarity between two playlists $u$ and $v$ is calculated by:
$$s_{uv} = \sum_{i \in I} \dfrac{r_{ui}r_{vi}}{||R_u||_2||R_v||_2}$$
$I$ is the set of tracks and $R_u$ is the vector of relevances $r_{ui}$ for the playlist $u$.

In fact, we basically count the number of tracks in the intersection of the playlists and normalize it.

In [5]:
def similarity(playlist_1, playlist_2):
    """Calculate the similarity between two playlists."""
    summation = len(set(playlist_1) & set(playlist_2))
    return summation/(np.sqrt(len(playlist_1))*np.sqrt(len(playlist_2)))

Given a playlist $u$ to be continuated, we calculate the similarity of it with all existent playlists and select the $k$ most similar playlists, that is, the set $N_k(u)$. So, we define a score for a track to be in the playlist:
$$\hat{r}_{ui} = \dfrac{\sum_{v \in N_k(u)} s_{uv} \cdot r_{vi}}{\sum_{v \in N_k(u)} s_{uv}}$$

The intuition is that we are giving high scores to tracks that are in many playlists with great similarities to out playlist. We return the tracks ordered by score.

In [6]:
def continuation(playlist, k):
    """Continue a playlist based on k most similar playlists."""
    s_u = lil_matrix((1, len(playlist_ids)))
    for alt_playlist_index, alt_playlist in grouped_tracks.items():
        s_u[0, playlist_ids[alt_playlist_index]] = similarity(playlist, alt_playlist)
    sorted_similarities_indices = np.flip(np.argsort(s_u.toarray()[0]))
    top_k_similarities_indices = sorted_similarities_indices[:k]
    scores = (s_u[0, top_k_similarities_indices]*R[top_k_similarities_indices, :]).toarray()[0]
    sorted_scores_indices = np.flip(np.argsort(scores))
    return [track_ids_back[index] for index in sorted_scores_indices]

## Sanity check

Before evaluating out model, it's worth it to do a sanity check. We will take the most listened songs from The Beatles and continue the playlist, with $k = 100$.

In [7]:
q = sp.search('The Beatles', type='artist')

Our playlist to be continuated:

In [8]:
[track['name'] for track in sp.artist_top_tracks(q['artists']['items'][0]['id'])['tracks']]

['Here Comes The Sun - Remastered 2009',
 'Let It Be - Remastered 2009',
 'Come Together - Remastered 2009',
 'Yesterday - Remastered 2009',
 'Hey Jude - Remastered 2015',
 'Blackbird - Remastered 2009',
 'Twist And Shout - Remastered 2009',
 'I Want To Hold Your Hand - Remastered 2015',
 'In My Life - Remastered 2009',
 'Help! - Remastered 2009']

Continuating...

In [9]:
result = continuation([track['id'] for track in sp.artist_top_tracks(q['artists']['items'][0]['id'])['tracks']], 100)

So:

In [10]:
q = sp.tracks(result[:50])

In [11]:
[(q['tracks'][i]['name'], q['tracks'][i]['artists'][0]['name']) for i in range(len(q['tracks']))]

[('Here Comes The Sun - Remastered 2009', 'The Beatles'),
 ('Come Together - Remastered 2009', 'The Beatles'),
 ('Hey Jude - Remastered 2015', 'The Beatles'),
 ('Let It Be - Remastered 2009', 'The Beatles'),
 ('I Want To Hold Your Hand - Remastered 2015', 'The Beatles'),
 ('Blackbird - Remastered 2009', 'The Beatles'),
 ('Yesterday - Remastered 2009', 'The Beatles'),
 ('Help! - Remastered 2009', 'The Beatles'),
 ('Got To Get You Into My Life - Remastered 2009', 'The Beatles'),
 ('While My Guitar Gently Weeps - Remastered 2009', 'The Beatles'),
 ('Billie Jean', 'Michael Jackson'),
 ('Strawberry Fields Forever - Remastered 2009', 'The Beatles'),
 ('Eleanor Rigby - Remastered 2009', 'The Beatles'),
 ('All You Need Is Love - Remastered 2009', 'The Beatles'),
 ('Bohemian Rhapsody - 2011 Mix', 'Queen'),
 ('Twist And Shout - Remastered 2009', 'The Beatles'),
 ('Beat It - Single Version', 'Michael Jackson'),
 ('All Along the Watchtower', 'Jimi Hendrix'),
 ('Lucy In The Sky With Diamonds - Rema

Well, it seems to be working.