## ALS Implicit Cyphon Library Example using *Spotify Million PLaylist* Dataset
#### Dataset can be found [here](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge) with specific data information, sysrec challenge information, etc.

The aim of this notebook is to try to implement Alternating Least Sqaures matrix factorization on the Spotify Million Playlist dataset.

### Initial Comments:
The format is a nested JSON list, where the outer information is the playlist and its corresponding information (i.e. playlist title and other playlist meta data), and then the nested portion is made up of all the songs in that playlist, along with its song title, and other meta data).
1. How do I want to organize the initial dataframe to work with? How should the playlist and song meta data be represented? Is it even important enough?
2. If I want to somehow include the meta data for playlists and/or songs, how would this eventually be represented in the user-item matrix?

### Initial Approach:
To begin, I will not use any meta data in the initial item-user matrix. All I want to have initially is a playlist song matrix, where the value in each playlist-song cell is a binary value; 0 if that song is **not** in that playlist, and 1 if it **is** in that playlist.

In [1]:
import sys, os, json
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
import random
from glob import glob
from sklearn.preprocessing import MinMaxScaler
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
import implicit

In [3]:
# Getting a list of all the separate JSONs
# Inspiration for this code from: https://stackoverflow.com/questions/57067551/how-to-read-multiple-json-files-into-pandas-dataframe

# Getting a list of all the individual JSON file paths
file_list = []
for f_name in glob('spotify-data/spotify_million_playlist_dataset/data/*.json'):
    file_list.append(f_name)

In [4]:
# Reading the data into a dataframe
# Note that the data is split up into multiple JSONs which all needs to read into the same dataframe

# Empty list to store the dataframes
dfs = [] 

# For initial testing purpose I will only use the first 10 JSON files
file_list_10 = file_list[0:10]

for file in file_list_10:
    data = json.load(open(file))
    for playlist in data["playlists"]:
        df = pd.DataFrame(playlist)
        dfs.append(df)

In [5]:
# Sanity check: there are 1000 playlists per file, and 10 files --> should have 10,000 dataframes (10,000 playlists)
print(len(dfs))

10000


In [6]:
# Still need to perform some processing because right now, each row is a song
dfs[0]

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists
0,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 0, 'artist_name': 'Bob Dylan', 'track_...",28,18425368,39
1,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 1, 'artist_name': 'Bob Dylan', 'track_...",28,18425368,39
2,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 2, 'artist_name': 'Loggins & Messina',...",28,18425368,39
3,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 3, 'artist_name': 'Bob Dylan', 'track_...",28,18425368,39
4,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 4, 'artist_name': 'Bob Dylan', 'track_...",28,18425368,39
...,...,...,...,...,...,...,...,...,...,...,...
70,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 70, 'artist_name': 'Bruce Springsteen'...",28,18425368,39
71,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 71, 'artist_name': 'Bruce Springsteen'...",28,18425368,39
72,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 72, 'artist_name': 'Bruce Springsteen'...",28,18425368,39
73,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 73, 'artist_name': 'Eric Church', 'tra...",28,18425368,39


In [7]:
# Make song and playlist libraries because we will only use their IDs in ALS
# Also want to modify the existing dataframes to better represent song info
song_dict = {}
playlist_dict = {}
for df in dfs:
    key = df["pid"][0]
    playlist_dict[key] = df["name"][0]
    
    # Add song info columns
    df["pos"] = np.nan
    df["artist_name"] = np.nan
    df["track_uri"] = np.nan
    df["artist_uri"] = np.nan
    df["track_name"] = np.nan
    df["album_uri"] = np.nan
    df["duration_ms"] = np.nan
    df["album_name"] = np.nan
    
    # Iterate over each row - which is each song in this playlist
    for ind in df.index:
        song = df["tracks"][ind]
        if song["track_uri"] not in song_dict:
            song_dict[song["track_uri"]] = song["track_name"]
            
        # Add correctly formated song info columns to the dataframe
        df["pos"][ind] = song["pos"]
        df["artist_name"][ind] = song["artist_name"]
        df["track_uri"][ind] = song["track_uri"]
        df["artist_uri"][ind] = song["artist_uri"]
        df["track_name"][ind] = song["track_name"]
        df["album_uri"][ind] = song["album_uri"]
        df["duration_ms"][ind] = song["duration_ms"]
        df["album_name"][ind] = song["album_name"]  

In [8]:
# Combine all the separate playlist dataframes into one big dataframe
full_data = pd.concat(dfs, ignore_index=True)

In [9]:
full_data

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,pos,artist_name,track_uri,artist_uri,track_name,album_uri,album_name,description
0,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 0, 'artist_name': 'Bob Dylan', 'track_...",28,277106.0,39,0.0,Bob Dylan,spotify:track:6QHYEZlm9wyfXfEM1vSu1P,spotify:artist:74ASZWbe4lXaubB36ztrGX,Boots of Spanish Leather,spotify:album:7DZeLXvr9eTVpyI1OlqtcS,The Times They Are A-Changin',
1,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 1, 'artist_name': 'Bob Dylan', 'track_...",28,330533.0,39,1.0,Bob Dylan,spotify:track:3RkQ3UwOyPqpIiIvGVewuU,spotify:artist:74ASZWbe4lXaubB36ztrGX,Mr. Tambourine Man,spotify:album:1lPoRKSgZHQAYXxzBsOQ7v,Bringing It All Back Home,
2,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 2, 'artist_name': 'Loggins & Messina',...",28,254653.0,39,2.0,Loggins & Messina,spotify:track:0ju1jP0cSPJ8tmojYBEI89,spotify:artist:7emRV8AluG3d4e5T0DZiK9,Danny's Song,spotify:album:5BWgJaesMjpJWCTU9sgUPf,The Best: Loggins & Messina Sittin' In Again,
3,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 3, 'artist_name': 'Bob Dylan', 'track_...",28,412200.0,39,3.0,Bob Dylan,spotify:track:7ny2ATvjtKszCpLpfsGnVQ,spotify:artist:74ASZWbe4lXaubB36ztrGX,A Hard Rain's A-Gonna Fall,spotify:album:0o1uFxZ1VTviqvNaYkTJek,The Freewheelin' Bob Dylan,
4,Bob Dylan,false,549000,1454803200,75,65,1,"{'pos': 4, 'artist_name': 'Bob Dylan', 'track_...",28,165426.0,39,4.0,Bob Dylan,spotify:track:18GiV1BaXzPVYpp9rmOg0E,spotify:artist:74ASZWbe4lXaubB36ztrGX,Blowin' In the Wind,spotify:album:0o1uFxZ1VTviqvNaYkTJek,The Freewheelin' Bob Dylan,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
666650,october,false,679999,1506816000,25,24,1,"{'pos': 20, 'artist_name': 'Sean Paul', 'track...",2,246706.0,22,20.0,Sean Paul,spotify:track:29LHe8kG3PraghUZOZYsw4,spotify:artist:3Isy6kedDrgPYoTS1dazA9,Baby Boy (feat. Beyonce),spotify:album:3UdSdz4TjW3tjmTZE03Ehv,Dutty Rock,
666651,october,false,679999,1506816000,25,24,1,"{'pos': 21, 'artist_name': 'Shakira', 'track_u...",2,218093.0,22,21.0,Shakira,spotify:track:3ZFTkvIE7kyPt6Nu3PEa7V,spotify:artist:0EmeFodog0BfCgMzAIvKQp,Hips Don't Lie,spotify:album:5ppnlEoj4HdRRdRihnY3jU,Oral Fixation Vol. 2,
666652,october,false,679999,1506816000,25,24,1,"{'pos': 22, 'artist_name': 'Destiny's Child', ...",2,254040.0,22,22.0,Destiny's Child,spotify:track:7qtAgn9mwxygsPOsUDVRRt,spotify:artist:1Y8cdNmUJH7yBTd9yOvr5i,Survivor,spotify:album:0IVseR3zfrrInlKJQNh294,Survivor,
666653,october,false,679999,1506816000,25,24,1,"{'pos': 23, 'artist_name': 'Destiny's Child', ...",2,245400.0,22,23.0,Destiny's Child,spotify:track:0FZvjrHpAmLKj574M4VwrF,spotify:artist:1Y8cdNmUJH7yBTd9yOvr5i,Cater 2 U,spotify:album:3xjdyJjSMNsSRkj3GTmBLi,Destiny Fulfilled,


In [10]:
# Sanity check
# The number of songs in the dataframe should be more or equal to the number of songs in the dictionary
# Dictonary should contain all the songs in the dataframe WITHOUT duplicates 
# **** Check this/fix
print("song_dict length = ", len(song_dict))
print("full dataframe # of rows = ", len(full_data))

song_dict length =  173217
full dataframe # of rows =  666655


In [30]:
# Drop the columns we don't want - everything except pid and track_uri
# Add a column with all 1s since each row is a song that is present in the corresponding playlist
# We need this when creating the sparse matrices (maybe there is a better way for this)****
data = full_data.copy()
data = data.drop(labels=["name", "collaborative", "modified_at", "num_tracks", "num_albums", "num_followers", "tracks",
                        "num_edits", "duration_ms", "num_artists", "pos", "artist_name", "artist_uri", "track_name",
                        "album_uri", "album_name", "description"], axis=1)
data["present"] = 1
data.head(99)

Unnamed: 0,pid,track_uri,present
0,549000,spotify:track:6QHYEZlm9wyfXfEM1vSu1P,1
1,549000,spotify:track:3RkQ3UwOyPqpIiIvGVewuU,1
2,549000,spotify:track:0ju1jP0cSPJ8tmojYBEI89,1
3,549000,spotify:track:7ny2ATvjtKszCpLpfsGnVQ,1
4,549000,spotify:track:18GiV1BaXzPVYpp9rmOg0E,1
...,...,...,...
94,549001,spotify:track:5LnEJ9AkSrp2uxISQtj0eo,1
95,549001,spotify:track:0R0zZnqPg7yOWb4PRmW8nC,1
96,549001,spotify:track:65gMVplMNmDkbE6wwsDBD4,1
97,549001,spotify:track:1FkgoPdajl8gwC1hlyvHtC,1


In [12]:
# Need to convert track_uri column (categorical) to numerical colum
data['track_uri'] = data['track_uri'].astype("category")
data['track_num_code'] = data['track_uri'].cat.codes

In [13]:
data

Unnamed: 0,pid,track_uri,present,track_num_code
0,549000,spotify:track:6QHYEZlm9wyfXfEM1vSu1P,1,142739
1,549000,spotify:track:3RkQ3UwOyPqpIiIvGVewuU,1,76489
2,549000,spotify:track:0ju1jP0cSPJ8tmojYBEI89,1,16371
3,549000,spotify:track:7ny2ATvjtKszCpLpfsGnVQ,1,168946
4,549000,spotify:track:18GiV1BaXzPVYpp9rmOg0E,1,25071
...,...,...,...,...
666650,679999,spotify:track:29LHe8kG3PraghUZOZYsw4,1,47568
666651,679999,spotify:track:3ZFTkvIE7kyPt6Nu3PEa7V,1,79229
666652,679999,spotify:track:7qtAgn9mwxygsPOsUDVRRt,1,169947
666653,679999,spotify:track:0FZvjrHpAmLKj574M4VwrF,1,5658


`csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])`
where `data`, `row_ind` and `col_ind` satisfy the relationship `a[row_ind[k], col_ind[k]] = data[k]`.
[SciPy Documentation about `scipy.sparse.csr_matrix`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)

In [14]:
# Implicit library expects the data as an item-user matrix so we create 2 matrices:
# (1) Item-user matrix --> For fitting the model
# (2) User-item matrix --> For the recommendation
sparse_item_user = sparse.csr_matrix((data['present'], (data['track_num_code'], data['pid'])))
sparse_user_item = sparse.csr_matrix((data['present'], (data['pid'], data['track_num_code'])))

In [15]:
# Initialize the als model and fit it using the sparse item-user matrix
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=20)

  "Intel MKL BLAS detected. Its highly recommend to set the environment "


In [16]:
# Fit the model
model.fit(sparse_user_item)

  0%|          | 0/20 [00:00<?, ?it/s]

In [17]:
# Get song name based on code
song_name = name = data.loc[data["track_num_code"]==142739].iloc[0].track_uri
print(song_dict[name])

Boots of Spanish Leather


In [18]:
# FIND SIMILAR ITEMS
# Find the 10 most similar songs to Boots of Spanish Leather
song_num_code = 142739 # Boots of Spanish Leather by Bob Dylan
n_similar = 10

In [19]:
# Use implicit to get similar items.
# This will output two arrays:
# (1) Array containing IDs of the top 10 songs
# (2) Array containing the corresponding artist similarities (scores)
# Is this just using cosine similarity between the song vectors?
similar = model.similar_items(song_num_code, n_similar)

In [20]:
# Print the names of our most similar songs
# First array is the IDs of the songs
# Second array are the corresponding cosine similarities (I think - not 100% how this works internally)
print(similar)

(array([142739, 144729, 115904, 158164, 138045, 137681, 134518,  98385,
        95230,  87940], dtype=int32), array([1.        , 0.9533271 , 0.95306385, 0.9472533 , 0.9449832 ,
       0.9449832 , 0.9449832 , 0.9449832 , 0.9449832 , 0.9449832 ],
      dtype=float32))


In [21]:
for song_num_id in similar[0]:
    name = data.loc[data["track_num_code"]==song_num_id].iloc[0].track_uri
    print(song_dict[name])

Boots of Spanish Leather
Kind & Generous
Have A Heart
Famous Blue Raincoat
You're The Boss
The Killing Moon - BBC Evening Session January 15, 1997
Whir
Danny Diamond
Plenty More
Wished For You


In [31]:
# Make a recommendation of songs for a playlist
recommendations = model.recommend(549001, sparse_user_item[549001])

In [32]:
print(recommendations)

(array([125139, 122885,  11980,  53578,  93886,   8684,  39189,  59664,
       120301, 119697], dtype=int32), array([0.09691456, 0.08740799, 0.08117663, 0.07241045, 0.07166649,
       0.0702515 , 0.06772389, 0.0672849 , 0.06702054, 0.0649409 ],
      dtype=float32))
