## ALS Implicit Cyphon Library Example using *lastfm-360K* Dataset
#### Dataset can be found [here](http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz) with specific data information [here.](https://www.upf.edu/web/mtg/lastfm360k)
#### Some code from [ALS Implicit Collaborative Filtering article.](https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe)

In [11]:
import sys
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
import random
from sklearn.preprocessing import MinMaxScaler

In [12]:
import implicit

In [13]:
# Read the data
# Organized as --> user_id /t artist_id /t artist_name /t #_plays
# Info about the playlist: https://www.upf.edu/web/mtg/lastfm360k 
raw_data = pd.read_table('lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv')
raw_data

Unnamed: 0,00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,betty blowtorch,2137
0,00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,die Ärzte,1099
1,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
2,00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,elvenking,717
3,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706
4,00000c289a1829a808ac09c00daf10bc3c4e223b,8bfac288-ccc5-448d-9573-c33ea2aa5c30,red hot chili peppers,691
...,...,...,...,...
17535649,"sep 20, 2008",7ffd711a-b34d-4739-8aab-25e045c246da,turbostaat,12
17535650,"sep 20, 2008",9201190d-409f-426b-9339-9bd7492443e2,cuba missouri,11
17535651,"sep 20, 2008",e7cf7ff9-ed2f-4315-aca8-bcbd3b2bfa71,little man tate,11
17535652,"sep 20, 2008",f6f2326f-6b25-4170-b89d-e235b25508e8,sigur rós,10


In [14]:
# Drop the artist ID column, and give column names for rest of the data
raw_data = raw_data.drop(raw_data.columns[1], axis=1)
raw_data.columns = ['user', 'artist', 'plays']
raw_data

Unnamed: 0,user,artist,plays
0,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099
1,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897
2,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717
3,00000c289a1829a808ac09c00daf10bc3c4e223b,juliette & the licks,706
4,00000c289a1829a808ac09c00daf10bc3c4e223b,red hot chili peppers,691
...,...,...,...
17535649,"sep 20, 2008",turbostaat,12
17535650,"sep 20, 2008",cuba missouri,11
17535651,"sep 20, 2008",little man tate,11
17535652,"sep 20, 2008",sigur rós,10


In [15]:
# Check if there are any NaN values (we then drop those rows)
print(raw_data.isna().sum())

user        0
artist    204
plays       0
dtype: int64


In [16]:
# Drop NaN columns
data = raw_data.copy()
data = data.dropna()
data

Unnamed: 0,user,artist,plays
0,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099
1,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897
2,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717
3,00000c289a1829a808ac09c00daf10bc3c4e223b,juliette & the licks,706
4,00000c289a1829a808ac09c00daf10bc3c4e223b,red hot chili peppers,691
...,...,...,...
17535649,"sep 20, 2008",turbostaat,12
17535650,"sep 20, 2008",cuba missouri,11
17535651,"sep 20, 2008",little man tate,11
17535652,"sep 20, 2008",sigur rós,10


In [17]:
# Create numeric columns for categorical columns
data['user'] = data['user'].astype("category")
data['artist'] = data['artist'].astype("category")
data['user_id'] = data['user'].cat.codes
data['artist_id'] = data['artist'].cat.codes
data

Unnamed: 0,user,artist,plays,user_id,artist_id
0,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099,0,90933
1,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897,0,185367
2,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717,0,106704
3,00000c289a1829a808ac09c00daf10bc3c4e223b,juliette & the licks,706,0,155241
4,00000c289a1829a808ac09c00daf10bc3c4e223b,red hot chili peppers,691,0,220128
...,...,...,...,...,...
17535649,"sep 20, 2008",turbostaat,12,358867,271740
17535650,"sep 20, 2008",cuba missouri,11,358867,78482
17535651,"sep 20, 2008",little man tate,11,358867,171784
17535652,"sep 20, 2008",sigur rós,10,358867,235118


`csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])`
where `data`, `row_ind` and `col_ind` satisfy the relationship `a[row_ind[k], col_ind[k]] = data[k]`.
[SciPy Documentation about `scipy.sparse.csr_matrix`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)

In [18]:
# Implicit library expects the data as an item-user matrix so we create 2 matrices:
# (1) Item-user matrix --> For fitting the model
# (2) User-item matrix --> For the recommendation
sparse_item_user = sparse.csr_matrix((data['plays'].astype(float), (data['artist_id'], data['user_id'])))
sparse_user_item = sparse.csr_matrix((data['plays'].astype(float), (data['user_id'], data['artist_id'])))

In [19]:
# Initialize the als model and fit it using the sparse item-user matrix
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=20)

  "Intel MKL BLAS detected. Its highly recommend to set the environment "


In [20]:
# Calculate the confidence by multiplying it by our alpha value.
alpha_val = 15
data_conf = (sparse_item_user * alpha_val).astype('double')

In [21]:
# Fit the model
model.fit(sparse_item_user)

  0%|          | 0/20 [00:00<?, ?it/s]

In [22]:
# FIND SIMILAR ITEMS
# Find the 10 most similar artists to Jay-Z 
item_id = 147068 # Jay-Z
n_similar = 10

In [23]:
# Use implicit to get similar items.
# This will output two arrays:
# (1) Array containing IDs of the top 10 artists
# (2) Array containing the corresponding artist similarities (scores)
# Is this just using cosine similarity between the song vectors?
similar = model.similar_items(item_id, n_similar)

In [24]:
for item in similar:
    print(item)

[147068 339186 263726 293418 247749 314480 152068  27393   5456 237108]
[1.0000001  0.9271978  0.9258615  0.91784364 0.91605586 0.91207266
 0.91083986 0.91011506 0.90678716 0.9051018 ]


In [32]:
# Print the names of our most similar artists
# First array is the IDs of the artists
# Second array are the corresponding cosine similarities (I think - not 100% how this works internally)
print(similar)

(array([147068, 339186, 263726, 293418, 247749, 314480, 152068,  27393,
         5456, 237108], dtype=int32), array([1.0000001 , 0.9271978 , 0.9258615 , 0.91784364, 0.91605586,
       0.91207266, 0.91083986, 0.91011506, 0.90678716, 0.9051018 ],
      dtype=float32))


In [43]:
# Print the corresponding artist name for every artist_id in similar[0]
for artist_id in similar[0]:
    name = data.loc[data["artist_id"]==artist_id].iloc[0].artist
    print(name)

jay-z


IndexError: single positional indexer is out-of-bounds

**NOTE:** This part is weird. The trained model finds other srtists similar to Jay-Z (I am guessing it is using cosine similarity across the item/artist  vectors), and then returns them as part of the variable called `similarity`. This variable is made up of two arrays, one which contains the artist_ids and the other that contains their corresponding cosine similarities (or some similarity values). 

The weird part is that when I try to print all the corresponding artist names (i.e. a corresponding artist name for a given artist_id). For example, `artist_id = 339186`. When I check if it exists in the artist_id column, it returns true (see cell below). But then when I try to extract the name or just return the row with that artist_id, I either get an index out of bounds or an empty row. How is this possible?

In [58]:
339186 in data.artist_id

True

In [60]:
# result = data.loc[data["artist_id"]==79461].iloc[0].artist
data.loc[data["artist_id"]==147068]
# print(result)

Unnamed: 0,user,artist,plays,user_id,artist_id
616,0001a88a7092846abb1b70dbcced05f914976371,jay-z,81,12,147068
833,00026e8fc41980c9605eac741cd97b8216d2dbbd,jay-z,100,16,147068
2746,0009194b405052f1ee09a9cce78d660c47832735,jay-z,256,55,147068
6246,0014e7ddfa8d8e75eae3615c28d34a5b66fe9bc4,jay-z,7,127,147068
6550,0015a0902387f912c4e05fe8294e88e3c46c3019,jay-z,155,133,147068
...,...,...,...,...,...
17528090,ffe32a034e6eaeb8a7a2139cb2048ef23c33059f,jay-z,7,358711,147068
17529239,ffe7359143a9fe15b3be2eaac57385e237f82e2c,jay-z,107,358735,147068
17529817,ffe92580b965ba38f7aa696f055af267960767f4,jay-z,730,358748,147068
17533153,fff6c0fff0e0bc03f7b5a3aa8a538dc9d887fa4a,jay-z,84,358817,147068


In [57]:
artist_id = 339186
data.loc[data["artist_id"]==artist_id].iloc[0].artist

IndexError: single positional indexer is out-of-bounds