# Collaborative filting Systems

We also come up with a way of recommendating based on the artist's similarity using collaborative filting systems

The collaborative filtering systems is based on: when two users share the same taste on one movie, they might have same taste on another. For example: when user A and B both love movie 1, and A love movie 2, then B might also love movie2.

In [146]:
import sys
sys.path.append("../../")
import time
import os
import shutil
import papermill as pm
import pandas as pd
import numpy as np
import tensorflow as tf
from reco_utils.recommender.ncf.ncf_singlenode import NCF
from reco_utils.recommender.ncf.dataset import Dataset as NCFDataset
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k, 
                                                     recall_at_k, get_top_k_items)
from reco_utils.common.constants import SEED as DEFAULT_SEED
from reco_utils.dataset.python_splitters import python_chrono_split


print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.25.3
Tensorflow version: 1.12.0


## Data processing

### Load data

In [6]:
path = '/Users/guoyixin/Desktop/NEU/7374/hetrec2011-lastfm-2k/'

In [247]:
import csv
artists= data = pd.read_csv(path+'artists.dat',header=0,encoding='utf-8',delimiter="\t",quoting=csv.QUOTE_NONE)

In [248]:
artists.head()

Unnamed: 0,id,name,url,pictureURL
0,1,MALICE MIZER,http://www.last.fm/music/MALICE+MIZER,http://userserve-ak.last.fm/serve/252/10808.jpg
1,2,Diary of Dreams,http://www.last.fm/music/Diary+of+Dreams,http://userserve-ak.last.fm/serve/252/3052066.jpg
2,3,Carpathian Forest,http://www.last.fm/music/Carpathian+Forest,http://userserve-ak.last.fm/serve/252/40222717...
3,4,Moi dix Mois,http://www.last.fm/music/Moi+dix+Mois,http://userserve-ak.last.fm/serve/252/54697835...
4,5,Bella Morte,http://www.last.fm/music/Bella+Morte,http://userserve-ak.last.fm/serve/252/14789013...


In [249]:
ua = pd.read_table(path+'user_artists.dat',sep = '\t',header=0,engine='python')

In [294]:
ua.head(20)

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983
5,2,56,6152
6,2,57,5955
7,2,58,4616
8,2,59,4337
9,2,60,4147


We decide use the user_artists.dat as our target dataset because it includes how user likes the artist using the colomn 'weight'  

In [205]:
ua.sort_values('weight')

Unnamed: 0,userID,artistID,weight
82318,1859,429,1
10646,229,4553,1
10645,229,4552,1
10644,229,4551,1
10636,229,2905,1
...,...,...,...
73745,1664,498,227829
84249,1905,203,257978
49304,1094,511,320725
91659,2071,792,324663


The weight's range is too large from 1 to 350000, that's why we are going to do some normalization but not to lead to too small 'weight'

In [209]:
rating = lambda x: (x-np.min(x))*5000/(np.max(x)-np.min(x))

In [210]:
ua['weight'] = ua[['weight']].apply(rating)

In [212]:
ua.describe()

Unnamed: 0,userID,artistID,weight
count,92834.0,92834.0,92834.0
mean,1037.010481,3331.123145,10.550755
std,610.870436,4383.590502,53.180522
min,2.0,1.0,0.0
25%,502.0,436.0,1.502706
50%,1029.0,1246.0,3.671707
75%,1568.0,4350.0,8.690179
max,2100.0,18745.0,5000.0


In [213]:
ua_ratings = ua.pivot(
    index= 'artistID',
    columns = 'userID',
    values = 'weight').fillna(0)

# Make recommendation



Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

In [274]:

from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors


In [275]:
artist_similarity = cosine_similarity(ua_ratings,dense_output=True)
artist_similarity

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 1., 0.],
       [0., 0., 0., ..., 1., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Convert a dense matrix into sparse matrix by using `csr_matrix()`, given that sparse matrix is more efficient for machine learning, and the use of libraries such as scikit learn 

In [276]:
ua_train = csr_matrix(ua_ratings.values)

Convert a dense matrix into sparse matrix by using `csr_matrix()`, given that sparse matrix is more efficient for machine learning, and the use of libraries such as scikit learn 

In [277]:
artist_neighbors = NearestNeighbors(metric='cosine', algorithm='brute')

Fit data to the KNN model

In [278]:
artist_neighbors.fit(ua_train)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

Randomly select a movie as user is selecting a movie before recommendation with its movieId

In [279]:
query_index = np.random.choice(ua_ratings.shape[0])

Calculate the distance and associate index using `.kneighbors` method from scikit learn library

In [280]:
distances, indices = artist_neighbors.kneighbors(ua_ratings.iloc[query_index,:].values.reshape(1,-1), n_neighbors = 6)

In [281]:
distances.flatten()

array([3.33066907e-16, 2.16943272e-04, 2.16943272e-04, 2.16943272e-04,
       2.16943272e-04, 2.16943272e-04])

Iterate list of calcualted distances, cross reference those distances representing nearest neighbors to the query movie with movies dataframe, and return a list of recommended movies with their movieIds

In [282]:
recommend_list = []

for i in range(len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(artists.loc[artists['id']==ua_ratings.index[query_index], 'name'].iloc[0]))
    else:
        recommend_list.append(ua_ratings.index[indices.flatten()[i]])
        print ('{0}:{1}, with distance of {2}:'.format(i, ua_ratings.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Charlie Musselwhite:

1:8819, with distance of 0.00021694327188803708:
2:8806, with distance of 0.00021694327188803708:
3:8809, with distance of 0.00021694327188803708:
4:8818, with distance of 0.00021694327188803708:
5:8825, with distance of 0.00021694327188803708:


In [283]:
recommend_list

[8819, 8806, 8809, 8818, 8825]

Show the result in a more clear way, as well as movie details

In [284]:
recommend_res = pd.DataFrame(recommend_list, columns=['id'])
recommend_res

Unnamed: 0,id
0,8819
1,8806
2,8809
3,8818
4,8825


In [285]:
recommend_res = pd.merge(recommend_res, artists, on='id')
recommend_res

Unnamed: 0,id,name,url,pictureURL
0,8819,Alberto Lizzio/Baroque Festival Orchestra,http://www.last.fm/music/Alberto%2BLizzio%252F...,
1,8806,The Drifters,http://www.last.fm/music/The+Drifters,http://userserve-ak.last.fm/serve/252/562266.jpg
2,8809,Rita Pavone,http://www.last.fm/music/Rita+Pavone,http://userserve-ak.last.fm/serve/252/42053213...
3,8818,Harry Belafonte,http://www.last.fm/music/Harry+Belafonte,http://userserve-ak.last.fm/serve/252/115203.jpg
4,8825,Julio Sosa,http://www.last.fm/music/Julio+Sosa,http://userserve-ak.last.fm/serve/252/57449221...


### Moving on, since the result of recommended movies are determined, it is better to confirm them with the movie similarities to see if the result is logically correcrt

In [289]:
similarity_compare = pd.DataFrame({'Recommend Similarity':artist_similarity[query_index],'Selected ArtistId':ua_ratings.index[query_index]})
similarity_compare = similarity_compare.set_index(ua_ratings.index)
similarity_compare = similarity_compare.sort_values('Recommend Similarity',ascending=False)
similarity_compare_pivot = similarity_compare.T
similarity_compare

Unnamed: 0_level_0,Recommend Similarity,Selected ArtistId
artistID,Unnamed: 1_level_1,Unnamed: 2_level_1
8821,1.000000,8821
8825,0.999783,8821
8804,0.999783,8821
8806,0.999783,8821
8809,0.999783,8821
...,...,...
6023,0.000000,8821
6024,0.000000,8821
6025,0.000000,8821
6026,0.000000,8821


In [290]:
similarity_compare.describe()

Unnamed: 0,Recommend Similarity,Selected ArtistId
count,17632.0,17632.0
mean,0.001524,8821.0
std,0.035213,0.0
min,0.0,8821.0
25%,0.0,8821.0
50%,0.0,8821.0
75%,0.0,8821.0
max,1.0,8821.0


## Conclusion and Reconsideration

The whole process is going well although we came to a look-like unblanced result, for the too sparse data.
We also load some other dataset, in which user can access to all item, and the model work really well! And the good similarity result are shown in the following graph:
<center>
<img src="image/screenshot.png" width=300 />
</center>

Anyway,a nice journey in this lab!!