# Collaborative filtering


Now, we're going to use the collaborative filtering method to solve our problem of recommending new artists to users.

But first, what is collaborative filtering?

In a very high level way, and using the music recommendation scenario, there are two ideas behind collaborative filtering.

### User based collaborative filtering

#### Intuition

If two people usually like the same artists and dislike the same artists, then they will probably feel the same way about a new artist

#### Algorithm

1. establish similarities between users
2. for each user, define a set of his most similar users, called neighbors
3. in order to recommend a new artist to a user, we average the preferences of neighbors and recommend the artist that his neighbors like the most

### Item based collaborative filtering

#### Intuition

If many people feel the same way about two artists, then the two artists are probably similar; a user that only knows one of the artists and liked it will probably like the other artist too

#### Algorithm

1. establish similarities between artists
2. each user has a preferences profile, based on how much he liked the artists
3. in order to recommend a new artist to a user, we pick the new artists that are the most similar to the user's preferences profile

### When should each method be used?

* If our system has much more users than items, go for item based (and vice-versa). Computing similarities is probably the most expensive operation you'll be doing, thus do it in the smaller set.

* If your items are more stable than users, i.e, if you add users more frequently than items, go for item based (and vice-versa). You'll want to recompute the similarities matrix as least as possible (while keeping accurate results, of course!)

* Do you need serendipity? If yes, go for user based. When you use item based, given that the results are based on the user's own actions, the results will be less surprising for him.

* Do you need to justify your recommendations? If yes, go for item based. It's easier to justify a recommendation with: "We're showing you this because you liked that" than with: "We're showing you this because Jon Doe (who you don't know!) also liked it and your tastes are very similar".


## Let's move to the fun part!

First, let's import the train dataset.

In [1]:
import pandas as pd

train = pd.read_csv('../data/train.csv', names=['user_id', 'artist_id', 'artist_name', 'playcounts'])
train.head()

Unnamed: 0,user_id,artist_id,artist_name,playcounts
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507
3,00000c289a1829a808ac09c00daf10bc3c4e223b,a342964d-ca53-4e54-96dc-e8501851e77f,walls of jericho,393
4,00000c289a1829a808ac09c00daf10bc3c4e223b,f779ed95-66c8-4493-9f46-3967eba785a8,letzte instanz,387


Now, let's transform the user and artist ids in integers. This will help us when we convert our data into sparse matrices.

In [2]:
from sklearn.preprocessing import LabelEncoder

le_users = LabelEncoder()
le_users.fit(train.user_id.values)
train['user_label'] = le_users.transform(train.user_id.values)

le_artists = LabelEncoder()
le_artists.fit(train.artist_id.values)
train['artist_label'] = le_artists.transform(train.artist_id.values)

train.head()

Unnamed: 0,user_id,artist_id,artist_name,playcounts,user_label,artist_label
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897,0,1677
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706,0,1761
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507,0,295
3,00000c289a1829a808ac09c00daf10bc3c4e223b,a342964d-ca53-4e54-96dc-e8501851e77f,walls of jericho,393,0,1505
4,00000c289a1829a808ac09c00daf10bc3c4e223b,f779ed95-66c8-4493-9f46-3967eba785a8,letzte instanz,387,0,2335


Here we'll store our data in a sparse matrix where each row corresponds to an artist and each columns corresponds to a user.

In [3]:
import numpy as np
from scipy.sparse import coo_matrix

row  = np.array(train.artist_label.values)
col  = np.array(train.user_label.values)
data = np.array(train.playcounts.values)
m = coo_matrix((data, (row, col)), shape=(train.artist_label.nunique(), train.user_label.nunique()))

This is an helper function to show us the density of a sparse matrix.

In [4]:
def print_matrix_density(m):
    # this is the number of non-zero values in the matrix
    nnz = m.nnz

    # get the matrix shape
    shape = m.shape

    # the number of elements in the matrix is n_rows * n_cols
    n_elems = shape[0] * shape[1]

    # the density of the matrix is the number of non-zero elements over all the elements
    print('Density of sparse matrix: {}%'.format(round(nnz/n_elems*100, 2)))

In [5]:
print_matrix_density(m)

Density of sparse matrix: 0.48%


In order to compute similarities between artists, we'll use Scikit's cosine_similarity function. Using the dense_output=False, we guarantee that the output is a sparse matrix.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

S = cosine_similarity(m, dense_output=False)
print(type(S))
print_matrix_density(S)

<class 'scipy.sparse.csr.csr_matrix'>
Density of sparse matrix: 68.72%


Now, let's produce recommendations! First, we'll read a test set and then we'll encode the user_ids, as we did before.

In [7]:
test = pd.read_csv('../data/test.csv', names=['user_id'])
test.head()

Unnamed: 0,user_id
0,002b5d4a2aea86f2b52103393398409aa28d435a
1,00a41bc4ce19c07116b93430e8ccd41984fb2e2d
2,01ac389171616b01a392118dab5a8f56bcfbdb3b
3,0208feef15a26e8c93c776022a8f63c00c4fa7e9
4,03030070193c32afdb21bea4b3597b4d2d549bd8


In [8]:
test_labels = le_users.transform(test.user_id.values)
test_labels[:10]

array([ 154,  589, 1594, 1936, 2801, 3119, 3999, 4634, 5804, 6021])

Then, we'll select from the inital matrix m, the columns that correspond to users in the test set.

In [9]:
m = m.tocsc()
test_users = m[:,test_labels]
print(type(test_users))

<class 'scipy.sparse.csc.csc_matrix'>


Here is where Math attacks!
We'll get the preferences of the users in the test set by computing the dot product between the similarities matrix and the users (or the artists that they like).

In [10]:
preferences = S.dot(test_users)
print('Density of sparse matrix: {}%'.format(round(preferences.nnz/(preferences.shape[0]*preferences.shape[1])*100, 2)))
preferences[:,0].toarray()

Density of sparse matrix: 99.95%


array([[   1.83029269],
       [   1.64725937],
       [ 141.30845304],
       ..., 
       [ 106.26176781],
       [   0.22112047],
       [  16.40485559]])

The preferences matrix is not so sparse anymore... But it's ok, because its matrix is much smaller, with just 300 columns (the number of users in the test set).

Now, we want to remove from the preferences matrix the artists that the users already listened to.

In [11]:
test_users_coo = test_users.tocoo()

for i, j in zip(test_users_coo.row, test_users_coo.col):
    preferences[i, j] = 0

preferences[:,0].toarray()

array([[   1.83029269],
       [   1.64725937],
       [   0.        ],
       ..., 
       [ 106.26176781],
       [   0.22112047],
       [  16.40485559]])

And now, let's move to pandas...

In [12]:
df_preferences = pd.DataFrame(preferences.toarray())
df_preferences = df_preferences.transpose()
df_preferences.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2410,2411,2412,2413,2414,2415,2416,2417,2418,2419
0,1.830293,1.647259,0.0,13.486969,2.679935,3.801377,0.259463,0.930958,6.805027,1.941573,...,7.073527,0.494677,1.243961,1.976835,6.900427,0.291134,0.237262,106.261768,0.22112,16.404856
1,2.353086,7.059208,7.631676,11.471106,1.615172,0.126451,1.297599,0.697074,6.371153,36.267973,...,16.185189,5.46452,0.337963,1.697208,1.015752,0.933111,1.867684,3.058141,7.163797,86.52104
2,0.33493,0.523145,0.714925,0.982454,0.142186,0.0,0.60732,1.057052,1.229495,0.565041,...,5.2572,3.081868,0.241466,11.184143,0.046592,0.287489,0.421334,0.695675,0.429554,4.022889
3,7.22717,2.559279,5.53845,14.323927,1.255134,0.572675,6.485822,0.397746,14.111103,34.079862,...,15.625716,13.77488,5.720238,20.830233,0.178316,1.919642,12.034838,3.932393,6.147588,35.506348
4,2.720481,16.069489,5.177597,0.586048,32.122461,2.428006,9.722055,38.803199,77.276059,6.459918,...,12.662701,9.440265,14.207742,3.119378,7.059925,3.386277,46.198454,1.723172,7.005515,4.474405


... and select the top 50 artists for each user.

In [13]:
k = 50
results = []
for index, row in df_preferences.iterrows():
    top_k = row.sort_values(ascending=False).head(k).index.values
    results.append(top_k)
    
results_ids = le_artists.inverse_transform(results)
results_ids[0][:10]

array(['94dbfe2e-ca48-4e08-a5a8-e1e74136c63d',
       '4e024037-14b7-4aea-99ad-c6ace63b9620',
       '4e954b02-fae2-4bd7-9547-e055a6ac0527',
       'c71abd83-9d66-4c7b-9f0d-c9c36e85a955',
       'd75d1f08-bbb8-4eae-9877-399ca9121197',
       'b614843c-bec3-421f-9af1-03169cdd4b63',
       '6593c2bc-9327-4adf-a8b2-b315e4f5c0bb',
       'd50548a0-3cfd-4d7a-964b-0aef6545d819',
       'ceef10f5-324d-4a04-8db7-1a4181e19ab3',
       '1f1f6737-b930-46fc-8d25-110bb99f7490'], dtype=object)

But how good is our model?
Let's evaluate it.

In the test_true set, we have 5 artist that each user listened to. So evaluating our model is basically checking if the top 50 that we predicted for each user matches the artists in the test_true dataset.

We'll use the mean average precision metric, which checks if our predictions are in the test_true dataset and how good was the order of our predictions.

In [14]:
test_true = pd.read_csv('../data/test_true.csv', header=None, names=['user_id', 'top_artists'])
test_true.head()

Unnamed: 0,user_id,top_artists
0,002b5d4a2aea86f2b52103393398409aa28d435a,"['6ec73176-6ea6-49d3-87e8-35b5fe6813f5', 'b614..."
1,00a41bc4ce19c07116b93430e8ccd41984fb2e2d,"['1f8e4ffd-b81b-4d10-91f4-461585bf9d16', '2228..."
2,01ac389171616b01a392118dab5a8f56bcfbdb3b,"['79ad5f01-b181-448d-b776-a252949d61af', 'a558..."
3,0208feef15a26e8c93c776022a8f63c00c4fa7e9,"['b9a20306-a4f5-4d3c-8680-e9cdc7e3af5b', '3d6b..."
4,03030070193c32afdb21bea4b3597b4d2d549bd8,"['1397d045-1603-41fc-80b9-712c18360145', 'ad38..."


In [15]:
import json

test_true.top_artists = test_true.top_artists.str.replace('\'', '"')
test_true.top_artists = test_true.apply(lambda x: json.loads(x.top_artists), axis=1)
test_true.head()

Unnamed: 0,user_id,top_artists
0,002b5d4a2aea86f2b52103393398409aa28d435a,"[6ec73176-6ea6-49d3-87e8-35b5fe6813f5, b614843..."
1,00a41bc4ce19c07116b93430e8ccd41984fb2e2d,"[1f8e4ffd-b81b-4d10-91f4-461585bf9d16, 2228019..."
2,01ac389171616b01a392118dab5a8f56bcfbdb3b,"[79ad5f01-b181-448d-b776-a252949d61af, a558ed3..."
3,0208feef15a26e8c93c776022a8f63c00c4fa7e9,"[b9a20306-a4f5-4d3c-8680-e9cdc7e3af5b, 3d6bbeb..."
4,03030070193c32afdb21bea4b3597b4d2d549bd8,"[1397d045-1603-41fc-80b9-712c18360145, ad38670..."


In [16]:
import ml_metrics as metrics

metrics.mapk(test_true.top_artists.values, results_ids, k)

0.15531719983948952