# How to Build a Recommender in Python


### by Maria Dominguez (aka Chi) & João Ascensão

## Overview:

#### Leveraging Python's data stack to build a music recommender, not unlike Spotify's Discover Weekly.

Notes:

* Implicit and explicit feedback
* Uniplaces and Data Science Academy logos at the end

## Recommender Systems:

Recommender systems present *items* that are likely to interest the *user*, by comparing the user's profile (inferred from *data*) to reference characteristics.

Thus, our general framework contains three main components:

* **Users**: the people in our system, that generate data and will receive item recommendations
* **Items**: the things in our system, with which the users interact and that we want to recommend to them
* **Data**: the different ways an user can express opinions about our items, i.e. where users and items meet.

## Music Recommendations


[Word out there](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe), is that Spotify's Discover Weekly mixes together two well-known types of recommenders:

* **Content-based filtering**: combines data (e.g. clicks, playcounts, ) and item attributes (e.g. content, text, tags, metadata) to create *user profiles*

    * **Natural language processing (NLP)**: representing artists and tracks with features extracted from text (e.g. description, comments, tags)
    * **Audio models**: extracting features by analyzing the raw audio tracks.
    
    
* **Collaborative filtering**: analyzes user historical behaviour to find similarities across pairs of users and/or items.

## Building the Recommender

Using data from the lastfm dataset, our prototype will include:

* A content-based filtering branch, based on processing the tags assigned by the users to each artist (no raw audio analysis today, bummer)
* A collaborative-filtering approach, using user/artist playcounts.

## Content-based filtering

We will import a file we prepared containing a bunch of tags relative to the artists in the dataset.

### Data preparation

#### Preparing the train dataset

We want to read the dataset into a `train` dataframe, containing `user_id`, `artist_id`, `artist_name` and `play_counts`.

In [1]:
import pandas as pd
import numpy as np


train_cols = ['user_id', 'artist_id', 'artist_name', 'play_counts']
train = pd.read_csv("../data/train.csv", header=None, names=train_cols)
# We checked, there are no missing values :)
train.head(n=3)

Unnamed: 0,user_id,artist_id,artist_name,play_counts
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507


Also, we don't need the artist name for now. We will store in a separate dataframe to use in the last step, to provide nicer looking recommendations.

In [2]:
# We will set the dataframe index to be the artist name for easy reference in 
# the future
artist_name = train[['artist_id', 'artist_name']].set_index('artist_id')
train = train.drop(['artist_name'], axis=1)
train.head(n=3)

Unnamed: 0,user_id,artist_id,play_counts
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,897
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,706
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,507


#### Preparing the tags dataset

The `tags` dataframe has `artist_id` and one row per `tag`. There are some rows with empty tags, we will get rid of those as we have no use for them.

In [3]:
tags_cols = ['artist_id', 'tag']
tags = pd.read_csv("../data/tags.csv", header=None, names=tags_cols)
# Null tags are of no use, and if there are some they must be removed.
tags_null = tags['tag'].isnull().sum()
print("There are {} tags missing. We must drop them.".format(tags_null))
tags = tags.dropna()
tags.head(n=3)

There are 0 tags missing. We must drop them.


Unnamed: 0,artist_id,tag
0,3bd73256-3905-4f3a-97e2-8b341527f805,90s
1,3bd73256-3905-4f3a-97e2-8b341527f805,alternative
2,3bd73256-3905-4f3a-97e2-8b341527f805,alternative punk rock


#### Include artists in both datasets only

Now, we can only build artist profiles for artists with at least one tag assigned to them, otherwise there's nothing we can use to *describe* them.

Similarly, artists with no playcounts are also useless to build user profiles. For convenience, we will remove them, *but could they be useful at some point*?

In [4]:
# We could try to use pandas.DataFrame.merge, why don't we?
artists_intersect = np.intersect1d(tags['artist_id'].values, 
                                   train['artist_id'].values)

To remove unnecessary artists from both dataframes, we will *select by label* the artists we want.

In [5]:
# Selecting by label only artists in tags that are also in train.
tags = tags.set_index('artist_id').loc[artists_intersect].reset_index()

# Selecting by label only artists in train that are also in tags.
train = train.set_index('artist_id').loc[artists_intersect].reset_index()

#### Label encode the `artist_id` on both dataframes

While reasoning behind this step is not immediately clear, it's important that `artist_id` is encoded as an integer.

In [6]:
from sklearn.preprocessing import LabelEncoder


# We will use the sklearn.preprocessing.LabelEncoder for this but other 
# approaches are valid. This will fit an integer to each unique string.
artist_encoder = LabelEncoder()
tags['artist_id'] = artist_encoder.fit_transform(tags['artist_id'])

# It is essential that we use the same encoder to transform the artists
# in the train dataset, so that they have the same label on both sides.
train['artist_id'] = artist_encoder.transform(train['artist_id'])

#### Sort the tags dataframe by `artist_id`

Again, while not immediately clear why, this is a critical step: we need to assure that the `tags` dataframe is sorted according to the *encoded* `artist_id`.

In [7]:
tags = tags.set_index('artist_id').sort_index().reset_index()
tags.head(n=3)

Unnamed: 0,artist_id,tag
0,0,90s
1,0,acoustic
2,0,alternative


### Building the artist profiles

We should how many unique artists and tags we have.

In [8]:
unique_artists = len(tags['artist_id'].unique())
unique_tags = len(tags['tag'].unique())
print("There are {} unique tags for {} unique artists.".format(unique_tags, 
                                                               unique_artists))

There are 2169 unique tags for 2420 unique artists.


Let's look at the top 3 most common tags.

In [9]:
tags_count = tags['tag'].value_counts()
print(tags_count[:3])

rock          2349
pop           1863
electronic    1539
Name: tag, dtype: int64


Now, let's take a random sample of 5 of the most obscure tags, just for fun.

In [10]:
tags_count_one = tags_count[tags_count == 1]
np.random.seed(1)
# We will randomize the order of the tags with count one so we can select 5 
# obscure tags at random.
randomize_order = np.random.permutation(len(tags_count_one))
tags_count_one[randomize_order][:5]

oldie              1
mo wax             1
90s emo            1
north east         1
barcelona sound    1
Name: tag, dtype: int64

#### Merge all tags in a single *document* per artist

The thing is, instead of a single tag per row, we need a single entry per arist, to have *collection* of tags per artist, resembling a *document*.

In [11]:
# There are probably less hacky ways to do this but it's not too expensive,
# it's interpretable and serves the purpose well.
# We are concatenating all tags as word in a single sentence.
tags = tags.groupby('artist_id')['tag'].apply(' '.join).reset_index()
# Some words, previously associated with particular tags, will appear more
# than once (e.g. 'alternative' in the second row).
tags.head(n=3)

Unnamed: 0,artist_id,tag
0,0,90s acoustic alternative alternative rock braz...
1,1,00s acoustic alternative alternative rock beau...
2,2,acid jazz afrobeat alternative rock ambient bi...


When talking about musical genres, people tend to use 'alternative rock' and 'alternative-rock' interchangeably.

In [12]:
# At this point, do things that don't scale. Try to spot possible strategies to
# improve the quality of your documents.
tags['tag'] = tags['tag'].str.replace("-", " ")

#### Structuring the tags data (aka vectorization)

And no, we are not using word-to-vec.

A vector exists in a space, in this particular case, a high-dimensional space corresponding to all the tags in our dataset. 

We take each word appearing in the documents as a dimension, or a feature. Each document is then represented as a *keyword vector*, typically with counts.

#### Weighting different words

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a *weighting* function.

Why do we need it? Because not all terms are equally relevant to describe an artist. In short, we measure the term frequency, weighted by its rarity.

$$ IDF _{term} = log\left({\frac{TotalDocuments}{DocumentsWithTerm}} \right) $$

Thus:

$$ TFIDF _{term} = TF _{term} * IDF _{term} $$

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import EnglishStemmer


# Add language stemmer to TfidfVectorizer by overriding the built-in 
# build_analyzer()
class StemmedTfidfVectorizer(TfidfVectorizer):
    # Stemming reduces words to their most basic form, to minimize the
    # feature space.
    def build_analyzer(self):
        stemmer = EnglishStemmer()
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

#### Bag of words representation with TF-IDF

We will now build our dictionary of tags and project our artist documents in our vector space.

In [14]:
# Bad of words representation, considering words with document frequency
# of at least 3 documents, otherwise they are not useful.
vectorizer = StemmedTfidfVectorizer(min_df=3, analyzer='word', 
                                    stop_words='english', lowercase=True)

# Build internal dictionary.
vectorizer.fit(tags['tag'])

# We call the dictionary the collection of distinct words.
vocabulary = list(vectorizer.vocabulary_)
print("Sample of the vocabulary {}.".format(vocabulary[:9]))
print("There are {} unique tags.".format(len(vocabulary)))

Sample of the vocabulary ['90s', 'acoust', 'altern', 'rock', 'brazil', 'brazilian', 'pop', 'music', 'dancehal'].
There are 774 unique tags.


#### Generating a sparse matrix of keyword vectors

We trained our dictionary, now we need to transform our `tags` dataframe into keyword vectors, with all the labels as features.

As you can see, the result is a sparse matrix. But what is that, right?

The reason why we sorted the artists, is because our vectorizer maintains the order of the rows, and we will perform some algebra on top of this very soon.

(This is a hint!)

In [15]:
# This will generate a sparse matrix with our artists in the rows, and
# the tags as columns.
artists_matrix = vectorizer.transform(tags['tag'])
print(artists_matrix.shape, type(artists_matrix))

(2420, 774) <class 'scipy.sparse.csr.csr_matrix'>


#### A note about normalization

Normalizing the artist keyword vectors is key, otherwise artists with more tags will have more weight on user profiles.

By normalizing, we mean making all vectors length 1.

We accomplish this with the following formula:

$$ \vec{u} = {\frac{\vec{v}}{\parallel\vec{v}\parallel}} $$

Where:

$$ \parallel\vec{v}\parallel = \sqrt{v_1^2 + v_2^2 + ... + v_n^2} $$

So, we take the non-normalized vector and divide it (i.e. all its components) by its own magnitude, also called length, or *norm*.

### Building user profiles

#### Generating a sparse matrix with users, items and playcounts


A sparse matrix is a big matrix where most of its elements are 0. 

The idea behind sparce matrices is to just store the information relative to the non-zero elements of the matrix. 

This makes working with these matrices way easier, and more performant.

In [16]:
from scipy.sparse import csr_matrix


user_encoder = LabelEncoder()

# We also need to transform the user_id into an integer.
# The row index in our sparse matrix.
rows = user_encoder.fit_transform(train['user_id'])
# The column index in our sparse matrix.
cols = train['artist_id'].values
data = train['play_counts'].values

# [row_ind[k], col_ind[k]] = data[k]
users_matrix = csr_matrix((data, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))

In [17]:
users_matrix.shape

(239929, 2420)

In [18]:
user_profiles = users_matrix.dot(artists_matrix)

user_profiles.shape

(239929, 774)

#### Making predictions (and another note about normalization)

An alternative to normalizing our vectors is to just compute the cosine between movie profile and the user taste.

This is because the cosine *is a dot-product that is already normalized*:

$$ cos(\vec{u}, \vec{v}) = 
\frac{\langle \vec{u}, \vec{v} \rangle}{\parallel\vec{u}\parallel\parallel\vec{v}\parallel} $$

In fact, the predictions will be *scalled by the user taste vector norm*, but that's ok because the order between the items will be kept unchanged.

Let's build our prediction function.

In [19]:
from sklearn.metrics.pairwise import cosine_similarity


def make_one_prediction(user_id):
    encoded_user_id = user_encoder.transform(user_id)
    user_preferences = user_profiles[encoded_user_id, :]
    encoded_artist = cosine_similarity(user_preferences, artists_matrix).argmax()
    artist_id = artist_encoder.inverse_transform(encoded_artist)
    artist = artist_name.loc[artist_id].values[0][0]
    return artist

make_one_prediction(['00000c289a1829a808ac09c00daf10bc3c4e223b'])

'stabbing westward'

# Collaborative filtering


Now, we're going to use the collaborative filtering method to solve our problem of recommending new artists to users.

But first, what is collaborative filtering?

In a very high level way, and using the music recommendation scenario, there are two ideas behind collaborative filtering.

### User based collaborative filtering

#### Intuition

If two people usually like the same artists and dislike the same artists, then they will probably feel the same way about a new artist

#### Algorithm

1. establish similarities between users
2. for each user, define a set of his most similar users, called neighbors
3. in order to recommend a new artist to a user, we average the preferences of neighbors and recommend the artist that his neighbors like the most

### Item based collaborative filtering

#### Intuition

If many people feel the same way about two artists, then the two artists are probably similar; a user that only knows one of the artists and liked it will probably like the other artist too

#### Algorithm

1. establish similarities between artists
2. each user has a preferences profile, based on how much he liked the artists
3. in order to recommend a new artist to a user, we pick the new artists that are the most similar to the user's preferences profile

### When should each method be used?

* If our system has much more users than items, go for item based (and vice-versa). Computing similarities is probably the most expensive operation you'll be doing, thus do it in the smaller set.

* If your items are more stable than users, i.e, if you add users more frequently than items, go for item based (and vice-versa). You'll want to recompute the similarities matrix as least as possible (while keeping accurate results, of course!)

* Do you need serendipity? If yes, go for user based. When you use item based, given that the results are based on the user's own actions, the results will be less surprising for him.

* Do you need to justify your recommendations? If yes, go for item based. It's easier to justify a recommendation with: "We're showing you this because you liked that" than with: "We're showing you this because Jon Doe (who you don't know!) also liked it and your tastes are very similar".


## Let's move to the fun part!

First, let's import the train dataset.

In [20]:
import pandas as pd

train = pd.read_csv('../data/train.csv', names=['user_id', 'artist_id', 'artist_name', 'playcounts'])
train.head()

Unnamed: 0,user_id,artist_id,artist_name,playcounts
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507
3,00000c289a1829a808ac09c00daf10bc3c4e223b,a342964d-ca53-4e54-96dc-e8501851e77f,walls of jericho,393
4,00000c289a1829a808ac09c00daf10bc3c4e223b,f779ed95-66c8-4493-9f46-3967eba785a8,letzte instanz,387


Now, let's transform the user and artist ids in integers. This will help us when we convert our data into sparse matrices.

In [21]:
from sklearn.preprocessing import LabelEncoder

le_users = LabelEncoder()
le_users.fit(train.user_id.values)
train['user_label'] = le_users.transform(train.user_id.values)

le_artists = LabelEncoder()
le_artists.fit(train.artist_id.values)
train['artist_label'] = le_artists.transform(train.artist_id.values)

train.head()

Unnamed: 0,user_id,artist_id,artist_name,playcounts,user_label,artist_label
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897,0,1677
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706,0,1761
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507,0,295
3,00000c289a1829a808ac09c00daf10bc3c4e223b,a342964d-ca53-4e54-96dc-e8501851e77f,walls of jericho,393,0,1505
4,00000c289a1829a808ac09c00daf10bc3c4e223b,f779ed95-66c8-4493-9f46-3967eba785a8,letzte instanz,387,0,2335


Here we'll store our data in a sparse matrix where each row corresponds to an artist and each columns corresponds to a user.

In [22]:
import numpy as np
from scipy.sparse import coo_matrix

row  = np.array(train.artist_label.values)
col  = np.array(train.user_label.values)
data = np.array(train.playcounts.values)
m = coo_matrix((data, (row, col)), shape=(train.artist_label.nunique(), train.user_label.nunique()))

This is an helper function to show us the density of a sparse matrix.

In [23]:
def print_matrix_density(m):
    # this is the number of non-zero values in the matrix
    nnz = m.nnz

    # get the matrix shape
    shape = m.shape

    # the number of elements in the matrix is n_rows * n_cols
    n_elems = shape[0] * shape[1]

    # the density of the matrix is the number of non-zero elements over all the elements
    print('Density of sparse matrix: {}%'.format(round(nnz/n_elems*100, 2)))

In [24]:
print_matrix_density(m)

Density of sparse matrix: 0.48%


In order to compute similarities between artists, we'll use Scikit's cosine_similarity function. Using the dense_output=False, we guarantee that the output is a sparse matrix.

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

S = cosine_similarity(m, dense_output=False)
print(type(S))
print_matrix_density(S)

<class 'scipy.sparse.csr.csr_matrix'>
Density of sparse matrix: 68.72%


Now, let's produce recommendations! First, we'll read a test set and then we'll encode the user_ids, as we did before.

In [26]:
test = pd.read_csv('../data/test.csv', names=['user_id'])
test.head()

Unnamed: 0,user_id
0,002b5d4a2aea86f2b52103393398409aa28d435a
1,00a41bc4ce19c07116b93430e8ccd41984fb2e2d
2,01ac389171616b01a392118dab5a8f56bcfbdb3b
3,0208feef15a26e8c93c776022a8f63c00c4fa7e9
4,03030070193c32afdb21bea4b3597b4d2d549bd8


In [27]:
test_labels = le_users.transform(test.user_id.values)
test_labels[:10]

array([ 154,  589, 1594, 1936, 2801, 3119, 3999, 4634, 5804, 6021])

Then, we'll select from the inital matrix m, the columns that correspond to users in the test set.

In [28]:
m = m.tocsc()
test_users = m[:,test_labels]
print(type(test_users))

<class 'scipy.sparse.csc.csc_matrix'>


Here is where Math attacks!
We'll get the preferences of the users in the test set by computing the dot product between the similarities matrix and the users (or the artists that they like).

In [29]:
preferences = S.dot(test_users)
print('Density of sparse matrix: {}%'.format(round(preferences.nnz/(preferences.shape[0]*preferences.shape[1])*100, 2)))
preferences[:,0].toarray()

Density of sparse matrix: 99.95%


array([[   1.83029269],
       [   1.64725937],
       [ 141.30845304],
       ..., 
       [ 106.26176781],
       [   0.22112047],
       [  16.40485559]])

The preferences matrix is not so sparse anymore... But it's ok, because its matrix is much smaller, with just 300 columns (the number of users in the test set).

Now, we want to remove from the preferences matrix the artists that the users already listened to.

In [30]:
test_users_coo = test_users.tocoo()

for i, j in zip(test_users_coo.row, test_users_coo.col):
    preferences[i, j] = 0

preferences[:,0].toarray()

array([[   1.83029269],
       [   1.64725937],
       [   0.        ],
       ..., 
       [ 106.26176781],
       [   0.22112047],
       [  16.40485559]])

And now, let's move to pandas...

In [31]:
df_preferences = pd.DataFrame(preferences.toarray())
df_preferences = df_preferences.transpose()
df_preferences.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2410,2411,2412,2413,2414,2415,2416,2417,2418,2419
0,1.830293,1.647259,0.0,13.486969,2.679935,3.801377,0.259463,0.930958,6.805027,1.941573,...,7.073527,0.494677,1.243961,1.976835,6.900427,0.291134,0.237262,106.261768,0.22112,16.404856
1,2.353086,7.059208,7.631676,11.471106,1.615172,0.126451,1.297599,0.697074,6.371153,36.267973,...,16.185189,5.46452,0.337963,1.697208,1.015752,0.933111,1.867684,3.058141,7.163797,86.52104
2,0.33493,0.523145,0.714925,0.982454,0.142186,0.0,0.60732,1.057052,1.229495,0.565041,...,5.2572,3.081868,0.241466,11.184143,0.046592,0.287489,0.421334,0.695675,0.429554,4.022889
3,7.22717,2.559279,5.53845,14.323927,1.255134,0.572675,6.485822,0.397746,14.111103,34.079862,...,15.625716,13.77488,5.720238,20.830233,0.178316,1.919642,12.034838,3.932393,6.147588,35.506348
4,2.720481,16.069489,5.177597,0.586048,32.122461,2.428006,9.722055,38.803199,77.276059,6.459918,...,12.662701,9.440265,14.207742,3.119378,7.059925,3.386277,46.198454,1.723172,7.005515,4.474405


... and select the top 50 artists for each user.

In [32]:
k = 50
results = []
for index, row in df_preferences.iterrows():
    top_k = row.sort_values(ascending=False).head(k).index.values
    results.append(top_k)
    
results_ids = le_artists.inverse_transform(results)
results_ids[0][:10]

array(['94dbfe2e-ca48-4e08-a5a8-e1e74136c63d',
       '4e024037-14b7-4aea-99ad-c6ace63b9620',
       '4e954b02-fae2-4bd7-9547-e055a6ac0527',
       'c71abd83-9d66-4c7b-9f0d-c9c36e85a955',
       'd75d1f08-bbb8-4eae-9877-399ca9121197',
       'b614843c-bec3-421f-9af1-03169cdd4b63',
       '6593c2bc-9327-4adf-a8b2-b315e4f5c0bb',
       'd50548a0-3cfd-4d7a-964b-0aef6545d819',
       'ceef10f5-324d-4a04-8db7-1a4181e19ab3',
       '1f1f6737-b930-46fc-8d25-110bb99f7490'], dtype=object)

But how good is our model?
Let's evaluate it.

In the test_true set, we have 5 artist that each user listened to. So evaluating our model is basically checking if the top 50 that we predicted for each user matches the artists in the test_true dataset.

We'll use the mean average precision metric, which checks if our predictions are in the test_true dataset and how good was the order of our predictions.

In [33]:
test_true = pd.read_csv('../data/test_true.csv', header=None, names=['user_id', 'top_artists'])
test_true.head()

Unnamed: 0,user_id,top_artists
0,002b5d4a2aea86f2b52103393398409aa28d435a,"['6ec73176-6ea6-49d3-87e8-35b5fe6813f5', 'b614..."
1,00a41bc4ce19c07116b93430e8ccd41984fb2e2d,"['1f8e4ffd-b81b-4d10-91f4-461585bf9d16', '2228..."
2,01ac389171616b01a392118dab5a8f56bcfbdb3b,"['79ad5f01-b181-448d-b776-a252949d61af', 'a558..."
3,0208feef15a26e8c93c776022a8f63c00c4fa7e9,"['b9a20306-a4f5-4d3c-8680-e9cdc7e3af5b', '3d6b..."
4,03030070193c32afdb21bea4b3597b4d2d549bd8,"['1397d045-1603-41fc-80b9-712c18360145', 'ad38..."


In [34]:
import json

test_true.top_artists = test_true.top_artists.str.replace('\'', '"')
test_true.top_artists = test_true.apply(lambda x: json.loads(x.top_artists), axis=1)
test_true.head()

Unnamed: 0,user_id,top_artists
0,002b5d4a2aea86f2b52103393398409aa28d435a,"[6ec73176-6ea6-49d3-87e8-35b5fe6813f5, b614843..."
1,00a41bc4ce19c07116b93430e8ccd41984fb2e2d,"[1f8e4ffd-b81b-4d10-91f4-461585bf9d16, 2228019..."
2,01ac389171616b01a392118dab5a8f56bcfbdb3b,"[79ad5f01-b181-448d-b776-a252949d61af, a558ed3..."
3,0208feef15a26e8c93c776022a8f63c00c4fa7e9,"[b9a20306-a4f5-4d3c-8680-e9cdc7e3af5b, 3d6bbeb..."
4,03030070193c32afdb21bea4b3597b4d2d549bd8,"[1397d045-1603-41fc-80b9-712c18360145, ad38670..."


In [35]:
import ml_metrics as metrics

metrics.mapk(test_true.top_artists.values, results_ids, k)

0.15531719983948952

# That's it!

# Thank you!