# How to Build a Recommender in Python


### by Maria Dominguez (aka Chi) & João Ascensão

## Overview:

#### Leveraging Python's data stack to build a music recommender, not unlike Spotify's Discover Weekly.

Notes:

* Implicit and explicit feedback
* Uniplaces and Data Science Academy logos at the end

## Recommender Systems:

Recommender systems present *items* that are likely to interest the *user*, by comparing the user's profile (inferred from *data*) to reference characteristics.

Thus, our general framework contains three main components:

* **Users**: the people in our system, that generate data and will receive item recommendations
* **Items**: the things in our system, with which the users interact and that we want to recommend to them
* **Data**: the different ways an user can express opinions about our items, i.e. where users and items meet.

## Music Recommendations:

[Word out there](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe), is that Spotify's Discover Weekly mixes together two well-known types of recommenders:

* **Content-based filtering**: combines data (e.g. clicks, playcounts, ) and item attributes (e.g. content, text, tags, metadata) to create *user profiles*

    * **Natural language processing (NLP)**: representing artists and tracks with features extracted from text (e.g. description, comments, tags)
    * **Audio models**: extracting features by analyzing the raw audio tracks.
    
    
* **Collaborative filtering**: analyzes user historical behaviour to find similarities across pairs of users and/or items.

## Building the Recommender:

Using data from the lastfm dataset, our prototype will include:

* A content-based filtering branch, based on processing the tags assigned by the users to each artist (no raw audio analysis today, bummer)
* A collaborative-filtering approach, using user/artist playcounts.

## Content-based filtering:

We will import a file we prepared containing a bunch of tags relative to the artists in the dataset.

### Data preparation:

#### Preparing the train dataset

We want to read the dataset into a `train` dataframe, containing `user_id`, `artist_id`, `artist_name` and `play_counts`.

In [1]:
import pandas as pd
import numpy as np


train_cols = ['user_id', 'artist_id', 'artist_name', 'play_counts']
train = pd.read_csv("../data/data/train.csv", header=None, names=train_cols)
# We checked, there are no missing values :)
train.head(n=3)

Unnamed: 0,user_id,artist_id,artist_name,play_counts
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507


Also, we don't need the artist name for now. We will store in a separate dataframe to use in the last step, to provide nicer looking recommendations.

In [2]:
# We will set the dataframe index to be the artist name for easy reference in 
# the future
artist_name = train[['artist_id', 'artist_name']].set_index('artist_id')
train = train.drop(['artist_name'], axis=1)
train.head(n=3)

Unnamed: 0,user_id,artist_id,play_counts
0,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,897
1,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,706
2,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,507


#### Preparing the tags dataset

The `tags` dataframe has `artist_id` and one row per `tag`. There are some rows with empty tags, we will get rid of those as we have no use for them.

In [3]:
tags_cols = ['artist_id', 'tag']
tags = pd.read_csv("../data/data/tags.csv", header=None, names=tags_cols)
# Null tags are of no use, and if there are some they must be removed.
tags_null = tags['tag'].isnull().sum()
print("There are {} tags missing. We must drop them.".format(tags_null))
tags = tags.dropna()
tags.head(n=3)

There are 0 tags missing. We must drop them.


Unnamed: 0,artist_id,tag
0,3bd73256-3905-4f3a-97e2-8b341527f805,90s
1,3bd73256-3905-4f3a-97e2-8b341527f805,alternative
2,3bd73256-3905-4f3a-97e2-8b341527f805,alternative punk rock


#### Include artists in both datasets only

Now, we can only build artist profiles for artists with at least one tag assigned to them, otherwise there's nothing we can use to *describe* them.

Similarly, artists with no playcounts are also useless to build user profiles. For convenience, we will remove them, *but could they be useful at some point*?

In [4]:
# We could try to use pandas.DataFrame.merge, why don't we?
artists_intersect = np.intersect1d(tags['artist_id'].values, 
                                   train['artist_id'].values)

To remove unnecessary artists from both dataframes, we will *select by label* the artists we want.

In [5]:
# Selecting by label only artists in tags that are also in train.
tags = tags.set_index('artist_id').loc[artists_intersect].reset_index()

In [6]:
# Selecting by label only artists in train that are also in tags.
train = train.set_index('artist_id').loc[artists_intersect].reset_index()

#### Label encode the `artist_id` on both dataframes

While reasoning behind this step is not immediately clear, it's important that `artist_id` is encoded as an integer.

In [7]:
from sklearn.preprocessing import LabelEncoder


# We will use the sklearn.preprocessing.LabelEncoder for this but other 
# approaches are valid. This will fit an integer to each unique string.
artist_encoder = LabelEncoder()
tags['artist_id'] = artist_encoder.fit_transform(tags['artist_id'])

# It is essential that we use the same encoder to transform the artists
# in the train dataset, so that they have the same label on both sides.
train['artist_id'] = artist_encoder.transform(train['artist_id'])

#### Sort the tags dataframe by `artist_id`

Again, while not immediately clear why, this is a critical step: we need to assure that the `tags` dataframe is sorted according to the *encoded* `artist_id`.

In [8]:
tags = tags.set_index('artist_id').sort_index().reset_index()
tags.head(n=3)

Unnamed: 0,artist_id,tag
0,0,90s
1,0,acoustic
2,0,alternative


### Building the artist profiles:

We should how many unique artists and tags we have.

In [9]:
unique_artists = len(tags['artist_id'].unique())
unique_tags = len(tags['tag'].unique())
print("There are {} unique tags for {} unique artists.".format(unique_tags, 
                                                               unique_artists))

There are 2169 unique tags for 2420 unique artists.


Let's look at the top 3 most common tags.

In [10]:
tags_count = tags['tag'].value_counts()
print(tags_count[:3])

rock          2349
pop           1863
electronic    1539
Name: tag, dtype: int64


Now, let's take a random sample of 5 of the most obscure tags, just for fun.

In [11]:
tags_count_one = tags_count[tags_count == 1]
np.random.seed(1)
# We will randomize the order of the tags with count one so we can select 5 
# obscure tags at random.
randomize_order = np.random.permutation(len(tags_count_one))
tags_count_one[randomize_order][:5]

latin alternative                                        1
electric delta blues                                     1
1 7 186240 183 23558 41608 89158 111733 150833 169883    1
harp                                                     1
dark folk                                                1
Name: tag, dtype: int64

#### Merge all tags in a single *document* per artist

The thing is, instead of a single tag per row, we need a single entry per arist, to have *collection* of tags per artist, resembling a *document*.

In [12]:
# There are probably less hacky ways to do this but it's not too expensive,
# it's interpretable and serves the purpose well.
# We are concatenating all tags as word in a single sentence.
tags = tags.groupby('artist_id')['tag'].apply(' '.join).reset_index()
# Some words, previously associated with particular tags, will appear more
# than once (e.g. 'alternative' in the second row).
tags.head(n=3)

Unnamed: 0,artist_id,tag
0,0,90s acoustic alternative alternative rock braz...
1,1,00s acoustic alternative alternative rock beau...
2,2,acid jazz afrobeat alternative rock ambient bi...


When talking about musical genres, people tend to use 'alternative rock' and 'alternative-rock' interchangeably.

In [13]:
# At this point, do things that don't scale. Try to spot possible strategies to
# improve the quality of your documents.
tags['tag'] = tags['tag'].str.replace("-", " ")

#### Structuring the tags data (aka vectorization)

And no, we are not using word-to-vec.

A vector exists in a space, in this particular case, a high-dimensional space corresponding to all the tags in our dataset. 

We take each word appearing in the documents as a dimension, or a feature. Each document is then represented as a *keyword vector*, typically with counts.

#### Weighting different words

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a *weighting* function.

Why do we need it? Because not all terms are equally relevant to describe an artist. In short, we measure the term frequency, weighted by its rarity.

$$ IDF _{term} = log\left({\frac{TotalDocuments}{DocumentsWithTerm}} \right) $$

Thus:

$$ TFIDF _{term} = TF _{term} * IDF _{term} $$

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import EnglishStemmer


# Add language stemmer to TfidfVectorizer by overriding the built-in 
# build_analyzer()
class StemmedTfidfVectorizer(TfidfVectorizer):
    # Stemming reduces words to their most basic form, to minimize the
    # feature space.
    def build_analyzer(self):
        stemmer = EnglishStemmer()
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

#### Bag of words representation with TF-IDF

We will now build our dictionary of tags and project our artist documents in our vector space.

In [15]:
# Bad of words representation, considering words with document frequency
# of at least 3 documents, otherwise they are not useful.
vectorizer = StemmedTfidfVectorizer(min_df=3, analyzer='word', 
                                    stop_words='english', lowercase=True)

# Build internal dictionary.
vectorizer.fit(tags['tag'])

# We call the dictionary the collection of distinct words.
vocabulary = list(vectorizer.vocabulary_)
print("Sample of the vocabulary {}.".format(vocabulary[:9]))
print("There are {} unique tags.".format(len(vocabulary)))

Sample of the vocabulary ['90s', 'acoust', 'altern', 'rock', 'brazil', 'brazilian', 'pop', 'music', 'dancehal'].
There are 774 unique tags.


#### Generating a sparse matrix of keyword vectors

We trained our dictionary, now we need to transform our `tags` dataframe into keyword vectors, with all the labels as features.

As you can see, the result is a sparse matrix. But what is that, right?

The reason why we sorted the artists, is because our vectorizer maintains the order of the rows, and we will perform some algebra on top of this very soon.

(This is a hint!)

In [16]:
# This will generate a sparse matrix with our artists in the rows, and
# the tags as columns.
artists_matrix = vectorizer.transform(tags['tag'])
print(artists_matrix.shape, type(artists_matrix))

(2420, 774) <class 'scipy.sparse.csr.csr_matrix'>


#### A note about normalization

Normalizing the artist keyword vectors is key, otherwise artists with more tags will have more weight on user profiles.

By normalizing, we mean making all vectors length 1.

We accomplish this with the following formula:

$$ \vec{u} = {\frac{\vec{v}}{\parallel\vec{v}\parallel}} $$

Where:

$$ \parallel\vec{v}\parallel = \sqrt{v_1^2 + v_2^2 + ... + v_n^2} $$

So, we take the non-normalized vector and divide it (i.e. all its components) by its own magnitude, also called length, or *norm*.

### Building user profiles:

#### Generating a sparse matrix with users, items and playcounts:

A sparse matrix is a big matrix where most of its elements are 0. 

The idea behind sparce matrices is to just store the information relative to the non-zero elements of the matrix. 

This makes working with these matrices way easier, and more performant.

In [17]:
from scipy.sparse import csr_matrix


user_encoder = LabelEncoder()

# We also need to transform the user_id into an integer.
# The row index in our sparse matrix.
rows = user_encoder.fit_transform(train['user_id'])
# The column index in our sparse matrix.
cols = train['artist_id'].values
data = train['play_counts'].values

# [row_ind[k], col_ind[k]] = data[k]
users_matrix = csr_matrix((data, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))

In [18]:
users_matrix.shape

(239929, 2420)

In [19]:
user_profiles = users_matrix.dot(artists_matrix)

In [20]:
user_profiles.shape

(239929, 774)

#### Making predictions (and another note about normalization)

An alternative to normalizing our vectors is to just compute the cosine between movie profile and the user taste.

This is because the cosine *is a dot-product that is already normalized*:

$$ cos(\vec{u}, \vec{v}) = 
\frac{\langle \vec{u}, \vec{v} \rangle}{\parallel\vec{u}\parallel\parallel\vec{v}\parallel} $$

In fact, the predictions will be *scalled by the user taste vector norm*, but that's ok because the order between the items will be kept unchanged.

Let's build our prediction function.

In [27]:
from sklearn.metrics.pairwise import cosine_similarity


def make_one_prediction(user_id):
    encoded_user_id = user_encoder.transform(user_id)
    user_preferences = user_profiles[encoded_user_id, :]
    encoded_artist = cosine_similarity(user_preferences, artists_matrix).argmax()
    artist_id = artist_encoder.inverse_transform(encoded_artist)
    artist = artist_name.loc[artist_id].values[0][0]
    return artist

make_one_prediction(['00000c289a1829a808ac09c00daf10bc3c4e223b'])

'stabbing westward'