# How to Build a Recommender in Python


### by Maria Dominguez (aka Chi) & João Ascensão

## Overview:

#### Leveraging Python's data stack to build a music recommender, not unlike Spotify's Discover Weekly.

Notes go here

### Recommender Systems:

Recommender systems present *items* that are likely to interest the *user*, by comparing the user's profile (inferred from *data*) to reference characteristics.

Thus, our general framework contains three main components:

* **Users**: the people in our system, that generate data and will receive item recommendations
* **Items**: the things in our system, with which the users interact and that we want to recommend to them
* **Data**: the different ways an user can express opinions about our items, i.e. where users and items meet.

### Music Recommendations:

[Word out there](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe), is that Spotify's Discover Weekly mixes together two well-known types of recommenders:

* **Content-based filtering**: combines data (e.g. clicks, playcounts, ) and item attributes (e.g. content, text, tags, metadata) to create *user profiles*

    * **Natural language processing (NLP)**: representing artists and tracks with features extracted from text (e.g. description, comments, tags)
    * **Audio models**: extracting features by analyzing the raw audio tracks.
    
    
* **Collaborative filtering**: analyzes user historical behaviour to find similarities across pairs of users and/or items.

### Building the Recommender:

Using data from the lastfm dataset, our prototype will include:

* A content-based filtering branch, based on processing the tags assigned by the users to each artist (no raw audio analysis today, bummer)
* A collaborative-filtering approach, using user/artist playcounts.

### Content-based filtering:

We will import a file we prepared containing a bunch of tags relative to the artists in the dataset.

In [1]:
import pandas as pd
import numpy as np

cols = ['artist_id', 'tag']
tags = pd.read_csv("../data/lastfm/tags.csv", header=None, names=cols)
tags = tags.dropna()
tags.head(n=5)

Unnamed: 0,artist_id,tag
0,3bd73256-3905-4f3a-97e2-8b341527f805,90s
1,3bd73256-3905-4f3a-97e2-8b341527f805,alternative
2,3bd73256-3905-4f3a-97e2-8b341527f805,alternative punk rock
3,3bd73256-3905-4f3a-97e2-8b341527f805,alternative rock
4,3bd73256-3905-4f3a-97e2-8b341527f805,ambient


We should check how many unique tags we have.

In [2]:
total_artists = len(tags['artist_id'].unique())
total_tags = len(tags['tag'].unique())
print("There are {} unique artists and {} unique tags in the dataset.".format(total_artists, total_tags))

There are 32452 unique artists and 8722 unique tags in the dataset.


As for the most common tags, let's check the top 10.

In [3]:
tags_count = tags['tag'].value_counts()
print(tags_count[:5])

rock                23187
electronic          20383
pop                 16523
alternative rock    13617
hip hop             11318
Name: tag, dtype: int64


Now, let's take a random sample of 20 of the most obscure, using a random permutation over tags with one appearance.

In [4]:
tags_count_one = tags_count[tags_count == 1]
tags_count_one[np.random.permutation(len(tags_count_one))][:5]

bodhran                      1
bellingham                   1
przystanek woodstock 2006    1
brithish                     1
icelandic indie              1
Name: tag, dtype: int64

In [25]:
tags = tags.groupby('artist_id')['tag'].apply(' '.join).reset_index()
tags['tag'] = tags['tag'].str.replace("-", " ")
tags.head(n=5)

Unnamed: 0,artist_id,tag
0,0002f649-8285-4a72-b847-b3854e1a449c,acoustic alternative emo pop folk jam pop pop ...
1,00034ede-a1f1-4219-be39-02f36853373e,90s acoustic alternative alternative rock braz...
2,00039b8a-3da6-4cb2-85e3-f93e30f43049,blues funk fusion jazz pop sexy soul
3,0004533f-77b7-468d-8657-40db6adec34f,acoustic alternative american ballad banda bea...
4,0004537a-4b12-43eb-a023-04009e738d2e,acid jazz alternative rock blues rock breakbea...


A reasonable strategy might be splitting strings into separe words, as there are many variations for each gender (e.g. rock and alternative rock).

Stemming, on the other hand, reduces words to their most basic form.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import EnglishStemmer
from string import punctuation

def stem_word(doc):
    stemmer = EnglishStemmer()
    analyzer = CountVectorizer().build_analyzer()
    return (stemmer.stem(word) for word in analyzer(doc))

cv = CountVectorizer(min_df=3, analyzer=stem_word, stop_words='english', lowercase=True)
cv.fit(tags['tag'])

vocabulary = list(cv.vocabulary_)
print("Sample of the vocabulary:", vocabulary[:9])
print("There are {} unique tags".format(len(vocabulary)))

Sample of the vocabulary: ['acoust', 'altern', 'emo', 'pop', 'folk', 'jam', 'punk', 'rock', '90s']
There are 2964 unique tags


In [23]:
tag_count_matrix = cv.transform(tags['tag'])

### Colaborative-filtering: