# How to Build a Recommender in Python


### by Maria Dominguez (aka Chi) & João Ascensão

## Overview:

#### Leveraging Python's data stack to build a music recommender, not unlike Spotify's Discover Weekly.

Notes go here

### Recommender Systems:

Recommender systems present *items* that are likely to interest the *user*, by comparing the user's profile (inferred from *data*) to reference characteristics.

Thus, our general framework contains three main components:

* **Users**: the people in our system, that generate data and will receive item recommendations
* **Items**: the things in our system, with which the users interact and that we want to recommend to them
* **Data**: the different ways an user can express opinions about our items, i.e. where users and items meet.

### Music Recommendations:

[Word out there](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe), is that Spotify's Discover Weekly mixes together two well-known types of recommenders:

* **Content-based filtering**: combines data (e.g. clicks, playcounts, ) and item attributes (e.g. content, text, tags, metadata) to create *user profiles*

    * **Natural language processing (NLP)**: representing artists and tracks with features extracted from text (e.g. description, comments, tags)
    * **Audio models**: extracting features by analyzing the raw audio tracks.
    
    
* **Collaborative filtering**: analyzes user historical behaviour to find similarities across pairs of users and/or items.

### Building the Recommender:

Using data from the lastfm dataset, our prototype will include:

* A content-based filtering branch, based on processing the tags assigned by the users to each artist (no raw audio analysis today, bummer)
* A collaborative-filtering approach, using user/artist playcounts.

### Content-based filtering:

We will import a file we prepared containing a bunch of tags relative to the artists in the dataset.

#### Data preparation:

##### Train

In [None]:
import pandas as pd
import numpy as np


train_cols = ['user_id', 'artist_id', 'artist_name', 'play_counts']
train = pd.read_csv("../data/lastfm/train.csv", header=None, names=train_cols)
# We checked, there are no missing values
train.head(n=3)

##### Tags

In [None]:
tags_cols = ['artist_id', 'tag']
tags = pd.read_csv("../data/lastfm/tags.csv", header=None, names=tags_cols)
print("There are {} tags missing. We must drop them.".format(tags['tag'].isnull().sum()))
tags = tags.dropna()
tags.head(n=3)

We need to find out which artists appear in both datasets.

In [None]:
artists_intersect = np.intersect1d(tags['artist_id'].values, train['artist_id'].values)

In [None]:
from sklearn.preprocessing import LabelEncoder

le1 = LabelEncoder()
le2 = LabelEncoder()

tags = tags.set_index('artist_id').loc[artists_intersect].reset_index()
tags['artist_id'] = le1.fit_transform(tags['artist_id'])
tags = tags.set_index('artist_id').sort_index().reset_index()
tags.head(n=3)

In [None]:
train = train.set_index('artist_id').loc[artists_intersect].reset_index()
train['artist_id'] = le1.fit_transform(train['artist_id'])
train['user_id'] = le2.fit_transform(train['user_id'])
train = train.set_index('artist_id').sort_index().reset_index()
train.head(n=3)

#### Building item profiles:

We should check if we have the same number of artists in both dataframes and how many unique tags we have.

In [None]:
total_artists_with_tags = len(tags['artist_id'].unique())
total_artists_with_playcounts = len(train['artist_id'].unique())
if total_artists_with_tags == total_artists_with_playcounts:
    print("Both dataframes contain the same number of unique artists, a job well done!\n")

total_tags = len(tags['tag'].unique())
print("There are {} unique tags for {} unique artists.".format(total_tags, 
                                                               total_artists_with_tags))

As for the most common tags, let's check the top 5.

In [None]:
tags_count = tags['tag'].value_counts()
print(tags_count[:5])

Now, let's take a random sample of 5 of the most obscure tags, using a random permutation over tags with one appearance.

In [None]:
tags_count_one = tags_count[tags_count == 1]
tags_count_one[np.random.permutation(len(tags_count_one))][:5]

The thing is, instead of a single tag per row, we need to have *collection* of tags per artist. We can accomplish this by concatenating all tags per artist.

The result resembles a document per artist, in which some words (e.g. `alternative`), previously associated with particular tags, appear now more than once.

In [None]:
tags = tags.groupby('artist_id')['tag'].apply(' '.join).reset_index()

We also want to split strings with "-" into multiple words (like `alternative-rock` to `alternative rock`). 

**At this point, do things that don't scale, like manually going through a good sample of documents to spot possible strategies.**

In [None]:
tags['tag'] = tags['tag'].str.replace("-", " ")
tags.head(n=5)

We can now apply something like a bag of words strategy with these documents, transforming them into *keyword vectors*.

A reasonable strategy might be splitting strings into separe words, as there are many variations for each gender (e.g. rock and alternative rock).

Stemming, on the other hand, reduces words to their most basic form.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import EnglishStemmer


# Add language stemmer to TfidfVectorizer by overriding build_analyzer()
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        stemmer = EnglishStemmer()
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

vectorizer = StemmedTfidfVectorizer(min_df=3, analyzer='word', 
                                    stop_words='english', lowercase=True)

In [None]:
vectorizer.fit(tags['tag'])

vocabulary = list(vectorizer.vocabulary_)
print("Sample of the vocabulary {}.".format(vocabulary[:9]))
print("There are {} unique tags.".format(len(vocabulary)))

In [None]:
item_tag_matrix = vectorizer.transform(tags['tag'])
print(item_tag_matrix.shape, type(item_tag_matrix))

In [None]:
from scipy.sparse import csr_matrix

le2 = LabelEncoder()

rows = le1.fit_transform(train['user_id'])
cols = le2.fit_transform(train['artist_id'])
data = train['play_counts'].values

# [row_ind[k], col_ind[k]] = data[k]
m = csr_matrix((data, (rows, cols)), shape=(rows.max()+1, cols.max()+1))

In [None]:
m.shape

In [None]:
res = m.dot(item_tag_matrix)

In [None]:
res.shape

### Colaborative-filtering: