# Building a recommender system web service from scratch 

In this notebook I'm going to show how to build a small and simple recommender system web service from scratch. First of all, we will look at the data that we're going to use, which is about songs. Secondly, we are going to analyse different recommender implementations that might be suitable for the data that we have, we will also compare their metrics to choose our best option. Thirdly, we are going to build a home-made item-based recommender system so that we have an intuition of how everything works at a low-level. Finally, we will see how to start the web service and take into account how to update the structures that we are going to use.

Let's get started!

## 1. Data Analysis

First of all we are going to take a look at our data. We will use the [Millong Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) which is a huge dataset of songs with their titles and authors. It also provides a file with the amount of times a user has played a song. We will also use the data present in the [Millon Song Dataset Kaggle's Challenge](https://www.kaggle.com/c/msdchallenge/data), which contains some useful data. Explicitely, we are going to use the following files:
* *train_triplets.txt*: this file is provided by [*The Echo Nest Taste Profile Subset*](https://labrosa.ee.columbia.edu/millionsong/tasteprofile). It basically has three fields separated by tabs: user_id, song_id, and play count. It is an explicit interaction between users and items, namely songs. This will pave the way for the recommender systems that we will implement.
* *unique_tracks.txt*: list of artist and titles for all tracks in a [text file](http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_tracks.txt).
* *taste_profile_song_to_tracks.txt*: some of the files refer to the songs by the song_id, but others do so with the track_id. This file maps each song_id to the corresponding track_id.
* *kaggle_users*: list of all the users, provided by the kaggle challenge's website.
* *kaggle_songs*: list of all the songs, provided by the kaggle challenge's website.

First we will fire up all the libraries that we are going to use.

In [2]:
import pandas as pd
import numpy as np
import time
import math
import warnings
warnings.filterwarnings('ignore')

Now it's time to load the dataset. The dataset is quite heavy, so we are going to load just a subset of it, to analyse the data.

In [7]:
song_triplets = pd.read_csv('train_triplets.txt', sep='\t', names=['user_id', 'song_id', 'count'], nrows=5000000)
song_triplets.head()

Unnamed: 0,user_id,song_id,count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1


Now we are going to load the *unique_tracks.txt* file and we are going to join that file with the triplets that we already have.

In [11]:
unique_tracks = pd.read_csv('unique_tracks.txt', sep='<SEP>', names=['track_id', 'song_id', 'author', 'title'])
song_triplets.merge(unique_tracks, on='song_id', how='left')

Unnamed: 0,user_id,song_id,count,track_id,author,title
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,TRIQAUQ128F42435AD,Jack Johnson,The Cove
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1,TRIRLYL128F42539D1,Billy Preston,Nothing from Nothing
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,TRMHBXZ128F4238406,Paco De Lucia,Entre Dos Aguas
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1,TRYQMNI128F147C1C7,Josh Rouse,Under Cold Blue Stars
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1,TRAHZNE128F9341B86,The Dead 60s,Riot Radio (Soundtrack Version)
5,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBNZDC12A6D4FC103,1,TRJPXGD128F92F17D7,Amset,Sin límites (I)
6,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBSUJE12A6D4F8CF5,2,TRPLAXZ128F4292406,Jorge Drexler,12 segundos de oscuridad
7,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBVFZR12A6D4F8AE3,1,TREGAVI128F147C1CA,Josh Rouse,Ears To The Ground (Album Version)
8,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXALG12A8C13C108,1,TRZYZWL128F4277AD2,Eric Hutchinson,Food Chain (Album Version)
9,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,TRHNCIR128F42334A5,Kanye West,Stronger


Let's see how many users and songs the dataset has.

In [34]:
print "Number of unique users:", len(song_triplets['user_id'].unique())
print "Number of unique songs:", pd.read_csv('kaggle_songs.txt', sep='\t', names=['song_id', 'index']).shape[0]

Number of unique users: 104474
Number of unique songs: 386213


The data is quite simple, we have the classic interaction between users and items, in addition to some song information such as the author and title. We are ready to move on to building the recommender systems.

## 2. Recommender Systems benchmarking

In this section we are going to build several recommender systems using different algorithms to compare them and choose our best option. When it comes to building a recommender system for a web service there are several things that we should take into account:
* **Accuracy**: of course, we want our recommender system to recommend truly interesting items so that we can keep the user engaged using our system (in this case a music player, such as Spotify). To measure this we are going to use the two more classic metrics: precision and recall.
* **Response time**: as it is a web service it should return the recommendations very quickly, say less than a second. This is crucial, since we can build very complex and sophisticated recommender systems with a very high accuracy, but with a very poor performance. Imagine that you are a user using Spotify, you ask the system to make some recommendations for you and it takes more than 3 seconds to get the response. You will probably leave the system and start doing other thing. To succeed in engaging the user, it is pivotal that the recommendations are given in a very short time.
* **Updatable**: as previously mentioned, there are several recommender algorithms that can be used, for some of them the update is straightforward, while for others it is not that simple. This is very important to consider, since our recommender should be continuously updated so that the last interactions that a user has are also taken into account for the recommendation.
* **Time to build the recommender**: very related to the previous point is the time to create the recommender system. If the system is not updatable very easily, we will have to re-build the system periodically, so it is quite important to consider how much time requires such task.
* **Memory usage**: Most of the recommender systems work on memory in order to return a recommendation very quickly. However, if we don't have a great amount of memory available, this turns out to be a very bothering feature that we should also take into account.

Now that we have defined all the variables to look at when we make the comparison, we will split the dataset into a training set, which we will use to create the recommender, and test set, which will be used to get the metrics.

For this section we will use [GraphLab Create](turi.com/) which is a very handy library that implements very efficient Machine Learning algorithms. It provides an academic license which I obtained when I fulfilled the [Machine Learning Specialization](www.coursera.org/specializations/machine-learning). GraphLab provides a set of different implementations for recommender systems, all of which are very efficient in terms of memory and they also provide a method called *recommend* on which it is possible to specify new observation data which is very handy for getting an updated recommendation.

In [20]:
import graphlab as gl
song_triplets = gl.SFrame.read_csv('train_triplets.txt', nrows=5000000, delimiter='\t', verbose=False, header=False)
song_triplets = song_triplets.rename({'X1': 'user_id', 'X2': 'song_id', 'X3':'count'})
training_data, validation_data = gl.recommender.util.random_split_by_user(song_triplets, 'user_id', 'song_id', 
                                                                            max_num_users=75000, random_seed=0)

Let's see how many rows we have for each set.

In [21]:
print "Training length:", training_data.shape[0]
print "Validation length:", validation_data.shape[0]

Training length: 4282570
Validation length: 717430


Approximately 15% of the dataset is used for testing them which is the common distribution.

Now it's time to build our first model! 

#### 2.1. Popularity recommender

Let's start by the simplest recommender possible: popularity recommender. Basically it recommends the most similar songs to each user. There is no personalization, nor intelligence behind this model.

In [24]:
ini = time.time()
m = gl.recommender.popularity_recommender.create(observation_data=training_data, item_id='song_id', 
                                                 user_id='user_id', verbose=False)
print "Time to create the recommeder:", time.time()-ini

Time to create the recommeder: 3.8022840023


Quite fast, isn't it? Let's now take a look to all the other variables that we are analysing.

In [25]:
ini = time.time()
m.recommend(users=[training_data['user_id'][0]], k=10)
print "Time to return a recommendation:", time.time()-ini

Time to return a recommendation: 0.0277328491211


In [29]:
m.evaluate_precision_recall(validation_data, verbose=False)

{'precision_recall_by_user': Columns:
 	user_id	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1326708
 
 Data:
 +-------------------------------+--------+-----------+--------+-------+
 |            user_id            | cutoff | precision | recall | count |
 +-------------------------------+--------+-----------+--------+-------+
 | 00003a4459f33b92906be11abe... |   1    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   2    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   3    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   4    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   5    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   6    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   7    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   8    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   9    |    0.0    |  0.0 

As we can see the time to create this recommender is pretty low, so is the time to give a recommendation. However, the precision and recall are not that high: only 3% of precision for the first recommendation or recommendation at 1. This is a very 

Let's now see a little more complex model far more suitable for this dataset: Item-Based recommender system.

#### 2.2. Item-Based recommender system

When we are in a situtation where the number of users are greater than the number of songs it is advisable to create an item-item recommender system. We will talk more about this model later when we build it. GraphLab provides an implementation for such model, and it is very configurable for the data we are dealing with (see the [API documentation](https://turi.com/products/create/docs/generated/graphlab.recommender.item_similarity_recommender.create.html#graphlab.recommender.item_similarity_recommender.create) for more details). One of such configurations is the distance that we want to use, by default is the jaccard similarity. Let's start using the *cosine* similarity.

In [35]:
ini = time.time()
m = gl.recommender.item_similarity_recommender.create(observation_data=training_data, item_id='song_id',
                                                      user_id='user_id', target='count',
                                                      similarity_type='cosine', verbose=False)
print "Time to create the recommeder:", time.time()-ini

Time to create the recommeder: 46.616230011


Despite the fact that it isn't as fast as the previous one, creating a model in less than a minute is really fast. Let's see the time to make recommendations

In [38]:
ini = time.time()
m.recommend(users=[training_data['user_id'][0]], k=10)
print "Time to return a recommendation:", time.time()-ini

Time to return a recommendation: 0.0313990116119


Quite fast! Let's see the precision and recall.

In [39]:
m.evaluate_precision_recall(validation_data, verbose=False)

{'precision_recall_by_user': Columns:
 	user_id	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1326708
 
 Data:
 +-------------------------------+--------+-----------+--------+-------+
 |            user_id            | cutoff | precision | recall | count |
 +-------------------------------+--------+-----------+--------+-------+
 | 00003a4459f33b92906be11abe... |   1    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   2    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   3    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   4    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   5    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   6    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   7    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   8    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   9    |    0.0    |  0.0 

It takes more time to be built than the baseline, and its performance in terms of accuracy is poorer. Clearly, we should leave this version and try with another option.

Let's now try with the pearson similarity to see if the accuracy gets higher.

In [40]:
ini = time.time()
m = gl.recommender.item_similarity_recommender.create(observation_data=training_data, item_id='song_id',
                                                      user_id='user_id', target='count',
                                                      similarity_type='pearson', verbose=False)
print "Time to create the recommeder:", time.time()-ini

Time to create the recommeder: 39.0279419422


In [41]:
m.evaluate_precision_recall(validation_data, verbose=False)

{'precision_recall_by_user': Columns:
 	user_id	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1326708
 
 Data:
 +-------------------------------+--------+-----------+--------+-------+
 |            user_id            | cutoff | precision | recall | count |
 +-------------------------------+--------+-----------+--------+-------+
 | 00003a4459f33b92906be11abe... |   1    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   2    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   3    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   4    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   5    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   6    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   7    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   8    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   9    |    0.0    |  0.0 

It doesn't seem to make the trick, the metrics are even worse. Let's now try with our last option for the item-based recommendation system, the jaccard similarity. Using this similarity we don't need the target.

In [42]:
ini = time.time()
m = gl.recommender.item_similarity_recommender.create(observation_data=training_data, item_id='song_id',
                                                      user_id='user_id', similarity_type='jaccard', verbose=False)
print "Time to create the recommeder:", time.time()-ini

Time to create the recommeder: 39.9650559425


In [43]:
m.evaluate_precision_recall(validation_data, verbose=False)

{'precision_recall_by_user': Columns:
 	user_id	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1326708
 
 Data:
 +-------------------------------+--------+-----------+--------+-------+
 |            user_id            | cutoff | precision | recall | count |
 +-------------------------------+--------+-----------+--------+-------+
 | 00003a4459f33b92906be11abe... |   1    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   2    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   3    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   4    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   5    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   6    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   7    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   8    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   9    |    0.0    |  0.0 

Sweet! These are great numbers! It is our best option so far. This is not magic, though, its great accuracy comes from the fact that we are working with implicit data. Users just listen to songs, one or more times, but they don't give an explicit score to each song they listen to, so using the *play_count* as a target to estimate would be a mistake. That's why the previous two options didn't work well. The *jaccard* similarity captures the relationship in implicit data and works very well at normalizing the data. You can see a more detailed explanation [here](https://turi.com/products/create/docs/generated/graphlab.recommender.item_similarity_recommender.ItemSimilarityRecommender.html#graphlab.recommender.item_similarity_recommender.ItemSimilarityRecommender).

Finally, we are going to analyse one last option, the ranking factorization recommender system that also works well on implicit data.

#### 2.3. Ranking Factorization recommender system

Our last approach is the Ranking Factorization which is more complex that the strategies that we have seen so far. It gets some number of latents representing both users and items, and combines them to get a ranking rather than a score (for explicit data, where there are scores, the *Factorization Recommender* should work better).

There are a great bunch of settings that should be configured to get a good result from this model. First of all, we are going to use the Implicit Alternating Least Squares (*ials*) solver which usually works better on implicit data. The *num_factors* indicate the number of factors that are used to represent the latent vectors for users and items, the greater this number the longer it takes to create the model and the greater the accuracy. Finally, the *ials_confidence_scaling_factor* denotes how much importance we are giving to the weights obtained from the *ials* solver.

In [59]:
ini = time.time()
m = gl.recommender.ranking_factorization_recommender.create(training_data, solver='ials',
                                                user_id='user_id', item_id='song_id', num_factors=32,
                                                verbose=False, ials_confidence_scaling_factor=1.2)
print "Time to create the model:", time.time()-ini

Time to create the model: 153.299302101


It took much more than the previous approaches, let's see how much time it takes to give a recommendation. 

In [60]:
ini = time.time()
m.recommend(users=[training_data['user_id'][0]], k=10)
print "Time to return a recommendation:", time.time()-ini

Time to return a recommendation: 0.0583698749542


It keeps being really fast. Let's now see the accuracy.

In [61]:
m.evaluate_precision_recall(validation_data, verbose=False)

{'precision_recall_by_user': Columns:
 	user_id	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1326708
 
 Data:
 +-------------------------------+--------+-----------+--------+-------+
 |            user_id            | cutoff | precision | recall | count |
 +-------------------------------+--------+-----------+--------+-------+
 | 00003a4459f33b92906be11abe... |   1    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   2    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   3    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   4    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   5    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   6    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   7    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   8    |    0.0    |  0.0   |   4   |
 | 00003a4459f33b92906be11abe... |   9    |    0.0    |  0.0 

Well, it's a good accuracy, but worse than the item-based recommender with *jaccard* similarity.

A side fact about the recommender systems implemented by GraphLab is that they accept side data for items and users that will help the recommender to give better recommendations, especially when there is a cold start.

The decision is made, we are going to use an item-based recommender system :D

## Home-Made Item-Based recommender systems

Now that we have made the decision to choose an item-based recommender system, we are going to give some intuition of how everything works. We are going to use a modified version of the classic algorithm, based on Fabio Aiolli's work [1]. However, we are going to make other modifications, because Aiolli's work is oriented to getting a good top-k recommendation accuracy, but giving the recommendations may take a long time (longer than a second).

[1] Aiolli, Fabio. "A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge." IIR. 2013.

For the implementation we will need a dictionary of users per song, i.e., for each song we will have a list of all the users that have listened to that song. Conversely, we will also need a dictionary of songs per user, which basically contains the user's history.

In [81]:
song_triplets = pd.read_csv('train_triplets.txt', sep='\t', names=['user', 'song', 'count'], nrows=10000000)
users_per_song = song_triplets[['song', 'user']].groupby('song')['user'].apply(list)
songs_per_user = song_triplets[['song', 'user']].groupby('user')['song'].apply(list)

Now that we have all the data we will define two functions that will be useful to make the recommendations. The first of them is to get the score, i.e., the distance between two songs. To make so, we carry out the conditional probability that a user listens to a song given the fact that he has listened other song. The second function calculates the score for some number of the songs that the user has recently listened to, it adds them up and return the songs with the highest accumulated score.

In [94]:
def get_score(set_users_song, other_users, q=3):
    """
    This function returns the conditional probability that a user listens to song i given that he listened to
    song j.
    - set_users_song: is the set of users that have listened to song i.
    - other_users: list of users that have listened to song j.
    - q: the score is powered to some quotient to penalize the very slow scores.
    """
    numerator = len(set_users_song.intersection(other_users))
    denominator = len(other_users)
    return 0 if numerator == 0 else math.pow(numerator/float(denominator), q)

def get_recommendation(user, n_recommendations=5, history=5):
    # We get all the songs that the user has listened to so far.
    user_songs = songs_per_user.loc[user]
    # If it is a new user, it returns the most popular items.
    if user_songs is None:
        return users_per_song.apply(lambda l: len(l)).nlargest(n_recommendations).index.tolist()
    
    # We set a 0 score to each song at the beginning.
    scores = pd.Series(np.zeros(users_per_song.shape[0]), index=users_per_song.index)
    # We only take into account the last 5 songs the user has listened to carry out the scores.
    ini = len(user_songs)
    if ini > history:
        ini = history
    # For each song that user has listened to, we get the score and we add all of them to obtain the 
    for song in user_songs[-ini:]:
        set_users_song = set(users_per_song.loc[song])
        # We obtain the score for each song by comparing it with the songs that the user has already listened.
        scores = scores.add(map(lambda l_users: get_score(set_users_song, l_users), users_per_song))
    # We drop the songs that the user has already listened to
    scores.drop(user_songs)
    
    return scores.nlargest(n_recommendations + 1).index.tolist()

Let's now see how much it takes to our home-made function to return the recommendations.

In [95]:
user = songs['user'].iloc[0]
ini = time.time()
get_recommendation(user, history=5)
print "Time to make recommendations:", time.time()-ini

Time to make recommendations: 3.88573718071


We can see how much slower this approach is than the ones that we have seen so far. More than 3 seconds, only considering 10M rows. What takes more time is to get the distance between one song and all the other ones in the dataset. Thus, if we wanted to make it faster and more responsive, we should pre-compute the distances between the items, since they are pretty stable on time, and use those values when making the recommendations. However, this makes the system far less updatable, we should re-compute the distances between songs periodically so that we keep up-to date. GraphLab has taken into account all these facts and provides a very complete and efficient implementation.

## Building the web service

So far, we have seen our data and which recommender algorithm suits better, as well as we've seen how to implement a recommender algorithm from scratch. Now it's time to put all the pieces together and to build a web service providing two functions:
* *add_data(userId, itemId)*: which basically will add a new user-item interaction. This is the case when a user listens to a song, and it's added to his history.
* *get_recommendations(userId)*: given a user id, this function will return a set of recommended songs based on his history.

Before we do so, it's a good idea to create a class that holds all the recommendation functionality. Thus, our first task is to create such class.

In [3]:
class Recommender:
    def __init__(self, path_to_data):
        self.__path_to_data = path_to_data
        self.__new_data = None
        self.create_model()
        self.__update_threshold = 10000
        print "Recommender initialized"
    
    def get_recommendations(self, user, k=4):
        """
        This function returns a k set of recommendations for a given user. If
        the user is not the system yet, it returns the most popular songs.
        - user: user id to which we are going to look for recommendations.
        - k: number of recommendations returned.
        """
        print "Recommendations for user", user
        
        if self.__new_data is not None:
            return self.__model__ .recommend(users=[user], k=k, verbose=True,
                                      new_observation_data = self.__new_data)['song_id']

        return self.__model__ .recommend(users=[user], k=k, verbose=True)['song_id']
    
    def add_data(self, user_id, song_id):
        """
        Given a user and a song interaction, this function adds the new interaction
        to the new data. If the number of new rows gets a previously specified
        threshold it updates the model with the new data.
        """
        assert user_id is not None, "The user_id is null"
        assert song_id is not None, "The song_id is null"
        #print user_id, song_id, self.__new_data
        if self.__new_data is None:
            self.__new_data = gl.SFrame({'user_id':[user_id], 'song_id':[song_id]})
        else:
            self.__new_data = gl.SFrame({'user_id':[user_id], 
                                     'song_id':[song_id]}).append(self.__new_data)
        if (self.__new_data.shape[0] > self.__update_threshold):
            print "It's gonna be updated"
            self.update()
        return 0
        
    def update(self):
        """
        When the number of data added gets the threshold the model is updated.
        Once everything is updated, the new_data is deleted and set to None 
        again.
        """
        self.create_model()
        del self.__new_data
        self.__new_data = None
        return 0
        
    def create_model(self):
        """
        This function creates the item-based GraphLab model. In order to do so
        it first reads the history data stored, it then appends the new_data
        to the read data and creates the model. To back-up it also saves the 
        model, and overwrites the data stored with the new set of data. Finally,
        it deletes the data that is useless for the execution.
        """
        self.__data = gl.SFrame.read_csv(self.__path_to_data, delimiter=',', 
                           verbose=False, header=True)
        #self.__data = self.__data.rename({'X1': 'user_id', 'X2': 'song_id'})
        if self.__new_data is not None:
            self.__data = self.__new_data.append(self.__data)
        self.__model__ = gl.recommender.item_similarity_recommender.create(self.__data, 
                                item_id='song_id', user_id='user_id', 
                                similarity_type='jaccard', verbose=False)
        self.__model__.save('jaccard_recommender')
        self.__data[['user_id', 'song_id']].export_csv(self.__path_to_data, header=True)
        self.__users = self.__data['user_id'].unique()
        del self.__data
        print "Recommender created"
        
        return 0
    
    def get_songs(self):
        """
        Returns the list of all the songs loaded in the recommender.
        """
        songs = self.__model__.recommend(users=['a'], k=self.__model__.get('num_items'))
        return list(songs['song_id'])
    
    def get_users(self):
        """
        Returns the list of all the users loaded in the recommender.
        """
        return list(self.__users)

As you can see there are some extra features that we haven't seen so far. The recommender also provides two methods to get the whole list of songs and users. It also has a method to update the recommender once some number of interactions was added. The other functions are the ones needed for the web service.

Now it's time to see the web service itself.

In [None]:
import web
from recommender import Recommender
import json

urls = (
    '/data/recommendations', 'get_recommendations',
    '/data/add', 'add_data',
    '/data/songs', 'list_songs',
    '/data/users', 'list_users'
)

recommender = Recommender('to_train.csv')
app = web.application(urls, globals())

class get_recommendations:
    """
    This class defines the functionality to get the recommendations for a given
    user. It first gets the user_id given in the querystring with the following
    format: [ip]:[port]/data/recommendations?user_id=[user_id]. Once the
    querystring is parsed, the user_id is obtained and the recommender object
    is invoked.
    It finally returns the recommendations in a JSON response.
    """
    def GET(self):
        input_data = web.input(user_id=None)
        user_id = input_data['user_id']
        #if user_id is None:
        #    return web.internalerror("To get a recommendation, a user_id should be given in the query-string.")
        user_id = str(user_id)
        recs = recommender.get_recommendations(user_id, 4)
        response = {'recommendations': list(recs)}
        
        web.header('Content-Type', 'application/json')
        return json.dumps(response)

class add_data:
    """
    This class implements the POST function to add a new user-song interaction.
    Whenever a user plays a song it is added to the service so that the 
    recommendations are kept up to date. The post body should contain two key-
    value fields: user_id=[user_id] and song_id=[song_id].
    Once all the data is obtained and parsed, this function calls the recommender's 
    add_data function, and if everything is ok it returns an OK JSON.
    """
    def POST(self):
        dic = web.input(user_id=None, song_id=None)
        user_id = dic['user_id']
        song_id = dic['song_id']
        #if user_id is None or song_id is None:
        #    return web.internalerror("There was an error parsing the query-string. A valid user_id and song_id should be given in the query-string.")
        
        user_id = str(user_id)
        song_id = str(song_id)
        recommender.add_data(user_id, song_id)
        response = {'user_id': user_id, 'song_id': song_id, 'response': 'interaction successfully added'}
        
        web.header('Content-Type', 'application/json')
        return json.dumps(response)
        
class list_songs:
    """
    It returns the whole list of songs already loaded in the system.
    """
    def GET(self):
        songs = recommender.get_songs()
        response = {'songs': songs}
        
        web.header('Content-Type', 'application/json')
        return json.dumps(response)

class list_users:
    """
    It returns the whole list of users already loaded in the system.
    """
    def GET(self):
        users = recommender.get_users()
        response = {'users': users}
        
        web.header('Content-Type', 'application/json')
        return json.dumps(response)

#if __name__ == "__main__":
app.run()

And that's it! We are done!

In the folder I will also leave a script that serves to test the web service, as well as it will fill the recommender with the rest of the interactions in the dataset.

Thanks for reaching so far!