# LastFM Recommender
The aim of this exercies is to build an ensemble recommender for music artists, using data made available [here](http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz).

I'll follow the methods described in [Jeremy Howards video](https://www.youtube.com/watch?v=V2h3IOBDvrA&t=5761s).

This data set is a single table with 350k users organised into the following rows:
- UserID 
- ArtistID 
- ArtistName 
- PlayCount

Not all of the artists have a valid ArtistID, so we will use the artist name as a unique id.

In [3]:
from theano.sandbox import cuda
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

In [4]:
path = 'data/lastfm-dataset-360K/'
fulldata_file = 'usersha1-artmbid-artname-plays.tsv'
data_file = 'fulldata.tsv'
sample_file = 'sampledata.tsv'

def read_fulldata_file():
    return pd.read_csv(path + fulldata_file, 
                       sep='\t',
                       usecols=[0,2,3],
                       names=['user', 'artist','plays'])

def read_sample_file():
    return pd.read_csv(path + sample_file,
                       sep='\t')

# Create the sample dataset of 1000 users
if not os.path.isfile(path + sample_file):
    df = read_fulldata_file()
    users_to_sample = df.user.sample(n=1000)
    rows_to_sample = df[df.user.isin(users_to_sample)]
    rows_to_sample.to_csv(path + sample_file,
                          index=False,
                          sep='\t')

In [5]:
df = read_sample_file()
# df = read_fulldata_file()

First, we need to transform the data somewhat:
- UserID and Artist name to continguous integers
- Playcount value for each (user, artist) tuple into a normalized value representing how much the user likes that artist compared to other artists.

We shall assign each (user, artist) tuple a value representing the fraction of all of that users plays that the artist represents. This should then leave the value normalized between 0 and 1.

In [6]:
userid2ids = {o:i for i,o in enumerate(df.user.unique())}
artistid2ids = {o:i for i,o in enumerate(df.artist.unique())}
plays_per_user = df.groupby(['user'])['plays'].sum().to_dict()

def normalize(row):
    row['plays'] = row['plays'] / plays_per_user[row['user']]
    row['user'] = userid2ids[row['user']]
    row['artist'] = artistid2ids[row['artist']]
    return row

norm_df = df.apply(normalize, axis=1)

Now we can decide on a number of latent factors and split it out into training and validation sets. We also create a few variables that we will need later.

In [8]:
n_factors = 40
np.random.seed = 42
msk = np.random.rand(len(norm_df)) < 0.8
trn = norm_df[msk]
val = norm_df[~msk]

n_users = norm_df.user.nunique()
n_artists = norm_df.artist.nunique()
n_users, n_artists

(1000, 15593)

As per the original example, we'll do a quick cross tab table of the top artists and most prolific users to sanity check how we are doing so far.

In [9]:
g_artists = norm_df.groupby('artist')['plays'].count()
top_artists = g_artists.sort_values(ascending=False)[:15]
g_users = norm_df.groupby('user')['artist'].count()
top_users = g_users.sort_values(ascending=False)[:15]

top = norm_df.join(top_users, rsuffix='_r', how='inner', on='user')
top = top.join(top_artists, rsuffix='_r', how='inner', on='artist')
pd.crosstab(top.user, top.artist, top.plays, aggfunc = np.sum)

artist,168,301,539,578,615,632,643,649,654,706
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
14,0.010375,,0.00457,0.02915,,,,0.003335,0.003458,0.009387
244,0.007231,,,0.023077,,,,,,
268,,,,,,,0.005792,,,0.013514
493,0.023529,,,,,,,,,
550,,,,,,,0.017322,0.003997,,
578,,0.033956,,,,,,,,
733,,,,,,,,0.009524,,
753,,,,,,,,0.012701,,
816,,,,0.05641,0.011111,0.012821,,,,


The resulting table is a lot more sparse than the equivilent table for movie titles. I'm guessing this is because there is a much larger number of distinct artists than there are movies compared to the overall size of the dataset.

# Dot product
The most basic model as per the original example.

In [10]:
user_in = Input(shape=(1,), dtype='int64', name='user_in')
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e04))(user_in)
artist_in = Input(shape=(1,), dtype='int64', name='artist_in')
a = Embedding(n_artists, n_factors, input_length=1, W_regularizer=l2(1e-4))(artist_in)

In [11]:
x = merge([u, a], mode='dot')
x = Flatten()(x)
model = Model([user_in, artist_in], x)
model.compile(Adam(0.001), loss='mse')

In [12]:
model.fit([trn.user, trn.artist], 
          trn.plays, 
          batch_size=64, 
          nb_epoch=3, 
          validation_data=([val.user, val.artist], val.plays))

Train on 40352 samples, validate on 10404 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f46198c0650>

In [None]:
model.optimizer.lr = 0.001
model.fit([trn.user, trn.artist], 
          trn.plays, 
          batch_size=64, 
          nb_epoch=6,
          validation_data=([val.user, val.artist], val.plays))