# Recommendation Engine

In this tutorial we are going to build a simple recommender system using collaborative filtering.

## 1. The import statements

In [1]:
import numpy as np
import pandas as pd
import sklearn.metrics.pairwise

## 2. The data

We will use Germany's data of the [Last.fm Dataset](https://labrosa.ee.columbia.edu/millionsong/lastfm). To read and explore the data we will use the [pandas library](https://pandas.pydata.org/):
+ [`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): reads a csv file and returns a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), a two-dimensional data structure with labeled rows and columns.
+ [`pandas.DataFrame.set_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html): sets the DataFrame index (the row labels).

Pandas enables the use of method chaining: *read_csv* call returns a DataFrame, on which we can immediatly apply the *set_index* method by chaining it via dot notation.

In [2]:
data = pd.read_csv('data/lastfm-matrix-germany.csv').set_index('user')
data.head()

Unnamed: 0_level_0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
62,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
data.shape

(1257, 285)

The resulting DataFrame contains a row for each user and each column represents an artist. The values indicate whether the user listend to a song by that artist (1) or not (0). Note that the number of times a person listened to a specific artist is not listed.

## 3. Determining artist similarity

We want to figure out which artist to recommend to which user. Since we know which user listened to which artists we can look for artists or users that are similar. Humans can have vastly complex listening preferences and are very hard to group. Artists on the other hand are usually much easier to group. So it is best to look for similarities between artists rather than between users.

To determine if two artists are similar, you can use many different similarity metrics. Finding the best metric is a whole research topic on its own. In many cases though the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is used. The implementation we will use here is the [`sklearn.metrics.pairwise.cosine_similarity`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).

In [4]:
### BEGIN SOLUTION
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(data.T)
### END SOLUTION
# similarity_matrix = sklearn.metrics.pairwise.cosine_similarity( ? )

assert similarity_matrix.shape == (285, 285)
print(type(similarity_matrix))
print(similarity_matrix.ndim)

<class 'numpy.ndarray'>
2


The *cosine_similarity* function returned a 2-dimensional [`numpy array`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html). This array contains all the similarity values we need, but it is not labelled. Since the entire array will not fit the screen, we will use [`slicing`](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) to print a subset of the result.

In [5]:
similarity_matrix[:5, :5]

array([[1.        , 0.        , 0.01791723, 0.05155393, 0.06277648],
       [0.        , 1.        , 0.05227877, 0.02507061, 0.06105625],
       [0.01791723, 0.05227877, 1.        , 0.11315371, 0.177153  ],
       [0.05155393, 0.02507061, 0.11315371, 1.        , 0.05663655],
       [0.06277648, 0.06105625, 0.177153  , 0.05663655, 1.        ]])

The artist names are both the row and column labels for the similarity_matrix. We can add these labels by creating a new DataFrame based on the numpy array. By using the [`pandas.DataFrame.iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) integer-location based indexer, we get the same slice as above, but with added labels.

In [6]:
### BEGIN SOLUTION
artist_similarities = pd.DataFrame(similarity_matrix, index=data.columns, columns=data.columns)
### END SOLUTION
# artist_similarities = pd.DataFrame( ? , index=data.columns, columns= ? )

assert np.array_equal(artist_similarities.columns, data.columns)
assert artist_similarities.shape == similarity_matrix.shape

artist_similarities.iloc[:5, :5]

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith
a perfect circle,1.0,0.0,0.017917,0.051554,0.062776
abba,0.0,1.0,0.052279,0.025071,0.061056
ac/dc,0.017917,0.052279,1.0,0.113154,0.177153
adam green,0.051554,0.025071,0.113154,1.0,0.056637
aerosmith,0.062776,0.061056,0.177153,0.056637,1.0


Pandas also provides a label based indexer, [`pandas.DataFrame.loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc), which we can use to get a slice based on label values.

In [10]:
slice_artists = ['ac/dc', 'madonna', 'metallica', 'rihanna', 'the white stripes']

artist_similarities.loc[slice_artists, slice_artists]

Unnamed: 0,ac/dc,madonna,metallica,rihanna,the white stripes
ac/dc,1.0,0.02833,0.2683,0.024813,0.146223
madonna,0.02833,1.0,0.054554,0.234604,0.013167
metallica,0.2683,0.054554,1.0,0.066895,0.160904
rihanna,0.024813,0.234604,0.066895,1.0,0.04613
the white stripes,0.146223,0.013167,0.160904,0.04613,1.0


As you can see above, bands are 100% similar to themselves and The White Stripes are nothing like Abba. 

We can further increase the usability of this data by making it a [tidy dataset](https://en.wikipedia.org/wiki/Tidy_data). This means we'll put each variable in a column, and each observation in a row. There's three variables in our dataset:
+ first artist
+ second artist
+ cosine similarity

In our current DataFrame the second artist is determined by the column labels, and as consequence the cosine similarity observation is spread over multiple columns. The [`pandas.DataFrame.melt`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) method will fix this.

In [11]:
similarities = (
    # start from wide DataFrame
    artist_similarities
    # add a name to the index
    .rename_axis(index='artist')
    # artist needs to be a column for melt
    .reset_index()
    # create the tidy dataset
    .melt(id_vars='artist', var_name='compared_with', value_name='cosine_similarity')
    # artist compared with itself not needed, keep rows where artist and compared_with are not equal.
    .query('artist != compared_with')
    # set identifying observations to index
    .set_index(['artist', 'compared_with'])
    # sort the index
    .sort_index()
)

To view the first n rows, we can use the [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, the default value for n is 5.

In [12]:
similarities.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,cosine_similarity
artist,compared_with,Unnamed: 2_level_1
a perfect circle,abba,0.0
a perfect circle,ac/dc,0.017917
a perfect circle,adam green,0.051554
a perfect circle,aerosmith,0.062776
a perfect circle,afi,0.0


Note that we created a [`MultiIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-hierarchical) by specifying two columns in the set_index call.

In [13]:
similarities.index

MultiIndex([('a perfect circle',              'abba'),
            ('a perfect circle',             'ac/dc'),
            ('a perfect circle',        'adam green'),
            ('a perfect circle',         'aerosmith'),
            ('a perfect circle',               'afi'),
            ('a perfect circle',               'air'),
            ('a perfect circle', 'alanis morissette'),
            ('a perfect circle',      'alexisonfire'),
            ('a perfect circle',       'alicia keys'),
            ('a perfect circle',  'all that remains'),
            ...
            (    'yann tiersen',  'three days grace'),
            (    'yann tiersen',         'timbaland'),
            (    'yann tiersen',         'tom waits'),
            (    'yann tiersen',              'tool'),
            (    'yann tiersen',         'tori amos'),
            (    'yann tiersen',            'travis'),
            (    'yann tiersen',           'trivium'),
            (    'yann tiersen',                '

The use of the MultiIndex enables flexible access to the data. If we index with a single artist name, we get all compared artists. To view the last n rows for this result, we can use the [`pandas.DataFrame.tail`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) method.

In [14]:
similarities.loc['the beatles', :].tail()

Unnamed: 0_level_0,cosine_similarity
compared_with,Unnamed: 1_level_1
trivium,0.028837
u2,0.139252
underoath,0.0
volbeat,0.0406
yann tiersen,0.151911


We can index on multiple levels by providing a tuple of indexes:

In [15]:
similarities.loc[('abba', 'madonna'), :]

cosine_similarity    0.241656
Name: (abba, madonna), dtype: float64

In [16]:
print(slice_artists)
similarities.loc[('abba', slice_artists), :]

['ac/dc', 'madonna', 'metallica', 'rihanna', 'the white stripes']


Unnamed: 0_level_0,Unnamed: 1_level_0,cosine_similarity
artist,compared_with,Unnamed: 2_level_1
abba,ac/dc,0.052279
abba,madonna,0.241656
abba,metallica,0.067116
abba,rihanna,0.13469
abba,the white stripes,0.0


## 4. Picking the best matches

Even though many of the artists above have a similarity close to 0, there might be some artists that seem to be slightly similar because somebody with a complex taste listened to them both. To remove this noise from the dataset we are going to limit the number of matches.

Let's first try this with the first artist in the list: *a perfect circle*.

In [17]:
artist = 'a perfect circle'
n_artists = 10
### BEGIN SOLUTION
top_n = similarities.loc[artist, :].sort_values('cosine_similarity').tail(n_artists)
### END SOLUTION
# top_n = similarities.loc[?, :].sort_values('cosine_similarity') ?
print(top_n)

assert len(top_n) == 10
assert type(top_n) == pd.DataFrame

                       cosine_similarity
compared_with                           
radiohead                       0.173384
the smashing pumpkins           0.174456
opeth                           0.187083
system of a down                0.199205
incubus                         0.200839
nine inch nails                 0.214669
porcupine tree                  0.223607
deftones                        0.323669
dredg                           0.347440
tool                            0.394709


We can transform the task of getting the most similar bands for a given band to a function.

In [18]:
def most_similar_artists(artist, n_artists=10):
    """Get the most similar artists for a given artist.
    
    Parameters
    ----------
    artist: str
        The artist for which to get similar artists
    n_artists: int, optional
        The number of similar artists to return
    
    Returns
    -------
    pandas.DataFrame
        A DataFrame with the similar artists and their cosine_similarity to
        the given artist
    """
    ### BEGIN SOLUTION
    return similarities.loc[artist, :].sort_values('cosine_similarity').tail(n_artists)
    ### END SOLUTION
    # return similarities.loc[ ? ].sort_values( ? ) ?

print(most_similar_artists('a perfect circle'))

assert top_n.equals(most_similar_artists('a perfect circle'))
assert most_similar_artists('abba', n_artists=15).shape == (15, 1)

                       cosine_similarity
compared_with                           
radiohead                       0.173384
the smashing pumpkins           0.174456
opeth                           0.187083
system of a down                0.199205
incubus                         0.200839
nine inch nails                 0.214669
porcupine tree                  0.223607
deftones                        0.323669
dredg                           0.347440
tool                            0.394709


## 5. Get the listening history

To quantify the recommendation score for an artist, we'll want to know whether a user listened to many similar artists. We know which artists are similar to a given artist, but we still need to figure out if any of these similar artists are in the listening history of the user. The index labels for a DataFrame can be retrieved by using the [`pandas.DataFrame.index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html) attribute.

In [19]:
artist = 'the beatles'
beatles
### BEGIN SOLUTION
similar_labels = most_similar_artists(artist).index
### END SOLUTION
# similar_labels = most_similar_artists( ? ). ?

Index(['oasis', 'ac/dc', 'johnny cash', 'david bowie', 'tom waits',
       'pink floyd', 'bob dylan', 'the rolling stones',
       'red hot chili peppers', 'the doors'],
      dtype='object', name='compared_with')

In [244]:
def most_similar_bands_history(band, user_id):
    similar_bands = most_similar_bands(band)
    history = data.loc[user_id, similar_bands.index].rename('listening_history')
    return pd.concat([similar_bands, history], axis=1)

example = most_similar_bands_history('abba', 42)

assert example.columns.to_list() == ['cosine_similarity', 'listening_history']

example

Unnamed: 0_level_0,cosine_similarity,listening_history
compared_band,Unnamed: 1_level_1,Unnamed: 2_level_1
madonna,0.241656,0
robbie williams,0.205398,0
elvis presley,0.191799,0
michael jackson,0.187885,0
queen,0.179427,0
the beatles,0.173025,0
kelly clarkson,0.164399,0
groove coverage,0.161206,1
duffy,0.150075,0
mika,0.140971,0


## 6. Find which artists to advise.

Now that we can easily find which artists are similar, we have to figure out which artists to advise to whom. To do this we need to determine how the listening history of a user matches that of artists they didn't listen to yet. For this we will use the following similarity score:

In [72]:
# Function to compute the similarity scores
def similarity_score(listening_history, similarities):
    return sum(listening_history * similarities) / sum(similarities)

For each band we sum the similarities of bands the user also listened to. In the end we divide by the total sum of similarities to normalise the score.

So let's say a user listened to 1 of 3 bands that are similar, for example `[0, 1, 0]` and there respective similarity scores are `[0.3, 0.2, 0.1]` you get the following score:

In [73]:
listening_history = np.array([0, 1, 0]) 
similarities = np.array([0.3, 0.2, 0.1])
print(f'{similarity_score(listening_history, similarities):.3f}')

0.333


Now let's compute the score for each band for user with ID 1.

In [274]:
user = 42

# a list of all the scores
scores = []

for band_index in range(len(band_similarities.columns)):
    band = band_similarities.columns[band_index]
    
    # For bands the user already listened to we set the score to 0
    if data.loc[user, band] == 1:
        scores.append(0)
    else:
        # Most similar bands to this one
### BEGIN SOLUTION
        most_similar_band_names = band_similarities.loc[band].sort_values(ascending=False)[1:n_best].index
### END SOLUTION
        # most_similar_band_names = band_similarities.loc[band].sort_values(ascending= ? ) ?
        # Get the similarity score of these bands
### BEGIN SOLUTION
        most_similar_band_scores = band_similarities.loc[band].sort_values(ascending=False)[1:n_best]
### END SOLUTION
        # most_similar_band_scores = band_similarities.loc[band].sort_values(ascending= ? ) ?
        # Get the listening history for these bands
        user_listening_history = data.loc[user, most_similar_band_names]

        scores.append(similarity_score(user_listening_history, most_similar_band_scores))

  This is separate from the ipykernel package so we can avoid doing imports until


3.75 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now let's make a nice print of the top 5 bands to advice to this user:

In [75]:
print(f'For user with id {user} we advice:')
pd.DataFrame(scores, index=band_similarities.columns).sort_values(0, ascending=False).head()

For user with id 1 we advice:


Unnamed: 0,0
flogging molly,0.364458
coldplay,0.3432
aerosmith,0.290306
the beatles,0.239779
mando diao,0.22593


Now try this also for other users.

In [153]:
most_similar_bands('abba').merge(data.loc[42, :], left_index=True, right_index=True)

Unnamed: 0,cosine_similarity,42
madonna,0.241656,0
robbie williams,0.205398,0
elvis presley,0.191799,0
michael jackson,0.187885,0
queen,0.179427,0
the beatles,0.173025,0
kelly clarkson,0.164399,0
groove coverage,0.161206,1
duffy,0.150075,0
mika,0.140971,0


In [218]:
most_similar_bands('abba')

Unnamed: 0_level_0,cosine_similarity
compared_band,Unnamed: 1_level_1
madonna,0.241656
robbie williams,0.205398
elvis presley,0.191799
michael jackson,0.187885
queen,0.179427
the beatles,0.173025
kelly clarkson,0.164399
groove coverage,0.161206
duffy,0.150075
mika,0.140971


In [282]:
def similarity_score(band, user_id):
    df = most_similar_bands_history(band, user_id)
    return df.product(axis=1).sum(axis=0) / df['cosine_similarity'].sum()

assert np.allclose(similarity_score('abba', 42), 0.08976655361839528)
assert np.allclose(similarity_score('the white stripes', 1), 0.09492796371597861)

similarity_score('abba', 42)

0.08976655361839528

In [283]:
%%timeit
similarity_score('abba', 42)

6.64 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [284]:
history = data.loc[42, :]
history.loc[history == 0]

a perfect circle    0
abba                0
ac/dc               0
adam green          0
aerosmith           0
                   ..
trivium             0
u2                  0
underoath           0
volbeat             0
yann tiersen        0
Name: 42, Length: 278, dtype: int64

In [285]:
def unknown_artists(user_id):
    history = data.loc[user_id, :]
    return history.loc[history == 0].index

example = unknown_artists(42)

example

Index(['a perfect circle', 'abba', 'ac/dc', 'adam green', 'aerosmith', 'afi',
       'air', 'alanis morissette', 'alexisonfire', 'alicia keys',
       ...
       'timbaland', 'tom waits', 'tool', 'tori amos', 'travis', 'trivium',
       'u2', 'underoath', 'volbeat', 'yann tiersen'],
      dtype='object', length=278)

In [286]:
user_id = 42
scores = pd.DataFrame(
    [{'band': band, 'score': similarity_score(band, user_id)} for band in unknown_artists(user_id)]
)

In [287]:
def score_unknown_artists(user_id):
    artists = unknown_artists(user_id)
    return [{'artist': artist, 'score': similarity_score(artist, user_id)} for artist in artists]

In [288]:
def recommendations(user_id, n_rec=5):
    scores = score_unknown_artists(user_id)
    return pd.DataFrame(scores).sort_values('score', ascending=False).head(n_rec).reset_index(drop=True)

In [289]:
recommendations(42)

Unnamed: 0,artist,score
0,oomph!,0.426284
1,lacuna coil,0.352066
2,rammstein,0.296665
3,schandmaul,0.243382
4,sonata arctica,0.229894


In [291]:
recommendations(1)

Unnamed: 0,artist,score
0,flogging molly,0.332939
1,coldplay,0.311746
2,aerosmith,0.266175
3,the beatles,0.218716
4,moby,0.217263


In [272]:
%%timeit
recommendations(42)

2.26 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [267]:
%prun recommendations(42)

 

         1433161 function calls (1413378 primitive calls) in 2.815 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   212718    0.141    0.000    0.315    0.000 {built-in method builtins.isinstance}
   144473    0.134    0.000    0.143    0.000 {built-in method builtins.getattr}
    92606    0.082    0.000    0.172    0.000 generic.py:7(_check)
     6401    0.074    0.000    0.074    0.000 {method 'reduce' of 'numpy.ufunc' objects}
93597/74652    0.050    0.000    0.074    0.000 {built-in method builtins.len}
     1394    0.049    0.000    0.120    0.000 managers.py:216(_rebuild_blknos_and_blklocs)
    18927    0.045    0.000    0.078    0.000 abc.py:180(__instancecheck__)
    17584    0.035    0.000    0.193    0.000 base.py:231(is_dtype)
    15610    0.035    0.000    0.071    0.000 common.py:1886(_is_dtype_type)
    29504    0.034    0.000    0.034    0.000 _weakrefset.py:70(__contains__)
  837/559    0.030    0.000    0

We only want to score the bands the user didn't listen to yet:

In [94]:
user = 1
test = data.loc[user, ~data.loc[user, :].astype(bool)]

In [95]:
test

a perfect circle    0
abba                0
ac/dc               0
adam green          0
aerosmith           0
                   ..
trivium             0
u2                  0
underoath           0
volbeat             0
yann tiersen        0
Name: 1, Length: 274, dtype: int64

In [93]:
~data.loc[user, :].astype(bool)

a perfect circle    True
abba                True
ac/dc               True
adam green          True
aerosmith           True
                    ... 
trivium             True
u2                  True
underoath           True
volbeat             True
yann tiersen        True
Name: 1, Length: 285, dtype: bool