# Recommendation Engine

In this part we are going to build a simple recommender system using collaborative filtering.

## 1. The import statements

In [None]:
import numpy as np
import pandas as pd
import sklearn.metrics.pairwise

## 2. The data

We will use Germany's data of the [Last.fm Dataset](https://labrosa.ee.columbia.edu/millionsong/lastfm). To read and print the data we will use the [pandas library](https://pandas.pydata.org/):
+ [`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): reads a csv file and returns a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).
+ [`pandas.DataFrame.set_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html): sets the DataFrame index (the row identifiers).

Pandas enables the use of method chaining: *read_csv* call returns a DataFrame, on which we can immediatly apply the *set_index* method by chaining it via dot notation.

In [None]:
data = pd.read_csv('data/lastfm-matrix-germany.csv').set_index('user')
data.head()

As you can see the data contains for each user which songs they listened to on Last.FM. Note that the number of times a person listened to a specific band is not listed.

## 3. Band similarity

We want to figure out which band to recommend to which user. Since we know which user listened to which band we can look for bands or users that are similar. Humans can have vastly complex listening preferences and are very hard to group. Bands on the other hand are usually much easier to group. So it is best to look for similarities between bands rather than between users.

To determine if 2 bands are similar, you can use many different similarity metrics. Finding the best metric is a whole research topic on its own. In many cases though the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is used. The implementation we will use here is the [`sklearn.metrics.pairwise.cosine_similarity`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).

In [None]:
### BEGIN SOLUTION
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(data.T)
### END SOLUTION
# similarity_matrix = sklearn.metrics.pairwise.cosine_similarity( ? )

assert similarity_matrix.shape == (285, 285)

To make a nice print of the data we will use the pandas library as follows.

In [None]:
band_similarities = pd.DataFrame(similarity_matrix, index=data.columns, columns=data.columns)
band_similarities.head()

As you can see above, bands are 100% similar to themselves and The White Stripes are nothing like Abba.

## 4. Picking the best matches

Even though many of the bands above have a similarity close to 0, there might be some bands that seem to be slightly similar because for some reason somebody with a very complex taste listened to them both. To remove this noise from the dataset we are going to select only the 10 best matches.

Let's first try this with the first band in the list.

In [None]:
n_best = 10
### BEGIN SOLUTION
top_n = band_similarities.iloc[:,0].sort_values(ascending=False)[:n_best]
### END SOLUTION
# top_n = band_similarities.iloc[:,0].sort_values(ascending= ? )[:?]
print(top_n)

assert len(top_n) == 10

If we only want the names, we can get them through the `.index`.

In [None]:
n_best = 10
### BEGIN SOLUTION
top_n = band_similarities.iloc[:,0].sort_values(ascending=False)[:n_best].index
### END SOLUTION
# top_n = band_similarities.iloc[:,0].sort_values(ascending= ? ) ?
print(top_n)

assert len(top_n) == 10 and top_n.__class__ == pd.Index

We can now transform this task to a function, which will bring significant benefits when we want to generalize this to all bands:

In [None]:
def get_similar_bands(series, n=10):
    ### BEGIN SOLUTION
    return series.sort_values(ascending=False)[:n].index
    ### END SOLUTION
    # return series.sort_values( ? ) ?

example_band = band_similarities.loc['a perfect circle', :]
print(get_similar_bands(example_band))
    
assert np.array_equal(top_n, get_similar_bands(example_band))
assert get_similar_bands(example_band).__class__ == pd.Index
assert len(get_similar_bands(example_band, n=15)) == 15

Now let's do this for all bands, using the [`pandas.DataFrame.apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) method on the band_similarities DataFrame. This method applies a function along the index. We want to have the bands in the index, so the shape of our resulting dataframe should be 285 rows and 10 columns

In [None]:
### BEGIN SOLUTION
top_n_similar_bands = band_similarities.apply(get_similar_bands).T
### END SOLUTION
#top_n_similar_bands = band_similarities.apply( ? ) ?

print(top_n_similar_bands.shape)

assert top_n_similar_bands.shape == (285, 10)

top_n_similar_bands.head()

## 5. Find which bands to advise.

Now that we know which bands are similar, we have to figure out which bands to advise to whom. To do this we need to determine how the listening history of a user matches that of bands they didn't listen to yet. For this we will use the following similarity score.

In [None]:
# Function to compute the similarity scores
def similarity_score(listening_history, similarities):
    return sum(listening_history * similarities) / sum(similarities)

For each band we sum the similarities of bands the user also listened to. In the end we divide by the total sum of similarities to normalise the score.

So let's say a user listened to 1 of 3 bands that are similar, for example `[0, 1, 0]` and there respective similarity scores are `[0.3, 0.2, 0.1]` you get the following score:

In [None]:
listening_history = np.array([0, 1, 0]) 
similarities = np.array([0.3, 0.2, 0.1])
print(f'{similarity_score(listening_history, similarities):.3f}')

Now let's compute the score for each band for user 1 (with index 0).

In [None]:
user = 1

# a list of all the scores
scores = []

for band_index in range(len(band_similarities.columns)):
    band = band_similarities.columns[band_index]
    
    # For bands the user already listened to we set the score to 0
    if data.loc[user, band] == 1:
        scores.append(0)
    else:
        # Most similar bands to this one
### BEGIN SOLUTION
        most_similar_band_names = band_similarities.loc[band].sort_values(ascending=False)[1:n_best].index
### END SOLUTION
        # most_similar_band_names = band_similarities.loc[band].sort_values(ascending= ? ) ?
        # Get the similarity score of these bands
### BEGIN SOLUTION
        most_similar_band_scores = band_similarities.loc[band].sort_values(ascending=False)[1:n_best]
### END SOLUTION
        # most_similar_band_scores = band_similarities.loc[band].sort_values(ascending= ? ) ?
        # Get the listening history for these bands
        user_listening_history = data.loc[user, most_similar_band_names]

        scores.append(similarity_score(user_listening_history, most_similar_band_scores))

Now let's make a nice print of the top 5 bands to advice to this user:

In [None]:
print(f'For user with id {user} we advice:')
pd.DataFrame(scores, index=band_similarities.columns).sort_values(0, ascending=False).head()

Now try this also for other users.