# Recommendation Engine
In this tutorial we are going to build a simple recommender system using collaborative filtering. You'll be learning about the popular data analysis package [pandas](https://pandas.pydata.org/) along the way.


## 1. The import statements

In [None]:
import numpy as np
import pandas as pd
import sklearn.metrics.pairwise

## 2. The data

We will use Germany's data of the [Last.fm Dataset](https://labrosa.ee.columbia.edu/millionsong/lastfm). To read and explore the data we will use the [pandas library](https://pandas.pydata.org/):
+ [`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): reads a csv file and returns a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), a two-dimensional data structure with labelled rows and columns.
+ [`pandas.DataFrame.set_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html): sets the DataFrame index (the row labels).

Pandas enables the use of method chaining: `read_csv` call returns a DataFrame, on which we can immediatly apply the `set_index` method by chaining it via dot notation.

In [None]:
data = pd.read_csv('data/lastfm-matrix-germany.csv').set_index('user')
data.head()

In [None]:
data.shape

The resulting DataFrame contains a row for each user and each column represents an artist. The values indicate whether the user listend to a song by that artist (1) or not (0). Note that the number of times a person listened to a specific artist is not listed.

## 3. Determining artist similarity

We want to figure out which artist to recommend to which user. Since we know which user listened to which artists we can look for artists or users that are similar. Humans can have vastly complex listening preferences and are very hard to group. Artists on the other hand are usually much easier to group. So it is best to look for similarities between artists rather than between users.

To determine if two artists are similar, you can use many different similarity metrics. Finding the best metric is a whole research topic on its own. In many cases though, the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is used. The implementation we will use here is the [`sklearn.metrics.pairwise.cosine_similarity`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).

This function will create a matrix of similarity scores between elements in the first dimension of the input. In our dataset the first dimension holds the different users and the second the different artists. You can switch these dimensions with [np.transpose()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.transpose.html).

In [None]:
### BEGIN SOLUTION
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(np.transpose(data))
### END SOLUTION
# similarity_matrix = sklearn.metrics.pairwise.cosine_similarity( ? )

assert similarity_matrix.shape == (285, 285)
print(similarity_matrix.ndim)

The `cosine_similarity` function returned a 2-dimensional [`numpy array`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html). This array contains all the similarity values we need, but it is not labelled. Since the entire array will not fit the screen, we will use [`slicing`](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) to print a subset of the result.

In [None]:
similarity_matrix[:5, :5]

The artist names are both the row and column labels for the similarity_matrix. We can add these labels by creating a new DataFrame based on the numpy array. By using the [`pandas.DataFrame.iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) integer-location based indexer, we get the same slice as above, but with added labels.

In [None]:
### BEGIN SOLUTION
artist_similarities = pd.DataFrame(similarity_matrix, index=data.columns, columns=data.columns)
### END SOLUTION
# artist_similarities = pd.DataFrame( ? , index=data.columns, columns= ? )

assert np.array_equal(artist_similarities.columns, data.columns)
assert artist_similarities.shape == similarity_matrix.shape

artist_similarities.iloc[:5, :5]

Pandas also provides a label based indexer, [`pandas.DataFrame.loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc), which we can use to get a slice based on label values.

In [None]:
slice_artists = ['ac/dc', 'madonna', 'metallica', 'rihanna', 'the white stripes']

artist_similarities.loc[slice_artists, slice_artists]

As you can see above, bands are 100% similar to themselves and The White Stripes are nothing like Abba. 

We can further increase the usability of this data by making it a [tidy dataset](https://en.wikipedia.org/wiki/Tidy_data). This means we'll put each variable in a column, and each observation in a row. There's three variables in our dataset:
+ first artist
+ second artist
+ cosine similarity

In our current DataFrame the second artist is determined by the column labels, and as consequence the cosine similarity observation is spread over multiple columns. The [`pandas.DataFrame.melt`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) method will fix this. We make extensive use of method chaining for this reshaping of the DataFrame. If you want to know the effect of the different methods, you can comment / uncomment them and check the influence on the result.

In [None]:
similarities = (
    # start from untidy DataFrame
    artist_similarities
    # add a name to the index
    .rename_axis(index='artist')
    # artist needs to be a column for melt
    .reset_index()
    # create the tidy dataset
    .melt(id_vars='artist', var_name='compared_with', value_name='cosine_similarity')
    # artist compared with itself not needed, keep rows where artist and compared_with are not equal.
    .query('artist != compared_with')
    # set identifying observations to index
    .set_index(['artist', 'compared_with'])
    # sort the index
    .sort_index()
)

To view the first n rows, we can use the [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, the default value for n is 5.

In [None]:
similarities.head()

Note that we created a [`MultiIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-hierarchical) by specifying two columns in the set_index call.

In [None]:
similarities.index

The use of the MultiIndex enables flexible access to the data. If we index with a single artist name, we get all compared artists. To view the last n rows for this result, we can use the [`pandas.DataFrame.tail`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) method.

In [None]:
similarities.loc['the beatles', :].tail()

We can index on multiple levels by providing a tuple of indexes:

In [None]:
similarities.loc[('abba', 'madonna'), :]

In [None]:
print(slice_artists)
similarities.loc[('abba', slice_artists), :]

## 4. Picking the best matches

Even though many of the artists above have a similarity close to 0, there might be some artists that seem to be slightly similar because somebody with a complex taste listened to them both. To remove this noise from the dataset we are going to limit the number of matches.

Let's first try this with the first artist in the list: `a perfect circle`.

In [None]:
artist = 'a perfect circle'
n_artists = 10
### BEGIN SOLUTION
top_n = similarities.loc[artist, :].sort_values('cosine_similarity').tail(n_artists)
### END SOLUTION
# top_n = similarities.loc[?, :].sort_values('cosine_similarity') ?
print(top_n)

assert len(top_n) == 10
assert type(top_n) == pd.DataFrame

We can transform the task of getting the most similar bands for a given band to a function.

In [None]:
def most_similar_artists(artist, n_artists=10):
    """Get the most similar artists for a given artist.
    
    Parameters
    ----------
    artist: str
        The artist for which to get similar artists
    n_artists: int, optional
        The number of similar artists to return
    
    Returns
    -------
    pandas.DataFrame
        A DataFrame with the similar artists and their cosine_similarity to
        the given artist
    """
    ### BEGIN SOLUTION
    return similarities.loc[artist, :].sort_values('cosine_similarity').tail(n_artists)
    ### END SOLUTION
    # return similarities.loc[ ? ].sort_values( ? ) ?

print(most_similar_artists('a perfect circle'))

assert top_n.equals(most_similar_artists('a perfect circle'))
assert most_similar_artists('abba', n_artists=15).shape == (15, 1)

Note that we also defined a docstring for this function, which we can view by using `help()` or `shift + tab` in a jupyter notebook.

In [None]:
help(most_similar_artists)

## 5. Get the listening history

To determine the recommendation score for an artist, we'll want to know whether a user listened to many similar artists. We know which artists are similar to a given artist, but we still need to figure out if any of these similar artists are in the listening history of the user. The listening history of a single user can be acquired by entering the user id with the `.loc` indexer.

In [None]:
user_id = 42
### BEGIN SOLUTION
user_history = data.loc[user_id, :]
### END SOLUTION
# user_history = data.loc[ ? , ?]

print(user_history)

assert user_history.name == user_id
assert len(user_history) == 285

We now have the complete listening history, but we only need the history for the similar artists. For this we can use the index labels from the DataFrame returned by the `most_similar_artists` function. Index labels for a DataFrame can be retrieved by using the [`pandas.DataFrame.index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html) attribute.

In [None]:
artist = 'the beatles'
### BEGIN SOLUTION
similar_labels = most_similar_artists(artist).index
### END SOLUTION
# similar_labels = most_similar_artists( ? ). ?

print(similar_labels)

assert len(similar_labels) == 10
assert type(similar_labels) == pd.Index

We can combine the user id and similar labels in the `.loc` indexer to get the listening history for the most similar artists.

In [None]:
user_id = 42
### BEGIN SOLUTION
similar_history = data.loc[user_id, similar_labels]
### END SOLUTION
# similar_history = data.loc[?, ?]

assert similar_history.name == user_id

print(similar_history)

Let's make a function to get the most similar artists and their listening history for a given artist and user. The function creates two DataFrames with the same index, and then uses [`pandas.concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to create a single DataFrame from them.

In [None]:
def most_similar_artists_history(artist, user_id):
    """Get most similar artists and their listening history.
    
    Parameters
    ----------
    artist: str
        The artist for which to get the most similar bands
    user_id: int
        The user for which to get the listening history
        
    Returns
    -------
    pandas.DataFrame
        A DataFrame containing the most similar artists for the given artist,
        with their cosine similarities and their listening history status for
        the given user.
    """
    ### BEGIN SOLUTION
    artists = most_similar_artists(artist)
    history = data.loc[user_id, artists.index].rename('listening_history')
    ### END SOLUTION
    # artists = most_similar_artists( ? )
    # history = data.loc[ ? , ? ].rename('listening_history')
    return pd.concat([artists, history], axis=1)

example = most_similar_artists_history('abba', 42)

assert example.columns.to_list() == ['cosine_similarity', 'listening_history']

example

## 6. Calculate the recommendation score.

Now that we have the `most_similar_artists_history` function, we can start to figure out which artists to advise to whom. We want to quantify how the listening history of a user matches artists similar to an artist they didn't listen to yet. For this purpose we will use the following recommendation score: 
+ We start with the similar artists for a given artist, and their listening history for the user.
+ Then we sum the cosine similarities of artists the user listened to. 
+ In the end we divide by the total sum of similarities to normalize the score.

So when a user listened to 1 of 3 artists that are similar, for example `[0, 1, 0]` and their respective similarity scores are `[0.3, 0.2, 0.1]` you get the following recommendation score:

In [None]:
listening_history = np.array([0, 1, 0]) 
similarity_scores = np.array([0.3, 0.2, 0.1])
recommendation_score = sum(listening_history * similarity_scores) / sum(similarity_scores)
print(f'{recommendation_score:.3f}')

Remember what the DataFrame returned by the `most_similar_artists_history` function looks like:

In [None]:
user_id = 42
artist = 'abba'
most_similar_artists_history(artist, user_id)

Pandas provides methods to do column or row aggregation, like e.g. [`pandas.DataFrame.product`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.product.html). This method will calculate all values in a column or row. The direction can be chosen with the `axis` parameter. As we need the product of the values in the rows (similarity \* history), we will need to specify `axis=1`.

In [None]:
most_similar_artists_history(artist, user_id).product(axis=1)

Then there's [`pandas.DataFrame.sum`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) which does the same thing for summing the values. As we want the sum for all values in the column we would have to specify `axis=0`. Since 0 is the default value for the `axis` parameter we don't have to add it to the method call.

In [None]:
most_similar_artists_history(artist, user_id).product(axis=1).sum()

Knowing these methods, it is only a small step to define the scoring function based on the output of `most_similar_artists_history`.

In [None]:
def recommendation_score(artist, user_id):
    """Calculate recommendation score.
    
    Parameters
    ----------
    artist: str
        The artist for which to calculate the recommendation score.
    user_id: int
        The user for which to calculate the recommendation score.
        
    Returns:
    float
        Recommendation score    
    """
    df = most_similar_artists_history(artist, user_id)
    ### BEGIN SOLUTION
    return df.product(axis=1).sum() / df.loc[:, 'cosine_similarity'].sum()
    ### END SOLUTION
    # return df.?(axis=1).?() / df.loc[:, ? ].sum()

assert np.allclose(recommendation_score('abba', 42), 0.08976655361839528)
assert np.allclose(recommendation_score('the white stripes', 1), 0.09492796371597861)

recommendation_score('abba', 42)

## Determine artists to recommend
We only want to recommend artists the user didn't listen to yet, which we'll determine by using the listening history.

In [None]:
def unknown_artists(user_id):
    """Get artists the user hasn't listened to.
    
    Parameters
    ----------
    user_id: int
        User for which to get unknown artists
        
    Returns
    -------
    pandas.Index
        Collection of artists the user hasn't listened to.
    """
    ### BEGIN SOLUTION
    history = data.loc[user_id, :]
    return history.loc[history == 0].index
    ### END SOLUTION
    # history = data.loc[ ? , :]
    # return history.loc[ ? == 0].index
    
print(unknown_artists(42))

assert len(unknown_artists(42)) == 278
assert type(unknown_artists(42)) == pd.Index

The last requirement for our recommender engine is a function that can score all unknown artists for a given user. We will make this function return a list of dictionaries, which can be easily converted to a DataFrame later on. The list will be generated using a [`list comprehension`](https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions).

In [None]:
def score_unknown_artists(user_id):
    """Score all unknown artists for a given user.
    
    Parameters
    ----------
    user_id: int
        User for which to get unknown artists
        
    Returns
    -------
    list of dict
        A list of dictionaries.
    """
    ### BEGIN SOLUTION
    artists = unknown_artists(user_id)
    return [{'recommendation': artist, 'score': recommendation_score(artist, user_id)} for artist in artists]
    ### END SOLUTION
    # artists = unknown_artists( ? )
    # return [{'recommendation': artist, 'score': recommendation_score( ? , user_id)} for artist in ?]

assert np.allclose(score_unknown_artists(42)[1]['score'], 0.08976655361839528)
assert np.allclose(score_unknown_artists(313)[137]['score'], 0.20616395469219984)

score_unknown_artists(42)[:5]

From the scored artists we can easily derive the best recommendations for a given user.

In [None]:
def user_recommendations(user_id, n_rec=5):
    """Recommend new artists for a user.
    
    Parameters
    ----------
    user_id: int
        User for which to get recommended artists
    n_rec: int, optional
        Number of recommendations to make
        
    Returns
    -------
    pandas.DataFrame
        A DataFrame containing artist recommendations for the given user,
        with their recommendation score.
    """
    scores = score_unknown_artists(user_id)
    ### BEGIN SOLUTION
    return (
        pd.DataFrame(scores)
        .sort_values('score', ascending=False)
        .head(n_rec)
        .reset_index(drop=True)
    )
    ### END SOLUTION
    # return (
    #     pd.DataFrame( ? )
    #     .sort_values( ? , ascending=False)
    #     . ? (n_rec)
    #     .reset_index(drop=True)
    # )

assert user_recommendations(313).loc[4, 'recommendation'] == 'jose gonzalez'
assert len(user_recommendations(1, n_rec=10)) == 10

user_recommendations(642)

With this final function, it is a small step to get recommendations for multiple users. As our code hasn't been optimized for performance, it is advised to limit the number of users somewhat.

In [None]:
recommendations = [user_recommendations(user).loc[:, 'recommendation'].rename(user) for user in data.index[:10]]

We can now use the `concat` function again to get a nice overview of the recommended artists.

In [None]:
np.transpose(pd.concat(recommendations, axis=1))

In [None]:
g_s = most_similar_artists_history('gorillaz', 642).assign(sim2 = lambda x: x.product(axis=1))
r_1 = g_s.sim2.sum()
total = g_s.cosine_similarity.sum()
print(total)
r_1/total


In [None]:
g_s