# Music Recomendations

In this project you will make an unsupervised system for music recommendations based on a song.

All of the data given to you is from Spotify. For definitions of some of the columns see https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features. The dataset has been partially processed to remove very unusual artists.

You will need to go through the entire machine learning process but for unsupervised learning (including big picture, exploration, ...). You will *not* split off a training and testing set. Also, you will not use cross-validation (see an example in a class notebook on how to convince `GridSearchCV` to not use CV). The ultimate goal is to be able to find which 'group' an audio track (which can be music, audio books, or other recordings) belongs to. We also want to make sure that none of the clusters are too small (so that if we ask for a related song, there is a significant amount of variability in the song that we get).

To help with making good clusters, you should form clusters for the artists and then integrate that information into each track's data. As with all preprocessing, you should try with and without this step (along with different clusterings of artists).


## Notes

### Artists Feature

The feature `artists` is actually a series of Python lists. To use them is a bit hard, but here are a few examples:
  * Get the length of the list for each row:  
    `data.artists.str.len()`
  * Get the first element of the list for each row:  
    `data.artists.apply(lambda artists:artists[0])`
  * Get the second element of the list for each row, or None if there is only one:   
    `data.artists.apply(lambda artists:artists[1] if len(artists) > 1 else None)`
  * To transform each artist in the list based on a dictionary named `trans` that has keys that are artists:  
    `data.artists.apply(lambda artists:(trans[artist] for artist in artists))`
  * To get the most common value from each list (once converted into something like numbers):  
    `data.artists.apply(statistics.mode)`
  * To make the list into multiple columns (filled with `None` for rows with fewer than max artists):  
    `pd.DataFrame({f'artists_{i}':data.artists.apply(lambda artists:artists[i] if len(artists) > i else None) for i in range(data.artists.str.len().max())})`

Also, other methods like `explode()` may be useful.


### Scoring

To be able to evaluate our model, we will need to use a custom scorer that can be used with `GridSearchCV` and `RandomSearchCV` that prefers clusters that contain pairs used by people in their personal playlists. This playlist data is in the `pairs` data. **This dataset must only be used for scoring.** Additionally, the scorer greatly penalizes having clusters that have less than 200 songs.

An example of using it is like:

```python
GridSearchCV(..., scoring=MusicScorer(data, pairs), ...)
```

where `data` and `pairs` are the full dataset and the pairs dataset from `load_data()`.


### Manual Testing

The function `recommendations()` can be used to perform manual testing. It can be called like:

```python
recommendations(data, clusters, ["50woGYhAqV3KXvO1LG4zLg", "6pmuu4qSz2WrtGkBjUfyuz", "3dmqIB2Qxe2XZobw9gXxJ6"])
```

where `data` is a `DataFrame` of all of the tracks (minimally the track ids) and `clusters` is a sequence of cluster numbers that line up with `data`.

In [None]:
import ast

import numpy as np
import scipy.sparse
import pandas as pd

In [None]:
def load_data():
    """
    Returns the track data, the artist data, and which pairs of tracks show up
    together in playlists. The pairs dataset is extremely large but used as a
    sparse matrix. It cannot be directly used.
    """
    data = pd.read_csv('data.csv', converters={'artists':ast.literal_eval})
    artists = pd.read_csv('artists.csv')
    tracks = np.unique(data.id)
    pairs = pd.DataFrame.sparse.from_spmatrix(scipy.sparse.load_npz('track_pairs.npz'), index=tracks, columns=tracks)
    return data, artists, pairs


def music_score(data, clusters, pairs):
    """
    Scores a set of clusters based on the track data, the clusters they are assigned to,
    and the pairs data.
    """
    summation = 0
    n_clusters = clusters.max()
    for i in range(n_clusters):
        tracks = data.id[clusters == i]
        sub = pairs.loc[tracks, tracks]
        denominator = len(tracks)
        if denominator < 200:
            denominator = (200*200*200*200)/(denominator*denominator*denominator)
        summation += sub.values.sum() / denominator
    return summation / n_clusters


class MusicScorer:
    """
    This is the actual scorer 'function' to use with `GridSearchCV`.
    It is used like:

    GridSearchCV(..., scoring=MusicScorer(data, pairs), ...)

    where `data` and `pairs` are the full dataset and the pairs
    dataset from `load_data()`.
    """

    def __init__(self, data, pairs):
        self.data = data
        self.pairs = pairs

    def __call__(self, estimator, X, y=None):
        # Get the cluster labels
        if hasattr(estimator, 'labels_'):
            labels = estimator.labels_
        elif hasattr(estimator, 'predict'):
            labels = estimator.predict(X)
        else:
            labels = estimator.fit_predict(X)

        # Compute the score
        return music_score(self.data, labels, self.pairs)


def recommendations(all_tracks, clusters, tracks, recommendations_per=5):
    """
    Given the complete data set (`all_tracks` is a DataFrame of track data or is
    a series/array of track ids) along with the `clusters` they belong to (a
    series/array of cluster numbers), lookup the given tracks (by their ids only),
    and return the number of recommendations per each of those tracks.

    If this is given a single track (as a string), this will return a single set
    of rows from `all_tracks`. If given a list of tracks (as a list of strings),
    this will return a list of sets of rows from `all_tracks`. The number of rows
    in each set is based on recommendations_per.
    """
    # force types
    full_data = len(all_tracks.shape) == 2
    all_tracks = pd.DataFrame(all_tracks) if full_data else pd.Series(all_tracks)
    all_tracks = all_tracks.reset_index(drop=True)
    clusters = pd.Series(clusters).reset_index(drop=True)
    single = isinstance(tracks, str)
    if single: tracks = [tracks]

    # get each track's cluster
    matches = all_tracks.id.isin(tracks) if full_data else all_tracks.isin(tracks)
    cluster_nums = clusters[matches]

    # sample from each cluster
    if single:
        return all_tracks.loc[clusters[clusters == cluster_nums.iloc[0]].sample(recommendations_per).index]
    return [all_tracks.loc[clusters[clusters == cluster].sample(recommendations_per).index] for cluster in cluster_nums]
