# Recommendation test sets

Build subset of data to test hottracks song recommendation

This will create a train and test set to be used in hottracks recommender

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pandas as pd
import numpy as np
import matplotlib as plt
import importlib
from pyspark.ml.feature import Tokenizer, CountVectorizer, MinHashLSH
from pyspark.sql.types import IntegerType, StringType, ArrayType

import mpd

In [None]:
# Will allow us to embed images in the notebook
%matplotlib inline
# change default plot size
plt.rcParams['figure.figsize'] = (15,10)

## Load and prep data

* Load the full data set
* Load the picked k=100 approx Nearest Neighbor results
* Build song recommdations based on songs in nearest playlist

In [None]:
mpd_all=mpd.load(spark, "onebig", 1)

### Load challenge data set

In [None]:
mpd_test=spark.read.json("../mpd-challenge/challenge_set.json", multiLine=True)

In [None]:
mpd_test.printSchema()

In [None]:
cpl=mpd_test.select(f.explode("playlists").alias("playlist"))

In [None]:
cpl.printSchema()

In [None]:
cpl.show(5)

In [None]:
recdf=cpl.select("playlist.name", "playlist.num_holdouts", "playlist.pid", "playlist.num_tracks", "playlist.tracks", "playlist.num_samples")

In [None]:
recdf.printSchema()

In [None]:
recdf.select("pid", recdf.tracks.artist_uri, recdf.tracks.track_uri).show(5)

In [None]:
recdf.select("pid", "name", f.explode("tracks")).show()

## Split data into test and train

Extract the playlist into a data frame that can be split.  We are working directly with the playlist vector and want a data set similar in structure to the challenge set.

This needs to produce a data set that looks like the input of the challenge set.  The working model for the mpd load has been a pre-processed json that doesn't have the full hierarchy of the raw input.  Should be able to create the playlist wrapper after model selection.

This will enable using the validate_submission.py utility and othe tools for working with the original data and challenge set.

The actual test sets will need to be playlist exported with data withheld

We are using the full 1,000,000 less 10k to produce a training set to match the challenge set size and because our "training" is really just a k-NN search we want as close to the original count of given playlists for that search.

In [None]:
mpd_all.printSchema()

In [None]:
mpd_all.count()

In [None]:
train, test = mpd_all.randomSplit([1000000.0-10000.0, 10000.0], 1244)

In [None]:
train.count()

In [None]:
test.count()

Interesting, due to normalization we don't seem to be able to get exactly 10k test examples. Actually it appears to vary by the seed value.  The seed 1244 was picked after some simple searching for a value that brings that data close to the desired 10k given in the challenge set.

In [None]:
train.printSchema()

## Compare distributions of origina, test, and challenge set

See how what similarity exists in the sub-samples distribution of playlists versus the original challenge set.

### Original data set

In [None]:
mpd.plothist(mpd_all, "num_tracks", 11)

### Training and Test

In [None]:
mpd.plothist(train, "num_tracks", 20)

In [None]:
mpd.plothist(test, "num_tracks", 20)

### Challenge dataset

In [None]:
mpd.plothist(recdf, "num_tracks", 20)

Interesting.  The random split represents the character of the overall data set but the challenge set has a bi-modal shape.  Wonder how to reproduce via a random sampling.

In [None]:
mpd.plothist(recdf, "num_holdouts", 20)

Interesting that the number of hold outs more closely aligns to the shape of the the playlist lengths in the orignal data set.

In [None]:
mpd.plothist(recdf, "num_samples", 20)

At a more granular histogram the challenge set looks exactly as expected with an even split across the five challenge types of seed tracks [0, 5, 10, 25, 95].

### Check for corellations of playlist length to their selection for different seed categories

In [None]:
X=recdf.select("num_tracks").toPandas()

In [None]:
Y=recdf.select("num_holdouts").toPandas()

In [None]:
plt.pyplot.scatter(X,Y)
plt.pyplot.xlabel("Playlist Length")
plt.pyplot.ylabel("Holdouts")

Clearly there is a correlation as they playlist length increase there are more hold outs. Interesting to see that there are about four groupings with playlist length of 100 being a dividing point between two sets.  

In [None]:
plt.pyplot.scatter(X, recdf.select("num_samples").toPandas())
plt.pyplot.xlabel("Playlist Length")
plt.pyplot.ylabel("Sample count")

Here it's clear to see where the challenge playlists come from.  We can easily sample from different playlist length categories to get our challenge set.  Again the 100 song playlist is a clear division between the groups for the 25 and 100 sample count challenge set.