# Create a test and training split for the mpd.

Build a test data set from the mpd using the playlist distribution found in the official challenge set.

This extracts 10k playlists from the mpd as a test set substitution for the original challenge set.  It saves the original mpd data files as a new training set with the test set removed. Keeping the structure of the original file set will simplify operation of codes that expect that input.

The constructed splits will be named by a directory like mpd-split-<description> that contains the test-set.json and a data subdir with the mpd slices.
    
The challenge set will then need to be constructed from the test-set.json so that codes can processes a challenge set of withheld data. Additional downstream processing with rate results submitted against the split.

In [None]:
import sys
import json
import re
import collections
import os
import datetime
import pandas as pd
import numpy as np

## Load the mpd slice files

Create one big data frame to make it simple to select the random samples.

In [None]:
playlists = pd.DataFrame()
tracks = pd.DataFrame()

In [None]:
debug = True
quick = True
max_files_for_quick_processing = 20

# random state
seed = 1

In [None]:
def process_mpd(path):
    global playlists, tracks;
    
    count = 0
    filenames = os.listdir(path)
    for filename in sorted(filenames):
        if filename.startswith("mpd.slice.") and filename.endswith(".json"):
            fullpath = os.sep.join((path, filename))
            f = open(fullpath)
            js = f.read()
            f.close()
            if debug: print("loaded {}:".format(fullpath))
            mpd_slice = json.loads(js)
            # Flatten data
            # extract slice info to keep association with original training files.
            slice_info = mpd_slice['info']['slice']
            slice_playlists = pd.json_normalize(mpd_slice, record_path=['playlists'])
            slice_playlists["slice"] = slice_info
            if debug: print("slice length {}:".format(len(slice_playlists)))
            slice_tracks = pd.json_normalize(mpd_slice['playlists'], record_path=['tracks'], meta=['pid'])
            # drop tracks from playlist dataframe
            # not worth it to save space, just makes it harder to reconstruct the playlist
            #slice_playlists.drop(columns='tracks', inplace=True)
            playlists = playlists.append(slice_playlists)
            tracks = tracks.append(slice_tracks)
            count += 1

            if quick and count > max_files_for_quick_processing:
                break


In [None]:
%%time
process_mpd("data/mpd/data")

Set a new index for playlists so each row has unique id using pid. After reading the slice files the index values repeat for each slice.

Preference is to not use the pid since that drops this data column.
Instead create a new column of integers for each row and then set that as the index.

In [None]:
playlists["newidx"]=range(len(playlists))

playlists.set_index("newidx", inplace=True)

In [None]:
[d.get("track_uri") for d in playlists[playlists["pid"]==0].tracks[0]]

for i, l in playlists.tracks.explode("tracks").iteritems():
    print("i: {} type: {}".format(i, type(l)))

In [None]:
pl = playlists.copy()

In [None]:
pl = playlists[["pid","tracks"]].explode("tracks")

In [None]:
pl["track_uri"] = [d.get("track_uri") for d in pl.tracks]

In [None]:
pl["artist_name"] = [d.get("artist_name") for d in pl.tracks]

The expanded one-row-per-track representation shows we have 1.4million songs (rows). The row index has 21k entries which matches the 21k playlists in the training set.

In [None]:
pl

### Check memory usage

In [None]:
tracks = pl[["track_uri"]]

In [None]:
tracks.memory_usage(deep=True)

In [None]:
tracks.memory_usage()

In [None]:
pd.__version__

In [None]:
tracks.info()

From example in [pandas sparse data types page](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) use memory_usage().sum().  Not clear why we divide by 1000.  Would think that makes it kilobytes.

In [None]:
'dense : {:0.2f} kbytes'.format(tracks.memory_usage().sum() / 1e3)

## One Hot encode playlists

Attempting to use get_dummies() works in the dense space an tries to build a dataframe of 100k by 1.4Million songs.  Not sure why so many rows but it's still to big for ram at 300+G

trackhots = pd.get_dummies(tracks, dtype=bool)

sklearn has a onehot encoder that is a preprocessor to many of its routines.  See if we can fit the tracks to this representaiton.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer


In [None]:
pl[["pid", "track_uri"]].info()

In [None]:
trackhots = OneHotEncoder()

In [None]:
trackhots.fit(pl[["pid", "track_uri"]])

In [None]:
trackhots.categories_

In [None]:
trackhots.get_feature_names()

Transform the original data into a matrix representation.

Here again is the 1.4x290k represenation.  The 1.4k is the songs, so rows in the original matrix but not clear where the 290k comes from.  Would expect 21k for the playlists.

In [None]:
th = trackhots.transform(pl[["pid","track_uri"]])

In [None]:
th

In [None]:
pl["pid"].max()

In [None]:
playlists["num_tracks"].sum()

Hmm, there are some problems in the transformation.  The 1.4mil comes from the total number of tracks in training.  The total unique is much smaller.

In [None]:
pl["track_uri"].drop_duplicates().count()

I'd expect an transformed data set to be 21k by 269k.

Ah, the onehot encoder wants a feature set of each record with its distinct features.
https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features

in this case it's rows of track_uri.
so each row with mapp to the idx value and will just have tracks.

In [None]:
playlists.head(1)

In [None]:
import ast

In [None]:
pl[["track_uri"]]

Try converting each tracks string to a data type 

https://www.geeksforgeeks.org/python-convert-string-dictionary-to-dictionary/

In [None]:
playlists[["tracks"]].tracks

We need a list of lists. This is pretty easy to construct with a list comprehension to wrap the lists into a list.

In [None]:
pltracks = [d for d in playlists[["tracks"]].tracks.apply((lambda s: [d["track_uri"] for d in s]))]

In [None]:
len(pltracks)

In [None]:
type(pltracks)

what we are really trying to do is train the encoding and then transform each row.

this is more like having a vocabulary and different sentances.
I need to map each sentance to it's onehot encoding of the vocabulary.

this example shows moving from an integerencoding to a one hot encoding
https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

reading the docs leads to multi label binarizer which appears to be closer to what i want.
https://scikit-learn.org/stable/modules/preprocessing_targets.html#multilabelbinarizer

In [None]:
mlb = MultiLabelBinarizer(sparse_output=True)

In [None]:
pltracks = mlb.fit_transform(pltracks)

We finally have a list of 21k playlists encoded with the 269k unique tracks.`

In [None]:
pltracks

## Get Cosine similarity

https://stackoverflow.com/a/27046041/8928529

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


In [None]:
sim = cosine_similarity(pltracks)

In [None]:
sim.shape

We want to do a matrix multiply for user-user similarity: score = sim * ratings

https://stackoverflow.com/a/16754459/8928529

In [None]:
from scipy import sparse

cast the similarity martrix into a compressed sparse row format so matrix multiplication doesn't explode the ram.

In [None]:
sim = sparse.csr_matrix(sim)

In [None]:
sim

In [None]:
score = sparse.csr_matrix.dot(sim, pltracks)

The result is a score matrix in the original dimentions that is 17% sparse.  With 973mil out of 5billion possible

In [None]:
score

In [None]:
score.shape

## Get Challenge set distribution

Just read the data distribution from the challenge set file directly.

In [None]:
# load data using Python JSON module
with open('data/challenge_set.json','r') as f:
    data = json.loads(f.read())

In [None]:
# Flatten data
challenge_playlists = pd.json_normalize(data, record_path=['playlists'])

In [None]:
[challenge_playlists["tracks"]]

In [None]:
chtracks = [d for d in challenge_playlists[["tracks"]].tracks.apply((lambda s: [d["track_uri"] for d in s]))]

In [None]:
chtracks[1002]

In [None]:
pltracks = [d for d in playlists[["tracks"]].tracks.apply((lambda s: [d["track_uri"] for d in s]))]

In [None]:
pltracks[0]

In [None]:
chtracks[1000]

In [None]:
alltracks = list()

In [None]:
alltracks = pltracks + chtracks[1000:]

In [None]:
len(alltracks)

In [None]:
allmpb = mlb.fit_transform(alltracks)

In [None]:
allmpb

In [None]:
sim = cosine_similarity(allmpb)

In [None]:
sim = sparse.csr_matrix(sim)

Memory use in virt jumps from 15g to 30g in the next operation.

In [None]:
score = sparse.csr_matrix.dot(sim, allmpb)

Manage memory by delete data we don't need after the score is computed.

In [None]:
del allmpb
del sim

In [None]:
score

In [None]:
score.shape

In [None]:
print(score[22000])

In [None]:
print(score[22000].sorted_indices())

In [None]:
import itertools

In [None]:
zip

In [None]:
#from itertools import izip

def sort_csr(m):
    tuples = zip(m.indices, m.data)
    return sorted(tuples, key=lambda x: (x[1]), reverse=True)


In [None]:
sorted(score[22000].data, reverse=True)[0:500]

In [None]:
score[22000].shape

In [None]:
score[22000].indices

In [None]:
cantracks = sort_csr(score[22000])

In [None]:
type(cantracks[0])

In [None]:
len(mlb.classes_)

In [None]:
rectracks=[mlb.classes_[i[0]] for i in cantracks]

In [None]:
len(rectracks)

In [None]:
chtracks[1000]

In [None]:
mlb.classes_

In [None]:
[item for item in chtracks[1000] if item in mlb.classes_ ]

In [None]:
[item for item in chtracks[1000] if item in rectracks]

In [None]:
[item for item in chtracks[1000] if item in ['spotify:track:66U0ASk1VHZsqIkpMjKX3B']] #rectracks]

In [None]:
[item for item in rectracks if item in ['spotify:track:66U0ASk1VHZsqIkpMjKX3B']] #rectracks]

Remove the challenge tracks from the recommended set.
Use a simple loop for now to keep the code simple.
Also allows us to inspect where the original songs are in the recommendation set.
For the current playlist, app positions for the first 5 songs are above the 500 song rec limit.
This suggests we will see a fairly poor rprec and ndcg performance for pure user-user knn.
Makes sense, since this is really just a most popular songs amoung similar users strategey.
A user focused popularity ranking rather than a global popularity ranking.
Suggests the need for the boosting strategies we see in the actually top performers.

In [None]:
for challenge_track in chtracks[1000]:
    print("remove track pos: {}".format(rectracks.index(challenge_track)))
    rectracks.remove(challenge_track)

In [None]:
len(rectracks)

Clean up the entire recommendation set. This is lists 21000-30000 in the currrent method. No index math is needed if we shift to putting the challenge tracks at the start.

Trim the recommenation set out of the score results

Even with the del earlier the RES memory remains  at 28g which helps explain why the next step kills the kernel

    98895 jpr       20   0   29.9g  28.7g  29564 S   0.0 15.3   4:06.54 python

Even with explicit garbage collection this doesn't free up the ram.

https://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python

Advice is to use a subprocess.

In [None]:
import gc

gc.collect()

Maybe it means I need to save out the results and reload them either in a new notebook or after the kernel barfs.

In [None]:
import pickle

In [None]:
pickle.dump(score, open( "save_score.p", "wb" ) )

Even pickle kills the kernel. 
Maybe best to just add some ram.

In [None]:
score = score[21000:]

In [None]:
score.shape

In [None]:
reclist = list()
indexdist = list() #pd.DataFrame(columns=["index"])
misses = 0
tooshort = 0

for idx in range(9000):
    cantracks = sort_csr(score[idx])
    rectracks=[mlb.classes_[i[0]] for i in cantracks]
    if (idx % 1000) == 0: print("challenge: {}".format(idx))
    for challenge_track in chtracks[idx]:
        try:
            indexdist.append(rectracks.index(challenge_track))
            #print("remove track pos: {}".format(rectracks.index(challenge_track)))
            rectracks.remove(challenge_track)
        except (ValueError, AttributeError):
            #print("didn't find in rectracks: {}".format(challenge_track))
            misses += 1
    if reclist < 500:
        tooshort += 1
        
    reclist.append(rectracks[0:500])

In [None]:
indexdist = pd.DataFrame(indexdist)

The distribution of removals shows that the vast majority are well above the 500 reclist limit

In [None]:
indexdist.describe()

In [None]:
len(reclist)

In [None]:
misses

In [None]:
len(reclist[8995])

In [None]:
submission = pd.DataFrame(reclist)

In [None]:
submission