# Process and split the raw data

This notebook performs processing on the dataset to convert it to a set of similarity pairings between artists. The Last.FM dataset has a list of users, artists, and number of plays for each artist. My method creates a similarity metric between all possible pairings of two artists by calculating the co-occurence of listeners between every artist pairing. That is, how many people have listened to both artists at least once. Then, I use various ways to convert that into a proportion.

In [1]:
import pandas as pd
import numpy as np
import json
from tqdm import tqdm_notebook 
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

with open('params.json') as f:
    params = json.load(f)

## Load and clean the data

This part presumes that you have downloaded the dataset at http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz and extracted the contents to the dataset directory.

In [4]:
playscols = ['usersha1', 'mbid', 'artistname', 'plays']
playsdf = pd.read_csv('dataset/usersha1-artmbid-artname-plays.tsv', sep='\t', names=playscols, index_col=False)
profilecols = ['usersha1', 'gender', 'age', 'country', 'registration']
profiledf = pd.read_csv('dataset/usersha1-profile.tsv', sep='\t', names=profilecols, index_col=False)
df = playsdf.merge(profiledf, on=['usersha1'], how='left')

# Clean the data by removing artists without mbid (usually nonsense)
# and profiles without registration (also usually nonsense, maybe dataset artifact)
# Also delete entries with unknown artists, and artists that only show up once
df = df[df['mbid'].notnull() & df['registration'].notnull() 
        & (df['artistname'] != '[unknown]') & df.duplicated(subset='artistname', keep=False)]

In [5]:
# Display data sample and some stats
display(df.head(10))
display("Unique Artists: {}".format(len(df['artistname'].unique())))
display("Unique Users: {}".format(len(df['usersha1'].unique())))
display("Entries: {}".format(len(df)))

Unnamed: 0,usersha1,mbid,artistname,plays,gender,age,country,registration
0,00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,betty blowtorch,2137,f,22.0,Germany,"Feb 1, 2007"
1,00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,die Ärzte,1099,f,22.0,Germany,"Feb 1, 2007"
2,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897,f,22.0,Germany,"Feb 1, 2007"
3,00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,elvenking,717,f,22.0,Germany,"Feb 1, 2007"
4,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706,f,22.0,Germany,"Feb 1, 2007"
5,00000c289a1829a808ac09c00daf10bc3c4e223b,8bfac288-ccc5-448d-9573-c33ea2aa5c30,red hot chili peppers,691,f,22.0,Germany,"Feb 1, 2007"
6,00000c289a1829a808ac09c00daf10bc3c4e223b,6531c8b1-76ea-4141-b270-eb1ac5b41375,magica,545,f,22.0,Germany,"Feb 1, 2007"
7,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507,f,22.0,Germany,"Feb 1, 2007"
8,00000c289a1829a808ac09c00daf10bc3c4e223b,c5db90c4-580d-4f33-b364-fbaa5a3a58b5,the murmurs,424,f,22.0,Germany,"Feb 1, 2007"
9,00000c289a1829a808ac09c00daf10bc3c4e223b,0639533a-0402-40ba-b6e0-18b067198b73,lunachicks,403,f,22.0,Germany,"Feb 1, 2007"


'Unique Artists: 133086'

'Unique Users: 358854'

'Entries: 17238522'

## MusicBrainz ID (MBID) to Artist Mapping

In [6]:
mbid_to_artist = df[['mbid', 'artistname']].drop_duplicates('mbid').set_index('mbid')

## Only look at the top artists

We take the numArtists parameter and look at that many artists, currently set to 1000 (the actual number will be ~950 since we can't get audio samples for every artist). This provides a massive amount of data, while still reasonably covering a very large number of artists. Generating data for the entire dataset would generate a prohibitively large amount of data, and create the potential of overfitting to extremely obscure artists. We do this by taking the previously-generated list of artists, for which we have audio data, and filtering the rest out of the dataset.

In [7]:
uniqueartists = pd.read_csv('dataset/uniqueartists.csv', index_col = 0)
display(uniqueartists.head(10))

Unnamed: 0,listeners
a74b1b7f-71a5-4011-9441-d0b5e4122711,77253
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,76270
cc197bad-dc9c-440d-a5b5-d52ba2e14234,66658
8bfac288-ccc5-448d-9573-c33ea2aa5c30,48926
9c9f1380-2516-4fc9-a3e6-f9f61941d090,46954
65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab,45233
83d91898-7763-47d7-b03b-b92132375c47,44443
95e1ead9-4d31-4808-a7ac-32c3614c116b,41229
f59c5520-5f46-4d2c-b2c4-822eabf53419,39774
cc0b7089-c08d-4c10-b6b0-873582c17fd6,37269


In [10]:
f = open("dataset/tracks.json","r")
flat_track_urls = json.load(f)
f.close()

# Only look at top numArtists artists, otherwise too much data to parse through in a reasonable timeframe
topartists = uniqueartists.sort_values(ascending=False, by='listeners').head(params['numArtists']).index
topartists = topartists[topartists.isin(flat_track_urls.keys())]

subsetdf = df[df['mbid'].isin(topartists)]
subsetdf.head(10)
display(len(topartists))

975

## Calculate crosstab of the dataset

This creates a symmetrical matrix with every artist MBID (MBIDs chosen since there are several artists with the same name, and the MBID is a guaranteed unique identifier) as column and row headers, and cells containing the number of users who listen to both artists. For example, the cell corresponding to `df['the beatles']['radiohead']` (substitute MBIDs for band names) contains the number of users who have listened to both Radiohead and The Beatles at least once. Along the diagonal is the number of users who listen to each given artist.

In [35]:
# Create empty crosstab dataframe
crosstab = pd.DataFrame(np.zeros((len(topartists), len(topartists))), columns=topartists, index=topartists)

# Split subset dataframe into chunks, then crosstab each chunk and add it to the total DF
for g, chunkdf in tqdm_notebook(subsetdf.groupby(np.arange(len(subsetdf)) // 1000000)):
    # Merges chunk with itself on userID. This creates a new DF with each 
    # artist entry for a given user coupled with all other artist entries for that user
    dd = pd.merge(chunkdf, chunkdf, on='usersha1')
    # Crosstab method to create co-occurrence matrix
    crosstab_tmp = pd.crosstab(dd['mbid_x'], dd['mbid_y'])
    crosstab = crosstab.add(crosstab_tmp, fill_value=0)

crosstab

HBox(children=(IntProgress(value=0, max=8), HTML(value='')))




Unnamed: 0,000fc734-b7e1-4a01-92d1-f544261b43f5,0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,004e5eed-e267-46ea-b504-54526f1f377d,00a9f935-ba93-4fc8-a33a-993abe9c936b,00eeed6b-5897-4359-8347-b8cd28375331,0103c1cc-4a09-4a5d-a344-56ad99a77193,0110e63e-0a9b-4818-af8e-41e180c20b9a,01252145-c9e8-4de5-a480-9b2bed05450a,012e3432-71d3-4317-9ce5-b60cb6cdc38f,013fa897-86db-41d3-8e9f-386c8a34f4e6,...,fd1baeb3-0ee9-4838-b4c7-615c78d68d10,fd429857-5ace-4609-ae54-1502c3bdac11,fe1a873d-2000-4789-a895-4187fe756203,ff6e677f-91dd-4986-a174-8db0474b1799,ff865aa0-4603-4f79-ae8b-8735332e2cfa,ff95eb47-41c4-4f7f-a104-cdc30f02e872,ffb18e19-64a4-4a65-b4ce-979e00c3c69d,ffb2d3e3-a4cc-48cf-8fb0-f2f846e9d7b9,ffb390b8-8df4-4b72-97d1-7b2fc008a452,ffe9ec08-6b6b-4993-9394-e280b429dbfd
000fc734-b7e1-4a01-92d1-f544261b43f5,5121.0,376.0,142.0,118.0,18.0,35.0,273.0,94.0,12.0,695.0,...,8.0,10.0,53.0,106.0,7.0,631.0,138.0,96.0,4.0,81.0
0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,376.0,31482.0,132.0,548.0,43.0,1247.0,1532.0,250.0,175.0,1604.0,...,350.0,1248.0,119.0,5303.0,978.0,355.0,1601.0,1253.0,743.0,162.0
004e5eed-e267-46ea-b504-54526f1f377d,142.0,132.0,4019.0,1315.0,136.0,149.0,33.0,68.0,65.0,103.0,...,16.0,49.0,24.0,70.0,12.0,43.0,46.0,67.0,5.0,65.0
00a9f935-ba93-4fc8-a33a-993abe9c936b,118.0,548.0,1315.0,24222.0,2401.0,2384.0,37.0,141.0,1531.0,215.0,...,349.0,1363.0,47.0,564.0,201.0,73.0,53.0,570.0,96.0,296.0
00eeed6b-5897-4359-8347-b8cd28375331,18.0,43.0,136.0,2401.0,6364.0,197.0,10.0,8.0,112.0,48.0,...,12.0,213.0,6.0,87.0,10.0,16.0,7.0,56.0,7.0,59.0
0103c1cc-4a09-4a5d-a344-56ad99a77193,35.0,1247.0,149.0,2384.0,197.0,17763.0,34.0,128.0,1078.0,151.0,...,1617.0,1481.0,34.0,1373.0,641.0,34.0,60.0,270.0,515.0,75.0
0110e63e-0a9b-4818-af8e-41e180c20b9a,273.0,1532.0,33.0,37.0,10.0,34.0,9653.0,153.0,5.0,577.0,...,8.0,41.0,75.0,817.0,22.0,530.0,352.0,123.0,17.0,77.0
01252145-c9e8-4de5-a480-9b2bed05450a,94.0,250.0,68.0,141.0,8.0,128.0,153.0,3130.0,18.0,121.0,...,38.0,36.0,135.0,236.0,11.0,54.0,82.0,46.0,15.0,50.0
012e3432-71d3-4317-9ce5-b60cb6cdc38f,12.0,175.0,65.0,1531.0,112.0,1078.0,5.0,18.0,4820.0,49.0,...,197.0,574.0,12.0,139.0,89.0,3.0,6.0,158.0,113.0,15.0
013fa897-86db-41d3-8e9f-386c8a34f4e6,695.0,1604.0,103.0,215.0,48.0,151.0,577.0,121.0,49.0,10931.0,...,30.0,147.0,66.0,586.0,22.0,414.0,161.0,650.0,29.0,66.0


## Save/Load data

In [36]:
# Save to HDF5 file
crosstab.to_hdf('dataset/crosstab.hd5', key='artists')

In [2]:
# Load from HDF5 file
crosstab = pd.read_hdf('dataset/crosstab.hd5', key='artists')

## Normalize co-occurrences by number of listeners of each artist

As the diagonal contains the number of listeners for each artist, we divide the dataframe by the diagonals along the columns. This gives us a normalized value which gives the proportion of listeners of a given artist that listen to another artist. For instance, if artist A has 10 listeners who all listen to artist B, and artist B has 20 listeners, `df.loc[A,B] == 1.0` and `df.loc[B,A] == 0.5`.

In [37]:
counts = pd.Series(np.diag(crosstab), index=crosstab.index)
crosstab_norm = crosstab.div(counts, axis=1)

## Remove diagonals and normalize values

Remove diagonals, to remove 1.0 values, then normalize along columns. Replace the diagonals with NaN, so that we can remove them later to get rid of pairings of artists with themselves.

In [38]:
np.fill_diagonal(crosstab_norm.values, 0)
vals = crosstab_norm.values
min_max_scaler = preprocessing.MinMaxScaler()
vals_scaled = min_max_scaler.fit_transform(vals)
crosstab_norm_scaled = pd.DataFrame(vals_scaled, columns=crosstab_norm.columns, index=crosstab_norm.index)
np.fill_diagonal(crosstab_norm_scaled.values, np.nan)

## Stack dataframe

Stacks the dataframe, converting the grid into a list. This is done to simplify splitting the dataset into a training, dev, and test set.

NOTE: This reverses the A/B relation above.

In [39]:
crosstab_norm_scaled_stack = crosstab_norm_scaled.stack()

In [41]:
the_beatles_mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'the beatles'].index[0]
radiohead_mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'radiohead'].index[0]
display(crosstab_norm_scaled.loc[the_beatles_mbid,radiohead_mbid])
display(crosstab_norm_scaled_stack.loc[the_beatles_mbid,radiohead_mbid])
display(crosstab_norm_scaled_stack.loc[radiohead_mbid,the_beatles_mbid])
display(crosstab_norm_scaled[the_beatles_mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))

'b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d'

'a74b1b7f-71a5-4011-9441-d0b5e4122711'

1.0

1.0

0.9999999999999999

Unnamed: 0,val,artistname
a74b1b7f-71a5-4011-9441-d0b5e4122711,1.0,radiohead
cc197bad-dc9c-440d-a5b5-d52ba2e14234,0.709293,coldplay
83d91898-7763-47d7-b03b-b92132375c47,0.707252,pink floyd
72c536dc-7137-4477-a521-567eeb840fa8,0.623486,bob dylan
b071f9fa-14b0-4217-8e97-eb41da73f598,0.58207,the rolling stones
678d88b2-87b0-403b-b63d-5da7465aecc3,0.577503,led zeppelin
8bfac288-ccc5-448d-9573-c33ea2aa5c30,0.563179,red hot chili peppers
5441c29d-3602-4898-b1a1-b77fa23b8e50,0.543284,david bowie
0383dadf-2a4e-4d10-a46a-e9e041da8eb3,0.509653,queen
9c9f1380-2516-4fc9-a3e6-f9f61941d090,0.470694,muse


## Split the dataset and save

Split the dataset by random sampling. 98% training set, 1% dev set, 1% test set

In [42]:
train, devtest = train_test_split(crosstab_norm_scaled_stack, test_size=0.02, random_state = 1)
dev, test = train_test_split(devtest, test_size=0.5, random_state = 2)

In [43]:
display("Total dataset size:", len(crosstab_norm_scaled_stack))
display("Training set size:", len(train))
display("Dev set size:", len(dev))
display("Test set size:", len(test))

'Total dataset size:'

949650

'Training set size:'

930657

'Dev set size:'

9496

'Test set size:'

9497

In [44]:
# Save all to HDF5 files
train.to_hdf('dataset/train.hd5', key='artists')
dev.to_hdf('dataset/dev.hd5', key='artists')
test.to_hdf('dataset/test.hd5', key='artists')

## Alternate metric using minimum then normalization across artist

The following sections process the data in an alternative way, by taking `min(df[A,B], df[B,A])` and using that as the value for both artists, then normalizing along columns. This should make obscure artists more prominent among other obscure artists, as an issue with the current method of data processing is that there are very popular bands that are highly-ranked among similar artists. This means that almost all rock bands will have very popular rock bands like The Beatles or Radiohead ranked highly, when the person would presumably prefer to have more obscure artists suggested. This may or may not subjectively work better.

All files and variables have the `_min` suffix.

**This is the method I ultimately went with, since I ended up training a siamese network, where input order would have not mattered.**

In [53]:
counts = pd.Series(np.diag(crosstab), index=crosstab.index)
crosstab_norm = crosstab.div(counts, axis=1)
minidx = crosstab_norm < crosstab_norm.T
crosstab_norm_min = crosstab_norm[minidx].fillna(0) + crosstab_norm.T[~minidx].fillna(0)

In [54]:
np.fill_diagonal(crosstab_norm_min.values, 0)
vals = crosstab_norm_min.values
min_max_scaler = preprocessing.MinMaxScaler()
vals_scaled = min_max_scaler.fit_transform(vals)
crosstab_norm_min_scaled = pd.DataFrame(vals_scaled, columns=crosstab_norm_min.columns, index=crosstab_norm_min.index)
np.fill_diagonal(crosstab_norm_min_scaled.values, np.nan)

In [47]:
crosstab_norm_min_scaled_stack = crosstab_norm_min_scaled.stack()

In [48]:
train_min, devtest_min = train_test_split(crosstab_norm_min_scaled_stack, test_size=0.02, random_state = 1)
dev_min, test_min = train_test_split(devtest_min, test_size=0.5, random_state = 2)

In [49]:
display("Total dataset size:", len(crosstab_norm_min_scaled_stack))
display("Training set size:", len(train_min))
display("Dev set size:", len(dev_min))
display("Test set size:", len(test_min))

'Total dataset size:'

949650

'Training set size:'

930657

'Dev set size:'

9496

'Test set size:'

9497

In [52]:
# Save all to HDF5 files
train_min.to_hdf('dataset/train_min.hd5', key='artists')
dev_min.to_hdf('dataset/dev_min.hd5', key='artists')
test_min.to_hdf('dataset/test_min.hd5', key='artists')

In [65]:
# Compare the two methods for a somewhat niche artist I'm personally familiar with.
# The second "min" method looks subjectively better!
mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'in flames'].index[0]
display(crosstab_norm_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))
display(crosstab_norm_min_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))

Unnamed: 0,val,artistname
65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab,1.0,metallica
cc0b7089-c08d-4c10-b6b0-873582c17fd6,0.759507,system of a down
f57e14e4-b030-467c-b202-539453f504ec,0.754825,children of bodom
ca891d65-d9b0-4258-89f7-e6ba29d83767,0.641658,iron maiden
00a9f935-ba93-4fc8-a33a-993abe9c936b,0.605915,nightwish
a466c2a2-6517-42fb-a160-1087c3bafd9f,0.602604,slipknot
b2d122f9-eadb-4930-a196-8f221eeb0c66,0.585474,rammstein
4bb4e4e4-5f66-4509-98af-62dbb90c45c5,0.550417,disturbed
9d30e408-1559-448b-b491-2f8de1583ccf,0.533059,dark tranquillity
f59c5520-5f46-4d2c-b2c4-822eabf53419,0.499144,linkin park


Unnamed: 0,val,artistname
f57e14e4-b030-467c-b202-539453f504ec,1.0,children of bodom
a466c2a2-6517-42fb-a160-1087c3bafd9f,0.798336,slipknot
4bb4e4e4-5f66-4509-98af-62dbb90c45c5,0.729198,disturbed
9d30e408-1559-448b-b491-2f8de1583ccf,0.706203,dark tranquillity
ca891d65-d9b0-4258-89f7-e6ba29d83767,0.6695,iron maiden
e631bb92-3e2b-43e3-a2cb-b605e2fb53bd,0.658548,arch enemy
00a9f935-ba93-4fc8-a33a-993abe9c936b,0.655514,nightwish
5b687684-ad34-4a9f-b425-0e7aa81fbd38,0.642209,amon amarth
d8d1b067-78bb-4db7-8f91-db2ff9a83ee5,0.642057,soilwork
c14b4180-dc87-481e-b17a-64e4150f90f6,0.63646,opeth


## Alternate method using plays

The following sections do all the above steps, but instead calculate cross-tabulation using plays instead of just co-occurrences. The end result is that rather than the values corresponding to number of people who listen to artist A who also listen to artist B, the values will show number of plays of artist B per play of artist A. This may or may not end up working better. This is also nicer for model training because it results in a flatter distribution of similarity scores, which isn't as sharply exponential.

All files are saved with the `_alt` suffix.

In [64]:
# Create empty crosstab dataframe
crosstab_alt = pd.DataFrame(np.zeros((len(topartists), len(topartists))), columns=topartists, index=topartists)

# Split subset dataframe into chunks, then crosstab each chunk and add it to the total DF
# Splitting into chunks is necessary to avoid running into memory issues
for g, chunkdf in tqdm_notebook(subsetdf.groupby(np.arange(len(subsetdf)) // 1000000)):
    # Merges chunk with itself on userID. This creates a new DF with each 
    # artist entry for a given user coupled with all other artist entries for that user
    dd = pd.merge(chunkdf, chunkdf, on='usersha1')
    # Crosstab method to create co-occurrence matrix
    crosstab_tmp = pd.crosstab(dd['mbid_x'], dd['mbid_y'], values=dd['plays_y'], aggfunc='sum')
    crosstab_alt = crosstab_alt.add(crosstab_tmp, fill_value=0)

crosstab_alt

HBox(children=(IntProgress(value=0, max=8), HTML(value='')))

Unnamed: 0,000fc734-b7e1-4a01-92d1-f544261b43f5,0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,004e5eed-e267-46ea-b504-54526f1f377d,00a9f935-ba93-4fc8-a33a-993abe9c936b,00eeed6b-5897-4359-8347-b8cd28375331,0103c1cc-4a09-4a5d-a344-56ad99a77193,0110e63e-0a9b-4818-af8e-41e180c20b9a,01252145-c9e8-4de5-a480-9b2bed05450a,012e3432-71d3-4317-9ce5-b60cb6cdc38f,013fa897-86db-41d3-8e9f-386c8a34f4e6,...,fd1baeb3-0ee9-4838-b4c7-615c78d68d10,fd429857-5ace-4609-ae54-1502c3bdac11,fe1a873d-2000-4789-a895-4187fe756203,ff6e677f-91dd-4986-a174-8db0474b1799,ff865aa0-4603-4f79-ae8b-8735332e2cfa,ff95eb47-41c4-4f7f-a104-cdc30f02e872,ffb18e19-64a4-4a65-b4ce-979e00c3c69d,ffb2d3e3-a4cc-48cf-8fb0-f2f846e9d7b9,ffb390b8-8df4-4b72-97d1-7b2fc008a452,ffe9ec08-6b6b-4993-9394-e280b429dbfd
000fc734-b7e1-4a01-92d1-f544261b43f5,1502792.0,112980.0,50336.0,35575.0,5207.0,6135.0,70121.0,23814.0,1875.0,244285.0,...,2283.0,1380.0,9024.0,23177.0,1287.0,196690.0,28241.0,27909.0,2128.0,19726.0
0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,121585.0,10571527.0,48246.0,193099.0,12852.0,294218.0,412642.0,63477.0,39775.0,517818.0,...,88659.0,369058.0,26688.0,1685788.0,332170.0,116682.0,516907.0,940032.0,220746.0,45877.0
004e5eed-e267-46ea-b504-54526f1f377d,43444.0,35743.0,1364165.0,690493.0,38872.0,48096.0,6700.0,30446.0,13800.0,29659.0,...,6331.0,12281.0,6454.0,21613.0,3219.0,13504.0,17466.0,47322.0,1307.0,22433.0
00a9f935-ba93-4fc8-a33a-993abe9c936b,38806.0,216252.0,490106.0,9754161.0,893008.0,732377.0,8158.0,41113.0,382985.0,73183.0,...,92775.0,371625.0,13193.0,165611.0,52946.0,8849.0,8941.0,273070.0,34177.0,99926.0
00eeed6b-5897-4359-8347-b8cd28375331,14600.0,17228.0,32192.0,1071208.0,2000885.0,50021.0,1462.0,2061.0,21415.0,17903.0,...,7106.0,43106.0,3151.0,18762.0,1396.0,3534.0,1123.0,21566.0,3156.0,13474.0
0103c1cc-4a09-4a5d-a344-56ad99a77193,6912.0,410295.0,43247.0,1088675.0,57880.0,4584028.0,7428.0,35528.0,269291.0,44115.0,...,534335.0,396017.0,5392.0,351019.0,149635.0,4862.0,24797.0,122761.0,202083.0,18809.0
0110e63e-0a9b-4818-af8e-41e180c20b9a,77951.0,457206.0,9562.0,7232.0,1852.0,4805.0,2366807.0,45582.0,755.0,154781.0,...,1250.0,8033.0,13313.0,206250.0,4789.0,134217.0,104363.0,48023.0,4402.0,16232.0
01252145-c9e8-4de5-a480-9b2bed05450a,25966.0,72767.0,38680.0,59010.0,4126.0,31556.0,36125.0,713673.0,6500.0,32947.0,...,9994.0,6204.0,25696.0,60320.0,1410.0,10805.0,18466.0,20810.0,3150.0,17541.0
012e3432-71d3-4317-9ce5-b60cb6cdc38f,2878.0,61041.0,18082.0,766576.0,25718.0,327092.0,1161.0,6693.0,1156417.0,12045.0,...,56684.0,145401.0,1887.0,32148.0,14074.0,294.0,454.0,87436.0,33273.0,3395.0
013fa897-86db-41d3-8e9f-386c8a34f4e6,269713.0,480122.0,48394.0,64957.0,13763.0,29278.0,151036.0,27637.0,12204.0,3338687.0,...,5490.0,29714.0,9828.0,126222.0,2545.0,99105.0,40706.0,375279.0,9012.0,17515.0


## Take the minimum, similar to the `_min` method, then normalize

This is especially necessary for this dataset, otherwise the dataset will not reflect true preferences, and also favours artists with many listens.

In [66]:
counts = pd.Series(np.diag(crosstab_alt), index=crosstab_alt.index)
crosstab_alt_norm = crosstab_alt.div(counts, axis=1)
minidx = crosstab_alt_norm < crosstab_alt_norm.T
crosstab_alt_norm = crosstab_alt_norm[minidx].fillna(0) + crosstab_alt_norm.T[~minidx].fillna(0)

In [68]:
np.fill_diagonal(crosstab_alt_norm.values, 0)
vals = crosstab_alt_norm.values
min_max_scaler = preprocessing.MinMaxScaler()
vals_scaled = min_max_scaler.fit_transform(vals)
crosstab_alt_norm_scaled = pd.DataFrame(vals_scaled, columns=crosstab_alt_norm.columns, index=crosstab_alt_norm.index)
np.fill_diagonal(crosstab_alt_norm_scaled.values, np.nan)

In [71]:
mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'in flames'].index[0]
display(crosstab_norm_min_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))
display(crosstab_alt_norm_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))

Unnamed: 0,val,artistname
f57e14e4-b030-467c-b202-539453f504ec,1.0,children of bodom
a466c2a2-6517-42fb-a160-1087c3bafd9f,0.798336,slipknot
4bb4e4e4-5f66-4509-98af-62dbb90c45c5,0.729198,disturbed
9d30e408-1559-448b-b491-2f8de1583ccf,0.706203,dark tranquillity
ca891d65-d9b0-4258-89f7-e6ba29d83767,0.6695,iron maiden
e631bb92-3e2b-43e3-a2cb-b605e2fb53bd,0.658548,arch enemy
00a9f935-ba93-4fc8-a33a-993abe9c936b,0.655514,nightwish
5b687684-ad34-4a9f-b425-0e7aa81fbd38,0.642209,amon amarth
d8d1b067-78bb-4db7-8f91-db2ff9a83ee5,0.642057,soilwork
c14b4180-dc87-481e-b17a-64e4150f90f6,0.63646,opeth


Unnamed: 0,val,artistname
f57e14e4-b030-467c-b202-539453f504ec,1.0,children of bodom
d8d1b067-78bb-4db7-8f91-db2ff9a83ee5,0.8207,soilwork
9d30e408-1559-448b-b491-2f8de1583ccf,0.763648,dark tranquillity
4bb4e4e4-5f66-4509-98af-62dbb90c45c5,0.713585,disturbed
a466c2a2-6517-42fb-a160-1087c3bafd9f,0.708062,slipknot
ca891d65-d9b0-4258-89f7-e6ba29d83767,0.636357,iron maiden
e631bb92-3e2b-43e3-a2cb-b605e2fb53bd,0.615191,arch enemy
8295ee00-0096-461d-95c7-c2263d2a4c6d,0.614306,killswitch engage
65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab,0.592871,metallica
ac865b2e-bba8-4f5a-8756-dd40d5e39f46,0.569613,koЯn


In [72]:
crosstab_alt_norm_scaled_stack = crosstab_alt_norm_scaled.stack()

In [74]:
train_alt, devtest_alt = train_test_split(crosstab_alt_norm_scaled_stack, test_size=0.02, random_state = 1)
dev_alt, test_alt = train_test_split(devtest_alt, test_size=0.5, random_state = 2)

In [75]:
display("Total dataset size:", len(crosstab_alt_norm_scaled_stack))
display("Training set size:", len(train_alt))
display("Dev set size:", len(dev_alt))
display("Test set size:", len(test_alt))

'Total dataset size:'

949650

'Training set size:'

930657

'Dev set size:'

9496

'Test set size:'

9497

In [76]:
# Save all to HDF5 files
train_alt.to_hdf('dataset/train_alt.hd5', key='artists')
dev_alt.to_hdf('dataset/dev_alt.hd5', key='artists')
test_alt.to_hdf('dataset/test_alt.hd5', key='artists')