# Process and split the raw data

This notebook performs processing on the dataset to convert it to a set of similarity pairings between artists. The Last.FM dataset has a list of users, artists, and number of plays for each artist. My method creates a similarity metric between all possible pairings of two artists by calculating the co-occurence of listeners between every artist pairing. That is, how many people have listened to both artists at least once. Then, I use various ways to convert that into a proportion.

In [1]:
import pandas as pd
import numpy as np
import json
from tqdm import tqdm_notebook 
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from utils import load_json, make_logger
import logging

params = load_json('params.json')
logger = make_logger('gen_split_dataset', 'log/gen_split_dataset.log')

## Load and clean the data

This part presumes that you have downloaded the dataset at http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz and extracted the contents to the dataset directory.

In [2]:
logger.info('Loading Last.FM data...')

playscols = ['usersha1', 'mbid', 'artistname', 'plays']
playsdf = pd.read_csv('dataset/usersha1-artmbid-artname-plays.tsv', sep='\t', names=playscols, index_col=False)
profilecols = ['usersha1', 'gender', 'age', 'country', 'registration']
profiledf = pd.read_csv('dataset/usersha1-profile.tsv', sep='\t', names=profilecols, index_col=False)
df = playsdf.merge(profiledf, on=['usersha1'], how='left')

# Clean the data by removing artists without mbid (usually nonsense)
# and profiles without registration (also usually nonsense, maybe dataset artifact)
# Also delete entries with unknown artists, and artists that only show up once
df = df[df['mbid'].notnull() & df['registration'].notnull() 
        & (df['artistname'] != '[unknown]') & df.duplicated(subset='artistname', keep=False)]

logger.info('Loaded Last.FM data')

Loading Last.FM data...
Loaded Last.FM data


In [3]:
# Display data sample and some stats
display(df.head(10))
logger.info('Unique Artists: {}'.format(len(df['artistname'].unique())))
logger.info('Unique Users: {}'.format(len(df['usersha1'].unique())))
logger.info('Entries: {}'.format(len(df)))

Unnamed: 0,usersha1,mbid,artistname,plays,gender,age,country,registration
0,00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,betty blowtorch,2137,f,22.0,Germany,"Feb 1, 2007"
1,00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,die Ärzte,1099,f,22.0,Germany,"Feb 1, 2007"
2,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897,f,22.0,Germany,"Feb 1, 2007"
3,00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,elvenking,717,f,22.0,Germany,"Feb 1, 2007"
4,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706,f,22.0,Germany,"Feb 1, 2007"
5,00000c289a1829a808ac09c00daf10bc3c4e223b,8bfac288-ccc5-448d-9573-c33ea2aa5c30,red hot chili peppers,691,f,22.0,Germany,"Feb 1, 2007"
6,00000c289a1829a808ac09c00daf10bc3c4e223b,6531c8b1-76ea-4141-b270-eb1ac5b41375,magica,545,f,22.0,Germany,"Feb 1, 2007"
7,00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,the black dahlia murder,507,f,22.0,Germany,"Feb 1, 2007"
8,00000c289a1829a808ac09c00daf10bc3c4e223b,c5db90c4-580d-4f33-b364-fbaa5a3a58b5,the murmurs,424,f,22.0,Germany,"Feb 1, 2007"
9,00000c289a1829a808ac09c00daf10bc3c4e223b,0639533a-0402-40ba-b6e0-18b067198b73,lunachicks,403,f,22.0,Germany,"Feb 1, 2007"


Unique Artists: 133086
Unique Users: 358854
Entries: 17238522


## MusicBrainz ID (MBID) to Artist Mapping

In [4]:
mbid_to_artist = pd.read_csv('dataset/mbid_to_artist.csv', index_col=0)

logger.info('Loaded mbid_to_artist.csv')

Loaded mbid_to_artist.csv


## Only look at the top artists

We take the numArtists parameter and look at that many artists, currently set to 1000 (the actual number will be ~950 since we can't get audio samples for every artist). This provides a massive amount of data, while still reasonably covering a very large number of artists. Generating data for the entire dataset would generate a prohibitively large amount of data, and create the potential of overfitting to extremely obscure artists. We do this by taking the previously-generated list of artists, for which we have audio data, and filtering the rest out of the dataset.

In [7]:
uniqueartists = pd.read_csv('dataset/uniqueartists.csv', index_col = 0)

logger.info('Loaded unique_artists.csv')
display(uniqueartists.head(10))

Loaded unique_artists.csv


Unnamed: 0,listeners
a74b1b7f-71a5-4011-9441-d0b5e4122711,77253
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,76270
cc197bad-dc9c-440d-a5b5-d52ba2e14234,66658
8bfac288-ccc5-448d-9573-c33ea2aa5c30,48926
9c9f1380-2516-4fc9-a3e6-f9f61941d090,46954
65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab,45233
83d91898-7763-47d7-b03b-b92132375c47,44443
95e1ead9-4d31-4808-a7ac-32c3614c116b,41229
f59c5520-5f46-4d2c-b2c4-822eabf53419,39774
cc0b7089-c08d-4c10-b6b0-873582c17fd6,37269


In [8]:
logger.info('Loading tracks.json and filtering out artists which don\'t have any audio data')

f = open('dataset/tracks.json','r')
flat_track_urls = json.load(f)
f.close()

# Only look at top numArtists artists, otherwise too much data to parse through in a reasonable timeframe
topartists = uniqueartists.sort_values(ascending=False, by='listeners').head(params['numArtists']).index
topartists = topartists[topartists.isin(flat_track_urls.keys())]

subsetdf = df[df['mbid'].isin(topartists)]
subsetdf.head(10)

logger.info('Artists filtered, filtered dataset contains {} artists'.format(len(topartists)))

Loading tracks.json and filtering out artists which don't have any audio data
Artists filtered, filtered dataset contains 975 artists


## Calculate crosstab of the dataset

This creates a symmetrical matrix with every artist MBID (MBIDs chosen since there are several artists with the same name, and the MBID is a guaranteed unique identifier) as column and row headers, and cells containing the number of users who listen to both artists. For example, the cell corresponding to `df['the beatles']['radiohead']` (substitute MBIDs for band names) contains the number of users who have listened to both Radiohead and The Beatles at least once. Along the diagonal is the number of users who listen to each given artist.

In [9]:
logger.info('Calculating dataset crosstab')

# Create empty crosstab dataframe
crosstab = pd.DataFrame(np.zeros((len(topartists), len(topartists))), columns=topartists, index=topartists)

# Split subset dataframe into chunks, then crosstab each chunk and add it to the total DF
for g, chunkdf in tqdm_notebook(subsetdf.groupby(np.arange(len(subsetdf)) // 1000000)):
    # Merges chunk with itself on userID. This creates a new DF with each 
    # artist entry for a given user coupled with all other artist entries for that user
    dd = pd.merge(chunkdf, chunkdf, on='usersha1')
    # Crosstab method to create co-occurrence matrix
    crosstab_tmp = pd.crosstab(dd['mbid_x'], dd['mbid_y'])
    crosstab = crosstab.add(crosstab_tmp, fill_value=0)

crosstab

logger.info('Crosstab calculation complete')

Calculating dataset crosstab


HBox(children=(IntProgress(value=0, max=8), HTML(value='')))

Crosstab calculation complete





## Save/Load data

In [10]:
# Save to HDF5 file
crosstab.to_hdf('dataset/crosstab.hd5', key='artists')

logger.info('Saved crosstab.hd5')

Saved crosstab.hd5


In [11]:
# Load from HDF5 file
crosstab = pd.read_hdf('dataset/crosstab.hd5', key='artists')

logger.info('Loaded crosstab.hd5')

Loaded crosstab.hd5


## Normalize co-occurrences by number of listeners of each artist

As the diagonal contains the number of listeners for each artist, we divide the dataframe by the diagonals along the columns. This gives us a normalized value which gives the proportion of listeners of a given artist that listen to another artist. For instance, if artist A has 10 listeners who all listen to artist B, and artist B has 20 listeners, `df.loc[A,B] == 1.0` and `df.loc[B,A] == 0.5`.

In [12]:
logger.info('Normalizing co-occurrences')

counts = pd.Series(np.diag(crosstab), index=crosstab.index)
crosstab_norm = crosstab.div(counts, axis=1)

logger.info('Co-occurrences normalized')

Normalizing co-occurrences
Co-occurrences normalized


## Remove diagonals and scale values

Remove diagonals, to remove 1.0 values, then scale along columns. Replace the diagonals with NaN, so that we can remove them later to get rid of pairings of artists with themselves.

In [13]:
logger.info('Scaling co-occurrence matrix')

np.fill_diagonal(crosstab_norm.values, 0)
vals = crosstab_norm.values
min_max_scaler = preprocessing.MinMaxScaler()
vals_scaled = min_max_scaler.fit_transform(vals)
crosstab_norm_scaled = pd.DataFrame(vals_scaled, columns=crosstab_norm.columns, index=crosstab_norm.index)
np.fill_diagonal(crosstab_norm_scaled.values, np.nan)

logger.info('Co-occurrence matrix scaled')

Scaling co-occurrence matrix
Co-occurrence matrix scaled


## Stack dataframe

Stacks the dataframe, converting the grid into a list. This is done to simplify splitting the dataset into a training, dev, and test set.

NOTE: This reverses the A/B relation above.

In [14]:
logger.info('Stacking co-occurrence matrix')

crosstab_norm_scaled_stack = crosstab_norm_scaled.stack()

logger.info('Co-occurrence matrix stacked')

Stacking co-occurrence matrix
Co-occurrence matrix stacked


In [15]:
the_beatles_mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'the beatles'].index[0]
radiohead_mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'radiohead'].index[0]
display(crosstab_norm_scaled.loc[the_beatles_mbid,radiohead_mbid])
display(crosstab_norm_scaled_stack.loc[the_beatles_mbid,radiohead_mbid])
display(crosstab_norm_scaled_stack.loc[radiohead_mbid,the_beatles_mbid])
display(crosstab_norm_scaled[the_beatles_mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))

1.0

1.0

0.9999999999999999

Unnamed: 0,val,artistname
a74b1b7f-71a5-4011-9441-d0b5e4122711,1.0,radiohead
cc197bad-dc9c-440d-a5b5-d52ba2e14234,0.709293,coldplay
83d91898-7763-47d7-b03b-b92132375c47,0.707252,pink floyd
72c536dc-7137-4477-a521-567eeb840fa8,0.623486,bob dylan
b071f9fa-14b0-4217-8e97-eb41da73f598,0.58207,the rolling stones
678d88b2-87b0-403b-b63d-5da7465aecc3,0.577503,led zeppelin
8bfac288-ccc5-448d-9573-c33ea2aa5c30,0.563179,red hot chili peppers
5441c29d-3602-4898-b1a1-b77fa23b8e50,0.543284,david bowie
0383dadf-2a4e-4d10-a46a-e9e041da8eb3,0.509653,queen
9c9f1380-2516-4fc9-a3e6-f9f61941d090,0.470694,muse


## Split the dataset and save

Split the dataset by random sampling. 98% training set, 1% dev set, 1% test set

In [16]:
logger.info('Splitting dataset into train/dev/test sets')

train, devtest = train_test_split(crosstab_norm_scaled_stack, test_size=0.02, random_state = 1)
dev, test = train_test_split(devtest, test_size=0.5, random_state = 2)

Splitting dataset into train/dev/test sets


In [18]:
logger.info('Total dataset size: {}'.format(len(crosstab_norm_scaled_stack)))
logger.info('Training set size: {}'.format(len(train)))
logger.info('Dev set size: {}'.format(len(dev)))
logger.info('Test set size: {}'.format(len(test)))

Total dataset size: 949650
Training set size: 930657
Dev set size: 9496
Test set size: 9497


In [19]:
# Save all to HDF5 files
logger.info('Saving split dataset')

train.to_hdf('dataset/train.hd5', key='artists')
dev.to_hdf('dataset/dev.hd5', key='artists')
test.to_hdf('dataset/test.hd5', key='artists')

logger.info('Split dataset saved')

Saving split dataset
Split dataset saved


## Alternate metric using minimum then normalization across artist

The following sections process the data in an alternative way, by taking `min(df[A,B], df[B,A])` and using that as the value for both artists, then normalizing along columns. This should make obscure artists more prominent among other obscure artists, as an issue with the current method of data processing is that there are very popular bands that are highly-ranked among similar artists. This means that almost all rock bands will have very popular rock bands like The Beatles or Radiohead ranked highly, when the person would presumably prefer to have more obscure artists suggested. This may or may not subjectively work better.

Prior to stacking, the upper triangle of the dataframe was set to NA to remove duplicate entries, since `df[A][B] == df[B][A]`.

All files and variables have the `_min` suffix.

**This is the method I ultimately went with, since I ended up training a siamese network, where input order would have not mattered.**

In [45]:
logger.info('Re-calculating and normalizing dataset using alternative minimum-based metric')

counts = pd.Series(np.diag(crosstab), index=crosstab.index)
crosstab_norm = crosstab.div(counts, axis=1)
minidx = crosstab_norm < crosstab_norm.T
crosstab_norm_min = crosstab_norm[minidx].fillna(0) + crosstab_norm.T[~minidx].fillna(0)

logger.info('Re-calculation and normalization complete')

Re-calculating and normalizing dataset using alternative minimum-based metric
Re-calculation and normalization complete


In [73]:
logger.info('Re-scaling min dataset')

np.fill_diagonal(crosstab_norm_min.values, 0)
vals = crosstab_norm_min.values
vals_scaled = (vals - np.nanmin(vals)) / (np.nanmax(vals)- np.nanmin(vals))
crosstab_norm_min_scaled = pd.DataFrame(vals_scaled, columns=crosstab_norm_min.columns, index=crosstab_norm_min.index)
np.fill_diagonal(crosstab_norm_min_scaled.values, np.nan)

logger.info('Min dataset re-scaled')

Re-scaling min dataset
Min dataset re-scaled


In [74]:
logger.info('Stacking min dataset')

crosstab_norm_min_scaled = crosstab_norm_min_scaled.where(~np.triu(np.ones(crosstab_norm_min_scaled.shape)).astype(np.bool))
crosstab_norm_min_scaled_stack = crosstab_norm_min_scaled.stack()

logger.info('Min dataset stacked')

Stacking min dataset
Min dataset stacked


In [75]:
logger.info('Splitting min dataset into train/dev/test sets')

train_min, devtest_min = train_test_split(crosstab_norm_min_scaled_stack, test_size=0.02, random_state = 1)
dev_min, test_min = train_test_split(devtest_min, test_size=0.5, random_state = 2)

Splitting min dataset into train/dev/test sets


In [76]:
logger.info('Total min dataset size: {}'.format(len(crosstab_norm_min_scaled_stack)))
logger.info('Min training set size: {}'.format(len(train_min)))
logger.info('Min dev set size: {}'.format(len(dev_min)))
logger.info('Min test set size: {}'.format(len(test_min)))

Total min dataset size: 474825
Min training set size: 465328
Min dev set size: 4748
Min test set size: 4749


In [50]:
logger.info('Saving split min dataset')

# Save all to HDF5 files
train_min.to_hdf('dataset/train_min.hd5', key='artists')
dev_min.to_hdf('dataset/dev_min.hd5', key='artists')
test_min.to_hdf('dataset/test_min.hd5', key='artists')

logger.info('Split min dataset saved')

Saving split min dataset
Split min dataset saved


In [61]:
# Compare the two methods for a somewhat niche artist I'm personally familiar with.
# The second 'min' method looks subjectively better!
mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'radiohead'].index[0]
display(crosstab_norm_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))
display(crosstab_norm_min_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))

Unnamed: 0,val,artistname
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,1.0,the beatles
cc197bad-dc9c-440d-a5b5-d52ba2e14234,0.901426,coldplay
9c9f1380-2516-4fc9-a3e6-f9f61941d090,0.726732,muse
83d91898-7763-47d7-b03b-b92132375c47,0.566881,pink floyd
f6f2326f-6b25-4170-b89d-e235b25508e8,0.556743,sigur rós
8bfac288-ccc5-448d-9573-c33ea2aa5c30,0.525327,red hot chili peppers
95e1ead9-4d31-4808-a7ac-32c3614c116b,0.507993,the killers
0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,0.488236,death cab for cutie
8f6bd1e4-fbe1-4f50-aa9b-94c450ec0f11,0.483461,portishead
52074ba6-e495-4ef3-9bb4-0703888a9f68,0.464916,arcade fire


Unnamed: 0,val,artistname
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,0.374121,the beatles
cc197bad-dc9c-440d-a5b5-d52ba2e14234,0.337243,coldplay
f6f2326f-6b25-4170-b89d-e235b25508e8,0.20829,sigur rós
ba0d6274-db14-4ef5-b28d-657ebde1a396,0.172628,the smashing pumpkins
ada7a83c-e3e1-40f1-93f9-3e73dbc9298a,0.160292,arctic monkeys
b7ffd2af-418f-4be2-bdd1-22f8b48613da,0.152447,nine inch nails
b23e8a63-8f47-4882-b55b-df2c92ef400e,0.15145,interpol
a96ac800-bfcb-412a-8a63-0a98df600700,0.139969,modest mouse
aa7a2827-f74b-473c-bd79-03d065835cf7,0.132551,franz ferdinand
f181961b-20f7-459e-89de-920ef03c7ed0,0.131399,the strokes


## Alternate method using plays

The following sections do all the above steps, but instead calculate cross-tabulation using plays instead of just co-occurrences. The end result is that rather than the values corresponding to number of people who listen to artist A who also listen to artist B, the values will show number of plays of artist B per play of artist A. This may or may not end up working better. This is also nicer for model training because it results in a flatter distribution of similarity scores, which isn't as sharply exponential.

All files are saved with the `_alt` suffix.

In [27]:
logger.info('Calculating alt (plays-based) dataset crosstab')

# Create empty crosstab dataframe
crosstab_alt = pd.DataFrame(np.zeros((len(topartists), len(topartists))), columns=topartists, index=topartists)

# Split subset dataframe into chunks, then crosstab each chunk and add it to the total DF
# Splitting into chunks is necessary to avoid running into memory issues
for g, chunkdf in tqdm_notebook(subsetdf.groupby(np.arange(len(subsetdf)) // 1000000)):
    # Merges chunk with itself on userID. This creates a new DF with each 
    # artist entry for a given user coupled with all other artist entries for that user
    dd = pd.merge(chunkdf, chunkdf, on='usersha1')
    # Crosstab method to create co-occurrence matrix
    crosstab_tmp = pd.crosstab(dd['mbid_x'], dd['mbid_y'], values=dd['plays_y'], aggfunc='sum')
    crosstab_alt = crosstab_alt.add(crosstab_tmp, fill_value=0)

crosstab_alt

logger.info('Calculated alt dataset crosstab')

Calculating alt (plays-based) dataset crosstab


HBox(children=(IntProgress(value=0, max=8), HTML(value='')))

Calculated alt dataset crosstab





## Take the minimum, similar to the `_min` method, then normalize

This is especially necessary for this dataset, otherwise the dataset will not reflect true preferences, and also favours artists with many listens.

In [28]:
logger.info('Normalizing alt co-occurrences')

counts = pd.Series(np.diag(crosstab_alt), index=crosstab_alt.index)
crosstab_alt_norm = crosstab_alt.div(counts, axis=1)
minidx = crosstab_alt_norm < crosstab_alt_norm.T
crosstab_alt_norm = crosstab_alt_norm[minidx].fillna(0) + crosstab_alt_norm.T[~minidx].fillna(0)

logger.info('Alt co-occurrences normalized')

Normalizing alt co-occurrences
Alt co-occurrences normalized


In [29]:
logger.info('Scaling alt co-occurrences')

np.fill_diagonal(crosstab_alt_norm.values, 0)
vals = crosstab_alt_norm.values
min_max_scaler = preprocessing.MinMaxScaler()
vals_scaled = (vals - np.nanmin(vals)) / (np.nanmax(vals)- np.nanmin(vals))
crosstab_alt_norm_scaled = pd.DataFrame(vals_scaled, columns=crosstab_alt_norm.columns, index=crosstab_alt_norm.index)
np.fill_diagonal(crosstab_alt_norm_scaled.values, np.nan)

logger.info('Alt co-occurrences scaled')

Scaling alt co-occurrences
Alt co-occurrences scaled


In [42]:
mbid = mbid_to_artist[mbid_to_artist['artistname'] == 'in flames'].index[0]
display(crosstab_norm_min_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))
display(crosstab_alt_norm_scaled[mbid].sort_values(ascending=False).head(10).to_frame(name = 'val').join(mbid_to_artist))

Unnamed: 0,val,artistname
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,1.0,the beatles
cc197bad-dc9c-440d-a5b5-d52ba2e14234,0.901426,coldplay
f6f2326f-6b25-4170-b89d-e235b25508e8,0.556743,sigur rós
ba0d6274-db14-4ef5-b28d-657ebde1a396,0.461421,the smashing pumpkins
ada7a83c-e3e1-40f1-93f9-3e73dbc9298a,0.428448,arctic monkeys
b7ffd2af-418f-4be2-bdd1-22f8b48613da,0.40748,nine inch nails
b23e8a63-8f47-4882-b55b-df2c92ef400e,0.404816,interpol
a96ac800-bfcb-412a-8a63-0a98df600700,0.374126,modest mouse
aa7a2827-f74b-473c-bd79-03d065835cf7,0.354301,franz ferdinand
f181961b-20f7-459e-89de-920ef03c7ed0,0.351221,the strokes


Unnamed: 0,val,artistname
b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,1.0,the beatles
cc197bad-dc9c-440d-a5b5-d52ba2e14234,0.928343,coldplay
f6f2326f-6b25-4170-b89d-e235b25508e8,0.716264,sigur rós
ba0d6274-db14-4ef5-b28d-657ebde1a396,0.558528,the smashing pumpkins
b23e8a63-8f47-4882-b55b-df2c92ef400e,0.525252,interpol
b7ffd2af-418f-4be2-bdd1-22f8b48613da,0.492788,nine inch nails
ada7a83c-e3e1-40f1-93f9-3e73dbc9298a,0.463321,arctic monkeys
a96ac800-bfcb-412a-8a63-0a98df600700,0.458346,modest mouse
f181961b-20f7-459e-89de-920ef03c7ed0,0.416454,the strokes
b6b2bb8d-54a9-491f-9607-7b546023b433,0.391124,pixies


In [31]:
logger.info('Stacking alt co-occurrences')

crosstab_alt_norm_scaled = crosstab_alt_norm_scaled.where(~np.triu(np.ones(crosstab_alt_norm_scaled.shape)).astype(np.bool))
crosstab_alt_norm_scaled_stack = crosstab_alt_norm_scaled.stack()

logger.info('Alt co-occurrences stacked')

Stacking alt co-occurrences
Alt co-occurrences stacked


In [32]:
logger.info('Stacking alt dataset into train/dev/test sets')

train_alt, devtest_alt = train_test_split(crosstab_alt_norm_scaled_stack, test_size=0.02, random_state = 1)
dev_alt, test_alt = train_test_split(devtest_alt, test_size=0.5, random_state = 2)

Stacking alt dataset into train/dev/test sets


In [33]:
logger.info('Total alt dataset size: {}'.format(len(crosstab_alt_norm_scaled_stack)))
logger.info('Alt training set size: {}'.format(len(train_alt)))
logger.info('Alt dev set size: {}'.format(len(dev_alt)))
logger.info('Alt test set size: {}'.format(len(test_alt)))

Total alt dataset size: 474825
Alt training set size: 465328
Alt dev set size: 4748
Alt test set size: 4749


In [34]:
logger.info('Saving split alt dataset')

# Save all to HDF5 files
train_alt.to_hdf('dataset/train_alt.hd5', key='artists')
dev_alt.to_hdf('dataset/dev_alt.hd5', key='artists')
test_alt.to_hdf('dataset/test_alt.hd5', key='artists')

logger.info('Split alt dataset saved')

Saving split alt dataset
Split alt dataset saved


In [65]:
vals = crosstab_norm_min.values

0.0