# Sample | Train | Validation Data Sets

Como sabemos, temos a disposição exatamente um milhão de playlists. 

No entanto, tanto pelo desafio de processar essa quantidade de dados quanto pelo fato de que nosso objetivo não é o de alcançar os níveis mais altos de precisão em termos de vencer o desafio, mas sim experimentar com certa agilidade algumas estratégias de recomendação de músicas, vamos selecionar uma amostra da base para os seguintes passos do projeto.

Potencialmente para o treino final podemos voltar a trabalhar com o data set completo, mas como uma série de testes serão feitos, é melhor trabalhar com menos dados por enquanto.

Vamos também já separar parte das playlists para vailidação de nossas previsões.

<br><br>
<hr>
@author: [Pedro Correia](https://github.com/pfcor)

In [1]:
import numpy as np
import pandas as pd
import feather as ft
import os

In [2]:
RANDOM_STATE = 433
TRAIN_SIZE = 100000
TEST_SIZE = 1000

Playlists

In [3]:
%%time

# loading playlists
df_playlists = ft.read_dataframe("data/formatted/playlists.fthr")

Wall time: 678 ms


In [4]:
# making only playlists with at least 25 songs available to be selected
df_playlists = df_playlists[df_playlists["num_tracks"] >= 25].copy()

In [5]:
# shuffling playlists
df_playlists = df_playlists.sample(frac=1, random_state=RANDOM_STATE)

In [6]:
# selecting observations
df_playlists_train  = df_playlists.iloc[:TRAIN_SIZE].reset_index(drop=True) # 100k playlists for training
df_playlists_test   = df_playlists.iloc[TRAIN_SIZE:TRAIN_SIZE+TEST_SIZE].reset_index(drop=True) # 1k playlists of validation

In [7]:
# saving sample to disk
df_playlists_train.to_feather("data/train-test/playlists_train.fthr")
df_playlists_test.to_feather("data/train-test/playlists_test.fthr")

In [8]:
# cleaning namespace
pids_train = df_playlists_train["pid"].values
pids_test =  df_playlists_test["pid"].values

del df_playlists
del df_playlists_train
del df_playlists_test

Playlists - Tracks

In [None]:
%%time

# loading playlist_tracks
df_playlist_tracks = ft.read_dataframe("data/formatted/playlist_track.fthr")

In [10]:
# filtering by the selected playlists
df_playlist_tracks_train = df_playlist_tracks[df_playlist_tracks["pid"].isin(pids_train)].reset_index(drop=True)
df_playlist_tracks_test  = df_playlist_tracks[df_playlist_tracks["pid"].isin(pids_test)].reset_index(drop=True)

In [11]:
# saving sample to disk
df_playlist_tracks_train.to_feather("data/train-test/playlist_track_train.fthr")
df_playlist_tracks_test.to_feather("data/train-test/playlist_track_test.fthr")

In [12]:
# cleaning namespace

aids_train = df_playlist_tracks_train["aid"].unique()
aids_test = df_playlist_tracks_test["aid"].unique()

tids_train = df_playlist_tracks_train["tid"].unique()
tids_test = df_playlist_tracks_test["tid"].unique()

del df_playlist_tracks
del df_playlist_tracks_train
del df_playlist_tracks_test

Artists & Tracks

In [3]:
%%time

df_artists = ft.read_dataframe("data/formatted/artists.fthr")
df_tracks = ft.read_dataframe("data/formatted/tracks.fthr")

Wall time: 9.41 s


In [14]:
# selecting only artists and tracks that appear in the selected playlists

df_artists_train = df_artists[df_artists["aid"].isin(aids_train)].reset_index(drop=True)
df_artists_test = df_artists[df_artists["aid"].isin(aids_test)].reset_index(drop=True)

df_tracks_train = df_tracks[df_tracks["tid"].isin(tids_train)].reset_index(drop=True)
df_tracks_test = df_tracks[df_tracks["tid"].isin(tids_test)].reset_index(drop=True)

In [15]:
# saving sample to disk

df_artists_train.to_feather("data/train-test/artists_train.fthr")
df_artists_test.to_feather("data/train-test/artists_test.fthr")

df_tracks_train.to_feather("data/train-test/tracks_train.fthr")
df_tracks_test.to_feather("data/train-test/tracks_test.fthr")

In [16]:
del df_artists
del df_artists_train
del df_artists_test

del df_tracks
del df_tracks_train
del df_tracks_test

<hr>