To download the dataset follow the instructions here:
- https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data?select=members.csv.7z

If you are running archlinux:
- git clone https://aur.archlinux.org/kaggle-api.git
- cd kaggle-api
- makepkg -si
- Go to the first link and create a kaggle account and agree to the competition rules
- go to your account page on kaggle and create an api key and save the kaggle.json file in the folder ~/.kaggle/
- kaggle competitions download -c kkbox-music-recommendation-challenge

In [44]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.data as tfd


from loguru import logger
from tqdm import tqdm

from typing import List, Any, Tuple, Optional, Dict

In [45]:
datapath: str = os.path.join('..', 'data')

In [46]:
def list_files(directory: str, extension: str) -> List[str]:
    all_files = os.listdir(directory)
    return [os.path.join(directory, file) for file in all_files if file.split('.')[-1] == extension]

In [50]:
datasets: Dict[str, pd.DataFrame] = dict()
for filepath in tqdm(list_files(datapath, 'csv'), ascii=True, desc="Loading data from disk"):
    datasets[os.path.basename(filepath).split('.')[0]] = pd.read_csv(filepath)

Loading data from disk: 100%|##########| 6/6 [00:19<00:00,  3.22s/it]


In [52]:
print(f"Loaded {len(datasets)} csv files")
for key, value in datasets.items():
    print(f"length of dataset {key} is {len(value)}")

Loaded 6 csv files
length of dataset sample_submission is 2556790
length of dataset members is 34403
length of dataset test is 2556790
length of dataset train is 7377418
length of dataset songs is 2296320
length of dataset song_extra_info is 2295971


In [53]:
# Let's remove the 'sample_submission' dataset
_ = datasets.pop('sample_submission')

In [62]:
for key, value in datasets.items():
    print(f"Information for dataset: {key}")
    print("Description")
    print(value.describe())
    print('\n')
    print("dataframe 'head'")
    print(value.head())
    print('\n\n-------------------------------------------------------\n\n')

Information for dataset: members
Description
               city            bd  registered_via  registration_init_time  \
count  34403.000000  34403.000000    34403.000000            3.440300e+04   
mean       5.371276     12.280935        5.953376            2.013994e+07   
std        6.243929     18.170251        2.287534            2.954015e+04   
min        1.000000    -43.000000        3.000000            2.004033e+07   
25%        1.000000      0.000000        4.000000            2.012103e+07   
50%        1.000000      0.000000        7.000000            2.015090e+07   
75%       10.000000     25.000000        9.000000            2.016110e+07   
max       22.000000   1051.000000       16.000000            2.017023e+07   

       expiration_date  
count     3.440300e+04  
mean      2.016901e+07  
std       7.320925e+03  
min       1.970010e+07  
25%       2.017020e+07  
50%       2.017091e+07  
75%       2.017093e+07  
max       2.020102e+07  


dataframe 'head'
                 

### In English:
- we have a list of users, their personal information, the songs that they liked and didnt like, and where they accesed the song
- we also have metadata about each song in the dataset
- the dataframe describe function seems to have bugged out with jupyter lab and not shown all of the columns, so I also printed the "heads" of each dataframe

In [None]:
# The first task is creating one dataframe that can hold all of our song information robustly
# And another which can handle our user data robustly
# And pos