To download the dataset follow the instructions here:
- https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data?select=members.csv.7z

If you are running archlinux:
- git clone https://aur.archlinux.org/kaggle-api.git
- cd kaggle-api
- makepkg -si
- Go to the first link and create a kaggle account and agree to the competition rules
- go to your account page on kaggle and create an api key and save the kaggle.json file in the folder ~/.kaggle/
- kaggle competitions download -c kkbox-music-recommendation-challenge

In [1]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.data as tfd


from loguru import logger
from tqdm import tqdm

from typing import List, Any, Tuple, Optional, Dict

In [2]:
datapath: str = os.path.join('..', 'data')

In [3]:
def list_files(directory: str, extension: str) -> List[str]:
    all_files = os.listdir(directory)
    return [os.path.join(directory, file) for file in all_files if file.split('.')[-1] == extension]

In [4]:
datasets: Dict[str, pd.DataFrame] = dict()
for filepath in tqdm(list_files(datapath, 'csv'), ascii=True, desc="Loading data from disk"):
    datasets[os.path.basename(filepath).split('.')[0]] = pd.read_csv(filepath)

Loading data from disk: 100%|############################################################| 6/6 [00:17<00:00,  2.96s/it]


In [5]:
print(f"Loaded {len(datasets)} csv files")
for key, value in datasets.items():
    print(f"length of dataset {key} is {len(value)}")

Loaded 6 csv files
length of dataset members is 34403
length of dataset sample_submission is 2556790
length of dataset songs is 2296320
length of dataset song_extra_info is 2295971
length of dataset test is 2556790
length of dataset train is 7377418


In [6]:
# Let's remove the 'sample_submission' dataset
_ = datasets.pop('sample_submission')

In [7]:
for key, value in datasets.items():
    print(f"Information for dataset: {key}")
    print("Description")
    print(value.describe())
    print('\n')
    print("dataframe 'head'")
    print(value.head())
    print('\n\n-------------------------------------------------------\n\n')

Information for dataset: members
Description
               city            bd  registered_via  registration_init_time  \
count  34403.000000  34403.000000    34403.000000            3.440300e+04   
mean       5.371276     12.280935        5.953376            2.013994e+07   
std        6.243929     18.170251        2.287534            2.954015e+04   
min        1.000000    -43.000000        3.000000            2.004033e+07   
25%        1.000000      0.000000        4.000000            2.012103e+07   
50%        1.000000      0.000000        7.000000            2.015090e+07   
75%       10.000000     25.000000        9.000000            2.016110e+07   
max       22.000000   1051.000000       16.000000            2.017023e+07   

       expiration_date  
count     3.440300e+04  
mean      2.016901e+07  
std       7.320925e+03  
min       1.970010e+07  
25%       2.017020e+07  
50%       2.017091e+07  
75%       2.017093e+07  
max       2.020102e+07  


dataframe 'head'
                 

### In English:
- we have a list of users, their personal information, the songs that they liked and didnt like, and where they accesed the song
- we also have metadata about each song in the dataset
- the dataframe describe function seems to have bugged out with jupyter lab and not shown all of the columns, so I also printed the "heads" of each dataframe

# The first task is creating one dataframe that can hold all of our song information robustly
# And another which can handle our user data robustly
If your data contains both numeric and categorical variables, the best way to carry out clustering on the dataset is to create principal components of the dataset and use the principal component scores as input into the clustering.

Remember that u can always get principal components for categorical variables using a multiple correspondence analysis (MCA), which will give principal components, and you can get then do a separate PCA for the numerical variables, and use the combined as input into your clustering.

OR u could use the R package called FactorMineR or PCAmix to carry Factor analysis of mixed data, with the output being principal components, and then using the principal components as input into your clustering.

Remember that clustering and principal components are doing almost similar thing in a simplistic way they're both distances, e.g Euclidean distance.

Dataset descriptions:
- members.csv
    - user ID, location, gender, 
- song_extra_info + songs.csv
    - song_id, song name, artist name, composer, lyricist, genre id, song length, language
- train.csv:
    - userid, songid, where they found the song, whether they liked it or now
    

Task theoreticals:
    theoretical:
        cluster by attributes and label by userid. i.e each datapoint is a user with a certain value assigned
    practical:
        difficult as some of the most important information is artist and composer and its hard to assign a numercal value to those attributes with the information at hand
  
  
    - theoretical
        - create an embedding between the userid, songid, composers, and how the user is realted to the song
        - cluster on the embedding
        
        
        
        
        
    - theoretical
        - do principal component analysis (MCA) on the categorical variables and then do another PCA on the numerical vairables, use this as the combined input to a clustering algorithm ( possibly spectral becuase of the most probably complex but ordered topology of the datase ) and then recommend users songs from the clusters they are interested in 
    - practical
        - could end up recommending too many songs but we can tune the "looking distance" of the final recommender
        
EITHER CLUSTER THE SONGS AND THEN DO A SECONDARY PROCESS TO RECOMMEDN TO USERS OR CLUSTER THE USERS AND RECOMMEND ENTIRE LIBRARIES
New Dataset:
- categorical:
    - artist name, composer, lyricist, genre_id, language
- numerical:
    - song length
- label:
    - song id
    
This will be used to create clusters of songs, or song groups

Recommendation dataset:
- userid, song rating, songs cluster, songs feature vector, 

In [8]:
len(datasets['songs'])
datasets["songs"]

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,張信哲 (Jeff Chang),董貞,何啟弘,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,湯小康,徐世珍,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,貴族精選,Traditional,Traditional,52.0
...,...,...,...,...,...,...,...
2296315,lg6rn7eV/ZNg0+P+x77kHUL7GDMfoL4eMtXxncseLNA=,20192,958,Catherine Collard,Robert Schumann (1810-1856),,-1.0
2296316,nXi1lrSJe+gLoTTNky7If0mNPrIyCQCLwagwR6XopGU=,273391,465,紀文惠 (Justine Chi),,,3.0
2296317,9KxSvIjbJyJzfEVWnkMbgR6dyn6d54ot0N5FKyKqii8=,445172,1609,Various Artists,,,52.0
2296318,UO8Y2MR2sjOn2q/Tp8/lzZTGKmLEvwZ20oWanG4XnYc=,172669,465,Peter Paul & Mary,,,52.0


In [9]:
raw = pd.DataFrame()
# categorical variables
raw["artist_name"] = datasets["songs"]["artist_name"]
raw["composer"] = datasets["songs"]["composer"] 
raw["lyricist"] = datasets["songs"]["lyricist"]
raw["genre_id"] = datasets["songs"]["genre_ids"]
raw["language"] = datasets["songs"]["language"]
raw["song_length"] = datasets["songs"]["song_length"]
raw["song_id"] = datasets["songs"]["song_id"]

In [10]:
raw

Unnamed: 0,artist_name,composer,lyricist,genre_id,language,song_length,song_id
0,張信哲 (Jeff Chang),董貞,何啟弘,465,3.0,247640,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=
1,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,444,31.0,197328,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=
2,SUPER JUNIOR,,,465,31.0,231781,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=
3,S.H.E,湯小康,徐世珍,465,3.0,273554,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=
4,貴族精選,Traditional,Traditional,726,52.0,140329,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=
...,...,...,...,...,...,...,...
2296315,Catherine Collard,Robert Schumann (1810-1856),,958,-1.0,20192,lg6rn7eV/ZNg0+P+x77kHUL7GDMfoL4eMtXxncseLNA=
2296316,紀文惠 (Justine Chi),,,465,3.0,273391,nXi1lrSJe+gLoTTNky7If0mNPrIyCQCLwagwR6XopGU=
2296317,Various Artists,,,1609,52.0,445172,9KxSvIjbJyJzfEVWnkMbgR6dyn6d54ot0N5FKyKqii8=
2296318,Peter Paul & Mary,,,465,52.0,172669,UO8Y2MR2sjOn2q/Tp8/lzZTGKmLEvwZ20oWanG4XnYc=


In [11]:
import re

def only_letters(tested_string):
    match = re.match("^[a-zA-Z]*$", str(tested_string))
    return match is not None
def only_numbers(tested_string):
    match = re.match("^[0-9]*$", str(tested_string))
    return match is not None

In [None]:
raw['artist_name'] = raw[raw['artist_name'].apply(only_letters)]['artist_name']
raw['lyricist'] = raw[raw['lyricist'].apply(only_letters)]['lyricist']
raw['composer'] = raw[raw['composer'].apply(only_letters)]['composer']
raw['genre_id'] = raw[raw['genre_id'].apply(only_numbers)]['genre_id']

In [None]:
raw

We are working with a truly massive dataset, and to allocate enough memory to run an MCA on even just raw cat would take over 1.4 TiB
so lets drop everything with an NA

In [None]:
raw.dropna(axis=0, how='any', inplace=True)

In [None]:
raw

In [None]:
raw = raw[(raw.artist_name != 'Various Artists') & (raw.artist_name != 'Various')]
raw = raw[raw.language != -1]

One other way to reduce the number of categories for MCA is to filter our dataframe by the top 1000 artists

In [None]:
num_artists=100
num_composers=500
num_lyricists=500
num_genres=500
num_languages=3

In [None]:
raw

In [None]:
raw = raw[raw.artist_name.isin(raw['artist_name'].value_counts(sort=True).index.tolist()[:num_artists])]
raw = raw[raw.composer.isin(raw['composer'].value_counts(sort=True).index.tolist()[:num_composers])]
raw = raw[raw.lyricist.isin(raw['lyricist'].value_counts(sort=True).index.tolist()[:num_lyricists])]
raw = raw[raw.genre_id.isin(raw['genre_id'].value_counts(sort=True).index.tolist()[:num_genres])]
raw = raw[raw.language.isin(raw['language'].value_counts(sort=True).index.tolist()[:num_languages])]

In [None]:
raw

In [None]:
raw.artist_name.unique()

In [None]:
raw.composer.unique()

In [None]:
raw.lyricist.unique()

In [None]:
raw.genre_id.unique()

In [None]:
raw.language.unique()

In [None]:
import prince
famd = prince.FAMD(
    n_components=3,
    n_iter=5,
    copy=True,
    check_input=True,
    engine='auto',
    random_state=42
)
famd = famd.fit(raw.drop(['song_length', 'song_id'], axis='columns'))

In [None]:
raw

In [None]:
famd.row_coordinates(raw)

In [None]:
ax = famd.plot_row_coordinates(
    X = raw,
    ax=None,
    figsize=(6,6),
    x_component=0,
    y_component=2,
    color_labels=["Artist: {}".format(t) for t in raw["artist_name"]],
    ellipse_outline=False,
    ellipse_fill=True,
    show_points=True
)

In [None]:
ax.get_figure().savefig('famd_row_coordinates.svg')

In [None]:
famd.eigenvalues_

In [None]:
famd.column_correlations(raw)

In [None]:
import scipy
import scipy.cluster.hierarchy as sch

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = pd.DataFrame()
for colname in raw.columns:
    if colname == "song_id":
        continue
    elif colname == 'song_length':
        continue
    else:
        encoded[colname] = le.fit_transform(raw[colname])

In [None]:
encoded["song_id"] = raw.song_id.values
encoded["song_length"] = raw.song_length.values

In [None]:
encoded.song_id

In [None]:
encoded

In [None]:
d = sch.distance.pdist(encoded.drop(['song_id', 'song_length'], axis=1), 'hamming')

In [None]:
L = sch.linkage(d, method='complete')

In [None]:
ind = sch.fcluster(L, t=0.9, criterion='distance')

In [None]:
ind

In [None]:
np.unique(ind)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 15))
dn = sch.dendrogram(L, p=7, truncate_mode='level')
sch.set_link_color_palette(['m', 'c', 'y', 'k'])
sch.set_link_color_palette(None)  # reset to default after use
plt.show()

In [None]:
cluster_memberships = pd.DataFrame({"song_id": raw["song_id"], "cluster_membership": ind})

In [None]:
cluster_memberships