In [2]:
import numpy as np
import pandas as pd
import scipy

The one-hot encoded version of the user data is saved as a Scipy sparse matrix. Feel free to perform EDA on the DataFrame version, but for training we'll need this one. Here I'll provide a very brief explanation of how sparse matrices work.

In [3]:
user_df = pd.read_csv('../data/filtered_user_df.csv')
sparse_matrix = scipy.sparse.load_npz('../data/user_data.npz')

Sparse matrices store information as indices, minimizing the amount of storage needed. As such, we can have a massive one-hot encoded matrix that we ordinarily would need hundreds of gigabytes to store. However, we need a way to convert the indices in the matrix back to meaningful information, as well as a way to convert data into indices to interact with the matrix. We define those mappings below.

In [4]:
user_ids = user_df['user'].unique()
artist_names = user_df['artist_name'].unique()

user_to_index = {user: i for i, user in enumerate(user_ids)}
artist_to_index = {artist: j for j, artist in enumerate(artist_names)}

index_to_user = {i: user for user, i in user_to_index.items()}
index_to_artist = {j: artist for artist, j in artist_to_index.items()}

To demonstrate how these mappings can be used, let's take a look at a function that retrieves a user's play counts.

In [5]:
def get_user_play_ct(user):  # The username you want to query
    # Map the username to its index in the matrix
    user_index = user_to_index[user]

    # Pull the row at the user index in the matrix
    user_play_counts = sparse_matrix.getrow(user_index)

    # To convert to a dense format (non-sparse)
    user_play_counts_dense = user_play_counts.todense()
    
    return pd.Series(user_play_counts_dense.A1, artist_names)

In [6]:
get_user_play_ct('nyancrimew')

Jasmine Thompson                 7559
Eminem                           5851
Watsky                           3044
Linkin Park                      2938
twenty one pilots                1849
                                 ... 
Ray Conniff and His Orchestra       0
Dorit Chrysler                      0
Skintone                            0
Royale                              0
DJ Stickle                          0
Length: 97310, dtype: int64

While the sparseness of this matrix makes it great for training a model, it's not so easy to perform EDA on. As such, **please use the DataFrame version of the data instead (filtered_user_df.csv)!** However, when training a model, this should be your go-to.