# Recommendations Using Embarrassingly Shallow Autoencoders for Sparse Data (EASE)

**Objective:**
To recommend tracks to a playlist based on similair playlist/s.

**Dataset Description**: 

The Million Playlist Dataset contains 1,000,000 playlists created by users on the Spotify platform. Each playlist in the MPD contains a playlist title, the track list (including track metadata) editing information (last edit time,  number of playlist edits) and other miscellaneous information about the playlist. 

See the ***references*** section for more details.



In [1]:
# Importing the libraries
import json
import csv
import pandas as pd
import numpy as np
import os
import sys
import logging
import torch
import pandas as pd


The Million Playlist Dataset consists of 1,000 slice files. For example, the first 1,000 playlists in the MPD are in a file called 
`mpd.slice.0-999.json`. Each slice file is a JSON dictionary with two fields: *info* and *playlists*.

We need to parse the JSON files to extract the track list and the playlist title.

Data Preprocessing
------------------


In [2]:
def json_to_df(path, no_of_files):
    '''
    Converts json files to dataframes
    '''
    mpd_playlists = []
    filenames = os.listdir(path)
    count = 0
    for fname in sorted(filenames):
        count = count + 1
        if fname.startswith("mpd.slice.") and fname.endswith(".json"):
            fullpath = os.sep.join((path, fname))
            f = open(fullpath)
            js = f.read()
            f.close()
            current_slice = json.loads(js)
            for playlist in current_slice['playlists']:
                mpd_playlists.append(playlist)
            if count == no_of_files:
                break
    df = pd.DataFrame(mpd_playlists)
    return df


In [3]:
df_playlists = json_to_df(path = 'spotify_million_playlist_dataset/data/', no_of_files = 1)
df_playlists.head()

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description
0,Throwbacks,False,0,1493424000,52,47,1,"[{'pos': 0, 'artist_name': 'Missy Elliott', 't...",6,11532414,37,
1,Awesome Playlist,False,1,1506556800,39,23,1,"[{'pos': 0, 'artist_name': 'Survivor', 'track_...",5,11656470,21,
2,korean,False,2,1505692800,64,51,1,"[{'pos': 0, 'artist_name': 'Hoody', 'track_uri...",18,14039958,31,
3,mat,False,3,1501027200,126,107,1,"[{'pos': 0, 'artist_name': 'Camille Saint-Saën...",4,28926058,86,
4,90s,False,4,1401667200,17,16,2,"[{'pos': 0, 'artist_name': 'The Smashing Pumpk...",7,4335282,16,


The above dataframe contains the data regarding each playlist in the dataset. The tracks within the dataset is nested under one column. The tracks are nested in a list. The nested list needs to be converted to a dataframe.

In [4]:
def tracksToDataFrame(df):
    '''
    Converts tracks column to dataframe
    '''
    # Creating a dataframe with all the tracks and their playlists id
    df_tracks = pd.DataFrame(columns=['pid',
    'pos',
    'playlist_name',
    'artist_name',
    'track_uri',
    'artist_uri',
    'track_name',
    'album_uri',
    'duration_ms',
    'album_name'])
    
    for i in range(len(df)):
        pid = pd.DataFrame([df['pid'][i]] * len(df['tracks'][i]), columns=['pid']) # Create a dataframe with the playlist id
        p_name = pd.DataFrame([df['name'][i]] * len(df['tracks'][i]), columns=['playlist_name']) # Create a dataframe with the playlist name
        temp = pd.DataFrame(df['tracks'][i]) # Create a dataframe with the tracks
        df_tracks = df_tracks.append(pd.concat([pid, p_name, temp], axis=1)) # Append the pid, pname and tracks to the dataframe
    
    return df_tracks

In [5]:
# Creating a dataframe with all the tracks and their playlists id
df_tracks = tracksToDataFrame(df_playlists)
df_tracks.head()

Unnamed: 0,pid,pos,playlist_name,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name
0,0,0,Throwbacks,Missy Elliott,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook
1,0,1,Throwbacks,Britney Spears,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone
2,0,2,Throwbacks,Beyoncé,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit)
3,0,3,Throwbacks,Justin Timberlake,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified
4,0,4,Throwbacks,Shaggy,spotify:track:1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot


In [6]:
# Save to csv
df_tracks.to_csv('data_CSV/tracks_playlist.csv')  

We can now use the above dataframe to build the utility matrix for the recommendation system.

Feature Engineering
--------------------

In [7]:
# Building the utility matrix
def build_utility_matrix(df_tracks):
    cols = ['pid', 'track_uri']
    df_utility_matrix = df_tracks[cols]
    return df_utility_matrix

In [8]:
df_utility_matrix = build_utility_matrix(df_tracks)
df_utility_matrix.head()

Unnamed: 0,pid,track_uri
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H


Model Architecture
------------------

In [10]:
class TorchEASE:
    def __init__(
        self, train, user_col="user_id", item_col="item_id", score_col=None, reg=0.05
    ):
        """

        :param train: Training DataFrame of user, item, score(optional) values
        :param user_col: Column name for users
        :param item_col: Column name for items
        :param score_col: Column name for scores. Implicit feedback otherwise
        :param reg: Regularization parameter
        """
        # Logging
        logging.basicConfig(
            format="%(asctime)s [%(levelname)s] %(name)s - %(message)s",
            level=logging.INFO,
            datefmt="%Y-%m-%d %H:%M:%S",
            stream=sys.stdout,
        ) 

        self.logger = logging.getLogger("notebook") 
        self.logger.info("Building user + item lookup") 
        
        #Regularization parameter
        self.reg = reg 

        # User and item columns
        self.user_col = user_col
        self.item_col = item_col

        self.user_id_col = user_col + "_id"
        self.item_id_col = item_col + "_id"

        
        self.user_lookup = self.generate_labels(train, self.user_col)
        self.item_lookup = self.generate_labels(train, self.item_col)

        # item_id -> item_index
        self.item_map = {} 
        self.logger.info("Building item hashmap")
        for _item, _item_id in self.item_lookup.values:
            self.item_map[_item_id] = _item
        
        train = pd.merge(train, self.user_lookup, on=[self.user_col])
        train = pd.merge(train, self.item_lookup, on=[self.item_col])
        self.logger.info("User + item lookup complete")
        self.indices = torch.LongTensor(
            train[[self.user_id_col, self.item_id_col]].values
        )

        if not score_col:
            # Implicit values only
            self.values = torch.ones(self.indices.shape[0])
        else:
            self.values = torch.FloatTensor(train[score_col])

        self.sparse = torch.sparse.FloatTensor(self.indices.t(), self.values)

        self.logger.info("Sparse data built")

    def generate_labels(self, df, col):
        dist_labels = df[[col]].drop_duplicates()
        dist_labels[col + "_id"] = dist_labels[col].astype("category").cat.codes

        return dist_labels

    
    def fit(self): 
        self.logger.info("Building G Matrix")
        G = self.sparse.to_dense().t() @ self.sparse.to_dense()
        G += torch.eye(G.shape[0]) * self.reg

        P = G.inverse()

        self.logger.info("Building B matrix")
        B = P / (-1 * P.diag())
        B = B + torch.eye(B.shape[0])
        self.B = B

        return


    def predict_all(self, pred_df, k=5, remove_owned=True):
        """
        :param pred_df: DataFrame of users that need predictions
        :param k: Number of items to recommend to each user
        :param remove_owned: Do you want previously interacted items included?
        :return: DataFrame of users + their predictions in sorted order
        """
        pred_df = pred_df[[self.user_col]].drop_duplicates()
        n_orig = pred_df.shape[0]

        # Alert to number of dropped users in prediction set
        pred_df = pd.merge(pred_df, self.user_lookup, on=[self.user_col])
        n_curr = pred_df.shape[0]
        if n_orig - n_curr:
            self.logger.info(
                "Number of unknown users from prediction data = %i" % (n_orig - n_curr)
            )

        _output_preds = []
        # Select only user_ids in our user data
        _user_tensor = self.sparse.to_dense().index_select(
            dim=0, index=torch.LongTensor(pred_df[self.user_id_col])
        )

        # Make our (raw) predictions
        _preds_tensor = _user_tensor @ self.B
        self.logger.info("Predictions are made")
        if remove_owned:
            # Discount these items by a large factor (much faster than list comp.)
            self.logger.info("Removing owned items")
            _preds_tensor += -1.0 * _user_tensor

        self.logger.info("TopK selected per user")
        for _preds in _preds_tensor:
            # Very quick to use .topk() vs. argmax()
            _output_preds.append(
                [self.item_map[_id] for _id in _preds.topk(k).indices.tolist()]
            )

        pred_df["predicted_items"] = _output_preds
        self.logger.info("Predictions are returned to user")
        return pred_df

In [11]:
# Initialize the model with the utility matrix - 500 playlists
te = TorchEASE(df_utility_matrix[df_utility_matrix["pid"]<500], user_col="pid", item_col="track_uri") 

2022-06-01 14:00:28 [INFO] notebook - Building user + item lookup
2022-06-01 14:00:28 [INFO] notebook - Building item hashmap
2022-06-01 14:00:28 [INFO] notebook - User + item lookup complete
2022-06-01 14:00:29 [INFO] notebook - Sparse data built


In [12]:
te.fit()

2022-06-01 14:00:30 [INFO] notebook - Building G Matrix
2022-06-01 14:08:10 [INFO] notebook - Building B matrix


In [25]:
# Get the top 10 recommendations for each user
out = te.predict_all(df_utility_matrix[df_utility_matrix["pid"]<500], k=10)

2022-06-01 14:25:05 [INFO] notebook - Predictions are made
2022-06-01 14:25:05 [INFO] notebook - Removing owned items
2022-06-01 14:25:05 [INFO] notebook - TopK selected per user
2022-06-01 14:25:06 [INFO] notebook - Predictions are returned to user


In [40]:
out.head()

Unnamed: 0,pid,pid_id,predicted_items
0,0,0,"[spotify:track:5dNfHmqgr128gMY2tc5CeJ, spotify..."
1,1,1,"[spotify:track:3yrSvpt2l1xhsV9Em88Pul, spotify..."
2,2,2,"[spotify:track:0UDCfleTgwihlnOUxbzokR, spotify..."
3,3,3,"[spotify:track:1GgYxWBBOdLFWJ4EEHIZSR, spotify..."
4,4,4,"[spotify:track:4EnkwZd0UJAuHpNMMemQaA, spotify..."


More helper functions to extract the track list and the playlist title from the dataframe.

In [41]:
# Function to get the playlist name
def get_playlist_name(pid):
    return df_playlists[df_playlists['pid'] == pid]['name'].values[0]

# Function to get the track name
def get_track_name(track_uri):
    return df_tracks[df_tracks['track_uri'] == track_uri]['track_name'].values[0]


def rec_dataframe(_out): 
    '''
    Input: The dataframe with pid and track_uri
    Output: The dataframe with playlist_name and track_name
    '''
    out = _out.explode('predicted_items').reset_index(drop=True)
    out['playlist_name'] = out['pid'].apply(get_playlist_name)
    out['track_name'] = out['predicted_items'].apply(get_track_name)
    out.drop(['pid_id'], axis=1, inplace=True)
    return out    

In [46]:
result = rec_dataframe(out)
result.head()

Unnamed: 0,pid,predicted_items,playlist_name,track_name
0,0,spotify:track:5dNfHmqgr128gMY2tc5CeJ,Throwbacks,Ignition - Remix
1,0,spotify:track:7uKcScNXuO3MWw6LowBjW1,Throwbacks,"One, Two Step"
2,0,spotify:track:5i66xrvSh1MjjyDd6zcwgj,Throwbacks,Umbrella
3,0,spotify:track:39qcvV4f0uqDMHxIkSb7tE,Throwbacks,Rich Girl
4,0,spotify:track:0VRh0HgB1RsgqjH7YswsJK,Throwbacks,Best Friend's Brother


In [47]:
result.to_pickle("res.pkl") 

Refereneces
-----------

[1]Embarrassingly Shallow Autoencoders for Sparse Data (EASE)  https://arxiv.org/pdf/1905.03375.pdf

[2] Variational Autoencoders for Collaborative Filtering. Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman and Tony Jebara, WWW 2018 https://arxiv.org/abs/1802.05814

[3] Spotify MPD https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge
