# Parsing Spotify's MPD Dataset
The MPD dataset is provided as slices, in JSON format. The parsing and preparing process includes:

1. Convert JSON slices to CSV slices. Outputs (XXXX's signify the slice #):
    * `mpd_playlists_XXXX-XXXX.csv`
    * `mpd_tracks_XXXX-XXXX.csv`
2. De-duplicate `mpd_tracks_XXXX-XXXX.csv` and save as `unique_mpd_tracks.csv`

In [1]:
import numpy as np
import pandas as pd
import os
import json
import csv

NOTE: Make sure that the `spotify_million_playlist_dataset` and its `data` directory are located in the parent folder before proceeding. Otherwise, update the `path` below.

In [2]:
# Input directory
path = "../spotify_million_playlist_dataset/data/"
mpd_filepath = os.path.relpath(path)
print('Input directory:')
print(mpd_filepath)

Input directory:
../spotify_million_playlist_dataset/data


In [3]:
# Output directories

# Base output directory
path = '../data'
output_filepath_root = os.path.relpath(path)

# Output playlists into playlists directory
playlist_output = os.path.join(output_filepath_root, 'playlists')

# Output tracks into tracks directory
track_output = os.path.join(output_filepath_root, 'tracks')

print('Output directories:')
print(output_filepath_root)
print(playlist_output)
print(track_output)

Output directories:
../data
../data/playlists
../data/tracks


# 1. Convert JSON slices to CSV slices

### JSON to CSV Parsing Function

In [20]:
def parse_spotify_mpd_json_to_csv(input_filepath, output_filepaths, playlist_csv_name, tracks_csv_name, tracks_per_playlist=15):
    """Parse through Spotify's Million Playlist Dataset (MPD) JSON slice file and produce the following CSV files:
    
        - {playlist_csv_name}.csv
        - {tracks_csv_name}.csv
        
        * Note: '.csv' suffice not required in argument string.
        
        
    The MPD JSON has a *PLAYLIST* field for each playlist row. The *INFO* field is ignored for each slice file.
    
    The JSON is structured in the following format:
    
        ### `info` Field (THE INFO FIELD IS IGNORED)
        

        ### `playlists` field 
        This is an array that typically contains 1,000 playlists. Each playlist is a dictionary that contains the following fields:

            * ***pid*** - integer - playlist id - the MPD ID of this playlist.
            * ***name*** - string - the name of the playlist 
            * ***description*** - optional string - if present, the description given to the playlist.
            * ***modified_at*** - seconds - timestamp (in seconds since the epoch) when this playlist was last updated.
            * ***num_artists*** - the total number of unique artists for the tracks in the playlist.
            * ***num_albums*** - the number of unique albums for the tracks in the playlist
            * ***num_tracks*** - the number of tracks in the playlist
            * ***num_followers*** - the number of followers this playlist had at the time the MPD was created.
            * ***num_edits*** - the number of separate editing sessions.
            * ***duration_ms*** - the total duration of all the tracks in the playlist (in milliseconds)
            * ***collaborative*** -  boolean - if true, the playlist is a collaborative playlist. 
            * ***tracks*** - an array of information about each track in the playlist. Each element is a dictionary with:
               * ***track_name*** - the name of the track
               * ***track_uri*** - the Spotify URI of the track
               * ***album_name*** - the name of the track's album
               * ***album_uri*** - the Spotify URI of the album
               * ***artist_name*** - the name of the track's primary artist
               * ***artist_uri*** - the Spotify URI of track's primary artist
               * ***duration_ms*** - the duration of the track in milliseconds
               * ***pos*** - the position of the track in the playlist (zero-based)
        
    
    This function does not further process the rows, it simply converts the JSON to CSVs.
    Rows are not sorted, duplicates are not removed.
    
    """
    
    # Load JSON from provided filepath
    j = ''
    with open(input_filepath, 'r') as file:
        j = json.load(file)
        
    # Lists to gather from nested JSON
    playlists_json_list = []
    tracks_json_list = []
    
    # MPD JSON structure has PLAYLISTs as outermost dictionary
    # Step through PLAYLISTs
    for playlist in j['playlists']:
        
        # -- Create this playlist's playlist row as JSON for CSV creation
        playlist_json = {}
        playlist_json['pid'] = playlist['pid']
        playlist_json['name'] = playlist['name']

        # Description is an optional field
        if 'description' in playlist:
            playlist_json['description'] = playlist['description']
        else:
            playlist_json['description'] = ''
            
        playlist_json['modified_at'] = playlist['modified_at']
        playlist_json['num_artists'] = playlist['num_artists']
        playlist_json['num_albums'] = playlist['num_albums']
        playlist_json['num_tracks'] = playlist['num_tracks']
        playlist_json['num_followers'] = playlist['num_followers']
        playlist_json['num_edits'] = playlist['num_edits']
        playlist_json['duration_ms'] = playlist['duration_ms']
        playlist_json['collaborative'] = playlist['collaborative']
        
        playlists_json_list.append(playlist_json)
        
        
        # -- Creating tracks csv
        # Each playlist has any number of tracks.
        # Use the *PID* to affiliate tracks with playlists
        
        # -- Saving tracks for each playlist
        # Add n tracks as features to the playlist csv
        # Where n is the input tracks_per_playlist
        # Format: track_1_uri, track_2_uri,...
        #         track_1_album_uri
        #         track_1_artist_uri
        #
        # Use padding where track does not exist (only 5 tracks in playlist when we want 15)

        track_counter = 1
        
        for track in playlist['tracks']:
            
            if track_counter <= tracks_per_playlist:
                
                # Track URI
                track_feature_name = 'track_' + str(track_counter) + '_uri'
                playlist_json[track_feature_name] = track['track_uri']
                
                # Album URI
                album_feature_name = 'track_' + str(track_counter) + '_album_uri'
                playlist_json[album_feature_name] = track['album_uri']
                
                # Arist URI
                artist_feature_name = 'track_' + str(track_counter) + '_artist_uri'
                playlist_json[artist_feature_name] = track['artist_uri']
            
            track_counter += 1

            # Create tracks CSV
            tracks_json = {}
            tracks_json['pid'] = playlist['pid']
            tracks_json['track_name'] = track['track_name']
            tracks_json['track_uri'] = track['track_uri']
            tracks_json['album_name'] = track['album_name']
            tracks_json['album_uri'] = track['album_uri']
            tracks_json['artist_name'] = track['artist_name']
            tracks_json['artist_uri'] = track['artist_uri']
            tracks_json['duration_ms'] = track['duration_ms']
            tracks_json['pos'] = track['pos']
            
            tracks_json_list.append(tracks_json)
            
        # Check if there weren't enough tracks and we need to add padding
        while track_counter <= tracks_per_playlist:
            # Resume track_counter, but now we assign '0' values
            # Track URI
            track_feature_name = 'track_' + str(track_counter) + '_uri'
            playlist_json[track_feature_name] = 0

            # Album URI
            album_feature_name = 'track_' + str(track_counter) + '_album_uri'
            playlist_json[album_feature_name] = 0

            # Arist URI
            artist_feature_name = 'track_' + str(track_counter) + '_artist_uri'
            playlist_json[artist_feature_name] = 0
            
            track_counter += 1
            
#         diff = tracks_per_playlist - len(playlist['tracks'])
#         if diff > 0:
#             resume_numbering = tracks_per_playlist - diff
            
            
            

    # Now that we've parsed through the JSON, save data to CSV
    
    # mpd_playlists.csv
    with open(f"{output_filepaths[0]}/{playlist_csv_name}.csv", 'w') as file:
        dw = csv.DictWriter(file, playlists_json_list[0].keys())
        dw.writeheader()
        dw.writerows(playlists_json_list)
        print(f"{playlist_csv_name}.csv saved to {output_filepaths[0]}")
    
    # mpd_tracks.csv
    with open(f"{output_filepaths[1]}/{tracks_csv_name}.csv", 'w') as file:
        dw = csv.DictWriter(file, tracks_json_list[0].keys())
        dw.writeheader()
        dw.writerows(tracks_json_list)
        print(f"{tracks_csv_name}.csv saved to {output_filepaths[1]}")
        

## Run Parser (User Input)
If parsing was interrupted or you'd like to freshly parse the raw data for any reason, we begin by deleted any existing files from the `playlists` and `tracks` output directories.

In [21]:
# Prompt user if files are already found in the cleaned data directories

if len(os.listdir(playlist_output)) > 0 or len(os.listdir(track_output)) > 0:
    if input("Files detected in output directories. Shall I delete them for you? (Y/N) ").lower() == 'y':
        for file in os.listdir(playlist_output):
            path = os.path.join(playlist_output, file)
            
            try:
                os.remove(path)
            except OSError as e:
                print("Error: %s : %s" % (file, e.strerror))
        print(f'Playlist data files deleted from {playlist_output}.')
            
        for file in os.listdir(track_output):
            path = os.path.join(track_output, file)
            
            try:
                os.remove(path)
            except OSError as e:
                print("Error: %s : %s" % (path, e.strerror))
                
        print(f'Track data files deleted from {track_output}.')
        
    else:
        print("No files were deleted.")
else:
    print("No files detected in output folders. You're ready to parse the raw data.")

Files detected in output directories. Shall I delete them for you? (Y/N)  y


Playlist data files deleted from ../data/playlists.
Track data files deleted from ../data/tracks.


# 2. Convert Playlists

In [22]:
# ------ WARNING!
# This cell can take a while to run (~10 mins)

# Input for number of tracks to include as features for each playlist
num_tracks = 15

# Begin parsing JSON files
print('\nParsing JSONs...\n')
length = len(os.listdir(mpd_filepath))

for ii,ff in enumerate(os.listdir(mpd_filepath)):
    print(f"Slice {ii + 1} of {length}: {ff}")
    suffix = str(ff.split('.')[-2])
    parse_spotify_mpd_json_to_csv(input_filepath=os.path.join(mpd_filepath, ff),
                                  output_filepaths=[playlist_output, track_output],
                                  playlist_csv_name='mpd_playlists_' + suffix,
                                  tracks_csv_name='mpd_tracks_' + suffix,
                                  tracks_per_playlist=num_tracks)
    print()



Parsing JSONs...

Slice 1 of 1000: mpd.slice.549000-549999.json
mpd_playlists_549000-549999.csv saved to ../data/playlists
mpd_tracks_549000-549999.csv saved to ../data/tracks

Slice 2 of 1000: mpd.slice.613000-613999.json
mpd_playlists_613000-613999.csv saved to ../data/playlists
mpd_tracks_613000-613999.csv saved to ../data/tracks

Slice 3 of 1000: mpd.slice.115000-115999.json
mpd_playlists_115000-115999.csv saved to ../data/playlists
mpd_tracks_115000-115999.csv saved to ../data/tracks

Slice 4 of 1000: mpd.slice.778000-778999.json
mpd_playlists_778000-778999.csv saved to ../data/playlists
mpd_tracks_778000-778999.csv saved to ../data/tracks

Slice 5 of 1000: mpd.slice.290000-290999.json
mpd_playlists_290000-290999.csv saved to ../data/playlists
mpd_tracks_290000-290999.csv saved to ../data/tracks

Slice 6 of 1000: mpd.slice.596000-596999.json
mpd_playlists_596000-596999.csv saved to ../data/playlists
mpd_tracks_596000-596999.csv saved to ../data/tracks

Slice 7 of 1000: mpd.slice.

In [10]:
# Load all CSV slices into DF that will be exported for future easy loading
playlist_slices = os.listdir(playlist_output)

# Load the first one to get the DF formatted
df = pd.read_csv(os.path.join(playlist_output, playlist_slices.pop()))

length = len(playlist_slices) + 1

# Loop through and load the rest of the files
for num, playlist in enumerate(playlist_slices):
    print(f"Concatenating slice #{num + 2} of {length}")
    df2 = pd.read_csv(os.path.join(playlist_output, playlist))
    df = pd.concat([df, df2], ignore_index=True)

print('\nDone.\n')

Concatenating slice #2 of 1000
Concatenating slice #3 of 1000
Concatenating slice #4 of 1000
Concatenating slice #5 of 1000
Concatenating slice #6 of 1000
Concatenating slice #7 of 1000
Concatenating slice #8 of 1000
Concatenating slice #9 of 1000
Concatenating slice #10 of 1000
Concatenating slice #11 of 1000
Concatenating slice #12 of 1000
Concatenating slice #13 of 1000
Concatenating slice #14 of 1000
Concatenating slice #15 of 1000
Concatenating slice #16 of 1000
Concatenating slice #17 of 1000
Concatenating slice #18 of 1000
Concatenating slice #19 of 1000
Concatenating slice #20 of 1000
Concatenating slice #21 of 1000
Concatenating slice #22 of 1000
Concatenating slice #23 of 1000
Concatenating slice #24 of 1000
Concatenating slice #25 of 1000
Concatenating slice #26 of 1000
Concatenating slice #27 of 1000
Concatenating slice #28 of 1000
Concatenating slice #29 of 1000
Concatenating slice #30 of 1000
Concatenating slice #31 of 1000
Concatenating slice #32 of 1000
Concatenating sl

In [11]:
df

Unnamed: 0,pid,name,description,modified_at,num_artists,num_albums,num_tracks,num_followers,num_edits,duration_ms,...,track_12_artist_uri,track_13_uri,track_13_album_uri,track_13_artist_uri,track_14_uri,track_14_album_uri,track_14_artist_uri,track_15_uri,track_15_album_uri,track_15_artist_uri
0,434000,Sad,,1488240000,24,26,27,1,6,6081757,...,spotify:artist:5lKZWd6HiSCLfnDGrq9RAm,spotify:track:1aBO5KPwxqLESNTTJBR6VP,spotify:album:0KMy4eY3BziwZPkVfFHP5v,spotify:artist:5UOLfDoNQJBGlGAKQg9Iwc,spotify:track:2iFvY1l5o2mmUAjBq1L9Mh,spotify:album:4M9Ti6t5h54aDMX4SizDfT,spotify:artist:4vVfuZfXWu18vk5Z4C7wbm,spotify:track:3yrVRdwCbEeKODZgG2mVZX,spotify:album:3SCJmoy3Z45p84IfuaM9YQ,spotify:artist:2EO56JK4txid1Pss9GVbOL
1,434001,pb&j,faves tbh,1487808000,35,38,39,1,9,8959761,...,spotify:artist:5T0MSzX9RC5NA6gAI6irSn,spotify:track:7BHPGtpuuWWsvE7cCaMuEU,spotify:album:03JPFQvZRnHHysSZrSFmKY,spotify:artist:1GLtl8uqKmnyCWxHmw9tL4,spotify:track:3ZOEytgrvLwQaqXreDs2Jx,spotify:album:6deiaArbeoqp1xPEGdEKp1,spotify:artist:0L8ExT028jH3ddEcZwqJJ5,spotify:track:5E30LdtzQTGqRvNd7l6kG5,spotify:album:18iFxjZugvKhuNNMbLjZJF,spotify:artist:77SW9BnxLY8rJ0RciFqkHh
2,434002,Sexy Time,,1473379200,36,57,67,21,43,17493934,...,spotify:artist:1P8IfcNKwrkQP5xJWuhaOC,spotify:track:6Ms01Gqi8gVBs14YrNUlVZ,spotify:album:1MXR2vMRldZITKc1Zk6bLe,spotify:artist:336vr2M3Va0FjyvB55lJEd,spotify:track:7cvkXf3AwPGT041PyOi5VX,spotify:album:1gIC63gC3B7o7FfpPACZQJ,spotify:artist:6vWDO969PvNqNYHIOW5v0m,spotify:track:5KyznPsIMzQpzPcNnb67rd,spotify:album:0NrP6lZ1RSuqE7xpmeKlNa,spotify:artist:0ZrpamOxcZybMHGg1AYtHP
3,434003,bedroom,,1436572800,14,15,15,1,12,3334671,...,spotify:artist:2iojnBLj0qIMiKPvVhLnsH,spotify:track:4xNbgOhQHofZJySyySbPG1,spotify:album:0DMQC8fJn3TR3xUgLU6jDg,spotify:artist:1Xfmvd48oOhEWkscWyEbh9,spotify:track:3XVBdLihbNbxUwZosxcGuJ,spotify:album:6TqRKHLjDu5QZuC8u5Woij,spotify:artist:3DiDSECUqqY1AuBP8qtaIa,spotify:track:2U5NrHirVZusBLQlohTbnZ,spotify:album:2p8MgaWfMYTrDtJxJMeuqG,spotify:artist:23zg3TcAtWQy7J6upgbUnj
4,434004,Whatever,,1506816000,36,57,79,1,46,18874072,...,spotify:artist:6l3HvQ5sa6mXTsMTB19rO5,spotify:track:1yYzuNd0KRyHVJ3NH8apBt,spotify:album:2kuF4nm4R2bcwBKvnvRbTW,spotify:artist:4LLpKhyESsyAXpc4laK94U,spotify:track:6GnhWMhgJb7uyiiPEiEkDA,spotify:album:2Tyx5dLhHYkx6zeAdVaTzN,spotify:artist:4LLpKhyESsyAXpc4laK94U,spotify:track:0htTZnlk6okQ1HIq4EvFQ6,spotify:album:6liIoWzpvrff945pUI7fHt,spotify:artist:02kJSzxNuaWGqwubyUba0Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,546995,feels,,1509408000,49,56,63,6,36,14977345,...,spotify:artist:6vWDO969PvNqNYHIOW5v0m,spotify:track:3OMh7VdOoWgtKhJimQQywz,spotify:album:07WV4Vf6503k6K1yGNWG8X,spotify:artist:6qqNVTkY8uBg9cP3Jd7DAH,spotify:track:0fioLzGM8ngbD1w6fMmm45,spotify:album:2Jg7JZ0ZXOGje1bkq7CVgK,spotify:artist:2wY79sveU1sp5g7SokKOiI,spotify:track:4sPmO7WMQUAf45kwMOtONw,spotify:album:0K4pIOOsfJ9lK8OjrZfXzd,spotify:artist:4dpARuHxo51G3z768sgnrY
999996,546996,Fall '15,,1452902400,126,135,144,2,58,31938990,...,spotify:artist:4OtFSn9UYWzNX9bGUf5G9W,spotify:track:7BC0mX45eSWGubU13m6dH8,spotify:album:1oQUa18tr6i27gKmO7LX1J,spotify:artist:536osqBGKzeozje8BfcGsa,spotify:track:3LfO09R3zh96w0qGURCvrr,spotify:album:6V7UTQKQXzg6jwNTxPxd2w,spotify:artist:6ZUjdwG0NvY6MT7vvmluhV,spotify:track:0LIU3eOAwXCZlWOiHuItkj,spotify:album:6HizrEcHCJCX8JEfD50L7Z,spotify:artist:0O7NhieDairfQvi9jr66Cx
999997,546997,oldies,,1507161600,52,69,91,4,25,20987893,...,spotify:artist:6jJ0s89eD6GaHleKKya26X,spotify:track:5fVZC9GiM4e8vu99W0Xf6J,spotify:album:1IM3GwptCGYjRkzCBolyFK,spotify:artist:0zOcE3mg9nS6l3yxt1Y0bK,spotify:track:3Q4WeJmzxuDpzMu9QjQqbM,spotify:album:0jwuTvP3hp2jFY08VLgvnD,spotify:artist:3r17AfJCCUqC9Lf0OAc73G,spotify:track:6C7RJEIUDqKkJRZVWdkfkH,spotify:album:3SZr5Pco2oqKFORCP3WNj9,spotify:artist:5K4W6rqBFWDnAN6FQUkS6x
999998,546998,Summer 2014,,1408579200,8,8,8,1,4,1659303,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# Save to csv
print('Saving to csv.')
df.to_csv(f"{output_filepath_root}/playlists.csv", index=False)
print(f'Saved to {output_filepath_root}/playlists.csv')

Saving to csv.
Saved to ../data/playlists.csv


# 3. Create the Unique Tracks DataFrame — In Pieces!

In [5]:
# Number of segments to break the sum of data into.
# ex: 8 means we process 1/8th of the slices and have 8 DFs to merge at the end.

num_data_batches = 20

# Get list of all files in data/tracks
track_filenames = os.listdir(track_output)

# Break into N batches with numpy's array_split
batched_track_filenames = np.array_split(track_filenames, num_data_batches)

# Save each batch's DF in a list
batch_dfs = []

# Loop through N batches
for batch_num, batch in enumerate(batched_track_filenames):
    
    print(f"\nBeginning batch #{batch_num + 1} of {num_data_batches}\n")
    
    # Get the filenames for this batch
    track_filenames = batch.tolist()
    
    # Load the first file in this batch into a DataFrame to get started
    df = pd.read_csv(os.path.join(track_output, track_filenames[0]))
    print(f"df beginning with shape {df.shape}\n")

    # pid and pos are no longer relevant & de-duplicate
    df.drop(columns=['pid', 'pos'], inplace=True)
    df.drop_duplicates(subset=['track_uri'], inplace=True)

    length = len(track_filenames)

    # Loop through the remaining track files and de-duplicate as we go
    for ii,ff in enumerate(track_filenames):
        print(f'Batch {batch_num + 1}/{num_data_batches} | Processing file {ii + 1} of {length}')

        # Keep track of this file's shape for reporting
        df_s0 = df.shape

        # Load new slice
        df2 = pd.read_csv(os.path.join(track_output, ff))

        # Drop pid, pos columns & de-duplicate
        df2.drop(columns=['pid', 'pos'], inplace=True)
        df2.drop_duplicates(subset=['track_uri'], inplace=True)

        # Concatenate this slice to our existing DF
        # df = pd.concat([df,df2], ignore_index=True, verify_integrity=True)
        df = pd.concat([df, df2], ignore_index=True).drop_duplicates().reset_index(drop=True)


        # Report
        df_s1 = df.shape
        print(f'{df2.shape[0]} tracks processed, {df_s1[0] - df_s0[0]} unique tracks added.\n')
        
    print(f"\nBatch {batch_num + 1} complete.\n")
    print(f"df final shape {df.shape}.")
    print('___________________________________________________________________\n')
    batch_dfs.append(df)

print("\nNow combining batches...\n")
    
# Combine all DFs in the collected list
df_join = batch_dfs.pop()

for d in batch_dfs:
    df_join = pd.concat([df_join, d], ignore_index=True).drop_duplicates().reset_index(drop=True)

# Save df to single csv
print("\nNow saving unique tracks to csv...")
try:
    df_join.to_csv(f"{output_filepath_root}/unique_tracks.csv", index=False)
    print(f"unique_tracks.csv saved to {output_filepath_root}")
except:
    print("Failed to save csv.")


Beginning batch #1 of 20

df beginning with shape (63574, 9)

Batch 1/20 | Processing file 1 of 50
32791 tracks processed, 0 unique tracks added.

Batch 1/20 | Processing file 2 of 50
34583 tracks processed, 22571 unique tracks added.

Batch 1/20 | Processing file 3 of 50
35516 tracks processed, 19527 unique tracks added.

Batch 1/20 | Processing file 4 of 50
34239 tracks processed, 16465 unique tracks added.

Batch 1/20 | Processing file 5 of 50
34630 tracks processed, 15015 unique tracks added.

Batch 1/20 | Processing file 6 of 50
35332 tracks processed, 14173 unique tracks added.

Batch 1/20 | Processing file 7 of 50
34926 tracks processed, 12934 unique tracks added.

Batch 1/20 | Processing file 8 of 50
36207 tracks processed, 13366 unique tracks added.

Batch 1/20 | Processing file 9 of 50
36489 tracks processed, 12399 unique tracks added.

Batch 1/20 | Processing file 10 of 50
36754 tracks processed, 13204 unique tracks added.

Batch 1/20 | Processing file 11 of 50
35497 tracks

In [6]:
pd.read_csv(f"{output_filepath_root}/unique_tracks.csv")

Unnamed: 0,track_name,track_uri,album_name,album_uri,artist_name,artist_uri,duration_ms
0,FUCKING BEST SONG EVERRR,spotify:track:6SbAbLqAWf2tnTdUy6Gmm5,FUCKING BEST SONG EVERRR,spotify:album:1hmvZb81DAeTx67G1FaTjZ,Wallpaper.,spotify:artist:6NMcnx3vKGSAeqSMbySlpw,217800
1,#STUPiDFACEDD,spotify:track:1MvpPH6BTP3IrLnTjEA2gw,#STUPiDFACEDD,spotify:album:1c7wJm9mghFyIKnQJOobW8,Wallpaper.,spotify:artist:6NMcnx3vKGSAeqSMbySlpw,184026
2,"I Got Soul, I'm So Wasted",spotify:track:1wBsDNh4BMUyjmGc0fgHJv,Doodoo Face,spotify:album:0zSZPABfVbwm2Eo06M7DLV,Wallpaper.,spotify:artist:6NMcnx3vKGSAeqSMbySlpw,188666
3,We Are Young (feat. Janelle Monáe) - feat. Jan...,spotify:track:5rgy6ghBq1eRApCkeUdJXf,Some Nights,spotify:album:7m7F7SQ3BXvIpvOgjW51Gp,fun.,spotify:artist:5nCi3BB41mBaMH9gfr6Su0,250626
4,Boyfriend,spotify:track:07dYGGSrzPeg6a3KZjWX65,Believe,spotify:album:7BWK3eXcbAdwYeulyQj5Kw,Justin Bieber,spotify:artist:1uNFoZAHBGtllmzznpCI3s,171333
...,...,...,...,...,...,...,...
2262287,Now You're a Dreamer,spotify:track:45t7ZbypVdTYbrs0WfL1sI,Tattoo,spotify:album:7xqtwaOCctSSqsHUxRiD16,Raindeer,spotify:artist:6DtbmyUQDSdTgEer4Ov0mk,228000
2262288,The Least of Your Brothers,spotify:track:464nM8hYCcKhTsZIYHNjSH,Leverage Models,spotify:album:6iPAonht2Qkv9GM8Zg9hZU,Leverage Models,spotify:artist:29f5yeDOR0NP9IwDpWhRYl,287413
2262289,Blood Red,spotify:track:2gnbjeFwGoWjF86uJfmPVA,Seasons In The Abyss,spotify:album:1OV8gLNyoCSAijC54d1RAM,Slayer,spotify:artist:1IQ2e1buppatiN1bxUVkrk,167946
2262290,Poseidon's Fury Unleashed,spotify:track:2CTu7Hk7ejfTNh3922pFrc,Through the Gale,spotify:album:2s5iW3fhyjy3blw1XTJ5qN,Asaf Avidan & the Mojos,spotify:artist:2TwepUY7feaTuipStcyzLZ,308825


# 4. Create URI Dictionaries

In [7]:
# Load df of unique tracks
df = pd.read_csv(f"{output_filepath_root}/unique_tracks.csv")

unique_albums = df.drop_duplicates(subset=['album_uri']).drop(columns=['track_name', 'track_uri', 'artist_name', 'artist_uri', 'duration_ms'])
display(unique_albums)

unique_artists = df.drop_duplicates(subset=['artist_uri']).drop(columns=['track_name', 'track_uri', 'album_name', 'album_uri', 'duration_ms'])
display(unique_artists)

# Save these new DFs to csv and load them in subsequent notebook and convert to dictionaries
try:
    unique_albums.to_csv(f"{output_filepath_root}/unique_albums.csv", index=False)
    print(f"unique_albums.csv saved to {output_filepath_root}")
    unique_artists.to_csv(f"{output_filepath_root}/unique_artists.csv", index=False)
    print(f"unique_artists.csv saved to {output_filepath_root}")
except:
    print("CSV saving failed.")

Unnamed: 0,album_name,album_uri
0,FUCKING BEST SONG EVERRR,spotify:album:1hmvZb81DAeTx67G1FaTjZ
1,#STUPiDFACEDD,spotify:album:1c7wJm9mghFyIKnQJOobW8
2,Doodoo Face,spotify:album:0zSZPABfVbwm2Eo06M7DLV
3,Some Nights,spotify:album:7m7F7SQ3BXvIpvOgjW51Gp
4,Believe,spotify:album:7BWK3eXcbAdwYeulyQj5Kw
...,...,...
2262276,"Nos plus belles années en karaoké (2010, Vol. 3)",spotify:album:3DbpnzDllFgVrYokPoiAoB
2262278,Hinos Cristãos,spotify:album:0Lv5WPtHWyTUZZNl2ISpww
2262280,"Projeto Sola, Vol. 1",spotify:album:0kDdi8xm92orRihrMbilRV
2262281,Eu Sou Livre,spotify:album:23O2Sf7BGe62TzjYdVP8wu


Unnamed: 0,artist_name,artist_uri
0,Wallpaper.,spotify:artist:6NMcnx3vKGSAeqSMbySlpw
3,fun.,spotify:artist:5nCi3BB41mBaMH9gfr6Su0
4,Justin Bieber,spotify:artist:1uNFoZAHBGtllmzznpCI3s
5,Flo Rida,spotify:artist:0jnsk9HBra6NMjO2oANoPY
6,One Direction,spotify:artist:4AK6F7OLvEQ5QYCBNiQWHq
...,...,...
2262261,Patrin,spotify:artist:1RDa9v2OMcvWYoHRoL29tw
2262262,M.A.S 0094,spotify:artist:5nrFv8M9OSiVAAbTdZobyR
2262265,H.M.A.K.,spotify:artist:6Ulu5YIzdOHWyOJnd9kodK
2262278,Agostinho Silva,spotify:artist:0TXLDLJB5yPsXs3lhshSrs


unique_albums.csv saved to ../data
unique_artists.csv saved to ../data


___