# Aquisition of Playlist Data + List of Unique Playlist IDs

Note: each of the cells can be run individually, as long as the prerequisite pickle files exist in `/data/processed/` directory. Otherwise, waiting for execution of all cells can take a substantial amount of time.

## Constants and Imports for the Notebook

In [1]:
import os
import json
import pickle

FILE_COUNT = 1000
PLAYLISTS_PER_FILE = 1000
MIN_FOLLOWER_THRESHOLD = 2

DIR_DATA_RAW = os.path.join("..", "data", "raw")
DIR_DATA_PROCESSED = os.path.join("..", "data", "processed")

PLAYLIST_COUNT = FILE_COUNT * PLAYLISTS_PER_FILE

PLAYLIST_LIST_PATH = os.path.join(DIR_DATA_PROCESSED, "playlists" + str(PLAYLIST_COUNT) + ".pkl")
PLAYLIST_OVERVIEW_PATH = os.path.join(DIR_DATA_PROCESSED, "playlists_overview" + str(PLAYLIST_COUNT) + ".pkl")
TRACK_URIS_PATH = os.path.join(DIR_DATA_PROCESSED, "unique_track_ids" + str(PLAYLIST_COUNT) + ".pkl")

## Extracting Playlist Data

The 1,000,000 spotify playlists are stored in `/data/raw/playlists/` folder, downloaded from [AI Crowd](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files) (account needed). The playlists are stored in 1000 `.json` files with 1000 playlists each.

Read all JSON files and merge their playlist information into a single list structure. The Python list is saved into a pickle structure for intermediate representation:

In [2]:
playlist_list = []

for file_index in range(FILE_COUNT):
    start_playlist_id = file_index * PLAYLISTS_PER_FILE
    end_playlist_id = (file_index + 1) * PLAYLISTS_PER_FILE - 1
    playlist_json_path = os.path.join("..", "data", "raw", "playlists", "mpd.slice." + str(start_playlist_id) + "-" + str(end_playlist_id) + ".json")
    
    with open(playlist_json_path, 'r') as json_file:
        data = json.load(json_file)
    
    playlist_data = data.get("playlists", [])

    for playlist in playlist_data:
        if playlist["num_followers"] < MIN_FOLLOWER_THRESHOLD:
            continue
        
        track_id_list = []
        for track in playlist["tracks"]:
            id = track["track_uri"][len("spotify:track:"):]
            track_id_list.append(id)
        del playlist["duration_ms"]
        del playlist["tracks"]
        playlist["track_ids"] = track_id_list

        if "description" in playlist: # some playlists have a description!
            del playlist["description"]

        playlist_list.append(playlist)

    print("Files processed: {}/{}".format(file_index+1, FILE_COUNT), end="\r")

with open(PLAYLIST_LIST_PATH, "wb") as fout:
    pickle.dump(playlist_list, fout, protocol = pickle.HIGHEST_PROTOCOL)

Files processed: 1000/1000

Generate an overview version of playlists without the track list:

In [None]:
with open(PLAYLIST_LIST_PATH, "rb") as fin:
    playlist_data = pickle.load(fin)

for playlist in playlist_data:
    del playlist["track_ids"]

with open(PLAYLIST_OVERVIEW_PATH, "wb") as fout:
    pickle.dump(playlist_data, fout, protocol = pickle.HIGHEST_PROTOCOL)

Generate a list of unique track ids across all playlists:

In [None]:
with open(PLAYLIST_LIST_PATH, "rb") as fin:
    playlist_data = pickle.load(fin)

track_id_set = set()
for playlist in playlist_data:
    for id in playlist["track_ids"]:
        track_id_set.add(id)

track_id_list = list(track_id_set)

with open(TRACK_URIS_PATH, "wb") as fout:
    pickle.dump(track_id_list, fout, protocol = pickle.HIGHEST_PROTOCOL)