# DSC 207R: Building the Dataset

John Tayag

In [1]:
# Install spotipy
!pip install spotipy --upgrade



In [10]:
# Import modules
import pandas as pd
import spotipy as sp
import json
import os
import time

from spotipy.oauth2 import SpotifyClientCredentials
from dotenv import load_dotenv, find_dotenv

## Get the list of genres and subgenres

This approach was inspired by [this article](https://www.kaylinpavlik.com/classifying-songs-genres/) but reworked to work in Python instead of R. I also added more genres to the queries and updated the subgenres. It relies on [Every Noise](), a collection of Spotify's genre category names organized into a genre-space to attempt to group subgenres together.

For this project, the list of genre seeds is defined by the user. These genres are then inputted as the root into [Every Noise](https://everynoise.com/everynoise1d.cgi?root=pop&scope=mainstream%20only) which will then produce a list of the most popular subgenres similar to the seed genre. Any returned subgenres that are genre categories themselves are skipped to avoid overlap and duplicated rows. The genre name itself and the following top 5 subgenres are selected for each genre and organized into a data frame manually. These subgenre values are what will be used as the search keyword for playlists on Spotify.

One note is that the genre categories themselves are often listed as subgenres of other genres as Every Noise bases its results on similarity. To avoid too much overlap and to avoid duplicate search queries, I filtered out any subgenre results that directly reference another genre.

In [11]:
# Build the genre data frame
genre_dict = {"pop":["pop", "dance pop", "post-teen pop", "boy band", "uk pop", "alt z"], \
            "r&b":["r&b", "urban contemporary", "contemporary r&b", "neo soul", "alternative r&b", "uk contemporary r&b"], \
            "rock":["rock", "album rock", "clasic rock", "heartland rock", "hard rock", "permanent wave"], \
            "hip hop":["hip hop", "urban contemporary", "southern hip hop", "atl hip hop", "hardcore hip hop", "west coast hip hop"], \
            "rap":["rap", "trap", "pop rap", "gangster rap", "dirty south rap", "melodic rap"], \
            "edm":["edm", "pop dance", "electro house", "dutch house", "progressive electro house", "brostep"], \
            "jazz":["jazz", "vocal jazz", "adult standards", "lounge", "movie tunes", "soul"], \
            "country":["country", "country road", "contemporary country", "country rock", "modern country rock", "country dawn"], \
            "classical":["classical", "compositional ambient", "orchestral soundtrack", "soundtrack", "healing hz", "easy listening"] \
            }

search_queries = pd.DataFrame(genre_dict)

search_queries
              

Unnamed: 0,pop,r&b,rock,hip hop,rap,edm,jazz,country,classical
0,pop,r&b,rock,hip hop,rap,edm,jazz,country,classical
1,dance pop,urban contemporary,album rock,urban contemporary,trap,pop dance,vocal jazz,country road,compositional ambient
2,post-teen pop,contemporary r&b,clasic rock,southern hip hop,pop rap,electro house,adult standards,contemporary country,orchestral soundtrack
3,boy band,neo soul,heartland rock,atl hip hop,gangster rap,dutch house,lounge,country rock,soundtrack
4,uk pop,alternative r&b,hard rock,hardcore hip hop,dirty south rap,progressive electro house,movie tunes,modern country rock,healing hz
5,alt z,uk contemporary r&b,permanent wave,west coast hip hop,melodic rap,brostep,soul,country dawn,easy listening


This method requires storing client credentials obtained from Spotify Web API's dashboard in a .env file in order to authenticate API requests

In [12]:
# Get authentication details and set up the spotipy object
load_dotenv(find_dotenv())
client_id, client_secret = os.getenv("SPOTIFY_CLIENT_ID"), os.getenv("SPOTIFY_CLIENT_SECRET")

auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
spot = sp.Spotify(auth_manager=auth_manager)

In [20]:
# Get n playlists for each genre-subgenre pair from each year in years
playlist_ids = {}
q_limit = 20 # Limit on # of playlists received
num_queries = 5

for genre, subgenres in genre_dict.items():
    for subgenre in subgenres:
        for i in range(0,num_queries):
            try:
                results = spot.search(q=f"{subgenre}", limit=q_limit, market="US", offset=i, type="playlist")

                for i in range(len(results["playlists"]["items"])):
                    id_results = results["playlists"]["items"][i]

                    if id_results["owner"]["id"]=="spotify":
                        owner = "spotify"
                    else:
                        owner = "user"
                    playlist_ids[id_results["id"]] = {"playlist_genre":genre, \
                                                    "playlist_subgenre":subgenre, \
                                                    "playlist_id":id_results["id"], \
                                                    "playlist_name":id_results["name"], \
                                                    "playlist_owner":owner, \
                                                    "playlist_description":id_results["description"], \
                                                    "playlist_numtracks":id_results["tracks"]["total"]
                                                    }
            except:
                continue
                
playlist_df = pd.DataFrame.from_dict(playlist_ids, orient="index")

### Beware: The following query takes a **very** long time to process

In [35]:
# For each playlist, record track IDs and relevant details
# Skip if:
#    query returns an error 
#    track type is episode (podcast)
#    track id is missing (most likely user-uploaded local file not in Spotify's database)

# time.sleep() is used to slow down the query to avoid hitting rate limits

#playlist_df = pd.read_csv("playlist_df")
results_dict = {}
track_df = pd.DataFrame()
q_limit = 50 # Limit on # of tracks received
pause_length = 1.5 #s

for i,ids in enumerate(playlist_df["playlist_id"]):
    start = time.time()
    time.sleep(pause_length)
    print(f"{i}: {ids}") # To see how many IDs have been processed
    
    try:
        results = spot.playlist_items(playlist_id=ids, limit=q_limit, additional_types=["track"])
        num_results = len(results["items"])
        print(f"{num_results} Tracks found")
        
        for i in range(len(results["items"])):
            if (results["items"][i]["track"] is None):
                continue
            elif (results["items"][i]["track"]["type"] == "episode"):
                continue
            elif (results["items"][i]["track"]["id"] is None):
                continue
            else:
                id_results = results["items"][i]["track"]
                
                # Skip duplicate tracks from same playlist
                if ids+"++"+id_results["id"] in results_dict:
                    continue
                else:
                    results_dict[ids+"++"+id_results["id"]] = {"playlist_id":ids, \
                                                             "track_id":id_results["id"], \
                                                             "track_name":id_results["name"], \
                                                             "track_popularity":id_results["popularity"], \
                                                             "album_id":id_results["album"]["id"], \
                                                             "album_name":id_results["album"]["name"], \
                                                             "album_type":id_results["album"]["album_type"], \
                                                             "album_release_date":id_results["album"]["release_date"], \
                                                             "album_markets":id_results["album"]["available_markets"], \
                                                             "album_total_tracks":id_results["album"]["total_tracks"], \
                                                             "artist_id":id_results["artists"][0]["id"], \
                                                             "artist_name":id_results["artists"][0]["name"]
                                                            }
        print("Success")
        end = time.time()
        print(f"{end-start}s elapsed")
    except:
        print("########################### Skipped")
        end = time.time()
        print(f"{end-start}s elapsed")
        continue

track_df = pd.DataFrame.from_dict(results_dict, orient="index")

0: 6mtYuOxzl58vSGnEDtZ9uB
50 Tracks found
Success
2.452423572540283s elapsed
1: 37i9dQZF1EQncLwOalG3K7
50 Tracks found
Success
2.2669811248779297s elapsed
2: 37i9dQZF1DWUa8ZRTfalHk
50 Tracks found
Success
1.855689525604248s elapsed
3: 5TDtuKDbOhrfW7C58XnriZ
50 Tracks found
Success
2.085399866104126s elapsed
4: 5mHjseWRbEExCej6J1qsQJ
44 Tracks found
Success
2.2779860496520996s elapsed
5: 37i9dQZF1EIg1p0x6beBBb
50 Tracks found
Success
1.866835594177246s elapsed
6: 37i9dQZF1DX0A8zVl7p82B
50 Tracks found
Success
2.0830132961273193s elapsed
7: 37i9dQZF1DXadasIcsfbqh
50 Tracks found
Success
2.5248429775238037s elapsed
8: 37i9dQZF1EIhG5F4iZSTGg
50 Tracks found
Success
2.0584399700164795s elapsed
9: 37i9dQZF1DXaPCIWxzZwR1
50 Tracks found
Success
2.069795846939087s elapsed
10: 1Cgey68pUlQGsCPI2wJuxr
50 Tracks found
Success
1.954932689666748s elapsed
11: 37i9dQZF1DWSVpJBtEkFud
50 Tracks found
Success
2.505650043487549s elapsed
12: 7l4sdtYsHVypensTVz8rb3
48 Tracks found
Success
1.9538383483886719

HTTP Error for GET to https://api.spotify.com/v1/playlists/37i9dQZF1EIcpx4EgCbll4/tracks with Params: {'limit': 50, 'offset': 0, 'fields': None, 'market': None, 'additional_types': 'track'} returned 404 due to Not found.


154: 37i9dQZF1EIcpx4EgCbll4
########################### Skipped
1.6407649517059326s elapsed
155: 37i9dQZF1DX4SBhb3fqCJd
50 Tracks found
Success
1.947451114654541s elapsed
156: 4h9dvquufDVDyGuYRHeu73
50 Tracks found
Success
2.5058791637420654s elapsed
157: 37i9dQZF1DX04mASjTsvf0
50 Tracks found
Success
2.4968276023864746s elapsed
158: 37i9dQZF1DX2hNQN2Fv6Cy
50 Tracks found
Success
2.4395740032196045s elapsed
159: 1KwEY91x7R1InzrLGSvwGF
50 Tracks found
Success
1.806098222732544s elapsed
160: 37i9dQZF1DX2WkIBRaChxW
50 Tracks found
Success
1.930781602859497s elapsed
161: 2yErHwW3rFeEtrhlGV3Zol
47 Tracks found
Success
2.31512188911438s elapsed
162: 37i9dQZF1DWUbo613Z2iWO
50 Tracks found
Success
1.8372623920440674s elapsed
163: 5tSbO6M5AT7r91ytDd4nNP
50 Tracks found
Success
1.8979010581970215s elapsed
164: 3JODOAqoApc0PUIyaewbZN
50 Tracks found
Success
1.8975586891174316s elapsed
165: 6qm25ncYRw2ymgCiLw9Y8n
50 Tracks found
Success
1.7816646099090576s elapsed
166: 37i9dQZF1DWViBxWcYEI1b
50 Tr

In [36]:
# Query Spotify for audio features for each track ID in track_df
tracks100 = []

# Split track_df into batches of 100
for i in range(0, track_df.shape[0], 100):
    tracks100.append(track_df.reset_index().loc[i:i+99, "track_id"])
audio_df = pd.DataFrame()

for track_batch in tracks100:
    try:
        results_df = pd.DataFrame(spot.audio_features(track_batch))
        results_df.drop(columns=["type", "uri", "track_href", "analysis_url"], inplace=True)
        audio_df = pd.concat([audio_df, results_df], axis=0)
    except:
        # If error, individually query for the songs in the batch
        for i,track in enumerate(track_batch):
            try:
                results_df = pd.DataFrame(sp.audio_features(track_batch[i]))
                results_df.drop(columns=["type", "uri", "track_href", "analysis_url"], inplace=True)
                audio_df = pd.concat([audio_df, results_df], axis=0)
            except:
                continue
                

In [53]:
playlist_df.sort_values(by=["playlist_genre", "playlist_subgenre"]).reset_index(drop=True)

Unnamed: 0,playlist_genre,playlist_subgenre,playlist_id,playlist_name,playlist_owner,playlist_description,playlist_numtracks
0,classical,classical,37i9dQZF1DWUPafHP1BJw1,Pop Goes Classical,spotify,"Your favorite pop songs, classically reimagined.",111
1,classical,classical,27Zm1P410dPfedsdoO9fqm,Classical Bangers 🎹🎻,user,"Best classical music to study, chill, and rela...",250
2,classical,classical,37i9dQZF1DWWEJlAGA9gs0,Classical Essentials,spotify,A selection of the greatest classical tunes; t...,159
3,classical,classical,1h0CEZCm6IbFTbxThn6Xcs,Best Classical Music,user,Essential tracks from the most notable classic...,201
4,classical,classical,37i9dQZF1DWVFeEut75IAL,Calming Classical,spotify,The most calming classical music.,67
...,...,...,...,...,...,...,...
1159,rock,rock,0nlB4rZ5Ng82YoDvOHYqaf,Rock en Español - Hits 💯% EXITOS,user,@cast_music.1 - Solo éxitos con la mejor selec...,62
1160,rock,rock,37i9dQZF1DWWsq4e0rDzty,Rock School,spotify,For an education in rock.,100
1161,rock,rock,3zoXL386C98ack345Ys9aK,Rocky Horror Picture Show Soundtrack,user,BEAUTIFUL,20
1162,rock,rock,37i9dQZF1DX8YNmLOBjUmx,Rock editors' picks: Best Rock & Alt Songs of ...,spotify,Our editors' picks for the best Rock & Alterna...,100


In [38]:
playlist_df.to_csv("playlist_df.csv")
track_df.to_csv("track_df.csv")
audio_df.to_csv("audio_df.csv")