# Introduction/Motivation

Hello and welcome! This notebook is the first installation in a series of 3 Jupyter notebooks (and one .py script for the interactive dashboard) that will encompass the *Vince Staples Audio Feature Analysis* data project. In essence, this project is a means to showcase the data science learnings I've gained through my studies, with the purpose being to query and acquire a dataset, parse through it and perform some exploratory data analysis, and ultimately employ some machine learning methodology, if deemed appropriate, to extrapolate insights from the data. 

While scouring through the plethora of interesting datasets available on the internet, I came across Spotify's public-facing web API, through which anyone can gain access to interesting data surrounding artists, tracks, albums, etc. Most notably, we're capable of acquiring very interesting metadata for musical tracks, particularly numerical audio features that help quantify key characteristics of musical tracks. Being a massive hip-hop enthusiast myself, I was enticed by the idea of taking one of my favorite artist's Spotify tracks - in this case, the renowned rap artist *Vince Staples* - and performing some analysis and visualization using those numerical audio features - hence this project was born. 

Naturally, the first step in this data exploration exercise is to actually retrieve the data. As implied by the title, this notebook is solely concerned with the data acquisition step. More information and discussion regarding the feature analysis and machine learning implementation can be found in the subsequent notebooks, but for now let's concern ourselves with pulling the data using Spotify's API, and preparing the data into a format that will be amenable to our analysis in later steps.

While I've obviously decided a priori that the pertinent artist for whom I'll be pulling the track data will be *Vince Staples*, for the sake of reproducibility I've structured the code in this notebook (not necessarily the commentary in the markdown) in such a manner that we could easily perform the same data pull for any other musical artist. We would only need to modify the artist string passed as a parameter to our GET queries, defined in step 2 below.

Official documentation regarding how to work with Spotify's web API, which I used to guide me through the steps below, can be found [here](https://developer.spotify.com/documentation/general/guides/authorization/client-credentials/). Additionally, I would like to acknowledge the CodingEntrepreneurs channel on YouTube for creating an amazingly clear and helpful video that walks through how to authorize calls to the API and work with the tool in Python, which can be found [here](https://youtu.be/xdq6Gz33khQ). I highly recommend that anyone interested in understanding/reproducing this work visit their YouTube video, because I would not have been able to accomplish the data acquisition step as seamlessly without it!

# Step 0: Import Libraries

In [1]:
# Import libraries to be used throughout this exercise

import pandas as pd
import numpy as np
import requests
import datetime

# Step 1: Acquire an API access token

To do this, I first went to Spotify's developer site and created an app within the "Dashboard" section (https://developer.spotify.com/dashboard/applications).

My app is subsequently provided a "client ID" and "client secret", which I'll use to create a token for my session.
In other words, I'm using these fields to authorize my session with the API, after which I can perform calls to it as I wish.

In [2]:
client_id = '4eac4887de094047a2d06887ae745137'
client_secret = 'f6ed1b9d48e542c4bf091d6bf6fceeaf'

Since my client is only going to be accessing public information from the service itself (not any particular user data),
I opt to use the "Client Credentials Flow" method of authorization. 

Essentially, I'll provide my client's ID and secret key in exchange for an access token, which will enable me to call the API for a specified period of time.
Spotify provides extensive documentation surrounding this method here: https://developer.spotify.com/documentation/general/guides/authorization/client-credentials/


In [3]:
# First, request an access token via a POST request (containing my client authentication) to the token generating url.

token_url = 'https://accounts.spotify.com/api/token'
method = 'POST'

# Create the body of the request 
token_body = {
    'grant_type' : 'client_credentials' # This specification is required, according to the documentation
}

In [4]:
# The header of the request will contain the credentials of my client. 
# Spotify notes that the credentials must be in the following form: <base64 encoded client_id:client_secret>

# Since we require base64 encoding, we'll need the base64 library to perform that string conversion
import base64

# Create the standard credentials string
client_credentials = f"{client_id}:{client_secret}"
    
# Encode the credentials in base 64
client_credentials_b64 = base64.b64encode(client_credentials.encode())

# Create the request header using the b64 credentials
token_header = {
    'Authorization' : f'Basic {client_credentials_b64.decode()}'
}

In [5]:
# Perform the POST request - the response is a json data structure that contains our access token, the token type, and the time to expiry (in seconds).

token_response = requests.post(token_url, data = token_body, headers = token_header)
print(token_response.status_code) # Check the status of the response; 200-299 indicate success

# Convert the response to a dictionary for parsing
token_response = token_response.json()

200


In [6]:
# Acquire access token and expiry from the response
access_token = token_response['access_token']
token_expire_time_secs = datetime.timedelta(seconds = token_response['expires_in'])

In [7]:
# Determine when the access token expires 

current_time = datetime.datetime.now()
token_expire_time = current_time + token_expire_time_secs

print('The current time is:', current_time.time(), '\n')
print('The access token expires at:', token_expire_time.time())

The current time is: 15:47:25.309373 

The access token expires at: 16:47:25.309373


# Step 2: Perform a search for the artist and acquire their unique Spotify ID 

In [8]:
# Now that we have our access token, let's perform actual API calls

# First we'll perform a search query to find the artist data for Vince Staples

header = {
    'Authorization': f'Bearer {access_token}',
}

# As noted in the documentation, search requests require the query parameters to be embedded within the endpoint url, not passed through the body of the request
# Therefore, I'll define the query elements separately, and then consolidate them into the final search url

q = 'Vince Staples'
search_type = 'artist'
search_url = f'https://api.spotify.com/v1/search?q={q}&type={search_type}' # Other elements can be added to this string as necessary, but this is all that's needed to acquire a unique record for the artist

# Perform the API call and save the response in artist_search
artist_search_response = requests.get(search_url, headers = header)
if artist_search_response.status_code in range(200,299):
    artist_search = artist_search_response.json()
    artist_id = artist_search['artists']['items'][0]['id'] # Acquire the artist's unique spotify ID from the first (and only) search result, which I'll use to further pull their albums
    print('Search successful.')
else:
    print('Search failed. Error details:\n', artist_search_response.json()['error'])

Search successful.


# Step 3: Acquire all streamable albums from the artist, using their Spotify ID

In [9]:
# Instantiate a dictionary that will hold all albums' metadata
album_dict = {
    'name' : [],
    'album_id' : [],
    'release_date' : [],
    'release_date_precision' : [],
    'total_tracks' : [],
    'album_type' : [],
    'available_markets' : [],
    'album_cover_url' : [],
    'album_cover_url_large' : []
    
}

get_album_endpoint = f'https://api.spotify.com/v1/artists/{artist_id}/albums?include_groups=album'

get_album_response = requests.get(get_album_endpoint, headers = header)

# Loop through all albums and collect metadata; store in album_dict
if get_album_response.status_code in range(200,299):
    for album in get_album_response.json()['items']:
        album_dict['name'].append(album['name'])
        album_dict['album_id'].append(album['id'])
        album_dict['release_date'].append(album['release_date'])
        album_dict['release_date_precision'].append(album['release_date_precision'])
        album_dict['total_tracks'].append(album['total_tracks'])
        album_dict['album_type'].append(album['type'])
        album_dict['available_markets'].append(album['available_markets'])
        album_dict['album_cover_url'].append(album['images'][-1]['url']) # Get the smallest album cover image (images are sorted from largest to smallest)
        album_dict['album_cover_url_large'].append(album['images'][1]['url']) # Get the second largest album cover image (For dashboard)
        
        
else: 
    print('Get request failed. Error details:\n', get_album_response.json()['error'])

In [10]:
# Create a dataframe with all albums and their metadata
albums_df = pd.DataFrame(album_dict)
albums_df

Unnamed: 0,name,album_id,release_date,release_date_precision,total_tracks,album_type,available_markets,album_cover_url,album_cover_url_large
0,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,day,16,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...
1,RAMONA PARK BROKE MY HEART,05gPgBp2MxXHwcuF4zXW7h,2022-04-08,day,16,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d00004851e9ad8a...,https://i.scdn.co/image/ab67616d00001e02e9ad8a...
2,Vince Staples,2suR5CCbtL2Wq8ShFo8rFr,2021-07-09,day,10,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d00004851ab1b13...,https://i.scdn.co/image/ab67616d00001e02ab1b13...
3,Vince Staples,2JWfKp8TLFmtrt142Ra9qP,2021-07-09,day,10,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d00004851b92f75...,https://i.scdn.co/image/ab67616d00001e02b92f75...
4,FM!,1XGGeqLZxjOMdCJhmamIn8,2018-11-02,day,11,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048519b2f8a...,https://i.scdn.co/image/ab67616d00001e029b2f8a...
5,FM!,1HB4naIXCKcUtvBK4Gmcke,2018-11-02,day,11,album,"[AE, BB, BF, BH, BN, BY, CA, DZ, EG, GB, GH, I...",https://i.scdn.co/image/ab67616d0000485138dffd...,https://i.scdn.co/image/ab67616d00001e0238dffd...
6,Big Fish Theory,5h3WJG0aZjNOrayFu3MhCS,2017-06-23,day,12,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048517ba7b4...,https://i.scdn.co/image/ab67616d00001e027ba7b4...
7,Big Fish Theory,7qYJbYwpR4NDyG4OLb6tSb,2017-06-23,day,12,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d0000485196cbe3...,https://i.scdn.co/image/ab67616d00001e0296cbe3...
8,Prima Donna,2haR5qnQopCdVASZ92YTGn,2016-08-25,day,7,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048514e48a1...,https://i.scdn.co/image/ab67616d00001e024e48a1...
9,Summertime '06,4Csoz10NhNJOrCTUoPBdUD,2015-06-30,day,20,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d0000485186f51d...,https://i.scdn.co/image/ab67616d00001e0286f51d...


Oddly enough, most albums (all but Prima Donna) appear twice under different album_id's. These are not deluxe versions of the albums (notice the number of tracks is the same between duplicates). After some digging around, it turns out that most albums have multiple release versions, catered to different regions in the world. In my particular case, though, the 'available_markets' columns is identical between both versions. 

What's more likely is that these are reuploaded versions, which apparently is expected behavior of distributors; more details here: https://community.spotify.com/t5/iOS-iPhone-iPad/Duplicates-of-the-same-albums/td-p/4542505  

In any case, I'll want to remove these duplicates before acquiring all corresponding track metadata. Since there's no discernible difference between using either album ID, our selection when de-duping is arbitrary, which makes things simple - we'll just keep the first instance of each album.

In [11]:
# Remove album duplicates; prioritize first instance
albums_df.drop_duplicates(subset = 'name', keep = 'first', inplace = True)
albums_df.reset_index(drop=True, inplace=True)
albums_df

Unnamed: 0,name,album_id,release_date,release_date_precision,total_tracks,album_type,available_markets,album_cover_url,album_cover_url_large
0,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,day,16,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...
1,Vince Staples,2suR5CCbtL2Wq8ShFo8rFr,2021-07-09,day,10,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d00004851ab1b13...,https://i.scdn.co/image/ab67616d00001e02ab1b13...
2,FM!,1XGGeqLZxjOMdCJhmamIn8,2018-11-02,day,11,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048519b2f8a...,https://i.scdn.co/image/ab67616d00001e029b2f8a...
3,Big Fish Theory,5h3WJG0aZjNOrayFu3MhCS,2017-06-23,day,12,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048517ba7b4...,https://i.scdn.co/image/ab67616d00001e027ba7b4...
4,Prima Donna,2haR5qnQopCdVASZ92YTGn,2016-08-25,day,7,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d000048514e48a1...,https://i.scdn.co/image/ab67616d00001e024e48a1...
5,Summertime '06,4Csoz10NhNJOrCTUoPBdUD,2015-06-30,day,20,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d0000485186f51d...,https://i.scdn.co/image/ab67616d00001e0286f51d...
6,Hell Can Wait,7mxpMxmMM8RN39YRlo08v7,2014-09-23,day,7,album,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",https://i.scdn.co/image/ab67616d00004851299c72...,https://i.scdn.co/image/ab67616d00001e02299c72...


# Step 4: Acquire all albums' track IDs, using albums' spotify IDs

Awesome, our album dataframe looks good and comprehensive now, and is also conveniently sorted by release date (from latest to oldest). 

The next step is to get the full set of tracks for each album, which we'll store in a separate dataframe. According to Spotify's API documentation, we can do this by performing a GET request from the following URL, replacing {album_id} with the album ID: https://api.spotify.com/v1/albums/{album_id}/tracks.

There is more track metadata available at https://api.spotify.com/v1/tracks/{track_id}, therefore I'll acquire all track ID's from the former endpoint, then use those track ID's to look up further track metadata from the latter link.

In the body of our request, we'll need to specify the number of results to return. Notice from albums_df that the maximum number of tracks in any given album in Vince's discography is 20, from Summertime '06, which just so happens to coincide with the default number of results returned. Regardless, we'll specify this parameter to be explicit (and to encourage code reusability). Note that the maximum possible value for limit is 50. If for some reason there were more than 50 tracks in an album (perhaps in non-musical volumes), we'd need to adjust our code to perform repeated queries that also specify an offset (the index of our first search result) for our response, so that we could retrieve all pages of results.

In [12]:
# Instantiate a dictionary to hold all album and track IDs
album_tracks_dict = {
    'album' : [],
    'album_id' : [],
    'album_release_date' : [],
    'album_cover_url' : [],
    'album_cover_url_large' : [],
    'track_id' : [],
    'track_name' : [],
    'track_number' : [],
}

# The number of tracks to return
results_returned = 20

# Loop through all albums in albums_df
for i, row in albums_df.iterrows():
    
    album_id = row['album_id']
    get_tracks_endpoint = f'https://api.spotify.com/v1/albums/{album_id}/tracks?limit={results_returned}' 

    # Get all tracks in the current album
    get_tracks_response = requests.get(get_tracks_endpoint, headers = header)
    
    if get_tracks_response.status_code in range(200, 299):
    
        # Loop through all returned tracks in the current album
        for track_result in get_tracks_response.json()['items']:
            album_tracks_dict['album'].append(row['name'])
            album_tracks_dict['album_id'].append(row['album_id'])
            album_tracks_dict['album_release_date'].append(row['release_date'])
            album_tracks_dict['album_cover_url'].append(row['album_cover_url'])
            album_tracks_dict['album_cover_url_large'].append(row['album_cover_url_large'])
            album_tracks_dict['track_id'].append(track_result['id'])
            album_tracks_dict['track_name'].append(track_result['name'])
            album_tracks_dict['track_number'].append(track_result['track_number'])
    else:
        print('Track retrieval failed. Details:\n', get_tracks_response.json()['error'])

In [13]:
# Create a tracks dataframe from our new dictionary
album_tracks_df = pd.DataFrame(album_tracks_dict)

album_tracks_df

Unnamed: 0,album,album_id,album_release_date,album_cover_url,album_cover_url_large,track_id,track_name,track_number
0,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,0lqAn1YfFVQ3SdoF7tRZO2,THE BEACH,1
1,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,7CvtBcThQ4piVKkfUXieig,AYE! (FREE THE HOMIES),2
2,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,3MVFHHeksQCnVuKOjPN01M,DJ QUIK,3
3,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,7jN5Abri3a1crehbnlWa1F,MAGIC,4
4,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,77JeMQGOagAhWcMd99RYCO,NAMELESS,5
...,...,...,...,...,...,...,...,...
78,Hell Can Wait,7mxpMxmMM8RN39YRlo08v7,2014-09-23,https://i.scdn.co/image/ab67616d00004851299c72...,https://i.scdn.co/image/ab67616d00001e02299c72...,7nUsDCJPnJTqNUqxipnNy4,Screen Door,3
79,Hell Can Wait,7mxpMxmMM8RN39YRlo08v7,2014-09-23,https://i.scdn.co/image/ab67616d00004851299c72...,https://i.scdn.co/image/ab67616d00001e02299c72...,2eb51UvA5hKfZOC1cy1oA1,Hands Up,4
80,Hell Can Wait,7mxpMxmMM8RN39YRlo08v7,2014-09-23,https://i.scdn.co/image/ab67616d00004851299c72...,https://i.scdn.co/image/ab67616d00001e02299c72...,6w6SW8zyEcyxwSR7Wya45a,Blue Suede,5
81,Hell Can Wait,7mxpMxmMM8RN39YRlo08v7,2014-09-23,https://i.scdn.co/image/ab67616d00004851299c72...,https://i.scdn.co/image/ab67616d00001e02299c72...,5nFur5SMzdpP4obWLrBFD3,Limos,6


# Step 5: Acquire detailed track metadata using track IDs

Nice, we're getting close! The final two steps are to: 
1. Acquire additional descriptors for each track (i.e. popularity, song duration, explitive language flag, etc.)
2. Acquire the numerical audio features corresponding to each track (i.e. valence, accousticness, danceability, etc.)

We'll set up a single dataframe to house the data from both of these steps, called tracks_df.

In [14]:
tracks_dict = {
    'album' : [],
    'album_id' : [],
    'album_release_date' : [],
    'album_cover_url' : [],
    'album_cover_url_large' : [],
    'track_id' : [],
    'track_name' : [],
    'track_number' : [],
    'duration_ms' : [],
    'disc_number' : [],
    'explicit_flag' : [],
#    'restrictions' : [], N/A for this artist
    'track_popularity' : []
}

# Loop through each track in the dataframe
for i, row in album_tracks_df.iterrows():
    
    # Construct the track-specific endpoint which will contain all of the metadata we need
    track_id = row['track_id']
    get_track_data_endpoint = f'https://api.spotify.com/v1/tracks/{track_id}'
    get_track_data_response = requests.get(get_track_data_endpoint, headers = header)
    
    if get_track_data_response.status_code in range(200,299):

        # Collect the response
        track_data = get_track_data_response.json()
        
        # Populate tracks_dict
        tracks_dict['album'].append(row['album'])
        tracks_dict['album_id'].append(row['album_id'])
        tracks_dict['album_release_date'].append(row['album_release_date'])
        tracks_dict['album_cover_url'].append(row['album_cover_url'])
        tracks_dict['album_cover_url_large'].append(row['album_cover_url_large'])
        tracks_dict['track_id'].append(row['track_id'])
        tracks_dict['track_name'].append(row['track_name'])
        tracks_dict['track_number'].append(row['track_number'])
        tracks_dict['duration_ms'].append(track_data['duration_ms'])
        tracks_dict['disc_number'].append(track_data['disc_number'])
        tracks_dict['explicit_flag'].append(track_data['explicit'])
#        tracks_dict['restrictions'].append(track_data['restrictions']['reason']) N/A for this artist
        tracks_dict['track_popularity'].append(track_data['popularity'])
    else:
        print('Track retrieval failed. Details:\n', get_track_data_response.json()['error'])

# Create a new dataframe for all of our tracks + the acquired metadata
tracks_df = pd.DataFrame(tracks_dict)
tracks_df.head()

Looking good. As a last step, let's convert the track duration column from milliseconds to seconds.

In [16]:
tracks_df['track_duration_s'] = tracks_df['duration_ms']/1000
tracks_df.drop(columns = ['duration_ms'], inplace = True)

# Step 6: Add each track's audio features to tracks_df

As previously mentioned, the last piece of our dataset (which will be central to our exploratory analysis; more on that later) we'll need is each track's audio features. 

I'll handle this step slightly differently this time around. Instead of iterating through the rows of our tracks_df dataframe and calling the API for each row (fairly inefficient), I'll pass through our full list of track_ids to the body of the request, to which the response will be a list of feature dictionaries, one for each track. I'll merge the output with our dataframe (using track_id as our mapping key). This way, we can obtain all of our audio features in one sweep, and can simply merge the results with our existing data structure.

In [17]:
# First create a single string consisting of all unique track_ids, delimited by commas
track_ids = tracks_df['track_id'].to_list()
delimiter = ','
ids = delimiter.join(track_ids)

# Embed the string within the API endpoint
audio_features_endpoint = f'https://api.spotify.com/v1/audio-features?ids={ids}'
audio_features_response = requests.get(audio_features_endpoint, headers = header)
if audio_features_response.status_code in range(200,299):
    
    audio_features = audio_features_response.json()['audio_features']
    
    # Check that we have as many sets of features as we have tracks
    print('Tracks: ', tracks_df.shape[0])
    print('Audio Features: ', len(audio_features))
else: 
    print('Audio feature retrieval failed. Details:\n', audio_features_response.json()['error'])

Tracks:  83
Audio Features:  83


In [18]:
# Create a dataframe from the response
audio_features_df = pd.DataFrame(audio_features)
audio_features_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.593,0.947,4,-6.206,0,0.6310,0.88800,0.000000,0.8230,0.6820,157.850,audio_features,0lqAn1YfFVQ3SdoF7tRZO2,spotify:track:0lqAn1YfFVQ3SdoF7tRZO2,https://api.spotify.com/v1/tracks/0lqAn1YfFVQ3...,https://api.spotify.com/v1/audio-analysis/0lqA...,67210,4
1,0.745,0.690,3,-7.063,1,0.0628,0.21500,0.000000,0.0779,0.2870,95.022,audio_features,7CvtBcThQ4piVKkfUXieig,spotify:track:7CvtBcThQ4piVKkfUXieig,https://api.spotify.com/v1/tracks/7CvtBcThQ4pi...,https://api.spotify.com/v1/audio-analysis/7Cvt...,185880,4
2,0.829,0.593,1,-11.431,0,0.1380,0.42200,0.000290,0.1920,0.8180,86.007,audio_features,3MVFHHeksQCnVuKOjPN01M,spotify:track:3MVFHHeksQCnVuKOjPN01M,https://api.spotify.com/v1/tracks/3MVFHHeksQCn...,https://api.spotify.com/v1/audio-analysis/3MVF...,140297,4
3,0.743,0.601,4,-6.684,0,0.1810,0.27200,0.000097,0.4960,0.2840,89.970,audio_features,7jN5Abri3a1crehbnlWa1F,spotify:track:7jN5Abri3a1crehbnlWa1F,https://api.spotify.com/v1/tracks/7jN5Abri3a1c...,https://api.spotify.com/v1/audio-analysis/7jN5...,226006,4
4,0.553,0.341,11,-11.554,0,0.0958,0.88700,0.434000,0.1400,0.3980,66.202,audio_features,77JeMQGOagAhWcMd99RYCO,spotify:track:77JeMQGOagAhWcMd99RYCO,https://api.spotify.com/v1/tracks/77JeMQGOagAh...,https://api.spotify.com/v1/audio-analysis/77Je...,48225,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,0.566,0.497,11,-9.951,0,0.2340,0.00199,0.000000,0.1130,0.0706,140.085,audio_features,7nUsDCJPnJTqNUqxipnNy4,spotify:track:7nUsDCJPnJTqNUqxipnNy4,https://api.spotify.com/v1/tracks/7nUsDCJPnJTq...,https://api.spotify.com/v1/audio-analysis/7nUs...,247013,4
79,0.609,0.712,0,-6.478,1,0.2520,0.00501,0.143000,0.6290,0.2300,88.149,audio_features,2eb51UvA5hKfZOC1cy1oA1,spotify:track:2eb51UvA5hKfZOC1cy1oA1,https://api.spotify.com/v1/tracks/2eb51UvA5hKf...,https://api.spotify.com/v1/audio-analysis/2eb5...,199800,4
80,0.415,0.745,7,-5.902,1,0.1740,0.10800,0.000067,0.5260,0.4680,84.926,audio_features,6w6SW8zyEcyxwSR7Wya45a,spotify:track:6w6SW8zyEcyxwSR7Wya45a,https://api.spotify.com/v1/tracks/6w6SW8zyEcyx...,https://api.spotify.com/v1/audio-analysis/6w6S...,218040,4
81,0.594,0.732,11,-5.852,0,0.4170,0.30700,0.000043,0.1110,0.5580,175.979,audio_features,5nFur5SMzdpP4obWLrBFD3,spotify:track:5nFur5SMzdpP4obWLrBFD3,https://api.spotify.com/v1/tracks/5nFur5SMzdpP...,https://api.spotify.com/v1/audio-analysis/5nFu...,212360,4


Now that was much quicker and easier. The last step will be to merge these columns into tracks_df - but first, let's drop some of the extraneous columns that we won't be needing.

In [19]:
audio_features_df.drop(columns = ['type', 'uri', 'track_href', 'analysis_url', 'duration_ms'], inplace = True)
audio_features_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,time_signature
0,0.593,0.947,4,-6.206,0,0.631,0.888,0.0,0.823,0.682,157.85,0lqAn1YfFVQ3SdoF7tRZO2,4
1,0.745,0.69,3,-7.063,1,0.0628,0.215,0.0,0.0779,0.287,95.022,7CvtBcThQ4piVKkfUXieig,4
2,0.829,0.593,1,-11.431,0,0.138,0.422,0.00029,0.192,0.818,86.007,3MVFHHeksQCnVuKOjPN01M,4
3,0.743,0.601,4,-6.684,0,0.181,0.272,9.7e-05,0.496,0.284,89.97,7jN5Abri3a1crehbnlWa1F,4
4,0.553,0.341,11,-11.554,0,0.0958,0.887,0.434,0.14,0.398,66.202,77JeMQGOagAhWcMd99RYCO,3


# Step 7: Merge audio features into tracks_df  

In [20]:
# This step will essentially be a SQL left-join. 
# Since both dataframes are at the same level of detail and contain the same records, this is equivalent to performing an inner-join
tracks_df = tracks_df.merge(right = audio_features_df, how = 'left', left_on = 'track_id', right_on = 'id')
tracks_df.head()

Unnamed: 0,album,album_id,album_release_date,album_cover_url,album_cover_url_large,track_id,track_name,track_number,disc_number,explicit_flag,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,time_signature
0,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,0lqAn1YfFVQ3SdoF7tRZO2,THE BEACH,1,1,True,...,-6.206,0,0.631,0.888,0.0,0.823,0.682,157.85,0lqAn1YfFVQ3SdoF7tRZO2,4
1,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,7CvtBcThQ4piVKkfUXieig,AYE! (FREE THE HOMIES),2,1,True,...,-7.063,1,0.0628,0.215,0.0,0.0779,0.287,95.022,7CvtBcThQ4piVKkfUXieig,4
2,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,3MVFHHeksQCnVuKOjPN01M,DJ QUIK,3,1,True,...,-11.431,0,0.138,0.422,0.00029,0.192,0.818,86.007,3MVFHHeksQCnVuKOjPN01M,4
3,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,7jN5Abri3a1crehbnlWa1F,MAGIC,4,1,True,...,-6.684,0,0.181,0.272,9.7e-05,0.496,0.284,89.97,7jN5Abri3a1crehbnlWa1F,4
4,RAMONA PARK BROKE MY HEART,2G549zeda2XNICgLmU0pNW,2022-04-08,https://i.scdn.co/image/ab67616d000048519fd6f5...,https://i.scdn.co/image/ab67616d00001e029fd6f5...,77JeMQGOagAhWcMd99RYCO,NAMELESS,5,1,False,...,-11.554,0,0.0958,0.887,0.434,0.14,0.398,66.202,77JeMQGOagAhWcMd99RYCO,3


In [21]:
# Validate our merge
print('Tracks post-merge: ', tracks_df.shape[0])

# Check population of the merged features
print('\nNulls in merged columns:')
audio_columns = audio_features_df.columns.to_list()
print(tracks_df.loc[:, audio_columns].isnull().sum())

Tracks post-merge:  83

Nulls in merged columns:
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
id                  0
time_signature      0
dtype: int64


In [22]:
# Looks good! Finally, let's remove the 'id' column to reduce redundancy.
tracks_df.drop(columns = ['id'], inplace = True)

# Data collection is complete! 

Now that we have our fully fledged dataset in tracks_df, we can move onto the fun part - data exploration. See you in the next notebook.

In [23]:
# Save off the final tracks dataframe as a CSV file within the project folder in the user's Documents 
import os
savepath = os.path.join('~/Documents/Vince Staples Audio Feature Analysis/','tracks.csv')
tracks_df.to_csv(savepath)