# Data Wrangling |  Modeling Spotify album popularity
## Leo Evancie, Springboard Data Science Career Track

This is the first step in a capstone project to model album popularity on Spotify, a popular music streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

Note: This notebook makes use of <i>Spotipy</i>, a Python library designed for Spotify API calls. I referred to the Medium article "<a href="https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b">Getting Started with Spotify's API & Spotipy</a>" by Max Tingle (2019), as well as the <a href="https://spotipy.readthedocs.io/en/latest/#"><i>Spotipy</i> documentation.</a>

### 1. Data collection

First, import the relevant libraries:

In [1]:
import pandas as pd
import numpy as np
import time
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from dotenv import load_dotenv

Load my Spotify developer credentials, stored as environmental variables. Instantiate the Spotify API client.

In [2]:
load_dotenv()
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

Retrieve a sampling of 20 albums (the default response limit) from the year 2021 to get a sense of the response structure.

In [3]:
album_results = sp.search(q='year:2021', type='album')

In [4]:
type(album_results)

dict

In [5]:
album_results['albums']['items'][0].keys()

dict_keys(['album_type', 'artists', 'available_markets', 'external_urls', 'href', 'id', 'images', 'name', 'release_date', 'release_date_precision', 'total_tracks', 'type', 'uri'])

I notice that <i>popularity</i> is not listed among the returned data. This is the variable I want my model to predict. Let's try a search for individual tracks instead.

In [6]:
track_results = sp.search(q='year:2021', type='track')
track_results = track_results['tracks']['items']
track_results[0].keys()

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri'])

Bingo. I see that I will need to search for a large quantity of tracks, rather than albums, in order to develop a model for predicting popularity. As such, I can expect that <i>album</i> and <i>artists</i> will heavily influence popularity. Hopefully, I can identify plenty of other influential features; an artist can't change their identity in the hope of boosting their music's popularity score!

I'll look at the sample results in a DataFrame to get a clearer picture of the values. I will choose only those variables that seem relevant to the question at hand. For example, I will exclude <i>preview_url</i>, and other references, keeping only <i>track_id</i> as a unique identifier.

In [7]:
track_id = []
name = []
artist_name = []
popularity = []
duration_ms = []
explicit = []
track_number = []
available_markets = []
disc_number = []
album_type = []

for track in track_results:
    track_id.append(track['id'])
    name.append(track['name'])
    artist_name.append(track['artists'][0]['name'])
    popularity.append(track['popularity'])
    duration_ms.append(track['duration_ms'])
    explicit.append(track['explicit'])
    track_number.append(track['track_number'])
    available_markets.append(track['available_markets'])
    disc_number.append(track['disc_number'])
    album_type.append(track['album']['album_type'])
    
track_df = pd.DataFrame({
    'track_id':track_id,
    'name':name,
    'artist_name':artist_name,
    'popularity':popularity,
    'duration_ms':duration_ms,
    'explicit':explicit,
    'track_number':track_number,
    'disc_number':disc_number,
    'available_markets':available_markets,
    'album_type':album_type
})

track_df

Unnamed: 0,track_id,name,artist_name,popularity,duration_ms,explicit,track_number,disc_number,available_markets,album_type
0,748mdHapucXQri7IAO8yFK,Kiss Me More (feat. SZA),Doja Cat,99,208866,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
1,08ejYlzduA6O82FJgnFKQz,Year 2020,Leonardo Makno,37,133281,False,2,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
2,0UsmyJDsst2xhX1ZiFF3JW,"Year,2015",Schoolgirl Byebye,25,74301,False,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
3,0koiLB5LKB5fXOuZKNmQvc,King,Years & Years,0,214834,False,19,1,"[AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, BR, B...",album
4,6PERP62TejQjgHu81OHxgM,good 4 u,Olivia Rodrigo,98,178147,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
5,3wLWURwOlQZ7O3VWcV4BJJ,Ingrato amor - Live Streaming audio and video ...,Amaranta,5,352647,False,13,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",album
6,7BFk1nSfwfkDOgnNNeY7Yn,wish i dropped out like brakence interlude,Satoshi Love,0,83000,False,4,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",album
7,5BJs3aDrKYBRkzrhaVxf89,After Me - 2021 Version,Saliva,47,234288,False,3,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
8,43PGPuHIlVOc04jrZVh9L6,RAPSTAR,Polo G,95,165925,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
9,6EPUCU04NW6c9eCwfTcxIQ,Medley Picaflor de Los Andes - Live Streaming ...,Amaranta,7,1035420,False,3,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",album


A few questions occur to me at this stage. What is to be made of the <i>available_markets</i> field? Are most songs available in the same large set of markets, or perhaps one of a few groups of markets, or is there greater variation?

Also, I see that one sampled track contains a feature, as included in the track name, "Kiss Me More (feat. SZA)." I could use a bit of string analysis to create a new Boolean field, <i>feature</i>, in case the presence of a featured artist has some bearing on popularity.

Perhaps most importantly, several of the sampled tracks have a popularity score of zero. What accounts for zero popularity? Is it a missing value, and/or due to being quite new on the platform? I will need to do some serious thinking about how to handle such scores in my analysis.

Meanwhile, I do not have enough continuous variables to build an effective model. Luckily, Spotify also offers an API endpoint for audio features, including quantities like <i>danceability</i> and <i>tempo</i>. Let's look at a sample response from such a search. I'll use a <i>track_id</i> from the above DataFrame.

In [11]:
sample_audio_feature = sp.audio_features(track_df['track_id'][0])
sample_audio_feature

[{'danceability': 0.762,
  'energy': 0.701,
  'key': 8,
  'loudness': -3.541,
  'mode': 1,
  'speechiness': 0.0286,
  'acousticness': 0.235,
  'instrumentalness': 0.000158,
  'liveness': 0.123,
  'valence': 0.742,
  'tempo': 110.968,
  'type': 'audio_features',
  'id': '748mdHapucXQri7IAO8yFK',
  'uri': 'spotify:track:748mdHapucXQri7IAO8yFK',
  'track_href': 'https://api.spotify.com/v1/tracks/748mdHapucXQri7IAO8yFK',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/748mdHapucXQri7IAO8yFK',
  'duration_ms': 208867,
  'time_signature': 4}]

Now that I know how to retrieve all the fields I want, as well as how to convert my search results into a DataFrame, I will generate my full dataset. Spotify's search API allows for a maximum offset of 1,000 for their API search calls, with a maximum 50 records per call. I probably need about ten times that number of tracks for my model. I will thus execute the search 10 times, each for a different year. (This will introduce the possibility that the release year itself might affect popularity; maybe audiences respond to different aspects of music in different eras.) After retrieving each batch of 50 tracks, I will feed those into the <i>sp.audio_features</i> search, storing all relevant results as I go.

In [8]:
#initializing empty lists, which will become DataFrame columns
track_id = []
name = []
artist_name = []
year = []
popularity = []
duration_ms = []
explicit = []
track_number = []
available_markets = []
disc_number = []
album_type = []
danceability = []
energy = []
loudness = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []
tempo = []
time_signature = []

#looping through the years 2012 through 2021:
for yr in range(2012,2022):
    #due to search API constraints, searching for chunks of 50 tracks up to 1000 total for the given year:
    for i in range(0,1000,50):
        track_results = sp.search(q='year:{}'.format(yr), type='track', limit=50, offset=i)
        track_results = track_results['tracks']['items']
        #initialize a list to collect track IDs, to be used in the audio feature search
        track_ids = []
        #appending track info to lists
        for track in track_results:
            track_id.append(track['id'])
            name.append(track['name'])
            artist_name.append(track['artists'][0]['name'])
            popularity.append(track['popularity'])
            duration_ms.append(track['duration_ms'])
            explicit.append(track['explicit'])
            track_number.append(track['track_number'])
            available_markets.append(track['available_markets'])
            disc_number.append(track['disc_number'])
            album_type.append(track['album']['album_type'])
            track_ids.append(track['id'])
            year.append(str(yr))
        #searching for corresponding audio features based on the track IDs from the search above
        feature_results = sp.audio_features(track_ids)
        for feature in feature_results:
            #while debugging, I found that two tracks do not have audio feature data available
            if feature == None:
                danceability.append(np.nan)
                energy.append(np.nan)
                loudness.append(np.nan)
                speechiness.append(np.nan)
                acousticness.append(np.nan)
                instrumentalness.append(np.nan)
                liveness.append(np.nan)
                valence.append(np.nan)
                tempo.append(np.nan)
                time_signature.append(np.nan)
            else:
                danceability.append(feature['danceability'])
                energy.append(feature['energy'])
                loudness.append(feature['loudness'])
                speechiness.append(feature['speechiness'])
                acousticness.append(feature['acousticness'])
                instrumentalness.append(feature['instrumentalness'])
                liveness.append(feature['liveness'])
                valence.append(feature['valence'])
                tempo.append(feature['tempo'])
                time_signature.append(feature['time_signature'])
            
#constructing DataFrame from the now-complete lists
df = pd.DataFrame({
    'track_id':track_id,
    'name':name,
    'artist_name':artist_name,
    'year':year,
    'popularity':popularity,
    'duration_ms':duration_ms,
    'explicit':explicit,
    'track_number':track_number,
    'disc_number':disc_number,
    'available_markets':available_markets,
    'album_type':album_type,
    'danceability':danceability,
    'energy':energy,
    'loudness':loudness,
    'speechiness':speechiness,
    'acousticness':acousticness,
    'instrumentalness':instrumentalness,
    'liveness':liveness,
    'valence':valence,
    'tempo':tempo,
    'time_signature':time_signature
}) 

In [9]:
df.head()

Unnamed: 0,track_id,name,artist_name,year,popularity,duration_ms,explicit,track_number,disc_number,available_markets,...,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,2iUmqdfGZcHIhS3b9E9EWq,Everybody Talks,Neon Trees,2012,82,177280,False,3,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",...,0.471,0.924,-3.906,0.0586,0.00301,0.0,0.313,0.725,154.961,4.0
1,08ejYlzduA6O82FJgnFKQz,Year 2020,Leonardo Makno,2012,37,133281,False,2,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",...,0.805,0.417,-11.722,0.0462,0.508,0.0201,0.0903,0.548,129.984,4.0
2,52a6VcF23v5HB7KfDEmBHq,Carried Away,Passion Pit,2012,60,221973,False,3,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",...,0.772,0.826,-5.135,0.0344,0.0176,9e-06,0.383,0.871,119.995,4.0
3,1JIzFhI9Lt5FyslawmHCBi,Five Years - 2012 Remaster,David Bowie,2012,58,283752,False,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",...,0.46,0.326,-10.699,0.0417,0.142,1e-05,0.0449,0.321,152.531,3.0
4,0ZFBKLOZLIM16RAUb5eomN,Bubblegum Bitch,MARINA,2012,76,154666,True,1,1,"[CA, US]",...,0.495,0.856,-5.123,0.0311,0.000219,0.0,0.103,0.609,158.024,4.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   track_id           10000 non-null  object 
 1   name               10000 non-null  object 
 2   artist_name        10000 non-null  object 
 3   year               10000 non-null  object 
 4   popularity         10000 non-null  int64  
 5   duration_ms        10000 non-null  int64  
 6   explicit           10000 non-null  bool   
 7   track_number       10000 non-null  int64  
 8   disc_number        10000 non-null  int64  
 9   available_markets  10000 non-null  object 
 10  album_type         10000 non-null  object 
 11  danceability       9998 non-null   float64
 12  energy             9998 non-null   float64
 13  loudness           9998 non-null   float64
 14  speechiness        9998 non-null   float64
 15  acousticness       9998 non-null   float64
 16  instrumentalness   9998