# Data Wrangling |  Modeling Spotify album popularity
## Leo Evancie, Springboard Data Science Career Track

This is the first step in a capstone project to model album popularity on Spotify, a popular music streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

Note: This notebook makes use of <i>Spotipy</i>, a Python library designed for Spotify API calls. I referred to the Medium article "<a href="https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b">Getting Started with Spotify's API & Spotipy</a>" by Max Tingle (2019), as well as the <a href="https://spotipy.readthedocs.io/en/latest/#"><i>Spotipy</i> documentation.</a>

### 1. Data collection

First, import the relevant libraries:

In [1]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from dotenv import load_dotenv

Load my Spotify developer credentials, stored as environmental variables. Instantiate the Spotify API client.

In [2]:
load_dotenv()
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

Retrieve a sampling of 20 albums (the default response limit) from the year 2021 to get a sense of the response structure.

In [3]:
album_results = sp.search(q='year:2021', type='album')

In [4]:
type(album_results)

dict

In [5]:
album_results

{'albums': {'href': 'https://api.spotify.com/v1/search?query=year%3A2021&type=album&offset=0&limit=10',
  'items': [{'album_type': 'album',
    'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1McMsnEElThX1knmY4oliG'},
      'href': 'https://api.spotify.com/v1/artists/1McMsnEElThX1knmY4oliG',
      'id': '1McMsnEElThX1knmY4oliG',
      'name': 'Olivia Rodrigo',
      'type': 'artist',
      'uri': 'spotify:artist:1McMsnEElThX1knmY4oliG'}],
    'available_markets': ['AD',
     'AE',
     'AG',
     'AL',
     'AM',
     'AO',
     'AR',
     'AT',
     'AU',
     'AZ',
     'BA',
     'BB',
     'BD',
     'BE',
     'BF',
     'BG',
     'BH',
     'BI',
     'BJ',
     'BN',
     'BO',
     'BR',
     'BS',
     'BT',
     'BW',
     'BY',
     'BZ',
     'CA',
     'CH',
     'CI',
     'CL',
     'CM',
     'CO',
     'CR',
     'CV',
     'CW',
     'CY',
     'CZ',
     'DE',
     'DJ',
     'DK',
     'DM',
     'DO',
     'DZ',
     'EC',
     'EE',
   

In [6]:
album_results['albums']['items'][0].keys()

dict_keys(['album_type', 'artists', 'available_markets', 'external_urls', 'href', 'id', 'images', 'name', 'release_date', 'release_date_precision', 'total_tracks', 'type', 'uri'])

I notice that <i>popularity</i> is not listed among the returned data. This is the variable I want my model to predict. Let's try a search for individual tracks instead.

In [7]:
track_results = sp.search(q='year:2021', type='track')
track_results = track_results['tracks']['items']
track_results[0].keys()

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri'])

Bingo. I see that I will need to search for a large quantity of tracks, rather than albums, in order to develop a model for predicting popularity. As such, I can expect that <i>album</i> and <i>artists</i> will heavily influence popularity. Hopefully, I can identify plenty of other influential features; an artist can't change their identity in the hope of boosting their music's popularity score!

I'll look at the sample results in a DataFrame to get a clearer picture of the values. I will choose only those variables that seem relevant to the question at hand. For example, I will exclude <i>preview_url</i>, and other references, keeping only <i>track_id</i> as a unique identifier.

In [8]:
track_id = []
name = []
artist_name = []
popularity = []
duration_ms = []
explicit = []
track_number = []
available_markets = []
disc_number = []
album_type = []

for track in track_results:
    track_id.append(track['id'])
    name.append(track['name'])
    artist_name.append(track['artists'][0]['name'])
    popularity.append(track['popularity'])
    duration_ms.append(track['duration_ms'])
    explicit.append(track['explicit'])
    track_number.append(track['track_number'])
    available_markets.append(track['available_markets'])
    disc_number.append(track['disc_number'])
    album_type.append(track['album']['album_type'])

In [9]:
track_df = pd.DataFrame({
    'track_id':track_id,
    'name':name,
    'artist_name':artist_name,
    'popularity':popularity,
    'duration_ms':duration_ms,
    'explicit':explicit,
    'track_number':track_number,
    'disc_number':disc_number,
    'available_markets':available_markets,
    'album_type':album_type
})

In [10]:
track_df

Unnamed: 0,track_id,name,artist_name,popularity,duration_ms,explicit,track_number,disc_number,available_markets,album_type
0,748mdHapucXQri7IAO8yFK,Kiss Me More (feat. SZA),Doja Cat,99,208866,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
1,08ejYlzduA6O82FJgnFKQz,Year 2020,Leonardo Makno,37,133281,False,2,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
2,0UsmyJDsst2xhX1ZiFF3JW,"Year,2015",Schoolgirl Byebye,25,74301,False,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
3,0koiLB5LKB5fXOuZKNmQvc,King,Years & Years,0,214834,False,19,1,"[AE, AL, AT, AU, BA, BE, BG, BH, BY, CH, CY, C...",album
4,6PERP62TejQjgHu81OHxgM,good 4 u,Olivia Rodrigo,98,178147,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
5,3wLWURwOlQZ7O3VWcV4BJJ,Ingrato amor - Live Streaming audio and video ...,Amaranta,3,352647,False,13,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",album
6,7BFk1nSfwfkDOgnNNeY7Yn,wish i dropped out like brakence interlude,Satoshi Love,0,83000,False,4,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",album
7,41cHUAANqM4s6e55Mo8e08,Shine,Years & Years,0,255506,False,16,1,"[AE, AL, AR, AU, BA, BE, BG, BH, BO, BR, BY, C...",album
8,43PGPuHIlVOc04jrZVh9L6,RAPSTAR,Polo G,95,165925,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
9,0Bkir9qbwWtydC0VJQTSOD,Me siento sola - Live Streaming audio and vide...,Amaranta,4,375485,False,7,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",album


A few questions occur to me at this stage. What is to be made of the <i>available_markets</i> field? Are most songs available in the same large set of markets, or perhaps one of a few groups of markets, or is there greater variation?

Also, I see that one sampled track contains a feature, as included in the track name, "Kiss Me More (feat. SZA)". I could use a bit of string analysis to create a new Boolean field, <i>feature</i>, in case the presence of a featured artist has some bearing on popularity.

Perhaps most importantly, several of the sampled tracks have a popularity score of zero. What accounts for zero popularity? Is it a missing value, and/or due to being quite new on the platform? I will need to do some serious thinking about how to handle such scores in my analysis.

Before digging into any of these questions, I will generate my full dataset. Spotify allows for a maximum offset of 1,000 for their API search calls, with a maximum 50 records per call.

In [11]:
track_id = []
name = []
artist_name = []
popularity = []
duration_ms = []
explicit = []
track_number = []
available_markets = []
disc_number = []
album_type = []

for i in range(0,1000,50):
    track_results = sp.search(q='year:2021', type='track', limit=50, offset=i)
    track_results = track_results['tracks']['items']
    for track in track_results:
        track_id.append(track['id'])
        name.append(track['name'])
        artist_name.append(track['artists'][0]['name'])
        popularity.append(track['popularity'])
        duration_ms.append(track['duration_ms'])
        explicit.append(track['explicit'])
        track_number.append(track['track_number'])
        available_markets.append(track['available_markets'])
        disc_number.append(track['disc_number'])
        album_type.append(track['album']['album_type'])

In [12]:
track_df = pd.DataFrame({
    'track_id':track_id,
    'name':name,
    'artist_name':artist_name,
    'popularity':popularity,
    'duration_ms':duration_ms,
    'explicit':explicit,
    'track_number':track_number,
    'disc_number':disc_number,
    'available_markets':available_markets,
    'album_type':album_type
})
track_df.head()

Unnamed: 0,track_id,name,artist_name,popularity,duration_ms,explicit,track_number,disc_number,available_markets,album_type
0,748mdHapucXQri7IAO8yFK,Kiss Me More (feat. SZA),Doja Cat,99,208866,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
1,08ejYlzduA6O82FJgnFKQz,Year 2020,Leonardo Makno,37,133281,False,2,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
2,0UsmyJDsst2xhX1ZiFF3JW,"Year,2015",Schoolgirl Byebye,25,74301,False,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single
3,0koiLB5LKB5fXOuZKNmQvc,King,Years & Years,0,214834,False,19,1,"[AE, AL, AT, AU, BA, BE, BG, BH, BY, CH, CY, C...",album
4,6PERP62TejQjgHu81OHxgM,good 4 u,Olivia Rodrigo,98,178147,True,1,1,"[AD, AE, AG, AL, AM, AO, AR, AT, AU, AZ, BA, B...",single


In [13]:
track_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   track_id           1000 non-null   object
 1   name               1000 non-null   object
 2   artist_name        1000 non-null   object
 3   popularity         1000 non-null   int64 
 4   duration_ms        1000 non-null   int64 
 5   explicit           1000 non-null   bool  
 6   track_number       1000 non-null   int64 
 7   disc_number        1000 non-null   int64 
 8   available_markets  1000 non-null   object
 9   album_type         1000 non-null   object
dtypes: bool(1), int64(4), object(5)
memory usage: 71.4+ KB
