### Spotify Initial 2021 Data Collection, Observation, and Cleaning ###

This code takes advantage of Spotipy, a package which allows one to use Spotify's API, to gather track-level data maintained by Spotify. It produces a dataset with information on each of the top 50 most popular songs on Spotify in America in 2021, as measured by Spotify and released to the public via an [ordered playlist](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DXaqCgtv7ZR3L?si=K4BLJsyXSy-j0cSED9FjgQ) at the end of the year.

Afterwards, the dataset is observed and cleaned.

Source: Spotify Web API, [Spotify Top Tracks of 2021 USA](https://open.spotify.com/playlist/37i9dQZF1DXbJMiQ53rTyJ?si=b42dbe5c50d545a3)

Downloaded: 12/13/2021

Srinidhi Ramakrishna

In [1]:
# Importing packages
import spotipy
import time
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials

In [2]:
# Locating my unique client and secret ID as a developer
cid = '303cbc1ce8224ed0987dae7b34810613'
secret = '91b877b5e37d44529d31a35da6e9461e'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [3]:
# Collecting track IDs based on the playlist URL
def getTrackIDs(user, playlist_id):
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

ids = getTrackIDs('spotify', '37i9dQZF1DXbJMiQ53rTyJ')

In [4]:
# Collecting track features for each song
def getTrackFeatures(id):
  meta = sp.track(id)
  features = sp.audio_features(id)

  # meta
  name = meta['name']
  album = meta['album']['name']
  artist = meta['album']['artists'][0]['name']
  release_date = meta['album']['release_date']
  duration_ms = meta['duration_ms']
  popularity = meta['popularity']
  explicit = meta['explicit']
    
  # features
  acousticness = features[0]['acousticness']
  danceability = features[0]['danceability']
  energy = features[0]['energy']
  instrumentalness = features[0]['instrumentalness']
  liveness = features[0]['liveness']
  loudness = features[0]['loudness']
  speechiness = features[0]['speechiness']
  tempo = features[0]['tempo']
  time_signature = features[0]['time_signature']
  valence = features[0]['valence']


  track = [name, album, artist, release_date, duration_ms, popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature, valence, explicit]
  return track

In [5]:
# Looping over track ids to append track-level metrics in a new row 
tracks = []
for i in range(len(ids)):
  time.sleep(.5)
  track = getTrackFeatures(ids[i])
  tracks.append(track)

In [6]:
# Creating dataset
df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'duration_ms', 'popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'valence', 'explicit'])
df.to_csv("../../data/Raw/spotify2021raw.csv", sep = ',')

#### Data Observation

In [7]:
spotifyraw2021 = pd.read_csv("../../data/Raw/spotify2021raw.csv")

Let's take a basic look at the dimensions of this dataset, as well as the meanings of the rows and columns. 

In [8]:
spotifyraw2021.shape

(50, 19)

In [9]:
spotifyraw2021.head()

Unnamed: 0.1,Unnamed: 0,name,album,artist,release_date,duration_ms,popularity,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,valence,explicit
0,0,drivers license,SOUR,Olivia Rodrigo,2021-05-21,242013,93,0.561,0.768,0.561,0.431,1.4e-05,0.106,-8.81,0.0578,143.875,4,0.137,True
1,1,good 4 u,SOUR,Olivia Rodrigo,2021-05-21,178146,96,0.563,0.335,0.563,0.664,0.0,0.0849,-5.044,0.154,166.928,4,0.688,True
2,2,Kiss Me More (feat. SZA),Planet Her,Doja Cat,2021-06-25,208666,88,0.764,0.259,0.764,0.705,8.9e-05,0.12,-3.463,0.0284,110.97,4,0.781,True
3,3,Heat Waves,Dreamland (+ Bonus Levels),Glass Animals,2020-08-06,238805,93,0.761,0.44,0.761,0.525,7e-06,0.0921,-6.9,0.0944,80.87,4,0.531,False
4,4,Levitating (feat. DaBaby),Future Nostalgia,Dua Lipa,2020-03-27,203064,89,0.702,0.00883,0.702,0.825,0.0,0.0674,-3.787,0.0601,102.977,4,0.915,False


This dataset has 50 rows and 19 columns, with each row representing a song. Rows are ordered in accordance with popularity in the US in 2017 as determined by Spotify. The columns represent a variety of track-level metrics. Rows and columns in this dataset are the same as those in the 2017 and 2020 Spotify datasets, and full descriptions of the meaning of these rows and columns can be found in the notebooks "Spotify Initial 2017 Data Collection, Observation, and Cleaning.ipynb" and "Spotify Initial 2020 Data Collection, Observation, and Cleaning.ipynb."

#### Data Cleaning

Here, I perform the same data cleaning steps that were performed on the 2017 and 2020 Spotify datasets. Explanations for these steps can be found in the same aforementioned notebooks. 

In [10]:
spotifytop2021 = spotifyraw2021.drop(['popularity', 'time_signature', 'danceability.1'], axis = 1)


In [11]:
spotifytop2021 = spotifytop2021.rename(columns = {"Unnamed: 0": "rank"})
spotifytop2021['rank'] = spotifytop2021['rank'] + 1

In [12]:
spotifytop2021['duration_sec'] = spotifytop2021['duration_ms']/1000
spotifytop2021 = spotifytop2021.drop(['duration_ms'], axis = 1)

In [13]:
spotifytop2021.sample(10)

Unnamed: 0,rank,name,album,artist,release_date,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,explicit,duration_sec
3,4,Heat Waves,Dreamland (+ Bonus Levels),Glass Animals,2020-08-06,0.761,0.44,0.525,7e-06,0.0921,-6.9,0.0944,80.87,0.531,False,238.805
0,1,drivers license,SOUR,Olivia Rodrigo,2021-05-21,0.561,0.768,0.431,1.4e-05,0.106,-8.81,0.0578,143.875,0.137,True,242.013
13,14,Blinding Lights,After Hours,The Weeknd,2020-03-20,0.514,0.00146,0.73,9.5e-05,0.0897,-5.934,0.0598,171.005,0.334,False,200.04
37,38,favorite crime,SOUR,Olivia Rodrigo,2021-05-21,0.369,0.866,0.272,0.0,0.147,-10.497,0.0364,172.929,0.218,False,152.666
45,46,positions,Positions,Ariana Grande,2020-10-30,0.737,0.468,0.802,0.0,0.0931,-4.771,0.0878,144.015,0.682,True,172.324
24,25,DÁKITI,EL ÚLTIMO TOUR DEL MUNDO,Bad Bunny,2020-11-27,0.731,0.401,0.573,5.2e-05,0.113,-10.059,0.0544,109.928,0.145,True,205.09
6,7,STAY (with Justin Bieber),F*CK LOVE 3: OVER YOU,The Kid LAROI,2021-07-23,0.591,0.0383,0.764,0.0,0.103,-5.484,0.0483,169.928,0.478,True,141.805
26,27,Best Friend (feat. Doja Cat),Best Friend (feat. Doja Cat) [Remix EP] [Exten...,Saweetie,2021-04-23,0.84,0.00302,0.766,4e-06,0.0684,-4.12,0.136,94.018,0.402,True,155.883
7,8,RAPSTAR,Hall of Fame,Polo G,2021-06-11,0.789,0.41,0.536,0.0,0.129,-6.862,0.242,81.039,0.437,True,165.925
47,48,Come & Go (with Marshmello),Legends Never Die,Juice WRLD,2020-07-10,0.625,0.0172,0.814,0.0,0.158,-5.181,0.0657,144.991,0.535,True,205.484


With a few small changes, the dataset is more readable!

In [14]:
spotifytop2021.to_csv("../../data/Clean/spotifytop2021cleaned.csv", sep = ',')