### Spotify Initial 2020 Data Collection, Observation, and Cleaning ###

This code takes advantage of Spotipy, a package which allows one to use Spotify's API, to gather track-level data maintained by Spotify. It produces a dataset with information on each of the top 50 most popular songs on Spotify in America in 2020, as measured by Spotify and released to the public via an [ordered playlist](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DXaqCgtv7ZR3L?si=K4BLJsyXSy-j0cSED9FjgQ) at the end of the year.

Afterwards, the dataset is observed and cleaned.

Source: Spotify Web API, [Spotify Top Tracks of 2020 USA](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DXaqCgtv7ZR3L?si=eAq2hBqrTR-s5M99y-vQeQ)

Downloaded: 11/14/2021

Srinidhi Ramakrishna

In [1]:
# Importing packages
import spotipy
import time
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials

In [2]:
# Locating my unique client and secret ID as a developer
cid = '303cbc1ce8224ed0987dae7b34810613'
secret = '91b877b5e37d44529d31a35da6e9461e'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [3]:
# Collecting track IDs based on the playlist URL
def getTrackIDs(user, playlist_id):
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

ids = getTrackIDs('spotify', '37i9dQZF1DXaqCgtv7ZR3L')

In [4]:
# Collecting track features for each song
def getTrackFeatures(id):
  meta = sp.track(id)
  features = sp.audio_features(id)

  # meta
  name = meta['name']
  album = meta['album']['name']
  artist = meta['album']['artists'][0]['name']
  release_date = meta['album']['release_date']
  duration_ms = meta['duration_ms']
  popularity = meta['popularity']
  explicit = meta['explicit']
    
  # features
  acousticness = features[0]['acousticness']
  danceability = features[0]['danceability']
  energy = features[0]['energy']
  instrumentalness = features[0]['instrumentalness']
  liveness = features[0]['liveness']
  loudness = features[0]['loudness']
  speechiness = features[0]['speechiness']
  tempo = features[0]['tempo']
  time_signature = features[0]['time_signature']
  valence = features[0]['valence']


  track = [name, album, artist, release_date, duration_ms, popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature, valence, explicit]
  return track

In [5]:
# Looping over track ids to append track-level metrics in a new row 
tracks = []
for i in range(len(ids)):
  time.sleep(.5)
  track = getTrackFeatures(ids[i])
  tracks.append(track)

In [6]:
# Creating dataset
df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'duration_ms', 'popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'valence', 'explicit'])
df.to_csv("../../data/Raw/spotify2020raw.csv", sep = ',')

#### Data Observation

In [7]:
spotifyraw2020 = pd.read_csv("../../data/Raw/spotify2020raw.csv")

Let's take a basic look at the dimensions of this dataset, as well as the meanings of the rows and columns. 

In [8]:
spotifyraw2020.shape

(50, 19)

In [9]:
spotifyraw2020.head()

Unnamed: 0.1,Unnamed: 0,name,album,artist,release_date,duration_ms,popularity,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,valence,explicit
0,0,The Box,Please Excuse Me for Being Antisocial,Roddy Ricch,2019-12-06,196652,86,0.896,0.104,0.896,0.586,0.0,0.79,-6.687,0.0559,116.971,4,0.642,True
1,1,Blinding Lights,After Hours,The Weeknd,2020-03-20,200040,94,0.514,0.00146,0.514,0.73,9.5e-05,0.0897,-5.934,0.0598,171.005,4,0.334,False
2,2,Blueberry Faygo,Certified Hitmaker,Lil Mosey,2020-02-06,162546,81,0.774,0.207,0.774,0.554,0.0,0.132,-7.909,0.0383,99.034,4,0.349,True
3,3,ROCKSTAR (feat. Roddy Ricch),BLAME IT ON BABY,DaBaby,2020-04-17,181733,86,0.746,0.247,0.746,0.69,0.0,0.101,-7.956,0.164,89.977,4,0.497,True
4,4,Life Is Good (feat. Drake),High Off Life,Future,2020-05-15,237918,77,0.795,0.067,0.795,0.574,0.0,0.15,-6.903,0.487,142.053,4,0.537,True


This dataset has 50 rows and 19 columns, with each row representing a song. Rows are ordered in accordance with popularity in the US in 2020 as determined by Spotify (i.e. The Box was the most streamed US song in 2020). The ordering reasoning was confirmed by several [external sources](https://sports.yahoo.com/spotifys-top-artists-songs-albums-050653496.html) reporting that this dataset/playlist was ordered by popularity.

The columns represent a variety of track-level metrics. "Unnamed: 0" can be taken to mean the popularity ranking of the top 50 songs at the end of 2020 (as it begins at 0, this column includes data ranging from 0 - 49). Several other columns are easily understandable labels or objective measures, such as song name, album title, artist, release date, duration_ms (duration of the song in milliseconds). 

Other columns straddle the line between objective and subjectively determined measures, likely involving some automated / machine learning methods on Spotify's end. The column 'explicit' represents a categorical variable denoting if a track has explicit lyrics - true meaning yes and false meaning no or unknown. The column 'tempo' represents the speed of the track in beats per minute. The column 'time signature' represents the meter of the song; however, I can already tell there are significant mistakes in time signature labeling of songs, meaning that Spotify's machine learning models may be unreliable for this metric; I will likely remove this column.

A significant portion of columns, such as danceability, energy, instrumentalness, positiveness, etc. are Spotify-defined metrics entirely defined by machine learning algorithms. Fuller descriptions of these variables and the others previously mentioned can be found [here](https://rpubs.com/PeterDola/SpotifyTracks).

The column 'popularity' is important to understand. It contains values between 0 and 100, with 100 being the most popular. Algorithmically calculated, it is determined based on number of plays and how recent those plays are. Especially since this dataset was gathered a year after 2020 (meaning that the 'popularity' algorithm has taken into account 2021 listening patterns) , the values in the 'popularity' column are no longer relevant and no longer match up to the ranked order of songs in the dataset. Thus, I will also disregard and remove this column.

#### Data Cleaning ####

As I discussed before, I will now remove the 'popularity' and 'time_signature' columns. Notice that there is also a duplicate column (danceability.1 contains the same data values as danceability) - thus, I will also remove the danceability.1 column. 

Next, I will rename the row Unnamed: 0 to 'rank' to indicate that it represents popularity rankings of the songs in the dataset; since the most popular song is denoted as '0', I will add 1 to all values in this column so that the most popular song is denoted as '1', the 50th most popular song is denoted as 50, and so on and so forth. 

In addition, I will convert the duration_ms column from milliseconds to seconds for readability. 

In [10]:
spotifytop2020 = spotifyraw2020.drop(['popularity', 'time_signature', 'danceability.1'], axis = 1)

In [11]:
spotifytop2020 = spotifytop2020.rename(columns = {"Unnamed: 0": "rank"})
spotifytop2020['rank'] = spotifytop2020['rank'] + 1

In [12]:
spotifytop2020['duration_sec'] = spotifytop2020['duration_ms']/1000
spotifytop2020 = spotifytop2020.drop(['duration_ms'], axis = 1)

In [13]:
spotifytop2020.sample(10)

Unnamed: 0,rank,name,album,artist,release_date,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,explicit,duration_sec
2,3,Blueberry Faygo,Certified Hitmaker,Lil Mosey,2020-02-06,0.774,0.207,0.554,0.0,0.132,-7.909,0.0383,99.034,0.349,True,162.546
14,15,Say So,Hot Pink,Doja Cat,2019-11-07,0.787,0.264,0.673,3e-06,0.0904,-4.583,0.159,110.962,0.779,True,237.893
30,31,everything i wanted,everything i wanted,Billie Eilish,2019-11-13,0.704,0.902,0.225,0.657,0.106,-14.454,0.0994,120.006,0.243,False,245.425
18,19,death bed (coffee for your head),death bed (coffee for your head),Powfu,2020-02-08,0.726,0.731,0.431,0.0,0.696,-8.765,0.135,144.026,0.348,False,173.333
20,21,Godzilla (feat. Juice WRLD),Music To Be Murdered By,Eminem,2020-01-17,0.808,0.145,0.745,0.0,0.292,-5.26,0.342,165.995,0.829,True,210.8
10,11,WHATS POPPIN,Sweet Action,Jack Harlow,2020-03-13,0.923,0.017,0.604,0.0,0.272,-6.671,0.245,145.062,0.826,True,139.741
49,50,SICKO MODE,ASTROWORLD,Travis Scott,2018-08-03,0.834,0.00513,0.73,0.0,0.124,-3.714,0.222,155.008,0.446,True,312.82
32,33,For The Night (feat. Lil Baby & DaBaby),Shoot For The Stars Aim For The Moon,Pop Smoke,2020-07-03,0.823,0.114,0.586,0.0,0.193,-6.606,0.2,125.971,0.347,True,190.476
13,14,Intentions (feat. Quavo),Changes,Justin Bieber,2020-02-14,0.806,0.3,0.546,0.0,0.102,-6.637,0.0575,147.986,0.874,False,212.866
44,45,goosebumps,Birds In The Trap Sing McKnight,Travis Scott,2016-09-16,0.841,0.0847,0.728,0.0,0.149,-3.37,0.0484,130.049,0.43,True,243.836


With a few small changes, the dataset is more readable!

In [14]:
spotifytop2020.to_csv("../../data/Clean/spotifytop2020cleaned.csv", sep = ',')