### Spotify Initial 2017 Data Collection, Observation, and Cleaning ###

This code takes advantage of Spotipy, a package which allows one to use Spotify's API, to gather track-level data maintained by Spotify. It produces a dataset with information on each of the top 50 most popular songs on Spotify in America in 2017, as measured by Spotify and released to the public via an [ordered playlist](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DX7Axsg3uaDZb?si=Yf6l20lBTWu9BzquG35UKg) at the end of the year.

Afterwards, the dataset is observed and cleaned.

Source: Spotify Web API, [Top Tracks of 2017: USA](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DX7Axsg3uaDZb?si=Yf6l20lBTWu9BzquG35UKg)

Downloaded: 11/22/2021

Srinidhi Ramakrishna

In [1]:
# Importing packages
import spotipy
import time
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials

In [2]:
# Locating my unique client and secret ID as a developer
cid = '9809a4a6d80942d0a6e115fde747e50e'
secret = '10ff9acc3b4e4b4b984a1be5ffa16d2a'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [3]:
# Collecting track IDs based on the playlist URL
def getTrackIDs(user, playlist_id):
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

ids = getTrackIDs('spotify', '37i9dQZF1DX7Axsg3uaDZb')

In [4]:
# Collecting track features for each song
def getTrackFeatures(id):
  meta = sp.track(id)
  features = sp.audio_features(id)

  # meta
  name = meta['name']
  album = meta['album']['name']
  artist = meta['album']['artists'][0]['name']
  release_date = meta['album']['release_date']
  duration_ms = meta['duration_ms']
  popularity = meta['popularity']
  explicit = meta['explicit']
    
  # features
  acousticness = features[0]['acousticness']
  danceability = features[0]['danceability']
  energy = features[0]['energy']
  instrumentalness = features[0]['instrumentalness']
  liveness = features[0]['liveness']
  loudness = features[0]['loudness']
  speechiness = features[0]['speechiness']
  tempo = features[0]['tempo']
  time_signature = features[0]['time_signature']
  valence = features[0]['valence']


  track = [name, album, artist, release_date, duration_ms, popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature, valence, explicit]
  return track

In [5]:
# Looping over track ids to append track-level metrics in a new row 
tracks = []
for i in range(len(ids)):
  time.sleep(.5)
  track = getTrackFeatures(ids[i])
  tracks.append(track)

In [6]:
# Creating dataset
df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'duration_ms', 'popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'valence', 'explicit'])
df.to_csv("../../data/Raw/spotify2017raw.csv", sep = ',')

#### Data Observation

In [7]:
spotifyraw2017 = pd.read_csv("../../data/Raw/spotify2017raw.csv")

Let's take a basic look at the dimensions of this dataset, as well as the meanings of the rows and columns. 

In [8]:
spotifyraw2017.shape

(98, 19)

In [9]:
spotifyraw2017.head()

Unnamed: 0.1,Unnamed: 0,name,album,artist,release_date,duration_ms,popularity,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,valence,explicit
0,0,HUMBLE.,DAMN.,Kendrick Lamar,2017-04-14,177000,85,0.908,0.000282,0.908,0.621,5.4e-05,0.0958,-6.638,0.102,150.011,4,0.421,True
1,1,XO Tour Llif3,Luv Is Rage 2,Lil Uzi Vert,2017-08-25,182706,86,0.732,0.00264,0.732,0.75,0.0,0.109,-6.366,0.231,155.096,4,0.401,True
2,2,Shape of You,÷ (Deluxe),Ed Sheeran,2017-03-03,233712,88,0.825,0.581,0.825,0.652,0.0,0.0931,-3.183,0.0802,95.977,4,0.931,False
3,3,Congratulations,Stoney (Deluxe),Post Malone,2016-12-09,220293,84,0.63,0.215,0.63,0.804,0.0,0.253,-4.183,0.0363,123.146,4,0.492,True
4,4,Despacito - Remix,Despacito Feat. Justin Bieber (Remix),Luis Fonsi,2017-04-17,228826,4,0.694,0.229,0.694,0.815,0.0,0.0924,-4.328,0.12,88.931,4,0.813,False


This dataset has 98 rows and 19 columns, with each row representing a song. Rows are ordered in accordance with popularity in the US in 2017 as determined by Spotify (i.e. HUMBLE. was the most streamed US song in 2017). The ordering reasoning was confirmed by several [external sources](https://time.com/5050155/spotify-2017-most-streamed-music/) reporting that this dataset/playlist was ordered by popularity.

The columns represent a variety of track-level metrics. "Unnamed: 0" can be taken to mean the popularity ranking of the top 98 songs at the end of 2017 (as it begins at 0, this column includes data ranging from 0 - 97). Several other columns are easily understandable labels or objective measures, such as song name, album title, artist, release date, duration_ms (duration of the song in milliseconds). 

Other columns straddle the line between objective and subjectively determined measures, likely involving some automated / machine learning methods on Spotify's end. The column 'explicit' represents a categorical variable denoting if a track has explicit lyrics - true meaning yes and false meaning no or unknown. The column 'tempo' represents the speed of the track in beats per minute. The column 'time signature' represents the meter of the song; however, I can already tell there are significant mistakes in time signature labeling of songs, meaning that Spotify's machine learning models may be unreliable for this metric; I will likely remove this column.

A significant portion of columns, such as danceability, energy, instrumentalness, valence, etc. are Spotify-defined metrics entirely defined by machine learning algorithms. Fuller descriptions of these variables and the others previously mentioned can be found [here](https://rpubs.com/PeterDola/SpotifyTracks).

The column 'popularity' is important to understand. It contains values between 0 and 100, with 100 being the most popular. Algorithmically calculated, it is determined based on number of plays and how recent those plays are. Especially since this dataset was gathered long after 2017 (meaning that the 'popularity' algorithm has taken into account listening patterns from 2018-2021) , the values in the 'popularity' column are no longer relevant and no longer match up to the ranked order of songs in the dataset. Thus, I will also disregard and remove this column.

#### Data Cleaning

As I discussed before, I will now remove the 'popularity' and 'time_signature' columns. Notice that there is also a duplicate column (danceability.1 contains the same data values as danceability) - thus, I will also remove the danceability.1 column. 

Next, I will rename the row Unnamed: 0 to 'rank' to indicate that it represents popularity rankings of the songs in the dataset; since the most popular song is denoted as '0', I will add 1 to all values in this column so that the most popular song is denoted as '1', the 50th most popular song is denoted as 50, and so on and so forth. 

In addition, I will convert the duration_ms column from milliseconds to seconds for readability. 

In [10]:
# Dropping popularity and time signature
spotifytop2017 = spotifyraw2017.drop(['popularity', 'time_signature', 'danceability.1'], axis = 1)

# Setting rank column
spotifytop2017 = spotifytop2017.rename(columns = {"Unnamed: 0": "rank"})
spotifytop2017['rank'] = spotifytop2017['rank'] + 1

# Converting and creating duration column
spotifytop2017['duration_sec'] = spotifytop2017['duration_ms']/1000
spotifytop2017 = spotifytop2017.drop(['duration_ms'], axis = 1)

In [11]:
spotifytop2017.sample(10)

Unnamed: 0,rank,name,album,artist,release_date,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,explicit,duration_sec
6,7,iSpy (feat. Lil Yachty),iSpy (feat. Lil Yachty),KYLE,2016-12-02,0.746,0.378,0.653,0.0,0.229,-6.745,0.289,75.016,0.672,True,253.106
23,24,Tunnel Vision,Painting Pictures,Kodak Black,2017-03-31,0.497,0.0576,0.489,9.9e-05,0.122,-7.724,0.294,171.853,0.231,True,268.186
13,14,goosebumps,Birds In The Trap Sing McKnight,Travis Scott,2016-09-16,0.841,0.0847,0.728,0.0,0.149,-3.37,0.0484,130.049,0.43,True,243.836
29,30,Slippery (feat. Gucci Mane),Culture,Migos,2017-01-27,0.92,0.307,0.674,0.0,0.104,-5.662,0.264,141.967,0.741,True,304.041
40,41,Swang,SremmLife 2 (Deluxe),Rae Sremmurd,2016-08-12,0.681,0.2,0.314,1e-05,0.1,-9.319,0.0581,139.992,0.166,True,208.12
75,76,Love Galore (feat. Travis Scott),Ctrl,SZA,2017-06-09,0.795,0.112,0.594,0.0,0.162,-6.2,0.0748,135.002,0.409,True,275.08
74,75,"Feels (feat. Pharrell Williams, Katy Perry & B...",Funk Wav Bounces Vol.1,Calvin Harris,2017-06-30,0.893,0.0642,0.745,0.0,0.0943,-3.105,0.0571,101.018,0.872,True,223.413
77,78,It's A Vibe,Pretty Girls Like Trap Music,2 Chainz,2017-06-16,0.822,0.0312,0.502,0.000887,0.114,-7.38,0.148,73.003,0.525,True,210.2
50,51,Thunder,Evolve,Imagine Dragons,2017-06-23,0.6,0.00683,0.81,0.21,0.155,-4.749,0.0479,167.88,0.298,False,187.146
44,45,Caroline,Good For You,Aminé,2017-07-28,0.94,0.17,0.335,0.0,0.262,-10.179,0.505,120.04,0.707,True,209.64


With a few small changes, the dataset is more readable!

In [12]:
spotifytop2017.to_csv("../../data/Clean/spotifytop2017cleaned.csv", sep = ',')