# Applied Machine Learning 2021- Assignment 3

## Dissecting Spotify Valence

In this assignment we will dissect Spotify's Valence metric.

This notebook is dedicated to explore how the Spotify API works. I chose a [dataset from Kaggle](https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db) and I used the track ids to get the audio features from the Spotify API using my credentials. I used this dataset because it had a number of 232,725 total tracks and specifically 176.774 unique track_ids. Moreover, there are 26 total genres which I thought might be good for our further analysis as there is representation from a lot of different genres.

* The audio features where included in the initial dataset but I chose to repeat the procedure in order to explore Spotify API's function
---

> Student Name: Aikaterini Dimatou </br>
> AM: 8180199
> University: Athens University of Economics and Business <br />
> Email: t8180199@aueb.gr

### Setting up the Spotify API

* To set up the Spotify API, I created an application in http://developer.spotify.com.

* I used my Spotify account in order to sign in.

* I created an app from the dashboard.



In [1]:
import pandas as pd

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

%matplotlib inline

* For storing my credentials, I created a file `spotify_config.py` with the following contents:

  ```
  config = {
      'client_id' : 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
      'client_secret' :'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
  }
  ```

In [2]:
from spotify_config import config

client_credentials_manager = SpotifyClientCredentials(config['client_id'],
                                                      config['client_secret'])
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

* We will read the Kaggle dataset csv 

In [3]:
tracks = pd.read_csv('SpotifyFeatures.csv')
tracks

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.61100,0.389,99373,0.910,0.000000,C#,0.3460,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.24600,0.590,137373,0.737,0.000000,F#,0.1510,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.95200,0.663,170267,0.131,0.000000,C,0.1030,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.70300,0.240,152427,0.326,0.000000,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95000,0.331,82625,0.225,0.123000,F,0.2020,-21.150,Major,0.0456,140.576,4/4,0.390
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232720,Soul,Slave,Son Of Slide,2XGLdVl7lGeq8ksM6Al7jT,39,0.00384,0.687,326240,0.714,0.544000,D,0.0845,-10.626,Major,0.0316,115.542,4/4,0.962
232721,Soul,Jr Thomas & The Volcanos,Burning Fire,1qWZdkBl4UVPj9lK6HuuFM,38,0.03290,0.785,282447,0.683,0.000880,E,0.2370,-6.944,Minor,0.0337,113.830,4/4,0.969
232722,Soul,Muddy Waters,(I'm Your) Hoochie Coochie Man,2ziWXUmQLrXTiYjCg2fZ2t,47,0.90100,0.517,166960,0.419,0.000000,D,0.0945,-8.282,Major,0.1480,84.135,4/4,0.813
232723,Soul,R.LUM.R,With My Words,6EFsue2YbIG4Qkq8Zr9Rir,44,0.26200,0.745,222442,0.704,0.000000,A,0.3330,-7.137,Major,0.1460,100.031,4/4,0.489


We will find the unique track_ids that we will use in order to get their audio features from the Spotify API

In [4]:
len(tracks['track_id'].unique())

176774

* We will get, for each of the tracks, its [audio features](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-several-audio-features).

* To do that, we'll create a dictionary keyed by `track_id`, with values being the audio features for the specific track.

In [5]:
features = {}
all_track_ids = list(tracks['track_id'].unique())

* The call for getting several audio features can get no more than 100 at a time, so we'll have to iterate.

In [6]:
start = 0
num_tracks = 100
while start < len(all_track_ids):
    print(f'getting from {start} to {start+num_tracks}')
    tracks_batch = all_track_ids[start:start+num_tracks]
    features_batch = sp.audio_features(tracks_batch)
    features.update({ track_id : track_features 
                     for track_id, track_features in zip(tracks_batch, features_batch) })
    start += num_tracks

getting from 0 to 100
getting from 100 to 200
getting from 200 to 300
getting from 300 to 400
getting from 400 to 500
getting from 500 to 600
getting from 600 to 700
getting from 700 to 800
getting from 800 to 900
getting from 900 to 1000
getting from 1000 to 1100
getting from 1100 to 1200
getting from 1200 to 1300
getting from 1300 to 1400
getting from 1400 to 1500
getting from 1500 to 1600
getting from 1600 to 1700
getting from 1700 to 1800
getting from 1800 to 1900
getting from 1900 to 2000
getting from 2000 to 2100
getting from 2100 to 2200
getting from 2200 to 2300
getting from 2300 to 2400
getting from 2400 to 2500
getting from 2500 to 2600
getting from 2600 to 2700
getting from 2700 to 2800
getting from 2800 to 2900
getting from 2900 to 3000
getting from 3000 to 3100
getting from 3100 to 3200
getting from 3200 to 3300
getting from 3300 to 3400
getting from 3400 to 3500
getting from 3500 to 3600
getting from 3600 to 3700
getting from 3700 to 3800
getting from 3800 to 3900
getting

In [11]:
len(features)

176774

In [12]:
features['7qiZfU4dY1lWllzX7mPBI3']

{'danceability': 0.825,
 'energy': 0.652,
 'key': 1,
 'loudness': -3.183,
 'mode': 0,
 'speechiness': 0.0802,
 'acousticness': 0.581,
 'instrumentalness': 0,
 'liveness': 0.0931,
 'valence': 0.931,
 'tempo': 95.977,
 'type': 'audio_features',
 'id': '7qiZfU4dY1lWllzX7mPBI3',
 'uri': 'spotify:track:7qiZfU4dY1lWllzX7mPBI3',
 'track_href': 'https://api.spotify.com/v1/tracks/7qiZfU4dY1lWllzX7mPBI3',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/7qiZfU4dY1lWllzX7mPBI3',
 'duration_ms': 233713,
 'time_signature': 4}

* We'll turn the dictionary to a `DataFrame`.

In [13]:
tracks = pd.DataFrame.from_dict(features, orient='index')
tracks

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0BRjO6ga9RKCKjfDqeFgWV,0.389,0.910,1,-1.828,1,0.0525,0.6110,0.000000,0.3460,0.816,166.969,audio_features,0BRjO6ga9RKCKjfDqeFgWV,spotify:track:0BRjO6ga9RKCKjfDqeFgWV,https://api.spotify.com/v1/tracks/0BRjO6ga9RKC...,https://api.spotify.com/v1/audio-analysis/0BRj...,99373,4
0BjC1NfoEOOusryehmNudP,0.591,0.737,6,-5.559,0,0.0877,0.2460,0.000000,0.1510,0.815,174.003,audio_features,0BjC1NfoEOOusryehmNudP,spotify:track:0BjC1NfoEOOusryehmNudP,https://api.spotify.com/v1/tracks/0BjC1NfoEOOu...,https://api.spotify.com/v1/audio-analysis/0BjC...,137373,4
0CoSDzoNIKCRs124s9uTVy,0.663,0.131,0,-13.879,0,0.0362,0.9520,0.000000,0.1030,0.368,99.488,audio_features,0CoSDzoNIKCRs124s9uTVy,spotify:track:0CoSDzoNIKCRs124s9uTVy,https://api.spotify.com/v1/tracks/0CoSDzoNIKCR...,https://api.spotify.com/v1/audio-analysis/0CoS...,170267,5
0Gc6TVm52BwZD07Ki6tIvf,0.241,0.326,1,-12.178,1,0.0390,0.7030,0.000000,0.0985,0.226,171.782,audio_features,0Gc6TVm52BwZD07Ki6tIvf,spotify:track:0Gc6TVm52BwZD07Ki6tIvf,https://api.spotify.com/v1/tracks/0Gc6TVm52BwZ...,https://api.spotify.com/v1/audio-analysis/0Gc6...,152427,4
0IuslXpMROHdEPvSl1fTQK,0.331,0.225,5,-21.150,1,0.0456,0.9500,0.123000,0.2020,0.390,140.576,audio_features,0IuslXpMROHdEPvSl1fTQK,spotify:track:0IuslXpMROHdEPvSl1fTQK,https://api.spotify.com/v1/tracks/0IuslXpMROHd...,https://api.spotify.com/v1/audio-analysis/0Ius...,82625,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1U0OMWvR89Cm20vCNar50f,0.736,0.701,10,-4.345,0,0.1000,0.2310,0.000000,0.2030,0.770,99.991,audio_features,1U0OMWvR89Cm20vCNar50f,spotify:track:1U0OMWvR89Cm20vCNar50f,https://api.spotify.com/v1/tracks/1U0OMWvR89Cm...,https://api.spotify.com/v1/audio-analysis/1U0O...,222667,4
2gGqKJWfWbToha2YmDxnnj,0.802,0.516,2,-9.014,1,0.2140,0.1040,0.000472,0.1050,0.482,175.663,audio_features,2gGqKJWfWbToha2YmDxnnj,spotify:track:2gGqKJWfWbToha2YmDxnnj,https://api.spotify.com/v1/tracks/2gGqKJWfWbTo...,https://api.spotify.com/v1/audio-analysis/2gGq...,201173,4
2iZf3EUedz9MPqbAvXdpdA,0.423,0.337,10,-13.092,0,0.0436,0.5660,0.000000,0.2760,0.497,80.023,audio_features,2iZf3EUedz9MPqbAvXdpdA,spotify:track:2iZf3EUedz9MPqbAvXdpdA,https://api.spotify.com/v1/tracks/2iZf3EUedz9M...,https://api.spotify.com/v1/audio-analysis/2iZf...,144667,4
1qWZdkBl4UVPj9lK6HuuFM,0.785,0.683,4,-6.944,0,0.0337,0.0329,0.000880,0.2370,0.969,113.830,audio_features,1qWZdkBl4UVPj9lK6HuuFM,spotify:track:1qWZdkBl4UVPj9lK6HuuFM,https://api.spotify.com/v1/tracks/1qWZdkBl4UVP...,https://api.spotify.com/v1/audio-analysis/1qWZ...,282447,4


In [14]:
tracks = tracks.reset_index(drop=True).rename(columns={'id' : 'song_id'})
tracks

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,song_id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.389,0.910,1,-1.828,1,0.0525,0.6110,0.000000,0.3460,0.816,166.969,audio_features,0BRjO6ga9RKCKjfDqeFgWV,spotify:track:0BRjO6ga9RKCKjfDqeFgWV,https://api.spotify.com/v1/tracks/0BRjO6ga9RKC...,https://api.spotify.com/v1/audio-analysis/0BRj...,99373,4
1,0.591,0.737,6,-5.559,0,0.0877,0.2460,0.000000,0.1510,0.815,174.003,audio_features,0BjC1NfoEOOusryehmNudP,spotify:track:0BjC1NfoEOOusryehmNudP,https://api.spotify.com/v1/tracks/0BjC1NfoEOOu...,https://api.spotify.com/v1/audio-analysis/0BjC...,137373,4
2,0.663,0.131,0,-13.879,0,0.0362,0.9520,0.000000,0.1030,0.368,99.488,audio_features,0CoSDzoNIKCRs124s9uTVy,spotify:track:0CoSDzoNIKCRs124s9uTVy,https://api.spotify.com/v1/tracks/0CoSDzoNIKCR...,https://api.spotify.com/v1/audio-analysis/0CoS...,170267,5
3,0.241,0.326,1,-12.178,1,0.0390,0.7030,0.000000,0.0985,0.226,171.782,audio_features,0Gc6TVm52BwZD07Ki6tIvf,spotify:track:0Gc6TVm52BwZD07Ki6tIvf,https://api.spotify.com/v1/tracks/0Gc6TVm52BwZ...,https://api.spotify.com/v1/audio-analysis/0Gc6...,152427,4
4,0.331,0.225,5,-21.150,1,0.0456,0.9500,0.123000,0.2020,0.390,140.576,audio_features,0IuslXpMROHdEPvSl1fTQK,spotify:track:0IuslXpMROHdEPvSl1fTQK,https://api.spotify.com/v1/tracks/0IuslXpMROHd...,https://api.spotify.com/v1/audio-analysis/0Ius...,82625,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
176769,0.736,0.701,10,-4.345,0,0.1000,0.2310,0.000000,0.2030,0.770,99.991,audio_features,1U0OMWvR89Cm20vCNar50f,spotify:track:1U0OMWvR89Cm20vCNar50f,https://api.spotify.com/v1/tracks/1U0OMWvR89Cm...,https://api.spotify.com/v1/audio-analysis/1U0O...,222667,4
176770,0.802,0.516,2,-9.014,1,0.2140,0.1040,0.000472,0.1050,0.482,175.663,audio_features,2gGqKJWfWbToha2YmDxnnj,spotify:track:2gGqKJWfWbToha2YmDxnnj,https://api.spotify.com/v1/tracks/2gGqKJWfWbTo...,https://api.spotify.com/v1/audio-analysis/2gGq...,201173,4
176771,0.423,0.337,10,-13.092,0,0.0436,0.5660,0.000000,0.2760,0.497,80.023,audio_features,2iZf3EUedz9MPqbAvXdpdA,spotify:track:2iZf3EUedz9MPqbAvXdpdA,https://api.spotify.com/v1/tracks/2iZf3EUedz9M...,https://api.spotify.com/v1/audio-analysis/2iZf...,144667,4
176772,0.785,0.683,4,-6.944,0,0.0337,0.0329,0.000880,0.2370,0.969,113.830,audio_features,1qWZdkBl4UVPj9lK6HuuFM,spotify:track:1qWZdkBl4UVPj9lK6HuuFM,https://api.spotify.com/v1/tracks/1qWZdkBl4UVP...,https://api.spotify.com/v1/audio-analysis/1qWZ...,282447,4


Then, in order to avoid repeating getting the data, we will store them in a csv file, and we will read the csv from now on

In [19]:
tracks.to_csv('spotifyAPI_tracks_features.csv', index=False)