# Dissecting Spotify Valence

In this assignment you will dissect Spotify's Valence metric.

---

> Panos Louridas, Associate Professor <br />
> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />
> louridas@aueb.gr

Spotify uses a metric called *valence* to measure the happiness of a track. The metric itself, however, was not developed by Spotify. It was originally developed by Echo Nest, a company that was bought by Spotify in 2014. We don't know exactly how valence is calculated. Some details are given by a blog post, which you can find here:

https://web.archive.org/web/20170422195736/http://blog.echonest.com/post/66097438564/plotting-musics-emotional-valence-1950-2013

Your task is to untangle the mystery behind valence and propose how this is derived.

Spotify offers the following information that may be relevant to your task:

* [Get Track's Audio Features](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features) and [Get Tracks' Audio Features](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features).

* [Get Track's Audio Analysis](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-analysis).

To tackle the problem you can use the Spotify charts data from Zenodo at https://doi.org/10.5281/zenodo.4778562, but you are not limitted to that. In fact you are encouraged to use additional data that you may find, or get yourself.

The rank of your assignment will contribute to the grade. That is, if $x$ is the unranked grade of an assignment and $r$ is the respective model’s ranking among $n$ students according to Q2 below, then the final grade will be computed as $0.75x + 2.5[1 - (r-1)/(n-1)]$. 

The ranking will be performed on a test dataset that will be provided in due course. The rank metrics will be Mean Average Error (MAE).

## Questions


### Q1: Expore which Track Features Influence Valence

You will use inferential statistic methods to study how track features influence valence. You must find the best possible model for explaining the valence based on the features that you find significant.

### Q2: Predict Valence

Use Machine Learning techniques to predict valence based on track features:

* You will use at least three different methods. For each methods you should ensure that you tune your hyperparameters as best as you can.

* Once you identify the best method and hyperparameters, explain, to the extent that is possible, which features influence the valence metric.

* You will evaluate your predictions on a holdout testing dataset that will be provided to you. Your evaluation and the value of the MAE on the holdout testing dataset must be included at the end of your submission.

In [1]:
# Importing the libraries i need
import numpy as np 
import pandas as pd
import seaborn as sns
import time
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats.stats as stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import statsmodels.formula.api as smf
from termcolor import cprint
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn import neighbors
from math import sqrt
%matplotlib inline
import os
import pandas as pd
import glob
import re
from datetime import datetime
import seaborn as sns
sns.set(style = "ticks", context = "talk")

# Reading the data
dir_location = "data"


In [2]:
track_dataset_1 = pd.read_csv(dir_location + r"\KaggleDataset\tracks.csv")
artists2 = pd.read_csv(dir_location + r"\KaggleDataset\artists.csv")
genres2 = pd.read_csv(dir_location + r"\KaggleDataset\genres_v2.csv")
holdout_ids = pd.read_csv(r"C:\Users\jason\My Drive\msc_data_science\1_2 Practical Data Science\Assignment 3\spotify_ids_holdout.txt", header=None, names=["id"])


print(track_dataset_1.shape), print(artists2.shape), print(genres2.shape)

(586672, 20)
(1104349, 5)
(42305, 22)


  genres2 = pd.read_csv(dir_location + r"\KaggleDataset\genres_v2.csv")


(None, None, None)

In [3]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotify_config import config
client_credentials_manager = SpotifyClientCredentials(config['client_id'],config['client_secret'])
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

def get_song_features(song_ids: list) -> pd.DataFrame:
    """
    Get the audio features for a list of song ids
    """
    features = {}
    start = 0
    num_tracks = 100
    while start < len(song_ids):
        print(f'getting from {start} to {start+num_tracks}')
        tracks_batch = song_ids[start:start+num_tracks]
        features_batch = sp.audio_features(tracks_batch)
        features.update({ track_id : track_features 
                        for track_id, track_features in zip(tracks_batch, features_batch) })
        start += num_tracks

    tracks = pd.DataFrame.from_dict(features, orient='index')
    tracks.reset_index(inplace=True)
    tracks.rename(columns={'index': 'id'})
    tracks = tracks.drop(columns=[x for x in tracks.columns if x not in tracks2.columns])
    return tracks

Based on your code found in one of your lectures we create the following function

Reading all the chart data, in order to collect the song features afterwards

In [5]:
header = 0
dfs = []
for file in glob.glob(dir_location + '\\Charts\\*\\201?\\*.csv'):    
    # Splitting the file path to get the region and dates
    filenamelist = file.split('\\')

    charts_dir_ind = filenamelist.index('Charts')

    region = filenamelist[charts_dir_ind+1]
    dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('\\')[-1])
    weekly_chart = pd.read_csv(file, header=header, sep='\t')
    weekly_chart['week_start'] = datetime.strptime(dates[0], '%Y-%m-%d')
    weekly_chart['week_end'] = datetime.strptime(dates[1], '%Y-%m-%d')
    weekly_chart['region'] = region
    dfs.append(weekly_chart)

all_charts = pd.concat(dfs)
all_charts

Unnamed: 0,position,song_id,song_name,artist,streams,last_week_position,weeks_on_chart,peak_position,position_status,week_start,week_end,region
0,1,5aAx2yezTd8zXrkmtKl66Z,Starboy,The Weeknd,947261,,1,1,new,2016-12-30,2017-01-06,au
1,2,5knuzwU65gJK7IF5yJsuaW,Rockabye (feat. Sean Paul & Anne-Marie),Clean Bandit,893107,,1,2,new,2016-12-30,2017-01-06,au
2,3,7BKLCZ1jbUBVqRi2FVlTVw,Closer,The Chainsmokers,871617,,1,3,new,2016-12-30,2017-01-06,au
3,4,3NdDpSvN911VPGivFlV5d0,I Don’t Wanna Live Forever (Fifty Shades Darke...,ZAYN,791592,,1,4,new,2016-12-30,2017-01-06,au
4,5,78rIJddV4X0HkNAInEcYde,Call On Me - Ryan Riback Extended Remix,Starley,743490,,1,5,new,2016-12-30,2017-01-06,au
...,...,...,...,...,...,...,...,...,...,...,...,...
195,196,7f5trao56t7sB7f14QDTmp,Juicy,Doja Cat,1920454,146.0,8,66,-50,2019-12-20,2019-12-27,us
196,197,5JiH89mHrv9oWHlD0T326z,To Be So Lonely,Harry Styles,1912267,32.0,2,32,-165,2019-12-20,2019-12-27,us
197,198,7GX5flRQZVHRAGd6B4TmDO,XO Tour Llif3,Lil Uzi Vert,1902239,154.0,121,2,-44,2019-12-20,2019-12-27,us
198,199,2dpaYNEQHiRxtZbfNsse99,Happier,Marshmello,1899623,173.0,71,8,-26,2019-12-20,2019-12-27,us


In [4]:
track_dataset_1

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.4450,0,-13.338,1,0.4510,0.674,0.744000,0.1510,0.1270,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.2630,0,-22.136,1,0.9570,0.797,0.000000,0.1480,0.6550,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.1770,1,-21.180,1,0.0512,0.994,0.021800,0.2120,0.4570,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918000,0.1040,0.3970,169.980,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.1580,3,-16.900,0,0.0390,0.989,0.130000,0.3110,0.1960,103.220,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,['阿YueYue'],['1QLBXKM5GCpyQQSVMNZqrZ'],2020-09-26,0.560,0.5180,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4
586668,0NuWgxEp51CutD2pJoF4OM,blind,72,153293,0,['ROLE MODEL'],['1dy5WNgIKQU6ezkpZs4y8z'],2020-10-21,0.765,0.6630,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4
586669,27Y1N4Q4U3EfDU5Ubw8ws2,What They'll Say About Us,70,187601,0,['FINNEAS'],['37M5pPGs6V1fchFJSgCguX'],2020-09-02,0.535,0.3140,7,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4
586670,45XJsGpFTyzbzeWK8VzR8S,A Day At A Time,58,142003,0,"['Gentle Bones', 'Clara Benin']","['4jGPdu95icCKVF31CcFKbS', '5ebPSE9YI5aLeZ1Z2g...",2021-03-05,0.696,0.6150,10,-6.212,1,0.0345,0.206,0.000003,0.3050,0.4380,90.029,4


In [12]:
track_dataset_2 = get_song_features(song_ids=all_charts.song_id.unique().tolist())

getting from 0 to 100
getting from 100 to 200
getting from 200 to 300
getting from 300 to 400
getting from 400 to 500
getting from 500 to 600
getting from 600 to 700
getting from 700 to 800
getting from 800 to 900
getting from 900 to 1000
getting from 1000 to 1100
getting from 1100 to 1200
getting from 1200 to 1300
getting from 1300 to 1400
getting from 1400 to 1500
getting from 1500 to 1600
getting from 1600 to 1700
getting from 1700 to 1800
getting from 1800 to 1900
getting from 1900 to 2000
getting from 2000 to 2100
getting from 2100 to 2200
getting from 2200 to 2300
getting from 2300 to 2400
getting from 2400 to 2500
getting from 2500 to 2600
getting from 2600 to 2700
getting from 2700 to 2800
getting from 2800 to 2900
getting from 2900 to 3000
getting from 3000 to 3100
getting from 3100 to 3200
getting from 3200 to 3300
getting from 3300 to 3400
getting from 3400 to 3500
getting from 3500 to 3600
getting from 3600 to 3700
getting from 3700 to 3800
getting from 3800 to 3900
getting

In [None]:
track_dataset_2 = pd.read_csv(dir_location + r"\DataUsed\genres_v2.csv")
holdout_dataset = pd.read_csv(dir_location + r"\KaggleDataset\genres_v2.csv")

In [28]:

holdout_dataset =  get_song_features(song_ids=holdout_ids.id.unique().tolist())
holdout_dataset

getting from 0 to 100
getting from 100 to 200
getting from 200 to 300
getting from 300 to 400
getting from 400 to 500
getting from 500 to 600
getting from 600 to 700
getting from 700 to 800
getting from 800 to 900
getting from 900 to 1000
getting from 1000 to 1100


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature
0,0.802,0.832,11,-4.107,1,0.0434,0.31100,0.000000,0.0815,0.8900,124.997,7x9aauaA9cu6tyfpHnqDLo,185427,4
1,0.864,0.556,2,-7.683,0,0.1940,0.25500,0.000004,0.1120,0.7260,99.974,56y1jOTK0XSvJzVv9vHQBK,230480,4
2,0.750,0.733,6,-3.180,0,0.0319,0.25600,0.000000,0.1140,0.8440,111.018,3rUGC1vUpkDG9CZFHMur1t,131872,1
3,0.853,0.824,1,-3.287,1,0.1030,0.03220,0.000000,0.0859,0.8880,108.044,01qFKNWq73UfEslI0GvumE,201812,4
4,0.663,0.670,8,-8.399,1,0.2710,0.04640,0.000089,0.2050,0.1380,136.952,2YSzYUF3jWqb9YP9VXmpjE,260111,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1011,0.538,0.783,11,-2.565,0,0.2220,0.01130,0.000000,0.2810,0.5530,75.388,3fDTzkvrOo5xQIO480Qmsb,259080,4
1012,0.485,0.545,11,-7.924,1,0.0336,0.06510,0.005470,0.0642,0.0385,150.185,4Ls53fBNVfaXTROBi6X8Hw,123891,4
1013,0.600,0.600,2,-7.715,0,0.1150,0.00388,0.000000,0.1220,0.5950,90.435,5PLqXnvHH7Gh6CcfiUEr7e,194920,3
1014,0.750,0.830,0,-3.544,1,0.0683,0.11500,0.000000,0.7650,0.6880,104.937,0hDE81j4N2DPLbEY4tiCDs,182857,4


In [27]:

subset_tracks1 = track_dataset_1[track_dataset_2.columns.tolist()]
# Concatenating the two DataFrames
merged_Dataframe = pd.concat([track_dataset_2, subset_tracks1])

# Removing duplicates from the concatenated DataFrame
merged_Dataframe = merged_Dataframe.drop_duplicates()
merged_Dataframe

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature
0,0.681,0.594,7,-7.028,1,0.2820,0.1650,0.000003,0.1340,0.5350,186.054,5aAx2yezTd8zXrkmtKl66Z,230453,4
1,0.720,0.763,9,-4.068,0,0.0523,0.4060,0.000000,0.1800,0.7420,101.965,5knuzwU65gJK7IF5yJsuaW,251088,4
2,0.748,0.524,8,-5.599,1,0.0338,0.4140,0.000000,0.1110,0.6610,95.010,7BKLCZ1jbUBVqRi2FVlTVw,244960,4
3,0.735,0.451,0,-8.374,1,0.0585,0.0631,0.000013,0.3250,0.0862,117.973,3NdDpSvN911VPGivFlV5d0,245200,4
4,0.670,0.838,0,-4.031,1,0.0362,0.0604,0.000611,0.1590,0.7170,104.998,78rIJddV4X0HkNAInEcYde,222041,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,0.560,0.518,0,-7.471,0,0.0292,0.7850,0.000000,0.0648,0.2110,131.896,5rgu12WBIHQtvej2MdHSH0,258267,4
586668,0.765,0.663,0,-5.223,1,0.0652,0.1410,0.000297,0.0924,0.6860,150.091,0NuWgxEp51CutD2pJoF4OM,153293,4
586669,0.535,0.314,7,-12.823,0,0.0408,0.8950,0.000150,0.0874,0.0663,145.095,27Y1N4Q4U3EfDU5Ubw8ws2,187601,4
586670,0.696,0.615,10,-6.212,1,0.0345,0.2060,0.000003,0.3050,0.4380,90.029,45XJsGpFTyzbzeWK8VzR8S,142003,4


In [32]:
# Subselecting DataFrame where 'id' is not in ids_to_exclude
train_validation_dataset = merged_Dataframe[~merged_Dataframe['id'].isin(holdout_dataset.id.tolist())]

In [33]:
train_validation_dataset

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature
0,0.681,0.594,7,-7.028,1,0.2820,0.1650,0.000003,0.1340,0.5350,186.054,5aAx2yezTd8zXrkmtKl66Z,230453,4
1,0.720,0.763,9,-4.068,0,0.0523,0.4060,0.000000,0.1800,0.7420,101.965,5knuzwU65gJK7IF5yJsuaW,251088,4
2,0.748,0.524,8,-5.599,1,0.0338,0.4140,0.000000,0.1110,0.6610,95.010,7BKLCZ1jbUBVqRi2FVlTVw,244960,4
3,0.735,0.451,0,-8.374,1,0.0585,0.0631,0.000013,0.3250,0.0862,117.973,3NdDpSvN911VPGivFlV5d0,245200,4
4,0.670,0.838,0,-4.031,1,0.0362,0.0604,0.000611,0.1590,0.7170,104.998,78rIJddV4X0HkNAInEcYde,222041,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,0.560,0.518,0,-7.471,0,0.0292,0.7850,0.000000,0.0648,0.2110,131.896,5rgu12WBIHQtvej2MdHSH0,258267,4
586668,0.765,0.663,0,-5.223,1,0.0652,0.1410,0.000297,0.0924,0.6860,150.091,0NuWgxEp51CutD2pJoF4OM,153293,4
586669,0.535,0.314,7,-12.823,0,0.0408,0.8950,0.000150,0.0874,0.0663,145.095,27Y1N4Q4U3EfDU5Ubw8ws2,187601,4
586670,0.696,0.615,10,-6.212,1,0.0345,0.2060,0.000003,0.3050,0.4380,90.029,45XJsGpFTyzbzeWK8VzR8S,142003,4


Generalized Linear Model
As the response variable ("valence") can take values belonging to the range (0, 1), a simple Linear Regression Model is not suitable to model "valence", so i will use a Generalized Linear Model.
More precisely, i will use Beta Regression, a regression suitable for modelling response variables that are percentages (meaning that they belong to the (0, 1) range).
To do this, i have to choose family = "Binomial" and as a link function the "logit" function.

In [76]:
def fit_glm_model(fields_to_use, target_field, dataset):

    formula_string = target_field + ' ~ ' + ' + '.join(fields_to_use)

    y = dataset[target_field]
    x = dataset[fields_to_use]
    data_reg = pd.concat([y, x], axis=1)

    # By choosing to family = "Binomial" and logit function, we replicate a "Beta Regression", which is a regression
    # for response variables that are continuous and belong to the (0, 1) space. 

    fit = smf.glm(formula=formula_string, 
                data=data_reg, 
                family=sm.families.Binomial(link=sm.families.links.logit())).fit()

    return fit

In [72]:
fields_to_use = ["duration_ms", "danceability", "energy", "loudness", "speechiness", "acousticness", "instrumentalness", "liveness", "tempo", "time_signature"]
target_field = 'valence'

fit_1 = fit_glm_model(fields_to_use=fields_to_use, target_field=target_field)



Checking for Multicolinearity
Now, i will check for multicolinearity between the independent variables (features). To do so, i construct a function named "compute_vif" that takes as inputs all the independent variables used in the model. Variables with scores >= 10, will be removed, and the model will be re-fitted with the the variables left.

In [41]:
def compute_vif(considered_features, data):
    
    '''This function takes as input all the independent variables used in GLM's fitting,
    and calculates the variance inflation factor of each one of them.'''
    
    X = data[considered_features]
    # calculation of variance inflation requires a constant
    X['intercept'] = 1
    
    # creating a dataframe to store vif values
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif = vif[vif['Variable']!='intercept']
    return vif

In [73]:
compute_vif(considered_features=fields_to_use, data=train_validation_dataset)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['intercept'] = 1


Unnamed: 0,Variable,VIF
0,duration_ms,1.0394
1,danceability,1.270421
2,energy,3.948349
3,loudness,2.871593
4,speechiness,1.235017
5,acousticness,2.162843
6,instrumentalness,1.212353
7,liveness,1.124396
8,tempo,1.076496
9,time_signature,1.070469


In [75]:
fit_1.summary2() 

0,1,2,3
Model:,GLM,AIC:,543001.6872
Link Function:,logit,BIC:,-7815189.1991
Dependent Variable:,valence,Log-Likelihood:,-271490.0
Date:,2024-01-06 13:05,LL-Null:,-305020.0
No. Observations:,596120,Deviance:,111990.0
Df Model:,10,Pearson chi2:,108000.0
Df Residuals:,596109,Scale:,1.0
Method:,IRLS,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-4.0310,0.0369,-109.3760,0.0000,-4.1032,-3.9587
duration_ms,-0.0000,0.0000,-42.2124,0.0000,-0.0000,-0.0000
danceability,3.4303,0.0195,175.8501,0.0000,3.3921,3.4686
energy,2.5243,0.0224,112.8994,0.0000,2.4804,2.5681
loudness,-0.0438,0.0009,-47.0799,0.0000,-0.0456,-0.0420
speechiness,-0.6190,0.0168,-36.7428,0.0000,-0.6520,-0.5860
acousticness,0.9095,0.0119,76.5548,0.0000,0.8862,0.9328
instrumentalness,-0.3095,0.0115,-26.8443,0.0000,-0.3321,-0.2869
liveness,0.0699,0.0159,4.3978,0.0000,0.0387,0.1010


In [46]:
# 1st Goodness of Fit test: Deviance to Df residuals ratio
round(109280/586661, 3)

0.186

Feature Selection
To select better variables, i will use the SelectKBest method from sklearn with the "f_regression" method as a scoring function. The output of this method will be a dataframe including every feature and its score from "f-regression" scoring function. The higher the score, the more important is the influence of the feature in Valence.

In [78]:
y = train_validation_dataset[target_field]
x = train_validation_dataset[fields_to_use]

names = pd.DataFrame(x.columns)

model = SelectKBest(score_func=f_regression, k=10)
results = model.fit(x, y)
results_df=pd.DataFrame(results.scores_)

scored=pd.concat([names,results_df], axis=1)
scored.columns = ["Feature", "Score"]
scored = scored.sort_values(by=['Score'], ascending=False)
display(scored)

Unnamed: 0,Feature,Score
1,danceability,219988.905289
2,energy,94283.230161
3,loudness,47291.915953
5,acousticness,19231.960338
6,instrumentalness,18340.133485
0,duration_ms,16026.400732
8,tempo,10814.840457
9,time_signature,6516.532614
4,speechiness,1183.783243
7,liveness,1.159948


In [85]:
scored[:5].Feature.tolist()

['danceability', 'energy', 'loudness', 'acousticness', 'instrumentalness']

Fitting a new model
Now, we choose the first 5 variables (meaning the 5 highest scoring variables) and fit the model with them

In [82]:
random.shuffle(scored[:5].Feature.tolist())

In [86]:
fit_2 = fit_glm_model(fields_to_use=scored[:5].Feature.tolist(), target_field=target_field, dataset=train_validation_dataset)



In [88]:
fit_2.summary2()

0,1,2,3
Model:,GLM,AIC:,548236.1707
Link Function:,logit,BIC:,-7810011.2065
Dependent Variable:,valence,Log-Likelihood:,-274110.0
Date:,2024-01-06 13:09,LL-Null:,-305020.0
No. Observations:,596120,Deviance:,117230.0
Df Model:,5,Pearson chi2:,111000.0
Df Residuals:,596114,Scale:,1.0
Method:,IRLS,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-3.6806,0.0247,-149.0409,0.0000,-3.7291,-3.6322
danceability,3.2456,0.0181,179.1311,0.0000,3.2101,3.2812
energy,2.4903,0.0215,115.9524,0.0000,2.4482,2.5324
loudness,-0.0339,0.0009,-38.1102,0.0000,-0.0356,-0.0321
acousticness,0.8898,0.0116,76.5502,0.0000,0.8670,0.9125
instrumentalness,-0.2739,0.0113,-24.2927,0.0000,-0.2960,-0.2518


## Predicting Valence


In [92]:
# selecting the prediction target
y = train_validation_dataset.valence

# Choosing the features

X = train_validation_dataset[fields_to_use]

# Spliting data to training data and validation data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1, test_size = 0.25)

In [93]:
train_validation_dataset

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature
0,0.681,0.594,7,-7.028,1,0.2820,0.1650,0.000003,0.1340,0.5350,186.054,5aAx2yezTd8zXrkmtKl66Z,230453,4
1,0.720,0.763,9,-4.068,0,0.0523,0.4060,0.000000,0.1800,0.7420,101.965,5knuzwU65gJK7IF5yJsuaW,251088,4
2,0.748,0.524,8,-5.599,1,0.0338,0.4140,0.000000,0.1110,0.6610,95.010,7BKLCZ1jbUBVqRi2FVlTVw,244960,4
3,0.735,0.451,0,-8.374,1,0.0585,0.0631,0.000013,0.3250,0.0862,117.973,3NdDpSvN911VPGivFlV5d0,245200,4
4,0.670,0.838,0,-4.031,1,0.0362,0.0604,0.000611,0.1590,0.7170,104.998,78rIJddV4X0HkNAInEcYde,222041,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,0.560,0.518,0,-7.471,0,0.0292,0.7850,0.000000,0.0648,0.2110,131.896,5rgu12WBIHQtvej2MdHSH0,258267,4
586668,0.765,0.663,0,-5.223,1,0.0652,0.1410,0.000297,0.0924,0.6860,150.091,0NuWgxEp51CutD2pJoF4OM,153293,4
586669,0.535,0.314,7,-12.823,0,0.0408,0.8950,0.000150,0.0874,0.0663,145.095,27Y1N4Q4U3EfDU5Ubw8ws2,187601,4
586670,0.696,0.615,10,-6.212,1,0.0345,0.2060,0.000003,0.3050,0.4380,90.029,45XJsGpFTyzbzeWK8VzR8S,142003,4


K-Nearest Neighbors
The first ML algorithm will be K-Nearest Neighbors. After testing manually this method's MAE result with and without Scaling, i ended up with Scaling the data for this algorithm as it gave me a better MAE (out-of sample only). The scaling range is (0, 1).

In [94]:
scaler = MinMaxScaler(feature_range=(0, 1))

x_train_scaled = scaler.fit_transform(train_X)
x_train_sc = pd.DataFrame(x_train_scaled)

x_test_scaled = scaler.fit_transform(val_X)
x_test_sc = pd.DataFrame(x_test_scaled)

In [95]:
mae_val = [] 
start_time = time.time()
for K in range(24, 30):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)

    model.fit(x_train_sc, train_y)  #fit the model
    pred=model.predict(x_test_sc) # make prediction on test set
    error = round(mean_absolute_error(val_y,pred), 6) 
    mae_val.append(error) 
    print('MAE value for K=' , K , 'is:', error)
duration_knn = round((time.time() - start_time)/60, 2) 

KeyboardInterrupt: 

XGBoost
The second model i will fit and use for predicting, is XGBoost. Again, i will use GridSearchCV for some basic Hyperparameters Tuning. Also, this model does not require any scaling.

In [101]:
from xgboost import XGBRegressor

spotify_xgboost_model = XGBRegressor() 

parameters = {
              'eval_metric':['mae'],
              'gamma':  [0, 0.2 ], 
              'max_depth': [6, 7, 8],
              }

xgb_grid = GridSearchCV(spotify_xgboost_model,
                        parameters,
                        cv = 2,
                        verbose=False,
                        n_jobs=-1   
                        )

start_time = time.time()
xgb_grid.fit(train_X, train_y)
duration_xgb = round((time.time() - start_time)/60, 2)

print("XGBOOST HyperParameter Tuning  %s minutes ---" % + duration_xgb)
print("Best Parameters: ", xgb_grid.best_params_)
predictions_xgb = xgb_grid.predict(val_X)
print("Mean Absolute Error: " + str(round(mean_absolute_error(predictions_xgb, val_y), 3)))

XGBOOST HyperParameter Tuning  3.16 minutes ---
Best Parameters:  {'eval_metric': 'mae', 'gamma': 0, 'max_depth': 8}
Mean Absolute Error: 0.134
