# Danceability

**Exploration and Linear regression analysis of a spotify dataset.** 
*Later, we will turn it into binary classification*
This dataset is generated using the Spotify API. It contains a subset of songs that
are available on the platform along with characteristics of the songs including tempo, loudness, signature, genre, etc. 
Reference: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-analysis.


In [1]:
import pandas as pd
import numpy as np

In [2]:
spotify_data = pd.read_csv("spotify_data.csv")

## Data Exploration

Before fitting any machine learning model, it is important to carefully explore and prepare the training set. 
Looking at characteristics such as null value counts and data types. 

**summary of data types and null values**

In [3]:
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42305 entries, 0 to 42304
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      42305 non-null  float64
 1   energy            42305 non-null  float64
 2   key               42305 non-null  int64  
 3   loudness          42305 non-null  float64
 4   speechiness       42305 non-null  float64
 5   acousticness      42305 non-null  float64
 6   instrumentalness  42305 non-null  float64
 7   liveness          42305 non-null  float64
 8   valence           42305 non-null  float64
 9   tempo             42305 non-null  float64
 10  type              42305 non-null  object 
 11  id                42305 non-null  object 
 12  duration_ms       42305 non-null  int64  
 13  time_signature    42305 non-null  int64  
 14  genre             42305 non-null  object 
dtypes: float64(9), int64(3), object(3)
memory usage: 4.8+ MB


**Removing non-numeric columns**

In [4]:
spotify_data_for_ml = spotify_data.select_dtypes(include=np.number)

# (also) spotify_data_for_ml = spotify_data.drop(['type','id','genre'],axis=1)
#spotify_data_for_ml.info()

**Get summary statistics for the numeric columns**

In [5]:
spotify_data_for_ml.describe()

Unnamed: 0,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0,42305.0
mean,0.639364,0.762516,5.37024,-6.465442,0.136561,0.09616,0.283048,0.214079,0.357101,147.474056,250865.846685,3.97258
std,0.156617,0.183823,3.666145,2.941165,0.126168,0.170827,0.370791,0.175576,0.2332,23.844623,102957.713571,0.268342
min,0.0651,0.000243,0.0,-33.357,0.0227,1e-06,0.0,0.0107,0.0187,57.967,25600.0,1.0
25%,0.524,0.632,1.0,-8.161,0.0491,0.00173,0.0,0.0996,0.161,129.931,179840.0,4.0
50%,0.646,0.803,6.0,-6.234,0.0755,0.0164,0.00594,0.135,0.322,144.973,224760.0,4.0
75%,0.766,0.923,9.0,-4.513,0.193,0.107,0.722,0.294,0.522,161.464,301133.0,4.0
max,0.988,1.0,11.0,3.148,0.946,0.988,0.989,0.988,0.988,220.29,913052.0,5.0


**Create X (a predictor matrix) and y (the outcome to be predicted)**

Using "danceability" as the outcome to predict. This will be "y'.
The remaining numeric features will be "X" and will predict "danceability."

In [8]:
X = spotify_data_for_ml.iloc[:, 1:12]
y = spotify_data_for_ml.iloc[:, 0]

## Preparation of Data for ML 

**Splitting the data into train and test sets (random sate = 42, test_size = .20)**

In [6]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state = 42)


**Scaling the training and testing data**

In [10]:
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
X_train_scaled = pd.DataFrame(s.fit_transform(X_train), columns = X_train.columns)
X_train_scaled

Unnamed: 0,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,-0.392062,-1.197591,-0.319946,1.839910,-0.491316,-0.767334,-0.715319,1.317208,1.781719,-0.957013,0.100091
1,1.085896,-0.924416,0.217197,-0.051699,-0.549142,-0.331527,2.360748,-0.998500,-0.394293,-0.311599,0.100091
2,-0.386609,-1.470766,0.306550,1.475835,2.263485,-0.767334,-0.257060,1.248721,1.106304,0.174947,-3.630617
3,0.698682,0.441460,1.230888,-0.787765,-0.456749,1.425152,-0.171882,-1.378173,1.157778,0.663642,0.100091
4,0.616876,-0.378065,1.291826,-0.281225,-0.358321,-0.767334,-0.506916,-0.459167,2.293609,-0.279109,0.100091
...,...,...,...,...,...,...,...,...,...,...,...
33839,-0.964703,0.168285,-0.227855,0.241144,0.611308,-0.767334,0.219937,1.685324,-0.389175,-1.607209,0.100091
33840,-0.201182,-1.470766,0.596518,-0.564571,-0.222983,-0.767334,-0.924857,2.263181,-0.315299,-1.269999,0.100091
33841,0.409635,0.714635,-0.100502,-0.763230,-0.559565,1.266432,-0.835704,-0.882929,1.115323,0.958771,0.100091
33842,-0.942888,-1.470766,0.520174,1.760764,0.699189,-0.767334,-0.450131,-1.357627,2.625192,-1.290672,0.100091


In [11]:
X_test_scaled = pd.DataFrame(s.transform(X_test), columns = X_test.columns)
X_test_scaled

Unnamed: 0,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,0.158764,-0.924416,0.842666,0.035362,-0.179628,-0.767334,0.486829,1.749530,2.474838,0.247108,0.100091
1,-0.784730,-1.197591,-0.270306,0.486499,1.138598,-0.766336,-0.663076,-0.026844,1.873593,-0.998678,0.100091
2,0.387820,-0.651240,1.054921,0.043277,0.968694,-0.767334,-0.695444,0.683705,-0.731330,-0.558811,0.100091
3,1.205878,-1.197591,0.561598,-0.597813,-0.555036,0.612722,2.167678,1.706726,-0.857058,-0.456715,0.100091
4,0.191486,0.168285,0.611239,-0.809926,-0.483699,-0.767125,-0.586416,2.027757,-0.562769,-0.557794,3.830799
...,...,...,...,...,...,...,...,...,...,...,...
8456,1.096804,-1.470766,0.046023,-0.729988,-0.558645,1.228770,1.287503,-1.173997,-0.100760,1.925262,0.100091
8457,0.856840,-1.197591,1.269915,-0.550325,-0.538186,-0.097483,-0.692037,-1.101230,0.107528,-0.542764,0.100091
8458,0.687775,0.987810,0.644789,-0.831296,-0.533733,1.236841,-0.580737,1.830858,-0.480125,-0.108113,0.100091
8459,-1.766400,-1.470766,-1.659552,2.021948,3.786769,-0.767334,-0.699986,0.050203,1.499975,-0.195634,0.100091


## Tuning and Evaluation of the Model 

**Model success evaluation with linear regression / import linear regression**

In [12]:
# (Model success evaluation with linear regression / import linear regression)
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)
linear_model.score(X_test_scaled, y_test)

# 34,32% of the X metrics explain the danceability Y (in this test mode)

0.3432842565997639

**Importing the KNeighborsRegressor**

In [13]:
from sklearn.neighbors import KNeighborsRegressor

**Regression model performance metrics (e.g. MAE and R2)**

In [16]:
# Predictions evaluation with Different Metrics
#
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
results = {'MAE': mean_squared_error(y_test, linear_model.predict(X_test_scaled)),
           'R2': r2_score(y_test, linear_model.predict(X_test_scaled))}
results

{'MAE': 0.0163553196339477, 'R2': 0.3432842565997639}

**Improving Model with KNN Regressor**

In [17]:

results = {}
for i in np.linspace(3, 25, 5, dtype = int):
    knn_model = KNeighborsRegressor(n_neighbors=i)
    knn_model.fit(X_train_scaled, y_train)
    y_pred = knn_model.predict(X_test_scaled)
    
    results[i] = [mean_squared_error(y_test,y_pred),
                  mean_absolute_error(y_test, y_pred), 
                  r2_score(y_test, y_pred),
                  knn_model.score(X_test_scaled,y_test)
                 ]
    
pd.DataFrame.from_dict(results, orient = "index", columns = ["MSE","MAE", "R2_from _r2_score", "R2_from_.score"])

Unnamed: 0,MSE,MAE,R2_from _r2_score,R2_from_.score
3,0.013486,0.08608,0.458502,0.458502
8,0.012839,0.086317,0.484465,0.484465
14,0.012587,0.086134,0.494588,0.494588
19,0.012563,0.085956,0.495563,0.495563
25,0.01259,0.08628,0.494458,0.494458


In [18]:
from sklearn.metrics import mean_squared_error
results = {}
for i in np.linspace(3, 25, 5, dtype = int):
    knn_model = KNeighborsRegressor(n_neighbors=i)
    knn_model.fit(X_train_scaled, y_train)
    y_pred = knn_model.predict(X_test_scaled)
    results[i] = (mean_squared_error(y_test,y_pred),knn_model.score(X_test_scaled,y_test))
pd.DataFrame.from_dict(results, orient = "index", columns = ["MSE","R2"])

Unnamed: 0,MSE,R2
3,0.013486,0.458502
8,0.012839,0.484465
14,0.012587,0.494588
19,0.012563,0.495563
25,0.01259,0.494458


**Describing the model performance**

**Model Improvement**

Changing the parameter value from the default, "uniform," to "distance."

**Notes**: k-NN is in a family of algorithms that rely on distance metrics to generate predictions. Data points are projected into a feature space and observations (here songs) are either near or far relative to each other and this naturally leads to the concept of "Neighbors." A prediction is possible when projecting a new data point into this space. Next, take the average danceability score from "k" neighbors. This is the prediction. Therefore, the n_neighbors parameter is setting this cutoff for the number of neighbors (songs) that are going into the prediction. The weights parameter slightly alters how neighbors contribute to the prediction.

With linear regression, 34,32% (R2) of the X metrics explain the danceability Y (in this test mode).
With KNN, we continue to increase R2 (R square value) to 49%

In [16]:
#KNN Documentation
help(KNeighborsRegressor)

Help on class KNeighborsRegressor in module sklearn.neighbors._regression:

class KNeighborsRegressor(sklearn.neighbors._base.KNeighborsMixin, sklearn.base.RegressorMixin, sklearn.neighbors._base.NeighborsBase)
 |  KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
 |  
 |  Regression based on k-nearest neighbors.
 |  
 |  The target is predicted by local interpolation of the targets
 |  associated of the nearest neighbors in the training set.
 |  
 |  Read more in the :ref:`User Guide <regression>`.
 |  
 |  .. versionadded:: 0.9
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, default=5
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 |  
 |  weights : {'uniform', 'distance'} or callable, default='uniform'
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhoo

**setting the weights parameter to distance**

In [17]:
results = {}
for i in np.linspace(3, 25, 5, dtype = int):
    knn_model = KNeighborsRegressor(n_neighbors=i, weights='distance')
    knn_model.fit(X_train_scaled, y_train)
    results[i] = knn_model.score(X_test_scaled, y_test)
pd.DataFrame.from_dict(results, orient = "index", columns = ["MSE","R2"])

Unnamed: 0,R2
3,0.52243
8,0.587074
14,0.606481
19,0.611732
25,0.614999


Variance has improved with the parameter change

**Interpretation of the model performance (refering to the documentation)** 

This change in success may be accounted for by the fact that some songs can be similar in ways that have nothing to do with the danceability score. By weighting more heavily the songs that are closest to the song we indirectly capture the genre. This will introduce some noise in the prediction.