## Supervised Learning: Classifying Sunny and Rainy Day Songs

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util

import requests
import pandas as pd
import numpy as np
import json
import os
import dotenv
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks

from colorthief import ColorThief
from urllib.request import urlopen
import io

from mvspotifyhelper.mvspotifyhelper import MV

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay

In [2]:
dotenv.load_dotenv()

username = 'michael_vaden'

spot_id = os.getenv('spot_id')
spot_secret = os.getenv('spot_secret')
redirect_uri= 'https://www.virginia.edu/'

client_credentials_manager = SpotifyClientCredentials(client_id=spot_id, client_secret=spot_secret)

scope = "playlist-modify-public playlist-modify-private playlist-read-private playlist-read-collaborative user-library-modify"

token = util.prompt_for_user_token(username, scope, spot_id, spot_secret, redirect_uri, show_dialog=True)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager, auth=token)

### Select Sunny and Rainy Themed Playlists

*To begin and make sure that this approach holds water*, I selected 3-4 playlists that were publicly available on Spotify for each category of sunny and rainy. These playlists all had more than 10k followers, and were named appropriately (rainy day, rainy days music, sunny day, summer party, etc). Many of these playlists are authored by Spotify. When I had roughly 700 total songs and an approximately balanced dataset, I used my helper functions from my built-in module (mvspotifyhelper) to extract the tracks (songs) and track features from these playlists to compile into a **training set**.

In [3]:
rainy1 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DXbvABJXBIyiY?si=e2da83ce834b40c5") # rainy day
rainy2 = MV.get_tracks_from_url('Circles Records', "https://open.spotify.com/playlist/3r82Jvzw3SSGKKiKf3dXMM?si=01242f93e6524a23").loc[:210,] # rainy days music
rainy3 = MV.get_tracks_from_url('tiarafernando', "https://open.spotify.com/playlist/47S4MBG0EEXwA0GdJUA4Ur?si=0e19386208f944c3") # a playlist for rainy days
rainy4 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DX3YSRoSdA634?si=1308b6681c504105") # life sucks

In [4]:
sunny1 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DX1BzILRveYHb?si=ac80c28e8a104400") # sunny day
sunny2 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DXd1MXcE8WTXq?si=8662accbcb644c0e") # summer throwbacks
sunny3 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DX5Ozry5U6G0d?si=953f1b53a9584ce7") # summer party
sunny4 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DX9fZ7amiNVu6?si=bf527d1910ff4453") # feel good summer
sunny5 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DXdPec7aLTmlC?si=63fcdc473f954614") # happy hits
sunny6 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DX8FwnYE6PRvL?si=031aaab855c24690") # rock party
sunny7 = MV.get_tracks_from_url('spotify', "https://open.spotify.com/playlist/37i9dQZF1DX0UrRvztWcAU?si=3cf9b8d0d5d84ae1") # wake up happy

In [5]:
rainy = pd.concat([rainy1, rainy2, rainy3, rainy4]).drop_duplicates('track.uri')
#rainy

In [6]:
sunny = pd.concat([sunny1, sunny2, sunny3, sunny4, sunny5, sunny6, sunny7]).drop_duplicates('track.uri')
#sunny

In [7]:
#rainy.describe()
rainy = rainy[['track.uri', 'track.id', 'track.name', 'track.popularity']].dropna()

In [8]:
#sunny.describe()
sunny = sunny[['track.uri', 'track.id', 'track.name', 'track.popularity']].dropna()

In [9]:
sunny_features = MV.add_song_features(sunny).drop(['id', 'uri', 'track_href', 'analysis_url', 'type'], axis=1)
rainy_features = MV.add_song_features(rainy).drop(['id', 'uri', 'track_href', 'analysis_url', 'type'], axis=1)

In [11]:
sunny_features['weather_type'] = 'sun'
rainy_features['weather_type'] = 'rain'
weather_songs = pd.concat([sunny_features, rainy_features])
weather_songs

Unnamed: 0,track.uri,track.id,track.name,track.popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,weather_type
0,spotify:track:3ZpQiJ78LKINrW9SQTgbXd,3ZpQiJ78LKINrW9SQTgbXd,All I Wanna Do,74.0,0.820,0.528,9,-11.179,1,0.0321,0.1110,0.01860,0.2570,0.931,120.091,272107,4,sun
1,spotify:track:3XKIUb7HzIF1Vu9usunMzc,3XKIUb7HzIF1Vu9usunMzc,Maria Maria (feat. The Product G&B),82.0,0.777,0.601,2,-5.931,1,0.1260,0.0406,0.00201,0.0348,0.680,97.911,261973,4,sun
2,spotify:track:0ofHAoxe9vBkTCp2UQIavz,0ofHAoxe9vBkTCp2UQIavz,Dreams - 2004 Remaster,87.0,0.828,0.492,0,-9.744,1,0.0276,0.0644,0.00428,0.1280,0.789,120.151,257800,4,sun
3,spotify:track:0bRXwKfigvpKZUurwqAlEh,0bRXwKfigvpKZUurwqAlEh,Lovely Day,81.0,0.692,0.651,9,-8.267,1,0.0324,0.2920,0.00241,0.1050,0.706,97.923,254560,4,sun
4,spotify:track:1YLJVmuzeM2YSUkCCaTNUB,1YLJVmuzeM2YSUkCCaTNUB,Dog Days Are Over,79.0,0.492,0.810,7,-5.315,1,0.0847,0.0416,0.00379,0.1170,0.245,149.954,251840,4,sun
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
554,spotify:track:5jx8tCxiO0uIbo2uNia23K,5jx8tCxiO0uIbo2uNia23K,The One That Got Away - Acoustic,68.0,0.714,0.434,4,-11.542,1,0.0396,0.6970,0.00000,0.0919,0.352,123.942,259040,4,rain
555,spotify:track:7gY3cyGcB2wnk2xDXiA0pe,7gY3cyGcB2wnk2xDXiA0pe,Head Above Water,66.0,0.578,0.694,5,-5.351,1,0.0706,0.0112,0.00000,0.1020,0.268,129.921,220850,4,rain
556,spotify:track:6D2tzc8kRnZb7P1lNwMBLH,6D2tzc8kRnZb7P1lNwMBLH,Close To You,62.0,0.494,0.170,10,-11.368,1,0.0337,0.8950,0.00000,0.1390,0.145,80.098,223213,4,rain
557,spotify:track:2bdqU7C4softKNcMYDFi96,2bdqU7C4softKNcMYDFi96,chaotic,66.0,0.412,0.382,2,-7.023,0,0.0497,0.5900,0.00000,0.0931,0.318,81.493,178721,3,rain


In [12]:
weather_songs['weather_type'].value_counts()

sun     579
rain    559
Name: weather_type, dtype: int64

In [13]:
weather_for_ML = weather_songs.iloc[:,3:]
weather_for_ML['weather_type'] = weather_for_ML['weather_type'].map({'rain':0, 'sun':1})
weather_for_ML

Unnamed: 0,track.popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,weather_type
0,74.0,0.820,0.528,9,-11.179,1,0.0321,0.1110,0.01860,0.2570,0.931,120.091,272107,4,1
1,82.0,0.777,0.601,2,-5.931,1,0.1260,0.0406,0.00201,0.0348,0.680,97.911,261973,4,1
2,87.0,0.828,0.492,0,-9.744,1,0.0276,0.0644,0.00428,0.1280,0.789,120.151,257800,4,1
3,81.0,0.692,0.651,9,-8.267,1,0.0324,0.2920,0.00241,0.1050,0.706,97.923,254560,4,1
4,79.0,0.492,0.810,7,-5.315,1,0.0847,0.0416,0.00379,0.1170,0.245,149.954,251840,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
554,68.0,0.714,0.434,4,-11.542,1,0.0396,0.6970,0.00000,0.0919,0.352,123.942,259040,4,0
555,66.0,0.578,0.694,5,-5.351,1,0.0706,0.0112,0.00000,0.1020,0.268,129.921,220850,4,0
556,62.0,0.494,0.170,10,-11.368,1,0.0337,0.8950,0.00000,0.1390,0.145,80.098,223213,4,0
557,66.0,0.412,0.382,2,-7.023,0,0.0497,0.5900,0.00000,0.0931,0.318,81.493,178721,3,0


We need to account for the discrete variables that we have in our dataset. Specifically, we need to make sure that key, mode (major or minor), and time signature are all treated as categorical descriptions of songs rather than numerical ones.

In [59]:
weather_for_ML[['key', 'mode', 'time_signature', 'weather_type']] = weather_for_ML[['key', 'mode', 'time_signature', 'weather_type']].astype('category')

weather_for_ML['zeropad'] = 0

weather_for_ML.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1138 entries, 0 to 558
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   track.popularity  1138 non-null   float64 
 1   danceability      1138 non-null   float64 
 2   energy            1138 non-null   float64 
 3   key               1138 non-null   category
 4   loudness          1138 non-null   float64 
 5   mode              1138 non-null   category
 6   speechiness       1138 non-null   float64 
 7   acousticness      1138 non-null   float64 
 8   instrumentalness  1138 non-null   float64 
 9   liveness          1138 non-null   float64 
 10  valence           1138 non-null   float64 
 11  tempo             1138 non-null   float64 
 12  duration_ms       1138 non-null   int64   
 13  time_signature    1138 non-null   category
 14  weather_type      1138 non-null   category
 15  zeropad           1138 non-null   int64   
dtypes: category(4), float64(1

### Supervised Learning Classification Models

After compiling a complete dataset of roughly 1100 songs that can be categorized as 'sun' or 'rain' songs, I created a train and test set from this data and used multiple machine learning approaches to attempt to correctly classify these songs.

In [66]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(weather_for_ML, test_size=0.2, random_state=42)

#test_set, validation_set = train_test_split(test_set, test_size = 0.5, random_state=42)

X_train = train_set.drop("weather_type", axis=1)
y_train = train_set["weather_type"].copy()

X_test = test_set.drop("weather_type", axis=1)
y_test = test_set["weather_type"].copy()

#X_validation = validation_set.drop("weather_type", axis=1)
#y_validation = validation_set["weather_type"].copy()

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(910, 15)
(228, 15)
(910,)
(228,)


To improve the performance of some of our classifier models, we standardize all of our numerical values in the dataset and one-hot-encode each of our categorical variables.

In [67]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_cols = list(X_train.select_dtypes(include=[np.number]))
cat_cols = list(X_train.select_dtypes(exclude=[np.number])) # key, mode, time signature

pipeline=ColumnTransformer([
    ('num',StandardScaler(),num_cols),
    ('cat',OneHotEncoder(handle_unknown="ignore"),cat_cols),
])

X_train = pipeline.fit_transform(X_train)

X_test = pipeline.transform(X_test)

Below are *Random Forest, XGBoost, AdaBoost, and Neural Network* approaches. Decision Trees, Logistic Regression, and Naive Bayes were also implemented, but the methods included yielded the best initial results. In the context of this classification of sunny and rainy songs, there is no significant consequence to false positive or false negative errors comparatively. As a result, for the first iteration of models, our measure of choice is simply accuracy. 

### Random Forest

In [68]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

from scipy.stats import randint 

forest_cls = RandomForestClassifier()

param_dist = {
  'n_estimators': randint(low=50, high=200),
  'max_features': randint(low=5, high=15),
  'max_depth': randint(low=5, high=15)
}

forest_rnd_search = RandomizedSearchCV(
    estimator=forest_cls, 
    param_distributions=param_dist,
    n_iter=50, 
    cv=5
)

forest_rnd_search.fit(X_train, y_train)


In [69]:
print(forest_rnd_search.best_params_)

{'max_depth': 7, 'max_features': 11, 'n_estimators': 101}


In [70]:
y_predict_RF = forest_rnd_search.best_estimator_.predict(X_test)

In [71]:

accuracy_score(np.rint(y_predict_RF).astype('int64'), y_test)

0.8333333333333334

In [72]:
confusion_matrix(y_predict_RF, y_test)

array([[ 80,  19],
       [ 19, 110]])

In [73]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

cv = KFold(n_splits=10, shuffle=True)
forest_scores = cross_val_score(forest_rnd_search.best_estimator_, weather_for_ML.drop('weather_type', axis=1).copy(), 
                              weather_for_ML['weather_type'], scoring='accuracy', cv=cv)

print("Mean:", forest_scores.mean())
print("Standard deviation:", forest_scores.std())

Mean: 0.876991150442478
Standard deviation: 0.027418872012169537


### XGBoost

In [74]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV


param_grid = {
    "max_depth": [i for i in range(5,16)],
    "learning_rate": [.001, .005, .01, .05, .1, .2, .3, .5, 1],
    "gamma": [0, 0.25, 0.5, 1],
    "reg_lambda": [0, 1],
    "scale_pos_weight": [i for i in range(1,3)],
    "subsample": [i/10.0 for i in range(6,10)],
    "colsample_bytree": [i/10.0 for i in range(6,10)],
}

xgb_cls = xgb.XGBClassifier(tree_method="hist", enable_categorical=True)

xgb_rnd_search = RandomizedSearchCV(
    estimator=xgb_cls, 
    param_distributions=param_grid, 
    n_iter=200,
    cv=5
)

xgb_rnd_search.fit(X_train, y_train)

In [75]:
print(xgb_rnd_search.best_params_)

y_predict_XG = xgb_rnd_search.best_estimator_.predict(X_test)

accuracy_score(y_predict_XG, y_test)

{'subsample': 0.9, 'scale_pos_weight': 1, 'reg_lambda': 1, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0.25, 'colsample_bytree': 0.8}


0.8421052631578947

In [76]:
cv = KFold(n_splits=10, shuffle=True)
xgb_scores = cross_val_score(xgb_rnd_search.best_estimator_, weather_for_ML.drop('weather_type', axis=1).copy(), 
                              weather_for_ML['weather_type'], scoring='accuracy', cv=cv)

print("Mean:", xgb_scores.mean())
print("Standard deviation:", xgb_scores.std())

Mean: 0.8725741344511722
Standard deviation: 0.030092009245071476


### AdaBoost

In [77]:
from sklearn.ensemble import AdaBoostClassifier

ada_cls = AdaBoostClassifier()


param_grid = {
    'n_estimators': [i*10 for i in range(1, 11)],
    'learning_rate': [i/100 for i in range(1, 101)]
}

ada_rnd_search = RandomizedSearchCV(
    estimator=ada_cls, 
    param_distributions=param_grid, 
    n_iter=200,
    cv=5
)

ada_rnd_search.fit(X_train, y_train)

In [78]:
print(ada_rnd_search.best_params_)

y_predict_AD = ada_rnd_search.best_estimator_.predict(X_test)

print(accuracy_score(y_predict_AD, y_test))

{'n_estimators': 50, 'learning_rate': 0.45}
0.8421052631578947


In [79]:
cv = KFold(n_splits=10, shuffle=True)
ada_scores = cross_val_score(ada_rnd_search.best_estimator_, weather_for_ML.drop('weather_type', axis=1).copy(), 
                              weather_for_ML['weather_type'], scoring='accuracy', cv=cv)

print("Mean:", ada_scores.mean())
print("Standard deviation:", ada_scores.std())

Mean: 0.8787067225586089
Standard deviation: 0.031697115195194386


### Neural Network

In [80]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from tensorflow import keras

tf.keras.utils.set_random_seed(42)

def run_neural_net(X_train, y_train, X_test, y_test):
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(X_train.shape[1],)),
        keras.layers.Dense(20, activation=tf.nn.relu),
        keras.layers.Dense(12, activation=tf.nn.relu),
        keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])

    model.compile( loss= "binary_crossentropy",

                    optimizer = tf.keras.optimizers.Adam(),

                    metrics = ['accuracy'])

    model.fit(X_train, y_train, epochs=10, batch_size=5, verbose=2)

    loss, accuracy = model.evaluate(X_test, y_test)

    return [loss, accuracy, model.predict(X_test), model]

nn1 = run_neural_net(X_train, y_train, X_test, y_test)

print(f' Model loss on the test set: {nn1[0]}')
print(f' Model accuracy on the test set: {100*nn1[1]}')

Epoch 1/10
182/182 - 1s - loss: 0.5987 - accuracy: 0.6429 - 1s/epoch - 7ms/step
Epoch 2/10
182/182 - 0s - loss: 0.3587 - accuracy: 0.8725 - 222ms/epoch - 1ms/step
Epoch 3/10
182/182 - 0s - loss: 0.2900 - accuracy: 0.8802 - 203ms/epoch - 1ms/step
Epoch 4/10
182/182 - 0s - loss: 0.2717 - accuracy: 0.8857 - 190ms/epoch - 1ms/step
Epoch 5/10
182/182 - 0s - loss: 0.2627 - accuracy: 0.8956 - 195ms/epoch - 1ms/step
Epoch 6/10
182/182 - 0s - loss: 0.2556 - accuracy: 0.8967 - 187ms/epoch - 1ms/step
Epoch 7/10
182/182 - 0s - loss: 0.2485 - accuracy: 0.8989 - 188ms/epoch - 1ms/step
Epoch 8/10
182/182 - 0s - loss: 0.2416 - accuracy: 0.9055 - 190ms/epoch - 1ms/step
Epoch 9/10
182/182 - 0s - loss: 0.2376 - accuracy: 0.9022 - 185ms/epoch - 1ms/step
Epoch 10/10
182/182 - 0s - loss: 0.2336 - accuracy: 0.9099 - 185ms/epoch - 1ms/step
 Model loss on the test set: 0.3404121398925781
 Model accuracy on the test set: 85.52631735801697


In [81]:
nn_cls = nn1[3]

In [82]:
confusion_matrix(np.rint(nn1[2]), y_test)

array([[ 83,  17],
       [ 16, 112]])

Based on the models used above, we can see that the results for the random forest, XGBoost, and AdaBoost approaches are all very similar. The neural network appears to perform the best out of all of the models, although it was not cross-validated like the other approaches. 

Other models such as logistic regression, decision trees, and naive bayes were attempted. However, the ones above yielded the best initial results. 

There are many potential next steps, including exploring more advanced hypertuning of the above models, cross-validating the neural network, and exploring new options such as KNN and advanced deep learning models. However, because accuracy here only applies to our 700 songs, I want to implement a weather playlist generator and see the song results before I iterate further. Based on the results of the playlists that are generated, I can see if the data that was used to train these models needs to be increased or changed, and determine if there is a better measure for performance than accuracy.

In [83]:
import joblib

joblib.dump(ada_rnd_search.best_estimator_, 'adaboost1.pkl')

['adaboost1.pkl']

In [84]:
joblib.dump(nn_cls, 'neuralnet1.pkl')

['neuralnet1.pkl']

In [85]:
joblib.dump(xgb_rnd_search.best_estimator_, 'xgboost1.pkl')

['xgboost1.pkl']

In [86]:
joblib.dump(forest_rnd_search.best_estimator_, 'forest1.pkl')

['forest1.pkl']