By: @kggold4 @TalSomech
Spotify dataset classification
Dataset link
Mark: 96
Our main project is in the spotify_classification.ipynb
notebook, also see the utils.py
file (for our utils functions).
The goal of this project is to train machine learning models (supervised) that will classified the popularity of a spotify song to three classes:
- high popular
- medium popular
- non popular
Features:
- acousticness (Ranges from 0 to 1)
- artists (List of artists mentioned)
- danceability (Ranges from 0 to 1)
- duration_ms (Integer typically ranging from 200k to 300k)
- energy (Ranges from 0 to 1)
- explicit (0 = No explicit content, 1 = Explicit content) - Categorical.
- id (Id of track generated by Spotify) - Numerical.
- id_artists.
- instrumentalness (Ranges from 0 to 1).
- key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…).
- liveness (Ranges from 0 to 1).
- loudness (Float typically ranging from -60 to 0).
- mode (0 = Minor, 1 = Major).
- name (Name of the song).
- popularity (Ranges from 0 to 100).
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary).
- speechiness (Ranges from 0 to 1).
- tempo (Float typically ranging from 50 to 150).
- time_signature.
- valence (Ranges from 0 to 1).
NOTE: during the ordering of the data we applay the popularity for classification to be in the following format:
class | real value | class value |
---|---|---|
high popular | 70 <= x | 2 |
medium popular | 40 <= x < 70 | 1 |
non popular | x < 40 | 0 |
In order to see the distribution between the number of popularitry classes (unbalanced number of features in data):
model | accuracy |
---|---|
KNeighbors Classifier | 74.20 % |
Logistic Regression | 72.32 % |
XGB Classifier | 77.74 % |
MLP Classifier | 70.82 % |
model | accuracy |
---|---|
KNeighbors Classifier | 59.35 % |
Logistic Regression | 60.06 % |
XGB Classifier | 65.41 % |
MLP Classifier | 64.16 % |
It's very difficult to precdict popularity of spotify tracks with the data we have in our data set, Even after we cleaned & normalized our data, and creation balanced and non-balanced training data for our models, We still see that the accuracy of our models is moderate.