# Using Machine Learning to Predict a Pop Song's Subgenre  

## Project Background

### Research Question
This project asks the question *“Can we predict a pop song's subgenre using its audio features?”*

### Findings
We found that *audio features can predict a pop song's subgenre with 56% accuracy.*

Out of 7 subgenres, there is a 14% chance of randomly picking the correct subgenre. The model we fit with the Gradient Boost Classifier had the highest accuracy rate of approximately 56%, making it 4 times more likely than random chance to correctly classify a pop song to its correct subgenre. 

### Dataset
The dataset was retrieved from Spotify’s API and contains audio feature information for *13,988 pop songs.*

### Target Variable 
Our target variable is a multi-class categorical variable for 7 pop subgeners: 
- 1 = dance-pop
- 2 = rap-pop
- 3 = folk-pop
- 4 = electro-pop
- 5 = rock-pop
- 6 = indie-pop
- 7 = EDM-pop

### Independent Variables
Our independent variables are audio features - metrics that measure each song's:
- acousticness
- danceability
- duration
- energy
- instrumentalness
- key
- liveness
- loudness
- mode (major or minor key)
- speechiness
- tempo
- time signature
- valence

### Modeling with Machine Learning Algorithms
We fit models for the following machine learning algorithms, and used Grid Search Cross Validation to hyper-tune the parameters of each model:
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree Classifier
- Random Forrest Classifier
- AdaBoost Classifier
- Gradient Boost Classifier
- eXtreme Gradient Boost Classifier
- Support Vector Machine

*Not included in our technical notebook: we further tuned each model by: 1) limiting features based on feature importance rankings from Decision Tree Classifier, and 2) applying Principal Component Analysis for dimensionality reduction. Neither of these two adjustments yielded better performing models.*

In [7]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')


#Setting pandas viewing options
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 300)
pd.set_option('display.width', 1000)


pres_test = pd.read_csv('data/presentation_test.csv')



pres_test.head()

Unnamed: 0,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,mode_feat,speechiness,tempo,valence,time_signature_1.0,time_signature_3.0,time_signature_4.0,time_signature_5.0,key_0.0,key_1.0,key_2.0,key_3.0,key_4.0,key_5.0,key_6.0,key_7.0,key_8.0,key_9.0,key_10.0,key_11.0
0,0171XsIM2xyeXRr6wsugEI,0.0358,0.717,171333,0.55,0.00196,0.126,-6.019,0,0.0521,96.976,0.332,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
1,2jSwKQBouf0brIcxGfA9CZ,0.269,0.72,200013,0.861,5e-06,0.601,-4.339,1,0.209,126.991,0.669,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,3wAX3qn53iQUFE84hpfeen,0.665,0.409,256800,0.264,0.00016,0.102,-16.273,1,0.0363,137.16,0.355,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0
3,1LZSLMw7OXJULj75J7ko3q,9.7e-05,0.607,214373,0.766,0.000288,0.635,-7.558,1,0.0383,125.076,0.453,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0tihGB0FIaNco6mRqI8nqI,0.00139,0.61,175067,0.923,5.3e-05,0.0985,-4.654,1,0.0567,116.0,0.607,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
