<a href="https://colab.research.google.com/github/jwils133/ml-class/blob/master/projects/Jada_Bach_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bach Chorales Project

Since XGBoost is an extremely powerful state-of-the-art algorithm, I chose to build an XGBoost classifier to predict the chord labels. According to my research, XGBoost is not recommended for datasets in which the number of features is significantly greater than the number of rows. That is not the case for this dataset, so it is appropriate to use an XGBoost classifier.

I ran my code on the following Tesla T4 Graphics Processing Unit (GPU):







In [2]:
!nvidia-smi

Sat Oct 22 04:57:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## <font color='#FF10F0'>1. Load the Data Set

I downloaded the zipped file, unzipped it, and uploaded the csv file to Google Colab. Then, I used Pandas to load the file into my code.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

bach = pd.read_csv('bach.csv')
bach

Unnamed: 0,choral_ID,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter,chord_label
0,000106b_,1,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
1,000106b_,2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5,C_M
2,000106b_,3,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2,C_M
3,000106b_,4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
4,000106b_,5,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2,F_M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,015505b_,105,NO,NO,YES,NO,NO,NO,NO,YES,NO,NO,YES,NO,G,4,G_m
5661,015505b_,106,NO,NO,YES,NO,NO,NO,NO,YES,NO,YES,NO,NO,G,3,G_m
5662,015505b_,107,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,C,5,C_M
5663,015505b_,108,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,YES,NO,C,3,C_M


I checked to see whether any of the columns only contained 1 unique value. Such columns would have the same value for all observations and thus would not add any value to the model and could easily be dropped. I found no such columns, so no columns were dropped for this reason. 

In [4]:
print(bach.nunique())

choral_ID        60
event_number    207
C                 2
C#                2
D                 2
D#                2
E                 2
F                 2
F#                2
G                 2
G#                2
A                 2
A#                2
B                 2
bass             16
meter             5
chord_label     102
dtype: int64


## <font color='#FF10F0'>2. Divide Dataset into Features and Labels
I created 2 DataFrames: one for the features and one for the labels. The choral_ID column is just used to identify unique chords and thus, is irrelevant for determining the chord_label. So, I dropped it from the features. I also dropped the chord_label column because it is the label that we are trying to predict.




In [5]:
bach_clean = bach.drop('choral_ID', axis=1)
bach_clean

Unnamed: 0,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter,chord_label
0,1,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
1,2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5,C_M
2,3,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2,C_M
3,4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
4,5,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2,F_M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,105,NO,NO,YES,NO,NO,NO,NO,YES,NO,NO,YES,NO,G,4,G_m
5661,106,NO,NO,YES,NO,NO,NO,NO,YES,NO,YES,NO,NO,G,3,G_m
5662,107,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,C,5,C_M
5663,108,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,YES,NO,C,3,C_M


In [6]:
bach_features = bach_clean.drop('chord_label', axis=1)
bach_features = bach_features.drop('event_number', axis=1)
bach_labels = bach_clean['chord_label']
bach_features

Unnamed: 0,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter
0,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3
1,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5
2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2
3,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3
4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,NO,NO,YES,NO,NO,NO,NO,YES,NO,NO,YES,NO,G,4
5661,NO,NO,YES,NO,NO,NO,NO,YES,NO,YES,NO,NO,G,3
5662,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,C,5
5663,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,YES,NO,C,3


## <font color='#FF10F0'>3. One-hot Encode Categorical (yes/no) Values in Feature Columns</font> 

XGBoost cannot be appied to categorical data. So, I used sklearn's OneHotEncoder function to transform the categorical yes/no values in the feature columns to one-hot encoded numerical values. 

In [7]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
bachSparse = enc.fit_transform(bach_features)
bachSparse

<5665x45 sparse matrix of type '<class 'numpy.float64'>'
	with 79310 stored elements in Compressed Sparse Row format>

## <font color='#FF10F0'>4. Separate Data into Training and Test Sets </font> 

I used the code given in the instructions to create separate DataFrames for the training and test datasets with each having 2 DataFrames: one for features and one for labels. 

In [8]:

from sklearn.model_selection import train_test_split
bach_train_features, bach_test_features, bach_train_labels, bach_test_labels = train_test_split(bachSparse, bach_labels, test_size = 0.2, random_state=42)

## <font color='#FF10F0'>5. Create XGBoost Classifier </font> 

I created an XGBoost classifier called model with the following parameters:

* `tree_method: gpu_hist`
* `predictor: gpu_predictor`

In [9]:
from xgboost import XGBClassifier
params = { 'tree_method': 'gpu_hist', 'predictor': 'gpu_predictor' }

model = XGBClassifier(**params)
model

XGBClassifier(predictor='gpu_predictor', tree_method='gpu_hist')

## <font color='#FF10F0'>6. Create Parameter Grid & Find Best Values for Hyperparameters </font> 

Now, I want to find the best hyperparameter values for

*   n_estimators: let's try 50, 100, 150, 200
*   max_depth: let's try 2, 4, 6, 8



So, I made a param_grid and used random search to determine the best values for these hyperparameters. 

In [10]:
param_grid = { 'n_estimators': [100, 125, 150, 200], 'max_depth':[4, 6, 8] }

In [11]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
param_comb = 5
folds=5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=param_comb,  n_jobs=-1, 
                                   cv=skf.split(bach_train_features,bach_train_labels), verbose=3)
random_search

RandomizedSearchCV(cv=<generator object _BaseKFold.split at 0x7f3960e980d0>,
                   estimator=XGBClassifier(predictor='gpu_predictor',
                                           tree_method='gpu_hist'),
                   n_iter=5, n_jobs=-1,
                   param_distributions={'max_depth': [4, 6, 8],
                                        'n_estimators': [100, 125, 150, 200]},
                   verbose=3)

## <font color='#FF10F0'>7. Fit XGBoost Classifier Using Best Values for Hyperparameters </font> 

After using the best values for the hyperparameters to fit the XGBoost classifier to the training data, I used the model to generate predictions. Finally, I determined the accuracy of the model.

In [12]:
%%time 
grid_result = random_search.fit(bach_train_features, bach_train_labels)

Fitting 5 folds for each of 5 candidates, totalling 25 fits




CPU times: user 59.6 s, sys: 1.04 s, total: 1min
Wall time: 28min 8s


In [13]:
random_search.best_params_

{'n_estimators': 125, 'max_depth': 8}

In [14]:
predictions = random_search.best_estimator_.predict(bach_test_features)

In [15]:
from sklearn.metrics import accuracy_score
accuracy_score(bach_test_labels, predictions)

0.736098852603707