# Preprocessing and Modeling

The following notebook is dedicated to training, testing, and validating various preprocessing and modeling methods. The goal is to attempt to formulate the best possible model for classifying the genre of an EDM song.

## Imports

In [114]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from sklearn.preprocessing import StandardScaler, PowerTransformer, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from xgboost import XGBClassifier

## Reading In The Data

In [2]:
# Read in the data
songs = pd.read_csv('data/songs_clean.csv')
val = pd.read_csv('data/val_clean.csv')

## Setting Features And Target Variables

In [3]:
# Set features and target for modeling
X = songs.drop('genre', axis=1)
y = songs['genre']

## Train Test Split

In [4]:
# Split the modeling data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.15, random_state=72, stratify=y)

## Preprocessing

Several options for preprocessing will be compared when testing subsequent models.

### Polynomial Features

Training the models with more complex interaction columns and polynomial features will improve the accuracy of the results.

In [5]:
# Create polynomial features up to 5 degrees
pf = PolynomialFeatures(degree=5)
# Training data
X_train = pf.fit_transform(X_train)
# Test data
X_test = pf.transform(X_test)
# Validation data
X_val = pf.transform(val.drop('genre', axis=1))

In [6]:
# Create a list to store all the new polynomial feature names
poly_feat = pf.get_feature_names(X.columns)

### Scaling Options

Depending on the type of model, standardizing or using a power transformer to scale the variables may improve the performance of the algorithm. Both a standard scaler and a power transformer are set up here so that they can be compared with each other.

In [7]:
# Run the data through a standard scaler
ss = StandardScaler()
# Training data
X_tr_sc = ss.fit_transform(X_train)
# Test data
X_te_sc = ss.transform(X_test)
# Validation data
X_val_sc = ss.transform(X_val)

In [8]:
# Run the data through a power transformer
pt = PowerTransformer()
# Training data
X_tr_pt = pt.fit_transform(X_train)
# Test data
X_te_pt = pt.transform(X_test)
# Validation data
X_val_pt = pt.transform(X_val)

  loglike = -n_samples / 2 * np.log(x_trans.var())
  x = um.multiply(x, x, out=x)
  ret = umr_sum(x, axis, dtype, out, keepdims)


## Baseline Models

A series of baseline modeling tactics will be deployed with scaled, unscaled and power transformed data. This will give a general idea of which preprocessing method works the best and which models show the most potential and should be further tuned.

### Logistic Regression

In [9]:
# Logistic regression baseline model
lr = LogisticRegression(random_state=72)
lr.fit(X_train, y_train)
print(f"Training Score: {lr.score(X_train, y_train)}")
print(f"Test Score: {lr.score(X_test, y_test)}")
print(f"Validation Score: {lr.score(X_val, val['genre'])}")



Training Score: 0.37481525273425953
Test Score: 0.35845896147403683
Validation Score: 0.28


In [10]:
# Logistic regression baseline model with standard scaler
lr.fit(X_tr_sc, y_train)
print(f"Training Score: {lr.score(X_tr_sc, y_train)}")
print(f"Test Score: {lr.score(X_te_sc, y_test)}")
print(f"Validation Score: {lr.score(X_val_sc, val['genre'])}")



Training Score: 0.7360331067100206
Test Score: 0.6834170854271356
Validation Score: 0.85


In [11]:
# Logistic regression baseline model with power transformer
lr.fit(X_tr_pt, y_train)
print(f"Training Score: {lr.score(X_tr_pt, y_train)}")
print(f"Test Score: {lr.score(X_te_pt, y_test)}")
print(f"Validation Score: {lr.score(X_val_pt, val['genre'])}")



Training Score: 0.7375110848359444
Test Score: 0.6666666666666666
Validation Score: 0.8


### K-Nearest Neighbor

In [12]:
# KNN baseline model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print(f"Training Score: {knn.score(X_train, y_train)}")
print(f"Test Score: {knn.score(X_test, y_test)}")
print(f"Validation Score: {knn.score(X_val, val['genre'])}")

Training Score: 0.5900088678687555
Test Score: 0.37018425460636517
Validation Score: 0.33


In [13]:
# KNN baseline model with standard scaler
knn.fit(X_tr_sc, y_train)
print(f"Training Score: {knn.score(X_tr_sc, y_train)}")
print(f"Test Score: {knn.score(X_te_sc, y_test)}")
print(f"Validation Score: {knn.score(X_val_sc, val['genre'])}")

Training Score: 0.7783032811114395
Test Score: 0.6515912897822446
Validation Score: 0.84


In [14]:
# KNN baseline model with power transformer
knn.fit(X_tr_pt, y_train)
print(f"Training Score: {knn.score(X_tr_pt, y_train)}")
print(f"Test Score: {knn.score(X_te_pt, y_test)}")
print(f"Validation Score: {knn.score(X_val_pt, val['genre'])}")

Training Score: 0.7821460242388413
Test Score: 0.661641541038526
Validation Score: 0.78


### Decision Tree

In [15]:
# Decision tree baseline model
dt = DecisionTreeClassifier(random_state=72)
dt.fit(X_tr_sc, y_train)
print(f"Training Score: {dt.score(X_tr_sc, y_train)}")
print(f"Test Score: {dt.score(X_te_sc, y_test)}")
print(f"Validation Score: {dt.score(X_val_sc, val['genre'])}")

Training Score: 0.9234407330771505
Test Score: 0.6197654941373534
Validation Score: 0.76


In [16]:
# Decision tree baseline model with standard scaler
dt.fit(X_tr_sc, y_train)
print(f"Training Score: {dt.score(X_tr_sc, y_train)}")
print(f"Test Score: {dt.score(X_te_sc, y_test)}")
print(f"Validation Score: {dt.score(X_val_sc, val['genre'])}")

Training Score: 0.9234407330771505
Test Score: 0.6197654941373534
Validation Score: 0.76


In [17]:
# Decision tree baseline model with power transformer
dt.fit(X_tr_pt, y_train)
print(f"Training Score: {dt.score(X_tr_pt, y_train)}")
print(f"Test Score: {dt.score(X_te_pt, y_test)}")
print(f"Validation Score: {dt.score(X_val_pt, val['genre'])}")

Training Score: 0.9234407330771505
Test Score: 0.6164154103852596
Validation Score: 0.75


### Bagging

In [18]:
# Bagging baseline model
bag = BaggingClassifier(random_state=72)
bag.fit(X_tr_sc, y_train)
print(f"Training Score: {bag.score(X_tr_sc, y_train)}")
print(f"Test Score: {bag.score(X_te_sc, y_test)}")
print(f"Validation Score: {bag.score(X_val_sc, val['genre'])}")

Training Score: 0.9154596511971623
Test Score: 0.7035175879396985
Validation Score: 0.88


In [19]:
# Bagging baseline model with standard scaler
bag.fit(X_tr_sc, y_train)
print(f"Training Score: {bag.score(X_tr_sc, y_train)}")
print(f"Test Score: {bag.score(X_te_sc, y_test)}")
print(f"Validation Score: {bag.score(X_val_sc, val['genre'])}")

Training Score: 0.9154596511971623
Test Score: 0.7035175879396985
Validation Score: 0.88


In [20]:
# Bagging baseline model with power transformer
bag.fit(X_tr_pt, y_train)
print(f"Training Score: {bag.score(X_tr_pt, y_train)}")
print(f"Test Score: {bag.score(X_te_pt, y_test)}")
print(f"Validation Score: {bag.score(X_val_pt, val['genre'])}")

Training Score: 0.9160508424475318
Test Score: 0.7035175879396985
Validation Score: 0.91


### Random Forest

In [21]:
# Random forest baseline model
rf = RandomForestClassifier(random_state=72)
rf.fit(X_train, y_train)
print(f"Training Score: {rf.score(X_train, y_train)}")
print(f"Test Score: {rf.score(X_test, y_test)}")
print(f"Validation Score: {rf.score(X_val, val['genre'])}")



Training Score: 0.9160508424475318
Test Score: 0.6968174204355109
Validation Score: 0.86


In [22]:
# Random forest baseline model with standard scaler
rf.fit(X_tr_sc, y_train)
print(f"Training Score: {rf.score(X_tr_sc, y_train)}")
print(f"Test Score: {rf.score(X_te_sc, y_test)}")
print(f"Validation Score: {rf.score(X_val_sc, val['genre'])}")

Training Score: 0.9160508424475318
Test Score: 0.6968174204355109
Validation Score: 0.86


In [23]:
# Random forest baseline model with power transformer
rf.fit(X_tr_pt, y_train)
print(f"Training Score: {rf.score(X_tr_pt, y_train)}")
print(f"Test Score: {rf.score(X_te_pt, y_test)}")
print(f"Validation Score: {rf.score(X_val_pt, val['genre'])}")

Training Score: 0.914572864321608
Test Score: 0.6901172529313233
Validation Score: 0.82


### AdaBoost

In [24]:
# Adaboost baseline model
ada = AdaBoostClassifier(random_state=72)
ada.fit(X_train, y_train)
print(f"Training Score: {ada.score(X_train, y_train)}")
print(f"Test Score: {ada.score(X_test, y_test)}")
print(f"Validation Score: {ada.score(X_val, val['genre'])}")

Training Score: 0.7652970736033107
Test Score: 0.7169179229480737
Validation Score: 0.84


In [25]:
# Adaboost baseline model with standard scaler
ada.fit(X_tr_sc, y_train)
print(f"Training Score: {ada.score(X_tr_sc, y_train)}")
print(f"Test Score: {ada.score(X_te_sc, y_test)}")
print(f"Validation Score: {ada.score(X_val_sc, val['genre'])}")

Training Score: 0.7652970736033107
Test Score: 0.7169179229480737
Validation Score: 0.84


In [26]:
# Adaboost forest baseline model with power transformer
ada.fit(X_tr_pt, y_train)
print(f"Training Score: {ada.score(X_tr_pt, y_train)}")
print(f"Test Score: {ada.score(X_te_pt, y_test)}")
print(f"Validation Score: {ada.score(X_val_pt, val['genre'])}")

Training Score: 0.7611587348507242
Test Score: 0.7386934673366834
Validation Score: 0.91


### Gradient Boost

In [27]:
# Gradient Boost baseline model
gb = GradientBoostingClassifier(random_state=72)
gb.fit(X_train, y_train)
print(f"Training Score: {gb.score(X_train, y_train)}")
print(f"Test Score: {gb.score(X_test, y_test)}")
print(f"Validation Score: {gb.score(X_val, val['genre'])}")

Training Score: 0.8604788649127993
Test Score: 0.7420435510887772
Validation Score: 0.9


In [28]:
# Gradient Boost baseline model with standard scaler
gb.fit(X_tr_sc, y_train)
print(f"Training Score: {gb.score(X_tr_sc, y_train)}")
print(f"Test Score: {gb.score(X_te_sc, y_test)}")
print(f"Validation Score: {gb.score(X_val_sc, val['genre'])}")

Training Score: 0.8604788649127993
Test Score: 0.7420435510887772
Validation Score: 0.9


In [29]:
# Gradient Boost forest baseline model with power transformer
gb.fit(X_tr_pt, y_train)
print(f"Training Score: {gb.score(X_tr_pt, y_train)}")
print(f"Test Score: {gb.score(X_te_pt, y_test)}")
print(f"Validation Score: {gb.score(X_val_pt, val['genre'])}")

Training Score: 0.857227313035767
Test Score: 0.7353433835845896
Validation Score: 0.91


### XGBoost

In [30]:
# XGBoost baseline model
xgb = XGBClassifier(random_state=72)
xgb.fit(X_train, y_train)
print(f"Training Score: {xgb.score(X_train, y_train)}")
print(f"Test Score: {xgb.score(X_test, y_test)}")
print(f"Validation Score: {xgb.score(X_val, val['genre'])}")

Training Score: 0.8294413242684008
Test Score: 0.7487437185929648
Validation Score: 0.91


In [31]:
# XGBoost baseline model with standard scaler
xgb.fit(X_tr_sc, y_train)
print(f"Training Score: {xgb.score(X_tr_sc, y_train)}")
print(f"Test Score: {xgb.score(X_te_sc, y_test)}")
print(f"Validation Score: {xgb.score(X_val_sc, val['genre'])}")

Training Score: 0.8294413242684008
Test Score: 0.7487437185929648
Validation Score: 0.91


In [32]:
# XGBoost baseline model with power transformer
xgb.fit(X_tr_pt, y_train)
print(f"Training Score: {xgb.score(X_tr_pt, y_train)}")
print(f"Test Score: {xgb.score(X_te_pt, y_test)}")
print(f"Validation Score: {xgb.score(X_val_pt, val['genre'])}")

Training Score: 0.8323972805202483
Test Score: 0.7403685092127303
Validation Score: 0.9


Based on the results above, particularly when it comes to validation, it appears that the models perform best overall when the data is run through a power transformer. In addition, the most promising models seem to be Bagging, Gradient Boost and XGBoost. The next step will be to focus on improving these five algorithms with parameter tuning.

## Model Tuning

### Decision Tree

In [33]:
# Set parameter distributions for a randomized search over a decision tree
params = {
    'max_depth': range(1,25),
    'min_samples_split': range(2,25),
    'min_samples_leaf': range(1,25),
    'min_weight_fraction_leaf': np.linspace(0,.5,100),
    'max_features': np.linspace(.001,1.0,100),
    'max_leaf_nodes': range(2,100),
    'min_impurity_decrease': np.linspace(0,1,100),
    'presort': range(2)
}

In [34]:
# Instantiate and fit a randomized search with 3 folds
rs = RandomizedSearchCV(dt, params, 500, cv=3, random_state=72)
rs.fit(X_tr_pt, y_train)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=72,
            splitter='best'),
          fit_params=None, iid='warn', n_iter=500, n_jobs=None,
          param_distributions={'max_depth': range(1, 25), 'min_samples_split': range(2, 25), 'min_samples_leaf': range(1, 25), 'min_weight_fraction_leaf': array([0.     , 0.00505, ..., 0.49495, 0.5    ]), 'max_features': array([0.001  , 0.01109, ..., 0.98991, 1.     ]), 'max_leaf_nodes': range(2, 100), 'min_impurity_decrease': array([0.    , 0.0101, ..., 0.9899, 1.    ]), 'presort': range(0, 2)},
          pre_dispatch='2*n_jobs', random_state=72, refit=True,
          return_train_score='warn', scoring=

In [35]:
# Display the best performing parameters
rs.best_params_

{'presort': 0,
 'min_weight_fraction_leaf': 0.0,
 'min_samples_split': 5,
 'min_samples_leaf': 21,
 'min_impurity_decrease': 0.020202020202020204,
 'max_leaf_nodes': 92,
 'max_features': 0.5963636363636364,
 'max_depth': 3}

In [36]:
# Score the best decision tree estimator on training, test, and validation
print(f"Training Score: {rs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {rs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {rs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7658882648536801
Test Score: 0.7135678391959799
Validation Score: 0.91


The randomized search over parameters resulted in a pretty decent model. The best parameters from this search will now be tuned further using grid search in an attempt to further improve this model.

In [37]:
# Set parameter distributions for a grid search over a decision tree
params = {
    'min_samples_split': range(4,7),
    'min_samples_leaf': range(20,23),
    'min_impurity_decrease': [.01,.02,.03],
    'max_leaf_nodes': range(91,94),
    'max_features': [.5,.6,.7],
    'max_depth': range(2,5)
}

In [38]:
# Instantiate and fit a grid search with 3 folds
gs = GridSearchCV(rs.best_estimator_, params, cv=3)
gs.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=0.5963636363636364, max_leaf_nodes=92,
            min_impurity_decrease=0.020202020202020204,
            min_impurity_split=None, min_samples_leaf=21,
            min_samples_split=5, min_weight_fraction_leaf=0.0, presort=0,
            random_state=72, splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'min_samples_split': range(4, 7), 'min_samples_leaf': range(20, 23), 'min_impurity_decrease': [0.01, 0.02, 0.03], 'max_leaf_nodes': range(91, 94), 'max_features': [0.5, 0.6, 0.7], 'max_depth': range(2, 5)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [39]:
# Display the best performing parameters
gs.best_params_

{'max_depth': 4,
 'max_features': 0.5,
 'max_leaf_nodes': 91,
 'min_impurity_decrease': 0.01,
 'min_samples_leaf': 20,
 'min_samples_split': 4}

In [40]:
# Score the best decision tree estimator on training, test, and validation
print(f"Training Score: {gs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7803724504877327
Test Score: 0.7303182579564489
Validation Score: 0.92


In [41]:
# Set parameter distributions for a second grid search over a decision tree
params = {
    'min_samples_split': range(3,6),
    'min_samples_leaf': range(19,22),
    'min_impurity_decrease': [0,.01,.02],
    'max_leaf_nodes': range(90,93),
    'max_features': [.4,.5,.6],
    'max_depth': range(3,6)
}

In [42]:
# Instantiate and fit a grid search with 3 folds
gs2 = GridSearchCV(rs.best_estimator_, params, cv=3)
gs2.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=0.5963636363636364, max_leaf_nodes=92,
            min_impurity_decrease=0.020202020202020204,
            min_impurity_split=None, min_samples_leaf=21,
            min_samples_split=5, min_weight_fraction_leaf=0.0, presort=0,
            random_state=72, splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'min_samples_split': range(3, 6), 'min_samples_leaf': range(19, 22), 'min_impurity_decrease': [0, 0.01, 0.02], 'max_leaf_nodes': range(90, 93), 'max_features': [0.4, 0.5, 0.6], 'max_depth': range(3, 6)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [43]:
# Display the best performing parameters
gs2.best_params_

{'max_depth': 4,
 'max_features': 0.4,
 'max_leaf_nodes': 90,
 'min_impurity_decrease': 0.01,
 'min_samples_leaf': 19,
 'min_samples_split': 3}

In [44]:
# Score the best decision tree estimator on training, test, and validation
print(f"Training Score: {gs2.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs2.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs2.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7777120898610701
Test Score: 0.7269681742043551
Validation Score: 0.91


In [45]:
# Save the best estimator as the final decision tree model
dt = gs.best_estimator_

### Bagging

In [46]:
# Set parameter distributions for a randomized search over a bagging classifier
params = {
    'n_estimators': range(1,100),
    'max_samples': np.linspace(.001,1,100),
    'max_features': np.linspace(.001,1,100),
    'bootstrap': range(2),
    'bootstrap_features': range(2),
    'warm_start': range(2)
}

In [47]:
# Instantiate and fit a randomized search with 3 folds
rs = RandomizedSearchCV(bag, params, 50, cv=3, random_state=72)
rs.fit(X_tr_pt, y_train)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=72,
         verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=None,
          param_distributions={'n_estimators': range(1, 100), 'max_samples': array([0.001  , 0.01109, ..., 0.98991, 1.     ]), 'max_features': array([0.001  , 0.01109, ..., 0.98991, 1.     ]), 'bootstrap': range(0, 2), 'bootstrap_features': range(0, 2), 'warm_start': range(0, 2)},
          pre_dispatch='2*n_jobs', random_state=72, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [48]:
# Display the best performing parameters
rs.best_params_

{'warm_start': 0,
 'n_estimators': 96,
 'max_samples': 0.25327272727272726,
 'max_features': 0.4752727272727273,
 'bootstrap_features': 1,
 'bootstrap': 0}

In [49]:
# Score the best bagging estimator on training, test, and validation
print(f"Training Score: {rs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {rs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {rs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.8486550399054094
Test Score: 0.7453936348408711
Validation Score: 0.93


In [50]:
# Set parameter distributions for a grid search over a bagging classifier
params = {
    'n_estimators': range(58,63),
    'bootstrap': range(2),
    'bootstrap_features': range(2),
    'warm_start': range(2)
}

In [51]:
# Instantiate and fit a grid search with 3 folds
gs = GridSearchCV(rs.best_estimator_, params, cv=3)
gs.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=BaggingClassifier(base_estimator=None, bootstrap=0, bootstrap_features=1,
         max_features=0.4752727272727273, max_samples=0.25327272727272726,
         n_estimators=96, n_jobs=None, oob_score=False, random_state=72,
         verbose=0, warm_start=0),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': range(58, 63), 'bootstrap': range(0, 2), 'bootstrap_features': range(0, 2), 'warm_start': range(0, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [52]:
# Display the best performing parameters
gs.best_params_

{'bootstrap': 1, 'bootstrap_features': 0, 'n_estimators': 60, 'warm_start': 0}

In [53]:
# Score the best bagging estimator on training, test, and validation
print(f"Training Score: {gs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.8424475317765298
Test Score: 0.7437185929648241
Validation Score: 0.92


In [54]:
# Save the best estimator as the final bagging model
bag = rs.best_estimator_

### Random Forest

In [55]:
# Set parameter distributions for a randomized search over a random forest
params = {
    'n_estimators': range(1,200),
    'criterion': ['gini','entropy'],
    'max_depth': range(1,25),
    'min_samples_split': range(2,25),
    'min_samples_leaf': range(1,25),
    'min_weight_fraction_leaf': np.linspace(0,.5,100),
    'max_features': np.linspace(.001,1.0,100),
    'max_leaf_nodes': range(2,100),
    'min_impurity_decrease': np.linspace(0,1,100),
    'bootstrap': range(2),
    'warm_start': range(2)
}

In [56]:
# Instantiate and fit a randomized search with 3 folds
rs = RandomizedSearchCV(rf, params, 50, cv=3, random_state=72)
rs.fit(X_tr_pt, y_train)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=72, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=None,
          param_distributions={'n_estimators': range(1, 200), 'criterion': ['gini', 'entropy'], 'max_depth': range(1, 25), 'min_samples_split': range(2, 25), 'min_samples_leaf': range(1, 25), 'min_weight_fraction_leaf': array([0.     , 0.00505, ..., 0.49495, 0.5    ]), 'max_features': array([0.001  , 0.01109, ..., 0.98991, 1.     ]), 'max_leaf_nodes': range(2, 100), 'min_impurity_decrease': array([0.    , 0.0101, ..., 0.9899, 1.    ]), 'bootstr

In [57]:
# Display the best performing parameters
rs.best_params_

{'warm_start': 1,
 'n_estimators': 125,
 'min_weight_fraction_leaf': 0.19191919191919193,
 'min_samples_split': 9,
 'min_samples_leaf': 22,
 'min_impurity_decrease': 0.0,
 'max_leaf_nodes': 26,
 'max_features': 0.3440909090909091,
 'max_depth': 5,
 'criterion': 'entropy',
 'bootstrap': 1}

In [58]:
# Score the best random forest estimator on training, test, and validation
print(f"Training Score: {rs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {rs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {rs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7496305054685191
Test Score: 0.7269681742043551
Validation Score: 0.89


In [59]:
# Set parameter distributions for a grid search over a random forest
params = {
    'n_estimators': [100,125,150],
    'criterion': ['gini','entropy'],
    'max_depth': [4,5,6],
    'min_samples_leaf': [20,22,24],
    'warm_start': range(2)
}

In [60]:
# Instantiate and fit a grid search with 3 folds
gs = GridSearchCV(rs.best_estimator_, params, cv=3)
gs.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=1, class_weight=None, criterion='entropy',
            max_depth=5, max_features=0.3440909090909091,
            max_leaf_nodes=26, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=22,
            min_samples_split=9,
            min_weight_fraction_leaf=0.19191919191919193, n_estimators=125,
            n_jobs=None, oob_score=False, random_state=72, verbose=0,
            warm_start=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [100, 125, 150], 'criterion': ['gini', 'entropy'], 'max_depth': [4, 5, 6], 'min_samples_leaf': [20, 22, 24], 'warm_start': range(0, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [61]:
# Display the best performing parameters
gs.best_params_

{'criterion': 'entropy',
 'max_depth': 4,
 'min_samples_leaf': 20,
 'n_estimators': 125,
 'warm_start': 0}

In [62]:
# Score the best random forest estimator on training, test, and validation
print(f"Training Score: {gs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7496305054685191
Test Score: 0.7269681742043551
Validation Score: 0.89


### AdaBoost

In [63]:
# Set parameter distributions for a randomized search over adaboost
params = {
    'learning_rate': np.linspace(.001,5,100),
    'n_estimators': range(1,100),
    'algorithm': ['SAMME', 'SAMME.R']
}

In [64]:
# Instantiate and fit a randomized search with 3 folds
rs = RandomizedSearchCV(ada, params, 50, cv=3, random_state=72)
rs.fit(X_tr_pt, y_train)

  sample_weight /= sample_weight_sum
  ((sample_weight > 0) |
  sample_weight /= sample_weight_sum
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |
  ((sample_weight > 0) |


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=72),
          fit_params=None, iid='warn', n_iter=50, n_jobs=None,
          param_distributions={'learning_rate': array([1.00000e-03, 5.14949e-02, ..., 4.94951e+00, 5.00000e+00]), 'n_estimators': range(1, 100), 'algorithm': ['SAMME', 'SAMME.R']},
          pre_dispatch='2*n_jobs', random_state=72, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [65]:
# Display the best performing parameters
rs.best_params_

{'n_estimators': 65, 'learning_rate': 0.7584242424242423, 'algorithm': 'SAMME'}

In [66]:
# Score the best adaboost estimator on training, test, and validation
print(f"Training Score: {rs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {rs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {rs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7723913686077446
Test Score: 0.7286432160804021
Validation Score: 0.86


In [67]:
# Set parameter distributions for a grid search over adaboost
params = {
    'learning_rate': [.5,.75,1],
    'n_estimators': range(60,71),
    'algorithm': ['SAMME', 'SAMME.R']
}

In [68]:
# Instantiate and fit a grid search with 3 folds
gs = GridSearchCV(rs.best_estimator_, params, cv=3)
gs.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=AdaBoostClassifier(algorithm='SAMME', base_estimator=None,
          learning_rate=0.7584242424242423, n_estimators=65,
          random_state=72),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'learning_rate': [0.5, 0.75, 1], 'n_estimators': range(60, 71), 'algorithm': ['SAMME', 'SAMME.R']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [69]:
# Display the best performing parameters
gs.best_params_

{'algorithm': 'SAMME', 'learning_rate': 1, 'n_estimators': 68}

In [70]:
# Score the best adaboost estimator on training, test, and validation
print(f"Training Score: {gs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7650014779781259
Test Score: 0.7252931323283082
Validation Score: 0.8


### Gradient Boost

In [72]:
# Set parameter distributions for a randomized search over gradient boost
params = {
    'learning_rate': np.linspace(.001,.5,100),
    'n_estimators': range(1,200),
    'subsample': np.linspace(0,1,100),
    'min_samples_split': range(2,10),
    'min_samples_leaf': range(1,10),
    'max_depth': range(1,50),
    'min_impurity_decrease': np.linspace(.001,1)
}

In [73]:
# Instantiate and fit a randomized search with 3 folds
rs = RandomizedSearchCV(gb, params, 5, cv=3, random_state=72)
rs.fit(X_tr_pt, y_train)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_sampl...      subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=5, n_jobs=None,
          param_distributions={'learning_rate': array([0.001  , 0.00604, ..., 0.49496, 0.5    ]), 'n_estimators': range(1, 200), 'subsample': array([0.    , 0.0101, ..., 0.9899, 1.    ]), 'min_samples_split': range(2, 10), 'min_samples_leaf': range(1, 10), 'max_depth': range(1, 50), 'min_impurity_decrease': a...51, 0.8369 ,
       0.85729, 0.87767, 0.89806, 0.91845, 0.93884, 0.95922, 0.97961,
       1.     ])},
          pre_dispatch='2*n_jobs', random_state=72

In [74]:
# Display the best performing parameters
rs.best_params_

{'subsample': 0.29292929292929293,
 'n_estimators': 131,
 'min_samples_split': 4,
 'min_samples_leaf': 6,
 'min_impurity_decrease': 0.18448979591836737,
 'max_depth': 18,
 'learning_rate': 0.046363636363636364}

In [75]:
# Score the best gradient boost estimator on training, test, and validation
print(f"Training Score: {rs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {rs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {rs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.9074785693171741
Test Score: 0.7085427135678392
Validation Score: 0.93


In [76]:
# Set parameter distributions for a grid search over gradient boost
params = {
    'criterion': ['friedman_mse','mse','mae'],
    'warm_start': range(2),
    'presort': range(2)
}

In [77]:
# Instantiate and fit a grid search with 3 folds
gs = GridSearchCV(rs.best_estimator_, params, cv=3)
gs.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.046363636363636364, loss='deviance',
              max_depth=18, max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.18448979591836737,
              min_impurity_split=N...0.29292929292929293, tol=0.0001,
              validation_fraction=0.1, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['friedman_mse', 'mse', 'mae'], 'warm_start': range(0, 2), 'presort': range(0, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [78]:
# Display the best performing parameters
gs.best_params_

{'criterion': 'friedman_mse', 'presort': 0, 'warm_start': 0}

In [79]:
# Score the best gradient boost estimator on training, test, and validation
print(f"Training Score: {gs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.9071829736919894
Test Score: 0.711892797319933
Validation Score: 0.92


In [80]:
# Set parameter distributions for a second grid search over gradient boost
params = {
    'min_weight_fraction_leaf': [0.,.25,.5],
    'max_features': [None,'auto','log2'],
    'max_leaf_nodes': [None,25]
}

In [81]:
# Instantiate and fit a grid search with 3 folds
gs2 = GridSearchCV(rs.best_estimator_, params, cv=3)
gs2.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.046363636363636364, loss='deviance',
              max_depth=18, max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.18448979591836737,
              min_impurity_split=N...0.29292929292929293, tol=0.0001,
              validation_fraction=0.1, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'min_weight_fraction_leaf': [0.0, 0.25, 0.5], 'max_features': [None, 'auto', 'log2'], 'max_leaf_nodes': [None, 25]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [82]:
# Display the best performing parameters
gs2.best_params_

{'max_features': None,
 'max_leaf_nodes': None,
 'min_weight_fraction_leaf': 0.25}

In [83]:
# Score the best gradient boost estimator on training, test, and validation
print(f"Training Score: {gs2.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs2.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs2.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7853975761158735
Test Score: 0.7353433835845896
Validation Score: 0.89


In [84]:
# Save the best estimator as the final gradient boost model
gb = rs.best_estimator_

### XGBoost

In [85]:
# Set parameter distributions for a randomized search over XGBoost
params = {
    'learning_rate': np.linspace(.001,.5,25),
    'n_estimators': range(1,200),
    'gamma': range(10),
    'min_child_weight': range(1,10),
    'max_delta_step': range(10),
    'subsample': np.linspace(0,1,25),
    'colsample_bytree': np.linspace(0,1,25),
    'colsample_bylevel': np.linspace(0,1,25),
    'colsample_bynode': np.linspace(0,1,25),
    'max_depth': range(1,25),
    'reg_alpha': range(5),
    'reg_lambda': range(5),
    'scale_pos_weight': np.linspace(0,1,25),
    'base_score': np.linspace(0,1,25)
}

In [86]:
# Instantiate and fit a randomized search with 3 folds
rs = RandomizedSearchCV(xgb, params, 50, cv=3, random_state=72)
rs.fit(X_tr_pt, y_train)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='multi:softprob', random_state=72, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
          fit_params=None, iid='warn', n_iter=50, n_jobs=None,
          param_distributions={'learning_rate': array([0.001  , 0.02179, 0.04258, 0.06338, 0.08417, 0.10496, 0.12575,
       0.14654, 0.16733, 0.18812, 0.20892, 0.22971, 0.2505 , 0.27129,
       0.29208, 0.31288, 0.33367, 0.35446, 0.37525, 0.39604, 0.41683,
       0.43762, 0.45842, 0.47921, 0.5    ]), 'n_esti..., 0.625  , 0.66667, 0.70833, 0.75   , 0.79167, 0.83333,
       0.875  , 0.91667, 0.95833, 1.     ])},
          pre_d

In [87]:
# Display the best performing parameters
rs.best_params_

{'subsample': 0.41666666666666663,
 'scale_pos_weight': 0.4583333333333333,
 'reg_lambda': 0,
 'reg_alpha': 2,
 'n_estimators': 91,
 'min_child_weight': 9,
 'max_depth': 15,
 'max_delta_step': 3,
 'learning_rate': 0.021791666666666668,
 'gamma': 0,
 'colsample_bytree': 0.6666666666666666,
 'colsample_bynode': 0.3333333333333333,
 'colsample_bylevel': 1.0,
 'base_score': 0.08333333333333333}

In [88]:
# Score the best XGBoost estimator on training, test, and validation
print(f"Training Score: {rs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {rs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {rs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.8016553355010346
Test Score: 0.7470686767169179
Validation Score: 0.92


In [89]:
# Set parameter distributions for a grid search over XGBoost
params = {
    'booster': ['gbtree','gblinear','dart'],
    'learning_rate': [.01,.05],
    'n_estimators': [80,100],
    'max_depth': [10,20],
    'reg_alpha': range(3),
    'reg_lambda': range(3),
}

In [90]:
# Instantiate and fit a grid search with 3 folds
gs = GridSearchCV(rs.best_estimator_, params, cv=3)
gs.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.08333333333333333, booster='gbtree',
       colsample_bylevel=1.0, colsample_bynode=0.3333333333333333,
       colsample_bytree=0.6666666666666666, gamma=0,
       learning_rate=0.021791666666666668, max_delta_step=3, max_depth=15,
       min_child_weight=9, miss...eight=0.4583333333333333,
       seed=None, silent=None, subsample=0.41666666666666663, verbosity=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'booster': ['gbtree', 'gblinear', 'dart'], 'learning_rate': [0.01, 0.05], 'n_estimators': [80, 100], 'max_depth': [10, 20], 'reg_alpha': range(0, 3), 'reg_lambda': range(0, 3)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [91]:
# Display the best performing parameters
gs.best_params_

{'booster': 'gbtree',
 'learning_rate': 0.01,
 'max_depth': 10,
 'n_estimators': 100,
 'reg_alpha': 1,
 'reg_lambda': 1}

In [92]:
# Score the best XGBoost estimator on training, test, and validation
print(f"Training Score: {gs.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7986993792491871
Test Score: 0.7420435510887772
Validation Score: 0.92


In [93]:
# Set parameter distributions for a second grid search over XGBoost
params = {
    'learning_rate': [.01,.02],
    'n_estimators': [90,100],
    'max_depth': [10,15],
    'reg_alpha': range(3),
    'reg_lambda': range(3),
}

In [94]:
# Instantiate and fit a grid search with 3 folds
gs2 = GridSearchCV(gs.best_estimator_, params, cv=3)
gs2.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.08333333333333333, booster='gbtree',
       colsample_bylevel=1.0, colsample_bynode=0.3333333333333333,
       colsample_bytree=0.6666666666666666, gamma=0, learning_rate=0.01,
       max_delta_step=3, max_depth=10, min_child_weight=9, missing=None,
       n_esti...eight=0.4583333333333333, seed=None,
       silent=None, subsample=0.41666666666666663, verbosity=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'learning_rate': [0.01, 0.02], 'n_estimators': [90, 100], 'max_depth': [10, 15], 'reg_alpha': range(0, 3), 'reg_lambda': range(0, 3)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [95]:
# Display the best performing parameters
gs2.best_params_

{'learning_rate': 0.01,
 'max_depth': 10,
 'n_estimators': 90,
 'reg_alpha': 1,
 'reg_lambda': 2}

In [96]:
# Score the best XGBoost estimator on training, test, and validation
print(f"Training Score: {gs2.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs2.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs2.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7989949748743719
Test Score: 0.7437185929648241
Validation Score: 0.92


In [98]:
# Set parameter distributions for a third grid search over XGBoost
params = {
    'learning_rate': [.005,.01],
    'max_depth': [5,10],
    'reg_alpha': range(3),
    'reg_lambda': range(3),
}

In [99]:
# Instantiate and fit a grid search with 3 folds
gs3 = GridSearchCV(gs.best_estimator_, params, cv=3)
gs3.fit(X_tr_pt, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.08333333333333333, booster='gbtree',
       colsample_bylevel=1.0, colsample_bynode=0.3333333333333333,
       colsample_bytree=0.6666666666666666, gamma=0, learning_rate=0.01,
       max_delta_step=3, max_depth=10, min_child_weight=9, missing=None,
       n_esti...eight=0.4583333333333333, seed=None,
       silent=None, subsample=0.41666666666666663, verbosity=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'learning_rate': [0.005, 0.01], 'max_depth': [5, 10], 'reg_alpha': range(0, 3), 'reg_lambda': range(0, 3)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [100]:
# Display the best performing parameters
gs3.best_params_

{'learning_rate': 0.005, 'max_depth': 5, 'reg_alpha': 0, 'reg_lambda': 0}

In [101]:
# Score the best XGBoost estimator on training, test, and validation
print(f"Training Score: {gs3.best_estimator_.score(X_tr_pt, y_train)}")
print(f"Test Score: {gs3.best_estimator_.score(X_te_pt, y_test)}")
print(f"Validation Score: {gs3.best_estimator_.score(X_val_pt, val['genre'])}")

Training Score: 0.7945610404966007
Test Score: 0.7487437185929648
Validation Score: 0.91


In [102]:
xgb = gs.best_estimator_

### Voting

In [103]:
# Set a list of estimators for the voting classifier
models = [
    ('dt', dt),
    ('bag', bag),
    ('ada', ada),
    ('gb', gb),
    ('xgb', xgb)
]

In [106]:
vc = VotingClassifier(models)
vc.fit(X_tr_pt, y_train)
print(f"Training Score: {vc.score(X_tr_pt, y_train)}")
print(f"Test Score: {vc.score(X_te_pt, y_test)}")
print(f"Validation Score: {vc.score(X_val_pt, val['genre'])}")

Training Score: 0.8226426248891516
Test Score: 0.7453936348408711
Validation Score: 0.93


In [107]:
vc = VotingClassifier(models, 'soft')
vc.fit(X_tr_pt, y_train)
print(f"Training Score: {vc.score(X_tr_pt, y_train)}")
print(f"Test Score: {vc.score(X_te_pt, y_test)}")
print(f"Validation Score: {vc.score(X_val_pt, val['genre'])}")

Training Score: 0.8312148980195093
Test Score: 0.7437185929648241
Validation Score: 0.93


In [108]:
vc = VotingClassifier(models)
vc.fit(X_tr_sc, y_train)
print(f"Training Score: {vc.score(X_tr_sc, y_train)}")
print(f"Test Score: {vc.score(X_te_sc, y_test)}")
print(f"Validation Score: {vc.score(X_val_sc, val['genre'])}")

Training Score: 0.8229382205143364
Test Score: 0.7520938023450586
Validation Score: 0.93


In [109]:
vc = VotingClassifier(models, 'soft')
vc.fit(X_tr_sc, y_train)
print(f"Training Score: {vc.score(X_tr_sc, y_train)}")
print(f"Test Score: {vc.score(X_te_sc, y_test)}")
print(f"Validation Score: {vc.score(X_val_sc, val['genre'])}")

Training Score: 0.8329884717706177
Test Score: 0.7420435510887772
Validation Score: 0.93


In [110]:
vc = VotingClassifier(models)
vc.fit(X_train, y_train)
print(f"Training Score: {vc.score(X_train, y_train)}")
print(f"Test Score: {vc.score(X_test, y_test)}")
print(f"Validation Score: {vc.score(X_val, val['genre'])}")

Training Score: 0.8229382205143364
Test Score: 0.7504187604690117
Validation Score: 0.93


In [111]:
vc = VotingClassifier(models, 'soft')
vc.fit(X_train, y_train)
print(f"Training Score: {vc.score(X_train, y_train)}")
print(f"Test Score: {vc.score(X_test, y_test)}")
print(f"Validation Score: {vc.score(X_val, val['genre'])}")

Training Score: 0.8329884717706177
Test Score: 0.7437185929648241
Validation Score: 0.93


In [115]:
# Persist the model
with open('final_model.pkl', 'wb') as f:
    pickle.dump(vc, f)

In [61]:
# feat_imp = pd.DataFrame(xgb.feature_importances_, index=poly_feat).sort_values(0, ascending=False)
# feat_imp

Unnamed: 0,0
duration_ms^3 loudness,0.084715
duration_ms^4 loudness,0.061337
energy tempo^4,0.047956
energy loudness tempo,0.040677
energy^3 loudness tempo,0.033887
energy loudness^2 tempo,0.030042
energy^2 tempo^3,0.021533
loudness tempo^4,0.019903
duration_ms tempo,0.018461
duration_ms^3 energy loudness,0.016275


In [48]:
# plt.figure(figsize=(10,10))
# plt.barh(feat_imp.index, feat_imp[0])
# plt.gca().invert_yaxis()