# Modeling |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the fourth step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will apply my cleaned and processed data to a number of models. For each type of model, I will perform hyperparameter tuning with cross-validation. Ultimately, I will determine which model performs best in predicting whether a given track is popular on Spotify.

## Table of Contents:

1. [Load libraries and data](#load)
2. [Logistic regression](#logistic)
3. [K-nearest neighbors](#knn)
4. [Random forest](#rf)
5. [Gradient boosting](#gb)
6. [Comparing model performance](#compare)
7. [Conclusions and next steps](#conclusion)

## 1. Load libraries and data <a class="anchor" id="load"></a>

First, I will read in the data, already split into train and test chunks from the preprocessing stage:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_curve, roc_auc_score

In [2]:
X_train = pd.read_csv('../data/X_train.csv', index_col=0)
y_train = np.ravel(pd.read_csv('../data/y_train.csv', index_col=0))
X_test = pd.read_csv('../data/X_test.csv', index_col=0)
y_test = np.ravel(pd.read_csv('../data/y_test.csv', index_col=0))

In [3]:
# Reminder of the training data features
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6896 entries, 9671 to 2191
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   single            6896 non-null   int64  
 1   danceability      6896 non-null   float64
 2   energy            6896 non-null   float64
 3   instrumentalness  6896 non-null   float64
 4   explicit          6896 non-null   int64  
 5   collab            6896 non-null   int64  
 6   timesig_0         6896 non-null   int64  
 7   timesig_1         6896 non-null   int64  
 8   timesig_3         6896 non-null   int64  
 9   timesig_4         6896 non-null   int64  
 10  timesig_5         6896 non-null   int64  
 11  track_number      6896 non-null   float64
 12  duration_s        6896 non-null   float64
dtypes: float64(5), int64(8)
memory usage: 754.2 KB


## 2. Logistic regression <a class="anchor" id="logistic"></a>

A standard classification model, I will run a baseline logistic regression first:

In [4]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
# Generate predictions from training data for training scores
y_train_pred = lr.predict(X_train)
# Save baseline f1 training score for comparison to testing & tuned scores
lr_baseline_f1_train = f1_score(y_train, y_train_pred)
print('Baseline LR classification report (training):\n')
print(classification_report(y_train, y_train_pred))

Baseline LR classification report (training):

              precision    recall  f1-score   support

           0       0.83      0.76      0.79      3558
           1       0.76      0.83      0.79      3338

    accuracy                           0.79      6896
   macro avg       0.79      0.79      0.79      6896
weighted avg       0.80      0.79      0.79      6896



In [5]:
# Generate predictions from testing data for testing scores
y_test_pred = lr.predict(X_test)
lr_baseline_f1_test = f1_score(y_test, y_test_pred)
print('Basline LR classification report (testing):\n')
print(classification_report(y_test, y_test_pred))

Basline LR classification report (testing):

              precision    recall  f1-score   support

           0       0.83      0.75      0.79      1509
           1       0.76      0.84      0.80      1447

    accuracy                           0.79      2956
   macro avg       0.80      0.79      0.79      2956
weighted avg       0.80      0.79      0.79      2956



With a simple baseline logistic regression model, I see an f1 test score of 0.80. With similar scores for training and testing, we do not see evidence of overfit.

Now, I will tune hyperparameters. The most straightforward method of hyperparameter tuning is grid search with cross-validation. We provide a set of values for any or all model parameters, and the GridSearchCV object systematically tests every possible combination of those given parameters (i.e., it steps through a multidimensional 'grid' of parameter values).

The 'CV' in GridSearchCV stands for cross-validation, which is a useful method to stave off overfitting. GridSearchCV applies cross-validation at each step of the grid search. The training data is split into k "folds", or equally-sized chunks (the default value for k is 3 in the case of GridSearchCV, but it can be altered). A model is trained from the current set of parameters, and then the model is fit to k-1 folds, leaving the kth aside to act as a test set. A model score is produced and saved. Then, a new model is fit to a different set of k-1 folds (using the same current set of parameters from the grid search), using a different chunk as the test set, producing a second score. This process is performed a total of k times, each iteration using a different chunk as the holdout set, until we have k scores. Those scores are then averaged. This provides us with a score corresponding to a certain set of parameters, and we can be more confident in this score, because the cross-validation lowers the chances that our score was impacted too much by the incidental nature of our overall train/test split. We have, in a sense, simulated the train-test split k times, without ever exposing the model to the real test data.

Here, I will only test a range of regularization parameters, along with a random state for the sake of reproducability.

In [6]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100,],
    'random_state':[42]
}

lr = LogisticRegression()
lr_grid = GridSearchCV(estimator=lr, param_grid=param_grid)
start_time = time()
lr_grid.fit(X_train, y_train)
# Capture wall time for fitting the tuned model for later comparison
lr_grid_fit_time = time() - start_time
# Display best parameters from grid search
lr_grid.best_params_

{'C': 10, 'random_state': 42}

In [7]:
y_train_pred = lr_grid.predict(X_train)
lr_tuned_f1_train = f1_score(y_train, y_train_pred)
print('Tuned LR classification report (training):')
print(classification_report(y_train, y_train_pred))

Tuned LR classification report (training):
              precision    recall  f1-score   support

           0       0.83      0.76      0.79      3558
           1       0.76      0.83      0.79      3338

    accuracy                           0.79      6896
   macro avg       0.79      0.79      0.79      6896
weighted avg       0.79      0.79      0.79      6896



In [8]:
start_time = time()
y_test_pred = lr_grid.predict(X_test)
# Capture wall time for making predictions for later comparison
lr_grid_pred_time = time() - start_time
lr_tuned_f1_test = f1_score(y_test, y_test_pred)
print('Tuned LR classification report (testing):')
print(classification_report(y_test, y_test_pred))

Tuned LR classification report (testing):
              precision    recall  f1-score   support

           0       0.83      0.75      0.79      1509
           1       0.76      0.84      0.80      1447

    accuracy                           0.79      2956
   macro avg       0.80      0.80      0.79      2956
weighted avg       0.80      0.79      0.79      2956



In [9]:
print('Change in LR f1 score from parameter tuning:', round(lr_tuned_f1_test-lr_baseline_f1_test, 3))

Change in LR f1 score from parameter tuning: 0.001


There was barely any improvement, which is not entirely unexpected given the fact that I only tested a small number of values for one single parameter.

## 3. K-nearest neighbors <a class="anchor" id="knn"></a>

The k-nearest neighbors model creates k clusters of points in your dataset where each cluster is assumed to share a class. The number of clusters is set by the user, and the positions and sizes of the clusters are determined mathematically. Here's a baseline model:

In [10]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_train_pred = knn.predict(X_train)
knn_baseline_f1_train = f1_score(y_train, y_train_pred)
print('Baseline KNN classification report (training):')
print(classification_report(y_train, y_train_pred))

Baseline KNN classification report (training):
              precision    recall  f1-score   support

           0       0.88      0.81      0.84      3558
           1       0.81      0.89      0.85      3338

    accuracy                           0.85      6896
   macro avg       0.85      0.85      0.85      6896
weighted avg       0.85      0.85      0.85      6896



In [11]:
y_test_pred = knn.predict(X_test)
knn_baseline_f1_test = f1_score(y_test, y_test_pred)
print('Baseline KNN classification report (testing):')
print(classification_report(y_test, y_test_pred))

Baseline KNN classification report (testing):
              precision    recall  f1-score   support

           0       0.82      0.74      0.77      1509
           1       0.75      0.83      0.79      1447

    accuracy                           0.78      2956
   macro avg       0.78      0.78      0.78      2956
weighted avg       0.78      0.78      0.78      2956



The baseline KNN model displayed some overfitting.

Next, I will tune parameters by performing a grid search over a range of numbers of neighbors, as well as uniform vs. distance-based weights:

In [12]:
param_grid = {
    'n_neighbors': [1,3,5,7,9],
    'weights': ['uniform', 'distance']
}

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(estimator=knn, param_grid=param_grid)
start_time = time()
knn_grid.fit(X_train, y_train)
knn_grid_fit_time = time() - start_time
knn_grid.best_params_

{'n_neighbors': 7, 'weights': 'distance'}

In [13]:
y_train_pred = knn_grid.predict(X_train)
knn_tuned_f1_train = f1_score(y_train, y_train_pred)
print('Tuned KNN classification report (training):')
print(classification_report(y_train, y_train_pred))

Tuned KNN classification report (training):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3558
           1       1.00      1.00      1.00      3338

    accuracy                           1.00      6896
   macro avg       1.00      1.00      1.00      6896
weighted avg       1.00      1.00      1.00      6896



In [14]:
start_time = time()
y_test_pred = knn_grid.predict(X_test)
knn_grid_pred_time = time() - start_time
knn_tuned_f1_test = f1_score(y_test, y_test_pred)
print('Tuned KNN classification report (testing):')
print(classification_report(y_test, y_test_pred))

Tuned KNN classification report (testing):
              precision    recall  f1-score   support

           0       0.82      0.74      0.78      1509
           1       0.75      0.83      0.79      1447

    accuracy                           0.78      2956
   macro avg       0.79      0.78      0.78      2956
weighted avg       0.79      0.78      0.78      2956



In [15]:
print('Change in f1 score from parameter tuning:', round(knn_tuned_f1_test-knn_baseline_f1_test, 3))

Change in f1 score from parameter tuning: 0.002


The model has memorized our training data, producing 100% training scores. And with an f1 test score of 0.79, the tuned KNN performed a little worse than the tuned LR model.

Logistic regression and k-nearest neighbors are each single models, whereby classification is determined by a single calculation, and that calculation is shaped by the parameters the model learns from the training data. Such models can perform quite well under many circumstances.

But we also have "ensemble models", which produce a large group of similarly-structured models, all of which get a say in the final classification decision. Ensemble models are shown to be more robust, because while a given individual model in the ensemble may produce a bias in a certain direction, another model will likely produce bias of a similar magnitude in the opposite direction. With enough members in the ensemble, the bias gets smoothed out.

## 4. Random Forest <a class="anchor" id="rf"></a>

Random forests are a popular ensemble model, where several decision trees are trained on bootstrapped samples from the training data. Classification is determined by a survey of decisions from the resulting "forest" of individual trees. In order to set a baseline, I will first create a random forest with all default parameters (setting the random state for reproducability).

In [16]:
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
y_train_pred = rfc.predict(X_train)
rfc_baseline_f1_train = f1_score(y_train, y_train_pred)
print('Baseline RFC classification report (training):')
print(classification_report(y_train, y_train_pred))

Baseline RFC classification report (training):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3558
           1       1.00      1.00      1.00      3338

    accuracy                           1.00      6896
   macro avg       1.00      1.00      1.00      6896
weighted avg       1.00      1.00      1.00      6896



In [17]:
y_test_pred = rfc.predict(X_test)
rfc_baseline_f1_test = f1_score(y_test, y_test_pred)
print('Baseline RFC classification report (testing):')
print(classification_report(y_test, y_test_pred))

Baseline RFC classification report (testing):
              precision    recall  f1-score   support

           0       0.85      0.81      0.83      1509
           1       0.81      0.85      0.83      1447

    accuracy                           0.83      2956
   macro avg       0.83      0.83      0.83      2956
weighted avg       0.83      0.83      0.83      2956



While we still see some overfitting to the training data, even the out-of-the-box random forest performs better on the test data than the tuned versions of our previous model candidates. Let's see what happens after tuning.

When building a model with relatively few important parameters, and/or when you have reason to only test a small number of values for your parameters, GridSearchCV is feasible. However, as the number of parameters and values-per-parameter increase, grid searching quickly becomes infeasible in terms of computing capacity and time. This is when we turn to a related method: RandomizedSearchCV. Rather than testing every possible combination of parameter values, RandomSearchCV selects a specified number of combinations, chosen at random, which drastically reduces the number of models evaluated. And yet this process has been shown to perform at least nearly as well as the brute-force grid search. This is likely due to the fact that only a subset of parameters are likely to produce much difference in a model's performance on any given dataset. To test every possible combination would mean producing a lot of practically redundant models. The randomized search does a pretty good job capturing, or approximating, the actual variation in performance for the provided parameter grid.

Since the RandomForestClassifier has so many potentially impactful parameters, I will employ RandomSearchCV. Guidance for this section came from sklearn documentation and this blog post: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74. Since there are about 1,000 parameter combinations in my grid, I will test `n_iter = 100` combinations, or 10% of the grid space.

In [18]:
param_grid = {
    'n_estimators': [100,500,800,1000,1500,2000],
    'criterion': ['gini','entropy'],
    'max_depth': [20,50,70,100,None],
    'max_features': ['auto','sqrt'],
    'min_samples_leaf': [1,2,4],
    'min_samples_split': [2,5,10],
    'random_state':[42]
}

rfc = RandomForestClassifier()
rfc_random = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=param_grid,
    n_iter=100,
    cv=5,
    random_state=42
)

start_time = time()
rfc_random.fit(X_train, y_train)
rfc_random_fit_time = time() - start_time
rfc_random.best_params_

{'random_state': 42,
 'n_estimators': 2000,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 100,
 'criterion': 'entropy'}

The random search process took considerable time. I would not even attempt a full GridSearchCV with the above parameter grid. Let's see how the tuned model performs.

In [19]:
y_train_pred = rfc_random.predict(X_train)
rfc_tuned_f1_train = f1_score(y_train, y_train_pred)
print('Tuned RFC classification report (training):')
print(classification_report(y_train, y_train_pred))

Tuned RFC classification report (training):
              precision    recall  f1-score   support

           0       0.96      0.93      0.95      3558
           1       0.93      0.96      0.95      3338

    accuracy                           0.95      6896
   macro avg       0.95      0.95      0.95      6896
weighted avg       0.95      0.95      0.95      6896



In [20]:
start_time = time()
y_test_pred = rfc_random.predict(X_test)
rfc_random_pred_time = time() - start_time
rfc_tuned_f1_test = f1_score(y_test, y_test_pred)
print('Tuned RFC classification report (testing):')
print(classification_report(y_test, y_test_pred))

Tuned RFC classification report (testing):
              precision    recall  f1-score   support

           0       0.85      0.80      0.82      1509
           1       0.80      0.85      0.83      1447

    accuracy                           0.83      2956
   macro avg       0.83      0.83      0.83      2956
weighted avg       0.83      0.83      0.83      2956



In [21]:
print('Change in f1 score from parameter tuning:', round(rfc_tuned_f1_test-rfc_baseline_f1_test, 3))

Change in f1 score from parameter tuning: 0.001


While the random forest model showed the best baseline metrics so far, we see barely any improvement from parameter tuning. We see less overfitting in the tuned model, though, with the training scores around 95% instead of 100%. 

But is this the best we can do?

## 5. Gradient boosting <a class="anchor" id="gb"></a>

The process of training a random forest classifier involves creating any number of _independent_ decision trees. The gradient boosting classifier, also a tree-based ensemble model, involves training a number of trees _in sequence_, where each subsequent tree is fitted to the error of the previous tree. In this way, each tree learns more about the data, getting better in specific ways to compensate for past shortcomings. Ultimate classification is determined by the final "smartest" tree in the sequence. Let's see how a baseline GB model performs:

In [22]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_train_pred = gbc.predict(X_train)
gbc_baseline_f1_train = f1_score(y_train, y_train_pred)
print('Baseline GBC classification report (training):')
print(classification_report(y_train, y_train_pred))

Baseline GBC classification report (training):
              precision    recall  f1-score   support

           0       0.87      0.81      0.84      3558
           1       0.81      0.88      0.84      3338

    accuracy                           0.84      6896
   macro avg       0.84      0.84      0.84      6896
weighted avg       0.84      0.84      0.84      6896



In [23]:
y_test_pred = gbc.predict(X_test)
gbc_baseline_f1_test = f1_score(y_test, y_test_pred)
print('Baseline GBC classification report (testing):')
print(classification_report(y_test, y_test_pred))

Baseline GBC classification report (testing):
              precision    recall  f1-score   support

           0       0.85      0.78      0.82      1509
           1       0.79      0.86      0.82      1447

    accuracy                           0.82      2956
   macro avg       0.82      0.82      0.82      2956
weighted avg       0.82      0.82      0.82      2956



The testing scores of the RFC and GBC were quite simliar. However, we see _much_ less overfitting in the GBC model. This is not surprising, since the GBC model is designed to reduce overfitting. Hopefully, we can improve performance with tuning, so that we have the sweet-spot of good testing scores and no overfitting.

I will again employ the randomized search with cross-validation. This grid could produce about 5,000 parameter combinations, so I will use a higher `n_iter` value to still capture a good portion of the space.

In [24]:
param_grid = {
    'loss': ['deviance','exponential'],
    'learning_rate': [0.01,0.1,1],
    'n_estimators': [50, 100, 150, 200],
    'criterion': ['friedman_mse','mse'],
    'min_samples_leaf': [1,2,4],
    'min_samples_split': [2,5,10],
    'max_depth': [2,3,4,5],
    'random_state': [42],
    'max_features': ['auto','sqrt','log2']
}

gbc = GradientBoostingClassifier()
gbc_random = RandomizedSearchCV(
    estimator=gbc,
    param_distributions=param_grid,
    n_iter=500,
    cv=5,
    random_state=42
)

start_time = time()
gbc_random.fit(X_train, y_train)
gbc_random_fit_time = time() - start_time
gbc_random.best_params_

{'random_state': 42,
 'n_estimators': 50,
 'min_samples_split': 10,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 5,
 'loss': 'deviance',
 'learning_rate': 0.1,
 'criterion': 'friedman_mse'}

In [25]:
y_train_pred = gbc_random.predict(X_train)
gbc_tuned_f1_train = f1_score(y_train, y_train_pred)
print('Tuned GBC classification report (training):')
print(classification_report(y_train, y_train_pred))

Tuned GBC classification report (training):
              precision    recall  f1-score   support

           0       0.89      0.83      0.86      3558
           1       0.83      0.89      0.86      3338

    accuracy                           0.86      6896
   macro avg       0.86      0.86      0.86      6896
weighted avg       0.86      0.86      0.86      6896



In [26]:
start_time = time()
y_test_pred = gbc_random.predict(X_test)
gbc_random_pred_time = time() - start_time
gbc_tuned_f1_test = f1_score(y_test, y_test_pred)
print('Tuned GBC classification report (testing):')
print(classification_report(y_test, y_test_pred))

Tuned GBC classification report (testing):
              precision    recall  f1-score   support

           0       0.85      0.78      0.81      1509
           1       0.79      0.85      0.82      1447

    accuracy                           0.82      2956
   macro avg       0.82      0.82      0.82      2956
weighted avg       0.82      0.82      0.82      2956



In [27]:
print('Change in f1 score from parameter tuning:', round(gbc_tuned_f1_test-gbc_baseline_f1_test, 3))

Change in f1 score from parameter tuning: -0.003


In [28]:
gbc_tuned_f1_test, gbc_baseline_f1_test

(0.8216539355695782, 0.8244628099173554)

The GBC was much more resistant to overfitting to the training data. Curiously, parameter tuning led to a slight decrease in f1 score.

## 6. Comparing model performance <a class="anchor" id="compare"></a>

Throughout the previous section, I checked the effect of tuning on a key model metric. I chose the f1 score as the key metric, because it is a good balance (i.e., harmonic mean) of precision and recall scores. In this use case, there is no reason to think we need to give special priority to reducing either false-positive or false-negative scores, so we do not particularly need to optimize precision or recall.

Now, I will organize f1 scores (basline and tuned, training and testing) for each model, as well as fit and prediction times, into a DataFrame for easy comparison.

In [29]:
metrics = pd.DataFrame(
    {
        'baseline_f1_train':[lr_baseline_f1_train, knn_baseline_f1_train, rfc_baseline_f1_train, gbc_baseline_f1_train],
        'baseline_f1_test':[lr_baseline_f1_test, knn_baseline_f1_test, rfc_baseline_f1_test, gbc_baseline_f1_test],
        'tuned_f1_train':[lr_tuned_f1_train, knn_tuned_f1_train, rfc_tuned_f1_train, gbc_tuned_f1_train],
        'tuned_f1_test':[lr_tuned_f1_test, knn_tuned_f1_test, rfc_tuned_f1_test, gbc_tuned_f1_test],
        'fit_time':[lr_grid_fit_time, knn_grid_fit_time, rfc_random_fit_time, gbc_random_fit_time],
        'prediction_time':[lr_grid_pred_time, knn_grid_pred_time, rfc_random_pred_time, gbc_random_pred_time]
    },
    index=['LogReg','KNN','RFC','GBC']
)

metrics

Unnamed: 0,baseline_f1_train,baseline_f1_test,tuned_f1_train,tuned_f1_test,fit_time,prediction_time
LogReg,0.794721,0.799737,0.79432,0.80092,1.213788,0.004575
KNN,0.84753,0.786702,0.99985,0.788936,7.004531,0.25944
RFC,0.99985,0.826586,0.945139,0.827748,3944.163748,1.931742
GBC,0.842378,0.824463,0.856274,0.821654,1796.987758,0.016631


In every case except for the GBC, parameter-tuning yielded some improvement in f1, both in terms of training and test scores. Although the tuned RFC showed the highest test f1 score (by an ever-so-slight margin), we see a greater degree of over-fitting via the training score as compared to the GBC. Furthermore, compared to the RFC, the GBC required about 1/3 of the wall-time to fit the data, and about 1/9 of the time to generate predictions.

## 7. Conclusions and next steps <a class="anchor" id="conclusion"></a>

After collecting ten thousand tracks from Spotify's API, performing exploratory data analysis, cleaning and preprocessing the data, and eliminating what appeared to be irrelevant/redundant features, I ended up with a set of 9,852 tracks, each containing the following features:

- `single` - (Bool) Standalone single or part of an album/compilation
- `danceability`, `energy`, `instrumentalness` - (Float) Spotify-provided measurements of musical qualities, scaled 0 to 1
- `explicit` - (Bool) Includes explicit lyrics
- `collab` - (Bool) Includes one or more guest artists
- `timesig` - (Category) Time signature; a rhythmic measurement
- `duration_s` - (Float) Length in seconds, scaled 0 to 1
- `track_number` - (Float) Position of track on corresponding album, scaled 0 to 1
- __`popularity`__ (Bool) Target

I tried two different classification models: Logistic regression and k-nearest neighbors. Then, I tried two ensemble classification models: Random forest and gradient boosting. For all four model types, I applied some hyperparameter-tuning in an attempt to optimize performance.

In addition to viewing classification reports for all models, I specifically recorded the train f1 and test f1 scores, singling these out as the primary metrics for comparison. I chose the f1 score due to its nature as a harmonic mean of precision and recall; there is no apparent reason to optimize either individually in this use-case.  I also recorded wall-times for the fitting and predicting steps of each model, for a gauge of practicality in deployment.

The random forest classifier yielded a test f1 score of 80.6%, to the gradient boosting classifier's 80.3% This 0.03% difference is not trivial, especially given the big-data context of Spotify's library and user base. However, the RFC's train f1 score is 95.1% to the GBC's 86.7%, indicating much more over-fitting in the RFC. This means that we can have somewhat less confidence in the RFC's ability to generalize to new data (beyond the test set), especially with such a slim margin between the test f1 scores. Moreover, the fitting and predicting times for the RFC were drastically higher than the GBC, and that is an important practical consideration when deploying an ML model.

__Overall, I judge the tuned gradient boosting classifier to be the superior model among these candidates__. With further time, one could dig deeper into tuning the random forest, perhaps with Bayesian optimization, in the hopes of bringing that test f1 score closer to the train f1 score. If the RFC's f1 score could be improved sufficiently higher than that of the GBC, that could outweigh its lengthy fit and predict times and make it the superior model instead.

In [78]:
metrics = open("../model_summary.txt","w")

model_features = [
    'Features:\n\n',
    'single - (Bool) Standalone single or part of an album/compilation \n',
    'danceability - (Float) Spotify-provided measurement of musical quality, scaled 0 to 1 \n',
    'energy - see above \n',
    'instrumentalness - see above \n',
    'explicit - (Bool) Includes explicit lyrics \n',
    'collab - (Bool) Includes one or more guest artists \n',
    'timesig - (Category) Time signature; a rhythmic measurement \n',
    'duration_s - (Float) Length in seconds, scaled 0 to 1 \n',
    'track_number - (Float) Track number, max 25, scaled 0 to 1 \n',
    'popularity - (Bool, TARGET) 1 if raw popularity score > 50, else 0 \n\n\n']
metrics.writelines(model_features)

metrics.write('Parameters: \n\n')
model_params = str(gbc_random.best_estimator_).split()
model_params = [x+' \n' for x in model_params]
metrics.writelines(model_params)

metrics.write('\n\n')
metrics.write('F1 test score: '+ str(round(gbc_tuned_f1_test,3)))

metrics.close()