# Modeling |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the fourth step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will apply my cleaned and processed data to a number of models. For each type of model, I will perform hyperparameter tuning with cross-validation. Ultimately, I will determine which model performs best in predicting whether a given track is popular on Spotify.

First, I will read in the data, already split into train and test chunks from the preprocessing stage:

In [31]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [4]:
X_train = pd.read_csv('../data/X_train.csv', index_col=0)
y_train = np.ravel(pd.read_csv('../data/y_train.csv', index_col=0))
X_test = pd.read_csv('../data/X_test.csv', index_col=0)
y_test = np.ravel(pd.read_csv('../data/y_test.csv', index_col=0))

### Logistic regression

Baseline model first:

In [18]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
#save baseline f1 score for later comparison to tuned model
lr_baseline_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      0.79      0.76      1509
           1       0.76      0.70      0.73      1447

    accuracy                           0.74      2956
   macro avg       0.75      0.74      0.74      2956
weighted avg       0.75      0.74      0.74      2956



Now I will tune hyperparameters. The most straightforward method of hyperparameter tuning is grid search with cross-validation. We provide a set of values for any or all model parameters, and the GridSearchCV object systematically tests every possible combination of those given parameters (i.e., it steps through a multidimensional 'grid' of parameter values).

The 'CV' in GridSearchCV stands for cross-validation, which is a useful method to stave off overfitting. GridSearchCV applies cross-validation at each step of the grid search. The training data is split into k "folds", or equally-sized chunks (the default value for k is 3 in the case of GridSearchCV, but it can be altered). A model is trained from the current set of parameters, and then the model is fit to k-1 folds, leaving the kth aside to act as a test set. A model score is produced and saved. Then, a new model is fit to a different set of k-1 folds (using the same current set of parameters from the grid search), using a different chunk as the test set, producing a second score. This process is performed a total of k times, each iteration using a different chunk as the holdout set, until we have k scores. Those scores are then averaged. This provides us with a score corresponding to a certain set of parameters, and we can be more confident in this score, because the cross-validation lowers the chances that our score was impacted too much by the incidental nature of our overall train/test split. We have, in a sense, simulated the train-test split k times, without ever exposing the model to the real test data.

Here, I will only test a range of regularization parameters, along with a random state for the sake of reproducability.

In [19]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100,],
    'random_state':[42]
}

lr = LogisticRegression()
lr_grid = GridSearchCV(estimator=lr, param_grid=param_grid)
lr_grid.fit(X_train, y_train)
lr_grid.best_params_

{'C': 10, 'random_state': 42}

In [20]:
y_pred = lr_grid.predict(X_test)
lr_tuned_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      0.78      0.76      1509
           1       0.76      0.70      0.73      1447

    accuracy                           0.74      2956
   macro avg       0.74      0.74      0.74      2956
weighted avg       0.74      0.74      0.74      2956



In [21]:
print('Change in f1 score from parameter tuning:', round(lr_tuned_f1-lr_baseline_f1, 3))

Change in f1 score from parameter tuning: 0.001


There was barely any improvement, which is not entirely unexpected given the fact that I only tested a small number of values for one single parameter.

### K-nearest neighbors

Baseline model:

In [22]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
knn_baseline_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.72      0.76      1509
           1       0.73      0.81      0.77      1447

    accuracy                           0.76      2956
   macro avg       0.77      0.77      0.76      2956
weighted avg       0.77      0.76      0.76      2956



Next, I perform a grid search over a range of numbers of neighbors, as well as uniform vs. distance-based weights:

In [23]:
param_grid = {
    'n_neighbors': [1,3,5,7,9],
    'weights': ['uniform', 'distance']
}

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(estimator=knn, param_grid=param_grid)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_

{'n_neighbors': 9, 'weights': 'distance'}

In [24]:
y_pred = knn_grid.predict(X_test)
knn_tuned_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.73      0.76      1509
           1       0.74      0.82      0.78      1447

    accuracy                           0.77      2956
   macro avg       0.77      0.77      0.77      2956
weighted avg       0.77      0.77      0.77      2956



In [26]:
print('Change in f1 score from parameter tuning:', round(knn_tuned_f1-knn_baseline_f1, 3))

Change in f1 score from parameter tuning: 0.006


### Random Forest

Random forests are a popular ensemble model, where several decision trees are trained on bootstrapped samples from the training data. Classification is determined by a survey of decisions from the resulting "forest" of individual trees. In order to set a baseline, I will first create a random forest with all default parameters (setting the random state for reproducability).

In [27]:
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
rfc_baseline_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.80      0.80      1509
           1       0.79      0.81      0.80      1447

    accuracy                           0.80      2956
   macro avg       0.80      0.80      0.80      2956
weighted avg       0.80      0.80      0.80      2956



Even the out-of-the-box random forest performs better across the board than the tuned versions of our previous model candidates. Let's see what happens after tuning.

When building a model with relatively few important parameters, and/or when you have reason to only test a small number of values for your parameters, GridSearchCV is feasible. However, as the number of parameters and values-per-parameter increase, grid searching quickly becomes infeasible in terms of computing capacity and time. This is when we turn to a related method: RandomizedSearchCV. Rather than testing every possible combination of parameter values, RandomSearchCV selects a specified number of combinations, chosen at random, which drastically reduces the number of models evaluated. And yet this process has been shown to perform at least nearly as well as the brute-force grid search. This is likely due to the fact that only a subset of parameters are likely to produce much difference in a model's performance on any given dataset. To test every possible combination would mean producing a lot of practically redundant models. The randomized search does a pretty good job capturing, or approximating, the actual variation in performance for the provided parameter grid.

Since the RandomForestClassifier has so many potentially impactful parameters, I will employ the more cost-effective RandomSearchCV. Guidance for this section came from sklearn documentation and this blog post: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74.

In [28]:
param_grid = {
    'n_estimators': [100,500,800,1000,1500,2000],
    'criterion': ['gini','entropy'],
    'max_depth': [20,50,70,100,None],
    'max_features': ['auto','sqrt'],
    'min_samples_leaf': [1,2,4],
    'min_samples_split': [2,5,10],
    'random_state':[42]
}

rfc = RandomForestClassifier()
rfc_random = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=param_grid,
    n_iter=60,
    cv=5,
    random_state=42
)

rfc_random.fit(X_train, y_train)
rfc_random.best_params_

{'random_state': 42,
 'n_estimators': 800,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': None,
 'criterion': 'entropy'}

Even the more time-friendly random search process consumed several minutes. I would not even attempt a GridSearchCV with the above parameter grid. Let's see how the tuned model performs.

In [29]:
y_pred = rfc_random.predict(X_test)
rfc_tuned_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.78      0.80      1509
           1       0.79      0.83      0.81      1447

    accuracy                           0.81      2956
   macro avg       0.81      0.81      0.81      2956
weighted avg       0.81      0.81      0.81      2956



In [30]:
print('Change in f1 score from parameter tuning:', round(rfc_tuned_f1-rfc_baseline_f1, 3))

Change in f1 score from parameter tuning: 0.009


Nearly a full percentage-point increase in the f1 score. That means that not only did the random forest model have the best baseline metrics, it also responded most drastically to our parameter tuning. This is a good illustration of the power of ensemble models, drawing on the "wisdom of crowds" to generate classifications by surveying outputs from hundreds of models.

But is this the best we can do?

### GradientBoostingClassifier

The process of training a random forest classifier involves creating any number of _independent_ decision trees. The gradient boosting classifier, also a tree-based ensemble model, involves training a number of trees _in sequence_, where each subsequent tree is fitted to the error of the previous tree. In this way, each tree learns more about the data, getting better in specific ways to compensate for past shortcomings. Let's see how a baseline GB model performs:

In [32]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
gbc_baseline_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.76      0.79      1509
           1       0.77      0.84      0.80      1447

    accuracy                           0.80      2956
   macro avg       0.80      0.80      0.80      2956
weighted avg       0.80      0.80      0.80      2956



Similar performance to the baseline RF classifier. Now, let's tune. I will again employ the randomized search with cross-validation:

In [34]:
param_grid = {
    'loss': ['deviance','exponential'],
    'learning_rate': [0.01,0.1,1],
    'n_estimators': [50, 100, 150, 200],
    'criterion': ['friedman_mse','mse'],
    'min_samples_leaf': [1,2,4],
    'min_samples_split': [2,5,10],
    'max_depth': [2,3,4,5],
    'random_state': [42],
    'max_features': ['auto','sqrt','log2']
}

gbc = GradientBoostingClassifier()
gbc_random = RandomizedSearchCV(
    estimator=gbc,
    param_distributions=param_grid,
    n_iter=60,
    cv=5,
    random_state=42
)

%timeit gbc_random.fit(X_train, y_train)
gbc_random.best_params_

2min 59s ± 6.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


{'random_state': 42,
 'n_estimators': 100,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 5,
 'loss': 'deviance',
 'learning_rate': 0.1,
 'criterion': 'friedman_mse'}

In [35]:
y_pred = gbc_random.predict(X_test)
gbc_tuned_f1 = f1_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.76      0.79      1509
           1       0.77      0.84      0.80      1447

    accuracy                           0.80      2956
   macro avg       0.80      0.80      0.80      2956
weighted avg       0.80      0.80      0.80      2956



In [36]:
print('Change in f1 score from parameter tuning:', round(gbc_tuned_f1-gbc_baseline_f1, 3))

Change in f1 score from parameter tuning: 0.002


So, while the default GB model performed similarly to the default RF model, the GB model was not improved nearly as much by parameter tuning as the RF model was.

## Comparing model performance and the effects of parameter tuning

Throughout the previous section, I checked the effect of tuning on a key model metric. I chose the f1 score as the key metric, because it is a good balance (i.e., harmonic mean) of precision and recall scores. (In this use case, there is no reason to think we need to give special priority to reducing either false-positive or false-negative scores, so we do not need to particularly optimize precision or recall.)

Now, I will organize both the baseline and tuned f1 scores for each model into a DataFrame for easy comparison and visualization.

In [38]:
metrics = pd.DataFrame(
    {
        'baseline_f1':[lr_baseline_f1, knn_baseline_f1, rfc_baseline_f1, gbc_baseline_f1],
        'tuned_f1':[lr_tuned_f1, knn_tuned_f1, rfc_tuned_f1, gbc_baseline_f1]
    },
    index=['LogReg','KNN','RFC','GBC']
)

metrics

Unnamed: 0,baseline_f1,tuned_f1
LogReg,0.728058,0.729159
KNN,0.7711,0.777485
RFC,0.797672,0.806321
GBC,0.800265,0.800265
