# Modeling |  Modeling Spotify track popularity
## Leo Evancie, Springboard Data Science Career Track

This is the fourth step in a capstone project to model music popularity on Spotify, a popular streaming service. Further project details and rationale can be found in the document 'Proposal.pdf'.

In this notebook, I will apply my cleaned and processed data to a number of models. For each type of model, I will perform hyperparameter tuning with cross-validation. Ultimately, I will determine which model performs best in predicting whether a given track is popular on Spotify.

First, I will read in the data, already split into train and test chunks from the preprocessing stage:

In [22]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
X_train = pd.read_csv('../data/X_train.csv', index_col=0)
y_train = pd.read_csv('../data/y_train.csv', index_col=0)
X_test = pd.read_csv('../data/X_test.csv', index_col=0)
y_test = pd.read_csv('../data/y_test.csv', index_col=0)

### Random Forest

Random forests are a popular ensemble model, where several decision trees are trained on bootstrapped samples from the training data. Classification is determined by a survey of decisions from the resulting "forest" of individual trees. In order to set a baseline, I will first create a random forest with all default parameters (setting the random state for reproducability).

In [14]:
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, np.ravel(y_train))
y_pred = rfc.predict(X_test)
rfc_baseline_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy score for baseline RandomForestClassifier:', rfc_baseline_accuracy)

Accuracy score for baseline RandomForestClassifier: 0.8000676589986468


Now I will tune hyperparameters. The most straightforward method of hyperparameter tuning is grid search with cross-validation. We provide a set of values for any or all model parameters, and the GridSearchCV object systematically tests every possible combination of those given parameters (i.e., it steps through a multidimensional 'grid' of parameter values).

The 'CV' in GridSearchCV stands for cross-validation, which is a useful method to stave off overfitting. GridSearchCV applies cross-validation at each step of the grid search. The training data is split into k "folds", or equally-sized chunks (the default value for k is 3 in the case of GridSearchCV, but it can be altered). A model is trained from the current set of parameters, and then the model is fit to k-1 folds, leaving the kth aside to act as a test set. A model score is produced and saved. Then, a new model is fit to a different set of k-1 folds (using the same current set of parameters from the grid search), using a different chunk as the test set, producing a second score. This process is performed a total of k times, each iteration using a different chunk as the holdout set, until we have k scores. Those scores are then averaged. This provides us with a score corresponding to a certain set of parameters, and we can be more confident in this score, because the cross-validation lowers the chances that our score was impacted too much by the incidental nature of our overall train/test split. We have, in a sense, simulated the train-test split k times, without ever exposing the model to the real test data.

When building a model with relatively few important parameters, and/or when you have reason to only test a small number of values for your parameters, GridSearchCV is feasible. However, as the number and cardinality of your tested parameters increases, grid searching becomes computationally infeasible. This is when we turn to a related method: RandomizedSearchCV. Rather than testing every possible combination of parameter values, RandomSearchCV selects a specified number of combinations, chosen at random, which can drastically reduce the number of models evaluated. And yet this process has been shown to perform at least nearly as well as the brute-force grid search. This is, in part, due to the fact that only a subset of parameters are likely to produce much difference in a model's performance on any given dataset, so to test every possible combination would mean producing a lot of practically redundant models. The random search does a pretty good job capturing, or approximating, the actual variation in performance for the provided parameter grid.

Since the RandomForestClassifier has so many potentially impactful parameters, instead of testing every possible combination with a grid search, I will employ the more cost-effective RandomSearchCV. Guidance for this section came from sklearn documentation and this blog post: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74.

In [15]:
param_grid = {
    'n_estimators': [100,500,800,1000,1500,2000],
    'criterion': ['gini','entropy'],
    'max_depth': [20,50,70,100,None],
    'max_features': ['auto','sqrt'],
    'min_samples_leaf': [1,2,4],
    'min_samples_split': [2,5,10],
    'random_state':[42]
}

rfc = RandomForestClassifier()
rfc_random = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=param_grid,
    n_iter=60,
    cv=5,
    random_state=42
)

rfc_random.fit(X_train, np.ravel(y_train))
rfc_random.best_params_

{'random_state': 42,
 'n_estimators': 800,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': None,
 'criterion': 'entropy'}

Even the more time-friendly random search process consumed several minutes. Above, we can see the parameters corresponding to the best-performing model. Now, we can use the search object's `.predict()` method directly, rathern than having to create a new model with the resulting best parameters.

In [17]:
y_pred = rfc_random.predict(X_test)
rfc_tuned_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy score for tuned RandomForestClassifier:', rfc_tuned_accuracy)

print('Accuracy gained from hyperparameter tuning:', rfc_tuned_accuracy-rfc_baseline_accuracy)

Accuracy score for tuned RandomForestClassifier: 0.8051420838971584
Accuracy gained from hyperparameter tuning: 0.005074424898511509


All of that for an additional half-percent of accuracy! Let's look at some of the other metrics:

In [18]:
print('Accuracy:', rfc_tuned_accuracy)
print('Recall:', recall_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('F1:', f1_score(y_test,y_pred))

Accuracy: 0.8051420838971584
Recall: 0.8286109191430546
Precision: 0.785199738048461
F1: 0.8063214525891056


Overall, we do see an increase in model performance, but it's certainly not high enough that we shouldn't consider further candidates.

### Logistic regression

Baseline model first:

In [19]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train, np.ravel(y_train))
y_pred = lr.predict(X_test)
lr_baseline_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', lr_baseline_accuracy)
print('Recall:', recall_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('F1:', f1_score(y_test,y_pred))

Accuracy: 0.7442489851150202
Recall: 0.6993780234968902
Precision: 0.7591897974493623
F1: 0.7280575539568346


Next, use simple grid search to test a range of regularization parameters:

In [20]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100,],
    'random_state':[42]
}

lr = LogisticRegression()
lr_grid = GridSearchCV(estimator=lr, param_grid=param_grid)
lr_grid.fit(X_train, np.ravel(y_train))
lr_grid.best_params_

{'C': 10, 'random_state': 42}

In [21]:
y_pred = lr_grid.predict(X_test)
lr_tuned_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', lr_tuned_accuracy)
print('Recall:', recall_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('F1:', f1_score(y_test,y_pred))
print('Accuracy gained from tuning:', lr_tuned_accuracy-lr_baseline_accuracy)

Accuracy: 0.7439106901217862
Recall: 0.7042156185210781
Precision: 0.7559347181008902
F1: 0.7291592128801431
Accuracy gained from tuning: -0.0003382949932340118


Here's an interesting result. LogisticRegression's default value for C is 1. GridSearchCV tested a range of C values, including 1, and judged 10 to be a better value. But here, we see that C=10 actually produced very slightly lower accuracy. From this, we can deduce that accuracy is not the default measure that GridSearchCV uses to determine best parameters.

Looking at the other metrics, we can see that the tuned model yielded lower accuracy and precision, but higher recall and f1 score.

### K-nearest neighbors

Baseline model:

In [24]:
knn = KNeighborsClassifier()
knn.fit(X_train, np.ravel(y_train))
y_pred = knn.predict(X_test)
knn_baseline_f1 = f1_score(y_test, y_pred)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('F1:', knn_baseline_f1)

Accuracy: 0.7642083897158322
Recall: 0.8113337940566689
Precision: 0.7346683354192741
F1: 0.7711001642036126


Next, perform a grid search over a range of number of neighbors, as well as uniform vs. distance-based weights:

In [25]:
param_grid = {
    'n_neighbors': [1,3,5,7,9],
    'weights': ['uniform', 'distance']
}

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(estimator=knn, param_grid=param_grid)
knn_grid.fit(X_train, np.ravel(y_train))
knn_grid.best_params_

{'n_neighbors': 9, 'weights': 'distance'}

In [27]:
y_pred = knn_grid.predict(X_test)
knn_tuned_f1 = f1_score(y_test, y_pred)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('F1:', knn_tuned_f1)
print('F1 gained from tuning:', knn_tuned_f1-knn_baseline_f1)

Accuracy: 0.7713125845737483
Recall: 0.816171389080857
Precision: 0.7423004399748586
F1: 0.7774851876234364
F1 gained from tuning: 0.006385023419823832
