# Spotify: Top 200 Weekly (Global) Song Analysis

This notebook provides an in-depth look at the hyper-parameters used for several different models when running a gridsearch. The models ran can be seen [here](./02_Preprocessing_And_Modeling).

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [3]:
# Read in the Dataset

df = pd.read_csv('./datasets/spotify_df.csv')
final_df = pd.read_csv('./datasets/final_df.csv')

In [7]:
# Defining X and y variables
X = df.drop(columns=['popularity', 'chord'])
y = df['popularity']

In [10]:
# Create Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [15]:
# Instantiate Pipelines
random_state = 42
pipe_lr = Pipeline([('ss', StandardScaler()), #Linear Regression
('lr', LinearRegression())])

pipe_dt = Pipeline([('ss', StandardScaler()),
('dt', DecisionTreeRegressor(random_state=random_state))]) #Decision Tree

pipe_rf = Pipeline([('ss', StandardScaler()),
                  ('rf', RandomForestRegressor(random_state=random_state))]) #Random Forest

# Fitting Pipelines
pipe_lr.fit(X_train, y_train)
pipe_dt.fit(X_train, y_train)
pipe_rf.fit(X_train, y_train)

Pipeline(steps=[('ss', StandardScaler()),
                ('rf', RandomForestRegressor(random_state=42))])

#### Decision Tree Gridsearch

I used the max_depth, min_sample_leaf, max_features, and max_leaf_nodes as starting parameters for the decision tree to prune the decision tree and help combat overfitting.
- max_depth:
I am tuning the max depth because without tuning it, the default is None. The deeper the tree, the more splits it will have, capturing more information about the data. But, with more splits, there is more likelihood of overfitting the model. Because my initial decision tree model was very overfit (Train: 1.0, Test: 0.38), I would want to try to minimize the overfitting while still capturing important info.

- min_sample_leaf:
The minimum number of samples that are required to be in the leaf node. Again, the more leaves, the more prone to overfitting.

- max_features:
The max features are the number of features to consider when looking for the best split. By using this hyperparameter, I will check to see if using a reduced number of features will help "increase the stability of the tree and reduce variance and over-fitting."

- max_leaf_nodes:
The maximum number of leaf nodes the decision tree can have.

[Source 1](https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680)
[2](https://www.youtube.com/watch?time_continue=361&v=XABw4Y3GBR4&feature=emb_title)

In [16]:
# # Decision Tree Parameters
# params_dt = {'dt__max_depth': [1,2,3,4,5,6,7,9,11,12],
#              'dt__min_samples_leaf': [1,2,3,4,5,6,7,8,9,10],
#              'dt__max_features': ['auto','log2','sqrt', None],
#              'dt__max_leaf_nodes': [None,10,15,20,25,30,40,50,60,70,80,90]
    
# }

# # Instantiate Gridsearch
# gs_dt = GridSearchCV(pipe_dt, 
#                   param_grid=params_dt,
#                   cv=5) 
# # Fit Gridsearch
# gs_dt.fit(X_train, y_train)

{'dt__max_depth': 6,
 'dt__max_features': 'auto',
 'dt__max_leaf_nodes': 20,
 'dt__min_samples_leaf': 8}

#### Random Forest Gridsearch

The hyperparameters I used to help with overfitting are:
- n_estimators:
The number of trees in the model. The more trees may increase accuracy, but more trees = more computational time.

- max_depth:
The deeper the tree, the more splits it will have, capturing more information about the data. But, with more splits, there is more likelihood of overfitting the model.

- min_samples_split:
By increasing the number in this hyperparameter, it reduces the number of splits in the tree, helping overfitting.

- mins_samples_leaf:
By having a minimum number of samples required for a leaf node, this can help prevent "the growth of the tree", also preventing overfitting.

- max_features:
The max features are the number of features to consider when looking for the best split. By using this hyperparameter, I will check to see if using a reduced number of features will help "increase the stability of the tree and reduce variance and over-fitting."

[Source](https://rspiro9.github.io/hyperparameter_tuning_for_random_forest)

In [21]:
# # Random Forest Parameters
# params_rf = {
#     'rf__n_estimators': [200, 400, 600, 800, 1000],
#     'rf__max_depth': [10, 20, 30, 40, 50, None],
#     'rf__min_samples_split': [2, 5, 10],
#     'rf__min_samples_leaf': [1,2,4],
#     'rf__max_features': ['auto','log2','sqrt', None],
# }

# # Instantiate Gridsearch
# gs_rf = GridSearchCV(pipe_rf,
#                     param_grid=params_rf,
#                     cv=3, n_jobs = -1)

# # Fit Gridsearch
# gs_rf.fit(X_train, y_train)

{'rf__max_depth': 20,
 'rf__max_features': 'auto',
 'rf__min_samples_leaf': 4,
 'rf__min_samples_split': 2,
 'rf__n_estimators': 400}

#### ExtraTrees Gridsearch

In [None]:
# # ExtraTrees Parameters
# params_xt = {
#     'xt__n_estimators': [200, 400, 600, 800, 1000],
#     'xt__max_depth': [10, 20, 30, 40, 50, None],
#     'xt__min_samples_split': [2, 5, 10],
#     'xt__min_samples_leaf': [1,2,4],
#     'xt__max_features': ['auto','log2','sqrt', None],
# }

# # Instantiate Gridsearch
# gs_xt = GridSearchCV(pipe_xt,
#                   param_grid=params_xt,
#                   cv=3, n_jobs = -1)

# # Fit Gridsearch
# gs_xt.fit(X_train, y_train)


#### Gridsearch with Chosen Features from Polynomial and Categorical Features

In [4]:
# Define X
X = final_df[['highest_charting_position', 'number_of_times_charted',
       'streams',
       'artist_followers', 'loudness', 'speechiness',
       'tempo', 'valence',
       'contemporary', 'edm', 'electropop', 'hip hop', 'house',
       'indie', 'latin', 'other', 'pop', 'rap', 'reggaeton', 'rock', 'trap',
       'danceability^2', 'energy loudness', 'chord', 
       'release_date_year']]
# Dummify `chord`              
X = pd.get_dummies(columns=['chord'], drop_first=True, data=X)

# Drop chord features that have 0 coefficient
X = X.drop(columns=['chord_A#/Bb', 'chord_C#/Db', 'chord_D', 'chord_E'])

# Define Y
y = final_df['popularity']

In [5]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [7]:
# Instantiate Random Forest Pipeline
random_state=42
pipe_rf_coefs = Pipeline([('ss', StandardScaler()),
                  ('rf', RandomForestRegressor(random_state=random_state))])

# Fit Pipeline
pipe_rf_coefs.fit(X_train, y_train)

Pipeline(steps=[('ss', StandardScaler()),
                ('rf', RandomForestRegressor(random_state=42))])

The hyperparameters I used to help with overfitting are:
- n_estimators:
The number of trees in the model. The more trees may increase accuracy, but more trees = more computational time.

- max_depth:
The deeper the tree, the more splits it will have, capturing more information about the data. But, with more splits, there is more likelihood of overfitting the model.

- min_samples_split:
By increasing the number in this hyperparameter, it reduces the number of splits in the tree, helping overfitting.

- mins_samples_leaf:
By having a minimum number of samples required for a leaf node, this can help prevent "the growth of the tree", also preventing overfitting.

- max_features:
The max features are the number of features to consider when looking for the best split. By using this hyperparameter, I will check to see if using a reduced number of features will help "increase the stability of the tree and reduce variance and over-fitting."

[Source](https://rspiro9.github.io/hyperparameter_tuning_for_random_forest)

In [8]:
# # Parameters
# params_rf = {
#     'rf__n_estimators': [200, 400, 600, 800, 1000],
#     'rf__max_depth': [10, 20, 30, 40, 50, None],
#     'rf__min_samples_split': [2, 5, 10],
#     'rf__min_samples_leaf': [1,2,4],
#     'rf__max_features': ['auto','log2','sqrt', None],
# }

# # Instantiate Random Forest Gridsearch
# gs_rf = GridSearchCV(pipe_rf_coefs,
#                     param_grid=params_rf,
#                     cv=3, n_jobs = -1)

# # Fit Gridsearch
# gs_rf.fit(X_train, y_train)

{'rf__max_depth': 10,
 'rf__max_features': 'auto',
 'rf__min_samples_leaf': 1,
 'rf__min_samples_split': 2,
 'rf__n_estimators': 600}