In [10]:
import pandas as pd

df1 = pd.read_csv("Popular_Clean.csv")
df2 = pd.read_csv("Song_Analytics.csv")

In [11]:
df1.head(2)

Unnamed: 0,track_name,artist_name,danceability,energy,speechiness,acousticness,instrumentalness,liveness,valence,tempo,...,highest_peak_value,normalized_likes,normalized_views,normalized_us_hit,normalized_high_peak,likes_per_day,views_per_day,genre_averaged_likes,normalized_num_peak_periods,popularity_score_scaled
0,Blinding Lights,The Weeknd,0.514,0.73,0.0598,0.00146,9.5e-05,0.0897,0.334,171.005,...,100.0,16.078045,20.463609,3.258097,4.615121,0.011874,0.015113,0.025767,0.333333,43.660471
1,Dance Monkey,Tones And I,0.824,0.587,0.0937,0.69,0.000105,0.149,0.514,98.029,...,12.0,16.55482,21.45255,2.484907,2.564949,0.010971,0.014216,0.029205,0.111111,41.465744


In [12]:
df2.head(2)

Unnamed: 0,track_name,tempo,mean_zcr,median_zcr,std_zcr,max_zcr,aboveThr_zcr,mean_sc,median_sc,std_sc,...,contrast_del_mean,contrast_avg_sd,Tone1,Tone2,Tone3,Tone4,Tone5,Tone6,Tone_deltaMean,Tone_avg_sd
0,0800 HEAVEN,143.554688,0.0918,0.0913,0.0527,0.6323,0.4358,2196.7699,2259.6592,862.0266,...,0.6804,6.128429,0.1183,0.1248,0.1614,0.127,0.0527,0.0522,0.00906,0.106067
1,1 2 3 feat Jason Derulo De La Ghetto,95.703125,0.1278,0.1172,0.0813,0.6621,0.6041,2811.9937,2768.6767,1057.0624,...,0.7764,6.554386,0.1012,0.0931,0.1172,0.1136,0.0526,0.0427,0.008302,0.086733


In [13]:
# Merging 'popularity_score_scaled' and 'time_frame' from df1 into df2
df_model = df2.merge(df1[['track_name', 'popularity_score_scaled', 'time_frame']], 
                       on='track_name', 
                       how='left')


In [17]:
df_model = df_model.dropna()

In [20]:
### PREPARING THE DATA

# Features and target
X = df_model.drop(['track_name', 'popularity_score_scaled'], axis=1)
y = df_model['popularity_score_scaled']


In [21]:
### SPLITTING THE DATASET

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Model Selection<br/>
Try different algorithms to see which performs best. Common choices for regression problems include:<br/>

Linear Regression<br/>
Decision Tree Regression<br/>
Random Forest Regression<br/>
Gradient Boosting Machines (like XGBoost or LightGBM)<br/>
Support Vector Machines (SVM)<br/>

In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(),
    "Support Vector Machine": SVR(),
    "XGBoost": XGBRegressor()
}

# Train models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"{name} trained.")


Linear Regression trained.
Random Forest trained.
Support Vector Machine trained.
XGBoost trained.


In [23]:
### MODEL EVALUATION

from sklearn.metrics import mean_squared_error, r2_score

for name, model in models.items():
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{name} - MSE: {mse}, R2: {r2}")


Linear Regression - MSE: 232.2233027153926, R2: 0.046319133218791886
Random Forest - MSE: 188.11057073122677, R2: 0.22747868087325673
Support Vector Machine - MSE: 239.72273652360624, R2: 0.015520903881097503
XGBoost - MSE: 208.5506806497821, R2: 0.1435364514919032


Random Forest Regressor:
- MSE: 188.11
- R2: 0.227

This model has the lowest MSE and the highest R² score, indicating it's the best at predicting the 'popularity_score_scaled' among the models you tested.
Linear Regression and Support Vector Machine:

These models have higher MSEs and lower R² scores, suggesting they're not capturing the complexity of your data as effectively as the Random Forest.
XGBoost:<br/>

XGBoost shows a decent performance but is still outperformed by the Random Forest. It might improve with hyperparameter tuning.<br/>
Next Steps:<br/>
1. Hyperparameter Tuning:<br/>
The performance of the Random Forest and XGBoost models might be significantly improved by tuning their hyperparameters. You can use techniques like Grid Search or Random Search for this purpose.
2. Feature Importance Analysis:<br/>
Especially for the Random Forest and XGBoost models, check which features are most important for predicting popularity. This could provide insights into your dataset and might even suggest further feature engineering or selection.
3. Cross-Validation:<br/>
Consider using cross-validation to assess the model's performance. This will give you a more robust understanding of how well the model might perform on unseen data.
4. Model Refinement:<br/>
Based on the results of hyperparameter tuning and feature importance, refine your models. You may also want to revisit data preprocessing steps if you think there's more room for improvement.
5. Final Model Selection:<br/>
Once you've tuned the models and reassessed their performance, choose the one that shows the best results on your test data for your final model.

In [24]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [10, 20, 30, 40, 50, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Create a base model
rf = RandomForestRegressor()

# Instantiate the grid search model
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, 
                               n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

# Best parameters
print("Best parameters:", rf_random.best_params_)


Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [None]:
from xgboost import XGBRegressor

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'max_depth': [3, 4, 5, 6, 7, 8],
    'colsample_bytree': [0.3, 0.5, 0.7, 0.9, 1],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1]
}

# Create a base model
xgb = XGBRegressor()

# Instantiate the grid search model
xgb_random = RandomizedSearchCV(estimator = xgb, param_distributions = param_grid, 
                                n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
xgb_random.fit(X_train, y_train)

# Best parameters
print("Best parameters:", xgb_random.best_params_)
