### <span style="color:blueviolet"> 1. Introduction

This notebook presents a solution to the [Playground Series - Season 5, Episode 9](https://www.kaggle.com/competitions/playground-series-s5e9) Kaggle competition, held in September 2025. The goal is to predict a song's beats-per-minute, with submissions evaluated using the Root Mean Squared Error (RMSE) between predicted and observed targets.

The workflow begins with importing the necessary libraries, followed by loading the training and testing datasets. A basic exploratory data analysis (EDA) is then performed, including examining shapes, structure, summary statistics, and other key information for both DataFrames.

Next, feature engineering is carried out. Pairwise and triplet combinations of columns are created to generate additional features, and the quartile and decile for each column's values are computed.

In the modeling stage, XGBoost and LightGBM models are defined with appropriate hyperparameters and trained using 5-fold cross-validation. In each fold, models are trained on the training set, and predictions are made for both the validation fold and the test data (the latter averaged across folds). Out-of-fold predictions are then compared to the true target values to calculate cross-validation RMSE.

Finally, predictions from both models are blended by averaging, and a new cross-validation RMSE is computed. A CSV file containing the averaged test set predictions is created for submission to the competition.

### <span style="color:blueviolet"> 2. Import Libraries

First, we import all the libraries required for this notebook. NumPy is imported for numerical operations, and Pandas for data manipulation. From itertools, combinations is imported to generate feature interactions during feature engineering. XGBRegressor and LGBMRegressor are imported from xgboost and lightgbm, respectively, for model training. Finally, KFold and mean_squared_error are imported from scikit-learn to perform cross-validation and evaluate regression models.

In [1]:
# ===== Import Libraries =====
import numpy as np
import pandas as pd
from itertools import combinations
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

### <span style="color:blueviolet"> 3. Load Data

In this step, the training and testing datasets are loaded from files as pandas DataFrames. The id column is set as the index of both DataFrames to ensure unique identification and alignment.

In [2]:
# ===== Load Data =====
X = pd.read_csv('/kaggle/input/playground-series-s5e9/train.csv').set_index('id')
X_test = pd.read_csv('/kaggle/input/playground-series-s5e9/test.csv').set_index('id')

### <span style="color:blueviolet"> 4. Explore Data

Next, we perform basic exploratory data analysis (EDA). For both the training and testing datasets, we examine their shapes, summary statistics of numerical values, and additional details such as data types, unique values, and missing values per column. A dedicated function is used to display the EDA, which is applied to each dataset in a loop.

In [3]:
# ===== Explore Data =====
def display_eda(df, name):
    print(f"{'='*50}\n{name} | SHAPE = {df.shape}\n{'='*50}")
    print(f"HEAD:")
    display(df.head())
    print(f"{'-'*50}\nDESCRIPTION:")
    display(df.describe().round(2))
    print(f"{'-'*50}\nINFORMATION:")
    info_df = pd.DataFrame({
        'TYPE': df.dtypes,
        'UNIQUE': df.nunique(),
        'MISSING': df.isna().sum()
    })
    display(info_df)

for name, df in [('TRAINING DATA', X), ('TESTING DATA', X_test)]:
    display_eda(df, name)

TRAINING DATA | SHAPE = (524164, 10)
HEAD:


Unnamed: 0_level_0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy,BeatsPerMinute
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.60361,-7.636942,0.0235,5e-06,1e-06,0.051385,0.409866,290715.645,0.826267,147.5302
1,0.639451,-16.267598,0.07152,0.444929,0.349414,0.170522,0.65101,164519.5174,0.1454,136.15963
2,0.514538,-15.953575,0.110715,0.173699,0.453814,0.029576,0.423865,174495.5667,0.624667,55.31989
3,0.734463,-1.357,0.052965,0.001651,0.159717,0.086366,0.278745,225567.4651,0.487467,147.91212
4,0.532968,-13.056437,0.0235,0.068687,1e-06,0.331345,0.477769,213960.6789,0.947333,89.58511


--------------------------------------------------
DESCRIPTION:


Unnamed: 0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy,BeatsPerMinute
count,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0
mean,0.63,-8.38,0.07,0.26,0.12,0.18,0.56,241903.69,0.5,119.03
std,0.16,4.62,0.05,0.22,0.13,0.12,0.23,59326.6,0.29,26.47
min,0.08,-27.51,0.02,0.0,0.0,0.02,0.03,63973.0,0.0,46.72
25%,0.52,-11.55,0.02,0.07,0.0,0.08,0.4,207099.88,0.25,101.07
50%,0.63,-8.25,0.07,0.24,0.07,0.17,0.56,243684.06,0.51,118.75
75%,0.74,-4.91,0.11,0.4,0.2,0.27,0.72,281851.66,0.75,136.69
max,0.98,-1.36,0.26,1.0,0.87,0.6,0.98,464723.23,1.0,206.04


--------------------------------------------------
INFORMATION:


Unnamed: 0,TYPE,UNIQUE,MISSING
RhythmScore,float64,322528,0
AudioLoudness,float64,310411,0
VocalContent,float64,229305,0
AcousticQuality,float64,270478,0
InstrumentalScore,float64,218979,0
LivePerformanceLikelihood,float64,279591,0
MoodScore,float64,306504,0
TrackDurationMs,float64,377442,0
Energy,float64,11606,0
BeatsPerMinute,float64,14622,0


TESTING DATA | SHAPE = (174722, 9)
HEAD:


Unnamed: 0_level_0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
524164,0.410013,-16.794967,0.0235,0.23291,0.012689,0.271585,0.664321,302901.5498,0.424867
524165,0.463071,-1.357,0.141818,0.057725,0.257942,0.097624,0.829552,221995.6643,0.846
524166,0.686569,-3.368928,0.167851,0.287823,0.210915,0.325909,0.304978,357724.0127,0.134067
524167,0.885793,-5.598049,0.118488,5e-06,0.376906,0.134435,0.48774,271790.3989,0.316467
524168,0.637391,-7.06816,0.126099,0.539073,0.06895,0.0243,0.591248,277728.5383,0.481067


--------------------------------------------------
DESCRIPTION:


Unnamed: 0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy
count,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0
mean,0.63,-8.38,0.07,0.26,0.12,0.18,0.56,241753.74,0.5
std,0.16,4.62,0.05,0.22,0.13,0.12,0.23,59103.9,0.29
min,0.14,-27.44,0.02,0.0,0.0,0.02,0.03,63973.0,0.0
25%,0.51,-11.55,0.02,0.07,0.0,0.08,0.4,207518.15,0.25
50%,0.63,-8.25,0.07,0.24,0.07,0.17,0.57,243584.59,0.51
75%,0.74,-4.9,0.11,0.4,0.2,0.27,0.72,281737.45,0.75
max,0.98,-1.36,0.26,1.0,0.68,0.6,0.98,449288.81,1.0


--------------------------------------------------
INFORMATION:


Unnamed: 0,TYPE,UNIQUE,MISSING
RhythmScore,float64,116151,0
AudioLoudness,float64,110402,0
VocalContent,float64,84370,0
AcousticQuality,float64,97364,0
InstrumentalScore,float64,79221,0
LivePerformanceLikelihood,float64,101149,0
MoodScore,float64,109993,0
TrackDurationMs,float64,133624,0
Energy,float64,10465,0


### <span style="color:blueviolet"> 5. Prepare Features

This section focuses on feature engineering before modeling. An additional function, add_features, is used for this purpose.

First, a dictionary called new_features is created to store the generated columns. We loop over all pairwise combinations of existing features, performing multiplication and division (adding 1e-6 to denominators to avoid division by zero). Another loop generates new columns by multiplying all triplet combinations of features.

Finally, for each column, two features are created to indicate the quartile and decile in which the values fall. This is done using pd.cut with 4 and 10 bins, respectively, without labels to produce integers, and including the lowest value. This works because all columns are numerical and contain no missing values.

Before applying this function to both training and testing datasets, the target variable is stored separately and removed from the training dataset to ensure it is excluded from feature calculations.

In [4]:
# ===== Prepare Features =====
def add_features(df):
    new_features = {}
    for col1, col2 in list(combinations(df.columns, 2)):
        new_features[f"{col1}_m_{col2}"] = df[col1] * df[col2]
        new_features[f"{col1}_d_{col2}"] = df[col1] / (df[col2] + 1e-6)

    for col1, col2, col3 in list(combinations(df.columns, 3)):
        new_features[f"{col1}_m_{col2}_m_{col3}"] = df[col1] * df[col2] * df[col3]

    for col in df.columns:
        new_features[f"{col}_quartile"] = pd.cut(df[col], bins=4, labels=False, include_lowest=True)
        new_features[f"{col}_decile"] = pd.cut(df[col], bins=10, labels=False, include_lowest=True)

    df = pd.concat([df, pd.DataFrame(new_features, index=df.index)], axis=1)
    return df

y = X['BeatsPerMinute']
X = X.drop(['BeatsPerMinute'], axis=1)

X = add_features(X)
X_test = add_features(X_test)

### <span style="color:blueviolet"> 6. XGBoost Model

In this step, the XGBoost model is defined. The objective is set to regression, using squared error as the loss function, and the evaluation metric during training and validation is root mean squared error (RMSE). A total of 1000 trees are used to allow sufficient learning without overfitting, with each tree limited to a maximum depth of 6 to control complexity. A learning rate of 0.002 ensures slow and stable learning.

For each tree, two-thirds of the features are used, and for each node, two-thirds of the features are sampled via the colsample_bytree and colsample_bynode parameters. L1 and L2 regularization are applied by setting reg_alpha to 2.50 and reg_lambda to 0.85 to penalize large leaf outputs. Finally, a random state is set for reproducibility.

In [5]:
# ===== XGBoost Model =====
xgb = XGBRegressor(
    objective = 'reg:squarederror',
    eval_metric = 'rmse',
    n_estimators = 1000,
    max_depth = 6,
    learning_rate = 0.002,
    colsample_bytree = 0.67,
    colsample_bynode = 0.67,
    reg_alpha = 2.50,
    reg_lambda = 0.85,
    random_state = 42
)

### <span style="color:blueviolet"> 7. LightGBM Model

In this step, we define the hyperparameters for the LightGBM model. A total of 1000 trees are used to allow sufficient learning without overfitting. Each tree’s depth is limited to 14, and the number of leaves is limited to 85, which is relatively high and allows the model to capture complex interactions. A low learning rate of 0.0015 ensures slow and stable learning.

For each tree, 90% of the features and 90% of the rows are used, controlled by the feature_fraction and subsample parameters. This introduces randomness and improves generalization. Large leaf outputs are slightly penalized by setting reg_alpha and reg_lambda to 0.0001. Finally, a random state is set for reproducibility, and verbosity is set to -1 to silence output during training.

In [6]:
# ===== LightGBM Model =====
lgbm = LGBMRegressor(
    n_estimators = 1000,
    max_depth = 14,
    num_leaves = 85,
    learning_rate = 0.0015,
    feature_fraction = 0.90,
    subsample = 0.90,
    reg_alpha = 0.0001,
    reg_lambda = 0.0001,
    random_state = 42,
    verbosity = -1
)

### <span style="color:blueviolet"> 8. 5-Fold Cross Validation

We use 5-fold cross-validation to train the models. First, two dictionaries are created to store predictions: oof_preds will hold out-of-fold predictions on the training data, and test_preds will store predictions on the testing data, averaged across folds. Each dictionary has a key for each model, with values as NumPy arrays of appropriate lengths initialized to zero.

Next, an instance of KFold is created with 5 splits, shuffling enabled to promote generalization, and a fixed random state for reproducibility.

For each fold, the data is split into training and validation sets. The XGBoost and LightGBM models are fitted on the training set, with XGBoost’s verbose output silenced. We then loop over both models to generate out-of-fold predictions on the validation set and predictions on the test data, which are averaged across folds.

After all folds are completed, the out-of-fold predictions of both models are compared with the true targets, and the cross-validation RMSE is calculated.

In [7]:
# ===== 5-Fold Cross-Validation =====
oof_preds = {'XGBOOST': np.zeros(len(X)), 'LIGHTGBM': np.zeros(len(X))}
test_preds = {'XGBOOST': np.zeros(len(X_test)), 'LIGHTGBM': np.zeros(len(X_test))}

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for (train_idx, valid_idx) in kf.split(X, y):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

    xgb.fit(X_train, y_train, verbose=False)
    lgbm.fit(X_train, y_train)

    for name, model in [('XGBOOST', xgb), ('LIGHTGBM', lgbm)]:
        oof_preds[name][valid_idx] = model.predict(X_valid)
        test_preds[name] += model.predict(X_test) / kf.n_splits

for model in ['XGBOOST', 'LIGHTGBM']:
    print(f'{model} PREDICTIONS - RMSE SCORE: {np.sqrt(mean_squared_error(y, oof_preds[model])):.6f}')

XGBOOST PREDICTIONS - RMSE SCORE: 26.461237
LIGHTGBM PREDICTIONS - RMSE SCORE: 26.460378


### <span style="color:blueviolet"> 9. Blending

In this step, both the out-of-fold predictions and the test-set predictions are averaged across the two models. We then calculate the new cross-validation RMSE by comparing the averaged out-of-fold predictions with the true target values.

In [8]:
# ===== Blending =====
avg_oof_preds = (oof_preds['XGBOOST'] + oof_preds['LIGHTGBM']) / 2
avg_test_preds = (test_preds['XGBOOST'] + test_preds['LIGHTGBM']) / 2

print(f'AVERAGED PREDICTIONS - RMSE SCORE: {np.sqrt(mean_squared_error(y, avg_oof_preds)):.6f}')

AVERAGED PREDICTIONS - RMSE SCORE: 26.460300


### <span style="color:blueviolet"> 10. Create Submission File

The final step is creating a CSV file for submission to the competition.

In [9]:
# ===== Create Submission File =====
output = pd.DataFrame({'id': X_test.index, 'y': avg_test_preds})
output.to_csv('submission.csv', index=False)