#### **1. INTRODUCTION**

This notebook presents a solution to the [Playground Series - Season 5, Episode 9](https://www.kaggle.com/competitions/playground-series-s5e9) Kaggle competition, held in September 2025. The goal is to predict a song's beats-per-minute, with submissions evaluated using the Root Mean Squared Error (RMSE) between predicted and observed targets.

It begins with importing libraries, loading the train and test datasets, and performing basic exploratory data analysis, including shapes, structure, and summary statistics. Feature engineering generates pairwise and triplet combinations of columns, along with quartile and decile features for each column. XGBoost and LightGBM models are trained with 5-fold cross-validation, producing out-of-fold predictions and test set averages. Finally, predictions from both models are blended and a CSV of test set predictions is prepared for submission.

#### **2. IMPORT LIBRARIES**

First we import the necessary libraries: NumPy for numerical operations and Pandas for data manipulation. Colorama’s Fore and Style are used for colored outputs, and combinations from itertools helps generate feature interactions. XGBRegressor and LGBMRegressor are used for model training, while KFold and mean_squared_error from scikit-learn handle cross-validation and regression evaluation.

In [58]:
# ===== IMPORT LIBRARIES =====
import numpy as np, pandas as pd
from colorama import Fore, Style
from itertools import combinations
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

#### **3. LOAD DATA**

In this step, the training and testing datasets are loaded from files as pandas DataFrames. The id column is set as the index of both DataFrames to ensure unique identification and alignment.

In [59]:
# ===== LOAD DATA =====
X = pd.read_csv('/kaggle/input/playground-series-s5e9/train.csv', index_col='id')
X_test = pd.read_csv('/kaggle/input/playground-series-s5e9/test.csv', index_col='id')

#### **4. EXPLORE DATA**

Next, we perform basic exploratory data analysis (EDA), examining the shapes, heads, information, descriptions, and the number of unique and missing values for both the training and testing datasets.


In [60]:
# ===== EXPLORE DATA =====
def PrintColor(text, color=Fore.BLUE, lines=True):
    if lines: print(f"{Style.BRIGHT}{color}{'-'*50}{Style.RESET_ALL}")
    print(f"{Style.BRIGHT}{color}{text}{Style.RESET_ALL}")

PrintColor(f"Training data shape = {X.shape} | Testing data shape shape = {X_test.shape}")
for name, df in [('Training data', X), ('Testing data', X_test)]:
    PrintColor(f"{name} head:", color=Fore.CYAN)
    display(df.head())

PrintColor("Information and description", color=Fore.MAGENTA)
for name, df in [('Training data', X), ('Testing data', X_test)]:
    PrintColor(f"{name} description:")
    display(df.drop(columns=['BeatsPerMinute'], errors='ignore').describe().round(2))
    
    PrintColor(f"{name} information:")
    display(df.info())

PrintColor("Unique and null values:")
info_df = pd.concat([X.drop(columns=['BeatsPerMinute']).nunique(), X_test.nunique(),
                     X.drop(columns=['BeatsPerMinute']).isna().sum(),X_test.isna().sum()],
                     keys=['Training_Nunq','Testing_Nunq','Training_Nulls','Testing_Nulls'], axis=1)
display(info_df.T)

[1m[34m--------------------------------------------------[0m
[1m[34mTraining data shape = (524164, 10) | Testing data shape shape = (174722, 9)[0m
[1m[36m--------------------------------------------------[0m
[1m[36mTraining data head:[0m


Unnamed: 0_level_0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy,BeatsPerMinute
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.60361,-7.636942,0.0235,5e-06,1e-06,0.051385,0.409866,290715.645,0.826267,147.5302
1,0.639451,-16.267598,0.07152,0.444929,0.349414,0.170522,0.65101,164519.5174,0.1454,136.15963
2,0.514538,-15.953575,0.110715,0.173699,0.453814,0.029576,0.423865,174495.5667,0.624667,55.31989
3,0.734463,-1.357,0.052965,0.001651,0.159717,0.086366,0.278745,225567.4651,0.487467,147.91212
4,0.532968,-13.056437,0.0235,0.068687,1e-06,0.331345,0.477769,213960.6789,0.947333,89.58511


[1m[36m--------------------------------------------------[0m
[1m[36mTesting data head:[0m


Unnamed: 0_level_0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
524164,0.410013,-16.794967,0.0235,0.23291,0.012689,0.271585,0.664321,302901.5498,0.424867
524165,0.463071,-1.357,0.141818,0.057725,0.257942,0.097624,0.829552,221995.6643,0.846
524166,0.686569,-3.368928,0.167851,0.287823,0.210915,0.325909,0.304978,357724.0127,0.134067
524167,0.885793,-5.598049,0.118488,5e-06,0.376906,0.134435,0.48774,271790.3989,0.316467
524168,0.637391,-7.06816,0.126099,0.539073,0.06895,0.0243,0.591248,277728.5383,0.481067


[1m[35m--------------------------------------------------[0m
[1m[35mInformation and description[0m
[1m[34m--------------------------------------------------[0m
[1m[34mTraining data description:[0m


Unnamed: 0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy
count,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0,524164.0
mean,0.63,-8.38,0.07,0.26,0.12,0.18,0.56,241903.69,0.5
std,0.16,4.62,0.05,0.22,0.13,0.12,0.23,59326.6,0.29
min,0.08,-27.51,0.02,0.0,0.0,0.02,0.03,63973.0,0.0
25%,0.52,-11.55,0.02,0.07,0.0,0.08,0.4,207099.88,0.25
50%,0.63,-8.25,0.07,0.24,0.07,0.17,0.56,243684.06,0.51
75%,0.74,-4.91,0.11,0.4,0.2,0.27,0.72,281851.66,0.75
max,0.98,-1.36,0.26,1.0,0.87,0.6,0.98,464723.23,1.0


[1m[34m--------------------------------------------------[0m
[1m[34mTraining data information:[0m
<class 'pandas.core.frame.DataFrame'>
Index: 524164 entries, 0 to 524163
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   RhythmScore                524164 non-null  float64
 1   AudioLoudness              524164 non-null  float64
 2   VocalContent               524164 non-null  float64
 3   AcousticQuality            524164 non-null  float64
 4   InstrumentalScore          524164 non-null  float64
 5   LivePerformanceLikelihood  524164 non-null  float64
 6   MoodScore                  524164 non-null  float64
 7   TrackDurationMs            524164 non-null  float64
 8   Energy                     524164 non-null  float64
 9   BeatsPerMinute             524164 non-null  float64
dtypes: float64(10)
memory usage: 44.0 MB


None

[1m[34m--------------------------------------------------[0m
[1m[34mTesting data description:[0m


Unnamed: 0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy
count,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0,174722.0
mean,0.63,-8.38,0.07,0.26,0.12,0.18,0.56,241753.74,0.5
std,0.16,4.62,0.05,0.22,0.13,0.12,0.23,59103.9,0.29
min,0.14,-27.44,0.02,0.0,0.0,0.02,0.03,63973.0,0.0
25%,0.51,-11.55,0.02,0.07,0.0,0.08,0.4,207518.15,0.25
50%,0.63,-8.25,0.07,0.24,0.07,0.17,0.57,243584.59,0.51
75%,0.74,-4.9,0.11,0.4,0.2,0.27,0.72,281737.45,0.75
max,0.98,-1.36,0.26,1.0,0.68,0.6,0.98,449288.81,1.0


[1m[34m--------------------------------------------------[0m
[1m[34mTesting data information:[0m
<class 'pandas.core.frame.DataFrame'>
Index: 174722 entries, 524164 to 698885
Data columns (total 9 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   RhythmScore                174722 non-null  float64
 1   AudioLoudness              174722 non-null  float64
 2   VocalContent               174722 non-null  float64
 3   AcousticQuality            174722 non-null  float64
 4   InstrumentalScore          174722 non-null  float64
 5   LivePerformanceLikelihood  174722 non-null  float64
 6   MoodScore                  174722 non-null  float64
 7   TrackDurationMs            174722 non-null  float64
 8   Energy                     174722 non-null  float64
dtypes: float64(9)
memory usage: 13.3 MB


None

[1m[34m--------------------------------------------------[0m
[1m[34mUnique and null values:[0m


Unnamed: 0,RhythmScore,AudioLoudness,VocalContent,AcousticQuality,InstrumentalScore,LivePerformanceLikelihood,MoodScore,TrackDurationMs,Energy
Training_Nunq,322528,310411,229305,270478,218979,279591,306504,377442,11606
Testing_Nunq,116151,110402,84370,97364,79221,101149,109993,133624,10465
Training_Nulls,0,0,0,0,0,0,0,0,0
Testing_Nulls,0,0,0,0,0,0,0,0,0


#### **5. PREPARE FEATURES**

Feature engineering is performed using the AddFeatures function, which creates new columns for pairwise and triplet combinations of existing features, including multiplication and division (with 1e-6 added to denominators). For each column, quartile and decile features are also generated using pd.cut.

The created features are stored in a dictionary, converted to a DataFrame with the same index as the dataset, and concatenated as new columns. The target variable is stored separately and removed from the training data to avoid including it in feature calculations.


In [61]:
# ===== PREPARE FEATURES =====
def AddFeatures(df):
    new_features = {}
    for col1, col2 in combinations(df.columns, 2):
        new_features[f"{col1}_m_{col2}"] = df[col1] * df[col2]
        new_features[f"{col1}_d_{col2}"] = df[col1] / (df[col2] + 1e-6)

    for col1, col2, col3 in combinations(df.columns, 3):
        new_features[f"{col1}_m_{col2}_m_{col3}"] = df[col1] * df[col2] * df[col3]

    for col in df.columns:
        new_features[f"{col}_quartile"] = pd.cut(df[col], bins=4, labels=False, include_lowest=True)
        new_features[f"{col}_decile"] = pd.cut(df[col], bins=10, labels=False, include_lowest=True)

    return pd.concat([df, pd.DataFrame(new_features, index=df.index)], axis=1)

X, y = X.drop('BeatsPerMinute', axis=1), X['BeatsPerMinute']
X, X_test = AddFeatures(X), AddFeatures(X_test)

#### **6. XGBOOST MODEL**

The XGBoost model is defined for regression using squared error, with RMSE as the evaluation metric. It uses 1000 trees with a maximum depth of 6 and a learning rate of 0.002 for stable learning.

Two-thirds of features are used per tree and per node via colsample_bytree and colsample_bynode, and L1/L2 regularization (reg_alpha=2.50, reg_lambda=0.85) penalizes large leaf outputs. A random state is set for reproducibility.

In [62]:
# ===== XGBOOST MODEL =====
xgb = XGBRegressor(
    objective = 'reg:squarederror',
    eval_metric = 'rmse',
    n_estimators = 1000,
    max_depth = 6,
    learning_rate = 0.002,
    colsample_bytree = 0.67,
    colsample_bynode = 0.67,
    reg_alpha = 2.50,
    reg_lambda = 0.85,
    random_state = 42
)

#### **7. LIGHTGBM MODEL**

The LightGBM model is defined with 1000 trees, maximum depth of 14, and 85 leaves to capture complex interactions. A low learning rate of 0.0015 ensures slow and stable learning.

Each tree uses 90% of features and 90% of rows via feature_fraction and subsample for randomness, and large leaf outputs are slightly penalized with reg_alpha and reg_lambda set to 0.0001. A random state ensures reproducibility, and verbosity is set to -1 to silence training output.

In [63]:
# ===== LIGHTGBM MODEL =====
lgbm = LGBMRegressor(
    n_estimators = 1000,
    max_depth = 14,
    num_leaves = 85,
    learning_rate = 0.0015,
    feature_fraction = 0.90,
    subsample = 0.90,
    reg_alpha = 0.0001,
    reg_lambda = 0.0001,
    random_state = 42,
    verbosity = -1
)

#### **8. 5-FOLD CROSS-VALIDATION**

We use 5-fold cross-validation to train the models. Two dictionaries store predictions: oof_preds for out-of-fold training predictions and test_preds for averaged test predictions. Each dictionary has a key per model, with values as zero-initialized NumPy arrays.

The data is split into five folds with shuffling to improve generalization and a fixed random state for reproducibility. For each fold, XGBoost and LightGBM are fitted on the training set, and predictions are generated for both validation and test data. After all folds, out-of-fold predictions are compared with true targets to calculate cross-validation RMSE.

In [65]:
# ===== 5-FOLD CROSS-VALIDATION =====
oof_preds = {name: np.zeros(len(X)) for name in ['XGBoost','LightGBM']}
test_preds = {name: np.zeros(len(X_test)) for name in ['XGBoost','LightGBM']}

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, valid_idx in kf.split(X, y):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

    xgb.fit(X_train, y_train, verbose=False)
    lgbm.fit(X_train, y_train)
    
    for name, model in [('XGBoost', xgb), ('LightGBM', lgbm)]:
        oof_preds[name][valid_idx] = model.predict(X_valid)
        test_preds[name] += model.predict(X_test)/kf.n_splits

for name in ['XGBoost','LightGBM']:
    rmse = np.sqrt(mean_squared_error(y, oof_preds[name]))
    PrintColor(f'{name} predictions - RMSE score: {rmse:.6f}', Fore.RED, lines=False)

[1m[31mXGBoost predictions - RMSE score: 26.461237[0m
[1m[31mLightGBM predictions - RMSE score: 26.460378[0m


#### **9. BLENDING**

In this step, both the out-of-fold predictions and the test-set predictions are averaged across the two models. We then calculate the new cross-validation RMSE by comparing the averaged out-of-fold predictions with the true target values.

In [66]:
# ===== BLENDING =====
avg_oof_preds = (oof_preds['XGBoost'] + oof_preds['LightGBM']) / 2
avg_test_preds = (test_preds['XGBoost'] + test_preds['LightGBM']) / 2

rmse = np.sqrt(mean_squared_error(y, avg_oof_preds))
PrintColor(f'Averaged predictions - RMSE score: {rmse:.6f}', Fore.RED, lines=False)

[1m[31mAveraged predictions - RMSE score: 26.460300[0m


#### **10. CREATE SUBMISSION FILE**

The final step is creating a CSV file for submission to the competition.

In [67]:
# ===== CREATE SUBMISSION FILE =====
output = pd.DataFrame({'id': X_test.index, 'y': avg_test_preds})
output.to_csv('submission.csv', index=False)