#### **1. INTRODUCTION**

This notebook presents a solution to the [Playground Series - Season 5, Episode 10](https://www.kaggle.com/competitions/playground-series-s5e10) competition on Kaggle, hosted in October 2025. The goal is to predict the likelihood of accidents on different types of roads, with submissions evaluated using the root mean squared error (RMSE) between the predicted and observed targets.

The workflow begins by importing the necessary libraries, loading the data, and performing exploratory data analysis (EDA). Feature engineering is performed using the original dataset to create target-encoded features, which are applied to both the training and test sets. XGBoost and CatBoost models are trained using 5-fold cross-validation, and performance is evaluated by comparing out-of-fold predictions with the true target values for each model as well as their averaged results. Finally, predictions from an external notebook are imported and blended with the current models’ outputs before generating a CSV file ready for submission.

#### **2. IMPORT LIBRARIES**

The necessary libraries for this project are imported. NumPy and Pandas handle basic data manipulation. Fore and Style from Colorama are used for colorful and readable print outputs. XGBRegressor and CatBoostRegressor are imported from xgboost and catboost, respectively, for modeling. Finally, KFold and mean_squared_error are imported from scikit-learn for model validation.

In [1]:
# ===== Import Libraries =====
import numpy as np, pandas as pd
from colorama import Fore, Style
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

#### **3. LOAD DATA**

The training and testing datasets are loaded into pandas DataFrames. This data was generated by a deep learning model trained on the [Simulated Roads Accident](https://www.kaggle.com/datasets/ianktoo/simulated-roads-accident-data/) dataset. The original dataset is also loaded in the following code.

In [2]:
# ===== Load Data =====
X = pd.read_csv('/kaggle/input/playground-series-s5e10/train.csv', index_col='id')
X_test = pd.read_csv('/kaggle/input/playground-series-s5e10/test.csv', index_col='id')
original = pd.read_csv(
    '/kaggle/input/simulated-roads-accident-data/synthetic_road_accidents_100k.csv'
).rename_axis('id')

#### **4. EXPLORE DATA**

In this section, we perform basic exploratory data analysis (EDA). We start by examining the shapes and heads of the three DataFrames to understand their structure. Next, we inspect the information and statistics of numerical columns to learn their data types, memory usage, and basic descriptive statistics. Finally, we check each DataFrame for the number of unique and null values in every column.

In [3]:
# ===== Explore Data =====
def print_color(text, color=Fore.BLUE, lines=True):
    if lines: print(f"{Style.BRIGHT}{color}{'-' * 50}{Style.RESET_ALL}")
    print(f"{Style.BRIGHT}{color}{text}{Style.RESET_ALL}")
    
print_color(f"Shapes of training - testing - original: {X.shape} {X_test.shape} {original.shape}")
for name, df in [('Training data', X), ('Testing data', X_test), ('Original data', original)]:
    print_color(f"{name} head:", color=Fore.CYAN)
    display(df.head())

print_color("Information and description", color=Fore.MAGENTA)
for name, df in [('Training data', X), ('Testing data', X_test), ('Original data', original)]:
    print_color(f"{name} description:")
    display(df.drop(columns=['accident_risk'], errors='ignore')
            .describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
            .drop(index='count').round(2).T)

    print_color(f"{name} information:")
    df.info()

print_color("Unique and null values:")
display(pd.concat([X.drop('accident_risk', axis=1).nunique(), X_test.nunique(),
                   original.drop('accident_risk', axis=1).nunique(),
                   X.drop('accident_risk', axis=1).isna().sum(), X_test.isna().sum(),
                   original.drop('accident_risk', axis=1).isna().sum()],
                   keys=['Train_Nunq', 'Test_Nunq', 'Orig_Nunq',
                         'Train_Nulls', 'Test_Nulls', 'Orig_Nulls'], axis=1).T)

[1m[34m--------------------------------------------------[0m
[1m[34mShapes of training - testing - original: (517754, 13) (172585, 12) (100000, 13)[0m
[1m[36m--------------------------------------------------[0m
[1m[36mTraining data head:[0m


Unnamed: 0_level_0,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents,accident_risk
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,urban,2,0.06,35,daylight,rainy,False,True,afternoon,False,True,1,0.13
1,urban,4,0.99,35,daylight,clear,True,False,evening,True,True,0,0.35
2,rural,4,0.63,70,dim,clear,False,True,morning,True,False,2,0.3
3,highway,4,0.07,35,dim,rainy,True,True,morning,False,False,1,0.21
4,rural,1,0.58,60,daylight,foggy,False,False,evening,True,False,1,0.56


[1m[36m--------------------------------------------------[0m
[1m[36mTesting data head:[0m


Unnamed: 0_level_0,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
517754,highway,2,0.34,45,night,clear,True,True,afternoon,True,True,1
517755,urban,3,0.04,45,dim,foggy,True,False,afternoon,True,False,0
517756,urban,2,0.59,35,dim,clear,True,False,afternoon,True,True,1
517757,rural,4,0.95,35,daylight,rainy,False,False,afternoon,False,False,2
517758,highway,2,0.86,35,daylight,clear,True,False,evening,False,True,3


[1m[36m--------------------------------------------------[0m
[1m[36mOriginal data head:[0m


Unnamed: 0_level_0,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents,accident_risk
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,rural,2,0.29,70,night,rainy,False,True,evening,False,False,1,0.64
1,highway,1,0.34,25,dim,clear,False,False,morning,False,False,3,0.27
2,rural,2,0.76,70,night,foggy,True,False,evening,True,True,1,0.76
3,rural,3,0.37,70,night,foggy,True,False,morning,False,True,0,0.6
4,highway,3,0.39,45,dim,rainy,False,True,morning,False,False,0,0.17


[1m[35m--------------------------------------------------[0m
[1m[35mInformation and description[0m
[1m[34m--------------------------------------------------[0m
[1m[34mTraining data description:[0m


Unnamed: 0,mean,std,min,5%,25%,50%,75%,90%,95%,99%,max
num_lanes,2.49,1.12,1.0,1.0,1.0,2.0,3.0,4.0,4.0,4.0,4.0
curvature,0.49,0.27,0.0,0.05,0.26,0.51,0.71,0.86,0.92,0.98,1.0
speed_limit,46.11,15.79,25.0,25.0,35.0,45.0,60.0,70.0,70.0,70.0,70.0
num_reported_accidents,1.19,0.9,0.0,0.0,1.0,1.0,2.0,2.0,3.0,3.0,7.0


[1m[34m--------------------------------------------------[0m
[1m[34mTraining data information:[0m
<class 'pandas.core.frame.DataFrame'>
Index: 517754 entries, 0 to 517753
Data columns (total 13 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   road_type               517754 non-null  object 
 1   num_lanes               517754 non-null  int64  
 2   curvature               517754 non-null  float64
 3   speed_limit             517754 non-null  int64  
 4   lighting                517754 non-null  object 
 5   weather                 517754 non-null  object 
 6   road_signs_present      517754 non-null  bool   
 7   public_road             517754 non-null  bool   
 8   time_of_day             517754 non-null  object 
 9   holiday                 517754 non-null  bool   
 10  school_season           517754 non-null  bool   
 11  num_reported_accidents  517754 non-null  int64  
 12  accident_risk           51775

Unnamed: 0,mean,std,min,5%,25%,50%,75%,90%,95%,99%,max
num_lanes,2.49,1.12,1.0,1.0,1.0,2.0,3.0,4.0,4.0,4.0,4.0
curvature,0.49,0.27,0.0,0.05,0.26,0.51,0.71,0.86,0.92,0.98,1.0
speed_limit,46.1,15.79,25.0,25.0,35.0,45.0,60.0,70.0,70.0,70.0,70.0
num_reported_accidents,1.19,0.9,0.0,0.0,1.0,1.0,2.0,2.0,3.0,3.0,7.0


[1m[34m--------------------------------------------------[0m
[1m[34mTesting data information:[0m
<class 'pandas.core.frame.DataFrame'>
Index: 172585 entries, 517754 to 690338
Data columns (total 12 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   road_type               172585 non-null  object 
 1   num_lanes               172585 non-null  int64  
 2   curvature               172585 non-null  float64
 3   speed_limit             172585 non-null  int64  
 4   lighting                172585 non-null  object 
 5   weather                 172585 non-null  object 
 6   road_signs_present      172585 non-null  bool   
 7   public_road             172585 non-null  bool   
 8   time_of_day             172585 non-null  object 
 9   holiday                 172585 non-null  bool   
 10  school_season           172585 non-null  bool   
 11  num_reported_accidents  172585 non-null  int64  
dtypes: bool(4), float64(1), i

Unnamed: 0,mean,std,min,5%,25%,50%,75%,90%,95%,99%,max
num_lanes,2.5,1.12,1.0,1.0,1.0,2.0,3.0,4.0,4.0,4.0,4.0
curvature,0.5,0.29,0.0,0.05,0.25,0.5,0.75,0.9,0.95,0.99,1.0
speed_limit,47.05,16.32,25.0,25.0,35.0,45.0,60.0,70.0,70.0,70.0,70.0
num_reported_accidents,1.5,1.23,0.0,0.0,1.0,1.0,2.0,3.0,4.0,5.0,10.0


[1m[34m--------------------------------------------------[0m
[1m[34mOriginal data information:[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   road_type               100000 non-null  object 
 1   num_lanes               100000 non-null  int64  
 2   curvature               100000 non-null  float64
 3   speed_limit             100000 non-null  int64  
 4   lighting                100000 non-null  object 
 5   weather                 100000 non-null  object 
 6   road_signs_present      100000 non-null  bool   
 7   public_road             100000 non-null  bool   
 8   time_of_day             100000 non-null  object 
 9   holiday                 100000 non-null  bool   
 10  school_season           100000 non-null  bool   
 11  num_reported_accidents  100000 non-null  int64  
 12  accident_risk           1

Unnamed: 0,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents
Train_Nunq,3,4,261,5,3,3,2,2,3,2,2,8
Test_Nunq,3,4,195,5,3,3,2,2,3,2,2,8
Orig_Nunq,3,4,101,5,3,3,2,2,3,2,2,11
Train_Nulls,0,0,0,0,0,0,0,0,0,0,0,0
Test_Nulls,0,0,0,0,0,0,0,0,0,0,0,0
Orig_Nulls,0,0,0,0,0,0,0,0,0,0,0,0


#### **5. FEATURE ENGINEERING**

In this section, we perform basic feature engineering using the original dataset. We define a function that, for each feature in a DataFrame, creates a new column containing the mean of the target values from the original dataset. For each feature, the original dataset is grouped by the feature's values, the mean of the target column  is calculated for each group, and these group means are mapped back to the rows of the DataFrame to create a new column. This approach works well because each feature has a relatively low number of unique values compared to the dataset length.

Before applying the function to the training and testing datasets, the target column is stored separately to avoid including it in the calculations. Finally, all object columns are converted to categorical, which is required for training XGBoost and CatBoost.

In [4]:
# ===== Feature Engineering =====
def target_encode(df):
    for col in df.columns:
        df[f'original_mean_{col}'] = df[col].map(original.groupby(col)['accident_risk'].mean())
    return df

X, y = X.drop(columns=['accident_risk']), X['accident_risk']
X, X_test = target_encode(X), target_encode(X_test)

for col in X.select_dtypes(include='object').columns:
    X[col] = X[col].astype('category')
    X_test[col] = X_test[col].astype('category')

#### **6. XGBOOST MODEL**

Next, we define the hyperparameters for the models we will train. The XGBoost model is set for regression with a squared error loss function and evaluated using root mean squared error (RMSE). Each tree has a maximum depth of 8 and uses 90% of the data and 75% of the features to improve generalization, while L1 and L2 regularization penalize large leaf outputs. A learning rate of 0.01 ensures slow progress over 10,000 iterations, with early stopping after 200 rounds if the evaluation metric no longer decreases, and a random state is set for reproducibility. Categorical features are enabled.

In [5]:
# ===== XGBoost Model =====
xgb = XGBRegressor(
    objective='reg:squarederror',
    eval_metric='rmse',
    max_depth=8,
    subsample=0.90,
    colsample_bytree=0.75,
    reg_alpha=0.001,
    reg_lambda=0.75,
    learning_rate=0.01,
    n_estimators=10000,
    early_stopping_rounds=200,
    random_state=42,
    enable_categorical=True
)

#### **7. CATBOOST MODEL**

The CatBoost model uses the same RMSE loss function, with trees limited to a maximum depth of 6 and 70% of the features per tree level to introduce randomness and improve generalization. A learning rate of 0.02 is applied over 10,000 iterations with early stopping after 200 rounds, allowing the model to gradually learn until performance stabilizes. A random state is set for reproducibility, and categorical and boolean columns are specified.

In [6]:
# ===== CatBoost Model =====
cb = CatBoostRegressor(
    loss_function='RMSE',
    max_depth=6,
    colsample_bylevel=0.7,
    l2_leaf_reg=1.25,
    learning_rate=0.02,
    iterations=10000,
    early_stopping_rounds=200,
    random_state=42,
    cat_features=[col for col in X.columns if X[col].dtype in ['category', 'bool']]
)

#### **8. 5-FOLD CROSS-VALIDATION**

We use 5-fold cross-validation to train our models. Two dictionaries are created to store out-of-fold and test predictions, with keys representing the model names and values as NumPy arrays of appropriate length initialized to zero. An instance of KFold is created with 5 splits, shuffling enabled to introduce randomness, and a fixed random state for reproducibility.

For each fold, the training data is split into training and validation subsets. Both models are trained on the training subset, using the validation subset for early stopping. Predictions are then made on the out-of-fold set and on the test set, with test predictions averaged across folds.

In [7]:
# ===== 5-Fold Cross-Validation =====
oof_preds = {name: np.zeros(len(X)) for name in ['XGBoost', 'CatBoost']}
test_preds = {name: np.zeros(len(X_test)) for name in ['XGBoost', 'CatBoost']}

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, valid_idx in kf.split(X, y):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

    xgb.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
    cb.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=0)

    for name, model in [('XGBoost', xgb), ('CatBoost', cb)]:
        oof_preds[name][valid_idx] = model.predict(X_valid)
        test_preds[name] += model.predict(X_test) / kf.n_splits

#### **9. MODEL EVALUATION**

In this step, we evaluate each model's performance by comparing the out-of-fold predictions with the true target values. We then average the out-of-fold and test set predictions from both models and compute the RMSE of the averaged predictions to assess overall performance.

In [8]:
# ===== Model Evaluation =====
for name in ['XGBoost', 'CatBoost']:
    rmse = np.sqrt(mean_squared_error(y, oof_preds[name]))
    print_color(f'{name} predictions - RMSE score: {rmse:.6f}', color=Fore.RED, lines=False)

avg_oof_preds = (oof_preds['XGBoost'] + oof_preds['CatBoost']) / 2
avg_test_preds = (test_preds['XGBoost'] + test_preds['CatBoost']) / 2

rmse = np.sqrt(mean_squared_error(y, avg_oof_preds))
print_color(f'Averaged predictions - RMSE score: {rmse:.6f}', color=Fore.RED)

[1m[31mXGBoost predictions - RMSE score: 0.055998[0m
[1m[31mCatBoost predictions - RMSE score: 0.056053[0m
[1m[31m--------------------------------------------------[0m
[1m[31mAveraged predictions - RMSE score: 0.055988[0m


#### **10. CLOSURE**

Before creating the submission file, we combine our predictions with those from Masaya Kawamata's notebook [S5E10 | NN Stacking - Baseline](https://www.kaggle.com/code/masayakawamata/s5e10-nn-stacking-baseline), which uses out-of-fold and test predictions from several models as features to train a neural network that achieves a cross-validation RMSE score of 0.05588. Blending models enhances robustness and overall performance. This approach is permitted in the competition, and all usage complies with copyright rules and proper attribution.

In [9]:
# ===== Closure =====
ext_preds = pd.read_csv('/kaggle/input/s5e10-nn-stacking-baseline/test_nn_ensemble.csv')
final_preds = 0.20 * avg_test_preds + 0.80 * ext_preds['accident_risk']

output = pd.DataFrame({'id': X_test.index, 'accident_risk': final_preds})
output.to_csv('submission.csv', index=False)