# Stepwise hyperparameter tuning of XGBoost regression with Optuna

## Highlights

> 1. Load and clean dataset
> 2. Separate in multiple datasets for train, validation, and test
> 3. Tune hyperparameter of XGBoost
> 4. Save submission file with correct format

* [1. Import and Clean Data](#import)
    * [1.1. Renaming columns](#columns_renaming)
    * [1.2. Missing values](#missing)
    * [1.3. Deleting features](#delete)
    * [1.4. Modifying date_of_registration and it features](#modifications)
    * [1.5. Encode categorical data](#encode)
* [2. Data preprocessing](#processing)
    * [2.1. Train-Test split](#splitOne)
    * [2.2. Train-Validation split](#splitTwo)
    * [2.3. Imputation](#imputation)
    * [2.4. Outliers binning](#binning)
* [3. Utility function](#utilities)
* [4. Evaluation metric](#metric)
* [5. XGBoost with default parameters](#default_xgboost)
* [6. Stepwize Hyperparameter Tuning](#stepwise)
    * [6.1. Resuming the optimization process in Optuna](#resume)
* [7. Model comparison](#comparison)

In [1]:
%pip install --upgrade polars scikit-learn seaborn numpy optbinning ortools

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os, sys

import polars as pl
import sklearn

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from unidecode import unidecode
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import IsolationForest



In [3]:
print('Seaborn', sns.__version__)
print('Polars', pl.__version__)
print('Sklearn', sklearn.__version__)

Seaborn 0.13.0
Polars 0.20.2
Sklearn 1.3.2


In [4]:
pl.Config.set_tbl_width_chars(300).set_tbl_cols(10).set_tbl_rows(30)

polars.config.Config

<span id="import"></span>

## 1. Import and Clean Data

In [5]:
types = {
    'Electric range (km)': pl.Float64,
    'Erwltp (g/km)': pl.Float64,
    'Enedc (g/km)': pl.Float64,
}

In [6]:
train_df = pl.read_csv('/kaggle/input/estimate-co2-emissions-from-cars/train.csv', dtypes=types)
submission_df = pl.read_csv('/kaggle/input/estimate-co2-emissions-from-cars/test.csv', dtypes=types)

print(f'Loaded {train_df.shape[0]} lines and {train_df.shape[1]} columns in train dataset')

Loaded 7571649 lines and 37 columns in train dataset


<span id="columns_renaming"></span>

### 1.1. Renaming comumns

In [7]:
def rename_column(colname):
    return colname.lower().split(' (')[0].strip().replace(' ', '_')

In [8]:
train_df.columns = list(map(rename_column, train_df.columns))
submission_df.columns = list(map(rename_column, submission_df.columns))

<span id="missing"></span>

### 1.2. Missing values

In [9]:
def get_missing(df, colname='missing'):
    return (
        df.select((pl.all().is_null().sum() / train_df.shape[0] * 100).round(2))
        .melt(value_name=colname)
        .filter(pl.col(colname) > 0)
        .sort(by=colname, descending=True)
    )

def get_static(df, colname='unique'):
    return (
        train_df.select(pl.all().n_unique())
        .melt(value_name=colname)
        .filter(pl.col(colname) == 1)
    )

missing_train = get_missing(train_df)
static_train = get_static(train_df)

missing_submission = get_missing(submission_df)
static_submission = get_static(submission_df)

print('Missing rows in train:')
print(missing_train)
print('Static columns in train:')
print(static_train)
print('Missing rows in submission:')
print(missing_submission)
print('Static columns in submission:')
print(static_submission)

Missing rows in train:
shape: (25, 2)
┌──────────────────────┬─────────┐
│ variable             ┆ missing │
│ ---                  ┆ ---     │
│ str                  ┆ f64     │
╞══════════════════════╪═════════╡
│ mms                  ┆ 100.0   │
│ ernedc               ┆ 100.0   │
│ de                   ┆ 100.0   │
│ vf                   ┆ 100.0   │
│ enedc                ┆ 83.84   │
│ electric_range       ┆ 82.96   │
│ z                    ┆ 77.98   │
│ erwltp               ┆ 46.48   │
│ it                   ┆ 37.78   │
│ fuel_consumption     ┆ 23.51   │
│ ec                   ┆ 13.51   │
│ mt                   ┆ 11.1    │
│ vfn                  ┆ 8.61    │
│ mp                   ┆ 6.41    │
│ at2                  ┆ 2.34    │
│ at1                  ┆ 2.19    │
│ date_of_registration ┆ 1.7     │
│ cn                   ┆ 1.52    │
│ ve                   ┆ 0.42    │
│ ep                   ┆ 0.25    │
│ tan                  ┆ 0.16    │
│ va                   ┆ 0.16    │
│ ct             

<span id="delete"></span>

### 1.3. Deleting features

In [10]:
# Gather all features with only one value or missing all rows
features_to_delete = (
    static_train['variable'].to_list() +
    static_submission['variable'].to_list() +
    missing_train.filter(pl.col('missing') == 100.0)['variable'].to_list() +
    missing_submission.filter(pl.col('missing') == 100.0)['variable'].to_list() +
    ['id']
)

train_df = train_df.drop(features_to_delete)
submission_df = submission_df.drop(features_to_delete)

In [11]:
# Resulting columns
new_missing_in_train = get_missing(train_df)
new_missing_in_submission = get_missing(submission_df)

print('Rows missing in train after drop:')
print(new_missing_in_train)
print('Rows missing in test after drop:')
print(new_missing_in_submission)

Rows missing in train after drop:
shape: (21, 2)
┌──────────────────────┬─────────┐
│ variable             ┆ missing │
│ ---                  ┆ ---     │
│ str                  ┆ f64     │
╞══════════════════════╪═════════╡
│ enedc                ┆ 83.84   │
│ electric_range       ┆ 82.96   │
│ z                    ┆ 77.98   │
│ erwltp               ┆ 46.48   │
│ it                   ┆ 37.78   │
│ fuel_consumption     ┆ 23.51   │
│ ec                   ┆ 13.51   │
│ mt                   ┆ 11.1    │
│ vfn                  ┆ 8.61    │
│ mp                   ┆ 6.41    │
│ at2                  ┆ 2.34    │
│ at1                  ┆ 2.19    │
│ date_of_registration ┆ 1.7     │
│ cn                   ┆ 1.52    │
│ ve                   ┆ 0.42    │
│ ep                   ┆ 0.25    │
│ tan                  ┆ 0.16    │
│ va                   ┆ 0.16    │
│ ct                   ┆ 0.16    │
│ w                    ┆ 0.16    │
│ t                    ┆ 0.02    │
└──────────────────────┴─────────┘
Rows m

<span id="modifications"></span>

### 1.4. Modifying date_of_registration and it features

In [12]:
def modify_features(df):
    return df.with_columns(
        pl.col('date_of_registration').str.to_datetime('%Y-%m-%d').dt.timestamp(time_unit='ms'),
        pl.col('it').replace('', 'none'),
    )

In [13]:
train_df = modify_features(train_df)
submission_df = modify_features(submission_df)

<span id="encode"></span>

### 1.5. Encode categorical data

In [14]:
categorical_columns = train_df.select(pl.col(pl.Utf8)).columns
numerical_columns = [i for i in train_df.columns if i not in categorical_columns]
print(f'Detected {len(categorical_columns)} categorical columns')
print(f'Detected {len(numerical_columns)} numerical columns')

Detected 16 categorical columns
Detected 14 numerical columns


In [15]:
def handle_encoding(value):
    return unidecode(value)

In [16]:
# Clean categories in columns
modifications = {i: pl.col(i).str.to_lowercase().str.strip_chars().str.replace_all(' ', '_').map_elements(handle_encoding) for i in categorical_columns}

train_df = train_df.with_columns(**modifications)
submission_df = submission_df.with_columns(**modifications)

<span id="processing"></span>

## 2. Data preprocessing

Do a univeriate statistical analysis of every column before preprocessing

In [17]:
# For every categories, print graph of every categories or top 20 if more
"""
for colname in categorical_columns:
    category_counts = train_df[colname].value_counts(sort=True).top_k(20, by='count')
    
    sns.barplot(
        y=[str(i) for i in category_counts[colname].to_numpy()],
        x=category_counts['count'].log().to_numpy(),
    )
    plt.title(f'Category counts in column {colname}')
    plt.xlabel('Log of counts')
    plt.show()
"""

"\nfor colname in categorical_columns:\n    category_counts = train_df[colname].value_counts(sort=True).top_k(20, by='count')\n    \n    sns.barplot(\n        y=[str(i) for i in category_counts[colname].to_numpy()],\n        x=category_counts['count'].log().to_numpy(),\n    )\n    plt.title(f'Category counts in column {colname}')\n    plt.xlabel('Log of counts')\n    plt.show()\n"

In [18]:
# For every numeric column, print graph of repartition
"""
for colname in [i for i in train_df.columns if i not in categorical_columns]:
    sns.displot(
        train_df,
        x=colname
    )
    plt.title(f'Distribution of column {colname}')
    plt.xlabel('Value')
    plt.show()
"""

"\nfor colname in [i for i in train_df.columns if i not in categorical_columns]:\n    sns.displot(\n        train_df,\n        x=colname\n    )\n    plt.title(f'Distribution of column {colname}')\n    plt.xlabel('Value')\n    plt.show()\n"

<span id="splitOne"></span>

### 2.1. Train-Test split

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
train, test = train_test_split(train_df, shuffle=True, test_size=0.05, stratify=train_df['ft'])

<span id="splitTwo"></span>

### 2.2. Train-Validation split

In [21]:
train, val = train_test_split(train, shuffle=True, test_size=0.05, stratify=train['ft'])

<span id="imputation"></span>

### 2.3. Imputation

In [22]:
# Impute with zeros
columns_to_impute_with_zeros = ['electric_range', 'fuel_consumption', 'erwltp', 'z', 'ec', 'enedc']

modifications = [pl.col(i).fill_null(value=0) for i in train.columns if i in columns_to_impute_with_zeros]

train = train.with_columns(*modifications)
val = val.with_columns(*modifications)
test = test.with_columns(*modifications)
submission_df = submission_df.with_columns(*modifications)

In [23]:
columns_to_impute_by_global_mean = ['at2', 'at1', 'w', 'mt', 'date_of_registration']

imputers = {i: train.select(pl.col(i).mean())[i][0] for i in train.columns if i in columns_to_impute_by_global_mean}

modifications = [pl.col(i).fill_null(value=imputers[i]) for i in imputers]

# train = train.with_columns(*modifications)
# val = val.with_columns(*modifications)
# test = test.with_columns(*modifications)
# submission_df = submission_df.with_columns(*modifications)

In [24]:
imputers_cn = train.group_by(['va', 've']).agg(pl.col('cn').mode().first()).with_columns(pl.concat_str([pl.col('va'), pl.col('ve')], separator=' ').alias('va_ve'))[['va_ve', 'cn']].to_dict(as_series=False)
imputers_cn = {i: j for (i, j) in zip(imputers_cn['va_ve'], imputers_cn['cn'])}

def impute_cn(df, imputers):
    df = df.with_columns(pl.concat_str([pl.col('va'), pl.col('ve')], separator=' ').alias('va_ve'))
    return df.with_columns(pl.col('cn').fill_null(pl.col('va_ve').map_dict(imputers))).drop('va_ve')

train = impute_cn(train, imputers_cn)
val = impute_cn(val, imputers_cn)
test = impute_cn(test, imputers_cn)
submission_df = impute_cn(submission_df, imputers_cn)

  return df.with_columns(pl.col('cn').fill_null(pl.col('va_ve').map_dict(imputers))).drop('va_ve')


In [25]:
imputers_ve = train.group_by(['cn', 'va']).agg(pl.col('ve').mode().first()).with_columns(pl.concat_str([pl.col('cn'), pl.col('va')], separator=' ').alias('cn_va'))[['cn_va', 've']].to_dict(as_series=False)
imputers_ve = {i: j for (i, j) in zip(imputers_ve['cn_va'], imputers_ve['ve'])}

def impute_ve(df, imputers):
    df = df.with_columns(pl.concat_str([pl.col('cn'), pl.col('va')], separator=' ').alias('cn_va'))
    return df.with_columns(pl.col('ve').fill_null(pl.col('cn_va').map_dict(imputers))).drop('cn_va')

train = impute_ve(train, imputers_ve)
val = impute_ve(val, imputers_ve)
test = impute_ve(test, imputers_ve)
submission_df = impute_ve(submission_df, imputers_ve)

  return df.with_columns(pl.col('ve').fill_null(pl.col('cn_va').map_dict(imputers))).drop('cn_va')


In [26]:
imputers_va = train.group_by(['cn', 've']).agg(pl.col('va').mode().first()).with_columns(pl.concat_str([pl.col('cn'), pl.col('ve')], separator=' ').alias('cn_ve'))[['cn_ve', 'va']].to_dict(as_series=False)
imputers_va = {i: j for (i, j) in zip(imputers_va['cn_ve'], imputers_va['va'])}

def impute_va(df, imputers):
    df = df.with_columns(pl.concat_str([pl.col('cn'), pl.col('ve')], separator=' ').alias('cn_ve'))
    return df.with_columns(pl.col('va').fill_null(pl.col('cn_ve').map_dict(imputers))).drop('cn_ve')

train = impute_va(train, imputers_va)
val = impute_va(val, imputers_va)
test = impute_va(test, imputers_va)
submission_df = impute_va(submission_df, imputers_va)

  return df.with_columns(pl.col('va').fill_null(pl.col('cn_ve').map_dict(imputers))).drop('cn_ve')


In [27]:
columns_to_impute_by_mode = ['it', 'vfn', 'mp', 'tan', 'ct', 't']
columns_to_use_for_grouping = ['va', 've', 'cn']

imputers = {i: train.group_by(columns_to_use_for_grouping).agg(pl.col(i).mode().first()).with_columns(pl.concat_str(columns_to_use_for_grouping, separator=' ').alias('grouped_by'))[['grouped_by', i]].to_dict(as_series=False) for i in columns_to_impute_by_mode}
imputers = {i: {j: k for (j, k) in zip(imputers[i]['grouped_by'], imputers[i][i])} for i in imputers}

modifications = [pl.col(i).fill_null(pl.col('grouped_by').map_dict(imputers[i])) for i in imputers]

def impute_mod(df, modifications):
    df = df.with_columns(pl.concat_str(columns_to_use_for_grouping, separator=' ').alias('grouped_by'))
    return df.with_columns(*modifications).drop('grouped_by')

train = impute_mod(train, modifications)
val = impute_mod(val, modifications)
test = impute_mod(test, modifications)
submission_df = impute_mod(submission_df, modifications)

  modifications = [pl.col(i).fill_null(pl.col('grouped_by').map_dict(imputers[i])) for i in imputers]


In [28]:
train = train.with_columns(pl.col('it').fill_null(value=""))
val = val.with_columns(pl.col('it').fill_null(value=""))
test = test.with_columns(pl.col('it').fill_null(value=""))
submission_df = submission_df.with_columns(pl.col('it').fill_null(value=""))

In [29]:
columns_to_impute_by_global_mode = ['va', 've', 'mp']

imputers = {i: train.select(pl.col(i).mode().first())[i][0] for i in train.columns if i in columns_to_impute_by_global_mode}

modifications = [pl.col(i).fill_null(value=imputers[i]) for i in imputers]

# train = train.with_columns(*modifications)
# val = val.with_columns(*modifications)
# test = test.with_columns(*modifications)
# submission_df = submission_df.with_columns(*modifications)

In [30]:
# Impute by mean grouped by column
columns_to_impute_by_mean = ['at2', 'at1', 'date_of_registration', 'ep', 'w', 'm', 'mt']
column_to_use_for_grouping = ['cn', 'va', 've']

imputers = {i: train.group_by(column_to_use_for_grouping).agg(pl.col(i).mean()).with_columns(pl.concat_str(columns_to_use_for_grouping, separator=' ').alias('grouped_by'))[['grouped_by', i]].to_dict(as_series=False) for i in train.columns if i in columns_to_impute_by_mean}
imputers = {i: {j: k for (j, k) in zip(imputers[i]['grouped_by'], imputers[i][i])} for i in imputers}

modifications = [pl.col(i).fill_null(pl.col('grouped_by').map_dict(imputers[i])) for i in imputers]

def impute_by_group(df, modifications):
    df = df.with_columns(pl.concat_str(columns_to_use_for_grouping, separator=' ').alias('grouped_by'))
    return df.with_columns(*modifications).drop('grouped_by')

train = impute_by_group(train, modifications)
val = impute_by_group(val, modifications)
test = impute_by_group(test, modifications)
submission_df = impute_by_group(submission_df, modifications)

  modifications = [pl.col(i).fill_null(pl.col('grouped_by').map_dict(imputers[i])) for i in imputers]


In [31]:
# Impute by mean grouped by column
columns_to_impute_by_mean = ['at2', 'at1', 'date_of_registration', 'ep', 'w', 'm', 'mt']
column_to_use_for_grouping = 'tan'

imputers = {i: train.group_by(column_to_use_for_grouping).agg(pl.col(i).mean()).to_dict(as_series=False) for i in train.columns if i in columns_to_impute_by_mean}
imputers = {i: {j: k for (j, k) in zip(imputers[i]['tan'], imputers[i][i])} for i in imputers}

modifications = [pl.col(i).fill_null(pl.col(column_to_use_for_grouping).map_dict(imputers[i])) for i in imputers]

train = train.with_columns(*modifications)
val = val.with_columns(*modifications)
test = test.with_columns(*modifications)
submission_df = submission_df.with_columns(*modifications)

  modifications = [pl.col(i).fill_null(pl.col(column_to_use_for_grouping).map_dict(imputers[i])) for i in imputers]


In [32]:
# Impute by mean grouped by column
columns_to_impute_by_mean = ['at2', 'at1', 'date_of_registration', 'ep', 'w', 'm', 'mt']
column_to_use_for_grouping = 'cn'

imputers = {i: train.group_by(column_to_use_for_grouping).agg(pl.col(i).mean()).to_dict(as_series=False) for i in train.columns if i in columns_to_impute_by_mean}
imputers = {i: {j: k for (j, k) in zip(imputers[i]['cn'], imputers[i][i])} for i in imputers}

modifications = [pl.col(i).fill_null(pl.col(column_to_use_for_grouping).map_dict(imputers[i])) for i in imputers]

train = train.with_columns(*modifications)
val = val.with_columns(*modifications)
test = test.with_columns(*modifications)
submission_df = submission_df.with_columns(*modifications)

  modifications = [pl.col(i).fill_null(pl.col(column_to_use_for_grouping).map_dict(imputers[i])) for i in imputers]


In [33]:
columns_to_impute_by_global_mean = ['at2', 'at1', 'w', 'mt', 'date_of_registration']

imputers = {i: train.select(pl.col(i).mean())[i][0] for i in train.columns if i in columns_to_impute_by_global_mean}

modifications = [pl.col(i).fill_null(value=imputers[i]) for i in imputers]

train = train.with_columns(*modifications)
val = val.with_columns(*modifications)
test = test.with_columns(*modifications)
submission_df = submission_df.with_columns(*modifications)

In [34]:
# Show missing values
print(get_missing(train))
print(get_missing(val))
print(get_missing(test))
print(get_missing(submission_df))

shape: (8, 2)
┌──────────┬─────────┐
│ variable ┆ missing │
│ ---      ┆ ---     │
│ str      ┆ f64     │
╞══════════╪═════════╡
│ mp       ┆ 5.56    │
│ vfn      ┆ 3.1     │
│ ve       ┆ 0.15    │
│ tan      ┆ 0.14    │
│ va       ┆ 0.14    │
│ cn       ┆ 0.07    │
│ ct       ┆ 0.04    │
│ t        ┆ 0.02    │
└──────────┴─────────┘
shape: (5, 2)
┌──────────┬─────────┐
│ variable ┆ missing │
│ ---      ┆ ---     │
│ str      ┆ f64     │
╞══════════╪═════════╡
│ mp       ┆ 0.29    │
│ vfn      ┆ 0.16    │
│ tan      ┆ 0.01    │
│ va       ┆ 0.01    │
│ ve       ┆ 0.01    │
└──────────┴─────────┘
shape: (5, 2)
┌──────────┬─────────┐
│ variable ┆ missing │
│ ---      ┆ ---     │
│ str      ┆ f64     │
╞══════════╪═════════╡
│ mp       ┆ 0.31    │
│ vfn      ┆ 0.17    │
│ tan      ┆ 0.01    │
│ va       ┆ 0.01    │
│ ve       ┆ 0.01    │
└──────────┴─────────┘
shape: (7, 2)
┌──────────┬─────────┐
│ variable ┆ missing │
│ ---      ┆ ---     │
│ str      ┆ f64     │
╞══════════╪═════════╡
│

In [35]:
# Encoding
whole_dataset = pl.concat([train[categorical_columns], val[categorical_columns], test[categorical_columns], submission_df[categorical_columns]], how='vertical')

for colname in categorical_columns:
    lbl = LabelEncoder()
    lbl.fit(whole_dataset[colname])
    
    print(f'Encoding column {colname} with {len(lbl.classes_)} classes')
    train = train.with_columns(pl.Series(lbl.transform(train[colname])).alias(colname))
    val = val.with_columns(pl.Series(lbl.transform(val[colname])).alias(colname))
    test = test.with_columns(pl.Series(lbl.transform(test[colname])).alias(colname))
    submission_df = submission_df.with_columns(pl.Series(lbl.transform(submission_df[colname])).alias(colname))

Encoding column country with 29 classes
Encoding column vfn with 9025 classes
Encoding column mp with 11 classes
Encoding column mh with 96 classes
Encoding column man with 105 classes
Encoding column tan with 4500 classes
Encoding column t with 1599 classes
Encoding column va with 5704 classes
Encoding column ve with 26766 classes
Encoding column mk with 719 classes
Encoding column cn with 7916 classes
Encoding column ct with 6 classes
Encoding column cr with 3 classes
Encoding column ft with 11 classes
Encoding column fm with 7 classes
Encoding column it with 318 classes


<span id="binning"></span>

### 2.4. Outliers binning

In [36]:
clf = IsolationForest(n_estimators=1000, contamination=0.01)
# m and ep contains NaN
# train = train.drop_nulls()
# clf.fit(train)

In [37]:
# train = train.with_columns(
#     pl.Series(clf.decision_function(train)).alias('scores'),
#     pl.Series(clf.predict(train)).alias('anomaly'),
# )

In [38]:
for i in range(len(train.columns) // 10 + 1):
    print(train[train.columns[i*10:(i+1)*10]])

shape: (6_833_412, 10)
┌─────────┬──────┬─────┬─────┬─────┬──────┬──────┬──────┬───────┬─────┐
│ country ┆ vfn  ┆ mp  ┆ mh  ┆ man ┆ tan  ┆ t    ┆ va   ┆ ve    ┆ mk  │
│ ---     ┆ ---  ┆ --- ┆ --- ┆ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---   ┆ --- │
│ i64     ┆ i64  ┆ i64 ┆ i64 ┆ i64 ┆ i64  ┆ i64  ┆ i64  ┆ i64   ┆ i64 │
╞═════════╪══════╪═════╪═════╪═════╪══════╪══════╪══════╪═══════╪═════╡
│ 27      ┆ 3598 ┆ 8   ┆ 44  ┆ 45  ┆ 3460 ┆ 953  ┆ 4347 ┆ 2443  ┆ 391 │
│ 16      ┆ 5603 ┆ 6   ┆ 77  ┆ 86  ┆ 2637 ┆ 1163 ┆ 1874 ┆ 3352  ┆ 535 │
│ 10      ┆ 1148 ┆ 10  ┆ 91  ┆ 99  ┆ 3859 ┆ 1461 ┆ 4165 ┆ 17797 ┆ 616 │
│ 5       ┆ 3121 ┆ 5   ┆ 62  ┆ 71  ┆ 1016 ┆ 661  ┆ 458  ┆ 14089 ┆ 438 │
│ 16      ┆ 4103 ┆ 9   ┆ 75  ┆ 27  ┆ 1725 ┆ 269  ┆ 4980 ┆ 826   ┆ 516 │
│ 10      ┆ 5783 ┆ 6   ┆ 77  ┆ 86  ┆ 2300 ┆ 1235 ┆ 2490 ┆ 16378 ┆ 114 │
│ 6       ┆ 4450 ┆ 1   ┆ 33  ┆ 35  ┆ 2110 ┆ 963  ┆ 2090 ┆ 1541  ┆ 176 │
│ 5       ┆ 6377 ┆ 9   ┆ 7   ┆ 7   ┆ 836  ┆ 657  ┆ 3871 ┆ 22875 ┆ 27  │
│ 8       ┆ 6888 ┆ 9   ┆ 7   ┆ 7   ┆ 968 

In [39]:
# train = train.filter(pl.col('anomaly') == 1).drop(['scores', 'anomaly'])
# train

<span id="utilities"></span>

## 3. Utility function

In [40]:
import xgboost as xgb
from optuna import create_study, logging
from optuna.pruners import MedianPruner
from optuna.integration import XGBoostPruningCallback
from sklearn.metrics import mean_absolute_error

In [41]:
feature_names = [i for i in train.columns if i != 'ewltp']
target = 'ewltp'

X_tr = train[feature_names]
y_tr = train[target]
X_val = val[feature_names]
y_val = val[target]
X_test = test[feature_names]
y_test = test[target]

In [42]:
def objective(trial, group, params=dict()):
    ## Initial learning parameters
    params['learning_rate'] = 0.1
    params['n_estimators'] = 1000
    
    if group == '1':
        params['max_depth'] = trial.suggest_int('max_depth', 6, 17)
        params['min_child_weight'] = trial.suggest_float('min_child_weight', 1e-10, 1e10, log=True)

    if group == '2':
        params['subsample'] = trial.suggest_float('subsample', 0, 1)
        params['colsample_bytree'] = trial.suggest_float('colsample_bytree', 0, 1)
        params['colsample_bylevel'] = trial.suggest_float('colsample_bylevel', 0, 1)
        params['colsample_bynode'] = trial.suggest_float('colsample_bynode', 0, 1)
        # params['gamma'] = trial.suggest_float('gamma', 0, 0.5)

    if group == '3':
        params['reg_alpha'] = trial.suggest_categorical('reg_alpha', [0, 1e-5, 1e-2, 0.1, 1, 100])
        params['reg_lambda'] = trial.suggest_float('reg_lambda', 1, 10)

    if group == '4':
        params['learning_rate'] = trial.suggest_float('learning_rate', 0, 0.3)
        params['n_estimators'] = trial.suggest_int('n_estimators', 100, 1500)

    # Maybe add tree_method='hist'
    reg = xgb.XGBRegressor(**params, eval_metric='mae', device='cuda', early_stopping_rounds=10)
    reg.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
    
    y_preds = reg.predict(X_test)
    score = mean_absolute_error(y_test, y_preds)
    print(f'Attendu lors du submit: {score}')
    
    return score

def execute_optimization(study_name, group, trials, params=dict(), direction='minimize'):
    logging.set_verbosity(logging.ERROR)
    
    pruner = MedianPruner(n_warmup_steps=5)
    
    study = create_study(
        direction=direction,
        study_name=study_name,
        # storage='sqlite:///optuna.db',
        # load_if_exists=True,
        pruner=pruner,
    )
    
    study.optimize(lambda trial: objective(trial, group, params), n_trials=trials, n_jobs=-1)
    
    print("STUDY NAME: ", study_name)
    print('------------------------------------------------')
    print("BEST MAE SCORE", study.best_value)
    print('------------------------------------------------')
    print(f"OPTIMAL GROUP - {group} PARAMS: ", study.best_params)
    print('------------------------------------------------')
    print("BEST TRIAL", study.best_trial)
    print('------------------------------------------------')
    
    return study.best_params

def stepwise_optimization(trials=10):
    final_params = dict()
    for g in ['1', '2', '3', '4']:
        print(f"=========================== Optimizing Group - {g} ============================")
        update_params = execute_optimization('xgboost', g, trials, params=final_params, direction='minimize')
        final_params.update(update_params)
        print(f"PARAMS after optimizing GROUP - {g}: ", final_params)
        print()
        print()
        
    print("=========================== FINAL OPTIMAL PARAMETERS ============================")
    print(final_params)
    
    return final_params

<span id="stepwise"></span>

## 4. Stepwize Hyperparameter tuning

In [None]:
params = stepwise_optimization(20)

Attendu lors du submit: 3.5474746118264235
Attendu lors du submit: 3.0981615931081006
Attendu lors du submit: 20.840586891244726
Attendu lors du submit: 3.0981615931081006
Attendu lors du submit: 2.9145036621649485
Attendu lors du submit: 2.8527994258629334
Attendu lors du submit: 2.8600724307408005
Attendu lors du submit: 3.567921994616316
Attendu lors du submit: 2.853810908266448
Attendu lors du submit: 4.361952789614769
Attendu lors du submit: 2.849384276461304
Attendu lors du submit: 2.9346478804908998
Attendu lors du submit: 2.882331636126398
Attendu lors du submit: 2.853810908266448
Attendu lors du submit: 2.853810908266448
Attendu lors du submit: 138.48776693971016
Attendu lors du submit: 2.853810908266448
Attendu lors du submit: 2.849384276461304
Attendu lors du submit: 2.8527994258629334
STUDY NAME:  xgboost
------------------------------------------------
BEST MAE SCORE 2.849384276461304
------------------------------------------------
OPTIMAL GROUP - 1 PARAMS:  {'max_depth':

In [None]:
X_submission = submission_df[feature_names]

In [None]:
reg = xgb.XGBRegressor(**params, eval_metric='mae', tree_method='hist', device='cuda', early_stopping_rounds=10)
reg.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)

In [None]:
y_preds = reg.predict(X_test)
print(f'Attendu lors du submit: {mean_absolute_error(y_test, y_preds)}')

<span id="comparison"></span>

## 5. Model prediction

In [None]:
import pandas as pd

In [None]:
y_submission = reg.predict(X_submission)

In [None]:
submission_csv = pd.DataFrame({'Ewltp (g/km)': y_submission})
submission_csv.index += 8000000
submission_csv.to_csv('/kaggle/working/submission.csv', index_label='ID')