**The goal** of this competition is to predict which clients are more likely to default on their loans. The evaluation is based on gini stability metric and will favor solutions that are stable over time. A separate chalenge is to deal with large data sizes and constantly monitor and reduce memory usage to not exceed allocated amount of RAM.  

**Structure**

There are several tables classified by 'depth':
* depth=0 - These are static features directly tied to a specific case_id.
* depth=1 - Each case_id has an associated historical record, indexed by num_group1.
* depth=2 - Each case_id has an associated historical record, indexed by both num_group1 and num_group2.

Various predictors were transformed, therefore we have the following notation for similar groups of transformations

* P - Transform DPD (Days past due)
* M - Masking categories
* A - Transform amount
* D - Transform date
* T - Unspecified Transform
* L - Unspecified Transform

Transformations within a group are denoted by a capital letter at the end of the predictor name (e.g., maxdbddpdtollast6m_4187119P)

**Strategy**

Following approach for data processing will be used:
* depth=2 files will be aggregated, grouped by case_id and num_group1. For numerical columns we will calculate the average between num_group2 values and for categoricals we will get the most frequent value. Like this they will become as depth=1 files and the same processing function as for depth=1 files will be applied to them.
* for depth=1 files we will use an aggregation grouped by case_id. For numerical columns we will calculate the mean, std, min and max between num_group1 values. For categoricals we will get the most frequent value. After this the files will become as depth=0.
* at depth=0 level we will merge all the files together on case_id as all the case_id are unique at this level. 

After data analysis, time delta features will be created based on timeseries columns and all the preparations for modeling will be made using a processing pipeline. Competition's stability metric will be integrated during Optuna tuning process and also in the model training as evaluation metric.

In [1]:
pip install scikit-learn==1.4.2

Collecting scikit-learn==1.4.2
  Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[?25hInstalling collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spopt 0.6.0 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-1.4.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os, glob
import gc
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
import joblib
import warnings
warnings.filterwarnings('ignore')

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import roc_auc_score
from sklearn.utils import resample, shuffle
from sklearn.decomposition import PCA
from sklearn.preprocessing import (
    OneHotEncoder, PowerTransformer, OrdinalEncoder
)
from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import (
    train_test_split, StratifiedGroupKFold
)
from lightgbm import (
    LGBMClassifier, early_stopping, log_evaluation, plot_importance,
)
from xgboost import XGBClassifier, DMatrix
from xgboost.callback import EarlyStopping
from catboost import CatBoostClassifier, Pool

pd.options.display.max_colwidth = None

#### Pandas functions

In [2]:
def downcast(df):
    """
    Reduce memory usage of a Pandas DataFrame by converting 
    object types to categories and downcasting numeric columns
    """
    # Column types
    object_cols, int_cols, float_cols = [], [], []
    for col, dtype in df.dtypes.items():
        if pd.api.types.is_object_dtype(dtype):
            object_cols.append(col)
        elif pd.api.types.is_integer_dtype(dtype):
            int_cols.append(col)
        elif pd.api.types.is_float_dtype(dtype):
            float_cols.append(col)
        
    # Convert object columns to category
    df[object_cols] = df[object_cols].astype('category')

    # Downcast integer columns
    df[int_cols] = df[int_cols].apply(pd.to_numeric, downcast='integer')
   
    # Downcast float columns
    df[float_cols] = df[float_cols].apply(pd.to_numeric, downcast='float')
        
    return df

def cols_types(df):
    """
    Create lists of feature names dtype
    """
    date_cols, num_cols, cat_cols = [], [], []
    for col, dtype in df.dtypes.items():
        if pd.api.types.is_bool_dtype(dtype):
            cat_cols.append(col)
        elif pd.api.types.is_datetime64_dtype(dtype):
            date_cols.append(col)
        elif pd.api.types.is_numeric_dtype(dtype):
            num_cols.append(col)
        else:
            cat_cols.append(col)
            
    return date_cols, num_cols, cat_cols

#### Polars functions to read and preprocess data
Many aggregation functions are commented out, but available for any experiments

In [3]:
def pl_cols_types(df):
    """
    (Polars version)
    Create lists of feature names dtype
    """
    date_cols, num_cols, cat_cols = [], [], []
    
    num_cols = df.select(pl.col(pl.NUMERIC_DTYPES)).columns
    date_cols = df.select(pl.col(pl.Date)).columns
    cat_cols = [col for col in df.columns 
                if col not in num_cols and col not in date_cols]
            
    return date_cols, num_cols, cat_cols

def aggregate_depth1(df):
    """
    (Polars version)
    Aggregate depth=1 dataframe and return a depth=0 dataframe 
    """
    # Drop 'num_group1' column
    df = df.drop('num_group1')

    # Create aggregation dataframe and count repetitive case_id
    col_to_count = df.columns[1]
    all_agg = df.groupby('case_id').agg(
        pl.col(col_to_count).count().alias(f'{col_to_count}_count')
    )
    # Columns types 
    date_cols, num_cols, cat_cols = pl_cols_types(df)
    num_cols.remove('case_id')

    # Aggregate categorical columns
    if len(cat_cols) > 0:
        for col in cat_cols:
            if not df[col].is_null().all():
                cat_agg = df.groupby('case_id').agg(
                    pl.col(col).drop_nulls().mode().first().alias(f'{col}_mode'),
#                     pl.n_unique(col).alias(f'{col}_n_unique'),
#                     pl.col(col).first().alias(f'{col}_first'),
#                     pl.col(col).drop_nulls().last().alias(f'{col}_last'),
                )
                # Drop aggregated column to free memory
                df = df.drop(col)

                # Merge with aggregated dataframe
                all_agg = all_agg.join(cat_agg, on='case_id', how='left')

                # Free memory
                del cat_agg

    # Aggregate date columns
    if len(date_cols) > 0:
        for col in date_cols:
            date_agg = df.groupby('case_id').agg(
                pl.mean(col).alias(f'{col}_mean'), 
#                 pl.col(col).first().alias(f'{col}_first'),
#                 pl.col(col).drop_nulls().last().alias(f'{col}_last'),
            )
            # Drop aggregated column to free memory
            df = df.drop(col)

            # Merge with aggregated dataframe
            all_agg = all_agg.join(date_agg, on='case_id', how='left')

            # Free memory
            del date_agg

    # Aggregate numeric columns 
    if len(num_cols) > 0:
        for col in num_cols:
            num_agg = df.groupby('case_id').agg(
                pl.mean(col).alias(f'{col}_mean'), 
#                 pl.median(col).alias(f'{col}_median'), 
#                 pl.min(col).alias(f'{col}_min'), 
#                 pl.max(col).alias(f'{col}_max'), 
#                 pl.col(col).first().alias(f'{col}_first'),
#                 pl.col(col).drop_nulls().last().alias(f'{col}_last'),
            )
            # Drop aggregated column to free memory
            df = df.drop(col)

            # Merge with aggregated dataframe
            all_agg = all_agg.join(num_agg, on='case_id', how='left')

            # Free memory
            del num_agg
  
    print('Depth1 aggregation finished')    
    return all_agg

def aggregate_depth2(df):
    """
    (Polars version)
    Aggregate depth=2 dataframe to level depth=1 and then apply 
    aggregate_depth1 function to return a depth=0 dataframe
    """
    df = df.drop('num_group2')
    
    # Columns types
    groupby_cols = ['case_id', 'num_group1']
    date_cols, num_cols, cat_cols = pl_cols_types(df)
    num_cols = [col for col in num_cols if col not in groupby_cols]
    
    # Create aggregation dataframe
    all_agg = df[groupby_cols]
    all_agg = all_agg.unique(groupby_cols, maintain_order=True)
    
    # Aggregate categoricals 
    if len(cat_cols) > 0:
        for col in cat_cols:
            if not df[col].is_null().all():
                cat_agg = df.groupby('case_id').agg(
                    pl.col(col).drop_nulls().mode().first().alias(f'{col}_mode'),
#                     pl.col(col).drop_nulls().first().alias(f'{col}_first'),
                )
                # Drop aggregated column to free memory
                df = df.drop(col)

                # Merge with aggregated dataframe
                all_agg = all_agg.join(cat_agg, on='case_id', how='left')

                # Free memory
                del cat_agg
    
    # Aggregate date columns
    if len(date_cols) > 0:
        for col in date_cols:
            date_agg = df.groupby('case_id').agg(
                pl.mean(col).alias(f'{col}_mean'), 
#                 pl.col(col).drop_nulls().first().alias(f'{col}_first'),
            )
            # Drop aggregated column to free memory
            df = df.drop(col)

            # Merge with aggregated dataframe
            all_agg = all_agg.join(date_agg, on='case_id', how='left')

            # Free memory
            del date_agg
    
    # Aggregate numeric columns if any
    if len(num_cols) > 0:
        for col in num_cols:
            num_agg = df.groupby('case_id').agg(
                pl.mean(col).alias(f'{col}_mean'), 
#                 pl.median(col).alias(f'{col}_median'),
#                 pl.col(col).drop_nulls().first().alias(f'{col}_first'),
            )
            # Drop aggregated column to free memory
            df = df.drop(col)

            # Merge with aggregated dataframe
            all_agg = all_agg.join(num_agg, on='case_id', how='left')

            # Free memory
            del num_agg
 
    del df
    print('Depth2 aggregation finished') 
    return aggregate_depth1(all_agg)

def create_df_from(path, file_name, depth=0):
    """
    (Polars version)
    Preprocess files in chunks 
    """
    dfs = []
    for i, file_path in enumerate(
        glob.glob(path + '*' + file_name + '*.parquet')
    ):
        df = pl.read_parquet(file_path)

        for col in df.columns:
            if (col[-1] == 'D') or (col == 'date_decision'):
                df = df.with_columns(pl.col(col).cast(pl.Date))
            elif col in ['case_id', 'WEEK_NUM', 'num_group1', 'num_group2']:
                df = df.with_columns(pl.col(col).cast(pl.Int32))
            elif 'person' in col:
                df = df.with_columns(
                    pl.col(col).cast(pl.String).cast(pl.Categorical))
            elif 'month' in col and 'T' in col:
                df = df.with_columns(
                    pl.col(col).cast(pl.String).cast(pl.Categorical))
        
        if depth == 2:
            df = aggregate_depth2(df)
            
        elif depth == 1:
            df = aggregate_depth1(df) 

        dfs.append(df)
        print(f'Chunk {i} added to list')
    
    return pl.concat(dfs, how='diagonal_relaxed')

def read_prepare_all(path, files_dict):
    """
    (Polars version)
    Read, preprocess and merge all the files together
    Return a pandas dataframe
    """
    # Read base data frame
    df_all = create_df_from(path, 'base')
    print(f'base created')

    # Read and aggregate 
    for depth, files_list in files_dict.items():
        for file in files_list:
            # Create dataframe from file chunks
            print(f'### Start read {file}')
            df = create_df_from(path, file, depth)
            
            # Join with the main dataframe
            df_all = df_all.join(df, how='left', on='case_id')
            print(f'=== {file} merged to df_all')
            
            # Convert to Categorical to free memory
            df_all = df_all.with_columns(
                pl.col(pl.String).cast(pl.Categorical))
            df_all = df_all.with_columns(
                pl.col(pl.Float64).cast(pl.Float32))
        
    # Free memory
    del df
    
    # Columns types
    date_cols, num_cols, cat_cols = pl_cols_types(df_all)
    
    # Convert to pandas in chunks to not explode memory use
    df_pd = df_all.select(pl.col(num_cols)).to_pandas()
    df_all = df_all.drop(num_cols)
    df_pd = df_pd.join(df_all.select(pl.col(date_cols)).to_pandas())
    df_all = df_all.drop(date_cols)
    df_pd = df_pd.join(df_all.select(pl.col(cat_cols)).to_pandas())
    del df_all
    print('df converted to pandas')
    
    # Create time features
    df_pd['birth_year'] = df_pd.birth_259D_mean.dt.year
    df_pd['decision_year'] = df_pd.date_decision.dt.year
    df_pd['decision_quarter'] = (
        df_pd.date_decision.dt.quarter.astype(str).astype('category'))
    df_pd['decision_month_of_year'] = (
        df_pd.date_decision.dt.month.astype(str).astype('category'))
    df_pd['decision_day_of_month'] = df_pd.date_decision.dt.day
    df_pd['decision_day_of_year'] = df_pd.date_decision.dt.dayofyear
    df_pd['decision_week_of_year'] = df_pd.date_decision.dt.isocalendar().week
    df_pd['decision_day_of_week'] = (
        (df_pd.date_decision.dt.dayofweek + 1).astype(str).astype('category'))
    
    return downcast(df_pd)

#### Read and preprocess data

In [4]:
# Filepaths
main_path = '/kaggle/input/home-credit-credit-risk-model-stability/'
train_path = main_path + 'parquet_files/train/'
test_path = main_path + 'parquet_files/test/'

# Read info files
feat_def = pd.read_csv(main_path + 'feature_definitions.csv')
submit = pd.read_csv(main_path + 'sample_submission.csv')

# Lists of file names
files_dict = {
    0: ['static_0', 'static_cb_0'],
    1: ['credit_bureau_a_1', 'credit_bureau_b_1', 'applprev_1', 
        'debitcard_1', 'deposit_1', 'other_1', 'person_1', 
        'tax_registry_a_1', 'tax_registry_b_1', 'tax_registry_c_1'],
    2: ['credit_bureau_a_2', 'credit_bureau_b_2', 'applprev_2', 'person_2']
}

In [None]:
%%time

# Process / restore point
process = True

if process:
    # Read, preprocess and merge all the training files together
    X = read_prepare_all(train_path, files_dict)
    
    # Backup processed data
    X.to_parquet('X.parquet')
    
else:
    # Restore processed data from backup
    X = pd.read_parquet('/kaggle/input/creditrisk-data/X.parquet') 

X.info()

Chunk 0 added to list
base created
### Start read static_0
Chunk 0 added to list
Chunk 1 added to list
=== static_0 merged to df_all
### Start read static_cb_0
Chunk 0 added to list
=== static_cb_0 merged to df_all
### Start read credit_bureau_a_1
Depth1 aggregation finished
Chunk 0 added to list
Depth1 aggregation finished
Chunk 1 added to list
Depth1 aggregation finished
Chunk 2 added to list
Depth1 aggregation finished
Chunk 3 added to list
=== credit_bureau_a_1 merged to df_all
### Start read credit_bureau_b_1
Depth1 aggregation finished
Chunk 0 added to list
=== credit_bureau_b_1 merged to df_all
### Start read applprev_1
Depth1 aggregation finished
Chunk 0 added to list
Depth1 aggregation finished
Chunk 1 added to list
=== applprev_1 merged to df_all
### Start read debitcard_1
Depth1 aggregation finished
Chunk 0 added to list
=== debitcard_1 merged to df_all
### Start read deposit_1
Depth1 aggregation finished
Chunk 0 added to list
=== deposit_1 merged to df_all
### Start read ot

#### Analyze data

In [None]:
# Function for date features
def make_time_features(df, date_col):
    plot_data = df[['target', 'WEEK_NUM']]
    plot_data['day_of_year'] = df[date_col].dt.dayofyear.astype('int16')
    plot_data['month_of_year'] = df[date_col].dt.month.astype('int8')
    plot_data['day_of_month'] = df[date_col].dt.day.astype('int8')
    plot_data['day_of_week'] = (df[date_col].dt.dayofweek + 1).astype('int8')
    return plot_data

# Create time related features for analysis
plot_data = make_time_features(X, 'date_decision')

In [None]:
# Target distribution
plt.figure(figsize=(5, 1))
plt.title('Target distribution')
sns.countplot(data=plot_data, y='target')
plt.show()
print(f'Targets == 1 are: {plot_data.target.mean().round(2) * 100}% from train data')

* Target is unbalanced (97/3)

In [None]:
# Target evolution in time
plt.figure(figsize=(12, 3))
plt.title('Target evolution in time')
g = sns.lineplot(data=plot_data, x='WEEK_NUM', y='target')
g.set(xticks=np.arange(0, 93, 4))
plt.grid() 
plt.show()

* A big drop of total defaults was registered between weeks 62 - 65 

In [None]:
# Target distribution by month_of_year
plt.figure(figsize=(8, 3))
plt.title('Target distribution by month of the year')
g = sns.lineplot(data=plot_data, x='month_of_year', y='target')
g.set(xticks=np.arange(1, 13, 1))
plt.grid()
plt.show()

In [None]:
# Target distribution by day_of_year
plt.figure(figsize=(12, 3))
plt.title('Target distribution by day of the year')
g = sns.lineplot(data=plot_data, x='day_of_year', y='target')
g.set(xticks=np.arange(1, 366, 30))
plt.grid()
plt.show()

* Strange behaviour on day 335, the only day without any target == 1

In [None]:
# Target distribution by day_of_month
plt.figure(figsize=(12, 3))
plt.title('Target distribution by day of month')
g = sns.lineplot(data=plot_data, x='day_of_month', y='target')
g.set(xticks=np.arange(1, 32, 1))
plt.grid()
plt.show()

In [None]:
# Target distribution by day_of_week
plt.figure(figsize=(8, 3))
plt.title('Target distribution by day of week')
g = sns.lineplot(data=plot_data, x='day_of_week', y='target')
g.set(xticks=np.arange(1, 8, 1))
plt.grid()
plt.show()

In [None]:
# Free memory
del plot_data

In [None]:
# Check the weeks with the highest missing values ratio
df_missing = X.groupby('WEEK_NUM').case_id.count().reset_index()
df_missing = df_missing.rename(columns={'case_id': 'Count'})

for week in X.WEEK_NUM.unique():
    missing = X[X.WEEK_NUM.eq(week)].isna().sum().sum()
    size = X[X.WEEK_NUM.eq(week)].size
    df_missing.loc[df_missing.WEEK_NUM.eq(week), 'Missing_ratio'] = missing / size

df_missing.sort_values('Missing_ratio', ascending=False)

* week 0 has very high missing ratio comparing to other weeks

In [None]:
# Check which weeks has the most maximum values by columns
max_df = X.select_dtypes(np.number).groupby('WEEK_NUM').max()
max_week_list = []
for col in max_df.columns:
    max_week = max_df.sort_values(col, ascending=False).iloc[0].name
    max_week_list.append(max_week)
max_week_df = pd.DataFrame(max_week_list)
max_week_df.value_counts()

* weeks 0 and 91 have very much maximum values comparing to other weeks

#### Define pipeline transformers

In [7]:
class TimeFeatTransformer(BaseEstimator, TransformerMixin):
    """
    Transformer to create time related features
    """
    def fit(self, df, y=None):
        self.ref_cols = ['birth_259D_mean', 'date_decision']
        self.original_cols = [col for col in df.columns 
                              if col not in self.ref_cols]
        return self
    
    def transform(self, df):
        # Create time delta features
        for col in self.original_cols:
            delta_col_0 = f'delta_{col}_{self.ref_cols[0]}'
            df[delta_col_0] = abs(df[col] - df[self.ref_cols[0]]).dt.days

            delta_col_1 = f'delta_{col}_{self.ref_cols[1]}'
            df[delta_col_1] = abs(df[col] - df[self.ref_cols[1]]).dt.days
            
        delta_col_0_1 = f'delta_{self.ref_cols[1]}_{self.ref_cols[0]}'      
        df[delta_col_0_1] = abs(df[self.ref_cols[1]] - df[self.ref_cols[0]]).dt.days
        
        # Drop used cols 
        df = df.drop(self.ref_cols + self.original_cols, axis=1)
        
        self.delta_cols = df.columns.to_list()
        
        return df
   
    def get_feature_names_out(self, input_features=None):
        return self.delta_cols

class NumFeatTransformer(BaseEstimator, TransformerMixin):
    """
    Transformer to create numeric related features
    """
    def fit(self, df, y=None):
        self.ref_cols = ['birth_year', 'decision_year']
        self.year_cols = []
        for col in df.columns:
            if 'year' in col and 'T' in col:
                self.year_cols.append(col)
        return self
    
    def transform(self, df):
        # Create year delta features
        for col in self.year_cols:
            delta_col_0 = f'delta_{col}_{self.ref_cols[0]}'
            df[delta_col_0] = abs(df[col] - df[self.ref_cols[0]])

            delta_col_1 = f'delta_{col}_{self.ref_cols[1]}'
            df[delta_col_1] = abs(df[col] - df[self.ref_cols[1]])

        delta_col_0_1 = f'delta_{self.ref_cols[1]}_{self.ref_cols[0]}'   
        df[delta_col_0_1] = abs(df[self.ref_cols[1]] - df[self.ref_cols[0]])
        
        df = df.drop(self.ref_cols + self.year_cols, axis=1)
        
        self.all_cols = df.columns.to_list()

        return df.astype(float)
   
    def get_feature_names_out(self, input_features=None):
        return self.all_cols

class BadColsDropTransformer(BaseEstimator, TransformerMixin): 
    """
    Transformer to drop unuseful columns
    """
    def fit(self, df, y=None):
        # Columns with many missing values 
        self.missing_values = df.isna().mean().sort_values(ascending=False)
        self.cols_to_drop = set(
            self.missing_values[self.missing_values.gt(0.95)].index
        )
        # Columns with one higly dominant value
        for col in df.columns:
            if (df[col].value_counts(normalize=True) > 0.95).any():
                self.cols_to_drop.add(col)
                
        # Columns with identical values  
        for col1 in df.columns[:-1]:
            if col1 not in self.cols_to_drop:
                for col2 in df.columns[df.columns.get_loc(col1) + 1:]:
                    if df[col1].equals(df[col2]):
                        self.cols_to_drop.add(col2)
        return self
        
    def transform(self, df):
        return df.drop(list(self.cols_to_drop), axis=1)
    
    def get_feature_names_out(self, input_features=None):
        return [col for col in input_features 
                if col not in self.cols_to_drop]
    
class HighCorrDropTransformer(BaseEstimator, TransformerMixin):
    """
    Transformer to drop highly correlated numerical columns
    """
    def fit(self, df, y=None):
        self.corr_matrix = df.corr()
        self.cols_to_drop = set()
        for col1 in self.corr_matrix.columns:
            for col2 in self.corr_matrix.columns:
                if col1 != col2:
                    # Check for high correlation
                    if abs(self.corr_matrix.loc[col1, col2]) >= 0.90:
                        # Check which column has more missing values
                        if df[col1].isna().sum() > df[col2].isna().sum():
                            self.cols_to_drop.add(col1)
                        else:
                            self.cols_to_drop.add(col2) 
        return self
    
    def transform(self, df):
        return df.drop(list(self.cols_to_drop), axis=1)
    
    def get_feature_names_out(self, input_features=None):
        return [col for col in input_features if col not in self.cols_to_drop]
        
class LowFreqTransformer(BaseEstimator, TransformerMixin):
    """
    Transformer to process categorical, boolean and object columns
    Fill missing and convert infrequent values
    """
    def fit(self, df, y=None):
        self.original_cols = df.columns
        self.frequencies = {}
        self.threshold = {}
        for col in df.columns:
            self.frequencies[col] = df[col].value_counts(normalize=True, 
                                                         ascending=True)
            self.threshold[col] = self.frequencies[col][
                (self.frequencies[col].cumsum() > 0.05).idxmax()
                ]
        return self
    
    def transform(self, df):
        for col in self.original_cols:
            df[col] = df[col].astype(str)
            
            infrequent_mask = (df[col].isin(
                self.frequencies[col].index[
                    self.frequencies[col] < self.threshold[col]
                ]))
            # Convert low frequency categoricals to 'infrequent'
            df.loc[infrequent_mask, col] = 'infrequent'
        return df
    
    def get_feature_names_out(self, input_features=None):
        return input_features

#### Processing pipeline

In [None]:
# Drop outlier weeks after analysis
#X = X[~X.WEEK_NUM.eq(0)]

# Drop day 335 as an outlier, the only date without any target=1
#X = X[~X['date_decision'].dt.dayofyear.eq(335)]

# Separate target
y = X.pop('target')

# Keep a copy of WEEK_NUM for group splitting in modeling
week_num = X.WEEK_NUM

# Drop unuseful and duplicate columns
cols_to_drop = [
    'case_id', 'MONTH', 'WEEK_NUM', 'birthdate_574D', 'dateofbirth_337D',  
]
X = X.drop(cols_to_drop, axis=1)

# Keep the structure of X to match it later with X_test 
X_structure = X.drop(X.index)

X.info()

In [8]:
# Separate columns by type
date_cols, num_cols, cat_cols = cols_types(X)
cat_unique = X[cat_cols].nunique()
low_card_cols = list(cat_unique.index[cat_unique.le(12)])
med_card_cols = list(cat_unique.index[cat_unique.gt(12) & cat_unique.le(200)])
high_card_cols = list(cat_unique.index[cat_unique.gt(200)])

# Pipeline to process date columns
date_pipeline = make_pipeline(
    TimeFeatTransformer(),
    BadColsDropTransformer(),
    HighCorrDropTransformer(),
    PowerTransformer(copy=False),
    )
# Pipeline to process numerical columns
num_pipeline = make_pipeline(
    NumFeatTransformer(),
    BadColsDropTransformer(),
    HighCorrDropTransformer(),
    PowerTransformer(copy=False),
    )
# Pipeline to process low cardinality columns
low_card_pipeline = make_pipeline(
    BadColsDropTransformer(),
    LowFreqTransformer(),
    OneHotEncoder(
        dtype=np.int8, drop='if_binary', sparse_output=False,
        min_frequency=0.02, handle_unknown='infrequent_if_exist'),
    )
# Pipeline to process medium cardinality columns
med_card_pipeline = make_pipeline(
    BadColsDropTransformer(),
    LowFreqTransformer(),
    OrdinalEncoder(handle_unknown='use_encoded_value',
                   unknown_value=np.nan,
                   dtype=np.float32),
    )
# Pipeline to process high cardinality columns
high_card_pipeline = make_pipeline(
    BadColsDropTransformer(),
    LowFreqTransformer(),
    TargetEncoder(target_type='binary', smooth='auto', shuffle=True),
    PowerTransformer(copy=False),
    )
# Define column transformer
processor = make_column_transformer(
    (date_pipeline, date_cols),
    (num_pipeline, num_cols),
    (low_card_pipeline, low_card_cols),
    (med_card_pipeline, med_card_cols),
    (high_card_pipeline, high_card_cols),
    verbose=True,
    )
processor

NameError: name 'X' is not defined

In [None]:
X = pd.DataFrame(processor.fit_transform(X, y), 
                       columns=processor.get_feature_names_out(),
                       index=X.index)
X = downcast(X)
enc_med_card_cols = []
for col in med_card_cols:
    if 'pipeline-4__' + col in X.columns:
        enc_med_card_cols.append('pipeline-4__' + col)

X[enc_med_card_cols] = X[enc_med_card_cols].astype('str').astype('category')

In [None]:
X.to_parquet('/kaggle/working/train_v4.parquet')

# HYPERPARAM TUNING

In [4]:
import data_proc as dp
base, X, y = dp.load_data('data/train_v3_filled_woe.parquet')
X4 = pd.read_parquet('data/train_v4.parquet')
X = X.merge(X4, left_index=True, right_index=True, how='left').astype(float)

del X4
gc.collect()

0

In [5]:
%%time 
#Train / validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, stratify=y)
del X, y

# Process data
"""X_train = pd.DataFrame(processor.fit_transform(X_train, y_train), 
                       columns=processor.get_feature_names_out(),
                       index=X_train.index)
X_val = pd.DataFrame(processor.transform(X_val), 
                     columns=processor.get_feature_names_out(),
                     index=X_val.index)   

# Free memory
X_train = downcast(X_train)
X_val = downcast(X_val)  
"""
date_cols, num_cols, cat_cols = cols_types(X_train)
cat_unique = X_train[cat_cols].nunique()
low_card_cols = list(cat_unique.index[cat_unique.le(12)])
med_card_cols = list(cat_unique.index[cat_unique.gt(12) & cat_unique.le(200)])
high_card_cols = list(cat_unique.index[cat_unique.gt(200)])
# Convert med_card_cols to category to pass them unencoded to the model 
enc_med_card_cols = []
for col in med_card_cols:
    if 'pipeline-4__' + col in X_train.columns:
        enc_med_card_cols.append('pipeline-4__' + col)

X_train[enc_med_card_cols] = X_train[enc_med_card_cols].astype('str').astype('category')
X_val[enc_med_card_cols] = X_val[enc_med_card_cols].astype('str').astype('category')

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1221229 entries, 157793 to 658664
Columns: 969 entries, month_decision to pipeline-5__employername_160M_mode
dtypes: float64(969)
memory usage: 8.8 GB
CPU times: total: 13.6 s
Wall time: 13.6 s


#### Optuna tuning

In [6]:
week_num = base['WEEK_NUM']
class StabilityMetric:
    """
    Stability metric for model optimization during training
    """
    def __init__(self, X_val):
        self.X_val = X_val
        
    def lgbm_stability_metric(self, y_true, y_pred):
        gini_in_time = []
        weeks_to_score = week_num[self.X_val.index].reset_index(drop=True)
        
        for week in weeks_to_score.unique():
            week_idx = weeks_to_score.eq(week)
            gini = np.array(2 * roc_auc_score(y_true[week_idx], y_pred[week_idx]) - 1)
            gini_in_time.append(gini)

        w_fallingrate = 88.0
        w_resstd = -0.5
        x = np.arange(len(gini_in_time))
        y = np.array(gini_in_time)
        a, b = np.polyfit(x, y, 1)
        y_hat = a * x + b
        residuals = y - y_hat
        res_std = np.std(residuals)
        avg_gini = np.mean(y)
        stability_score = avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std
        is_higher_better = True

        return 'stability_score', stability_score, is_higher_better

    def xgb_stability_metric(self, y_pred, y_true):
        if isinstance(y_true, DMatrix):
            y_true = y_true.get_label()
            
        gini_in_time = []
        weeks_to_score = week_num[self.X_val.index].reset_index(drop=True)
        
        for week in weeks_to_score.unique():
            week_idx = weeks_to_score.eq(week)
            gini = np.array(2 * roc_auc_score(y_true[week_idx], y_pred[week_idx]) - 1)
            gini_in_time.append(gini)

        w_fallingrate = 88.0
        w_resstd = -0.5
        x = np.arange(len(gini_in_time))
        y = np.array(gini_in_time)
        a, b = np.polyfit(x, y, 1)
        y_hat = a * x + b
        residuals = y - y_hat
        res_std = np.std(residuals)
        avg_gini = np.mean(y)
        stability_score = avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

        return 'stability_score', stability_score

def lgbm_objective(trial):
    """
    LGBMClassifier parameters search
    """
    # Target ratio for unbalanced data
    y_ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
    params = {
        'n_estimators': 5000,
        'num_leaves': trial.suggest_int('num_leaves', 2, 300),
        'min_child_samples': trial.suggest_int('min_data_in_leaf', 20, 2000),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.005, 0.1),
        'reg_alpha': trial.suggest_loguniform('lambda_l1', 1e-8, 10),
        'reg_lambda': trial.suggest_loguniform('lambda_l2', 1e-8, 10),
        'colsample_bytree': trial.suggest_uniform('feature_fraction', 0.5, 1),
        'subsample': trial.suggest_uniform('bagging_fraction', 0.5, 1),
        'subsample_freq': trial.suggest_int('bagging_freq', 0, 10),
        
        'objective': 'binary',
        'metric': 'None',
        'verbosity': -1,
        'scale_pos_weight': y_ratio,
        'device': 'gpu',
        'max_bin': 255,
        'n_jobs': -1,
    }
    custom_metric = StabilityMetric(X_val)

    model = LGBMClassifier(**params)  
    model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              eval_metric=custom_metric.lgbm_stability_metric,
              callbacks=[log_evaluation(100), early_stopping(100)],
              )
    y_pred = model.predict_proba(X_val)[:, 1]
    _, stability_score, _ = custom_metric.lgbm_stability_metric(
        np.array(y_val), np.array(y_pred)
    )
    return stability_score

def xgb_objective(trial):
    """
    XGBClassifier parameters search
    """
    # Target ratio for unbalanced data
    y_ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
    params = {
        'n_estimators': 5000,
        'learning_rate': trial.suggest_uniform('learning_rate', 0.005, 0.1),
        'max_depth': trial.suggest_int('max_depth', 4, 20),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 3000),
        'max_delta_setp': trial.suggest_int('max_delta_setp', 0, 10),
        'subsample': trial.suggest_uniform('subsample', 0.5, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.5, 1),
        'lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 10),
        'alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 10),
        'gamma': trial.suggest_loguniform('gamma', 1e-8, 10),
        
        'objective': 'binary:logistic',
#         'eval_metric': 'auc',
        'tree_method': 'gpu_hist',
        'enable_categorical': True,
        'verbosity': 1, 
        'scale_pos_weight': y_ratio,
        'device': 'cuda',
        'n_jobs': -1,
    }
    es = EarlyStopping(rounds=100, 
                       maximize=True,
                       save_best=True,
                      )
    custom_metric = StabilityMetric(X_val)
    model = XGBClassifier(**params)
    model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              eval_metric=custom_metric.xgb_stability_metric,
              callbacks=[es],
             )
    y_pred = model.predict_proba(X_val)[:, 1]
    _, stability_score = custom_metric.xgb_stability_metric(
        np.array(y_pred), np.array(y_val)
    )
    return stability_score

def cb_objective(trial):
    """
    CatBoostClassifier parameters search
    """
    # Target ratio for unbalanced data
    y_ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
    params = {
        'iterations': 5000,
        'learning_rate': trial.suggest_uniform('learning_rate', 0.005, 0.1),
        'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 1e-8, 100),
        'bagging_temperature': trial.suggest_uniform('bagging_temperature', 0.0, 10),
        'random_strength': trial.suggest_loguniform('random_strength', 1e-8, 10),
        'depth': trial.suggest_int('depth', 4, 16),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 100),
        
        'scale_pos_weight': y_ratio,
        'objective': 'Logloss',
        'eval_metric': 'Logloss', # auc and custom functions are not working on gpu
        'verbose': 50, 
        'task_type': 'GPU',
        'thread_count': -1,
    }
    train_data = Pool(data=X_train, 
                      label=y_train, 
                      cat_features=enc_med_card_cols,
                      )
    val_data = Pool(data=X_val, 
                    label=y_val, 
                    cat_features=enc_med_card_cols,
                    )    
    model = CatBoostClassifier(**params)  
    model.fit(train_data,
              eval_set=val_data,
              early_stopping_rounds=100,
              )   
    y_pred = model.predict_proba(X_val)[:, 1]
    
    return roc_auc_score(y_val, y_pred)

In [7]:
%%time

# Optuna study
objective = lgbm_objective
sampler = optuna.samplers.TPESampler(multivariate=True)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

# Show best results
trial = study.best_trial

print('Number of finished trials: ', len(study.trials))
print('Best trial:')
print('Value:', trial.value)
print('Params:')

for key, value in trial.params.items():
     print('{}: {}'.format(key, value))

[I 2024-05-21 23:36:37,115] A new study created in memory with name: no-name-5f0e8dad-6cce-43d3-8d6a-27544142aa83


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.667382
[200]	valid_0's stability_score: 0.673653
[300]	valid_0's stability_score: 0.673731
[400]	valid_0's stability_score: 0.672637
Early stopping, best iteration is:
[303]	valid_0's stability_score: 0.674164


[I 2024-05-21 23:39:42,327] Trial 0 finished with value: 0.6741638619108079 and parameters: {'num_leaves': 85, 'min_data_in_leaf': 1509, 'learning_rate': 0.06893725206124518, 'lambda_l1': 0.03934534060838883, 'lambda_l2': 1.0654041584742898e-06, 'feature_fraction': 0.646291688350873, 'bagging_fraction': 0.5030451802618435, 'bagging_freq': 7}. Best is trial 0 with value: 0.6741638619108079.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.663352
[200]	valid_0's stability_score: 0.66897
[300]	valid_0's stability_score: 0.668483
Early stopping, best iteration is:
[252]	valid_0's stability_score: 0.670039


[I 2024-05-21 23:42:23,950] Trial 1 finished with value: 0.6700394944847379 and parameters: {'num_leaves': 63, 'min_data_in_leaf': 199, 'learning_rate': 0.09846487264604088, 'lambda_l1': 0.9481996528578365, 'lambda_l2': 3.9781014295706225e-06, 'feature_fraction': 0.5806800636275802, 'bagging_fraction': 0.6415633421693626, 'bagging_freq': 3}. Best is trial 0 with value: 0.6741638619108079.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.664416
[200]	valid_0's stability_score: 0.673758
[300]	valid_0's stability_score: 0.678775
[400]	valid_0's stability_score: 0.679488
Early stopping, best iteration is:
[348]	valid_0's stability_score: 0.680051


[I 2024-05-21 23:46:31,923] Trial 2 finished with value: 0.6800507264111265 and parameters: {'num_leaves': 134, 'min_data_in_leaf': 1651, 'learning_rate': 0.05820191335984723, 'lambda_l1': 2.274365396073248e-08, 'lambda_l2': 7.432069631290359, 'feature_fraction': 0.938662897502058, 'bagging_fraction': 0.8183782639490413, 'bagging_freq': 5}. Best is trial 2 with value: 0.6800507264111265.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.656709
[200]	valid_0's stability_score: 0.672128
[300]	valid_0's stability_score: 0.678303
[400]	valid_0's stability_score: 0.680437
[500]	valid_0's stability_score: 0.680775
[600]	valid_0's stability_score: 0.680718
[700]	valid_0's stability_score: 0.680785
[800]	valid_0's stability_score: 0.681711
Early stopping, best iteration is:
[759]	valid_0's stability_score: 0.681857


[I 2024-05-21 23:52:17,191] Trial 3 finished with value: 0.6818573122901658 and parameters: {'num_leaves': 54, 'min_data_in_leaf': 433, 'learning_rate': 0.049213655604152265, 'lambda_l1': 2.638393420147147, 'lambda_l2': 3.9290503858941405e-05, 'feature_fraction': 0.9094345170837402, 'bagging_fraction': 0.7189961556188009, 'bagging_freq': 8}. Best is trial 3 with value: 0.6818573122901658.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.663975
[200]	valid_0's stability_score: 0.670428
[300]	valid_0's stability_score: 0.672454
[400]	valid_0's stability_score: 0.6727
[500]	valid_0's stability_score: 0.67358
Early stopping, best iteration is:
[484]	valid_0's stability_score: 0.674405


[I 2024-05-21 23:58:19,355] Trial 4 finished with value: 0.6744048029105112 and parameters: {'num_leaves': 194, 'min_data_in_leaf': 1746, 'learning_rate': 0.06713693832103466, 'lambda_l1': 0.003753780333883439, 'lambda_l2': 1.6062406972291433e-06, 'feature_fraction': 0.9650385712655265, 'bagging_fraction': 0.7905960243193841, 'bagging_freq': 1}. Best is trial 3 with value: 0.6818573122901658.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.656513
[200]	valid_0's stability_score: 0.665094
[300]	valid_0's stability_score: 0.664979
Early stopping, best iteration is:
[259]	valid_0's stability_score: 0.665986


[I 2024-05-22 00:02:03,710] Trial 5 finished with value: 0.6659857525946383 and parameters: {'num_leaves': 254, 'min_data_in_leaf': 246, 'learning_rate': 0.051079624779120494, 'lambda_l1': 1.060832363851471e-07, 'lambda_l2': 1.135559954855427e-05, 'feature_fraction': 0.5446837652424679, 'bagging_fraction': 0.5063830428783496, 'bagging_freq': 8}. Best is trial 3 with value: 0.6818573122901658.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.632358
[200]	valid_0's stability_score: 0.640417
[300]	valid_0's stability_score: 0.646238
[400]	valid_0's stability_score: 0.65254
[500]	valid_0's stability_score: 0.658666
[600]	valid_0's stability_score: 0.663697
[700]	valid_0's stability_score: 0.667829
[800]	valid_0's stability_score: 0.671147
[900]	valid_0's stability_score: 0.673834
[1000]	valid_0's stability_score: 0.67634
[1100]	valid_0's stability_score: 0.678222
[1200]	valid_0's stability_score: 0.679106
[1300]	valid_0's stability_score: 0.680601
[1400]	valid_0's stability_score: 0.681768
[1500]	valid_0's stability_score: 0.682683
[1600]	valid_0's stability_score: 0.683007
[1700]	valid_0's stability_score: 0.6838
[1800]	valid_0's stability_score: 0.684438
[1900]	valid_0's stability_score: 0.685343
[2000]	valid_0's stability_score: 0.685656
[2100]	valid_0's stability_score: 0.685951
[2200]	valid_0's stability_score: 0.68623
[2300]

[I 2024-05-22 00:19:50,440] Trial 6 finished with value: 0.6864117589786688 and parameters: {'num_leaves': 88, 'min_data_in_leaf': 1398, 'learning_rate': 0.007527971787524835, 'lambda_l1': 2.5902094144625828e-05, 'lambda_l2': 2.4865532359136905e-07, 'feature_fraction': 0.617023068364577, 'bagging_fraction': 0.8869282625503894, 'bagging_freq': 6}. Best is trial 6 with value: 0.6864117589786688.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.665018
[200]	valid_0's stability_score: 0.676899
[300]	valid_0's stability_score: 0.678953
[400]	valid_0's stability_score: 0.680454
Early stopping, best iteration is:
[383]	valid_0's stability_score: 0.680928


[I 2024-05-22 00:24:16,342] Trial 7 finished with value: 0.6809276961201459 and parameters: {'num_leaves': 147, 'min_data_in_leaf': 1947, 'learning_rate': 0.049627949283823476, 'lambda_l1': 8.318181157521204e-05, 'lambda_l2': 0.004581246312273876, 'feature_fraction': 0.6299965918565626, 'bagging_fraction': 0.5697193612234792, 'bagging_freq': 5}. Best is trial 6 with value: 0.6864117589786688.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.631773
[200]	valid_0's stability_score: 0.642116
[300]	valid_0's stability_score: 0.651709
[400]	valid_0's stability_score: 0.660318
[500]	valid_0's stability_score: 0.666318
[600]	valid_0's stability_score: 0.670656
[700]	valid_0's stability_score: 0.674037
[800]	valid_0's stability_score: 0.676445
[900]	valid_0's stability_score: 0.678801
[1000]	valid_0's stability_score: 0.680878
[1100]	valid_0's stability_score: 0.681612
[1200]	valid_0's stability_score: 0.682256
[1300]	valid_0's stability_score: 0.682837
[1400]	valid_0's stability_score: 0.683556
[1500]	valid_0's stability_score: 0.684856
[1600]	valid_0's stability_score: 0.68516
[1700]	valid_0's stability_score: 0.685786
[1800]	valid_0's stability_score: 0.685936
[1900]	valid_0's stability_score: 0.686347
[2000]	valid_0's stability_score: 0.686417
[2100]	valid_0's stability_score: 0.686662
[2200]	valid_0's stability_score: 0.686931
[2

[I 2024-05-22 00:40:47,375] Trial 8 finished with value: 0.6871954517209767 and parameters: {'num_leaves': 87, 'min_data_in_leaf': 1221, 'learning_rate': 0.009281623487755453, 'lambda_l1': 0.00038939235514527735, 'lambda_l2': 2.6803464736049585e-06, 'feature_fraction': 0.7132340937648285, 'bagging_fraction': 0.5044042539937792, 'bagging_freq': 7}. Best is trial 8 with value: 0.6871954517209767.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.638858
[200]	valid_0's stability_score: 0.650643
[300]	valid_0's stability_score: 0.662126
[400]	valid_0's stability_score: 0.669412
[500]	valid_0's stability_score: 0.673899
[600]	valid_0's stability_score: 0.677231
[700]	valid_0's stability_score: 0.679587
[800]	valid_0's stability_score: 0.681343
[900]	valid_0's stability_score: 0.682706
[1000]	valid_0's stability_score: 0.684057
[1100]	valid_0's stability_score: 0.68517
[1200]	valid_0's stability_score: 0.685986
[1300]	valid_0's stability_score: 0.686568
[1400]	valid_0's stability_score: 0.686445
Early stopping, best iteration is:
[1312]	valid_0's stability_score: 0.686789


[I 2024-05-22 00:53:46,189] Trial 9 finished with value: 0.6867891798519102 and parameters: {'num_leaves': 152, 'min_data_in_leaf': 1454, 'learning_rate': 0.01310319072581281, 'lambda_l1': 5.193513049558476, 'lambda_l2': 0.025609497452307235, 'feature_fraction': 0.730019876586063, 'bagging_fraction': 0.866873253261149, 'bagging_freq': 2}. Best is trial 8 with value: 0.6871954517209767.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.607766
[200]	valid_0's stability_score: 0.63331
[300]	valid_0's stability_score: 0.646745
[400]	valid_0's stability_score: 0.654849
[500]	valid_0's stability_score: 0.66028
[600]	valid_0's stability_score: 0.664017
[700]	valid_0's stability_score: 0.667619
[800]	valid_0's stability_score: 0.670448
[900]	valid_0's stability_score: 0.672913
[1000]	valid_0's stability_score: 0.674192
[1100]	valid_0's stability_score: 0.675188
[1200]	valid_0's stability_score: 0.675957
[1300]	valid_0's stability_score: 0.677107
[1400]	valid_0's stability_score: 0.67789
[1500]	valid_0's stability_score: 0.67862
[1600]	valid_0's stability_score: 0.679044
[1700]	valid_0's stability_score: 0.67927
[1800]	valid_0's stability_score: 0.679777
[1900]	valid_0's stability_score: 0.680223
[2000]	valid_0's stability_score: 0.680479
[2100]	valid_0's stability_score: 0.680886
[2200]	valid_0's stability_score: 0.680888
Early 

[I 2024-05-22 01:05:54,416] Trial 10 finished with value: 0.6809446480277598 and parameters: {'num_leaves': 9, 'min_data_in_leaf': 858, 'learning_rate': 0.029240087063338898, 'lambda_l1': 2.9410347401104765e-06, 'lambda_l2': 1.719225089403035e-08, 'feature_fraction': 0.8171730790402636, 'bagging_fraction': 0.9728773253775156, 'bagging_freq': 10}. Best is trial 8 with value: 0.6871954517209767.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.631856
[200]	valid_0's stability_score: 0.639911
[300]	valid_0's stability_score: 0.645207
[400]	valid_0's stability_score: 0.650446
[500]	valid_0's stability_score: 0.654884
[600]	valid_0's stability_score: 0.659454
[700]	valid_0's stability_score: 0.663693
[800]	valid_0's stability_score: 0.667366
[900]	valid_0's stability_score: 0.670429
[1000]	valid_0's stability_score: 0.673051
[1100]	valid_0's stability_score: 0.675191
[1200]	valid_0's stability_score: 0.676798
[1300]	valid_0's stability_score: 0.678259
[1400]	valid_0's stability_score: 0.679709
[1500]	valid_0's stability_score: 0.680713
[1600]	valid_0's stability_score: 0.681643
[1700]	valid_0's stability_score: 0.682643
[1800]	valid_0's stability_score: 0.683314
[1900]	valid_0's stability_score: 0.683684
[2000]	valid_0's stability_score: 0.684153
[2100]	valid_0's stability_score: 0.684817
[2200]	valid_0's stability_score: 0.685226
[

[I 2024-05-22 01:39:40,458] Trial 11 finished with value: 0.6881509914750644 and parameters: {'num_leaves': 204, 'min_data_in_leaf': 1107, 'learning_rate': 0.005322296350733101, 'lambda_l1': 0.01464950362843525, 'lambda_l2': 0.006133498136111812, 'feature_fraction': 0.7482148842739891, 'bagging_fraction': 0.897325244068229, 'bagging_freq': 0}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.652852
[200]	valid_0's stability_score: 0.670734
[300]	valid_0's stability_score: 0.678516
[400]	valid_0's stability_score: 0.682212
[500]	valid_0's stability_score: 0.683277
[600]	valid_0's stability_score: 0.684394
[700]	valid_0's stability_score: 0.684712
[800]	valid_0's stability_score: 0.684882
Early stopping, best iteration is:
[779]	valid_0's stability_score: 0.685107


[I 2024-05-22 01:49:46,782] Trial 12 finished with value: 0.6851072031434673 and parameters: {'num_leaves': 280, 'min_data_in_leaf': 1049, 'learning_rate': 0.024874099408989835, 'lambda_l1': 0.0028090451066594287, 'lambda_l2': 0.0010761573739551868, 'feature_fraction': 0.7554237569246546, 'bagging_fraction': 0.9827222211667748, 'bagging_freq': 0}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.652874
[200]	valid_0's stability_score: 0.670588
[300]	valid_0's stability_score: 0.678303
[400]	valid_0's stability_score: 0.681142
[500]	valid_0's stability_score: 0.682555
[600]	valid_0's stability_score: 0.683449
[700]	valid_0's stability_score: 0.683719
Early stopping, best iteration is:
[636]	valid_0's stability_score: 0.68402


[I 2024-05-22 01:57:15,107] Trial 13 finished with value: 0.6840200283740652 and parameters: {'num_leaves': 207, 'min_data_in_leaf': 1035, 'learning_rate': 0.027579901938644045, 'lambda_l1': 0.053870769944021395, 'lambda_l2': 0.3114035827802884, 'feature_fraction': 0.7173090065504052, 'bagging_fraction': 0.699894328924575, 'bagging_freq': 3}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.632256
[200]	valid_0's stability_score: 0.640401
[300]	valid_0's stability_score: 0.648044
[400]	valid_0's stability_score: 0.65462
[500]	valid_0's stability_score: 0.660564
[600]	valid_0's stability_score: 0.665119
[700]	valid_0's stability_score: 0.668838
[800]	valid_0's stability_score: 0.671061
[900]	valid_0's stability_score: 0.673727
[1000]	valid_0's stability_score: 0.675583
[1100]	valid_0's stability_score: 0.676916
[1200]	valid_0's stability_score: 0.677782
[1300]	valid_0's stability_score: 0.679238
[1400]	valid_0's stability_score: 0.679956
[1500]	valid_0's stability_score: 0.680414
[1600]	valid_0's stability_score: 0.681323
[1700]	valid_0's stability_score: 0.681627
[1800]	valid_0's stability_score: 0.682284
[1900]	valid_0's stability_score: 0.682549
[2000]	valid_0's stability_score: 0.682872
Early stopping, best iteration is:
[1945]	valid_0's stability_score: 0.683048


[I 2024-05-22 02:17:45,034] Trial 14 finished with value: 0.6830484885260103 and parameters: {'num_leaves': 213, 'min_data_in_leaf': 711, 'learning_rate': 0.006415675556420283, 'lambda_l1': 0.0006888267182939708, 'lambda_l2': 0.0001229769304326429, 'feature_fraction': 0.8340146148260409, 'bagging_fraction': 0.6245804018174352, 'bagging_freq': 10}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.642949
[200]	valid_0's stability_score: 0.660754
[300]	valid_0's stability_score: 0.670942
[400]	valid_0's stability_score: 0.676704
[500]	valid_0's stability_score: 0.680579
[600]	valid_0's stability_score: 0.682949
[700]	valid_0's stability_score: 0.684459
[800]	valid_0's stability_score: 0.685368
[900]	valid_0's stability_score: 0.685808
[1000]	valid_0's stability_score: 0.686377
[1100]	valid_0's stability_score: 0.686463
Early stopping, best iteration is:
[1046]	valid_0's stability_score: 0.686804


[I 2024-05-22 02:27:20,542] Trial 15 finished with value: 0.6868041879304275 and parameters: {'num_leaves': 120, 'min_data_in_leaf': 1298, 'learning_rate': 0.01935565583741258, 'lambda_l1': 0.07203418590601807, 'lambda_l2': 0.025192956157022807, 'feature_fraction': 0.8286394463238029, 'bagging_fraction': 0.9116164688911964, 'bagging_freq': 4}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.614973
[200]	valid_0's stability_score: 0.63946
[300]	valid_0's stability_score: 0.651427
[400]	valid_0's stability_score: 0.65907
[500]	valid_0's stability_score: 0.664443
[600]	valid_0's stability_score: 0.66869
[700]	valid_0's stability_score: 0.67217
[800]	valid_0's stability_score: 0.674039
[900]	valid_0's stability_score: 0.675785
[1000]	valid_0's stability_score: 0.676902
[1100]	valid_0's stability_score: 0.678274
[1200]	valid_0's stability_score: 0.678739
[1300]	valid_0's stability_score: 0.679631
[1400]	valid_0's stability_score: 0.680311
[1500]	valid_0's stability_score: 0.680791
[1600]	valid_0's stability_score: 0.681317
[1700]	valid_0's stability_score: 0.681625
[1800]	valid_0's stability_score: 0.681971
[1900]	valid_0's stability_score: 0.68221
[2000]	valid_0's stability_score: 0.68245
[2100]	valid_0's stability_score: 0.68261
[2200]	valid_0's stability_score: 0.683153
[2300]	v

[I 2024-05-22 02:41:31,816] Trial 16 finished with value: 0.6841371761393052 and parameters: {'num_leaves': 9, 'min_data_in_leaf': 1128, 'learning_rate': 0.03655978641017954, 'lambda_l1': 3.2104327875576224e-06, 'lambda_l2': 3.8742244244167625e-08, 'feature_fraction': 0.7778742978185861, 'bagging_fraction': 0.7805895097779325, 'bagging_freq': 0}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.657915
[200]	valid_0's stability_score: 0.673653
[300]	valid_0's stability_score: 0.678829
[400]	valid_0's stability_score: 0.679601
[500]	valid_0's stability_score: 0.679743
Early stopping, best iteration is:
[471]	valid_0's stability_score: 0.680966


[I 2024-05-22 02:46:49,734] Trial 17 finished with value: 0.6809656363480742 and parameters: {'num_leaves': 176, 'min_data_in_leaf': 641, 'learning_rate': 0.03878370641770844, 'lambda_l1': 0.00465658698470873, 'lambda_l2': 0.0004614227254377971, 'feature_fraction': 0.6867290090763803, 'bagging_fraction': 0.6531407102621324, 'bagging_freq': 8}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.666553
[200]	valid_0's stability_score: 0.66793
Early stopping, best iteration is:
[167]	valid_0's stability_score: 0.669924


[I 2024-05-22 02:49:41,388] Trial 18 finished with value: 0.669923588525244 and parameters: {'num_leaves': 243, 'min_data_in_leaf': 1228, 'learning_rate': 0.09568065787011715, 'lambda_l1': 0.00010711592035504625, 'lambda_l2': 0.8560338823071783, 'feature_fraction': 0.5048929080435853, 'bagging_fraction': 0.8399636987119552, 'bagging_freq': 6}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.642193
[200]	valid_0's stability_score: 0.659022
[300]	valid_0's stability_score: 0.669763
[400]	valid_0's stability_score: 0.675575
[500]	valid_0's stability_score: 0.679997
[600]	valid_0's stability_score: 0.682436
[700]	valid_0's stability_score: 0.684149
[800]	valid_0's stability_score: 0.684794
[900]	valid_0's stability_score: 0.685883
[1000]	valid_0's stability_score: 0.686222
[1100]	valid_0's stability_score: 0.686674
[1200]	valid_0's stability_score: 0.686633
[1300]	valid_0's stability_score: 0.686974
[1400]	valid_0's stability_score: 0.687076
[1500]	valid_0's stability_score: 0.686948
Early stopping, best iteration is:
[1419]	valid_0's stability_score: 0.687323


[I 2024-05-22 03:01:22,240] Trial 19 finished with value: 0.6873232938255093 and parameters: {'num_leaves': 105, 'min_data_in_leaf': 862, 'learning_rate': 0.01797319199061028, 'lambda_l1': 5.291214484925113e-06, 'lambda_l2': 0.006617328796403234, 'feature_fraction': 0.6752046756824283, 'bagging_fraction': 0.9229435720316811, 'bagging_freq': 2}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.644464
[200]	valid_0's stability_score: 0.663826
[300]	valid_0's stability_score: 0.672705
[400]	valid_0's stability_score: 0.677639
[500]	valid_0's stability_score: 0.680757
[600]	valid_0's stability_score: 0.68215
[700]	valid_0's stability_score: 0.683302
[800]	valid_0's stability_score: 0.68413
[900]	valid_0's stability_score: 0.684204
Early stopping, best iteration is:
[858]	valid_0's stability_score: 0.684403


[I 2024-05-22 03:10:35,422] Trial 20 finished with value: 0.6844029066646435 and parameters: {'num_leaves': 172, 'min_data_in_leaf': 841, 'learning_rate': 0.02058904875119336, 'lambda_l1': 7.934442885755701e-07, 'lambda_l2': 0.00860083881382397, 'feature_fraction': 0.8900113280076428, 'bagging_fraction': 0.9373832157005724, 'bagging_freq': 1}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.637495
[200]	valid_0's stability_score: 0.651704
[300]	valid_0's stability_score: 0.663266
[400]	valid_0's stability_score: 0.670736
[500]	valid_0's stability_score: 0.675317
[600]	valid_0's stability_score: 0.678747
[700]	valid_0's stability_score: 0.681762
[800]	valid_0's stability_score: 0.683404
[900]	valid_0's stability_score: 0.684429
[1000]	valid_0's stability_score: 0.685313
[1100]	valid_0's stability_score: 0.685764
[1200]	valid_0's stability_score: 0.686164
[1300]	valid_0's stability_score: 0.686596
[1400]	valid_0's stability_score: 0.68712
[1500]	valid_0's stability_score: 0.687199
[1600]	valid_0's stability_score: 0.687459
[1700]	valid_0's stability_score: 0.687394
Early stopping, best iteration is:
[1627]	valid_0's stability_score: 0.687651


[I 2024-05-22 03:24:02,436] Trial 21 finished with value: 0.687651035931788 and parameters: {'num_leaves': 105, 'min_data_in_leaf': 875, 'learning_rate': 0.014533682968274977, 'lambda_l1': 2.3796334791986652e-05, 'lambda_l2': 0.17078278275340716, 'feature_fraction': 0.677384761541142, 'bagging_fraction': 0.9394025914188204, 'bagging_freq': 2}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.63903
[200]	valid_0's stability_score: 0.650684
[300]	valid_0's stability_score: 0.661726
[400]	valid_0's stability_score: 0.668315
[500]	valid_0's stability_score: 0.673466
[600]	valid_0's stability_score: 0.67718
[700]	valid_0's stability_score: 0.67909
[800]	valid_0's stability_score: 0.680963
[900]	valid_0's stability_score: 0.682474
[1000]	valid_0's stability_score: 0.683501
[1100]	valid_0's stability_score: 0.684029
[1200]	valid_0's stability_score: 0.68442
[1300]	valid_0's stability_score: 0.685385
[1400]	valid_0's stability_score: 0.685593
[1500]	valid_0's stability_score: 0.685828
Early stopping, best iteration is:
[1468]	valid_0's stability_score: 0.686129


[I 2024-05-22 03:36:43,436] Trial 22 finished with value: 0.686129163963065 and parameters: {'num_leaves': 111, 'min_data_in_leaf': 602, 'learning_rate': 0.01378454366413464, 'lambda_l1': 1.8771363967985088e-05, 'lambda_l2': 0.19537042598612422, 'feature_fraction': 0.6777250270575366, 'bagging_fraction': 0.9391941169012458, 'bagging_freq': 2}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.650467
[200]	valid_0's stability_score: 0.669342
[300]	valid_0's stability_score: 0.676789
[400]	valid_0's stability_score: 0.680255
[500]	valid_0's stability_score: 0.682666
[600]	valid_0's stability_score: 0.683428
[700]	valid_0's stability_score: 0.684549
[800]	valid_0's stability_score: 0.684706
[900]	valid_0's stability_score: 0.685326
[1000]	valid_0's stability_score: 0.685858
[1100]	valid_0's stability_score: 0.685966
Early stopping, best iteration is:
[1023]	valid_0's stability_score: 0.68622


[I 2024-05-22 03:44:13,799] Trial 23 finished with value: 0.6862199166070175 and parameters: {'num_leaves': 53, 'min_data_in_leaf': 879, 'learning_rate': 0.03682969982300936, 'lambda_l1': 1.6943554120821722e-06, 'lambda_l2': 0.08343015304786718, 'feature_fraction': 0.779247060377412, 'bagging_fraction': 0.9957269786376554, 'bagging_freq': 1}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.642539
[200]	valid_0's stability_score: 0.659211
[300]	valid_0's stability_score: 0.67027
[400]	valid_0's stability_score: 0.6759
[500]	valid_0's stability_score: 0.680243
[600]	valid_0's stability_score: 0.682785
[700]	valid_0's stability_score: 0.684132
[800]	valid_0's stability_score: 0.68474
[900]	valid_0's stability_score: 0.684968
[1000]	valid_0's stability_score: 0.685192
[1100]	valid_0's stability_score: 0.685505
[1200]	valid_0's stability_score: 0.686118
[1300]	valid_0's stability_score: 0.685766
Early stopping, best iteration is:
[1246]	valid_0's stability_score: 0.686384


[I 2024-05-22 03:54:52,487] Trial 24 finished with value: 0.6863836439607981 and parameters: {'num_leaves': 111, 'min_data_in_leaf': 529, 'learning_rate': 0.018696394902865537, 'lambda_l1': 1.2821900089281313e-05, 'lambda_l2': 0.002011136802779837, 'feature_fraction': 0.6527572309370945, 'bagging_fraction': 0.8694597053313206, 'bagging_freq': 3}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.658067
[200]	valid_0's stability_score: 0.673022
[300]	valid_0's stability_score: 0.678918
[400]	valid_0's stability_score: 0.68099
[500]	valid_0's stability_score: 0.682504
[600]	valid_0's stability_score: 0.682679
[700]	valid_0's stability_score: 0.682772
[800]	valid_0's stability_score: 0.683133
[900]	valid_0's stability_score: 0.682826
Early stopping, best iteration is:
[828]	valid_0's stability_score: 0.683358


[I 2024-05-22 04:03:55,877] Trial 25 finished with value: 0.6833580504990489 and parameters: {'num_leaves': 227, 'min_data_in_leaf': 905, 'learning_rate': 0.030262166914726003, 'lambda_l1': 1.9281737116659274e-07, 'lambda_l2': 2.74804322716665, 'feature_fraction': 0.5762977987959823, 'bagging_fraction': 0.9299341575541547, 'bagging_freq': 0}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.668966
[200]	valid_0's stability_score: 0.671395
Early stopping, best iteration is:
[179]	valid_0's stability_score: 0.672566


[I 2024-05-22 04:06:43,595] Trial 26 finished with value: 0.672565826128374 and parameters: {'num_leaves': 167, 'min_data_in_leaf': 332, 'learning_rate': 0.08148015797025349, 'lambda_l1': 0.2685286193957568, 'lambda_l2': 0.044110050127223216, 'feature_fraction': 0.6783207103361464, 'bagging_fraction': 0.8974379182934872, 'bagging_freq': 2}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.649931
[200]	valid_0's stability_score: 0.661703
[300]	valid_0's stability_score: 0.671316
[400]	valid_0's stability_score: 0.67686
[500]	valid_0's stability_score: 0.680634
[600]	valid_0's stability_score: 0.683112
[700]	valid_0's stability_score: 0.684704
[800]	valid_0's stability_score: 0.685731
[900]	valid_0's stability_score: 0.686588
[1000]	valid_0's stability_score: 0.687367
[1100]	valid_0's stability_score: 0.687148
Early stopping, best iteration is:
[1048]	valid_0's stability_score: 0.687506


[I 2024-05-22 04:19:18,032] Trial 27 finished with value: 0.6875060726074638 and parameters: {'num_leaves': 279, 'min_data_in_leaf': 735, 'learning_rate': 0.015098347645446007, 'lambda_l1': 0.001081767382483351, 'lambda_l2': 0.7651053862211805, 'feature_fraction': 0.5963108301047251, 'bagging_fraction': 0.836117107999597, 'bagging_freq': 4}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.657486
[200]	valid_0's stability_score: 0.66622
[300]	valid_0's stability_score: 0.665723
Early stopping, best iteration is:
[222]	valid_0's stability_score: 0.667099


[I 2024-05-22 04:23:16,091] Trial 28 finished with value: 0.6670986053189413 and parameters: {'num_leaves': 299, 'min_data_in_leaf': 52, 'learning_rate': 0.042563402407733764, 'lambda_l1': 0.014672690656013227, 'lambda_l2': 1.0517803209949552, 'feature_fraction': 0.6088945000943293, 'bagging_fraction': 0.8363090910540283, 'bagging_freq': 4}. Best is trial 11 with value: 0.6881509914750644.


Training until validation scores don't improve for 100 rounds
[100]	valid_0's stability_score: 0.64221
[200]	valid_0's stability_score: 0.649151
[300]	valid_0's stability_score: 0.654685
[400]	valid_0's stability_score: 0.659165
[500]	valid_0's stability_score: 0.66353
[600]	valid_0's stability_score: 0.66721
[700]	valid_0's stability_score: 0.670501
[800]	valid_0's stability_score: 0.673444
[900]	valid_0's stability_score: 0.675577
[1000]	valid_0's stability_score: 0.677934
[1100]	valid_0's stability_score: 0.679847
[1200]	valid_0's stability_score: 0.681112
[1300]	valid_0's stability_score: 0.682371
[1400]	valid_0's stability_score: 0.68359
[1500]	valid_0's stability_score: 0.684438
[1600]	valid_0's stability_score: 0.685142
[1700]	valid_0's stability_score: 0.685983
[1800]	valid_0's stability_score: 0.68678
[1900]	valid_0's stability_score: 0.687272
[2000]	valid_0's stability_score: 0.687307
[2100]	valid_0's stability_score: 0.687717
[2200]	valid_0's stability_score: 0.68808
[2300]	

[I 2024-05-22 04:48:28,857] Trial 29 finished with value: 0.6884790581859022 and parameters: {'num_leaves': 266, 'min_data_in_leaf': 731, 'learning_rate': 0.005901835737658873, 'lambda_l1': 0.0010634148733296234, 'lambda_l2': 5.835448638038832, 'feature_fraction': 0.5106655665557869, 'bagging_fraction': 0.7534811540507564, 'bagging_freq': 4}. Best is trial 29 with value: 0.6884790581859022.


Number of finished trials:  30
Best trial:
Value: 0.6884790581859022
Params:
num_leaves: 266
min_data_in_leaf: 731
learning_rate: 0.005901835737658873
lambda_l1: 0.0010634148733296234
lambda_l2: 5.835448638038832
feature_fraction: 0.5106655665557869
bagging_fraction: 0.7534811540507564
bagging_freq: 4
CPU times: total: 4d 2h 32min 41s
Wall time: 5h 11min 51s


In [None]:
%%time

# Optuna study
objective = cb_objective
sampler = optuna.samplers.TPESampler(multivariate=True)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

# Show best results
trial = study.best_trial

print('Number of finished trials: ', len(study.trials))
print('Best trial:')
print('Value:', trial.value)
print('Params:')

for key, value in trial.params.items():
     print('{}: {}'.format(key, value))

[I 2024-05-22 05:19:07,111] A new study created in memory with name: no-name-387bd64c-5ad7-43f1-9f56-972e3aeb3e3a


0:	learn: 0.6855418	test: 0.6867555	best: 0.6867555 (0)	total: 238ms	remaining: 19m 49s
50:	learn: 0.5135213	test: 0.5788604	best: 0.5788604 (50)	total: 11.9s	remaining: 19m 11s
100:	learn: 0.4556692	test: 0.5743165	best: 0.5731124 (81)	total: 23.1s	remaining: 18m 41s
150:	learn: 0.4115648	test: 0.5816089	best: 0.5731124 (81)	total: 34.3s	remaining: 18m 21s
bestTest = 0.5731123517
bestIteration = 81
Shrink model to first 82 iterations.


[I 2024-05-22 05:19:53,426] Trial 0 finished with value: 0.786542864918172 and parameters: {'learning_rate': 0.02853574682203823, 'l2_leaf_reg': 1.4611042789271413e-08, 'bagging_temperature': 8.670084696965707, 'random_strength': 3.842297697655097e-06, 'depth': 12, 'min_data_in_leaf': 36}. Best is trial 0 with value: 0.786542864918172.


In [None]:
%%time

# Optuna study
objective = xgb_objective
sampler = optuna.samplers.TPESampler(multivariate=True)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

# Show best results
trial = study.best_trial

print('Number of finished trials: ', len(study.trials))
print('Best trial:')
print('Value:', trial.value)
print('Params:')

for key, value in trial.params.items():
     print('{}: {}'.format(key, value))

#### Models' best parameters

In [None]:
# LGBM best parameters
lgbm_params = {
    'n_estimators': 5000,
    'num_leaves': 214, 
    'min_data_in_leaf': 1831, 
    'learning_rate': 0.018016752095308213, 
    'lambda_l1': 0.039542276491157664, 
    'lambda_l2': 0.0028839017821387612, 
    'feature_fraction': 0.8606139936996, 
    'bagging_fraction': 0.6505054297397137,
    'bagging_freq': 4,
    'objective': 'binary',
    'metric': 'None', # stability metric is used as eval_metric
    'verbosity': -1,
    'device': 'gpu',
    'n_jobs': -1,
    'max_bin': 255,
    }

#{'num_leaves': 266, 'min_data_in_leaf': 731, 'learning_rate': 0.005901835737658873, 'lambda_l1': 0.0010634148733296234,
#'lambda_l2': 5.835448638038832, 'feature_fraction': 0.5106655665557869, 'bagging_fraction': 0.7534811540507564, 'bagging_freq': 4}
# XGB best parameters
xgb_params = {
    'n_estimators': 5000,
    'learning_rate': 0.02563139404535397, 
    'max_depth': 17, 
    'min_child_weight': 1446, 
    'max_delta_setp': 0,
    'subsample': 0.8564298564731834, 
    'colsample_bytree': 0.9235898601649664, 
    'reg_lambda': 1.1289608466365559e-08, 
    'reg_alpha': 4.911518931389214e-07, 
    'gamma': 0.00035869889925144383, 
    
    'objective': 'binary:logistic',
#     'eval_metric': 'auc',
    'tree_method': 'gpu_hist',
    'enable_categorical': True,
    'verbosity': 0,
    'device': 'cuda',
    'n_jobs': -1,
    }
# CatBoost best parameters
cb_params = {
    'iterations': 5000,
    'learning_rate': 0.08, 
    'l2_leaf_reg': 57.87508612048416, 
    'bagging_temperature': 0.737099486243173, 
    'random_strength': 0.0017723437301297412, 
    'depth': 6, 
    'min_data_in_leaf': 66,
    
    'objective': 'Logloss',
    'eval_metric': 'Logloss', # auc and custom functions are not working on gpu
    'verbose': 50, 
    'task_type': 'GPU',
    'thread_count': -1,
    }
# {'learning_rate': 0.06141273570840567, 'l2_leaf_reg': 17.946406453936355, 'bagging_temperature': 1.2822776449667406, 'random_strength': 0.07236918779653291, 'depth': 10, 'min_data_in_leaf': 48}
# {'learning_rate': 0.07746171181807705, 'l2_leaf_reg': 0.16856620630061742, 'bagging_temperature': 0.4939947354720309, 'random_strength': 0.10365883602086806, 'depth': 5, 'min_data_in_leaf': 90}

#### Create model

In [None]:
def train_model(X_train, y_train, X_val, y_val, model_type, params):
    """
    Train a specific model type with predefined parameters
    """
    y_ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
    params['scale_pos_weight'] = y_ratio
    custom_metric = StabilityMetric(X_val)

    if model_type == 'lgbm':
        model = LGBMClassifier(**params)
        model.fit(X_train, y_train,
                  eval_set=[(X_val, y_val)],
                  eval_metric=custom_metric.lgbm_stability_metric,
                  callbacks=[log_evaluation(100), early_stopping(100)],
                 )
    elif model_type == 'xgb':
        es = EarlyStopping(rounds=100, 
                           maximize=True,
                           save_best=True,)
        model = XGBClassifier(**params)
        model.fit(X_train, y_train,
                  eval_set=[(X_val, y_val)],
                  eval_metric=custom_metric.xgb_stability_metric,
                  callbacks=[es],
                 )
    elif model_type == 'cb':
        train_data = Pool(data=X_train, label=y_train, 
                          cat_features=enc_med_card_cols,
                          )
        val_data = Pool(data=X_val, label=y_val, 
                        cat_features=enc_med_card_cols,
                        )
        model = CatBoostClassifier(**params)  
        model.fit(train_data,
                  eval_set=val_data,
                  early_stopping_rounds=100,
                  )
    return model

In [None]:
%%time

# Train model
X_train, y_train = shuffle(X_train, y_train)
lgbm_model = train_model(X_train, y_train, X_val, y_val, 'lgbm', lgbm_params)

In [None]:
# Plot LGBM model features importance
plot_importance(lgbm_model, 
                importance_type='gain', 
                max_num_features=20, 
                height=0.5,
                grid=False,
                precision=0,
                )
plt.show()

#### Read and prepare test set

In [None]:
# Free memory
del X_train, X_val, y_train, y_val

# Read, preprocess and merge all the test files together
X_test = read_prepare_all(test_path, files_dict)

# Match X_test columns with X columns
for col in X_structure.columns:
    if col not in X_test.columns:
        X_test[col] = X_structure[col]
        print(f'{col} added to X_test')
        
X_test = X_test[X_structure.columns]

X_test.info()

#### Predict and submit

In [None]:
def predict_proba_in_batches(X_test, final_model, batch_size=10000):
    """
    Process the test set and predict in batches
    """
    num_samples = len(X_test)
    num_batches = (num_samples // batch_size) + 1
    y_pred = []
    
    for batch_idx in range(num_batches):
        start_idx = batch_idx * batch_size
        end_idx = min((batch_idx + 1) * batch_size, num_samples)
        X_batch = X_test.iloc[start_idx:end_idx]
        X_batch = pd.DataFrame(
            processor.transform(X_batch), 
            columns=processor.get_feature_names_out(),
            index=X_batch.index)
        X_batch[enc_med_card_cols] = (X_batch[enc_med_card_cols]
                                      .astype(str).astype('category'))
        batch_preds = []
        
        for model_type, model in final_model.items():
            if 'single' in model_type:
                batch_model_preds = model.predict_proba(X_batch)[:, 1]

            elif 'ensemble' in model_type: 
                # Predict the average from a list of estimators
                batch_model_preds = mean_predict_proba(
                    X_batch, estimators=model)
                
            batch_preds.append(batch_model_preds)
            
        batch_preds = np.vstack(batch_preds)
        y_pred.append(batch_preds) 
        
    # The average from all predictions
    y_pred = np.hstack(y_pred)
    y_pred = np.mean(y_pred, axis=0) 
        
    return y_pred

In [None]:
# Predict
final_model = {
    'single': lgbm_model,
#     'single': xgb_model,
#     'ensemble': lgbm_bag_week_model,
}
y_pred = predict_proba_in_batches(X_test, final_model, batch_size=10000)

# Submit
submit = pd.read_csv(main_path + 'sample_submission.csv')
submit.score = y_pred

submit.to_csv('submission.csv', index=False)

submit