## Prerequisites

### Install libraries


In [1]:
# pip install ibis-ml "ibis-framework[duckdb]"

#### Download dataset

- Option 1: Log into your kaggle account and download all data from this [link](https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/data)
- Option 2: Kaggle API
    1. Go to your Kaggle account settings: [Kaggle Account Settings](https://www.kaggle.com/account).
    2. Under the "API" section, click on "Create New API Token". This will download the `kaggle.json` file to your computer.
    3. Place the `kaggle.json` file in the correct directory, normally it is under your home directory `~/.kaggle`:
        ```bash
        mkdir ~/.kaggle
        mv ~/Downloads/kaggle.json ~/.kaggle
        ```
    4. Install Kaggle CLI and download the data:
        ```bash
        pip install kaggle
        kaggle competitions download -c home-credit-credit-risk-model-stability
        unzip home-credit-credit-risk-model-stability.zip
        ```


### Data paths

Define the root directory for the downloaded data and set up the data paths for the train and test directories

In [2]:
from pathlib import Path
# Change the root path to yours
ROOT            = Path("/Users/claypot/Downloads/home-credit-credit-risk-model-stability")
TRAIN_DIR       = ROOT / "parquet_files" / "train"
TEST_DIR        = ROOT / "parquet_files" / "test"

### Import library

In [3]:
from glob import glob

import ibis
import ibis.expr.datatypes as dt
from ibis import _
import ibis_ml as ml

ibis.options.interactive = True

## 0. Competion overview

### 0.1 Objective
The aim of this competition is to predict which clients are more likely to default on their loans, leveraging both internal and external information available for each client. 

In this example, we will showcase how to use Ibis and IbisML for feature engineering and last-mile data preprocessing. Subsequently, we will build three separate classifiers:

- A neural network using PyTorch
- XGBoost
- LightGBM  

### 0.2 Data description
The dataset comprises numerous files due to the utilization of diverse data sources and varying levels of data aggregation during preparation. 

todo: (add more details here): There is 1.5m rows, 465 columns, about ~1.1GB parquet data.

##### 0.2.1 Base tables
The base table(`train_base.parquet`) contain fundamental details about the training samples, each identified uniquely by a case_id. This identifier will serve as the linking key for joining with other feature tables.
- `case_id` - This is the unique identifier for each credit case. You'll need this ID to join relevant tables to the base table. There is about 1.5m unique loans.
- `date_decision` - This refers to the date when a decision was made regarding the approval of the loan.
- `WEEK_NUM` - This is the week number used for aggregation. In the test sample, WEEK_NUM continues sequentially from the last training value of WEEK_NUM.
- `MONTH` - This column represents the month and is intended for aggregation purposes.
- `target` - This is the target value, determined after a certain period based on whether or not the client defaulted on the specific credit case (loan).

##### 0.2.2 Feature tables

There are roughly ~450 unique features from previous application and external credit bureaus. All features are usable as predictors, their definitions can be found in the file feature_definitions.csv from this [link](https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/data). Tables with a depth greater than 0 require aggregation based on `case_id`.
For depth=0 tables, predictors can be directly used as features. However, for tables with depth>0, you may need to employ aggregation functions that will condense the historical records associated with each case_id into a single feature. In case num_group1 or num_group2 stands for person index (this is clear with predictor definitions) the zero index has special meaning. When num_groupN=0 it is the applicant (the person who applied for a loan).

In [4]:
# Example of table with depth = 0
ibis.read_parquet(TRAIN_DIR / "train_static_cb_0.parquet").head()

In [5]:
# Example of table with depth = 1
ibis.read_parquet(TRAIN_DIR / "train_credit_bureau_b_1.parquet").relocate("num_group1").head()

## 1. Data loading and processing

We'll utilize Ibis to load the Parquet file (with the same data also available in CSV format) and process the data using IbisM:

While loading data into memory, we will perform the following data processing steps:

- **Union Datasets**: Combine multiple sub-files of the same dataset into one table, as some datasets are split into multiple sub-files with a common prefix.
- **Convert Data Types**: Ensure consistency by converting data types, as the same features in different sub-files may have different types.
- **Aggregate Features**: For tables with depth greater than 0, aggregate features based on `case_id`, including calculations for maximum values. You can also collect other statistics such as mean, median, mode, minimum, standard deviation, and others based on your needs.

#### 1.1 Cast columns to correct types

We'll use IbisML to create a series of Cast steps, forming a recipe for type conversion across the dataset. This conversion is based on the provided information extracted from column names. Predictors that underwent similar transformations are indicated by a capital letter at the end of their names.

- P - Transform DPD (Days past due)
- M - Masking categories
- A - Transform amount
- D - Transform date
- T - Unspecified Transform
- L - Unspecified Transform

We generate the `data_type_recipes` for data processing.

In [6]:
# Convert columns ends with P to floating number
step_cast_P_to_float = ml.Cast(ml.endswith("P"), dt.float64)
# Convert columns ends with A to floating number
step_cast_A_to_float = ml.Cast(ml.endswith("A"), dt.float64)
# Convert columns ends with D to date
step_cast_D_to_date = ml.Cast(ml.endswith("D"), dt.date)
# Convert columns ends with M to str
step_cast_M_to_str = ml.Cast(ml.endswith("M"), dt.str)

data_type_recipes = ml.Recipe(
    step_cast_P_to_float,
    step_cast_D_to_date,
    step_cast_M_to_str,
    step_cast_A_to_float,
    # Cast some special columns
    ml.Cast(["date_decision"], "date"),
    ml.Cast(["case_id", "WEEK_NUM", "num_group1", "num_group2"], dt.int64),
    ml.Cast([
        "cardtype_51L",
        "credacc_status_367L",
        "requesttype_4525192L",
        "riskassesment_302T",
        "max_periodicityofpmts_997L",
        # "max_empl_employedtotal_800L",
        # "max_empl_industry_691L"
        ], 
        dt.str
    ),
    ml.Cast(["isbidproductrequest_292L", "isdebitcard_527L", "equalityempfrom_62L"], dt.int64),
)

> IbisML offers a powerful set of column selectors, allowing you to select columns based on names, types, and patterns. For more information, you can refer to the IbisML column selectors [documentation](https://ibis-project.github.io/ibis-ml/reference/selectors.html).

#### 1.2 Data aggregation based on case_id

For tables with a depth greater than 0 that cannot be directly joined with the base table, we need to aggregate the features by the case_id. You could compute the minimum, maximum, mean, median, and standard deviation for numeric columns, and the maximum and minimum for non-numeric columns.

Here, I use the maximum as an example:

In [7]:
def agg_by_id(table):
    return table.group_by("case_id").agg(
            [
                table[col_name].max().name(f"max_{col_name}")
                for col_name in table.columns 
                if col_name[-1] in ("T", "L", "P", "A", "D", "M")
            ]
    )

#### 1.3 Read and process the data files

In [8]:

def read_and_process_files(file_path, depth=None, is_regex=False):
    """
    Read and process Parquet files, either a single file or multiple files matching a regex pattern.
    
    Args:
        file_path (str): Path to the file or regex pattern to match files.
        depth (int, optional): Depth of processing. If 1 or 2, additional aggregation is performed. Defaults to None.
        is_regex (bool, optional): Whether the file_path is a regex pattern. Defaults to False.
    
    Returns:
        ibis.Table: The processed Ibis table.
    """
    if is_regex:
        # Read and union multiple files
        chunks = []
        for path in glob(str(file_path)):
            chunk = ibis.read_parquet(path)
            # Transform table using IbisML
            chunk = data_type_recipes.fit(chunk).to_ibis(chunk)
            chunks.append(chunk)
        table = ibis.union(*chunks)
    else:
        # Read a single file
        table = ibis.read_parquet(file_path)
    
        # Transform table using IbisML
        table = data_type_recipes.fit(table).to_ibis(table)
    
    if depth in [1, 2]:
        # Perform aggregation if depth is 1 or 2
        table = agg_by_id(table)
    
    return table



In [9]:
train_data_store = {
    "df_base": read_and_process_files(TRAIN_DIR / "train_base.parquet"),
    "depth_0": [
        read_and_process_files(TRAIN_DIR / "train_static_cb_0.parquet"),
        read_and_process_files(TRAIN_DIR / "train_static_0_*.parquet", is_regex=True),
    ],
    "depth_1": [
        read_and_process_files(TRAIN_DIR / "train_applprev_1_*.parquet", 1, is_regex=True),
        read_and_process_files(TRAIN_DIR / "train_tax_registry_a_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_tax_registry_b_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_tax_registry_c_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_credit_bureau_b_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_other_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_person_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_deposit_1.parquet", 1),
        read_and_process_files(TRAIN_DIR / "train_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        read_and_process_files(TRAIN_DIR / "train_credit_bureau_b_2.parquet", 2),
    ]
}

test_data_store = {
    "df_base": read_and_process_files(TEST_DIR / "test_base.parquet"),
    "depth_0": [
        read_and_process_files(TEST_DIR / "test_static_cb_0.parquet"),
        read_and_process_files(TEST_DIR / "test_static_0_*.parquet", is_regex=True),
    ],
    "depth_1": [
        read_and_process_files(TEST_DIR / "test_applprev_1_*.parquet", 1, is_regex=True),
        read_and_process_files(TEST_DIR / "test_tax_registry_a_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_tax_registry_b_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_tax_registry_c_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_credit_bureau_b_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_other_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_person_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_deposit_1.parquet", 1),
        read_and_process_files(TEST_DIR / "test_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        read_and_process_files(TEST_DIR / "test_credit_bureau_b_2.parquet", 2),
    ]
}

### 1.4 Join features with the base data

Join all the features from different sources to the base table.

In [10]:
def join_data(df_base, depth_0, depth_1, depth_2):
    for i, df in enumerate(depth_0 + depth_1 + depth_2):
        df_base = df_base.join(df, "case_id", how="left", rname="{name}_right" + f"_{i}" )
    return df_base

In [11]:
df_train = join_data(**train_data_store)
df_test = join_data(**test_data_store)

### 1.5 Remove less meaningful columns

To filter out columns based on specific criteria, you can follow these steps:

- Remove columns where the average proportion of null values exceeds 0.95.
- Remove categorical columns where the number of unique values is more than 50.

In [12]:
# use this to find the remoed_cols, it will take a couple of minutes, you could directly use the output in the next cell
# removed_cols = []
# for colname in df_train.columns:
 
#     null_frac = df_train[colname].isnull().mean().execute()
#     freq = df_train[colname].nunique().execute()
#     if colname not in ["target", "case_id", "WEEK_NUM"] and null_frac > 0.95:
#         removed_cols.append(colname)
#     if colname not in ["target", "case_id", "WEEK_NUM"] and str(df_train[colname].type()) ==  "string":
#         if (freq == 1) | (freq > 50):
#             removed_cols.append(colname)
#     if (colname[-1] not in ["P", "A", "L", "M"]) and (('month_' in colname) or ('year_' in colname)):
#         removed_cols.append(colname)

In [13]:
removed_cols = [
    'assignmentdate_4955616D',
    'dateofbirth_342D',
    'for3years_128L',
    'for3years_504L',
    'for3years_584L',
    'formonth_118L',
    'formonth_206L',
    'formonth_535L',
    'forquarter_1017L',
    'forquarter_462L',
    'forquarter_634L',
    'fortoday_1092L',
    'forweek_1077L',
    'forweek_528L',
    'forweek_601L',
    'foryear_618L',
    'foryear_818L',
    'foryear_850L',
    'pmtaverage_4955615A',
    'pmtcount_4955617L',
    'riskassesment_302T',
    'riskassesment_940T',
    'bankacctype_710L',
    'clientscnt_136L',
    'equalityempfrom_62L',
    'interestrategrace_34L',
    'isbidproductrequest_292L',
    'lastapprcommoditytypec_5251766M',
    'lastcancelreason_561M',
    'lastdependentsnum_448L',
    'lastotherinc_902A',
    'lastotherlnsexpense_631A',
    'lastrejectcommodtypec_5251769M',
    'lastrepayingdate_696D',
    'maxannuity_4075009A',
    'paytype1st_925L',
    'paytype_783L',
    'payvacationpostpone_4187118D',
    'previouscontdistrict_112M',
    'typesuite_864L',
    'max_cancelreason_3545846M',
    'max_district_544M',
    'max_profession_152M',
    'max_name_4527232M',
    'max_name_4917606M',
    'max_employername_160M',
    'case_id_right_6',
    'max_amount_1115A',
    'max_classificationofcontr_1114M',
    'max_contractdate_551D',
    'max_contractmaturitydate_151D',
    'max_contractst_516M',
    'max_contracttype_653M',
    'max_credlmt_1052A',
    'max_credlmt_228A',
    'max_credlmt_3940954A',
    'max_credor_3940957M',
    'max_credor_3940957M',
    'max_credquantity_1099L',
    'max_credquantity_984L',
    'max_debtpastduevalue_732A',
    'max_debtvalue_227A',
    'max_dpd_550P',
    'max_dpd_733P',
    'max_dpdmax_851P',
    'max_dpdmaxdatemonth_804T',
    'max_dpdmaxdatemonth_804T',
    'max_dpdmaxdateyear_742T',
    'max_dpdmaxdateyear_742T',
    'max_installmentamount_644A',
    'max_installmentamount_833A',
    'max_instlamount_892A',
    'max_interesteffectiverate_369L',
    'max_interestrateyearly_538L',
    'max_lastupdate_260D',
    'max_maxdebtpduevalodued_3940955A',
    'max_numberofinstls_810L',
    'max_overdueamountmax_950A',
    'max_overdueamountmaxdatemonth_494T',
    'max_overdueamountmaxdatemonth_494T',
    'max_overdueamountmaxdateyear_432T',
    'max_overdueamountmaxdateyear_432T',
    'max_periodicityofpmts_997L',
    'max_periodicityofpmts_997M',
    'max_pmtdaysoverdue_1135P',
    'max_pmtmethod_731M',
    'max_pmtnumpending_403L',
    'max_purposeofcred_722M',
    'max_residualamount_1093A',
    'max_residualamount_127A',
    'max_residualamount_3940956A',
    'max_subjectrole_326M',
    'max_subjectrole_43M',
    'max_totalamount_503A',
    'max_totalamount_881A',
    'case_id_right_7',
    'max_amtdebitincoming_4809443A',
    'max_amtdebitoutgoing_4809440A',
    'max_amtdepositbalance_4809441A',
    'max_amtdepositincoming_4809444A',
    'max_amtdepositoutgoing_4809442A',
    'max_birthdate_87D',
    'max_childnum_185L',
    'max_contaddr_district_15M',
    'max_contaddr_zipcode_807M',
    'max_gender_992L',
    'max_housingtype_772L',
    'max_isreference_387L',
    'max_maritalst_703L',
    'max_registaddr_district_1083M',
    'max_registaddr_zipcode_184M',
    'max_role_993L',
    'max_role_993L',
    'max_contractenddate_991D',
    'max_last180dayaveragebalance_704A',
    'max_last180dayturnover_1134A',
    'max_last30dayturnover_651A',
    'case_id_right_11',
    'max_pmts_date_1107D',
    'max_pmts_dpdvalue_108P',
    'max_pmts_pmtsoverdue_635A',
    "max_empl_employedtotal_800L",
    "max_empl_industry_691L",
 ]

In [14]:

df_train = df_train.drop(removed_cols)
df_train

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [15]:

df_test = df_test.drop(removed_cols)
df_test

#### Double check the schema between train and test datasaet

In [16]:
td = dict(df_train.schema())
tt = dict(df_test.schema())
for col, t in td.items():
    if col in tt:
        f = tt[col]
        if f != t:
            print(col, t, f)

## 2. Last-mile data preprocessing

Last-mile preprocessing is crucial in the machine learning (ML) workflow because it ensures that the data fed into the model is in the most optimal format, facilitating accurate and efficient training and predictions. We will perform the following transformation before feeding the data to models:
- Drop features with zero variance and redundant features
- Missing value imputation
- Encoding categorical variables
- Handling date variables
- Handling outliers
- Scaling and normalization



### 2.1 Drop features

In [17]:
# Drop features with zero variance for all features
# For numerical columns, drop featuures with 0 variance
# For non-numerical columns, drop features with one or fewer unique values
step_drop_zero_variance = ml.DropZeroVariance(ml.everything())

# Drop redundant case_id_* 
step_drop_reduncant_id = ml.Drop(ml.startswith("case_id_right_"))

### 2.2 Imputing

In [18]:
step_impute_mode = ml.ImputeMode(ml.string())
step_impute_median = ml.ImputeMedian(ml.numeric())

### 2.3 Encoding categorical variables

In [19]:
target_encoding_step = ml.TargetEncode([
    "education_1103M",
    "education_88M",
    "max_education_1138M",
    "max_employername_160M",
    "max_district_544M",
    "previouscontdistrict_112M",
    "lastapprcommoditycat_1041M",
    "lastcancelreason_561M",
    "lastrejectcommoditycat_161M",
    "lastrejectreason_759M",
])

In [20]:
ohe_step = ml.OneHotEncode([
    "bankacctype_710L",
    "cardtype_51L",
    "credtype_322L",
    "disbursementtype_67L",
    "inittransactioncode_186L",
    # "lastapprcommoditycat_1041M",
    "lastapprcommoditytypec_5251766M",
    #"lastcancelreason_561M",
    # "lastrejectcommoditycat_161M",
    "lastrejectcommodtypec_5251769M",
    #"lastrejectreason_759M",
    "lastrejectreasonclient_4145040M",
    "lastst_736L",
    "paytype1st_925L",
    "paytype_783L",
    #"previouscontdistrict_112M",
    "twobodfilling_608L",
    "max_credtype_587L",
    # "max_district_544M",
    "max_familystate_726L",
    "max_inittransactioncode_279L",
    "max_postype_4733339M",
    "max_profession_152M",
    "max_rejectreason_755M",
    "max_rejectreasonclient_4145042M",
    "max_status_219L",
    # "max_employername_160M",
    "max_classificationofcontr_1114M",
    "max_contractst_516M",
])

### 2.4 Handling date variables

In [21]:
# Calculate all the days difference between any date columns and the column `date_decision`
date_cols = [col_name for col_name in df_train.columns if col_name[-1] == "D"]
days_to_decision_expr = {
        f"{col}_date_decision_diff": (_.date_decision.epoch_seconds() - getattr(_, col).epoch_seconds()) / (60 * 60 * 24)
        for col in date_cols
}
days_to_decision_step = ml.Mutate(days_to_decision_expr)


In [22]:
# Extract information from the date columns
expand_date_step = ml.ExpandDate(ml.date(), ["week", "day"]) # dow and month is set to catagoery, but it is int

### 2.5 Construct last-mile preprocessing recipe

In [23]:
last_mile_ml_recipes = ml.Recipe(
    # Drop cols with 0 variance
    step_drop_zero_variance,
    # remove extra case_id_right_*
    step_drop_reduncant_id,

    # handle date cols
    days_to_decision_step,
    expand_date_step,
    ml.Drop(ml.date()),
    ml.Drop(["MONTH", "WEEK_NUM"]),

    # handle string columns
    ohe_step,
    target_encoding_step,
    step_impute_mode,
    ml.Drop(ml.string()),

    # handle numeric cols
    # Capping outliers
    ml.HandleUnivariateOutliers(["days180_256L"]), 
    ml.ImputeMedian(ml.numeric()),
    ml.ScaleMinMax(ml.numeric()),

    # cast bool to int
    ml.Cast(ml.has_type("bool"), "float32"),
    ml.FillNA(ml.numeric(), 0),
    ml.Cast(ml.numeric(), "float32"),
)

## 3 Modeling

After completing data preprocessing with Ibis and IbisML, we proceed to the modeling phase. Here are two approaches:

1. IbisML can be utilized for data preprocessing and seamlessly integrating with various modeling frameworks, inclduing:
     - Neural network 
     - xgboost
     - lightgbm
2. IbisML recipes can be incorporated as components within an sklearn Pipeline. You could find more information from this [tutorial](https://ibis-project.github.io/ibis-ml/tutorial/xgboost.html).

We will focus on option 1 in this example.

### 3.1 Train and test data splitting

In [24]:
# Put 3/4 of the data into the training set
df_train = df_train.mutate(
    train=df_train.case_id.hash().abs() % 4 < 3
)

# Create data frames for the two sets:
train_data = df_train[df_train.train].drop("train")
test_data = df_train[~df_train.train].drop("train")

X_train = train_data.drop("target")
y_train = train_data.target.cast(dt.int64).name("target")

X_test = test_data.drop("target")
y_test = test_data.target

### 3.2 IbisML preprocessing fit and transform

In [25]:
# Train preprocessing recipe using training dataset
last_mile_ml_recipes.fit(X_train, y_train)


In the previous cell, we trained the recipe using the training dataset. Now, we will transform both the train and test datasets for later modeling. Data can be outputted by IbisML recipes in various formats, such as NumPy, pandas, polars, PyArrow, Dask DataFrame, and XGBoost DMatrix, making it compatible with various modeling frameworks.

In [None]:
# Transform train and test dataset usng ibisML recipe
X_train_transformed = last_mile_ml_recipes.transform(X_train)
X_test_transformed = last_mile_ml_recipes.transform(X_test)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

### 3.3 Neural network classifier using pytorch

In [None]:
# pip install --upgrade pytorch-lightning torch torchvision

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import pytorch_lightning as pl
from pytorch_lightning import Trainer

class SimpleNeuralNetClassifier(pl.LightningModule):
    def __init__(self, input_dim, hidden_dim=64, output_dim=1):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
        self.loss = nn.BCEWithLogitsLoss()  # Binary cross-entropy loss
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.loss(y_hat.squeeze(), y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.loss(y_hat.squeeze(), y)
        self.log('val_loss', loss)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.001)
    
    def predict_proba(self, x):
        # Return the probability predictions
        with torch.no_grad():
            self.eval()
            return self.sigmoid(self(x))

# Initialize your LightningModule
nn_classifier = SimpleNeuralNetClassifier(input_dim=X_train_transformed.shape[1])

# Create PyTorch Lightning data loaders
train_dataset = TensorDataset(
    torch.Tensor(X_train_transformed),
    torch.Tensor(y_train.to_pandas().to_numpy())
)
val_dataset = TensorDataset(
    torch.Tensor(X_test_transformed),
    torch.Tensor(y_test.to_pandas().to_numpy())
)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)

# Initialize a Trainer
trainer = Trainer(max_epochs=4, log_every_n_steps=100)

# Train the model
trainer.fit(nn_classifier, train_loader, val_loader)


GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default

  | Name    | Type              | Params | Mode 
------------------------------------------------------
0 | model   | Sequential        | 24.3 K | train
1 | loss    | BCEWithLogitsLoss | 0      | train
2 | sigmoid | Sigmoid           | 0      | train
------------------------------------------------------
24.3 K    Trainable par

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=2` reached.


In [None]:

y_pred = nn_classifier.predict_proba( torch.Tensor(X_test_transformed))
y_pred

tensor([[0.0337],
        [0.0358],
        [0.0350],
        ...,
        [0.0203],
        [0.0208],
        [0.0198]])

### 3.4 xgboost


In [None]:
# pip install xgboost

In [None]:
import xgboost as xgb

# Build a simple xgboost
xgboost = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.05, 
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
# Fit the model using training dataset
xgboost.fit(X_train_transformed, y_train)


array([0.03383248, 0.03363751, 0.03548482, ..., 0.02272419, 0.02288411,
       0.02252967], dtype=float32)

In [None]:
# Predict
xgboost.predict_proba(X_train_transformed)[:, 1]

### 3.5 lightgbm

In [None]:
# pip install lightgbm

In [None]:
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(
    objective='binary',     
    random_state=5,        
    learning_rate=0.05,     
    n_estimators=100,        
    max_depth=4,             
    subsample=0.8,          
    colsample_bytree=0.8,    
)
# Train
lgbm.fit(X_train_transformed, y_train)


In [None]:
# Predict
lgbm.predict_proba(X_train_transformed)[:, 1]

## Reference
- [1st Place Solution](https://www.kaggle.com/code/yuuniekiri/fork-of-home-credit-risk-lightgbm)
- [home-credit-2024-starter-notebook](https://www.kaggle.com/code/jetakow/home-credit-2024-starter-notebook)
- [EDA and Submission](https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/508337)
- [Home Credit Baseline](https://www.kaggle.com/code/greysky/home-credit-baseline)