## LightGBM on Lending Club Dataset

In this dataset, we train an LightGBM binary classifier on the Lending Club dataset.

In [1]:
import lightgbm as lgb
import shap
import numpy as np
import pandas as pd
import onnxruntime as rt

from onnxmltools import convert_lightgbm, convert_sklearn
from onnxmltools.convert.common.data_types import FloatTensorType
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter

SEED=2022

In [5]:
# Read the dataset
data = pd.read_csv('../data/lending-club.csv')
y_all = data['loan_approval']
x_all = data.iloc[:, 0: data.shape[1] - 1]

# One-hot encode the categorical columns
x_all = pd.get_dummies(x_all, prefix_sep='-')

# Train test split (6:4)
x_train, x_test, y_train, y_test = train_test_split(
    x_all, y_all, test_size=0.4, random_state=SEED
)

# Convert data frames to numpy arrays
feature_names = x_all.columns
x_train_matrix = x_train.to_numpy()
x_test_matrix = x_test.to_numpy()
y_train_array = y_train.to_numpy()
y_test_array = y_test.to_numpy()

# Create lightgbm dataset
d_train = lgb.Dataset(x_train, label=y_train)
d_test = lgb.Dataset(x_test, label=y_test)

In [4]:
params = {
    "verbose": 0,
    "learning_rate": 0.01,
    "max_bin": 512,
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "binary_logloss",
    # Randomly sample training data in each boosting iteration to avoid overfitting
    "subsample": 0.5,
    "min_data": 100,
    "boost_from_average": True,
}

model = lgb.train(
    params,
    d_train,
    5000,
    valid_sets=[d_test],
    early_stopping_rounds=50,
)

'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.


You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[1]	valid_0's binary_logloss: 0.508791
Training until validation scores don't improve for 50 rounds
[2]	valid_0's binary_logloss: 0.508113
[3]	valid_0's binary_logloss: 0.507385
[4]	valid_0's binary_logloss: 0.506748
[5]	valid_0's binary_logloss: 0.506072
[6]	valid_0's binary_logloss: 0.505469
[7]	valid_0's binary_logloss: 0.504854
[8]	valid_0's binary_logloss: 0.504265
[9]	valid_0's binary_logloss: 0.503715
[10]	valid_0's binary_logloss: 0.503128
[11]	valid_0's binary_logloss: 0.502621
[12]	valid_0's binary_logloss: 0.502099
[13]	valid_0's binary_logloss: 0.501605
[14]	valid_0's binary_logloss: 0.501079
[15]	valid_0's binary_logloss: 0.50057
[16]	valid_0's binary_logloss: 0.500097
[17]	valid_0's binary_logloss: 0.499595
[18]	valid_0's binary_logloss: 0.499108
[19]	valid_0's binary_logloss: 0.49863
[20]	valid_0's binary_logloss: 0.498122
[21]	valid_0's binary_loglos

In [5]:
y_pred = model.predict(x_train)
y_pred_label = [1 if y >= 0.5 else 0 for y in y_pred]
train_acc = accuracy_score(y_train, y_pred_label)

y_pred = model.predict(x_test)
y_pred_label = [1 if y >= 0.5 else 0 for y in y_pred]
test_acc = accuracy_score(y_test, y_pred_label)

print(f'Train accuracy: {train_acc:.4}, Test accuracy: {test_acc:.4}')

Train accuracy: 0.8183, Test accuracy: 0.7945


### Export the Model as ONNX

To use this model in Rust and the web, we export it as an ONNX model.

In [19]:
initial_types = [("float_input", FloatTensorType([None, x_train.shape[1]]))]
model_onnx = convert_lightgbm(model, initial_types=initial_types)

The maximum opset needed by this model is only 9.


In [20]:
# Save the ONNX model
with open('./lending-club-lightgbm.onnx', 'wb') as fp:
    fp.write(model_onnx.SerializeToString())

### ONNX Model Inference

Model inference using the saved ONNX model.


In [6]:
session = rt.InferenceSession('./lending-club-lightgbm.onnx')
y_pred_onnx = session.run(None, {'float_input': x_test.astype(np.float32).to_numpy()})

2023-01-31 13:10:33.806247 [W:onnxruntime:, execution_frame.cc:828 VerifyOutputSizes] Expected shape from model of {1} does not match actual shape of {2000} for output label


In [7]:
y_pred_onnx

[array([1, 1, 1, ..., 1, 1, 1], dtype=int64),
 [{0: 0.1581752896308899, 1: 0.8418247103691101},
  {0: 0.22155171632766724, 1: 0.7784482836723328},
  {0: 0.2402471899986267, 1: 0.7597528100013733},
  {0: 0.10018116235733032, 1: 0.8998188376426697},
  {0: 0.11543935537338257, 1: 0.8845606446266174},
  {0: 0.17755955457687378, 1: 0.8224404454231262},
  {0: 0.12994492053985596, 1: 0.870055079460144},
  {0: 0.1724655032157898, 1: 0.8275344967842102},
  {0: 0.1221851110458374, 1: 0.8778148889541626},
  {0: 0.23944509029388428, 1: 0.7605549097061157},
  {0: 0.25275862216949463, 1: 0.7472413778305054},
  {0: 0.07190161943435669, 1: 0.9280983805656433},
  {0: 0.13511121273040771, 1: 0.8648887872695923},
  {0: 0.14658993482589722, 1: 0.8534100651741028},
  {0: 0.19418931007385254, 1: 0.8058106899261475},
  {0: 0.2746114134788513, 1: 0.7253885865211487},
  {0: 0.07353192567825317, 1: 0.9264680743217468},
  {0: 0.30164414644241333, 1: 0.6983558535575867},
  {0: 0.12612593173980713, 1: 0.8738740682

In [9]:
test_acc = accuracy_score(y_test, y_pred_onnx[0])
print(f'ONNX Test accuracy: {test_acc:.4}')

ONNX Test accuracy: 0.7945


### SHAP to Explain the Model

In [24]:
rng = np.random.RandomState(SEED)
random_indexes = rng.choice(x_train_matrix.shape[0], 100, replace=False)
background_data = x_train_matrix[random_indexes, :]

explainer = shap.KernelExplainer(model.predict, background_data)

In [27]:
model.predict(x_test_matrix[0:1, :])

Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.


array([0.8418247])

In [31]:
explainer.explain(x_test_matrix[0:1, :])

Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.
The default of 'normalize' will be set to False in version 1.2 and deprecated in version 1.4.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LassoLarsIC())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


array([-0.0227419 , -0.00977796,  0.00387448, -0.0067414 ,  0.00593194,
       -0.00344492, -0.01235467,  0.00115505,  0.04335813,  0.03315957,
        0.        ,  0.        , -0.00621136,  0.        ,  0.        ,
        0.        , -0.00356393,  0.01245515,  0.00214403,  0.00077031,
        0.00170855,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.00115944,  0.        ,  0.        ,  0.        ,
        0.        ])

### Export Data as JSON

In [1]:
x_test_matrix.shape

NameError: name 'x_test_matrix' is not defined

### Create a Data Subset and Export CSV

In [4]:
# data = np.load('../data/lending-club-data-5000.npz', allow_pickle=True)

# x_all = data['x_all']
# y_all = data['y_all']
# feature_names = data['feature_names']
# feature_types = data['feature_types']
# cont_index = data['cont_index']
# cat_index = data['cat_index']

# for i in cat_index:
#     counter = Counter(x_all[:, i])
#     print(i, feature_names[i], len(counter))
    

# # Create a data subset with only essential categorical features
# selected_feature_indexes = cont_index.tolist()
# selected_feature_indexes.extend([1, 2, 3, 5, 10, 14, 16])

# data_df_dict = {}
# for i in selected_feature_indexes:
#     name = feature_names[i]
#     data_df_dict[name] = x_all[:, i]
# data_df_dict['loan_approval'] = y_all

# data_df = pd.DataFrame(data_df_dict)
# data_df.to_csv('../data/lending-club.csv', index=False)
