Used in IGANN Paper (Case Study 2)

See IGANN Appendix i:
"The dataset is taken from the FICO Explainable Machine Learning Challenge15. It contains 10,459 samples with 21 continuous features, two categorical features, and a binary target variable stating whether or not an individual defaulted on the loan"

The data contains anonymized credit applications of HELOC credit lines, which are a type of loan, collateralized by a customer’s property

Appendix also highlights preprocessing (keep only 10 features + target)

Link: https://www.kaggle.com/datasets/averkiyoliabev/home-equity-line-of-creditheloc


Model trained on raw dataset

***GPT Analysis of "MSinceMostRecentDelq (index = 8)***

The feature **MSinceMostRecentDelq** (Months Since Most Recent Delinquency) reflects how long it has been since the borrower was last delinquent on a payment. In the context of loan approval, a higher value typically indicates more time has passed since the last delinquency, which is generally seen as a positive sign.

Key domain knowledge contradictions in the shape function:

1. **Negative values for recent delinquencies**: For values close to zero (e.g., the range "(-9.0, -7.5)" to "(2.5, 5.5)"), the function returns negative contributions, which is expected, as recent delinquencies are risky. However, the contributions get **worse** as delinquency becomes more recent, reaching a minimum at "(-9.0, -7.5)", which is inconsistent with the fact that it would be impossible to have a delinquency 9 months in the future.

2. **Improving outcomes with very high values**: The contributions for MSinceMostRecentDelq increase significantly after 20 months (e.g., ranges like "(30.5, 31.5)" or higher). While it makes sense that outcomes improve with more time since the last delinquency, the large positive values beyond 60 months seem unrealistic. The shape function suggests **extreme optimism** for borrowers who have not had a delinquency for several years, even though such borrowers might still have other risk factors.

3. **Inconsistent pattern near 70 months**: After 70 months, the contribution suddenly **drops** (e.g., "(65.5, 66.5)" to "(73.5, 74.5)"). This contradicts the expectation that as time since the last delinquency increases, the likelihood of loan repayment should consistently improve. This drop could indicate a flaw in the data or model.

In summary, the model suggests extreme penalties for very recent delinquencies (including impossible values) and overly optimistic predictions for very old delinquencies. Additionally, the drop near 70 months is unexpected.

In [2]:
import igann_helpers
import pandas as pd

dataset = igann_helpers.load_fico_data()
X_df = dataset["full"]["X"]
X_df["RiskPerformance"] = dataset["full"]["y"]

# X_df.to_csv("heloc_preprocessed.csv", index=False)

X_df

In [None]:
simple_feature_names = ["Overall Credit Risk Score", "Months Since First Credit Account", "Average Age of Credit Accounts", "Number of Well-Maintained Accounts", "Percentage of Accounts Never Late",
                            "Months Since Last Missed Payment", "Percentage of Installment vs Revolving Loans", "Time Since Last Credit Application", "Credit Utilization Ratio", "Number of Active Credit Cards/Lines", "Loan Repaid"]

In [4]:
import pandas as pd
df2 = pd.read_csv("ds_description.csv")
print(dict(df2))

{'Column Name': 0             ExternalRiskEstimate
1            MSinceOldestTradeOpen
2                   AverageMInFile
3            NumSatisfactoryTrades
4           PercentTradesNeverDelq
5             MSinceMostRecentDelq
6             PercentInstallTrades
7     MSinceMostRecentInqexcl7days
8       NetFractionRevolvingBurden
9       NumRevolvingTradesWBalance
10                 RiskPerformance
Name: Column Name, dtype: object, 'Description': 0                  Consolidated version of risk markers
1                        Months since oldest trade open
2                                Average months in file
3                         Number of satisfactory trades
4                 Percentage of trades never delinquent
5                  Months since most recent delinquency
6                      Percentage of installment trades
7         Months since most recent inquiry excl. 7 days
8     Net fraction revolving burden (= revolving bal...
9               Number of revolving trades wit

In [16]:
import pandas as pd

df = pd.read_csv("heloc_preprocessed.csv")

simple_feature_names = ["Overall Credit Risk Score", "Months Since First Credit Account", "Average Age of Credit Accounts", "Number of Well-Maintained Accounts", "Percentage of Accounts Never Late",
                            "Months Since Last Missed Payment", "Percentage of Installment vs Revolving Loans", "Time Since Last Credit Application", "Credit Utilization Ratio", "Number of Active Credit Cards/Lines", "Loan Repaid"]

df_simple = df.copy()
df_simple.columns = simple_feature_names

In [19]:
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
from sklearn.model_selection import train_test_split
import joblib
seed = 42

y = df_simple["Loan Repaid"]
X = df_simple.drop(columns="Loan Repaid")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

test_data = X_test.copy()
test_data["Loan Repaid"] = y_test
#test_data.to_csv("heloc_test.csv", index=False)

ebm_loan = ExplainableBoostingClassifier(random_state=seed, n_jobs=1)
ebm_loan.fit(X_train, y_train)
show(ebm_loan.explain_global())

#joblib.dump(ebm_loan, "ebm_heloc.pkl")

['ebm_heloc.pkl']