We are going to use UMAP so let's install that using pip.

In [1]:
!pip install --quiet umap-learn

Now let's load the data and do a tiny amount of feature engineering.

In [2]:
import pandas as pd

LOAN = '/kaggle/input/financial-risk-for-loan-approval/Loan.csv'

df = pd.read_csv(filepath_or_buffer=LOAN, parse_dates=['ApplicationDate'])
df['LoanApproved'] = df['LoanApproved'] == 1
df.head()

Unnamed: 0,ApplicationDate,Age,AnnualIncome,CreditScore,EmploymentStatus,EducationLevel,Experience,LoanAmount,LoanDuration,MaritalStatus,...,MonthlyIncome,UtilityBillsPaymentHistory,JobTenure,NetWorth,BaseInterestRate,InterestRate,MonthlyLoanPayment,TotalDebtToIncomeRatio,LoanApproved,RiskScore
0,2018-01-01,45,39948,617,Employed,Master,22,13152,48,Married,...,3329.0,0.724972,11,126928,0.199652,0.22759,419.805992,0.181077,False,49.0
1,2018-01-02,38,39709,628,Employed,Associate,15,26045,48,Single,...,3309.083333,0.935132,3,43609,0.207045,0.201077,794.054238,0.389852,False,52.0
2,2018-01-03,47,40724,570,Employed,Bachelor,26,17627,36,Married,...,3393.666667,0.872241,6,5205,0.217627,0.212548,666.406688,0.462157,False,52.0
3,2018-01-04,58,69084,545,Employed,High School,34,37898,96,Single,...,5757.0,0.896155,5,99452,0.300398,0.300911,1047.50698,0.313098,False,54.0
4,2018-01-05,37,103264,594,Employed,Associate,17,9184,36,Married,...,8605.333333,0.941369,5,227019,0.197184,0.17599,330.17914,0.07021,True,36.0


Are the classes in our target variable balanced? we would expect not.

In [3]:
df['LoanApproved'].value_counts(normalize=True, dropna=False).to_dict()

{False: 0.761, True: 0.239}

Only about a quarter of loans are approved.

How does risk score predict loan approval?

In [4]:
from plotly import express

express.histogram(data_frame=df, x='RiskScore', color='LoanApproved', nbins=100)

This is an odd looking distribution; we might expect risk scores to be normally distributed, but our distribution is roughly bimodal.

Let's do the dumb thing and just look at our numerical data.

In [5]:
columns = [column for column, dtype in df.dtypes.to_dict().items() if str(dtype) in {'int64', 'float64'} if column not in {'LoanApproved', 'RiskScore'}]

Let's use dimension reduction to build a scatter plot; we might expect loan approvals/disapprovals to cluster together, and if we're fortunate, the two clusters are distinct. 

In [6]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df[columns])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:47.542732


In [7]:
express.scatter(data_frame=df, x='x', y='y', color='LoanApproved', height=800)

We are only somewhat fortunate; we see broad regions of approvals and broad regions of disapprovals, and we also see areas that are mixed. Let's build a model.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[columns], df['LoanApproved'], test_size=0.2, random_state=2024, stratify=df['LoanApproved'])

logreg = LogisticRegression(max_iter=10000, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f} f1: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test)), f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0))

model fit in 3155 iterations
accuracy: 0.9097 f1: 0.9087
              precision    recall  f1-score   support

       False       0.93      0.95      0.94      3044
        True       0.83      0.78      0.80       956

    accuracy                           0.91      4000
   macro avg       0.88      0.86      0.87      4000
weighted avg       0.91      0.91      0.91      4000



In [9]:
probability_df = pd.DataFrame(data=logreg.predict_proba(X=X_test).max(axis=1), columns=['probability'])
probability_df['prediction'] = logreg.predict(X=X_test)
probability_df['actual'] = y_test.tolist()
probability_df['correct'] = probability_df['actual'] == probability_df['prediction']
probability_df[['x', 'y']] = umap.transform(X=X_test)

probability_df.head()

Unnamed: 0,probability,prediction,actual,correct,x,y
0,0.998894,False,False,True,6.882908,6.901273
1,0.998954,True,True,True,6.626804,9.397749
2,0.988409,False,False,True,11.26814,1.030051
3,0.942935,True,False,False,15.794761,8.189176
4,0.84911,True,True,True,7.772244,14.461901


In [10]:
from plotly import express

express.scatter(data_frame=probability_df, x='x', y='y', color='probability', facet_col='correct', hover_name='actual')

We do see a fair number of cases where the model is correct with a relatively low model probability, but we see almost no cases where the model is incorrect with high model probability.