## Data dictionary

|    | Variable          | Explanation                                                                                                             |
|---:|:------------------|:------------------------------------------------------------------------------------------------------------------------|
|  0 | credit_policy     | 1 if the customer meets the credit underwriting criteria; 0 otherwise.                                                  |
|  1 | purpose           | The purpose of the loan.                                                                                                |
|  2 | int_rate          | The interest rate of the loan (more risky borrowers are assigned higher interest rates).                                |
|  3 | installment       | The monthly installments owed by the borrower if the loan is funded.                                                    |
|  4 | log_annual_inc    | The natural log of the self-reported annual income of the borrower.                                                     |
|  5 | dti               | The debt-to-income ratio of the borrower (amount of debt divided by annual income).                                     |
|  6 | fico              | The FICO credit score of the borrower.                                                                                  |
|  7 | days_with_cr_line | The number of days the borrower has had a credit line.                                                                  |
|  8 | revol_bal         | The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).                           |
|  9 | revol_util        | The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available). |
| 10 | inq_last_6mths    | The borrower's number of inquiries by creditors in the last 6 months.                                                   |
| 11 | delinq_2yrs       | The number of times the borrower had been 30+ days past due on a payment in the past 2 years.                           |
| 12 | pub_rec           | The borrower's number of derogatory public records.                                                                     |
| 13 | not_fully_paid    | 1 if the loan is not fully paid; 0 otherwise.   

[Source](https://www.kaggle.com/itssuru/loan-data) of dataset.

In [128]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, classification_report
import joblib


In [129]:
df = pd.read_csv("loan_data.csv")
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [130]:
print(df['credit.policy'].value_counts())
print("=====================================")
print(df['purpose'].value_counts())



credit.policy
1    7710
0    1868
Name: count, dtype: int64
purpose
debt_consolidation    3957
all_other             2331
credit_card           1262
home_improvement       629
small_business         619
major_purchase         437
educational            343
Name: count, dtype: int64


In [131]:
df.dropna(inplace=True)

In [132]:
X = df.drop(['not.fully.paid', 'int.rate'], axis=1)
y = df['not.fully.paid']


In [133]:
# Encoding categorical features
encoder = LabelEncoder()
X['purpose'] = encoder.fit_transform(X['purpose'])

# Scaling numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [134]:
encoding_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
encoding_mapping

{'all_other': 0,
 'credit_card': 1,
 'debt_consolidation': 2,
 'educational': 3,
 'home_improvement': 4,
 'major_purchase': 5,
 'small_business': 6}

In [135]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

In [136]:
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Resample the training data using SMOTE
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Compute class weights
class_weights = compute_class_weight('balanced', classes=[0, 1], y=y_train_resampled)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

In [137]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train_resampled, y_train_resampled)

best = grid_search.best_params_

In [138]:
model = LogisticRegression(class_weight=class_weight_dict, max_iter=10000, C=best['C'], penalty=best['penalty'], solver=best['solver'], random_state=42)
model.fit(X_train_resampled, y_train_resampled)

In [139]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

In [140]:
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_rep)

Accuracy: 0.61
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.60      0.72      1625
           1       0.22      0.63      0.33       291

    accuracy                           0.61      1916
   macro avg       0.56      0.62      0.53      1916
weighted avg       0.80      0.61      0.66      1916



In [141]:
joblib.dump(model, 'loan_predict_resampled.joblib')

['loan_predict_resampled.joblib']

RF

In [142]:
rf = RandomForestClassifier()
rf.fit(X_train_resampled, y_train_resampled)

# Predictions
rf_preds = rf.predict(X_test)

# Performance

print("Acc: ", accuracy_score(y_test, rf_preds))
print("Classification Report:\n", classification_report(y_test, rf_preds))

Acc:  0.8136743215031316
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.94      0.89      1625
           1       0.27      0.13      0.18       291

    accuracy                           0.81      1916
   macro avg       0.56      0.53      0.54      1916
weighted avg       0.77      0.81      0.79      1916



In [161]:
mapping = {
    'all_other': 0,
    'credit_card': 1,
    'debt_consolidation': 2,
    'educational': 3,
    'home_improvement': 4,
    'major_purchase': 5,
    'small_business': 6
}

columns_order = [
    "credit.policy", "purpose", "installment", "log.annual.inc",
    "dti", "fico", "days.with.cr.line", "revol.bal",
    "revol.util", "inq.last.6mths", "delinq.2yrs", "pub.rec"
]

data = pd.read_json("d.json").to_dict()["data"]
data['purpose'] = mapping[data['purpose']]
input_df = pd.DataFrame([data], columns=columns_order)

# scale the input
scaler = joblib.load('scaler.pkl')
input_df = scaler.transform(input_df)

In [162]:
model2 = joblib.load('loan_predict_resampled.joblib')
prob_default2 = model2.predict_proba(input_df)[:, 1][0]
prob_default2

0.4644914874427539

In [None]:
## defaulting is 1 => 
0.5312726506253583
0.7207761780424421
0.6726161018329725
0.7087672787847078

## paying is 0 => 
0.6859344671049403
0.5229682831152656
0.36356808745522956
0.4644914874427539