# Home Credit Loan Default Prediction - Phase 4 

## Implementing Deep Learning


**Abstract.** Predicting loan defaults is a critical task for financial institutions. This project explored various machine learning approaches, including Logistic Regression, Decision Trees, Random Forests, Gradient Boosting, Light Gradient Boosting (LightGBM), XGBoost, and Multi-Layer Perceptron (MLP) architectures. Following feature engineering, models were evaluated using both a full feature set and a reduced feature set derived through Principal Component Analysis (PCA), feature importance analysis, and correlation analysis. Hyperparameter tuning was conducted on best performing model. XGBoost with the full feature set and with default parameter values was our best-performing model with a test accuracy of 91.59% with good performance on identifying non-defaulters and minimizing false positives. However, the severe class imbalance in the dataset posed challenges, resulting in low recall and F1-scores. We conclude by recommending techniques such as SMOTE to mitigate class imbalance and improve model performance for the minority class.

## Table of Contents
- [1. Team and Project Management](#team)
- [2. Project Report](#report)
  - [2.1 Introduction](#introduction)
  - [2.2 Methods](#methods)
  - [2.3 Results](#results)
  - [2.4 Discussion, Conclusion and Gap Analysis](#discussion)
  - [2.5 References](#references)
- [3. Code Appendix](#appendix)

## 1. Team and Project Management <a class="anchor" id="team"></a>

**Team Members**

| Name           | Email             | Role              | Photo                           |
|----------------|-------------------|-------------------|---------------------------------|
| Lexi Colwell   | alecolwe@iu.edu   | Phase 1 Lead      | <img src="https://iu.instructure.com/images/thumbnails/177973973/0g37V233Y26RuFrO4SpckMxeO237lTphfcshwCPB" width="50">   |
| Nasheed Jafri  | njafri@iu.edu     | Phase 2 Lead      | <img src="https://iu.instructure.com/images/thumbnails/126314916/QRUIEEkso27JL5B1T8aFYUsG72QVtz5EJD4gfe1Z" width="50">|
| Cassie Cagwin  | cacagwin@iu.edu   | Phase 3 Lead      | <img src="https://www.widsworldwide.org/wp-content/uploads/2023/10/1-6-scaled.jpeg" width="50"> |
| Maria Aroca    | mparoca@iu.edu    | Phase 4 Lead      | <img src="https://media.licdn.com/dms/image/v2/C4E03AQHBCzQVjfUjYA/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1589987573488?e=1736985600&v=beta&t=RIOl6PYoXLUqfTFXW5VMpFm9l1zGIvd2u5-y0k59jCk" width="50">  |

**Phase Task Description**

| Phase     	| Task Description                                                                                 	| Project Manager	|
|---------------|------------------------------------------------------------------------------------------------------|--------------------|
| Phase 1   	| Project planning and proposal, including data sources, metrics, and baseline models              	| Lexi Colwell   	|
| Phase 2   	| Data exploration (EDA), baseline pipeline, feature engineering, and initial hyperparameter tuning	| Nasheed Jafri  	|
| Phase 3   	| Advanced feature engineering, hyperparameter tuning, feature selection, and ensemble methods     	| Cassie Cagwin  	|
| Phase 4   	| Final model integration, implementing advanced architectures, and project report completion      	| Maria Aroca    	|

**<p>Phase 4 Credit Assignment Plan Summary</p>**

| Name            | Task Description                                                                                  | Estimated Time |
|-----------------|---------------------------------------------------------------------------------------------------|----------------|
| Lexi Colwell    | Documented the modeling pipeline, including verifying data for potential leakage issues.    | 15 hrs         |
| Nasheed Jafri   | Create exploratory data analysis (EDA) visualizations and organize insights for the final report. | 15 hrs         |
| Cassie Cagwin   | Develop models, create visuals for model performance,conduct experiments with MLP and kaggle submission and submission code | 15 hrs         |
| Maria Aroca     | Perform MLP model experiments and organize results for inclusion in the final submission.         | 15 hrs         |

## 2. Project Report <a class="anchor" id="report"></a>

## 2.1 Introduction <a class="anchor" id="introduction"></a>

Many individuals face significant barriers to accessing loans due to insufficient or non-existent credit histories, often leaving them vulnerable to predatory lending practices. This project aims to address this issue by expanding financial inclusion for the unbanked population through the use of alternative data sources and the development of predictive models using machine learning techniques. Our primary objective was to predict loan repayment ability by employing a comprehensive modeling process to analyze credit default risk across an extensive historical collection of datasets.

This report has the following structure: In the **Methods** section, we briefly describe the data and detail how features were engineered and our modelling approach. Feature engineering played a crucial role in this process, transforming raw and heterogeneous data into meaningful features to create a final unified dataset as the base for our predictive models. We also describe the models we experimented with. This task was framed as a standard supervised classification problem, where the goal is to train models to predict loan repayment labels based on historical features.

The modeling process began with a baseline Logistic Regression model for its simplicity and interpretability and advanced to more complex models, including Decision Trees, Random Forests, Gradient Boosting, XGBoost, Light Gradient Boosting (LightGBM) and Support Vector Machines (SVM). To address the dataset's high dimensionality, we applied feature selection techniques such as correlation analysis, tree-based feature importance, and Principal Component Analysis (PCA), reducing the dataset to 20 relevant features. Ensemble methods, such as a voting classifier and a stacking classifier, were implemented to combine the predictions of multiple models. We also performed hyperparameter tuning on our best model. Additionally, we explored neural network-based approaches, implementing and evaluating Multi-Layer Perceptron (MLP) models with different architectures, including Sparse Input MLP, Simple Deep MLP, MLP with Dropout, and Wider MLP.

In the **Results** section, we present the performance metrics of the models, highlighting the challenges and successes of the modeling process. Overall, XGBoost with default parameters trained on the full feature set emerged as the best-performing model, achieving a test accuracy of 91.59%. However, class imbalance in the dataset posed significant challenges, leading to low recall and F1 scores for the minority class (loan defaulters).

Finally, In the **Discussion** section, we address potential next steps to deal with the class imbalance problem, such as employing advanced resampling techniques or cost-sensitive methods. Our full code is presented in the **Code Appendix** at the end of the document.


## 2.2 Methods <a class="anchor" id="methods"></a>

### **2.2.1 Data**

In our project we analyzed the data uploaded to the [Home Credit Default Risk kaggle](https://www.kaggle.com/competitions/home-credit-default-risk/data) competition project page. The data for this project represents a comprehensive and diverse collection of data sources related to credit applications and repayment behaviors, making it both rich in detail and complex to work with.


The dataset used in this study has a total size of **2.68 GB** and comprises various csv files, each containing distinct but complementary information related to loan applicants and their financial histories. Two primary files, ***application_train.csv*** and ***application_test.csv***, provide detailed information about loans and loan applicants, serving as the foundation for training and testing the models.


The ***POS_CASH_balance.csv*** file contains monthly balance snapshots of previous point-of-sale (POS) transactions and cash loans associated with the applicants. The ***bureau.csv*** file captures data from previous loans reported to credit bureaus, while ***bureau_balance.csv*** offers monthly balance snapshots of credits tracked by these bureaus.


Additionally, the ***credit_card_balance.csv*** file includes monthly balance data for credit cards that applicants have with Home Credit. The ***installments_payments.csv*** file records historical payment data for loan installments, providing insight into payment behaviors. Finally, the ***previous_application.csv*** file contains application data for clients' previous loans with Home Credit. Together, these files provide a comprehensive view of applicant financial profiles and loan histories, enabling robust modeling and analysis.


*Table 1* shows the dimensions and size of each the files. For a detailed description of the columns of these csv files, please see the *data description* section of the Appendix.

**Table 1. Data Dimensions**

| Dataset                  | Dimensions (Rows x Columns) | Size    |
|--------------------------|-----------------------------|---------|
| application_train        | 307,511 x 122              | 158 MB  |
| application_test         | 48,744 x 121               | 25 MB   |
| bureau                   | 1,716,428 x 17             | 162 MB  |
| bureau_balance           | 27,299,925 x 3             | 358 MB  |
| credit_card_balance      | 3,840,312 x 23             | 405 MB  |
| installments_payments    | 13,605,401 x 8             | 690 MB  |
| previous_application     | 1,670,214 x 37             | 386 MB  |
| POS_CASH_balance         | 10,001,358 x 8             | 375 MB  |

We accessed the dataset using the Kaggle API and created a function to unzip the downloaded files. The extracted data was then loaded and organized into a dictionary, allowing for efficient retrieval during the analysis. In *Figure 1*, we provide a code snippet illustrating this process.

> **Figure 1. Code Snippet of Data Download**
> ```python
> 1  import os
> 2  import pandas as pd
> 3  import zipfile
> 4  import warnings
> 5  
> 6  warnings.filterwarnings('ignore')
> 7  
> 8  # Set Kaggle API credentials
> 9  os.environ['KAGGLE_USERNAME'] = "your username"
> 10  os.environ['KAGGLE_KEY'] = "your key"
> 11  
> 12  # Download the dataset
> 13  !kaggle competitions download -c home-credit-default-risk -p /root/shared/I526_AML_Student/Assignments/Data/home-credit-default-risk
> 14  
> 15  # Define the data directory path
> 16  DATA_DIR = "../../../Data/home-credit-default-risk"  # Ensure this points to the correct location
> 17  
> 18  # Unzip the dataset
> 19  unzippingReq = True  # Set to True if unzipping is required
> 20  if unzippingReq:
> 21      zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
> 22      zip_ref.extractall(DATA_DIR)  # Extract all files to the data directory
> 23      zip_ref.close()
> 24  
> 25  # Function to load data
> 26  def load_data(in_path, name):
> 27      df = pd.read_csv(in_path)
> 28      print(f"{name}: shape is {df.shape}")
> 29      print(df.info())
> 30      display(df.head(5))
> 31      return df
> 32  
> 33  # Load the dataset into a dictionary
> 34  datasets = {}  # Store datasets in a dictionary for easier tracking
> 35  ds_name = 'application_train'
> 36  datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
> 37
> ```  


We conducted **exploratory data analysis (EDA)** to gain an initial understanding of the datasets, identify missing values, and examining key trends and relationships. This process involved generating summary statistics, visualizing data distributions, and exploring correlations between variables to inform feature engineering decisions. Findings from this analysis, some of which will be described in the *Results* section, provided valuable insights into the structure of the data.

### **2.2.2 Data Lineage and Feature Engineering**

Managing the heterogeneity of the data was one of the primary challenges in this project. Each dataset captured unique aspects of a client’s financial behavior, often in different formats, requiring careful alignment to ensure consistency. The data included both static attributes (e.g., demographic details) and time-series data (e.g., monthly balances and installment payments), introducing a mix of snapshot and longitudinal records. Furthermore, inconsistent collection periods across datasets complicated synchronization, as data for the same client did not always span identical timeframes.


To address these challenges, the data pipeline was designed to systematically process and integrate information from multiple sources. A critical step in this pipeline was aggregating datasets to align with the unique loan application identifier (`SK_ID_CURR`). For example, datasets such as **installments_payments**, **credit_card_balance**, and **POS_CASH_balance** contained multiple entries per client, representing either monthly snapshots or individual transactions. These datasets were aggregated using statistical summaries to generate a single, consistent view for each loan application. This integration step ensured that all data sources were harmonized and usable for downstream predictive modeling. *Figure 2* illustrates the data lineage pipeline, which integrates multiple data sources into a coherent dataset for analysis.



**Figure 2. Data Lineage**

<img src=https://i.ibb.co/pbj98kF/data-lineage.png>

Feature engineering played a vital role in transforming raw data into meaningful predictors of client creditworthiness. Below, we describe the feature engineering process for each secondary dataset:

- **Previous Applications**:
  The ***previous_application*** dataset, containing historical loan applications, was processed using the **prevAppsFeaturesAggregater** function. Features were aggregated using statistical summaries (e.g., min, max, mean) on numeric attributes like `AMT_APPLICATION` and `AMT_ANNUITY`. A new feature, `range_AMT_APPLICATION`, representing the variability in loan request amounts, was created to capture fluctuations in financial needs, which are indicative of credit risk.

- **Credit Bureau**:
  The **BureauFeaturesAggregater** function processed the ***bureau*** dataset, which contained records of clients’ previous credits with other institutions. It included features such as `AMT_CREDIT_SUM` and `AMT_CREDIT_SUM_OVERDUE`. A new feature, `OVERDUE_CREDIT_RATIO`, was created to represent the ratio of overdue credit to the total credit. This ratio is a key indicator of financial risk, as a high value suggests difficulty in managing credit obligations.

- **Credit Card Balances**:
  The ***credit_card_balance*** dataset, containing monthly credit card records, was processed using the **CreditCardBalanceFeatureEngineer** and **CreditCardBalanceAggregator** functions. Features like `CREDIT_UTIL_RATIO` (ratio of balance to credit limit) and `DPD_MAX` (maximum days past due) were engineered to capture repayment behavior and credit utilization patterns.

- **Installment Payments**:
  The ***installments_payments*** dataset provided details on individual loan installment records. Using the **InstallmentsFeaturesAggregater** function, features such as `payment_delay_days` (difference between scheduled and actual payment dates) and `payment_to_installment_ratio` (proportion of payment made relative to the installment amount) were created. These features help capture repayment consistency over time.

- **POS and Cash Loans**:
  The ***POS_CASH_balance*** dataset, tracking POS and cash loans, was processed using the **POSCashBalanceFeaturesAggregater** function. New features, such as `POS_REMAINING_INSTALMENTS_RATIO` (ratio of remaining installments to total installments), were engineered to measure repayment progress.

To illustrate how we approached feature engineering, *Figure 3* provides a code example of one of our aggregators.

> **Figure 3. Code Snippet Feature Engineering on Installments Payments Example**
> ```python
> 1  class InstallmentsFeaturesAggregater(BaseEstimator, TransformerMixin):
> 2      def __init__(self):
> 3          self.delay_agg_op = {
> 4              'payment_delay_days': ['mean', 'max']  # Include average_delay and max_delay
> 5          }
> 6          self.payment_agg_op = {
> 7              'amt_payment_difference': ['sum', 'std'],  # Include total_payment_difference and variability_payment_difference
> 8              'payment_to_installment_ratio': ['mean', 'max']  # Include average_payment_ratio and max_payment_ratio
> 9          }
> 10  
> 11      def fit(self, X, y=None):
> 12          return self
> 13  
> 14      def transform(self, X, y=None):
> 15  
> 16          # Create new features
> 17          X['payment_delay_days'] = X['DAYS_ENTRY_PAYMENT'] - X['DAYS_INSTALMENT']
> 18          X['amt_payment_difference'] = X['AMT_INSTALMENT'] - X['AMT_PAYMENT']
> 19          X['payment_to_installment_ratio'] = X['AMT_PAYMENT'] / X['AMT_INSTALMENT']
> 20  
> 21          # Aggregate delay features
> 22          delay_features = X.groupby('SK_ID_CURR').agg(self.delay_agg_op)
> 23          delay_features.columns = ['_'.join(col).strip() for col in delay_features.columns]
> 24  
> 25          # Aggregate payment features
> 26          payment_features = X.groupby('SK_ID_CURR').agg(self.payment_agg_op)
> 27          payment_features.columns = ['_'.join(col).strip() for col in payment_features.columns]
> 28  
> 29          # Combine both aggregated features
> 30          aggregated_features = pd.concat([delay_features, payment_features], axis=1)
> 31  
> 32          return aggregated_features.reset_index()
> 33  
> 34  # Test function for the updated InstallmentsFeaturesAggregater
> 35  def test_driver_InstallmentsFeaturesAggregater(df):
> 36      print(f"df.shape: {df.shape}\n")
> 37      print(f"Sample df columns: {df.columns}")
> 38      test_pipeline = make_pipeline(InstallmentsFeaturesAggregater())
> 39      result = test_pipeline.fit_transform(df)
> 40      return result
> ```

We used the aggregater functions we created to build a pipeline for each of these datasets. We merged the datasets sequentially into the primary application dataset using a left join on `SK_ID_CURR`. Each merge operation appended the aggregated features from the secondary dataset to the primary dataset. *Figure 4* shows a code example of this process.

> **Figure 4. Code Snippet of Final Dataset Creation**
> ```python
> 1  import pandas as pd
> 2  from sklearn.pipeline import Pipeline
> 3  from sklearn.base import BaseEstimator, TransformerMixin
> 4  
> 5  # Initialize pipelines for each dataset
> 6  bureau_feature_pipeline = Pipeline([
> 7      ('bureau_aggregater', BureauFeaturesAggregater(bureau_features))
> 8  ])
> 9  
> 10  bureau_balance_feature_pipeline = Pipeline([
> 11      ('bureau_balance_aggregater', BureauBalanceFeaturesAggregater(features=bureau_balance_features))
> 12  ])
> 13  
> 14  credit_card_feature_pipeline = Pipeline([
> 15      ('credit_card_aggregater', CreditCardFeaturesAggregater(features=ccb_features))
> 16  ])
> 17  
> 18  installments_feature_pipeline = Pipeline([
> 19      ('installments_aggregater', InstallmentsFeaturesAggregater())
> 20  ])
> 21  
> 22  pos_cash_feature_pipeline = Pipeline([
> 23      ('pos_cash_aggregater', POSCashBalanceFeaturesAggregater())
> 24  ])
> 25  
> 26  prev_apps_feature_pipeline = Pipeline([
> 27      ('prev_apps_aggregater', prevAppsFeaturesAggregater(features))
> 28  ])
> 29  
> 30  # Datasets
> 31  X_train = datasets['application_train']  # Primary dataset
> 32  bureau = datasets['bureau']
> 33  ID_combos = bureau[['SK_ID_CURR', 'SK_ID_BUREAU']].drop_duplicates()
> 34  bureau_balance = pd.merge(datasets['bureau_balance'], ID_combos, on='SK_ID_BUREAU', how='left')
> 35  credit_card_balance = datasets['credit_card_balance']
> 36  installments = datasets['installments_payments']
> 37  pos_cash = datasets['POS_CASH_balance']
> 38  previous_apps = datasets['previous_application']
> 39  
> 40  merge_all_data = True
> 41  
> 42  if merge_all_data:
> 43      # 1. Transform and merge Bureau/Bureau Balance data
> 44      bureau_aggregated = bureau_feature_pipeline.transform(bureau)
> 45      X_train = X_train.merge(bureau_aggregated, how='left', on='SK_ID_CURR')
> 46      bureau_balance_aggregated = bureau_balance_feature_pipeline.transform(bureau_balance)
> 47      X_train = X_train.merge(bureau_balance_aggregated, how='left', on='SK_ID_CURR')
> 48  
> 49      # 2. Transform and merge Credit Card Balance data
> 50      credit_card_aggregated = credit_card_feature_pipeline.transform(credit_card_balance)
> 51      X_train = X_train.merge(credit_card_aggregated, how='left', on='SK_ID_CURR')
> 52  
> 53      # 3. Transform and merge Installments data
> 54      installments_aggregated = installments_feature_pipeline.transform(installments)
> 55      X_train = X_train.merge(installments_aggregated, how='left', on='SK_ID_CURR')
> 56  
> 57      # 4. Transform and merge POS Cash Balance data
> 58      pos_cash_aggregated = pos_cash_feature_pipeline.transform(pos_cash)
> 59      X_train = X_train.merge(pos_cash_aggregated, how='left', on='SK_ID_CURR')
> 60  
> 61      # 5. Transform and merge Previous Applications data
> 62      previous_apps_aggregated = prev_apps_feature_pipeline.transform(previous_apps)
> 63      X_train = X_train.merge(previous_apps_aggregated, how='left', on='SK_ID_CURR')
> 64  
> 65  
> ```

### **2.2.3 Feature Selection**

To reduce dimensionality of the data, we initially tested **Principal Component Analysis (PCA)** alone, but we found none of the components individually explained a large proportion of the variance, meaning that many components (108) were still needed to explain a significant proportion of the variance, as shown in *Figure 5*.

**Figure 5. Initial PCA dimensionality reduction**

<img src=https://i.ibb.co/MhBwsMR/pca.png/>

This led us to use another approach where we combined **correlation analysis**, **feature importance ranking**, and **Principal Component Analysis (PCA)** to select the most relevant features. With this new approach 20 PCA components explained 95% of the variance, which was an important improvement.

To do this, first, **correlation analysis** identified highly correlated feature pairs (correlation > 0.9), where only one feature from each pair (the one with the higher importance score) was retained to reduce multicollinearity. Next, **feature importance** was computed using **Decision Trees** and **Random Forests**, with scores averaged to prioritize impactful features.

Finally, **PCA** was applied to the reduced dataset, retaining components that explained 95% of the variance. This reduced the dataset to **20 features**, which, alongside the full set of features, were used in subsequent experiments. *Figure 6* illustrates this process.

> **Figure 6. Code Snippet of Feature Selection**
> ```python
> 1  from sklearn.decomposition import PCA
> 2  # Correlation Analysis
> 3  def correlation_analysis(X, correlation_threshold=0.9):
> 4   
> 5      #Identify pairs of highly correlated features and return the features to drop.
> 6      correlation_matrix = X.corr()
> 7  
> 8      # Identify highly correlated pairs
> 9      correlated_features = set()
> 10      for i in range(len(correlation_matrix.columns)):
> 11          for j in range(i):
> 12              if abs(correlation_matrix.iloc[i, j]) > correlation_threshold:
> 13                  colname1 = correlation_matrix.columns[i]
> 14                  colname2 = correlation_matrix.columns[j]
> 15                  correlated_features.add((colname1, colname2))
> 16  
> 17      print(f"Highly correlated feature pairs (|correlation| > {correlation_threshold}):")
> 18      for pair in correlated_features:
> 19          print(pair)
> 20  
> 21      return correlated_features
> 22  
> 23  # Feature Importance
> 24  def compute_feature_importance(X_train, y_train, preprocessor):
> 25  
> 26      #Compute feature importance using Decision Tree and Random Forest models.
> 27  
> 28      # Decision Tree
> 29      dt_model = Pipeline([
> 30          ('preprocessor', preprocessor),
> 31          ('decision_tree', DecisionTreeClassifier(random_state=42))
> 32      ])
> 33      dt_model.fit(X_train, y_train)
> 34      dt_importances = dt_model.named_steps['decision_tree'].feature_importances_
> 35  
> 36      # Random Forest
> 37      rf_model = Pipeline([
> 38          ('preprocessor', preprocessor),
> 39          ('random_forest', RandomForestClassifier(random_state=42))
> 40      ])
> 41      rf_model.fit(X_train, y_train)
> 42      rf_importances = rf_model.named_steps['random_forest'].feature_importances_
> 43  
> 44      # Extract feature names
> 45      preprocessed_features = preprocessor.get_feature_names_out().tolist()
> 46  
> 47      # Ensure lengths match
> 48      if len(preprocessed_features) != len(dt_importances):
> 49          raise ValueError("Mismatch between preprocessed features and importance values.")
> 50  
> 51      # Combine Results
> 52      importance_df = pd.DataFrame({
> 53          'Feature': preprocessed_features,
> 54          'DT_Importance': dt_importances,
> 55          'RF_Importance': rf_importances
> 56      })
> 57      importance_df['Average_Importance'] = importance_df[['DT_Importance', 'RF_Importance']].mean(axis=1)
> 58      importance_df.sort_values(by='Average_Importance', ascending=False, inplace=True)
> 59  
> 60      return importance_df
> 61  
> 62  # Combine Correlation and Feature Importance
> 63  def combine_correlation_and_importance(correlated_features, importance_df):
> 64  
> 65      #For each correlated pair, retain the feature with higher average importance.
> 66  
> 67      features_to_drop = set()
> 68      for col1, col2 in correlated_features:
> 69          avg_importance_col1 = importance_df.loc[importance_df['Feature'] == col1, 'Average_Importance'].values
> 70          avg_importance_col2 = importance_df.loc[importance_df['Feature'] == col2, 'Average_Importance'].values
> 71  
> 72          if len(avg_importance_col1) > 0 and len(avg_importance_col2) > 0:
> 73              if avg_importance_col1[0] < avg_importance_col2[0]:
> 74                  features_to_drop.add(col1)
> 75              else:
> 76                  features_to_drop.add(col2)
> 77  
> 78      return features_to_drop
> 79  
> 80  # Feature Selection
> 81  def select_final_features(importance_df, features_to_drop, top_features_cutoff=20):
> 82  
> 83      #Combine results of correlation analysis and feature importance to select features.
> 84  
> 85      top_features = set(importance_df['Feature'].head(top_features_cutoff))
> 86      final_features = top_features - features_to_drop
> 87      return final_features
> 88  
> 89  # Reapply PCA
> 90  def reapply_pca(X_train_reduced):
> 91  
> 92     #Perform PCA on the reduced dataset.
> 93  
> 94      from sklearn.decomposition import PCA
> 95      scaler = StandardScaler()
> 96      X_scaled = scaler.fit_transform(X_train_reduced)
> 97  
> 98      pca = PCA(n_components=0.95, svd_solver='full', random_state=42)
> 99      X_pca = pca.fit_transform(X_scaled)
> 100  
> 101      print(f"PCA explained variance ratio: {pca.explained_variance_ratio_}")
> 102      print(f"Number of components to retain 95% variance: {pca.n_components_}")
> 103      return X_pca, pca
> 104  
> 105  # Correlation Analysis
> 106  correlated_features = correlation_analysis(X_train[numerical_features])
> 107  
> 108  # Feature Importance
> 109  importance_df = compute_feature_importance(X_train, y_train, preprocessor)
> 110  
> 111  
> 112  # Combine Correlation and Importance
> 113  features_to_drop = combine_correlation_and_importance(correlated_features, importance_df)
> 114  print(f"Features to drop: {features_to_drop}")
> 115  
> 116  # Select Final Features
> 117  final_features = select_final_features(importance_df, features_to_drop)
> 118  print(f"Final Selected Features: {final_features}")
> 119  
> 120  # Get transformed feature names
> 121  transformed_feature_names = preprocessor.get_feature_names_out()
> 122  
> 123  # Map selected features to transformed names
> 124  final_selected_features = [feature for feature in final_features if feature in transformed_feature_names]
> 125  
> 126  # Debugging: Check for mismatches
> 127  missing_features = final_features - set(final_selected_features)
> 128  if missing_features:
> 129      print(f"Features not found in transformed feature names: {missing_features}")
> 130  
> 131  # Filter Dataset
> 132  X_train_reduced = pd.DataFrame(preprocessor.transform(X_train), columns=transformed_feature_names)[final_selected_features]
> 133  X_test_reduced = pd.DataFrame(preprocessor.transform(X_test), columns=transformed_feature_names)[final_selected_features]
> 134  X_valid_reduced = pd.DataFrame(preprocessor.transform(X_valid), columns=transformed_feature_names)[final_selected_features]
> 135  
> 136  # Verify the reduced dataset
> 137  print(f"Original Feature Count: {X_train.shape[1]}")
> 138  print(f"Reduced Feature Count: {X_train_reduced.shape[1]}")
> ```

### **2.2.4 Modeling Experiments**

Initial experiments included running Logistic Regression (our baseline model), Decision Trees, Random Forests, Gradient Boosting, XGBoost, and LightGBM on all 208 features. The same models were re-trained on the reduced dataset.

Due to efficiency constraints, we trained a Support Vector Machines (SVC) model with 20% of the training data on the reduced features.

We also experimented wtih voting and stacking ensemble models on both the complete and the reduced features.

These models were selected for their suitability for binary classification tasks and ability to handle large, diverse datasets with a mix of categorical and numerical features.

**Table 2. Model Implementation**

| Algorithm                | Implementation                                       | Loss Function                           |
|--------------------------|-----------------------------------------------------|------------------------------------------|
| Logistic Regression       | `sklearn.linear_model.LogisticRegression`           | Log Loss (Binary Cross-Entropy)          |
| Decision Tree Classifier  | `sklearn.tree.DecisionTreeClassifier`               | Gini Impurity / Entropy                 |
| Random Forest             | `sklearn.ensemble.RandomForestClassifier`           | Gini Impurity / Entropy                 |
| Support Vector Machine    | `sklearn.svm.SVC`                                   | Hinge Loss (SVM Loss)                   |
| Gradient Boosting         | `sklearn.ensemble.GradientBoostingClassifier`       | Log Loss (Binary Cross-Entropy)         |
| Extreme Gradient Boosting | `xgboost.XGBClassifier`                             | Log Loss (Binary Cross-Entropy)         |
| Light Gradient Boosting   | `lightgbm.LGBMClassifier`                           | Log Loss (Binary Cross-Entropy)         |
| Voting Classifier         | `sklearn.ensemble.VotingClassifier`                 | Weighted Average of Component Losses    |
| Stacking Classifier       | `sklearn.ensemble.StackingClassifier`               | Meta-Model Loss                         |


- **Logistic Regression:** A straightforward and interpretable model for binary classification, used as a baseline to evaluate the performance of more advanced models.
- **Decision Tree Classifier:** Creates interpretable decision pathways by splitting on features. Helps understand which specific features indicate repayment likelihood.
- **Random Forest:** An ensemble of decision trees, offering robustness to overfitting and capturing complex, non-linear feature interactions.
- **Support Vector Machine (SVM):** Finds the optimal hyperplane for class separation, making it effective for high-dimensional feature spaces.
- **Gradient Boosting:** Sequentially builds models to correct the errors of prior models. Effective for structured data with complex, non-linear relationships.
- **Extreme Gradient Boosting (XGBoost):** An optimized version of gradient boosting with advanced features like regularization, missing value handling, and parallel processing for scalability.
- **Light Gradient Boosting (LightGBM):** Designed for speed and memory efficiency, handles large datasets and high-cardinality categorical features.
- **Voting Classifier:** Combines predictions from multiple models using a soft voting strategy, leveraging the strengths of each model for improved performance.
- **Stacking Classifier:** Uses a meta-model to combine the outputs of several base models, often leading to better generalization by exploiting model complementarities.

**Figure 7. Summary of Modeling Pipeline**
<img src=https://i.imgur.com/MIdjkKf.png/>

To evaluate our models for the HCDR dataset we used accuracy, precision, recall, F1 score, AUC-ROC, and log loss metrics, which are common metrics used to evaluate classification models. The results were logged in an experiments table. *Table 3* contains a description and formula for each metric.

**Table 3. Evaluation Metrics**

| Name                               | Description                                                                                     | Formula                                                                                       |
|------------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Accuracy                           | Proportion of correctly predicted samples out of the total samples                              | $$\text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{True Positives + True Negatives + False Positives + False Negatives}}$$ |
| Precision                          | Proportion of true positive predictions out of all positive predictions                         | $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$ |
| Recall (Sensitivity)               | Proportion of actual positives correctly identified                                             | $$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$ |
| F1 Score                           | Harmonic mean of precision and recall, balancing the two metrics                                | $$\text{F1 Score} = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$ |
| AUC-ROC                            | Area under the Receiver Operating Characteristic curve, measuring model's ability to distinguish between classes | $$\text{AUC-ROC} = \int \text{ROC Curve}$$ |
| Log Loss                           | Logarithmic loss penalizes wrong predictions more as they deviate from true class probabilities | $$\text{Log Loss} = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)})$$ |


- **Accuracy** is the proportion of correctly predicted values divided by the total number of predicted values. It basically measures how often a machine learning model correctly predicts the outcome. This metric doesn't work well if the data set is imbalanced, meaning the overwhelming majority of the Target variables have the same classification.  

- **Precision** is the proportion of true positive values predicted to the total number of positive values predicted. For projects where false positives are less desirable, precision is a good metric choice to minimize them.

- **Recall** is the proportion of true positive values predicted to the total number of the true positives and the total number of the false negatives. For projects where false negatives are less desirable, recall is a good metric choice.

- **F1 Score** is the harmonic mean of precision and recall. For projects where both false positives and false negatives are similarly weighted then F1 is a good metric choice.

- **AUC-ROC** is Area under the Receiver Operating Characteristic curve which measures a model's ability to distinguish between classes. AUC-ROC is a good metric choice for projects with highly imbalanced data sets.

- **Log (Logarithmic) loss** penalizes wrong predictions more as they deviate from true class probabilities. Log loss is a good metric choice for balanced and imbalanced data sets but can be sensitive to outliers.

*Figure 8* gives an example of the code used to train and evaluate our models on the full feature set, while *Figure 9* shows a code sample of our voting ensemble classifier. The full code of our experiments can be found in the Code Appendix.   

> **Figure 8. Code Snippet of Modelling and Evaluation**
> ```python
> 1  import pandas as pd
> 2  import numpy as np
> 3  from sklearn.model_selection import train_test_split, GridSearchCV
> 4  from sklearn.tree import DecisionTreeClassifier
> 5  from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
> 6  from sklearn.linear_model import LogisticRegression
> 7  from xgboost import XGBClassifier
> 8  from lightgbm import LGBMClassifier
> 9  from sklearn.pipeline import Pipeline
> 10  from sklearn.compose import ColumnTransformer
> 11  from sklearn.preprocessing import StandardScaler
> 12  from sklearn.impute import SimpleImputer
> 13  from category_encoders import TargetEncoder
> 14  from sklearn.metrics import classification_report, accuracy_score
> 15  from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, log_loss
> 16  
> 17  # Define a dictionary of models
> 18  models = {
> 19      # "SVC": SVC(probability=True, random_state=42), # excluding due to high resource requirements
> 20      "LogisticRegression": LogisticRegression(random_state=42),
> 21      "DecisionTree": DecisionTreeClassifier(random_state=42),
> 22      "RandomForest": RandomForestClassifier(random_state=42),
> 23      "GradientBoosting": GradientBoostingClassifier(random_state=42),
> 24      "XGBoosting": XGBClassifier(random_state=42),
> 25      "LightGradientBoosting": LGBMClassifier(random_state=42)
> 26  }
> 27  
> 28  # Split features into categorical and numerical groups
> 29  categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
> 30  numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
> 31  
> 32  # Preprocessing pipelines for categorical and numerical data
> 33  categorical_transformer = Pipeline(steps=[
> 34      ('target_encoder', TargetEncoder(handle_unknown='ignore'))  # Encode categorical features
> 35  ])
> 36  
> 37  numerical_transformer = Pipeline(steps=[
> 38      ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
> 39      ('scaler', StandardScaler())  # Standardize numerical features
> 40  ])
> 41  
> 42  # Combine preprocessing steps
> 43  preprocessor = ColumnTransformer(
> 44      transformers=[
> 45          ('num', numerical_transformer, numerical_features),
> 46          ('cat', categorical_transformer, categorical_features)
> 47      ]
> 48  )
> 49  
> 50  # Create pipelines for each model
> 51  pipelines = {}
> 52  for model_name, model in models.items():
> 53      pipelines[model_name] = Pipeline(steps=[
> 54          ('preprocessor', preprocessor),
> 55          ('model', model)
> 56      ])
> 57  
> 58  # Train each pipeline and store the trained models
> 59  %%time
> 60  best_models = {}
> 61  for model_name, pipeline in pipelines.items():
> 62      print(f"Training {model_name}...")
> 63      pipeline.fit(X_train, y_train)
> 64      best_models[model_name] = pipeline
> 65  
> 66  # Initialize experiment log DataFrame
> 67  try:
> 68      expLog
> 69  except NameError:
> 70      expLog = pd.DataFrame(columns=[
> 71          'Experiment_Name', 'Train_Accuracy', 'Valid_Accuracy', 'Test_Accuracy',
> 72          'Train_ROC_AUC', 'Valid_ROC_AUC', 'Test_ROC_AUC',
> 73          'Train_Precision', 'Valid_Precision', 'Test_Precision',
> 74          'Train_Recall', 'Valid_Recall', 'Test_Recall',
> 75          'Train_F1', 'Valid_F1', 'Test_F1',
> 76          'Train_Log_Loss', 'Valid_Log_Loss', 'Test_Log_Loss'
> 77      ])
> 78  
> 79  # Function to log model performance metrics
> 80  def populate_expLog(X_train, X_valid, X_test, model_name, model):
> 81      exp_name = f"{model_name}_{X_train.shape[1]}_features"
> 82      global expLog
> 83      expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
> 84          [
> 85              accuracy_score(y_train, model.predict(X_train)),
> 86              accuracy_score(y_valid, model.predict(X_valid)),
> 87              accuracy_score(y_test, model.predict(X_test)),
> 88              roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
> 89              roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
> 90              roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
> 91              precision_score(y_train, model.predict(X_train)),
> 92              precision_score(y_valid, model.predict(X_valid)),
> 93              precision_score(y_test, model.predict(X_test)),
> 94              recall_score(y_train, model.predict(X_train)),
> 95              recall_score(y_valid, model.predict(X_valid)),
> 96              recall_score(y_test, model.predict(X_test)),
> 97              f1_score(y_train, model.predict(X_train)),
> 98              f1_score(y_valid, model.predict(X_valid)),
> 99              f1_score(y_test, model.predict(X_test)),
> 100              log_loss(y_train, model.predict_proba(X_train)[:, 1]),
> 101              log_loss(y_valid, model.predict_proba(X_valid)[:, 1]),
> 102              log_loss(y_test, model.predict_proba(X_test)[:, 1])
> 103          ],
> 104          4
> 105      ))
> 106  
> 107  # Evaluate each model and log results
> 108  for model_name, model in best_models.items():
> 109      y_pred = model.predict(X_test)
> 110      print(f"\n{model_name} Performance:")
> 111      print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
> 112      print(classification_report(y_test, y_pred))
> 113      populate_expLog(X_train, X_valid, X_test, model_name, model)
> 114  
> ```

>**Figure 9. Code Snippet of Voting Ensemble Model**
>```python
> 1  import pandas as pd
> 2  import numpy as np
> 3  from sklearn.pipeline import Pipeline
> 4  from sklearn.compose import ColumnTransformer
> 5  from sklearn.preprocessing import StandardScaler
> 6  from sklearn.impute import SimpleImputer
> 7  from category_encoders import TargetEncoder
> 8  from sklearn.linear_model import LogisticRegression
> 9  from sklearn.tree import DecisionTreeClassifier
> 10  from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
> 11  from xgboost import XGBClassifier
> 12  from lightgbm import LGBMClassifier
> 13  from sklearn.metrics import classification_report, accuracy_score
> 14  
> 15  # Define the best models for the voting classifier
> 16  best_models = {
> 17      # "SVC": SVC(probability=True, random_state=42), # Excluded due to resource constraints
> 18      "LogisticRegression": LogisticRegression(random_state=42),
> 19      "DecisionTree": DecisionTreeClassifier(random_state=42),
> 20      "RandomForest": RandomForestClassifier(random_state=42),
> 21      "GradientBoosting": GradientBoostingClassifier(random_state=42),
> 22      "XGBoosting": XGBClassifier(random_state=42),
> 23      "LightGradientBoosting": LGBMClassifier(random_state=42)
> 24  }
> 25  
> 26  # Define feature types for preprocessing
> 27  categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
> 28  numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
> 29  
> 30  # Create preprocessing pipelines for categorical and numerical data
> 31  categorical_transformer = Pipeline(steps=[
> 32      ('target_encoder', TargetEncoder(handle_unknown='ignore'))  # Encode categorical features
> 33  ])
> 34  
> 35  numerical_transformer = Pipeline(steps=[
> 36      ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
> 37      ('scaler', StandardScaler())  # Standardize numerical features
> 38  ])
> 39  
> 40  # Combine the preprocessing pipelines into a single preprocessor
> 41  voting_preprocessor = ColumnTransformer(
> 42      transformers=[
> 43          ('num', numerical_transformer, numerical_features),
> 44          ('cat', categorical_transformer, categorical_features)
> 45      ]
> 46  )
> 47  
> 48  # Create pipelines for each model with preprocessing
> 49  voting_pipelines = {}
> 50  for model_name, model in best_models.items():
> 51      voting_pipelines[model_name] = Pipeline(steps=[
> 52          ('preprocessor', voting_preprocessor),
> 53          ('model', model)
> 54      ])
> 55  
> 56  # Combine the models into a voting classifier
> 57  voting_estimators = [
> 58      (name, Pipeline([
> 59          ('preprocessor', voting_preprocessor),
> 60          ('model', model)
> 61      ]))
> 62      for name, model in best_models.items()
> 63  ]
> 64  voting_clf = VotingClassifier(estimators=voting_estimators, voting='soft')
> 65  
> 66  # Train the voting classifier
> 67  %%time
> 68  voting_clf.fit(X_train, y_train)
> 69  
> 70  # Evaluate the voting classifier on the test set
> 71  y_pred_voting = voting_clf.predict(X_test)
> 72  print("\nVoting Classifier Performance:")
> 73  print(f"Accuracy: {accuracy_score(y_test, y_pred_voting):.4f}")
> 74  print(classification_report(y_test, y_pred_voting))
> 75  
> 76  # Log performance metrics
> 77  populate_expLog(X_train, X_valid, X_test, "Voting_ensemble", voting_clf)
> 78  
> ```

We performed **hyperparameter tuning** on our best performing model, XGBoost, which was ultimately selected due to its combination of high accuracy (91.59%), computational efficiency, and scalability during training. While other models, such as Light Gradient Boosting and its ensemble variations, demonstrated strong performances across multiple metrics, XGBoost's rapid training time and ability to handle large datasets made it the most practical choice for our workflow.



We selected specific hyperparameters to test the performance and optimized the XGBoost model based on its functionality and typical impact on classification tasks:

- **`learning_rate`**: This parameter controls the step size at each iteration while updating weights, impacting how quickly the model learns. We chose values `[0.1, 0.2]` to test a moderate learning rate range, balancing convergence speed and potential overfitting.

- **`max_depth`**: This determines the maximum depth of the trees, controlling the model's capacity to capture complex interactions. Values `[3, 5, 7]` were selected to explore shallow to moderately deep trees, balancing interpretability and overfitting risk.

- **`subsample`**: This specifies the fraction of training data used to grow trees, helping to reduce overfitting. We tested `[0.8, 1.0]` to compare the impact of using all data versus a slightly reduced sample.

- **`colsample_bytree`**: This parameter defines the fraction of features randomly sampled for each tree. Values `[0.8, 1.0]` were chosen to assess how limiting features per tree affects model performance and generalization.

*Figure 10* shows a code snippet from our hyperparamter tuning on XGBoost.


> **Figure 10. Code Snippet of Hyperparamteter tuning**
>```python
> 1  import pandas as pd
> 2  import numpy as np
> 3  from sklearn.pipeline import Pipeline
> 4  from sklearn.compose import ColumnTransformer
> 5  from sklearn.preprocessing import StandardScaler
> 6  from sklearn.impute import SimpleImputer
> 7  from category_encoders import TargetEncoder
> 8  from sklearn.model_selection import GridSearchCV
> 9  from xgboost import XGBClassifier
> 10  
> 11  # Define the model and hyperparameter grid for tuning
> 12  models_PCA_tuning = {
> 13      "XGBoost": {
> 14          "model": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
> 15          "params": {
> 16              "learning_rate": [0.1, 0.2],  # Learning rate values to test
> 17              # "n_estimators": [100, 200, 500],  # Uncomment if testing more estimators
> 18              "max_depth": [3, 5, 7],  # Depth of the trees
> 19              "subsample": [0.8, 1.0],  # Subsampling ratio of the training instances
> 20              "colsample_bytree": [0.8, 1.0]  # Subsampling ratio of columns by tree
> 21          }
> 22      }
> 23  }
> 24  
> 25  # Identify categorical and numerical features
> 26  categorical_features = X_train_reduced.select_dtypes(include=['object', 'category']).columns.tolist()
> 27  numerical_features = X_train_reduced.select_dtypes(include=['int64', 'float64']).columns.tolist()
> 28  
> 29  # Preprocessing pipelines for categorical and numerical data
> 30  categorical_transformer = Pipeline(steps=[
> 31      ('target_encoder', TargetEncoder(handle_unknown='ignore'))  # Encode categorical variables
> 32  ])
> 33  
> 34  numerical_transformer = Pipeline(steps=[
> 35      ('imputer', SimpleImputer(strategy='mean')),  # Fill missing values with mean
> 36      ('scaler', StandardScaler())  # Standardize numerical features
> 37  ])
> 38  
> 39  # Combine the preprocessing steps
> 40  preprocessor = ColumnTransformer(
> 41      transformers=[
> 42          ('num', numerical_transformer, numerical_features),
> 43          ('cat', categorical_transformer, categorical_features)
> 44      ]
> 45  )
> 46  
> 47  # Create pipelines for models and tune using GridSearchCV
> 48  pipelines = []
> 49  for model_name, model_info in models_PCA_tuning.items():
> 50      pipeline = Pipeline(steps=[
> 51          ('preprocessor', preprocessor),  # Apply preprocessing
> 52          ('model', model_info['model'])  # Add model
> 53      ])
> 54  
> 55      # Prepare the parameter grid for the model
> 56      param_grid = {f"model__{param_name}": param_values for param_name, param_values in model_info['params'].items()}
> 57  
> 58      # Perform GridSearchCV to find the best parameters
> 59      grid_search = GridSearchCV(
> 60          estimator=pipeline,
> 61          param_grid=param_grid,
> 62          cv=3,  # 3-fold cross-validation
> 63          scoring='accuracy',  # Optimize for accuracy
> 64          n_jobs=-1  # Use all available cores
> 65      )
> 66  
> 67      pipelines.append((model_name, grid_search))
> 68  
> 69  # Train models and store the best estimators
> 70  best_models_PCA_tuning = {}
> 71  for model_name, grid_search in pipelines:
> 72      print(f"Training {model_name}...")
> 73      grid_search.fit(X_train_reduced, y_train)
> 74      best_models_PCA_tuning[model_name] = grid_search.best_estimator_  # Store the best model
> 75  
> 76  # Extract best parameters for each model
> 77  best_params = []
> 78  for model_name, grid_search in pipelines:
> 79      best_params.append({
> 80          'Model': model_name,
> 81          **grid_search.best_params_  # Add best parameters to the dictionary
> 82      })
> 83  
> 84  # Convert best parameters to a DataFrame for visualization
> 85  best_params_df = pd.DataFrame(best_params)
> 86  
> 87  # Print the best parameters in a formatted output
> 88  print("Best Parameters for Each Model:\n")
> 89  for index, row in best_params_df.iterrows():
> 90      print(f"Model: {row['Model']}")
> 91      for param, value in row.items():
> 92          if param != 'Model':
> 93              print(f"  {param}: {value}")
> 94      print("-" * 40)  # Separator for readability
> ```

### **2.2.5 MLP Implementation**

For our Multi-Layer Perceptron (MLP) experiments, we explored multiple architectures to benchmark performance and evaluate different strategies for improving model efficacy. These included a **Sparse Input MLP** for baseline comparisons, a **Simple Deep MLP** with additional depth, an **MLP with Dropout** to mitigate overfitting, and a **Wider MLP** designed to enhance representational capacity.

We implemented an MLP using **PyTorch Lightning**. The baseline architecture consisted of two hidden layers with 64 and 32 units, respectively, each followed by ReLU activation. The output layer was configured for binary classification. Preprocessing was handled through a pipeline, where numerical features were standardized, and categorical variables were encoded before training.

The dataset was split into training, validation, and test sets, with data loading managed dynamically using a **LightningDataModule** to ensure scalability and reproducibility. The **Adam optimizer** with a learning rate of 0.001 was used for optimization, and a checkpoint callback monitored validation accuracy, saving the best-performing model during training.

Models were trained for 20 epochs with a batch size of 64, utilizing GPU acceleration to improve computational efficiency. After training, the best checkpoint was reloaded and evaluated on the test set to measure final performance. **TensorBoard** was employed to track training metrics and visualize the learning process.

To test different architectures, we ran the following experiments using the `run_experiment` function:
- **Sparse Input MLP:** Two hidden layers with `(64, 32)` units.
- **Simple Deep MLP:** Three hidden layers with `(128, 64, 32)` units.
- **MLP with Dropout:** Three hidden layers with `(128, 64, 32)` units and a dropout rate of 0.2.
- **Wider MLP:** Three hidden layers with `(512, 256, 128)` units.

Results from each experiment were logged to ensure consistent evaluation and comparison across architectures. These experiments provided valuable insights into the impact of architectural variations on model performance.

In *Figure 11* we show a diagram of our MLP pipeline and in *Figure 12* a sample of the code implementation. The full implemenation can be found in the Code Appendix.


**Figure 11. MLP Pipeline**
<img src=https://i.ibb.co/GtQx7kZ/mlp-pipeline.png>

> **Figure 11. Code Snippet of MLP Implementation**
>```python
> 1  # Install necessary dependencies
> 2  !pip install pytorch-lightning torch torchvision torchmetrics
> 3  
> 4  # Import required libraries
> 5  import pytorch_lightning as pl
> 6  import torch
> 7  import torch.nn as nn
> 8  from torchmetrics import Accuracy
> 9  from torchmetrics.functional import precision, recall, f1_score, auroc
> 10  from pkg_resources import parse_version
> 11  from torchmetrics import __version__ as torchmetrics_version
> 12  from torch.utils.data import DataLoader, TensorDataset, random_split
> 13  import pandas as pd
> 14  from sklearn.preprocessing import StandardScaler
> 15  
> 16  class MultiLayerPerceptron(pl.LightningModule):
> 17      def __init__(self, input_dim, hidden_units=(64, 32)):
> 18          super().__init__()
> 19  
> 20          # Define accuracy metrics based on PyTorch Lightning version
> 21          if parse_version(torchmetrics_version) > parse_version("0.8"):
> 22              self.train_acc = Accuracy(task="multiclass", num_classes=2)  # Adjust num_classes
> 23              self.valid_acc = Accuracy(task="multiclass", num_classes=2)
> 24              self.test_acc = Accuracy(task="multiclass", num_classes=2)
> 25          else:
> 26              self.train_acc = Accuracy()
> 27              self.valid_acc = Accuracy()
> 28              self.test_acc = Accuracy()
> 29  
> 30          # Define MLP architecture
> 31          all_layers = [nn.Flatten()]
> 32          for hidden_unit in hidden_units:
> 33              all_layers.append(nn.Linear(input_dim, hidden_unit))
> 34              all_layers.append(nn.ReLU())
> 35              input_dim = hidden_unit
> 36          all_layers.append(nn.Linear(hidden_units[-1], 2))  # Output layer for binary classification
> 37          self.model = nn.Sequential(*all_layers)
> 38  
> 39      def forward(self, x):
> 40          return self.model(x)
> 41  
> 42      def training_step(self, batch, batch_idx):
> 43          x, y = batch
> 44          logits = self(x)
> 45          loss = nn.functional.cross_entropy(logits, y)
> 46          preds = torch.argmax(logits, dim=1)
> 47          self.train_acc.update(preds, y)
> 48          self.log("train_loss", loss, prog_bar=True)
> 49          self.log("train_acc", self.train_acc.compute(), prog_bar=True)  # Log train accuracy
> 50  
> 51          train_precision = precision(preds, y, task="binary")
> 52          train_recall = recall(preds, y, task="binary")
> 53          train_f1 = f1_score(preds, y, task="binary")
> 54          train_roc_auc = auroc(logits.softmax(dim=-1)[:, 1], y, task="binary")
> 55          self.log("train_precision", train_precision, prog_bar=True)
> 56          self.log("train_recall", train_recall, prog_bar=True)
> 57          self.log("train_f1", train_f1, prog_bar=True)
> 58          self.log("train_roc_auc", train_roc_auc, prog_bar=True)
> 59  
> 60          return loss
> 61  
> 62      def validation_step(self, batch, batch_idx):
> 63          x, y = batch
> 64          logits = self(x)
> 65          loss = nn.functional.cross_entropy(logits, y)
> 66          preds = torch.argmax(logits, dim=1)
> 67          self.valid_acc.update(preds, y)
> 68          self.log("valid_loss", loss, prog_bar=True)
> 69          self.log("valid_acc", self.valid_acc.compute(), prog_bar=True)  # Log validation accuracy
> 70  
> 71          valid_precision = precision(preds, y, task="binary")
> 72          valid_recall = recall(preds, y, task="binary")
> 73          valid_f1 = f1_score(preds, y, task="binary")
> 74          valid_roc_auc = auroc(logits.softmax(dim=-1)[:, 1], y, task="binary")
> 75          self.log("valid_precision", valid_precision, prog_bar=True)
> 76          self.log("valid_recall", valid_recall, prog_bar=True)
> 77          self.log("valid_f1", valid_f1, prog_bar=True)
> 78          self.log("valid_roc_auc", valid_roc_auc, prog_bar=True)
> 79  
> 80          return loss
> 81  
> 82      def test_step(self, batch, batch_idx):
> 83          x, y = batch
> 84          logits = self(x)
> 85          loss = nn.functional.cross_entropy(logits, y)
> 86          preds = torch.argmax(logits, dim=1)
> 87          self.test_acc.update(preds, y)
> 88          self.log("test_loss", loss, prog_bar=True)
> 89          self.log("test_acc", self.test_acc.compute(), prog_bar=True)  # Log test accuracy
> 90  
> 91          test_precision = precision(preds, y, task="binary")
> 92          test_recall = recall(preds, y, task="binary")
> 93          test_f1 = f1_score(preds, y, task="binary")
> 94          test_roc_auc = auroc(logits.softmax(dim=-1)[:, 1], y, task="binary")
> 95          self.log("test_precision", test_precision, prog_bar=True)
> 96          self.log("test_recall", test_recall, prog_bar=True)
> 97          self.log("test_f1", test_f1, prog_bar=True)
> 98          self.log("test_roc_auc", test_roc_auc, prog_bar=True)
> 99  
> 100  
> 101          return loss
> 102  
> 103      def configure_optimizers(self):
> 104          return torch.optim.Adam(self.parameters(), lr=0.001)
> 105  
> 106  # Data preprocessing
> 107  scaler = StandardScaler()
> 108  X_train_reduced = pd.DataFrame(scaler.fit_transform(X_train_reduced), columns=X_train_reduced.columns)
> 109  X_test_reduced = pd.DataFrame(scaler.transform(X_test_reduced), columns=X_test_reduced.columns)
> 110  
> 111  # Convert data to PyTorch tensors
> 112  X_train_tensor = torch.tensor(X_train_reduced.values, dtype=torch.float32)
> 113  y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
> 114  X_test_tensor = torch.tensor(X_test_reduced.values, dtype=torch.float32)
> 115  y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)
> 116  
> 117  # Create datasets and dataloaders
> 118  batch_size = 64
> 119  dataset = TensorDataset(X_train_tensor, y_train_tensor)
> 120  train_size = int(0.8 * len(dataset))
> 121  val_size = len(dataset) - train_size
> 122  train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
> 123  test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
> 124  
> 125  train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
> 126  val_loader = DataLoader(val_dataset, batch_size=batch_size)
> 127  test_loader = DataLoader(test_dataset, batch_size=batch_size)
> 128  
> 129  # Define a custom PyTorch Lightning DataModule
> 130  class CustomDataModule(pl.LightningDataModule):
> 131      def __init__(self, X_train, X_test, y_train, y_test, batch_size=64):
> 132          super().__init__()
> 133          self.X_train = X_train
> 134          self.X_test = X_test
> 135          self.y_train = y_train
> 136          self.y_test = y_test
> 137          self.batch_size = batch_size
> 138          self.input_dim = X_train.shape[1]
> 139  
> 140      def setup(self, stage=None):
> 141          dataset = TensorDataset(torch.tensor(self.X_train.values, dtype=torch.float32),
> 142                                   torch.tensor(self.y_train.values, dtype=torch.long))
> 143          train_size = int(0.8 * len(dataset))
> 144          val_size = len(dataset) - train_size
> 145          self.train_dataset, self.val_dataset = random_split(dataset, [train_size, val_size])
> 146          self.test_dataset = TensorDataset(torch.tensor(self.X_test.values, dtype=torch.float32),
> 147                                            torch.tensor(self.y_test.values, dtype=torch.long))
> 148  
> 149      def train_dataloader(self):
> 150          return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
> 151  
> 152      def val_dataloader(self):
> 153          return DataLoader(self.val_dataset, batch_size=self.batch_size)
> 154  
> 155      def test_dataloader(self):
> 156          return DataLoader(self.test_dataset, batch_size=self.batch_size)
> 157  
> 158  # Initialize the DataModule
> 159  data_module = CustomDataModule(X_train_reduced, X_test_reduced, y_train, y_test, batch_size=batch_size)
> 160  
> 161  # Define and run experiments
> 162  experiment_logs = []
> 163  def run_experiment(model_name, hidden_units, dropout_rate=None):
> 164      if dropout_rate:
> 165          class DropoutMLP(MultiLayerPerceptron):
> 166              def __init__(self, input_dim, hidden_units, dropout_rate):
> 167                  super().__init__(input_dim, hidden_units)
> 168                  layers = []
> 169                  for hidden_unit in hidden_units:
> 170                      layers.append(nn.Linear(input_dim, hidden_unit))
> 171                      layers.append(nn.ReLU())
> 172                      layers.append(nn.Dropout(p=dropout_rate))
> 173                      input_dim = hidden_unit
> 174                  layers.append(nn.Linear(hidden_units[-1], 2))
> 175                  self.model = nn.Sequential(*layers)
> 176          model = DropoutMLP(input_dim=data_module.input_dim, hidden_units=hidden_units, dropout_rate=dropout_rate)
> 177      else:
> 178          model = MultiLayerPerceptron(input_dim=data_module.input_dim, hidden_units=hidden_units)
> 179  
> 180      trainer = pl.Trainer(
> 181          max_epochs=20,
> 182          accelerator="gpu" if torch.cuda.is_available() else "cpu",
> 183          devices=1 if torch.cuda.is_available() else None,
> 184      )
> 185      trainer.fit(model, datamodule=data_module)
> 186      results = trainer.test(model=model, dataloaders=data_module.test_dataloader(), verbose=False)
> 187  
> 188      metrics = {
> 189          "Experiment_Name": model_name,
> 190          "Hidden_Units": hidden_units,
> 191          "Dropout_Rate": dropout_rate,
> 192          "Test_Accuracy": results[0]["test_acc"],
> 193          "Test_Loss": results[0]["test_loss"],
> 194          "Test_Precision": results[0]["test_precision"],
> 195          "Test_Recall": results[0]["test_recall"],
> 196          "Test_ROC_AUC": results[0]["test_roc_auc"],
> 197          "Test_F1": results[0]["test_f1"],
> 198      }
> 199      experiment_logs.append(metrics)
> 200      print(f"Experiment {model_name} complete!")
> 201  
> 202  # Run experiments
> 203  run_experiment("Sparse Input MLP", hidden_units=(64, 32))
> 204  run_experiment("Simple Deep MLP", hidden_units=(128, 64, 32))
> 205  run_experiment("MLP with Dropout", hidden_units=(128, 64, 32), dropout_rate=0.2)
> 206  run_experiment("Wider MLP", hidden_units=(512, 256, 128))
> 207  
> 208  # Start TensorBoard
> 209  %load_ext tensorboard
> 210  %tensorboard --logdir lightning_logs/
> 211  

> ```


**Figure 12. Tensorboard use**
<img src=https://i.ibb.co/Y7YSjb4/tensorboard.png/>

### **2.2.6 Data Leakage Mitigation**

**Data Leakage**

Data leakage refers to using information during model training that is outside of the training dataset (e.g., test data, future data, features derived from the target variable). This error often occurs when training and test data are not split at the correct time in the pipeline, so the test data is shared with training data. Data leakage is an issue because it causes the model to appear better performing than it actually will be in a real-world application. The model is likely to overfit because it is using all of the data to train instead of just the training set. Thus, the test metrics will likely be high, but these scores are not a good reflection of how the model will perform on unseen data.

Data leakage is a common error in machine learning because it usually occurs subtly. Machine learning pipelines are often complex with many different steps where data is processed or analyzed, so it can be difficult to detect leakage. Below, we discuss what can be done to detect and fix data leakage, as well as what we did to minimize data leakage in our project.

**Ways to detect and fix leakage:**
One way to identify data leakage is to compare the training and test metrics. If there are large differences where the training metrics are much better than the test metrics, it is an indication that data leakage may be occurring somewhere in your pipeline. Another sign of possible data leakage is a strong unexpected or illogical correlation between a feature and the target variable. This strong association can occur if information from the target variable was used to create other features. Finally, if the model shows inconsistent performance across different datasets, there may be a data leakage issue. The model will likely perform well on unseen data similar to the test set, but more varied datasets will result in poor model performance.

If you suspect leakage in your pipeline, you should first ensure that train/test split is performed before any preprocessing of the data. Feature engineering steps should also be checked to make sure that future data (data that would not be available at time of training/prediction) and target variable information (e.g., feature derived from the target variable)  are not included in your model training. You can then remove any features that are found to have illogically high feature importance or strong illogical correlations with the target variable.

**What we did to minimize leakage:**
An important step in minimizing data leakage is to ensure the train/test split is performed before preprocessing is done on the data. Preprocessing includes operations such as scaling, imputing missing values, and encoding categorical variables. In our pipeline, we performed the train/test split after merging the secondary tables with the primary data table (Application Train). The splitting of the training and testing data occurs before preprocessing done with the Categorical and Numerical Transformers. Splitting the data before preprocessing is important because none of the values from test data should be used to scale/impute/encode the values in the training data. The preprocessing is performed separately on the training and test sets so no information is shared between the two sets.

Additionally, the target variable was not used to create any new features during our feature engineering on the secondary tables, so no features were knowingly derived from the target variable. We did not encounter any extremely high or illogical feature importances during feature selection that would raise concern that target information was inadvertently used as a feature in training. Finally, there were not abnormally large differences between training and test metrics on the models we tested to indicate a leakage issue.


In the next section, we will present our main findings of our experiments.

## 2.3 Results <a class="anchor" id="results"></a>

### **2.3.1 Exploratory Data Analysis Key Insights**

The data reveals a significant class imbalance in the target variable, with 92% of applicants successfully repaying their loans and only 8% encountering payment difficulties (Figure 13).

**Figure 13. Distribution of the Target Variable**

<img src=https://i.ibb.co/hsGB9q6/eda1.png/>

An analysis of applicant occupations (*Figure 14*) shows that most clients were laborers, while the fewest were IT and HR professionals. This distribution reflects the nature of loan applicants, where low credit history applications are typically more prevalent among labor-intensive professions.

**Figure 14. Loan Application Occupation**

<img src=https://i.ibb.co/pvLBy4Q/eda2.png/>

Default rates varied significantly across income, education, family, and housing types (*Figure 15*). Individuals on maternity leave or unemployed exhibited the highest default rates, suggesting increased financial vulnerability in these groups. Conversely, those with academic degrees had the lowest default rates, highlighting the stabilizing effect of higher education on financial behavior. These findings suggest that income stability, education level, and family status are key factors influencing repayment outcomes.

**Figure 15. Default Percentages by Income, Education, Family Status and Housing Type**

<img src=https://i.ibb.co/GHN5Rf3/eda3.png/>

*Figure 16* illustrates the distribution of credit card balance records relative to the months before the loan application. The data shows an increasing number of entries as the timeline approaches the loan application date, with the highest number of records occurring in the final months prior to the application. This trend suggests that credit card activity is most actively recorded closer to the application, possibly reflecting a focus on recent financial behavior in creditworthiness assessments.

**Figure 16.Credit Card Balances Months Relative to Application**

<img src=https://i.ibb.co/pr4d459/eda5.png/>

*Figure 17* shows the distribution of credit card history categorized into three groups: long-term histories (greater than 7 years), medium-term histories (3 to 7 years), and short-term histories (less than 3 years). The majority of clients have long credit card histories, accounting for the largest share of the data. Medium-term and short-term histories are represented in nearly equal proportions but are significantly fewer than long-term histories. This distribution indicates that most clients possess well-established credit histories, which may serve as a strong indicator for creditworthiness evaluations.

**Figure 17.Distribution of Credit Card History**

<img src=https://i.ibb.co/9smdgw9/eda6.png/>

*Figure 18* presents the histograms of DAYS_INSTALLMENT and DAYS_ENTRY_PAYMENT, representing the days relative to the loan application date when installments were scheduled and payments were made, respectively. Both distributions show that the frequency of entries increases as the timeline approaches zero, with a sharp peak close to the application date. This indicates that most installment-related activities occur shortly before or near the loan application.

Similarly, *Figure 19* shows a steady increase in entries as the timeline approaches the loan application date. CNT_INSTALMENT_FUTURE (remaining installments) highlights a sharp decline as the number of future payments decreases.

**Figure 18. Distribution of Days Installment and Days Entry Payment**

<img src=https://i.ibb.co/2yn83bB/eda7.png/>

**Figure 19. Distributions of POS_CASH_balance**

<img src=https://i.ibb.co/37p0B3C/eda8.png/>

The exploratory data analysis provided critical insights into the structure and characteristics of the data, guiding the necessary feature engineering and preprocessing steps.These insights directly informed the development of the modeling pipeline, enabling the transformation of raw data into inputs suitable for predictive analysis. Next we will present the results of the models.

### **2.3.2 Modeling Results**

Our modeling experiments evaluated nine algorithms: Logistic Regression, Decision Trees, Random Forests, Gradient Boosting, XGBoost, and Light Gradient Boosting (LightGBM), SVM with 20% of training data and Voting and Stacking Ensemble Models. Each model was tested on both the full feature set (208 features) and a reduced feature set (20 features), except for SVM that was only tested on the reduced feature set. The evaluation metrics are presented in *Table 4* and *Figure 20* compares the key evaluation metrics across models. MLP results will be presented in the following subsection.


**Table 4. Metrics Across Experiments without MLP**

|  | **Experiment_Name** | **Train_Accuracy** | **Valid_Accuracy** | **Test_Accuracy** | **Train_ROC_AUC** | **Valid_ROC_AUC** | **Test_ROC_AUC** | **Train_Precision** | **Valid_Precision** | **Test_Precision** | **Train_Recall** | **Valid_Recall** | **Test_Recall** | **Train_F1** | **Valid_F1** | **Test_F1** | **Train_Log_Loss** | **Valid_Log_Loss** | **Test_Log_Loss** |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LogisticRegression_208_features | 0.9197 | 0.9197 | 0.916 | 0.7608 | 0.7537 | 0.7594 | 0.487 | 0.5448 | 0.4955 | 0.0179 | 0.0196 | 0.0167 | 0.0345 | 0.0379 | 0.0323 | 0.2451 | 0.2478 | 0.254 |
| 1 | DecisionTree_208_features | 1 | 0.8546 | 0.8499 | 1 | 0.5443 | 0.5416 | 1 | 0.1511 | 0.1515 | 1 | 0.1743 | 0.171 | 1 | 0.1619 | 0.1606 | 0 | 5.2424 | 5.4091 |
| 2 | RandomForest_208_features | 1 | 0.9195 | 0.9162 | 1 | 0.7212 | 0.7237 | 1 | 0.75 | 0.9 | 0.9998 | 0.0024 | 0.0027 | 0.9999 | 0.0048 | 0.0054 | 0.06 | 0.2772 | 0.2828 |
| 3 | GradientBoosting_208_features | 0.9208 | 0.9198 | 0.9162 | 0.7784 | 0.7656 | 0.7678 | 0.687 | 0.5645 | 0.5357 | 0.0228 | 0.0188 | 0.0182 | 0.0441 | 0.0364 | 0.0352 | 0.2394 | 0.2437 | 0.2518 |
| 4 | XGBoosting_208_features | 0.9291 | 0.9192 | 0.9149 | 0.8847 | 0.7647 | 0.7688 | 0.8561 | 0.491 | 0.4357 | 0.1395 | 0.0584 | 0.0452 | 0.24 | 0.1044 | 0.082 | 0.194 | 0.2448 | 0.2523 |
| 5 | LightGradientBoosting_208_features | 0.9225 | 0.9204 | 0.9165 | 0.8249 | 0.7743 | 0.7739 | 0.7655 | 0.5966 | 0.5538 | 0.0491 | 0.0374 | 0.0313 | 0.0923 | 0.0704 | 0.0592 | 0.2229 | 0.2404 | 0.2491 |
| 6 | LightGradientBoosting_208_features_Voting_ensemble | 0.934 | 0.9199 | 0.9163 | 0.9994 | 0.7538 | 0.7554 | 1 | 0.566 | 0.5385 | 0.1765 | 0.0242 | 0.0234 | 0.3 | 0.0464 | 0.0448 | 0.1231 | 0.2491 | 0.2571 |
| 7 | LightGradientBoosting_208_features_Stacking_ensemble | 1 | 0.9198 | 0.9161 | 1 | 0.7213 | 0.7235 | 1 | 0.545 | 0.5177 | 1 | 0.0293 | 0.0222 | 1 | 0.0557 | 0.0425 | 0.0426 | 0.2544 | 0.2626 |
| 8 | LogisticRegression_20_features | 0.9197 | 0.9193 | 0.916 | 0.7266 | 0.7263 | 0.7257 | 0.4467 | 0.4375 | 0.4737 | 0.0049 | 0.0038 | 0.0055 | 0.0098 | 0.0075 | 0.0108 | 0.254 | 0.2553 | 0.2626 |
| 9 | DecisionTree_20_features | 1 | 0.8553 | 0.8489 | 1 | 0.5401 | 0.5344 | 1 | 0.1462 | 0.1407 | 1 | 0.1644 | 0.1564 | 1 | 0.1548 | 0.1481 | 0 | 5.2151 | 5.445 |
| 10 | RandomForest_20_features | 1 | 0.9196 | 0.9162 | 1 | 0.7189 | 0.7107 | 1 | 0.6389 | 0.625 | 0.9994 | 0.0062 | 0.0046 | 0.9997 | 0.0123 | 0.009 | 0.0599 | 0.2777 | 0.2977 |
| 11 | GradientBoosting_20_features | 0.9204 | 0.9196 | 0.916 | 0.7552 | 0.745 | 0.7474 | 0.6704 | 0.5455 | 0.4833 | 0.0136 | 0.0113 | 0.0088 | 0.0266 | 0.0221 | 0.0173 | 0.2462 | 0.2495 | 0.2574 |
| 12 | XGBoosting_20_features | 0.9256 | 0.9189 | 0.9151 | 0.8621 | 0.7421 | 0.7394 | 0.8825 | 0.4552 | 0.4167 | 0.0831 | 0.0328 | 0.0273 | 0.1519 | 0.0612 | 0.0513 | 0.2076 | 0.2517 | 0.2608 |
| 13 | LightGradientBoosting_20_features | 0.9209 | 0.9194 | 0.9159 | 0.7957 | 0.7483 | 0.7498 | 0.7905 | 0.4835 | 0.4667 | 0.0186 | 0.0118 | 0.0085 | 0.0364 | 0.0231 | 0.0167 | 0.234 | 0.2483 | 0.2565 |
| 14 | LightGradientBoosting_20_features_Voting_ensemble | 0.9293 | 0.9195 | 0.9156 | 0.9996 | 0.733 | 0.7305 | 1 | 0.5207 | 0.4096 | 0.1184 | 0.0169 | 0.0103 | 0.2117 | 0.0328 | 0.0201 | 0.1274 | 0.2554 | 0.2648 |
| 15 | LightGradientBoosting_20_features_Stacking_ensemble | 1 | 0.9191 | 0.9156 | 1 | 0.719 | 0.7108 | 1 | 0.4667 | 0.4476 | 1 | 0.0245 | 0.0194 | 1 | 0.0465 | 0.0373 | 0.0517 | 0.2563 | 0.2661 |
| 16 | SVC_20%_20_features | 0.9185 | 0.9194 | 0.916 | 0.9035 | 0.6113 | 0.6167 | 1 | 0 | 0 | 0.0003 | 0 | 0 | 0.0006 | 0 | 0 | 0.2503 | 0.2765 | 0.2845 |
| 17 | XGBoost_hyperparamtune_20_features | 0.9205 | 0.9194 | 0.9158 | 0.7799 | 0.7483 | 0.7507 | 0.7221 | 0.5 | 0.4211 | 0.0149 | 0.01 | 0.0073 | 0.0291 | 0.0195 | 0.0143 | 0.2387 | 0.2483 | 0.2562 |

**Figure 20.Model Evaluation without MLP**

<img src=https://i.ibb.co/1vNNZJ7/metric-comparison-no-mlp.png/>

XGBoost with all features was selected as the optimal model due to its combination of high accuracy, computational efficiency, and scalability. *Figure 21* illustrates the ROC curves for several models, with XGBoost achieving one of the highest AUC scores (0.77) alongside LightGBM. These metrics demonstrate XGBoost's ability to capture patterns in the data effectively, making it suitable for our task. However, it is important to note the challenges posed by the severe class imbalance, which led to low recall and F1-scores as observed in *Table 4*.


**Figure 21.ROC Curve Across Models**

<img src=https://i.ibb.co/WHGmPC1/roc-curve.png/>

To further enhance XGBoost's performance, we conducted hyperparameter tuning using the full set of 208 features. The best parameters obtained were:

- colsample_bytree: 0.8
- learning_rate: 0.1
- max_depth: 5
- subsample: 1.0

With these optimized parameters, XGBoost delivered slightly improved results on the test set. Feature importance analysis (**Figure 22**) highlighted the external credit sources (EXT_SOURCE_2, EXT_SOURCE_3, and EXT_SOURCE_1) as the most predictive variables, along with demographic features like DAYS_BIRTH and DAYS_EMPLOYED.


**Figure 22.Feature Importance XGBoost with Hyperpameter Tuning**

<img src=https://i.ibb.co/khzszcm/feature-importance-xgboost.png/>

Despite its overall strong performance, XGBoost struggled to address the imbalance in the data. As shown in *Figure 23*, the model accurately identified the majority of negative samples but performed poorly in predicting positive (minority class) samples, resulting in an F1-score of 0.02 for the minority class. This emphasizes the need for targeted strategies to improve the model's ability to handle imbalanced data.

**Figure 23.Confusion Matrix XGBoost with Hyperparameter Tuning**

<img src=https://i.ibb.co/Ykh5NpZ/Confusion-Matrix-XGBoost.png/>

For our MLP experiments, we tested various architectures, including a Sparse Input MLP for benchmarking, a Simple Deep MLP for added depth, an MLP with Dropout to mitigate overfitting, and a Wider MLP for greater representational power. *Table 5* provides a detailed comparison of the architectures and their respective performance metrics, while *Figure 24* shows the metrics in the context of all other experiments.
Overall, the MLP models demonstrated comparable performance to traditional machine learning models but did not show significant improvement over models like XGBoost and LightGBM. For instance, while the Wider MLP achieved slightly higher recall, its overall test accuracy (91.26%) and F1-score remained below those of the tuned XGBoost model. Additionally, the severe class imbalance persisted across all MLP experiments, resulting in low precision, recall, and F1-scores for the minority class.


**Figure 24.MLP Architecture Evaluation**

<img src=https://i.ibb.co/61y1KDh/mlp-arch-comparison.png/>

**Figure 25.Evaluation Metrics All Models**

<img src=https://i.ibb.co/vzFSRWS/metric-comparison-all-experiments.png/>

Despite testing more complex architectures, the MLP models struggled to handle the inherent data challenges, including class imbalance and the need for feature engineering.

The results were submitted to Kaggle.

<img src=https://i.ibb.co/h2SL9jn/kaggle.png/>

## 2.4 Discussion, Conclusion and Gap Analysis <a class="anchor" id="discussion"></a>

#### **2.4.1 Conclusion**

This study evaluated multiple models for predicting loan defaults and found that XGBoost with the full feature set and without hyperparameter tuning achieved the highest accuracy of 91.59% on the test set. Across our experiments, many models performed similarly, regardless of whether we used the full feature set or applied feature selection. Notably, while MLP models performed comparably to traditional models, such as Logistic Regression and Decision Trees, they did not outperform methods like XGBoost and LightGBM, which remained the top performers.
While it excelled at predicting non-defaulters, XGBoost struggled with the severe class imbalance, as reflected in its low recall and F1-score for defaulters. Addressing this imbalance through oversampling, cost-sensitive learning, or advanced ensemble techniques is crucial for improving performance.


#### **2.4.2 Gap Analysis**



In this phase, we continued our experiments by incorporating Multi-Layer Perceptron (MLP) models to evaluate their performance against traditional and ensemble-based machine learning models. Despite exploring various architectures, including Sparse Input MLP, Simple Deep MLP, MLP with Dropout, and Wider MLP, the results reaffirmed XGBoost as the best-performing model, achieving a test accuracy of 91.59% and a test ROC AUC of 0.7518. While MLP models demonstrated comparable performance, they did not surpass XGBoost or LightGBM, particularly in handling the class imbalance, where F1-scores for the minority class remained low.

**Compared to other groups** that have uploaded their results in Phase 4, our XGBoost model remains competitive. Group 2, for example, utilized Gradient Boosting and achieved a public AUC score of 0.75268 and a private AUC score of 0.7699, with significant contributions from engineered features such as EXT_SOURCE_MEAN and ANNUITY_CREDIT_RATIO. Group 6 experimented with an MLP architecture featuring multiple hidden layers (129 → 64 → 32 → 16 → 2) and achieved similar performance to their Logistic Regression baseline, which aligns with our findings on MLP's limited improvement over simpler models.

Like other groups, we continue to face challenges with class imbalance, which significantly impacts recall and F1-scores for the minority class. This highlights an area for improvement in subsequent phases.

Group 2's feature engineering approach, including aggregated credit and annuity ratios, presents a valuable avenue for refining our feature set. Moreover, the diversity of architectures explored by Group 6 could inspire further MLP experiments.

#### **2.4.3 Discussion**

Our project revealed several insights and challenges in modeling loan default prediction. A major challenge throughout our project was the severe class imbalance, which impacted our ability to effectively model the data. From a consumer perspective, it was interesting developing models that, using more data and ML techniques, can ensure that applicants are not unfairly rejected. However, from a business perspective, the model's difficulty in accurately predicting defaulters highlights an important limitation, as incorrectly identifying applicants who may default poses a financial risk.


To address this imbalance, we recommend implementing techniques such as SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples for the minority class or employing cost-sensitive learning, which penalizes misclassifications of the minority class during training. Additionally, ensemble techniques like balanced bagging or boosting could also improve model performance.

Future work could also experiment with developing a multitask loss function in PyTorch to predict both the loan default class (default or non-default) and the length of time before defaulting. This would involve creating a multi-headed model combining classification and regression tasks, using a combined loss function (e.g., cross-entropy and mean squared error) to enhance both predictive performance and interpretability


Our project also highlighted several challenges that impacted the workflow. The complexity of dimensionality reduction, combined with the need to process both static and time-series data, required significant computational resources. Despite using Google Colab Pro, we faced limitations in computational capacity, which restricted experimentation with larger datasets and more complex models. Furthermore, collaborative editing of shared files introduced additional logistical hurdles. Ideally, implementing a local or cloud server with more computational resources and robust version control systems, such as Git, would streamline collaboration and prevent conflicts.


## 2.5 References <a class="anchor" id="references"></a>

- Scikit-learn. "Logistic Regression Documentation." Available at: [https://scikit-learn.org/dev/modules/linear_model.html](https://scikit-learn.org/dev/modules/linear_model.html).

- Scikit-learn. "Support Vector Classification (SVC) Documentation." Available at: [https://scikit-learn.org/dev/modules/svm.html](https://scikit-learn.org/dev/modules/svm.html).

- Scikit-learn. "DecisionTreeClassifier Documentation." Available at: [https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html](https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

- Scikit-learn. "RandomForestClassifier Documentation." Available at: [https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html](https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

- Scikit-learn. "GradientBoostingClassifier Documentation." Available at: [https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).

- LightGBM. "LightGBM Parameters Documentation." Available at: [https://lightgbm.readthedocs.io/en/latest/Parameters.html](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

- Towards Data Science. "Beginner’s Guide to the Must-Know LightGBM Hyperparameters." Medium. Available at: [https://towardsdatascience.com/beginners-guide-to-the-must-know-lightgbm-hyperparameters-a0005a812702](https://towardsdatascience.com/beginners-guide-to-the-must-know-lightgbm-hyperparameters-a0005a812702).

- XGBoost. "XGBoost Parameters Documentation." Available at: [https://xgboost.readthedocs.io/en/stable/parameter.html](https://xgboost.readthedocs.io/en/stable/parameter.html).

- Scikit-learn. "VotingClassifier Documentation." Available at: [https://scikit-learn.org/0.18/modules/generated/sklearn.ensemble.VotingClassifier.html](https://scikit-learn.org/0.18/modules/generated/sklearn.ensemble.VotingClassifier.html).

- Scikit-learn. "StackingClassifier Documentation." Available at: [https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.StackingClassifier.html](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.StackingClassifier.html).

- TrainInData Blog. "Machine Learning with Imbalanced Data." Available at: [https://www.blog.trainindata.com/machine-learning-with-imbalanced-data/](https://www.blog.trainindata.com/machine-learning-with-imbalanced-data/).

- Machine Learning Mastery. "SMOTE Oversampling for Imbalanced Classification." Available at: [https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/).

- PyTorch Lightning. "PyTorch Lightning Documentation." Available at: [https://lightning.ai/docs/pytorch/stable/](https://lightning.ai/docs/pytorch/stable/).


