# **Home Credit Default Risk Group4 Phase 1**
**Leveraging Machine Learning to Predict Loan Defaults**

### **Team Members** <a class="anchor" id="team"></a>

We are a team of diverse individuals from quite different backgrounds. Cassie leads a team of data scientists and engineers focused on cybersecurity. Maria, with a PhD in Political Science, works as a data scientist at a center for artificial intelligence in healthcare. Lexi is a Research Data Analyst at IU Bloomington’s Cognitive Development Lab, where she studies infant cognitive development. Nasheed is a PhD candidate in Math with a focus on Applied Statistics. Together, we bring a unique blend of expertise and perspectives to this project.



| Name           | Email             | Role              | Photo                           |
|----------------|-------------------|-------------------|---------------------------------|
| Lexi Colwell   | alecolwe@iu.edu   | Phase 1 Lead      | <img src="https://iu.instructure.com/images/thumbnails/177973973/0g37V233Y26RuFrO4SpckMxeO237lTphfcshwCPB" width="50">   |
| Nasheed Jafri  | njafri@iu.edu     | Phase 2 Lead      | <img src="https://iu.instructure.com/images/thumbnails/126314916/QRUIEEkso27JL5B1T8aFYUsG72QVtz5EJD4gfe1Z" width="50">|
| Cassie Cagwin  | cacagwin@iu.edu   | Phase 3 Lead      | <img src="https://www.widsworldwide.org/wp-content/uploads/2023/10/1-6-scaled.jpeg" width="50"> |
| Maria Aroca    | mparoca@iu.edu    | Phase 4 Lead      | <img src="https://media.licdn.com/dms/image/v2/C4E03AQHBCzQVjfUjYA/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1589987573488?e=1736985600&v=beta&t=RIOl6PYoXLUqfTFXW5VMpFm9l1zGIvd2u5-y0k59jCk" width="50">  |


## **Table of Contents**
- [Phase Leader Plan](#phase_leader_plan)
- [Credit Assignment Plan (SMART Goals)](#credit_assignment)
- [Project Proposal](#project_proposal)
  - [Project Title](#title)
  - [Project Abstract](#abstract)
  - [Data Description](#data)
  - [Machine Learning Algorithms](#ML)
  - [Evaluation Metrics](#metrics)
  - [Pipeline Steps](#pipeline)
  - [Project Timeline](#timeline)


## **1. Phase Leader Plan** <a class="anchor" id="phase_leader_plan"></a>

The project will be delivered in multiple phases, each with specific milestones. Each team member has been assigned as the leader responsible for one phase. The table below outlines the phases and their respective project managers.

| Phase         | Task Description                                                                                     | Project Manager    |
|---------------|------------------------------------------------------------------------------------------------------|--------------------|
| Phase 1       | Project planning and proposal, including data sources, metrics, and baseline models                  | Lexi Colwell       |
| Phase 2       | Data exploration (EDA), baseline pipeline, feature engineering, and initial hyperparameter tuning    | Nasheed Jafri      |
| Phase 3       | Advanced feature engineering, hyperparameter tuning, feature selection, and ensemble methods         | Cassie Cagwin      |
| Phase 4       | Final model integration, implementing advanced architectures, and project report completion          | Maria Aroca        |



## **2. Credit Assignment Plan (SMART Goals)** <a class="anchor" id="credit_assignment"></a>

In this section, we outline the project’s major tasks organized by phase, detailing specific goals, deadlines, and responsibilities. Each task follows the SMART goal framework to ensure that our objectives are specific, measurable, achievable, relevant, and time-bound. The table below provides an overview of each task, its assignee, and the corresponding deadline.

<br>

**<p>Phase 1 Credit Assignment Plan Summary</p>**

| Name     | Task Description                                                                                     | Estimated Time    |
|---------------|------------------------------------------------------------------------------------------------------|--------------------|
| Lexi Colwell      | Data exploration, Initial pipeline and metrics write up, Abstract, Submit proposal                 | 5 hrs      |
| Nasheed Jafri       | Data exploration, Initial pipeline and metrics write up    | 5 hrs      |
| Cassie Cagwin      | Data exploration, Initial pipeline and metrics write up         | 5 hrs     |
| Maria Aroca       | Data exploration, Initial pipeline and metrics write up          | 5 hrs       |

<br>
<br>


**<p>Detailed Credit Assignment Plan</p>**


| Phase | Task Description                                                                                                 | Assignee        | SMART Goal                                                                                                                                                | Due Date    |
|-------|------------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| 1     | Write abstract                                                                                                   | Everyone        | Our goal is to write a 150-200 word abstract summarizing the project scope and goals. Everyone will contribute by drafting, reviewing, and finalizing it. This will clarify our project’s purpose for all stakeholders. | 11/11/2024 |
| 1     | Initial data exploration for main application dataset                                                            | Everyone        | Our goal is to identify initial insights from the main application data by analyzing key variables. Each team member will explore different aspects to ensure comprehensive data understanding. | 11/10/2024 |
| 1     | Data exploration for bureau data                                                                                 | Lexi            | Lexi will complete an initial exploration of bureau data by analyzing its structure and trends, providing insights that will inform our feature engineering phase. | 11/10/2024 |
| 1     | Data exploration for bureau balance data                                                                         | Lexi            | Lexi will analyze bureau balance data to capture monthly trends and summarize key patterns, which will enhance our understanding of client credit behavior. | 11/10/2024 |
| 1     | Data exploration for POS and cash loan balances                                                                  | Maria, Lexi     | Our goal is for Maria and Lexi to summarize characteristics and trends in POS and cash loan data, helping to identify client financial behavior patterns. | 11/10/2024 |
| 1     | Data exploration for credit card balances                                                                        | Nasheed         | Nasheed will explore credit card balance data, documenting key patterns to better understand client spending habits and repayment behavior.                 | 11/10/2024 |
| 1     | Data exploration for previous applications                                                                       | Cassie          | Cassie will analyze previous applications, identifying primary data points that will inform features for modeling loan approvals.                         | 11/10/2024 |
| 1     | Data exploration for installment payments data                                                                   | Maria, Nasheed  | Our goal is for Maria and Nasheed to describe repayment history patterns in installments data, providing insights into repayment consistency and delays.   | 11/10/2024 |
| 1     | Writeup on machine algorithms and metrics                                                                        | Everyone        | The team will draft a 200-300 word section explaining chosen machine learning algorithms and metrics, ensuring a clear approach to modeling and evaluation. | 11/11/2024 |
| 1     | Design initial machine learning pipelines                                                                        | Everyone        | Our goal is to create a pipeline framework outlining preprocessing and modeling steps. Each member will contribute to ensure robust data flow and model implementation. | 11/11/2024 |
| 1     | Develop baseline model pipeline                                                                                  | Everyone        | The team will implement and test a baseline model pipeline, ensuring an initial benchmark for model performance to guide improvements in future phases. | 11/11/2024 |
| 1     | Plan for additional pipelines                                                                                    | Everyone        | The team will outline advanced pipelines and additional models to explore, enhancing our modeling strategy for upcoming phases.                            | 11/11/2024 |
| 1     | Finalize project proposal writeup                                                                                | Everyone        | The team will complete the full project proposal, including the phase leader plan and SMART goals, creating a structured plan for project execution.      | 11/12/2024 |
| 1     | Submit project proposal                                                                                          | Lexi Colwell    | Lexi will submit the project proposal on Canvas, ensuring all components are complete and accurately presented.                                           | 11/12/2024 |
| 1     | Submit initial discussion post                                                                                   | Lexi Colwell    | Lexi will post the project proposal update on the Canvas discussion to engage classmates and instructors with our project direction.                       | 11/12/2024 |
| 2     | Conduct EDA for application data                                                                                 | Everyone        | Our goal is to perform exploratory data analysis, identify patterns, and summarize initial insights to set a foundation for feature engineering.           | 11/15/2024 |
| 2     | Conduct EDA for bureau data                                                                                      | Lexi            | Lexi will complete the bureau data EDA, providing a summary of findings that will guide feature extraction relevant to client credit history.              | 11/15/2024 |
| 2     | Conduct EDA for bureau balance data                                                                              | Lexi            | Lexi will perform EDA for bureau balance data and summarize key trends, enabling us to integrate time-series insights into modeling.                      | 11/15/2024 |
| 2     | Define evaluation metrics                                                                                        | Everyone        | The team will select and outline metrics for model evaluation, ensuring that model performance aligns with project objectives.                             | 11/15/2024 |
| 2     | Develop baseline models                                                                                          | Everyone        | The team will develop and evaluate initial baseline models, setting performance benchmarks for comparison with advanced models.                            | 11/19/2024 |
| 2     | Conduct initial feature engineering                                                                              | Everyone        | Our goal is to create new features from data to improve input representation, enhancing model accuracy and relevance.                                     | 11/19/2024 |
| 2     | Perform initial hyperparameter tuning                                                                            | Everyone        | The team will tune model hyperparameters to optimize baseline performance, aiming to establish solid parameters for future iterations.                     | 11/19/2024 |
| 2     | Compile brief report for project update                                                                          | Everyone        | The team will summarize Phase 2 work in a report and prepare slides, providing a clear update on progress for peer review.                                 | 11/19/2024 |
| 2     | Submit slides, notebook, and video presentation                                                                  | Nasheed         | Nasheed will upload project update materials to Canvas, ensuring accessible and organized documentation for the phase.                                    | 11/19/2024 |
| 3     | Conduct second round of feature engineering                                                                      | Everyone        | Our goal is to refine model features based on initial findings, further optimizing data input for improved accuracy and performance.                     | 11/29/2024 |
| 3     | Perform second round of hyperparameter tuning                                                                    | Everyone        | The team will fine-tune hyperparameters, aiming to boost model performance with optimized settings.                                                      | 11/29/2024 |
| 3     | Conduct feature selection                                                                                        | Everyone        | The team will select relevant features, reducing model complexity while maintaining accuracy.                                                            | 11/29/2024 |
| 3     | Implement ensemble methods                                                                                       | Everyone        | Our goal is to test ensemble methods to enhance model robustness and performance, aiming to achieve more consistent results.                             | 12/3/2024  |
| 3     | Prepare project update report                                                                                    | Everyone        | The team will compile a brief report and slides for project discussion, clearly outlining Phase 3 progress.                                               | 12/3/2024  |
| 3     | Submit project update materials                                                                                  | Cassie Cagwin   | Cassie will upload the updated project materials (slides, notebook, video) to Canvas, ensuring all files are accessible.                                 | 12/3/2024  |
| 4     | Implement Neural Network model                                                                                   | Everyone        | The team will build and evaluate a Neural Network as the final model, exploring deeper learning methods for improved performance.                        | 12/6/2024  |
| 4     | Write and compile final project report                                                                           | Everyone        | The team will complete the project report with analysis, conclusions, and final results, providing comprehensive project documentation.                  | 12/9/2024  |
| 4     | Submit final slides, notebook, and video presentation                                                            | Maria Aroca     | Maria will upload the final materials on Canvas, marking the completion of project submissions.                                                         | 12/11/2024 |



## **3. Project Proposal** <a class="anchor" id="project_proposal"></a>


In this section, we provide an overview of our project's scope, objectives, and approach. This proposal includes a project title, a brief abstract summarizing our goals and methodology, and a description of the data that we'll be using. We detail the machine learning algorithms we plan to implement and our reasoning behind these choices, as well as the metrics we will use to evaluate our model's performance. A Gantt chart outlines the timeline and key phases of the project. Additionally, we describe the steps involved in our proposed data pipeline and include contact details and photos of each team member.

### **3.1 Project Title** <a class="anchor" id="title"></a>

"Leveraging Machine Learning to Predict Loan Defaults"

### **3.2 Project Abstract** <a class="anchor" id="abstract"></a>
The purpose of this project is to use the Home Credit Default Risk dataset to create the best model for predicting whether customers will repay a loan. We will conduct exploratory data analysis to investigate the data types present in the dataset, learn about relations between features to determine which features might be the most useful in predicting loan repayment, and to inform creation of possible new features. We will first implement a baseline pipeline that uses Logistic Regression and perform initial feature engineering. We will evaluate the performance of all planned models using accuracy, precision, recall, F1 score, AUC-ROC, and log loss metrics. After evaluating the baseline model, we will optimize the model by hyperparameter tuning. We will then compare the Logistic Regression models to Decision Tree Classifier, Random Forest, Support Vector Machine, Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting, and Neural Network models. Throughout our model comparisons, we will continue feature engineering and hyperparameter tuning to improve our model performance. After all models have been tested, we will compile our findings, present the final results, and discuss model insights.

### **3.3 Data Description** <a class="anchor" id="data"></a>
We will use the data uploaded to the [Home Credit Default Risk kaggle](https://www.kaggle.com/competitions/home-credit-default-risk/data) competition project page.

The data for this project represents a comprehensive and diverse collection of data sources related to credit applications and repayment behaviors, making it both rich in detail and complex to work with. This data includes current loan applications with static demographic and financial information, Credit Bureau reports with historical data from other financial institutions, monthly balances on various credit accounts such as credit cards and POS (point of sale) loans, and detailed repayment histories that capture installment payments, including missed or late payments, across multiple credit types. Additionally, it includes information on previous loan applications within Home Credit, providing an in-depth look at each client’s financial profile and credit history over time. The combination of these sources results in a broad dataset that offers valuable insights into repayment patterns and client behavior but also presents significant challenges.

A primary challenge with this data is its heterogeneity; each source captures unique aspects of a client’s financial behavior, often in different formats. Some datasets provide static information, while others track monthly balances, leading to a mix of snapshot and time-series data that must be carefully aligned. Additionally, the data spans inconsistent collection periods, complicating synchronization across timeframes. This complexity requires thoughtful feature engineering to capture each client’s creditworthiness effectively. Below, we describe each of the tables in detail.


#### ***Application_train.csv*** and ***application_test.csv***


(307k rows, 122 columns) is the primary data set for this project. This file contains data about clients and their current application and information about the applicant. There is a similar ‘application_test.csv’ file containing our test data (49k rows, 121 columns, it does not contain TARGET). After dropping some redundant/highly correlated columns, the following columns might be retained from this data:

| Variable                     | Description |
|------------------------------|-------------|
| AMT_ANNUITY                  | Loan annuity |
| AMT_CREDIT                   | Credit amount of the loan |
| AMT_INCOME_TOTAL             | Income of the client |
| AMT_REQ_CREDIT_BUREAU_DAY    | Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) |
| AMT_REQ_CREDIT_BUREAU_HOUR   | Number of enquiries to Credit Bureau about the client one hour before application |
| AMT_REQ_CREDIT_BUREAU_MON    | Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) |
| AMT_REQ_CREDIT_BUREAU_QRT    | Number of enquiries to Credit Bureau about the client 3 months before application (excluding one month before application) |
| AMT_REQ_CREDIT_BUREAU_WEEK   | Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) |
| AMT_REQ_CREDIT_BUREAU_YEAR   | Number of enquiries to Credit Bureau about the client one year (excluding last 3 months before application) |
| APARTMENTS_AVG               | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| BASEMENTAREA_AVG             | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| CNT_CHILDREN                 | Number of children the client has |
| CODE_GENDER                  | Gender of the client |
| COMMONAREA_AVG               | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| DAYS_BIRTH                   | Client's age in days at the time of application |
| DAYS_EMPLOYED                | How many days before the application the person started current employment |
| DAYS_ID_PUBLISH              | How many days before the application did client change the identity document with which they applied for the loan |
| DAYS_LAST_PHONE_CHANGE       | How many days before application did client change phone |
| DAYS_REGISTRATION            | How many days before the application did client change their registration |
| DEF_30_CNT_SOCIAL_CIRCLE     | How many observations of client's social surroundings defaulted on 30 DPD (days past due) |
| ENTRANCES_AVG                | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| EXT_SOURCE_1 - 3             | Normalized score from external data source |
| FLAG_CONT_MOBILE             | Was mobile phone reachable (1=YES, 0=NO) |
| FLAG_DOCUMENT_1 - 21         | Did client provide document 1 - 21 |
| FLAG_EMAIL                   | Did client provide email (1=YES, 0=NO) |
| FLAG_OWN_CAR                 | Flag if the client owns a car |
| FLAG_OWN_REALTY              | Flag if client owns a house or flat |
| FLAG_PHONE                   | Did client provide home phone (1=YES, 0=NO) |
| FLAG_WORK_PHONE              | Did client provide work phone (1=YES, 0=NO) |
| FLOORSMAX_AVG                | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| FLOORSMIN_AVG                | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| LANDAREA_AVG                 | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| LIVE_CITY_NOT_WORK_CITY      | Flag if client's contact address does not match work address (1=different, 0=same, at city level) |
| NAME_CONTRACT_TYPE           | Identification if loan is cash or revolving |
| NAME_EDUCATION_TYPE          | Level of highest education the client achieved |
| NAME_FAMILY_STATUS           | Family status of the client |
| NAME_HOUSING_TYPE            | What is the housing situation of the client (renting, living with parents, etc.) |
| NAME_INCOME_TYPE             | Client's income type (businessman, working, maternity leave, etc.) |
| NAME_TYPE_SUITE              | Who was accompanying client when they applied for the loan |
| NONLIVINGAPARTMENTS_AVG      | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| NONLIVINGAREA_AVG            | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| OBS_30_CNT_SOCIAL_CIRCLE     | How many observations of client's social surroundings with observable 30 DPD (days past due) default |
| OCCUPATION_TYPE              | What kind of occupation the client has |
| ORGANIZATION_TYPE            | Type of organization where client works |
| OWN_CAR_AGE                  | Age of client's car |
| REG_REGION_NOT_WORK_REGION   | Flag if client's permanent address does not match work address (1=different, 0=same, at region level) |
| REGION_POPULATION_RELATIVE   | Normalized population of region where client lives (higher number means the client lives in a more populated region) |
| REGION_RATING_CLIENT         | Rating of the region where client lives (1,2,3) |
| SK_CURR_ID                   | ID of loan in our sample |
| TARGET                       | Target variable (1 - client with payment difficulties: they had a late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) |
| YEARS_BEGINEXPLUATATION_AVG  | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |
| YEARS_BUILD_AVG              | Normalized information about building where the client lives. What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floors |

#### ***credit_card_balance.csv***
(3.8 million rows, 23 columns) This dataset contains monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows. After dropping some redundant/highly correlated columns, the following columns might be retained from this data:

| Variable                     | Description |
|------------------------------|-------------|
| SK_ID_PREV                   | ID of previous credit in Home Credit related to loan. (One loan in their sample can have 0, 1, 2, or more previous loans in Home Credit) |
| SK_ID_CURR                   | ID of loan in their sample |
| MONTHS_BALANCE               | Month of balance relative to application date (-1 means the freshest balance date) |
| AMT_BALANCE                  | Balance during the month of previous credit |
| AMT_CREDIT_LIMIT_ACTUAL      | Credit card limit during the month of the previous credit |
| AMT_DRAWINGS_CURRENT         | Amount drawing during the month of the previous credit |
| AMT_PAYMENT_TOTAL_CURRENT    | How much did the client pay during the month in total on the previous credit |
| CNT_DRAWINGS_CURRENT         | Number of drawings during this month on the previous credit |
| CNT_INSTALMENT_MATURE_CUM    | Number of paid installments on the previous credit |
| NAME_CONTRACT_STATUS         | Contract status (Active, Approved, Completed, Demand, Refused, Sent proposal, Signed) on the previous credit |
| SK_DPD_DEF                   | DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit |


#### ***Bureau.csv***
(1.7 million rows, 17 columns) This table contains information about clients’ previous credits provided by other financial institutions. One SK_ID_CURR can have multiple rows of previous credit in this table. After dropping some redundant/highly correlated columns, the following columns might be retained from this data:

| Variable                     | Description |
|------------------------------|-------------|
| SK_ID_CURR                   | ID of loan in the sample |
| SK_BUREAU_ID                 | ID used to join with `bureau_balance` features if needed |
| CREDIT_ACTIVE                | Status of Credit Bureau reported credits |
| DAYS_CREDIT                  | Number of days before the current application when the client applied for Credit Bureau credit |
| CREDIT_DAY_OVERDUE           | Number of days past due on Credit Bureau credit at the time of application for the related loan in the sample |
| DAYS_CREDIT_ENDDATE          | Remaining duration of CB credit (in days) at the time of application |
| AMT_CREDIT_MAX_OVERDUE       | Maximum amount overdue on the Credit Bureau credit at the application date of the loan in the sample |
| CNT_CREDIT_PROLONG           | Number of times the Credit Bureau credit was prolonged |
| AMT_CREDIT_SUM               | Current credit amount for the Credit Bureau credit |
| AMT_CREDIT_SUM_DEBT          | Current debt on the Credit Bureau credit |
| AMT_CREDIT_SUM_LIMIT         | Current credit limit of a credit card reported in Credit Bureau |
| AMT_CREDIT_SUM_OVERDUE       | Current amount overdue on the Credit Bureau credit |
| CREDIT_TYPE                  | Type of credit (e.g., car loan, cash loan, etc.) |
| AMT_ANNUITY                  | Annuity of the Credit Bureau credit |


#### ***Bureau_balance.csv***
(27 million rows, 3 columns) This table contains monthly balances for previous credits. There is one row for each month of history for every previous credit that was reported to the credit bureau.

| Variable         | Description |
|------------------|-------------|
| SK_BUREAU_ID   | ID used to join with bureau features if needed |
| MONTHS_BALANCE   | Month of balance relative to application date |
| STATUS           | Status of the credit loan during the month |

#### ***Previous_application.csv***
(1.7 million rows, 37 columns) This table contains information about clients’ previous credit applications for Home Credit loans of clients who have loans in the train and test data sets.  There is one row for each previous application.One SK_ID_CURR can have multiple rows of previous applications in this table. After dropping some redundant/highly correlated columns, and aggregating the following columns might be retained from this data:

| Variable                          | Description |
|-----------------------------------|-------------|
| SK_ID_CURR                        | ID of loan in our sample |
| PREV_AMT_APPLICATION_MEAN         | Mean of how much credit the client asked for on previous applications |
| PREV_AMT_APPLICATION_SUM          | Total amount of how much credit the client asked for on all previous applications |
| PREV_AMT_CREDIT_MEAN              | Mean of how much credit the client was approved for on previous applications |
| PREV_AMT_CREDIT_SUM               | Total amount of how much credit the client was approved for on previous applications |
| PREV_AMT_DOWN_PAYMENT_MEAN        | Mean of down payments on all previous applications |
| PREV_AMT_DOWN_PAYMENT_SUM         | Total amount of down payments on all previous applications |
| PREV_AMT_ANNUITY_MEAN             | Mean of annuity for all previous applications |
| PREV_AMT_ANNUITY_SUM              | Total sum of annuity for all previous applications |
| PREV_CNT_PAYMENT_MEAN             | Mean of terms of previous credit at application of the previous applications |
| PREV_CNT_PAYMENT_SUM              | Total sum of terms of previous credit at application of the previous applications |
| PREV_APPROVED_SUM                 | Total number of approved applications for `SK_ID_CURR` |
| PREV_REFUSED_SUM                  | Total number of refused applications for `SK_ID_CURR` |
| PREV_CANCELED_SUM                 | Total number of canceled applications for `SK_ID_CURR` |
| PREV_APPROVAL_RATE                | Total number of approved applications / Total number of applications |
| PREV_CREDIT_APPLICATION_RATIO_MEAN | `PREV_AMT_CREDIT_MEAN` / `PREV_AMT_APPLICATION_MEAN` |


#### ***POS_CASH_balance.csv***
(10 million rows, 8 columns) This dataset provides monthly balance snapshots for previous Point of Sale (POS) and cash loans that the applicants had with Home Credit. Each row represents one month of repayment history for a given credit, including both consumer credit and cash loans linked to loans in the current sample. The dataset captures detailed, time-based information on each credit account’s status, payment terms, and any overdue payments. Each entry is associated with a unique credit and represents one monthly observation, covering the loan's status, remaining installments, and potential overdue days.

| Variable               | Description |
|------------------------|-------------|
| SK_ID_PREV             | Identifier for each previous credit associated with the loans in the sample. This allows linking multiple rows to a single previous credit, recording monthly snapshots of the credit status. |
| SK_ID_CURR             | Identifier for the current loan in the sample, connecting each record back to the specific loan application. |
| MONTHS_BALANCE         | The number of months relative to the loan application date for each balance snapshot. A value of -1 represents the most recent monthly snapshot. This field allows for chronological tracking of credit history. |
| CNT_INSTALMENT         | The number of installments in the original term of the previous credit. |
| CNT_INSTALMENT_FUTURE  | The number of remaining installments on the previous credit. This variable helps track repayment progress and indicates how much of the loan is still unpaid. |
| NAME_CONTRACT_STATUS   | The status of the credit contract during each monthly snapshot. Possible statuses include “Active,” “Completed,” and “Past Due,” among others. This variable is useful for tracking whether the loan was repaid on time, still ongoing, or completed. |
| SK_DPD                 | Days Past Due (DPD) during the month for the previous credit. This field indicates the number of days by which the payment was overdue in that month, if any, providing insight into the client’s payment discipline. |
| SK_DPD_DEF             | Days Past Due with a tolerance level (small overdue amounts are ignored) for the previous credit during the month. This metric helps identify significant overdue periods while ignoring minor payment delays that may not indicate real risk. |


#### ***Installments_payments.csv***
(13.6 million rows, 8 columns) The installments_payments.csv file contains detailed repayment history data for previously disbursed credits related to the loans in the sample. Each row corresponds to a single installment, capturing both payments made by clients and instances where payments were missed. Each entry represents a record of one payment for an installment or a record of a missed installment for a previous Home Credit credit related to the loans in our sample.

| Variable               | Description |
|------------------------|-------------|
| SK_ID_PREV             | An identifier for each previous credit related to the loans in the sample. Multiple rows may be associated with a single credit, reflecting each installment or payment related to that credit. |
| SK_ID_CURR             | A unique identifier for the current loan in the sample. This links each installment record back to the specific loan application. |
| NUM_INSTALMENT_VERSION | Indicates the version of the installment calendar. A value of 0 denotes a credit card-related installment. |
| NUM_INSTALMENT_NUMBER  | The specific installment number within the credit’s payment schedule. This indicates the position of the installment within the entire repayment plan. |
| DAYS_INSTALMENT        | The scheduled day for the installment payment, relative to the application date of the current loan. A negative value represents days before the loan application date. |
| DAYS_ENTRY_PAYMENT     | The actual day the payment was made, relative to the application date of the current loan. This allows for comparisons between the scheduled payment date and the actual payment date to assess payment timeliness. |
| AMT_INSTALMENT         | The prescribed (expected) payment amount for this particular installment. This reflects the amount due according to the credit agreement. |
| AMT_PAYMENT            | The actual payment amount made by the client for the installment. Comparing `AMT_PAYMENT` with `AMT_INSTALMENT` provides insight into whether clients overpaid, underpaid, or missed payments. |


### 3.4 **Machine Learning Algorithms** <a class="anchor" id="ML"></a>
We are planning to test Logistic Regression, Decision Tree Classifier, Random Forest, Support Vector Machine, Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting, and Neural Network models. Logistic Regression is straightforward, interpretable, and effective for binary classification tasks so it will serve as a good baseline. Decision trees create easily interpretable decision pathways based on feature splits, making them helpful for identifying which specific alternative data may indicate a higher likelihood of repayment. Random Forest can handle complex, non-linear relationships and capture interactions among features. SVMs work by finding the hyperplane that best separates data points of different classes, and they work well for datasets with high-dimensional feature spaces, such as the HCDR dataset which involves multiple alternative data sources. Gradient Boosting is an ensemble technique that builds sequential models to correct the errors of previous models. It is highly effective for structured data and can capture complex, non-linear relationships. XGBoost is a highly efficient, scalable version of gradient boosting that includes regularization and handles missing values, and it can handle large, diverse datasets with high accuracy. LightGBM is another gradient boosting algorithm that is optimized for speed and memory efficiency, and it is robust to overfitting and can handle high-cardinality categorical features well. Neural networks consist of multiple layers of interconnected neurons that can learn complex, non-linear relationships in data, and they are well-suited for complex data where feature interactions are not obvious or non-linear. Neural networks may help us uncover depeer patterns once simpler models are exhausted. We will identify the best final model based on our evaluation metrics.


| Algorithm                | Implementation                                       | Loss Function                           |
|--------------------------|-----------------------------------------------------|------------------------------------------|
| Logistic Regression       | `sklearn.linear_model.LogisticRegression`           | Log Loss (Binary Cross-Entropy)          |
| Decision Tree Classifier  | `sklearn.tree.DecisionTreeClassifier`               | Gini Impurity / Entropy                 |
| Random Forest             | `sklearn.ensemble.RandomForestClassifier`           | Gini Impurity / Entropy                 |
| Support Vector Machine    | `sklearn.svm.SVC`                                  | Hinge Loss (SVM Loss)                   |
| Gradient Boosting         | `sklearn.ensemble.GradientBoostingClassifier`       | Log Loss (Binary Cross-Entropy)          |
| Extreme Gradient Boosting | `xgboost.XGBClassifier`                            | Log Loss (Binary Cross-Entropy)          |
| Light Gradient Boosting   | `lightgbm.LGBMClassifier`                          | Log Loss (Binary Cross-Entropy)          |
| Neural Network            | `torch.nn.Module` | Log Loss (Binary Cross-Entropy)  |


### 3.5 **Evaluation Metrics** <a class="anchor" id="metrics"></a>
To evaluate our models for the HCDR dataset, we are using accuracy, precision, recall, F1 score, AUC-ROC, and log loss metrics, which are common metrics used to evaluate classification models. The table below contains a description and formula for each metric.


| Name                               | Description                                                                                     | Formula                                                                                       |
|------------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Accuracy                           | Proportion of correctly predicted samples out of the total samples                              | $$\text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{True Positives + True Negatives + False Positives + False Negatives}}$$ |
| Precision                          | Proportion of true positive predictions out of all positive predictions                         | $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$ |
| Recall (Sensitivity)               | Proportion of actual positives correctly identified                                             | $$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$ |
| F1 Score                           | Harmonic mean of precision and recall, balancing the two metrics                                | $$\text{F1 Score} = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$ |
| AUC-ROC                            | Area under the Receiver Operating Characteristic curve, measuring model's ability to distinguish between classes | $$\text{AUC-ROC} = \int \text{ROC Curve}$$ |
| Log Loss                           | Logarithmic loss penalizes wrong predictions more as they deviate from true class probabilities | $$\text{Log Loss} = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)})$$ |



For this project we will consider 6 metrics: Accuracy, Precision, Recall, F1 Score, AUC-ROC, and Log Loss. **Accuracy** is the proportion of correctly predicted values divided by the total number of predicted values. It basically measures how often a machine learning model correctly predicts the outcome. This metric doesn't work well if the data set is imbalanced, meaning the overwhelming majority of the Target variables have the same classification.  **Precision** is the proportion of true positive values predicted to the total number of positive values predicted. For projects where false positives are less desirable, precision is a good metric choice to minimize them. **Recall** is the proportion of true positive values predicted to the total number of the true positives and the total number of the false negatives. For projects where false negatives are less desirable, recall is a good metric choice. **F1 Score** is the harmonic mean of precision and recall. For projects where both false positives and false negatives are similarly weighted then F1 is a good metric choice. **AUC-ROC** is Area under the Receiver Operating Characteristic curve which measures a model's ability to distinguish between classes. AUC-ROC is a good metric choice for projects with highly imbalanced data sets. **Log (Logarithmic) loss** penalizes wrong predictions more as they deviate from true class probabilities. Log loss is a good metric choice for balanced and imbalanced data sets but can be sensitive to outliers.


### 3.6 **Pipeline Steps** <a class="anchor" id="pipeline"></a>

![Pipeline Diagram](https://i.imgur.com/WMXsTpL.png)

1. **Data Exploration**: Understand the dataset structure, features, and missing values.
2. **Data Preprocessing**: Clean data, handle missing values, encode categorical variables.
3. **Feature Engineering**: Create and select features to improve model performance.
4. **Model Selection**: Evaluate multiple algorithms to identify the most suitable one.
5. **Model Training**: Train the selected model on the training dataset.
6. **Model Evaluation**: Assess model performance using chosen metrics.
7. **Hyperparameter Tuning**: Optimize model parameters for improved accuracy.
8. **Final Report**: Compile findings, present results, and discuss model insights.


### 3.7 **Project Timeline** <a class="anchor" id="timeline"></a>
Below is the estimated timeline for our project phases:

<center>
<div style="text-align: center;">
    <a href="https://your_target_url.com">
        <img src="https://i.ibb.co/Lpz7YV6/gantt.png" alt="alt text" width="800"/>
    </a>
</div>
<center/>

