# FP Phase 1 - Home Credit Default Risk

Spring 2024

**Team Members:**
- Glen Colletti
- Alex Bordanca
- Paul Miller



## Abstract
>"The project is based on the Home Credit Default Risk (HCDR)  Kaggle Competition . The goal of the competition is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities."

The team will investigate features in the training data and other files provided for additional insight for reliable predictors of repayment. Feature selection will be conducted, investigation will include forward feature selection and backward selection. Pipelines will be implemented to standardize inputs of both categorical and numerical features. Optimal hyperparameters will be determined for model fitting. The data will be fit using a variety of machine learning algorithms to include XG_Boost, Logistic Regression, and KNN classifier.

## Data Description

The HCDR dataset contains 122 feature columns and about 307,000 records. The target variable for prediction is called "TARGET", it is a binary vector. The 0 case represents full repayment, while the 1 case represents an unpaid loan. 

Correlating features to the target variable, we can see that most features have minimal correlations. The top ten positive correlations range from approximately 8% to about 4%. The negative correlations are somewhat stronger ranging from about 18% to 3% absolute value. 

![image.png](attachment:486752be-c7e3-4219-a8e0-b55bcb45eddb.png)

By occupation the majority of applicants are laborers, sales staff, or core staff. 

![image.png](attachment:3e4ce93f-776f-4ca5-ba9d-b56418578d58.png)

Most of the loans let in the training data were repaid. 

![image.png](attachment:c1fc96cd-00d1-4897-9789-7a311d06fd28.png)

Of the 122 features, 58 were missing 12% or more of their data. The remaining 64 features have less than 1% missing data. The largest share of features missing maxed out at about 69%. 

## Machine Algorithms and Metrics

### Models
- **Logistic Regression**: A linear model for binary classification that predicts the probability a datapoint belongs to a particular class. It is implemented with the sigmoid function which forces the output to an interval of [0,1], thus representing the probability of the positive class. The loss function use is log loss/cross entropy loss, which analyzes the difference between the predicted probabilities and the actual binary class, penalizing wrongly classified predictions that have high predicted probability.
- **XGBoost**: XGBoost uses a gradient boosted framework, fitting each new tree to the negative gradient of the loss function (logistic loss, in our case with and without regularization, both L1 and L2) of the entire ensemble. The parallelization works to prevent overfitting.
- **KNNClassifier**: Predicts the class of a particular datapoint based on the majority class of the K nearest neighbors of the point in feature space. It doesn’t work directly to optimize a loss function, with its performance instead being evaluated and optimized using metrics like accuracy, F-1, etc.
 
### Metrics
- **F-1 Score**: The harmonic mean of precision and recall: $$ F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} = 2 \cdot \frac{tp}{2tp + fp + fn} $$
- **RMSE (Root Mean Squared Error)**: The square root of the mean of the squared differences between observed values and predicted values, i.e., residuals. $$ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \bar{y})^2} $$
- **R2 Score (Coefficient of Determination)**: The proportion of the variation of the dependent variable that is explained by or predictable from the independent variable(s). $$ R^2 = 1 - \frac{\sum (y_{\text{true}} - y_{\text{pred}})^2}{\sum (y_i - \bar{y})^2} $$
- **Accuracy**: Defines how often a model correctly predicts the outcome. $$ \text{acc} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} $$

## Pipelines

We plan to implement the following steps as described by this block diagram:

![Image](Project_pipelines.png)


## Gantt Chart & Phase Leader Plan

![Image](Phase_1_2_Gantt.png)
![Image](Phase_3_4_Gantt.png)

## Credit Assignment Plan

![Image](Phase_1_credit_assignment.png)