<h3 align="Center">
    <img alt="Logo" title="#logo" width="250px" src="https://www.nerdwallet.com/cdn-cgi/image/quality=85/cdn/loans/edu/stride.png">
    <br>
</h3>

# <center> **Early warning model for credit default**
## <center> **Executive report**
### <center> **Funding Credit team**
### <center> Analysts:
### <center> *Kevin M. Figueroa*
### <center> *September 19th 2022*

_________________________________________________________________________________________________________________________________________________

# <center> Executive report

## 1. Fintechs' challenge in the growing student loan market



Recent advances in digital technology and big data have allowed FinTech (financial technology) lending to emerge as a potentially promising solution to reduce the cost of credit and increase financial inclusion. FinTech credit has the potential to enhance financial inclusion and outperform traditional credit scoring by (1) leveraging nontraditional data sources to improve the assessment of the borrower’s track record; (2) appraising collateral value; (3) forecasting income prospects; and (4) predicting changes in general conditions. However, because of the central role of data in ML-based analysis, data relevance should be ensured, especially in situations when a deep structural change occurs, when borrowers could counterfeit certain indicators, and when agency problems arising from information asymmetry could not be resolved. (IMF 2019)

Student debt has been one of the sectors that have grown exponentially with the addition of more FinTechs to the market willing to fund young people's education. Student loan debt is now the second highest consumer debt category - behind only mortgage debt - and higher than both credit cards and auto loans. In 2019 there were more than 44 million borrowers who collectively owed USD 1.5 trillion in student loan debt in the U.S. alone. The average student in the Class of 2016 has USD 37,172 in student loan debt. (CNBC, 2020)

<h3 align="Center">
    <img alt="Logo" title="#logo" width="400px" src="http://cdn.statcdn.com/Infographic/images/normal/17777.jpeg">
    <br>
</h3>

While it is important to increase the inclusion of capital to fund education, it is as equally important to ensure that every loan is allocated to individuals who will have the financial capacity to repay the debt in the future without substantially affecting their well-being. Student loans FinTechs, probably more than any other type, rely on the ability to forecast the situation in which borrowers will be in the long term future. While other types of loans, such as business loans or mortgages, can confidently use past information to accurately predict the future, student loans are one of the investments that can change an individual's future. If a Business gets a loan to fund a project, there is probably a project evaluation with future cashflow estimations, and most likely, the remaining operations of the business will remain relatively consistent. Similarly, after the 2008 mortgage crisis, mortgage loans have been more carefully allocated to individuals who are already financially settled. Student loans, on the other hand, represent the adventures of young people that will navigate over the next few years, in some cases moving to new cities or even countries, drastically changing their set of skills and critical thinking, and exponentially increasing the potential level of income they can now access. Only by considering the fact that the difference between traditional loan approvals and the first payment is of 1-3 months, and for student loans can be as long as 4 years, it is not difficult to understand the additional complexity that evaluating a student loan represents. In addition, to the complexity in prediciting a very dynamic future, the terms of this predictions for student loans are typically of 10-15 years, compared to terms of 1-5 years for digital business loans. 

All in all, student loan FinTechs face the challenge of predicting a future that will probably look significantly different from the past, and of doing so for a very long period of time.



## 2. Project's purpose

 

Knowing the complexity of the challenge, it is crucial to leverage every piece of information that FinTechs can get access to, and that is what this project is all about. Stride is a FinTech company that competes in the student loan market, and its mission is to make education affordable and accessible. To achieve this mission, Stride is in constant search of ways to improve the quality of the methods used to evaluate risks. Recently, the Stride Funding Credit team gained access to a promising set of data for predicting loan defaults. Unfortunately, the complete documentation of the data was lost while renewing the internal database structure. However, it wouldn't be wise to just discard such a valuable piece of information. 

The purpose of this project is to assess the potential of this data and obtain as much insight as possible that could be useful to improve Stride's operations. To achieve the purpose, the following specific objectives were achieved: 

- Perform a deep exploratory data analysis process to understand as much of the data as possible. 
- Clean the data. 
- Produce a model that can help identify loans at risk of default.
- Predict a set of data that has no information on the default status.
- Propose a business strategy that can take advantage of the predictive power of the model. 




## 3. Data exploration results

### Description of the data

 

Because the documentation of the data is not available, the first task is to provide a description of the information we have available. The dataset contains the following variables:

- *id*: A unique identifier of each of the clients.

    No repeated ids, implying that each client has only been granted one loan. 

- *date_of_birth*: Birth date of the client.

- *number_dependants*: Number of people that depend directly on the client's income.

- *credit_utilization*: % of the credit limit that the individual has used. 

- *debt_to_income_ratio*: Ratio of debt to income.

- *monthly_income*: Total income that the client receives every month.

- *number_open_credit_lines*: Number of open credit lines that the client has. 

- *number_open_loans*: Numer of loans that the client has received that haven't been paid off

- *number_90_days_past_due*: Number of accounts in which the client has 90+ days past- due

- *number_charged_off*: Number of debt accounts in which the client has been charged off

- *score1 and score2*: The data contains two scores grades and it is not clear what is the difference between them. There are two possibilities:

    - H1: Scores are collected from different credit reporting institutions: It is difficult to test this hypothesis since the three main credit institutions use similar ranges. Equifax -> 250-850;  Experian -> 300-850;  Transunion -> 300-850

    - H2: Scores are collected at different moments in time: It could be the case that one of the scores was taken before the loan was disbursed and the other at some specific point during the duration of the loan. In this case, the difference would represent the evolution of the credit performance of the client. If this were the case, it will be assumed that *score1* was collected before *score2*. 

    **The correlation analysis shown that the first hypothesis was the most likely to be true. Therefore, H1 is the official description.**
    
    (See [Methodologic report](https://github.com/kevinmiguel97/Credit-default-model/blob/main/Credit_default_model.ipynb): Feature selection section)

- *target*: Binary variable that shows if an individual defaulted (1 if defaulted and 0 if not)

From the original set of variables, the following variables were generated:

- *age*: Age of the individual at the end of the loan or time of default.

- *avg_loan* (*total_debt* / *number_accounts*): Averge laon amount non-paid by the individual. 
 
    Knowing the original amount of the accounts would be better to obtain variables such as monthly payments

- *90_days_pct* (*number_90_days_past_due* / *number_accounts*): Percentage of accounts with 90 days past-due.

- *charged_off_pct* (*number_charged_off* / *number_accounts*): Percentage of accounts charged off.

- *avg_score* (*score1* + *score2*) / 2: Average score. 

- *score_change* (*score2* - *score1*): Change of the credit score.

Table 1 shows the descriptive statistics of the variables.

<h3 align="Center">
    <img alt="Logo" title="#logo" width="500px" src="assets/Table1.png">
    <br>
</h3>



Just by analyzing Table 1 we can get the following conclusions: 

- Number of dependants variable minimum value is (-1), which makes no logical sense. The most logical explanation is that in a previous data cleaning there were some missing values that were replaced with a (-1), which is a convention in some processes. 

- Age variable goes as high as 193. Given that, according to Guinness Records, the oldest alive person known is 118 years old, any value above that must be a mistake. By analyzing the histogram of ages, it was found that the mistake came from a typo in the year of birth. (See [Methodologic report]('https://github.com/kevinmiguel97/Credit-default-model/blob/main/Credit_default_model.ipynb') Figure 6)

- Scales are different among variables, so a standardization process will likely be needed. 



### Target distribution
 

The variable *target* is the one the model will try to predict. It takes the value of 0 if the loan was succesfully repaid, and a value of 1 if the loan was defaulted. Because we are trying to predict a binary variable, it is important to know what is the proportion of defaulted loans. Figure 1 shows the percentage of defaulted vs Non-defaulted loans. We have a quite unbalanced sample, with Non-default loans representing around 90% of the cases, and Default loans representing only 10% of the cases.

Trying to preserve these proportions will be important when splitting the data to test the model's performance.



<h3 align="Center">
    <img alt="Logo" title="#logo" width="400px" src="assets/Figure1.png">
    <br>
</h3>


### Features distributions


Figure 2 shows the boxplots for the original quantitative variables. Not only does it confirm the idea that a standarization process is needed, but also allows us to see that we have to deal with outliers. Luckily, standarization will be able to deal with that as well

<h3 align="Center">
    <img alt="Logo" title="#logo" width="400px" src="assets/Figure2.png">
    <br>
</h3>

Figure 3.1 shows the Kernel Density Estimator (KDE) of the distribution of the variables conditioned by the type of repayer. It provides important information about the problem we are facing. Ideally, we hope that the green and red curves don't overlap. That would allow us to produce models that can accurately predict the target with the features available. In this case, most of the graphs are almost one of top of the other. The only variables that have a more clear segmentation between defaulters and non-defaulters are the credit scores. We can start forecasting that any potential model will have to rely heavily in these features. 



<h3 align="Center">
    <img alt="Logo" title="#logo" width="800px" src="assets/Figure3.1.png">
    <br>
</h3>

### Correlation between features
After plotting the correlation matrix, our previous analysis was confirmed. There are not highly significant correlation levels between the features and the target. Based on these correlation values, and accounting for multicollinearity problems, the final set of features that will be used to train the model was selected. (See [Methodologic report](https://github.com/kevinmiguel97/Credit-default-model/blob/main/Credit_default_model.ipynb): Feature selection)

Figure 4 shows the correlation matrix of the selected variables. The highest correlated feature with the target is the average score with (+0.18), and there is no multicollinearity problem anymore.

<h3 align="Center">
    <img alt="Logo" title="#logo" width="800px" src="assets/Figure4.png">
    <br>
</h3>

## 4. Evaluating if both samples were drawn from the same population



Before doing any further analysis, it is important to verify that the distributions of both the training and the testing samples were drawn from populations with the same distributions. If this wasn't the case, it would be like trying to predict the behavior of people in the US by studying data of people from the UK, and therefore, our analysis would lack validity and consistency.


To perform this evaluation, the Kolmogorov-Smirnov for two samples test (KS2) was used. This test evaluates the hypothesis that both samples were drawn from the same distribution. Figure 5 shows the KDE for every variable of each sample. Only by scanning the graphs, it is very clear that both samples show similar distributions for all the variables involved. We can also verify that by taking a look at the KS2 p-values. Using a 95% confidence level, as long as the p-value is higher than 0.05, we cannot reject the hypothesis that both samples were drawn from the same distribution. In this case, all the p-values obtained are greater than 0.05. Hence, we can confidently move forward knowing that any result obtained will apply to our testing set.

<h3 align="Center">
    <img alt="Logo" title="#logo" width="800px" src="assets/Figure5.png">
    <br>
</h3>



## 5. What can be done with this dataset? Early Warnings for Collections (EWC) tool

Since we are are ina student loan scope, we would expect that the ages for student loans would lie between 20 and 40 yrs. The age variable has only 25% of the observations below 41 years, suggesting that it represents the age at some point in the life of the loan, but not at the begining of it. Consequently, a similar concern arises with the rest of the variables. This issue would make any model created with this data meaningless for underwriting strategies. 

That doesn't mean that nothing can be done with this information. It will be useful to create a tool that tracks loans and predicts if an already existing loan is likely to default any time soon. The model built will serve as an Early Warning Collection (EWC) tool for the collections department, so that the team can direct the efforts to those loans that will likely default soon and potentially recover a higher amount.

## 6. Additional variables to consider

If we were to build a model to be used before the underwriting process, it will be useful to consider the following variables:

(Potential sources for these variables are shown inside parenthesis)

- Number of loans with stride(internal data): It is safe to assume that if a client has already performed well in a loan with Stride the probability of a default in additional credits. In my experience, only 7% of repeat clients defaulted on their subsequent loans. Since the description of the data mentioned that each observation represented a client, it was assumed that no individual had repeated loans, but it would be something worth looking into.

- Loan terms of the loan (internal data): Knowing the general characteristics of the loans is crucial when explaining why someone defaulted. It might be the case that the amount was higher than he was able to pay the interest rate could have been too expensive, if the term was too long, you face issues such as potential deaths or accidents. Adding variables such as loan_amount, interest_rate, and term would allow control for this situation.

- Initial date (internal_data): As mentioned before, this dataset is ambiguous about when was all the information collected. Adding an initial date of the loan would allow us to derive information such as the age of the client when hiring, and control for environmental effects such as a recession in the economy, or in his working sector.

- Information about the program to attend (require acceptance letter during the application): Following the reasoning above, knowing more information about the program that the client intends to attend would allow predicting with more precision information such as grad_income (Income after graduation), sector where he could work and school where the program will be taken.

- GPA (require transcripts of previous academic levels and follow-up after graduation): Track the academic performance of the client would help to know his potential future job opportunities.

- Gym membership (ask during application): A phenomenon that I studied in my previous position was the relationship between payment behavior and subscriptions, especially for people without a credit record. The hypothesis was that subscriptions represent a similar situation as that of a loan in terms of requiring a periodic payment. While most subscriptions such as streaming services and cellphone plans were not significant, having a gym membership was significant. The reasoning behind this fact is behavioral. In simple terms, having a gym membership can be used as an instrumental variable of "level of responsibility", as not only does it require to be periodically paid, but also implies that (in most cases) the individual attends the gym on a regular base, which impacts many branches such as health condition (less likely that they could die or get sick) and self-esteem. 

- Loan percentage repaid (internal data): We can divide the process to underwrite a loan into two main phases. On one hand, the origination phase contains all the processes done before disbursing the loan (promotion, credit processing, risk analysis). On the other hand, the collections phase contains all the actions done after the disbursing. With this in my mind, there are only two reasons why a loan can be incorrectly approved. Either the loan was incorrectly approved from the beginning, in this case, we have an origination issue, or the loan was correctly approved but something went wrong afterward, in this case, we have a collections issue. A FinTech needs to know where are the defaulters coming from. The percentage of the loan that was paid is a great indicator to evaluate that. In general, loans that default over the first 20% of the payments imply a higher origination responsibility. If the loan defaults after that, for example after paying 80% of the debt, most likely the origination process was adequate, and the issue came from a collections perspective.

## 7. Data Cleaning

### Feature selection implemented
A new set of data was generated containing only variables that were not dropped based on the Feature Selection made. 

### Filling misssing values
Number of dependants is the only column that contains missing values represented by a (-1). Instead of simply dropping the observations, missing values will be filled with the mode of the number of dependants of the sample. The reasoning behind using the mode instead of the mean is to avoid extreme values to skew the model. In future projects, more sophisticated methods can be used to fill missing values such as use the conditional mode of the age group. For simplicity the overall mode was used. 

### Fixing ages
Date of birth was the only variable with a typo, and it was impacting the age variable. To fix this issue, we are going to subtract 100 years to observations over 118 years old.

### Standarization
MinMaxScaler will be applied to the data after spliting into training and testing sets, as it could cause data leakage problems if the data is transformed before.

Table 3 shows the statistical summary of the new datset.

<h3 align="Center">
    <img alt="Logo" title="#logo" width="500px" src="assets/Table3.png">
    <br>
</h3>

## 3. Model creation

### Contextualizing concepts

**Groups**

Because the group we are trying to predict is default loans we have the following groups:

- Positives observation: Defaulted loans (Contain the target we want to predict)

- Negatives observations: Non-defaulted loans (Doesn't contain the target we want to predict)

**Possible outcomes**

See Table 4 below.

<h3 align="Left">
    <img alt="Logo" title="#logo" width="400px" src="assets/Table4.png">
    <br>
</h3>

**Evaluation metrics**

- Accuracy: How is the model predicting both defaulters and non-defaulters?

- Precision: What percentage of the default cases predicted were actual defaulters?

- Recall: What percentage of the actual defaulters was the model able to predict?

**Best evaluation metric**

From a business perspective, there is a single trade-off to keep in mind when developing risk models to evaluate loan applications Underwriting amount vs. Delinquent portfolio. Models that can have high accuracy to predict default loans usually will approve fewer loans, leading to lower portfolio levels. On the other hand, models that approve a large number of applications will likely let more potential defaulters into the portfolio, leading to a higher amount of delinquent portfolio levels. There is no right or wrong answer to what kind of model to prefer, it will all depend on the risk appetite and the stage of the financial institution. For new fintech startups that are just starting to grow, usually, conservative models are preferred, as many Institutional Lenders (IL) give a high weight to delinquent portfolio levels when investing capital in a firm. Riskier models are more suitable for companies that have some years in their markets, with a solid portfolio level, that is trying to increase their market share.

Because this won't be a model to evaluate applications before granting the loans, but rather a tool that can be used to predict potential defaults in existing loans, the decision of which metric to use is not as complex. For this type of tool, the most important goal is to accurately predict as many potential defaulters as possible, as the loans have already been granted, which is what the **Recall** metric provides. However, we also want to take into consideration the rest of the metrics, as the tradeoff, in this case, implies the effort of the collection department. If we predicted all cases as potential defaulters, there would be no optimal allocation of energy from the team, and the model will be useless.

### Splitting and standarizing training data into training and testing sets

To test the accuracy of the models we built, we are going to allocate a portion of the training data (which contains the correct predictions) to train our models, and the rest of the data will be used to evaluate the performance metrics in "unseen" data.

The proportion will be:

- 80% of the observations (13,336) will be used for training purposes.

- 20% (3,335) of the observations for testing purposes.

Because of the imbalanced situation that we face, we are going to keep the same proportion of positive observations (defaults) in both sets to avoid potential skews.

### Null model

This is the simplest model we can build. That is, we are going to simply predict the most common case. Recall that we had an unbalance target variable in which around 90% of the data were non-defaulters and 10% were defaulters. It will be a simple example to understand how to evaluate a model.

- In this case, the model has an accuracy of 89.91%, because it is only accurately predicting non-defaulters.

- Recall and precision are 0% since it does not predict any positives.

- The confusion matrix values show the prediction percentage of a certain case. You can interpret each of the rows as follows:

    - 100% of non-defaulters were classified as non-defaulters, and 0% of non-defaulters were classified as defaulters.

    - 100% of defaulters were classified as non-defaulters, and 0% of defaulters were classified as defaulters.

- You would hope that the diagonal contains the highest values of every row.

<h3 align="Center">
    <img alt="Logo" title="#logo" width="300px" src="assets/Figure8.png">
    <br>
</h3>

### PyCaret setup

To create our Machine Learning (ML) models, we are going to use the [PyCaret](https://pycaret.readthedocs.io/en/latest/index.html) library.  PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. (See Methodologic report: PyCaret SetUp)

### Selecting best models
The compare_models function trains and evaluates the performance of all estimators available in the model library using cross-validation with 10 Folds and selects the best n models based on the average metrics of the folds. We are sorting it by Recall grade and asking for the best two models.

In this case, the models that perform the best are Decision Tree Classifier and K-nearest Neighbors Classifier (KNN). Because the difference between Decision Tree and KNN is quite high, we will select Decision Tree Classifier as model to work. As expected, models are having troubles classifying default loans. Our highest recall score is only 15.79%.

<h3 align="Center">
    <img alt="Logo" title="#logo" width="400px" src="assets/Imagen12.png">
    <br>
</h3>

#### Understanding best model
The model that we selected is a Decision Tree Classifier with the following parameters:

DecisionTreeClassifier(
    
                       ccp_alpha=0.0,
                       class_weight=None,
                       criterion='gini',
                       max_depth=None, 
                       max_features=None, 
                       max_leaf_nodes=None,
                       min_impurity_decrease=0.0, 
                       min_impurity_split=None,
                       min_samples_leaf=1, 
                       min_samples_split=2,
                       min_weight_fraction_leaf=0.0, 
                       presort='deprecated',
                       random_state=0, 
                       splitter='best'
                       
                       )

### Creating model
The create_model function trains and evaluates the performance of a given estimator using cross validation. In this case we are training the Decision Tree Classifier from above. Table 7 shows the metrics obtained for each of the 10-folds made during the training process

<h3 align="Center">
    <img alt="Logo" title="#logo" width="400px" src="assets/Table7.png">
    <br>
</h3>

### Understanding the model

As expected, average score represents the variable that is used the most by the model. 

<h3 align="Center">
    <img alt="Logo" title="#logo" width="900px" src="tree.png">
    <br>
</h3>