# Ever Delinquent Prediction Write Up

##  Problem Statement and Summary of Key Results

The goal of this project is to predict a given loan's ever-delinquent probability based on some loan characteristics. In the primary mortgage insurance (PMI) industry, knowing the likelihood of delinquency upfront will help insurance companies properly price the premium and determine the reserve amount to cover potential loss caused by claims.

The key takeaways are as follows:

1) This project is a typical classification problem. Among different machine learning algorithms (Logistic Regression, SVM, Kernel SVM, Random Forest and Decision Tree), logistic regression outperformed the rest with 0.685 AUROC score and 69% accuracy score. This result also translate into 4% accuracy improvement compared to the threshold/baseline model which only uses borrower's credit score (FICO) to predict ever delinquent probablity. In other words, adding additional features increased the power of model prediction in this problem.

2) Loan characteristics such as FICO, Debt to Income Ratio (DTI), Loan to Value Ratio (LTV), Number of Borrowers, Property Type, Housing Occupancy Status, First-time Home Buyer (Y/N) and Loan Purpose are selected features in my predictive model. 

3) Since loans characteristics are given by lenders upfront before insurance underwriting process, creating this model will help the business understand the likelihood of loan's ever-delinquent status, thus provide lenders the pricing option which the company feels comfortable with (p.s. pricing scorecard is based on expected claim rate. The higher the expected claim rate, the higher the insurance premium price.)

4) The outcome of the predictive model could also be used in calculating reserve dollar amount on a given loan to cover potential claim loss. Below is just a simplified calculation, and in real world, we will also consider other factors such as reinsurance deals, REO, claim documents completeness etc. 

$$Rserve $ Amount = Original~UPB * MI~Coverage * Ever~Delinquent~Probability * Claim~Probability~on~Ever~Delinquent$$

## Data Exploratory Analysis

The population of this project meets the following criteria:
- Loans closed after year 2010 to avoid the data noise from financial crisis (2007-2008)
- Loans closed before 2013 so we have at least 4-year loan performance data to collect

There are 481,218 rows and 19 columns in my original data pull, but I ended up keeping **10** columns/features before I started modeling. The columns I dropped fit in either one of the following criteria:

- ID columns
- Values under a particular column are nearly identical (>99% of population share the same value)

*See below regarding **7 categorical features'** distribution *


| Orig Yr| % to All   |Claim %|
|--------------|-----------|---------|
|2010        |   10.6%   |0.6%
|2011       |   15.0%   |0.3%
|2012       |   34.5%   |0.1%
|2013       |   40.0%   |0.1%

| Loan Purp     | % to All  
|--------------|-----------
| Refi Cash Out       |   2.6%   
| Refi Payoff Lien      |  32.4%   
| Purchase    |  65.1%   

| Property Type     | % to All  
|--------------|-----------
| Co-op or Condo       |   9.8%   
| Manufacutre Housing    |  0.3%   
| Single Fam    |  89.9%   

| Occupancy Status     | % to All  
|--------------|-----------
| Primary Resident     |   96.5%   
| Secondary Resident    |  3.5% 

| First Time Home Buyer     | % to All  
|--------------|-----------
| Y     |   31.7%   
| N    |  68.3% 

| Multi Borrower    | % to All  
|--------------|-----------
| Y     |   49.2%   
| N    |  50.8% 

| MI Channel    | % to All  
|--------------|-----------
| Delegated     |   66.7%   
| Non-Delegated    |  33.3% 

**3** features (FICO, DTI and LTV) are continuous data originally. However, in order to make dummy variable creation later on more doable, I slightly manipulated the data and grouped the values into different buckets. Now those 3 features become categorical data as well.

![my_image](files/1.png)
![my_image](files/3.png)
![my_image](files/2.png)

Due to unbalanced data when considering delinquent target (2.5% delinquent vs. 97.5% never delinquent loans), I downsampled the "never delinquent" population. At end of the data exploration step, I kept 10,000 loans for modeling: **5,000 never delinquent and 5,000 ever delinquent**.

## Feature Selection, Model Selection and Regularization

Logistic Regression outperformed SVM, Decision Tree and Random Forest algorithms in predicting ever-delinquent probablities and resulted in **68.6% accuracy score**. See below as the outcome from Logistic Regression.

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from ggplot import *
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
df_eval = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df_eval, aes(x='fpr', y='tpr')) +\
      geom_line() +\
      geom_abline(linetype='dashed')

![my_image](files/download.png)

In [None]:
# Applying K-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies1 = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
#mean
accuracies1.mean()
#variance
accuracies1.std()

accuracy mean: 0.68612388710894023
accuracy std: 0.013107884669360552

I also performed Lasso (L1) and Ridge (L2) regularization after fitting the training set in the model. It turns out the accuracy is very close to the one before regularization. 

In [None]:
# Fitting Logistic Regression with Regularization to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(penalty = 'l2', C =100,random_state = 0)
classifier.fit(X_train, y_train)
# Applying K-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies1 = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies1.mean()

accuracy: 0.68537408215296181

##### I will add the bell curve charts later after we discuss the message. I cannot fully getting what I see.

## Next Steps/Future Works

- Create utility codes for future use: model fitting, model evaluation and regularization pieces 
- This model is under a fairly simplified assumpsion that ever-delinquent relate to claim rate. But in the real world, a loan's claim status is also related to how long a loan gets into delinquent (60-days, 120-days. 360-days etc.) and how often a loan gets into delinquent. The transactional data will be helpful to answer this question. 
- Some other features might be relevant in predicting ever delinquent probablity: property state, property value, borrower years of employment, self-employed Y/N etc. Some of the data is not available for me to use today, but it is not a bad idea to put my thoughts out there. 