## Credit Card Default Prediction 

This notebook is a case study to examine methods for forecasting financial impacts of credit card defaults. The goal is to model a simple simulation where we have an unbalanced dataset, do not have sufficient attributes to attain high classification accuracy, and there is large variance in the variables of interest. In such a case, how can we most accurately project the financial outcomes of this dataset. 

### Methodology

The best method from those explored in the `CCDefaults_Classification` notebook is used to generate predictions which are here used to compare two methods for projecting a simplified model of cash flow for the following month. 

In the first method, each customer is assigned to a quantile and the median balance for that quantile is used to estimate financial impacts.

The second method uses a hurdle type approach in which a regression will be used to estimate the amount defaulted for each customer. 

The 'truth' that both models will be compared to is determined by

$$Receivables = \sum{} \mathbb{I}(D_i = 0) * Balance_{oct} * (1 + Interest)$$

$$Revenue = \sum{} \mathbb{I}(D_i = 0) * Payment$$

$$Losses = \sum{} \mathbb{I}(D_i = 1) * Bill_i $$

where **I**(D) is the indicator function and D is 1 if customer i defaulted and 0 if they did not. 

Payment, Bill, and Balance are estimated for each customer by averaging the respective variable from the previous three months. No interest information is present in the dataset, so a value of 15% is used for all customers.


### Data
The dataset used in this analysis is the [Default of Credit Card Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset provided by Yeh, I. C., & Lien, C. H. (2009) via the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). 

It contains demographic information, a 6-month payment history, and an indication of default in the following month for each record. 

In [2]:
import pandas as pd
import numpy as np
import statsmodels as sm

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import GammaRegressor

In [None]:
ccdefaults = pd.read_parquet("CCDefaults.parquet")

In [None]:
X = ccdefaults.drop(columns = ["default_oct", "avg_bill_3", "percentile_bin"])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
scaler = StandardScaler()

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Just when you thought it was over, it wasn't! In a world where decision boundaries are hazy ...

[Credit Card Defaults 3: Tomek Link's Awakening - In theaters Christmas 2021](CCDefaults_imblearn)