# **Principle of Data Learning**

# 1 Deep dive
pg. 17-24

### Aim of this book:
- Understand the role of **liquidity, equity** and many other key banking features;
- Engineer and select features;
- Predict defaults, payoffs, loss rates and exposures;
- Predict downturn and crisis outcomes using pre-crisis features;
- Understand the implications of COVID-19;
- Apply innovative **sampling techniques** for model training and validation;
- Deep-learn from Logit Classifiers to **Random Forests** and Neural Networks
- Do unsupervised Clustering, Principal Components and **Bayesian Techniques**;
- Build **multi-period** models for CECL, IFRS 9 and CCAR;
- Build credit portfolio correlation models for value-at-risk and expected shortfall; and
- Run over 1,500 lines of pandas, statsmodels and scikit-learn Python code (`statsmodels`, `scikit-learn`, ...)
- Access real credit data and much more . . .

### Targetted reader
**Credit analysts** in financial institutions, fin-techs and prudential regulators

### Credit risk information

+ Internal data:
    - origination / underwrinting data: from **LOS**;
    - performance: monthly / quarterly / annually review covered from the orgination to latest review from **LMS**;
    - modification: e.g *restructure???*;
    - payoff / retention;
    - maturity: relates to maturity time like *release of collateal* or various of accounting activities;
    - default / workout: data from default to resolution or collection, can cover up to 10 years!
+ External data:
    - macro: time varying information that is identical at a given period for all borrowers, maybe **stratified** by country, state, statistiscal area, etc;
    - population stats;
    - etc like: business filings, data from social networks, ratings agencies, property appraisers, activity profiles of payment systems or transport systems.

--> Panel data:
+ features (*also known as covariates, risk factors, explanatory variables, independent variables and right-hand side variables*);
+ risk-outcomes (*also known as responses, outputs, dependent variables and left-hand side variables*): default, payoff, loss rates, exposures;
+ for each loan (`i`) and time (`t`).

### Things to consider
- Any information that is used in a model must be **measurable** through **sensor** since the model is used to **predict risk outcomes** for new borrowers, loans;
- Identifiers:
  + *borrowers* or *loans*;
  + time: application time, origination time, observation time, payoff time, default time and maturity time;
  + think about the scenario that the borrower may be part of a larger holding structure, family, or benefit from guarantees that have credit-risk relevant relationships.
- relationships of features and outcomes can be reciprocal or one-way (???).

### The dataset
+ Panel form: 5k residential US mortgage borrower over 60 periods (quarters);
+ Central vars: `id`, and `time`;
+ Starts at the beginning of the millennium, includes the Global Financial Crisis (GFC) in period 27 approximately;
+ Origination times prior to the start of the observation period have negative numbers;
+ Default, payoff and status events are observed one period after features in the same row;
+ LGD and related recoveries are observed between the default and resolution time;
+ The loans are not observed immediately after origination;
+ Order of informations:
    - Borrower IDs;
    - Time stamps;
    - Information features at observation time;
    - Information features at loan origination;
    - Outcome observations.
+ Key variables:
    - `id`: borrower id;
    - `time`: time stamp of observation;
    - `orig_time`: time stamp for origination;
    - `first_time`: time stamp for first observation;
    - `mat_time`: time stamp for maturity;
    - `res_time`: time stamp for resolution;
    - `balance_time`: outstanding balance at observation time;
    - `LTV_time`: loan to value ratio at observation time, in %;
    - `interest_rate_time`: interest rate at observation time, in %;
    - `rate_time`: risk-free rate at observation time, in %;
    - `hpi_time`: house price index at observation time, base year=100;
    - `gdp_time`: GDP growth at observation time, in %;
    - `uer_time`: unemployment rate at observation time, in %;
    - `REtype_CO_orig_time`: real estate type — condominium: 1, otherwise: 0;
    - `REtype_PU_orig_time`: real estate type — planned urban developments: 1, otherwise: 0;
    - `REtype_SF_orig_time`: real estate type — single family home: 1, otherwise: 0;
    - `investor_orig_time`: investor borrower: 1, otherwise: 0;
    - `balance_orig_time`: outstanding balance at origination time;
    - `FICO_orig_time`: FICO score at origination time, in %;
    - `LTV_orig_time`: loan to value ratio at origination time, in %;
    - `Interest_Rate_orig_time`: interest rate at origination time, in %;
    - `state_orig_time`: US state in which the property is located;
    - `hpi_orig_time`: house price index at origination time, base year=100;
    - `default_time`: default outcome at observation time;
    - `payoff_time`: payoff outcome at observation time;
    - `status_time`: default (1), payoff (2) and non-default/non-payoff (0) outcome at observation time;
    - `lgd_time`: LGD outcome, at default time, assuming no discounting of cash flows;
    - `recovery_res`: sum of all outcome cash flows received during resolution period.
+ `lgd_time`, `recovery_res` and `res_time` are only observed for default_time=1 and if the resolution process is complete;
+ `LTV_time` = balance_time / house price at time;
+ house price at time = house price at origination * ( `hpi_time` / `hpi_orig_time` );
+ house price at origination = `balance_orig_time` / `LTV_orig_time`;

### Basel, CECL, IFRS 9, DFAST, CCAR and Stress Tests

Critical standards:
- Basel: **minimum** amount of required Tier I and Tier II capital. Basel may include the various reforms (Basel I to Basel III), and a number of nationally issued guidance notes;
- Current Expected Credit Loss (CECL), IFRS 9: loan loss **provisioning** and **eligible amount** of available Tier I capital;
- National stress tests (e.g., Dodd-Frank Act Stress Test (DFAST) or Federal Reserve Bank (FRB) stress tests in the US): requirement of **additional capital buffers**;
- Comprehensive Capital Analysis and Review (CCAR): requirement of **additional capital buffers**.

Regulations might differ:
- Basel requires through-the-cycle PDs, Downturn EADs and Downturn LGDs. CCAR, CECL;
- IFRS 9 require Lifetime PDs and EADs/LGDs that are based on current economic circumstances and are forward-looking, i.e., take future expectations into account;
- DFAST and FRB stress tests require stressed PDs, LGDs and EADs.

### Lessons from the COVID-19 Crisis

- Impact seems to be more critical than GFC;
- Behaviours by banks and governments might temporarily reduce risk in short-term, but might increase in long-term;
- Variable: `cep_time`;
- Equity next to liquidity is a central aspect, variable `equity_time`;
- Time effects;
- Challenges:
  + *Calculating Crisis PDs without downturn data* -> Model-based measurement of crisis PDs, Parameter-based stress-testing (Margin of conservatism, Bayesian approach); or
  + Scenario-based stress-testing, Parameter-based stress-testing (Regime-switching models);
  + *Liquidity as a driver of default* -> Estimation of models with liquidity as feature; Inclusion of additional liquidity feature (e.g., income over non-discretionary expenses);
  + *Impact of time effects* -> TVA Analysis: control for vintage and age effects through dummy variables or other features that describe the origination process and for time effects  through macroeconomic features;
  + *Low default portfolios* -> Most prudent estimators/Margin of conservatism;
  + *Validation of pre-crisis models* -> Backtesting: split training and validation sample along time dimension;
  + *Ability of machine learning models to predict defaults for severe downturns* -> Backtesting of machine learning approaches;
  + *Adequacy of model estimates for Basel requirements* -> Comparison of Basel capital with expected loss.
- Main elements of the approach in the book:
  + model calibrations should include the latest credit data;
  + features that are identified as drivers of credit risk outcomes should be included in the models;
  + validation should focus on backtesting and include a train-test split along the time line;
  + the adequacy of model estimates for applications like capital adequacy, loan loss provisioning and loan pricing needs to be vetted.

### Machine Learning

- lower variable costs;
- higher degree automatized;
- credit risk generally realizes in time-lags;
- advanced models can adjust quicker to the new risk levels than traditional models.

# 2 Python Literacy
pg. 25-46

### What Python is used for in this book:

- Describing data;
- Plotting data;
- Generating new variables;
- Transforming variables;
- Subsetting data;
- Combining data;
- Regression models.

### Packages

Data processing:
+ `pandas`: Processing data structures: series (1D) and dataframes (2D); see The pandas development team (2020). What pandas offers:
  - indexing based on labels (`.loc`) as well as positions (`.iloc`);
  - data sub-setting;
  - dataset splitting, merging and joining;
  - time-series functionality;
+ `numpy`: Processing of n-dimensional array objects, see Harris et al. (2020);
+ `scipy`: Submodule for statistics, see Virtanen et al. (2020);
+ `matplotlib`: Plotting library, see Hunter (2007);
+ `math`: Mathematical functions, see VanRossum and Drake (2009);
+ `random`: Random number generator,see VanRossum and Drake (2009);
+ `tabulate`: Printing tabular data;
+ `joblib`: Running functions as pipeline jobs, see Joblib Development Team (2020);
+ `pickle`: Converting a object to a file for saving, see VanRossum and Drake (2009);
+ `scikit-learn`: Machine learning techniques, see Pedregosa et al. (2011);

Building models:
+ `statsmodels`: Fitting statistical models. Interacts with pandas data frames to fit statistical models, see Seabold and Perktold (2010);
+ `IPython`: Interactive computing, see Pérez and Granger (2007);
+ `pydot` and `graphviz`: Plotting of decision trees;
+ `pymc3`: Probabilistic programming, see Salvatier et al. (2016);
+ `lifelines`: Survival analysis, see Davidson-Pilon (2019);
+ `lightgbm`: Gradient boosting, see Ke et al. (2017).

### First look

In [13]:
import warnings; warnings.simplefilter('ignore')
import matplotlib
import matplotlib.pyplot as plt
from dcr import * # import a dataframe named "data"

# config the plot output
%matplotlib inline
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (16, 9)
plt.rcParams.update({'font.size': 16})

# 3 Risk-based Learning
pg. 47-67

# 4 Machine Learning
pg. 68-90