<a href="https://colab.research.google.com/github/pkuSapphire/CreditRiskManagement/blob/main/Project_2_Time_Series_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2: Time Series Model

By Bosen Li (bl3097), Wenyu Luo (wl2905), Edward Zhang (yz4756)

## Step 1

Read the data into Python and convert them into charge-off percentages. Use the Augmented Dickie Fuller test to see if these series are stationary. If they are not, take the first differences.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

cre = pd.read_excel('CRE.xlsx')
cre.set_index('date',inplace=True)
card = pd.read_excel('card.xlsx')
card.set_index('date',inplace=True)

cre['pct'] = 100 *  cre['chargeoffs']/cre['loans']
card['pct'] = 100 * card['chargeoffs']/card['loans']

print(f"The p-value of CRE chargeoffs percentage's Augmented Dickie Fuller test is \
{adfuller(cre['pct'])[1]:.3f}, \
indicating that this is not stationary.")
print(f"The p-value of card chargeoffs percentage'sAugmented Dickie Fuller test is \
{adfuller(card['pct'])[1]:.3f}, \
indicating that this is not stationary too.\n")

cre['pct_diff'] = cre['pct'].diff()
card['pct_diff'] = card['pct'].diff()

cre.dropna(inplace=True)
card.dropna(inplace=True)

print('After taking the first difference,')
print(f"The p-value of CRE chargeoffs percentage's Augmented Dickie Fuller test is \
{adfuller(cre['pct_diff'])[1]:.3f}, \
thus that this is stationary.")
print(f"The p-value of card chargeoffs percentage'sAugmented Dickie Fuller test is \
{adfuller(card['pct_diff'])[1]:.3f}, \
thus that this is stationary too.")


The p-value of CRE chargeoffs percentage's Augmented Dickie Fuller test is 0.489, indicating that this is not stationary.
The p-value of card chargeoffs percentage'sAugmented Dickie Fuller test is 0.053, indicating that this is not stationary too.

After taking the first difference,
The p-value of CRE chargeoffs percentage's Augmented Dickie Fuller test is 0.013, thus that this is stationary.
The p-value of card chargeoffs percentage'sAugmented Dickie Fuller test is 0.002, thus that this is stationary too.


## Step 2

- Download unemployment data (UNRATE), oil prices (DCOILBRENTEU), US GDP (GDP), 10-year minus 2-year treasury rates (T10Y2Y), and a volatility series of your choice.
- Pull the economic data into a pandas data frame.
- Create a GDP growth variable (GDP_t – GDP_t-1) / GDP_t-1.
- Combine this data with the charge-off data, making sure that none of the times in the economic series are later than the beginning dates for the charge-offs. (Clarification - each charge-off data point is measured over a full quarter. The reporting date is the end of the quarter. We want to ensure that the explanatory data does not overlap with the data point's quarter.)
- Find the augmented Dickey-Fuller statistics for each economic time series. - If indicated, take the first difference and use that instead.
- Run all possible AR1, three-factor models (one lag and three factors).
- Choose the best model based on r-squared and comment on the results.

In [9]:
import pandas_datareader.data as web

unrate = web.DataReader("UNRATE","fred",start='2000-01-01')
oil = web.DataReader("DCOILBRENTEU","fred",start='2000-01-01')
gdp = web.DataReader("GDP","fred",start='2000-01-01')
t10y2y = web.DataReader("T10Y2Y","fred",start='2000-01-01')
volatity = web.DataReader("VIXCLS","fred",start='2000-01-01')

# cleaning data

unrate.index = unrate.index-pd.DateOffset(days=1) # This will not cause overlap, because the data itself is one day lag
unrate = unrate.resample('QE').last()
gdp.index = gdp.index-pd.DateOffset(days=1) # This will not cause overlap, because the data itself is one day lag
gdp = gdp.resample('QE').last()
oil = oil.resample('QE').mean()
t10y2y = t10y2y.resample('QE').mean()
volatity = volatity.resample('QE').mean()

econ = pd.concat([unrate,oil,gdp,t10y2y,volatity],axis=1)
econ['gdp_growth'] = econ['GDP'].pct_change(fill_method=None) # econ['gdp_growth'] = econ['GDP'].diff()/econ['GDP'].shift(1), thank new features
econ.dropna(inplace=True)

for col in ['UNRATE', 'DCOILBRENTEU', 'GDP', 'gdp_growth', 'T10Y2Y', 'VIXCLS']:
  if adfuller(econ[col])[1] > 0.05 and col != 'GDP':
    econ[col+'_diff'] = econ[col].diff() # create first diff
  econ[col]=econ[col].shift(-1) # avoid overlap


def add_prefix(df, prefix): # use this function so everytime I run this part, it will not add prefix everytime
    df.columns = [prefix + col if not col.startswith(prefix) else col for col in df.columns]
    return df

cre = add_prefix(cre, 'cre_')
card = add_prefix(card, 'card_')

df = pd.merge(cre,card,left_index=True,right_index=True)
df = pd.merge(df,econ,left_index=True,right_index=True,how='inner') # The pct_diff starts from 2001-06-30, so use the inner to discard other rows

df = df[['cre_loans', 'cre_chargeoffs', 'cre_pct', 'cre_pct_diff', \
         'card_loans', 'card_chargeoffs', 'card_pct', 'card_pct_diff', \
         'UNRATE', 'DCOILBRENTEU','DCOILBRENTEU_diff', 'GDP', 'gdp_growth', 'T10Y2Y', 'VIXCLS']]
dfstat = df[['cre_pct_diff','card_pct_diff','UNRATE', 'DCOILBRENTEU_diff', 'gdp_growth', 'T10Y2Y', 'VIXCLS']] # to conduct an AR model, we need stationary data


In [16]:
import itertools
import statsmodels.api as sm


explanatory_vars = ['UNRATE', 'DCOILBRENTEU_diff', 'gdp_growth', 'T10Y2Y', 'VIXCLS']
factor_combinations = list(itertools.combinations(explanatory_vars, 3))

def lag_and_regression(dependent_var, explanatory_vars, df):
    df.loc[:, dependent_var + '_lag1'] = df[dependent_var].shift(1)
    dfc = df.dropna()
    X = dfc[[dependent_var + '_lag1'] + list(explanatory_vars)].dropna()
    y = dfc[dependent_var]
    X = sm.add_constant(X)
    return sm.OLS(y, X).fit().rsquared


results = [{'combo': combo,
            'r2_cre': lag_and_regression('cre_pct_diff', combo, dfstat),
            'r2_card': lag_and_regression('card_pct_diff', combo, dfstat)}
           for combo in factor_combinations]

top_cre_results = sorted(results, key=lambda x: x['r2_cre'], reverse=True)[:3]
top_card_results = sorted(results, key=lambda x: x['r2_card'], reverse=True)[:3]
print("Top 3 models for 'cre_pct_diff' based on R²:")
for result in top_cre_results:
    print(f"Factors: {result['combo']}, R²: {result['r2_cre']:.3f}")
print("\nTop 3 models for 'card_pct_diff' based on R²:")
for result in top_card_results:
    print(f"Factors: {result['combo']}, R²: {result['r2_card']:.3f}")

Top 3 models for 'cre_pct_diff' based on R²:
Factors: ('DCOILBRENTEU_diff', 'T10Y2Y', 'VIXCLS'), R²: 0.254
Factors: ('UNRATE', 'DCOILBRENTEU_diff', 'VIXCLS'), R²: 0.253
Factors: ('DCOILBRENTEU_diff', 'gdp_growth', 'VIXCLS'), R²: 0.252

Top 3 models for 'card_pct_diff' based on R²:
Factors: ('DCOILBRENTEU_diff', 'gdp_growth', 'VIXCLS'), R²: 0.022
Factors: ('DCOILBRENTEU_diff', 'T10Y2Y', 'VIXCLS'), R²: 0.021
Factors: ('UNRATE', 'DCOILBRENTEU_diff', 'VIXCLS'), R²: 0.020


In [21]:
dfstat.index.freq = 'QE'

import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
import itertools

def arima_with_factors(dependent_var, factor_combination, df):
    X = df[list(factor_combination)]
    model = ARIMA(df[dependent_var], exog=X, order=(1, 0, 0))
    result = model.fit()
    return {'combo': factor_combination,
            'loglikelihood': result.llf,
            'aic': result.aic,
            'bic': result.bic}

explanatory_vars = ['UNRATE', 'DCOILBRENTEU_diff', 'gdp_growth', 'T10Y2Y', 'VIXCLS']
factor_combinations = list(itertools.combinations(explanatory_vars, 3))
results_cre = []
results_card = []

for combo in factor_combinations:
    cre_result = arima_with_factors('cre_pct_diff', combo, dfstat)
    card_result = arima_with_factors('card_pct_diff', combo, dfstat)
    results_cre.append(cre_result)
    results_card.append(card_result)

best_cre_loglikelihood = sorted(results_cre, key=lambda x: x['loglikelihood'], reverse=True)[0]
best_cre_aic = sorted(results_cre, key=lambda x: x['aic'])[0]
best_cre_bic = sorted(results_cre, key=lambda x: x['bic'])[0]
best_card_loglikelihood = sorted(results_card, key=lambda x: x['loglikelihood'], reverse=True)[0]
best_card_aic = sorted(results_card, key=lambda x: x['aic'])[0]
best_card_bic = sorted(results_card, key=lambda x: x['bic'])[0]

print("Best model for 'cre_pct_diff' based on log-likelihood, AIC and BIC:")
print(f"Factors: {best_cre_loglikelihood['combo']}, Log-likelihood: {best_cre_loglikelihood['loglikelihood']:.3f}\
AIC: {best_cre_aic['aic']:.3f}, BIC: {best_cre_bic['bic']:.3f}")
print("\nBest model for 'card_pct_diff' based on log-likelihood, AIC and BIC:")
print(f"Factors: {best_card_loglikelihood['combo']}, Log-likelihood: {best_card_loglikelihood['loglikelihood']:.3f}\
AIC: {best_card_aic['aic']:.3f}, BIC: {best_card_bic['bic']:.3f}")
print("Two methods, using OLS R-square or using loglikihood, AIC and BIC shows the same result.")



Best model for 'cre_pct_diff' based on log-likelihood, AIC and BIC:
Factors: ('DCOILBRENTEU_diff', 'T10Y2Y', 'VIXCLS'), Log-likelihood: 112.577AIC: -213.153, BIC: -199.248

Best model for 'card_pct_diff' based on log-likelihood, AIC and BIC:
Factors: ('DCOILBRENTEU_diff', 'gdp_growth', 'VIXCLS'), Log-likelihood: 21.052AIC: -30.103, BIC: -16.199
Two methods, using OLS R-square or using loglikihood, AIC and BIC shows the same result.


In [23]:
pip install pmdarima

Collecting pmdarima
  Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pmdarima
Successfully installed pmdarima-2.0.4


In [30]:
# Using other models may have better result.
from pmdarima import auto_arima

def auto_arima_with_factors(dependent_var, factor_combination, df):
    X = df[list(factor_combination)]
    model = auto_arima(df[dependent_var], exogenous=X, seasonal=False, stepwise=True, trace=False)
    arima_model = model.arima_res_
    return {'combo': factor_combination,
            'order': model.order,
            'loglikelihood': arima_model.llf,
            'aic': arima_model.aic,
            'bic': arima_model.bic}

results_cre, results_card = [], []

for combo in factor_combinations:
    results_cre.append(auto_arima_with_factors('cre_pct_diff', combo, dfstat))
    results_card.append(auto_arima_with_factors('card_pct_diff', combo, dfstat))
best_cre_ll = sorted(results_cre, key=lambda x: x['loglikelihood'], reverse=True)[0]
best_card_ll = sorted(results_card, key=lambda x: x['loglikelihood'], reverse=True)[0]
print(f"Best model for 'cre_pct_diff' based on log-likelihood:\nFactors: {best_cre_ll['combo']}, ARIMA Order: {best_cre_ll['order']}, Log-Likelihood: {best_cre_ll['loglikelihood']:.3f}, AIC: {best_cre_ll['aic']:.3f}, BIC: {best_cre_ll['bic']:.3f}")
print(f"\nBest model for 'card_pct_diff' based on log-likelihood:\nFactors: {best_card_ll['combo']}, ARIMA Order: {best_card_ll['order']}, Log-Likelihood: {best_card_ll['loglikelihood']:.3f}, AIC: {best_card_ll['aic']:.3f}, BIC: {best_card_ll['bic']:.3f}")
print("For cre information, we have higher log-likelihood, but not for the card data.")

Best model for 'cre_pct_diff' based on log-likelihood:
Factors: ('UNRATE', 'DCOILBRENTEU_diff', 'gdp_growth'), ARIMA Order: (1, 0, 1), Log-Likelihood: 116.355, AIC: -226.709, BIC: -219.757

Best model for 'card_pct_diff' based on log-likelihood:
Factors: ('UNRATE', 'DCOILBRENTEU_diff', 'gdp_growth'), ARIMA Order: (0, 0, 0), Log-Likelihood: 20.410, AIC: -38.819, BIC: -36.502


## Step 3

Key variables that could impact charge-off rates include interest rates, such as the Federal Funds Rate, the housing price index (HPI), consumer credit growth, the debt-to-income ratio (DTI), and initial jobless claims.

-	Interest rates or consumer price index (CPI) fluctuations can significantly affect consumers’ ability to service their credit card debts.
-	The housing price index is a critical factor influencing the probability of charge-offs in commercial real estate (CRE) loans.
-	For banks, when facing financial strain, they may resort to charge-offs to reduce their exposure to risky assets or to improve liquidity management.

To produce accurate forecasts using these models, I would need reliable macroeconomic forecasts, insights into potential monetary and fiscal policy changes, and industry-specific trends, such as CRE vacancy rates or shifts in consumer behavior for credit cards. These inputs would significantly enhance the accuracy and reliability of future predictions.