# Summary

This is a high-level overview and key python functions I used for my data science project “Forecast Probability of Default with Logistic Regression and Random Forest Models for City National Bank Personal Loan Portfolio”. This is my model validation project in Q4 2018 and I work with Credit Portfolio Risk department. 

In the model, the dependent variable is default indicator (0, 1) on borrower level of personal loan portfolio. The independent variable is macroeconomic variables (Unemployment rate, interest rate, credit spread…). 

The project follows a standardized model validation workflow: Challenged modeling approach, testing data quality and EDA, and model outcome analysis. Besides the standard workflow, my accomplishment and contribution in this project is to provide an alternative modeling approach and analysis with Machine Learning model (Random Forest)


# Business Purpose

This model is mainly used for CCAR annual regulatory reporting. The Comprehensive Capital Analysis and Review (CCAR) is an annual exercise by the Federal Reserve to ensure that financial institutions have well-defined and forward-looking capital planning processes that account for their unique risks and sufficient capital to continue operations through times of economic and financial stress.  

The Federal Reserve will provide stress scenarios on macroeconomic variables and the bank will forecast potential loss of their capital based on the stress scenarios.

# Import Libraries

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Tue Jan  8 09:57:43 2019

@author: albhsu
"""

#####This is a on-job self-development project. Only the sample of my code is provided#####

import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc, roc_curve, classification_report
from sklearn import metrics

# Import Dataset

In [None]:
data = pd.read_csv('PG_test.csv')
data.head()
data.info()

# Check if any missing data and outlier
data.isnull().values.any()
data[data['US_DUNEM_CA_QOQ'] > 0.2]

# Challenge Modeling Approach

In [None]:
#Build Logistic Regression model object
clf = LogisticRegression()
clf.fit(train[features], train['y'])

# Machine Learning

In [None]:
#Split datset into training and testing set
data['is_train'] = np.random.uniform(0, 1, len(data)) <= .75
train, test = data[data['is_train']==True], data[data['is_train']==False]
features = data.columns[6:67]

In [None]:
#Run Logistic Regression model
y_train = pd.factorize(train['PGDEF'])[0]
clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='ovr')
clf.fit(train[features], y_train)
y_pred = clf.predict(test[features])

# Model Perfomance

In [None]:
#Classifier score and Confusion Matrix
clf.score(test[features], test['PGDEF'])
cm = metrics.confusion_matrix(test['PGDEF'], y_pred)
print(cm)

In [None]:
#Feature importance
importance = pd.DataFrame(index = features, data = clf.feature_importances_, columns = ['importance'])
importance.sort_values(by = 'importance', ascending = True, inplace = True)

In [None]:
#Plot feature importance
ax = importance[-5:].plot.barh()
y_pos = np.arange(len(features))
plt.barh(y_pos, clf.coef_.ravel())
plt.show()

In [None]:
#Another way to generate Confusion matrix
pd.crosstab(test['y'], y_pred)

#Classification report
target_names = ['0', '1']
y_pred = clf.predict(test[features])
y_actual = test['PGDEF']
print(classification_report(y_actual, y_pred, target_names=target_names))
fpr, tpr, thresholds = metrics.roc_curve(y_pred, y_actual, pos_label=2)