# bank-sentinel

# section 1

A machine learning project by Joey Shiu and Noelle Koliadko in partial fulfillment of the requirements for CPTR 435: Machine Learning. 

This project is a binary classification problem designed to predict bank failure based on its financials as reported to the FDIC. The input consists of 26 numerical ratios derived from bank financial data. Insofar as possible, we have used the features outlined by Le, H.H., & Viviani, J.-L. (2018). These features will be further explained in a later section. The output is a binary class, where 1 represents survival and 0 represents failure. 

We have created a vanilla Artificial Neural Network architecture using the Keras/TensorFlow framework. We are evaluating the model on accuracy, precision, recall, and the F1 score. 

Data is extracted from a public API from the FDIC which can be found here: https://banks.data.fdic.gov/docs/

Bibliography:

>Le, H. H., & Viviani, J.-L. (2018). Predicting bank failure: An improvement by implementing a machine-learning approach to classical financial ratios. *Research in International Business and Finance, 44*, 16–25. https://doi.org/10.1016/j.ribaf.2017.07.104 

>Serrano-Cinca, C. & Gutiérrez-Nieto, B. (2013). Partial Least Square Discriminant Analysis for bankruptcy prediction. *Decision Support Systems, 54*(3), 1245-1255. https://doi.org/10.1016/j.dss.2012.11.015 

> Zhao, H., Sinha, A.P., & Ge, W. (2009). Effects of feature construction on classification performance: An empirical study in bank failure prediction. *Expert Systems with Applications, 36*(2), 2633-2644. https://doi.org/10.1016/j.eswa.2008.01.053 


## section 2

This section extracts data from year-end financials in 2021 for active banks for and from year-end financials in the year prior to failure for failed banks. We begin by defining functions that allow us to make API calls using the Elastic Search query syntax.

In [1]:
import requests
import pandas as pd
import math
from io import StringIO
from urllib.parse import quote_plus

### Construct API call

In [2]:
# function to get data from BankFind API
def getData(url: str, filter: str, fields: str, sortby: str = 'CERT', order: str = 'ASC', n: int = 10000, k: int = 0, suffix: str = '&format=csv&download=false&filename=data_file') -> pd.DataFrame:
    request = requests.get(url + 'filters=' + quote_plus(filter) + '&fields=' + quote_plus(fields) + '&sort_by=' + sortby + '&sort_order=' + order + '&limit=' + str(n) + '&offset=' + str(k) + suffix).text
    return pd.read_csv(StringIO(request))

In [3]:
# reset index and drop redundant columns for bank dataframe
def cleanbankdata(df: pd.DataFrame) -> None:
    df.set_index('CERT', inplace=True)
    df.drop('ID', axis = 1, inplace = True)

### Get lists of all banks 2006-present

In [4]:
# number of banks to get 
# note: in the selected time frame there are less than 5000 in each group, so the value of n is only for constraining test cases
n = 5000

In [6]:
# strings for getting lists of active and failed banks
institutionurl = 'https://banks.data.fdic.gov/api/institutions?'
filtersfailed = 'ACTIVE:0 AND DATEUPDT:[\"2006-01-01\" TO \"2023-12-31\"]'
filtersactive = 'ACTIVE:1 AND DATEUPDT:[\"2006-01-01\" TO \"2023-12-31\"]'
bankfields = 'STALP,NAME,ACTIVE,CERT,DATEUPDT'


# get failed banks
failedbanks = getData(institutionurl, filtersfailed, bankfields, n = n)
cleanbankdata(failedbanks)

# get active banks
activebanks = getData(institutionurl, filtersactive, bankfields, n = n)
cleanbankdata(activebanks)

In [7]:
failedbanks.head()

Unnamed: 0_level_0,ACTIVE,DATEUPDT,NAME,STALP
CERT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9,0,02/27/2008,Union Trust Company,ME
46,0,01/07/2022,Merchants Bank of Alabama,AL
47,0,07/10/2020,Traders & Farmers Bank,AL
57,0,11/21/2006,Community Bank,AL
59,0,01/09/2013,The Citizens Bank,AL


In [8]:
activebanks.head()

Unnamed: 0_level_0,ACTIVE,DATEUPDT,NAME,STALP
CERT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14,1,06/05/2023,State Street Bank and Trust Company,MA
35,1,09/02/2022,AuburnBank,AL
39,1,03/28/2023,Robertson Banking Company,AL
41,1,08/31/2022,Phenix-Girard Bank,AL
49,1,08/31/2022,Bank of Evergreen,AL


In [9]:
print(f'number of failed banks: {len(failedbanks)}')
print(f'number of active banks: {len(activebanks)}')

number of failed banks: 4948
number of active banks: 4611


### Get financials of banks

In [10]:
# global information
# financials

# gets financial data for all banks currently active from their year-end report in 2021
financialsurl = 'https://banks.data.fdic.gov/api/financials?'

featureslist =  ['NAME,RISDATE,CERT,REPYEAR,',
                'LNATRESR,ELNLOS,NIM,EAMINTAN,LNLSGRS,NTLNLS,EQ,ASSET5,' ,
                'RBCT1,IDT1RWAJR,EQTOTR,EQV,LNLSNET,LIAB,LIABEQR,LIABEQ,DEP,',
                'NIMY,NIMR,NONIXR,PTAXNETINCR,ITAX,',
                'ROA,ROE,NETINC,EEFFR,CHBAL,ASSET,BKPREM']
features = ''.join(featureslist)

featurenames = {'LNATRESR': 'LOAN LOSS RESERVE/GROSS LN&LS',
                'ELNLOS' : 'PROVISIONS FOR LN & LEASE LOSSES',
                'NIM' : 'NET INTEREST INCOME',
                'EAMINTAN' : 'AMORT & IMPAIR LOSS AST',
                'LNLSGRS' : 'LOANS AND LEASES, GROSS',
                'NTLNLS' : 'TOTAL LN&LS NET CHARGE-OFFS',
                'EQ' : 'Equity Capital',
                'ASSET5' : 'TOTAL ASSETS-CAVG5',
                'RBCT1' : 'TIER 1 RBC-PCA',
                'IDT1RWAJR' : 'TIER 1 RISK-BASED CAPITAL RATIO',
                'EQTOTR' : 'TOTAL EQUITY CAPITAL RATIO',
                'EQV' : 'BANK EQUITY CAPITAL/ASSETS',
                'LNLSNET' : 'LOANS AND LEASES-NET',
                'CUSLI' : 'CUSTOMERS ACCEPTANCES',
                'LIAB' : 'TOTAL LIABILITIES',
                'LIABEQR' : 'TOTAL LIABILITIES & CAPITAL RATIO',
                'LIABEQ' : 'TOTAL LIABILITIES & CAPITAL',
                'DEP' : 'Total deposits',
                'NIMY' : 'NET INTEREST MARGIN',
                'NIMR' : 'NET INTEREST INCOME RATIO',
                'IOTHFEE' : 'OTHER FEE INCOME',
                'NONIXR' : 'TOTAL NONINTEREST EXPENSE RATIO',
                'PTAXNETINCR' : 'PRE-TAX NET INCOME OPERATING INCOME RATIO',
                'ITAX' : 'APPLICABLE INCOME TAXES',
                'ROA' : 'Return on assets (ROA)',
                'ROE' : 'Return on equity (ROE)',
                'NETINC' : 'Net income',
                'EEFFR' : 'EFFICIENCY RATIO',
                'CHBAL' : 'CASH & DUE FROM DEPOSITORY INST',
                'ASSET' : 'Total assets',
                'BKPREM' : 'PREMISES AND FIXED ASSETS'
                }

#### Active Banks

In [11]:
activefilters = 'RISDATE:20211231'
activefinancials = getData(financialsurl, activefilters, features)

In [12]:
activefinancials.isna().sum()

ASSET           0
ASSET5          0
BKPREM         17
CERT            0
CHBAL           0
DEP             0
EAMINTAN       17
EEFFR           0
ELNLOS         17
EQ             17
EQTOTR          0
EQV             0
ID              0
IDT1RWAJR       0
ITAX           17
LIAB            0
LIABEQ          0
LIABEQR         0
LNATRESR        0
LNLSGRS         0
LNLSNET         0
NAME            0
NETINC         17
NIM            17
NIMR            0
NIMY            0
NONIXR          0
NTLNLS         17
PTAXNETINCR     0
RBCT1          17
REPYEAR         0
RISDATE         0
ROA             0
ROE            17
dtype: int64

In [13]:
print(f'number of financial reports (banks): {len(activefinancials)}')

number of financial reports (banks): 4904


In [14]:
# join bank data with financials
activedata = activebanks.merge(activefinancials, on = 'CERT', how = 'left', suffixes=['_b', '_f'])
# drop NAs
activedata.dropna(inplace= True)

In [16]:
activedata.head()

Unnamed: 0,CERT,ACTIVE,DATEUPDT,NAME_b,STALP,ASSET,ASSET5,BKPREM,CHBAL,DEP,...,NIMR,NIMY,NONIXR,NTLNLS,PTAXNETINCR,RBCT1,REPYEAR,RISDATE,ROA,ROE
0,14,1,06/05/2023,State Street Bank and Trust Company,MA,311063000.0,315584400.0,2250000.0,109322000.0,260805000.0,...,0.632477,0.776841,2.573955,2000.0,1.007971,18845000.0,2021.0,20211231.0,0.84193,9.88
1,35,1,09/02/2022,AuburnBank,AL,1104523.0,1030987.0,41786.0,78839.0,996948.0,...,2.331261,2.514126,1.911178,79.0,0.868488,100059.0,2021.0,20211231.0,0.741522,7.52
2,39,1,03/28/2023,Robertson Banking Company,AL,412189.0,392534.0,3964.0,54893.0,374169.0,...,3.080243,3.411668,2.381705,-8.0,1.680874,36507.0,2021.0,20211231.0,1.570055,16.73
3,41,1,08/31/2022,Phenix-Girard Bank,AL,285239.0,261810.0,1689.0,49033.0,242418.0,...,2.62786,2.780162,2.351705,41.0,1.134792,36019.0,2021.0,20211231.0,1.678698,12.24
4,49,1,08/31/2022,Bank of Evergreen,AL,75734.0,71653.2,897.0,5253.0,66576.0,...,3.077322,3.282217,2.641892,13.0,0.880631,8959.0,2021.0,20211231.0,0.736883,5.88


#### Failed banks

In [17]:
# create a new column containing one year prior to the year of failure

failedbanks['prevYr'] = failedbanks.DATEUPDT.str.rsplit('/', expand = True, n = 1)[1].astype(int) - 1
failedbanks['targetdate'] = failedbanks.prevYr * 10000 + 1231


In [18]:
## this cell takes > 1 minute to execute

# get all financial data in the selected date range for the banks in the list of failed banks
# BankFind does not like this query! It balks when you ask for too much data.
# We will make multiple queries, store them in a list, then pd.concat them at the end
failedfinancialslist = []

# This query is split into two nested loops:
# first we divide the divide the number of banks into groups of 1000
n_failed = len(failedbanks)
n_bins = math.ceil(n_failed / 1000)
cert = failedbanks.index.to_series().reset_index(drop = True).astype(str)


for i in range(n_bins):
    # develop the query for each set of 1000 banks
    bankIDstring = ' OR '.join(cert.loc[i*1000 : min(i*1000 + 999, n_failed - 1)])
    failedfilter = f'CERT:({bankIDstring}) AND REPYEAR:[2005 TO 2023]'

    # second (nested) loop to ensure all the data is collected
    # BankFind cuts off at 10k rows, but since there are 72 quarters in the date range, there will be many more entries. 
    # The following loop pulls data in chunks of 10k until all data is collected. 
    j = 0
    while True:
        try:
            # for optimal performance, append to python list and perform pd.concat at the end
            failedfinancialslist.append(getData(financialsurl, failedfilter, features, k = j))
            j += 10000
        except:
            break

# concatenate all into one dataframe
failedfinancials = pd.concat(failedfinancialslist)

In [19]:
failedfinancials.head()

Unnamed: 0,ASSET,ASSET5,BKPREM,CERT,CHBAL,DEP,EAMINTAN,EEFFR,ELNLOS,EQ,...,NIMR,NIMY,NONIXR,NTLNLS,PTAXNETINCR,RBCT1,REPYEAR,RISDATE,ROA,ROE
0,509057,495183.5,5454.0,9,7217,290464,0.0,70.380792,-215.0,33724.0,...,3.271514,3.541336,3.030796,21.0,1.44916,34379.0,2005,20050331,1.011342,14.71
1,518990,503119.0,5865.0,9,9381,306989,0.0,68.484681,-215.0,34610.0,...,3.299816,3.575826,2.949998,42.0,1.442999,34978.0,2005,20050630,1.012882,14.89
2,530632,509997.25,5842.0,9,15625,335075,0.0,68.420729,-215.0,34179.0,...,3.247571,3.53103,2.909092,46.0,1.398873,34845.0,2005,20050930,0.982203,14.64
3,523535,512704.8,5994.0,9,9724,335239,0.0,69.585168,-215.0,34053.0,...,3.187019,3.460747,2.928196,41.0,1.321813,35439.0,2005,20051231,0.95357,14.3
4,536475,530005.0,6212.0,9,11061,317671,0.0,77.258687,0.0,36862.0,...,2.946387,3.189305,3.020349,-34.0,0.889048,38325.0,2006,20060331,0.664145,9.93


In [20]:
# join bank data with financials
faileddata = failedbanks.merge(failedfinancials, left_on = ['CERT', 'targetdate'], right_on = ['CERT', 'RISDATE'], how = 'left', suffixes=['_b', '_f'])
# drop NA values
faileddata.dropna(inplace = True)
# drop unneeded columns
faileddata.drop(['prevYr', 'targetdate'], axis = 1, inplace = True)

In [21]:
faileddata.head()

Unnamed: 0,CERT,ACTIVE,DATEUPDT,NAME_b,STALP,ASSET,ASSET5,BKPREM,CHBAL,DEP,...,NIMR,NIMY,NONIXR,NTLNLS,PTAXNETINCR,RBCT1,REPYEAR,RISDATE,ROA,ROE
0,9,0,02/27/2008,Union Trust Company,ME,539169.0,542956.8,8232.0,15627.0,342440.0,...,2.866158,3.121476,3.365829,402.0,0.465783,31137.0,2007.0,20071231.0,0.395796,5.72
1,46,0,01/07/2022,Merchants Bank of Alabama,AL,382952.0,384777.4,3344.0,28545.0,345793.0,...,2.898039,3.075171,2.77849,45.0,0.87921,35597.0,2021.0,20211231.0,0.692868,7.46
2,47,0,07/10/2020,Traders & Farmers Bank,AL,366379.0,366703.4,8819.0,7459.0,288635.0,...,3.368117,3.649053,2.641644,193.0,0.93427,59738.0,2019.0,20191231.0,0.773104,4.97
3,57,0,11/21/2006,Community Bank,AL,564639.0,548678.0,22558.0,18650.0,443226.0,...,3.769971,4.374382,4.175309,685.0,0.695854,39579.0,2005.0,20051231.0,0.423564,5.18
4,59,0,01/09/2013,The Citizens Bank,AL,172397.0,169502.0,426.0,41938.0,143109.0,...,2.962207,3.320306,1.479039,366.0,1.674317,28606.0,2012.0,20121231.0,1.13686,6.84


### Recombine the data

In [22]:
# combine active bank data and failed data into one df
alldata = pd.concat([activedata, faileddata])
# drop columns that will not be part of the ML model
alldata.drop(['CERT', 'DATEUPDT', 'NAME_b', 'STALP', 'ID', 'NAME_f', 'REPYEAR', 'RISDATE'], axis = 1, inplace = True)

In [23]:
alldata

Unnamed: 0,ACTIVE,ASSET,ASSET5,BKPREM,CHBAL,DEP,EAMINTAN,EEFFR,ELNLOS,EQ,...,NETINC,NIM,NIMR,NIMY,NONIXR,NTLNLS,PTAXNETINCR,RBCT1,ROA,ROE
0,1,311063000.0,315584400.0,2250000.0,109322000.0,260805000.0,243000.0,69.913938,-29000.0,27821000.0,...,2657000.0,1996000.0,0.632477,0.776841,2.573955,2000.0,1.007971,18845000.0,0.841930,9.88
1,1,1104523.0,1030987.0,41786.0,78839.0,996948.0,0.0,70.225961,-600.0,100951.0,...,7645.0,24035.0,2.331261,2.514126,1.911178,79.0,0.868488,100059.0,0.741522,7.52
2,1,412189.0,392534.0,3964.0,54893.0,374169.0,0.0,57.542931,300.0,36189.0,...,6163.0,12091.0,3.080243,3.411668,2.381705,-8.0,1.680874,36507.0,1.570055,16.73
3,1,285239.0,261810.0,1689.0,49033.0,242418.0,0.0,66.720850,100.0,36064.0,...,4395.0,6880.0,2.627860,2.780162,2.351705,41.0,1.134792,36019.0,1.678698,12.24
4,1,75734.0,71653.2,897.0,5253.0,66576.0,0.0,73.829953,40.0,8994.0,...,528.0,2205.0,3.077322,3.282217,2.641892,13.0,0.880631,8959.0,0.736883,5.88
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4942,0,238443.0,233908.8,4206.0,17551.0,213291.0,0.0,63.086740,0.0,25002.0,...,3463.0,7839.0,3.351306,3.537314,2.493707,281.0,1.459116,22961.0,1.480492,14.21
4943,0,435050.0,448144.2,11383.0,15911.0,370924.0,0.0,93.608679,2457.0,34331.0,...,-304.0,17805.0,3.973052,4.523389,3.889150,1459.0,-0.282721,35096.0,-0.067835,-0.88
4944,0,145799.0,142465.8,4655.0,23540.0,130522.0,0.0,60.822862,-300.0,14148.0,...,2233.0,4375.0,3.070912,3.333547,2.106470,-115.0,1.567394,14696.0,1.567394,15.02
4946,0,2064652.0,2120586.0,2631.0,425.0,1875835.0,0.0,82.751328,0.0,155471.0,...,14095.0,80390.0,3.790933,3.893613,5.185123,0.0,1.080786,143700.0,0.664675,9.44


In [24]:
# store cached data for use in another notebook
alldata.to_csv('alldata2.csv')

## section 3-4

In [26]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

The features that we are using are not all immediately available from the FDIC database. We need to perform feature engineering to create the desired features. Le, H.H., & Viviani, J.-L. divided their features into 5 categories to represent bank health in the areas of Loan Quality, Capital Quality, Operations, Profitability, and Liquidity. 

##### Loan Quality
+ LNATRESR is Loan Loss reserve / Gross Loans and leases
+ LLPNIR is Loan loss provision / net interest revenue
+ ILGL is Impaired losses (loans) / gross loans
+ NCOGL is Net charge off / Gross loans
+ ILEQ is Impaired loans / equity

##### Capital Quality
+ IDT1RWAJR is the Tier 1 capital ratio
+ EQTOTR is the Total equity capital ratio
+ EQV is Equity / Assets
+ EQNL is Equity / Net Loans
+ EQLIAB is Equity / Liabilities
+ LIABEQR is Total Capital / Assets
+ TCNL is Total capital / Net Loans
+ TCDEP is Total capital / deposits
+ TCLIAB is Total capital / Liabilities

##### Operations
+ NIMY is the Net Interest Margin
+ NIMR is the Net Interest Income ratio
+ NONIXR is the Non-Interest expense ratio
+ PTAXNETINCR is the Pre-tax Operating Income ratio
+ TAVAST is Income taxes / avg assets

##### Profitability 
+ ROA is Return on Assets
+ ROE is Return on Equity
+ NIEQ is net income / equity
+ EEFFR is the Efficiency Ratio

##### Liquidity
+ NLTA is net loans / total assets
+ NLTD is net loans / total deposits
+ LATD is (assets - fixed assets) / total deposits


In [None]:
class CustomFeatures(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self # nothing to fit
    
    def transform(self, X, y=None):
        # Z2, Loan loss provision / net interest revenue
        LLPNIR = X.ELNLOS / X.NIM

        # Z3, Impaired loans / gross loans
        ILGL = X.EAMINTAN / X.LNLSGRS

        # Z4, Net charge off / Gross loans
        NCOGL = X.NTLNLS / X.LNLSGRS

        # Z5, Impaired loans / equity
        ILEQ = X.EAMINTAN / X.EQ

        # Z9, Equity / Net Loans
        EQNL = X.EQ / X.LNLSNET
        
        # no data for Z10

        # Z11, Equity / Liabilities
        EQLIAB = X.EQ / X.LIAB

        # Z13, Total capital / Net Loans
        TCNL = X.LIABEQ / X.LNLSNET

        # Z14, Total capital / deposits
        TCDEP = X.LIABEQ / X.DEP

        # Z15, Total capital / Liabilities
        TCLIAB = X.LIABEQ / X.LIAB

        # no data for Z18

        # Z21, taxes / avg assets
        TAVAST = X.ITAX / X.ASSET5

        # Z24, net income / equity
        NIEQ = X.NETINC / X.EQ

        # no data for Z25 and Z26, use Efficiency ratio instead

        # Z27 , net loans / total assets
        NLTA = X.LNLSNET / X.ASSET

        # not sure how Z28 and Z29 are different, use net loans / total deposits
        NLTD = X.LNLSNET / X.DEP

        # not sure how Z30 and 31 are different, use liquid assets / total deposits = (assets - fixed assets) / total deposits
        LATD = (X.ASSET - X.BKPREM) / X.DEP

        # construct final table
        finalTable = pd.concat([X.LNATRESR, LLPNIR, ILGL, NCOGL, ILEQ, X.IDT1RWAJR, X.EQTOTR, X.EQV, EQNL, EQLIAB, X.LIABEQR, TCNL, TCDEP, TCLIAB, X.NIMY, X.NIMR, X.NONIXR, X.PTAXNETINCR, TAVAST, X.ROA, X.ROE, NIEQ, X.EEFFR, NLTA, NLTD, LATD, X.ACTIVE], axis = 1)
        
        
        # drop rows that have resulted in division by zero
        finalTable.replace([np.inf, -np.inf], np.nan, inplace=True)
        finalTable.dropna(inplace = True)

        finalTable.columns = ['LNATRESR', 'LLPNIR', 'ILGL', 'NCOGL', 'ILEQ', 'IDT1RWAJR', 'EQTOTR', 'EQV', 'EQNL', 'EQLIAB', 'LIABEQR', 'TCNL', 'TCDEP', 'TCLIAB', 'NIMY', 'NIMR', 'NONIXR', 'PTAXNETINCR', 'TAVAST', 'ROA', 'ROE', 'NIEQ', 'EEFFR', 'NLTA', 'NLTD', 'LATD', 'ACTIVE']
        return finalTable