In [None]:
# Machine Learning model to predict whether financial statements indicate a <6% Tier 1 Capital Ratio

1. See what columns are essential
    * Load in first two columns of every file
    * def load_one(path):
    Read the file (tab-delimited, skip the description row)

    df = pd.read_csv(
   
        path,
        sep="\t",
        header=0,
        nrows=2,
        dtype=str
    )
    * Look at descriptions and how many files have what
3. Determine essential columns to keep and notate
4. Any files that don't have essential columns are dropped
5. Read in all files, merge to single DataFrame
6. Force essential columns to float32
    * df[c] = pd.to_numeric(df[c], errors="coerce").astype(np.float32) for c in essential_cols
    * Some models force conversion to float64, sklearn LogisticRegression does this
7. Train, Test, Split
8. Predict

FFIEC Website: https://cdr.ffiec.gov/public/ManageFacsimiles.aspx

Bulk Data: https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx

Silent trip ups: 

- Have non-proportional amounts of what you're classifying.

- Not having enought samples

In [17]:
import pandas as pd
import glob
from pathlib import Path

In [18]:
# Folder with your 62 RCRI text files
folder = Path("RCRI Schedules")

# Collect info
files_info = {}

for path in sorted(folder.glob("*.txt")):
    try:
        # Read just the first 2 rows (codes + descriptions)
        df_head = pd.read_csv(
            path,
            sep="\t",
            header=None,
            nrows=2,
            engine="python",
            dtype=str
        )
        # Fill NaNs with empty string
        df_head = df_head.fillna("")
        # Drop columns where both code and description are empty
        keep = ~((df_head.iloc[0] == "") & (df_head.iloc[1] == ""))
        df_head = df_head.loc[:, keep]

        # Store results
        codes = list(df_head.iloc[0])
        descs = list(df_head.iloc[1])
        files_info[path.name] = {
            "codes": codes,
            "descs": descs
        }
    except Exception as e:
        print(f"Error reading {path.name}: {e}")

# Example: print one file’s header + descriptions
first_file = list(files_info.keys())[0]
print("File:", first_file)
for c, d in zip(files_info[first_file]["codes"], files_info[first_file]["descs"]):
    print(f"{c:<12} {d}")


File: FFIEC CDR Call Schedule RCR 03312010(1 of 2).txt
IDRSSD       
RCFD0010     CASH AND DUE FROM DEPOSITORIES
RCFD1395     TIER 3 CPTL ALLOCATD FOR MARKET RISK
RCFD1651     MARKET RISK EQUIVALENT ASSETS
RCFD1754     HELD-TO-MATURITY SECURITIES
RCFD1773     AVAILABLE-FOR-SALE SECURITIES
RCFD2170     TOTAL ASSETS
RCFD2221     UNREALIZED GAINS ON A-F-S EQUITY SEC
RCFD3123     ALLL_AMT
RCFD3128     ALLOCATED TRANSFER RISK RESERVES
RCFD3210     TOTAL EQUITY CAPITAL
RCFD3368     QTLY AVG OF TOTAL ASSETS
RCFD3411     COMMERCIAL LETTERS OF CREDIT
RCFD3429     PART IN ACCEPT ACQ BY NONACCEPT BK
RCFD3433     COMMIT&CONTINGENCIES/SECURITIES LENT
RCFD3545     TOTAL TRADING ASSETS
RCFD3792     TOTAL RISK-BASED CAPITAL
RCFD3809     INT RATE CONTRACTS-MAT 1 YR OR LESS
RCFD3812     FGN EXCHG RATE CONTRACTS-MAT 1 YR OR
RCFD3821     PERFORMANCE STANDBY LETTERS OF CR
RCFD3833     UNUSED COMMITMENTS MAT EXCEED 1 YEAR
RCFD4336     ACCMLTD NT GN(LOSS) ON CSH FLW HDGS
RCFD5306     QUALIFYING SUBORDINATED 

In [None]:
# Good start, but need to print all columns and see which ones we don't need and where they are from.

In [7]:
df_rc = pd.read_csv(
    "RCRI Schedules\FFIEC CDR Call Schedule RCRI 12312024.txt",
    sep="\t",
    header=0,      # Use the first line as headers (the RCFD codes)
    low_memory=False # Reads whole file into memory first, determiens d-type second. No mixed column d-types.
)
df_rc

Unnamed: 0,IDRSSD,RCFA2170,RCFA3128,RCFA3792,RCFA5310,RCFA5311,RCFA7204,RCFA7205,RCFA7206,RCFA8274,...,RCOAP866,RCOAP867,RCOAP868,RCOAP870,RCOAP872,RCOAP875,RCOAQ258,RCOAS540,RCOWH312,Unnamed: 154
0,,TOTAL ASSETS,ALLOCATED TRANSFER RISK RESERVES,TOTAL RISK-BASED CAPITAL,AACL INCLUDIBLE IN TIER2 CAPITAL,TIER 2 (SUPPLEMENTARY) CAPITAL,TIER 1 LEVERAGE CAPITAL RATIO,TOTAL RISK-BASED CAPITAL RATIO,TIER 1 RISK-BASED CAPITAL RATIO,TIER 1 CPTL ALLWBL UNDR RISK-BASED,...,TIER 2 CAP INSTRMNTS PLUS RLTD SRPLS,NNQLIFY CAP INSTR PHSOUT TIER2 CAP,CAP MNRTY INTRST NOT INCL TIER 1 CAP,TIER 2 CAPITAL BEFORE DEDUCTIONS,TOTAL TIER 2 CAPITAL DEDUCTIONS,DEDS COMEQTY TIER1 CAP ADD TIER1 CAP,CUMULATIVE CHANGE IN FAIR VALUE,UNCONDITIONALLY OR OTHER AMOUNT,TOTAL APPLICABLE CAPITAL BUFFER,
1,37.0,,,,,,,,,,...,,,,,,0,0,0,,
2,242.0,,,,,,,,,,...,,,,,,0,0,0,,
3,279.0,,,,,,,,,,...,,,,,,132,0,0,,
4,354.0,,,,,,,,,,...,,,,,,587,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4539,5879201.0,,,,,,,,,,...,0,0,0,186,0,0,0,,,
4540,5887420.0,,,,,,,,,,...,0,0,0,0,0,0,0,,,
4541,5903517.0,,,,,,,,,,...,,,,,,0,0,105,,
4542,5972661.0,,,,,,,,,,...,0,0,0,0,0,0,0,,,


In [None]:
def load_one(path: Path) -> pd.DataFrame:
    df = pd.read_csv(
        path,
        sep="\t",
        header=0,       # codes in row 0
        low_memory=False,
        dtype=str,      # read as string first to avoid mixed-type surprises
    )

F1-Score - a machine learning evaluation metric that measures a classifier's accuracy by taking the harmonic mean of precision and recall. It's a particularly useful metric for evaluating models, especially with imbalanced datasets, because it balances the trade-off between high precision and high recall. A perfect F1 score is 1 (100%), while the worst score is 0.
https://www.google.com/search?client=firefox-b-1-d&q=machine+learning+f1+score

Mean Absolute Error - Average distance of all data points from fitted trend line

Validation Accuracy (For Random Forest) - How often model is correct overall. Not as useful if you don't have 50/50 split of classes (or 25/25/25/25 or whatever)

In [None]:
# Read in one file. 
# List columns and descriptions
# List any file that does not contain that exact same columns

In [None]:
# Need more quarters worth of data, get at least 50 < 6%s. Just download as many as possible and look at
# Look at all columns for each file
# Any that are exactly the same, great
# If not, do they have what we need?
# Check dtypes