## Financial Statements Coding Tool

By Ken Burchfiel

Released under the MIT license

All trademarks (American Express, Truist, Visa, etc.) remain the property of their respective owners.

This script is designed to help code financial expenses. It first imports financial statements (such as .csv files that you have downloaded from your checking and/or credit card accounts), then assigns codes to each transaction based on the description provided for that transaction.

This Jupyter Notebook shows how the script can code and analyze simulated financial data. However, in order to make it fit your needs and purposes, you will want to tweak the expense categories and descriptions that correspond with each category. You will probably also need to modify the statements import script to fit with your own bank and/or credit card provider. However, the time you put into these updates should help speed up your future financial analyses.

The script will first import various libraries and designate various folder paths:

In [1]:
import pandas as pd
import numpy as np
import os
import time
start_time = time.time()
import matplotlib.pyplot as plt
statements_top_folder = 'simulated_spending_data' # This folder contains 
# financial statements that the script will import.
data_output_folder = 'coded_transactions' # The coded version of the financial
# statements will go here.
expense_codes_path = 'sample_finance_codes.csv' # This file contains a list 
# of financial statement codes. You'll probably want to replace this file with
# a list of your own codes, but you can certainly use it as a starting point.

The following cell imports a sample list of expense codes. Note thaht each row has both a code and a subcode. For instance, 'C' is the 'Car' code, and 'C-M' is a code for car maintenance expenses.

In [2]:
expense_codes = pd.read_csv(expense_codes_path).drop('Notes', axis = 1)

expense_codes.head(20)

Unnamed: 0,Code,Subcode,Code_Description,Subcode_Description
0,A,A-X,ATM,ATM Withdrawals
1,B,B-C,Basics,Convenience Stores
2,B,B-B,Basics,Books
3,B,B-F,Basics,Pets
4,B,B-G,Basics,Gifts
5,B,B-P,Basics,Photo Prints
6,B,B-S,Basics,Shipping
7,B,B-T,Basics,Local Public Transit
8,B,B-R,Basics,Cabs and Rideshare
9,B,B-V,Basics,Cash App Payments


# Reading in Financial Statements

The next cell creates a list of all financial statements to be imported into the program. These statements are stored within the path assigned to the statements_top_folder variable. 

The script also creates 'Account_Type' and 'file_type' values for each code. file_type refers to the format of the original data, and is used to tell the script how to interpret a certain financial statement. Account_type is a subset of file_type.

For instance, say you have two different Visa credit card accounts. Both of your account statements will have the same file type (which you can label as 'visa'), but in order to distinguish them, you can assign different account types to these statements. For instance, if your first Visa card ends in 1234 and your second card ends in 5678, you could use 'visa_1234' and 'visa_5678' as your Account_Type names. 

The script treats everything preceding the first *hyphen* in the file name as the *account type* and everything preceding the first *underscore* as the *file type*. Therefore, you'll want to rename your statements so that the account types and file types are interpreted correctly by the script. For example, you could name the statement for your Visa card ending in 1234 'visa_1234-2022_data.csv'. 

In [3]:
statements_file_list = []
for root, dirs, files in os.walk(statements_top_folder): 
    statements_file_list.extend([{"name":name, "path":os.path.join(
        root, name)} for name in files])
# See https://docs.python.org/3/library/os.html#os.walk
df_statements = pd.DataFrame(statements_file_list)

df_statements['Account_Type'] = df_statements['name'].str.split('-').str[0]
df_statements['file_type'] = df_statements['name'].str.split('_').str[0]
df_statements.reset_index(drop=True,inplace=True)
df_statements

Unnamed: 0,name,path,Account_Type,file_type
0,amex_1-simulated_2022_data.csv,simulated_spending_data\amex_1-simulated_2022_...,amex_1,amex
1,truist_1-simulated_2022_data.csv,simulated_spending_data\truist_1-simulated_202...,truist_1,truist


## Processing different financial statement types:

Different banks and credit card companies will have different formats for their statements. One bank might choose to list dates first, followed by descriptions and amounts, whereas other banks may choose to list descriptions first, followed by amounts and dates. (In addition, one bank may choose to make transactions negative, whereas another may use positive numbers for transactions.) 

In order to correctly analyze data, each of these statements needs to be converted to the same format. That is the purpose of the following cells. They convert American Express and Truist data into a common format that the rest of the script can process correctly. You'll need to add in other functions for your own credit card and bank statements (unless they happen to use the exact same format as American Express and Truist.)

The following cells show how to process American Express and Truist records. 

In [4]:
def process_amex_statement(file_path, account_type = None):
    '''This function takes American Express financial statement 
    data and converts it into a format that the script can use.
    It also adds in an Account_Type column to help distinguish
    this data from other financial statement data.
    '''
    expected_column_list = ['Date', 'Description', 'Amount']
    df_csv = pd.read_csv(file_path)
    if df_csv.columns.to_list() != expected_column_list:
        return ValueError("Columns in .csv file are different than expected") 
        # The above if statement will raise an error if the columns present
        # in the data aren't the ones expected.
    new_column_order = ['Date', 'Amount', 'Description']
    df_csv = df_csv[new_column_order] # This line rearranges the columns
    # in the original data to match the column format used by this script.
    df_csv['Date'] = pd.to_datetime(df_csv['Date'])
    df_csv['Account_Type'] = account_type

    return df_csv

# Here is an example of the function in use:
df_amex_example = process_amex_statement(df_statements.iloc[0,1], 
account_type = df_statements.iloc[0,2])
df_amex_example

Unnamed: 0,Date,Amount,Description,Account_Type
0,2022-05-06,32.94,310 BOWERY 0000 NEW YORK NY,amex_1
1,2022-01-09,132.30,AIRBNB SAN FRANCISCO CA,amex_1
2,2022-11-06,3.20,ALHEGAZY HALAL FOOD NEW YORK NY,amex_1
3,2022-11-08,11.03,Arts and Crafts BeerNew York NY,amex_1
4,2022-11-23,47.54,BEST BUY NEW YORK NY,amex_1
...,...,...,...,...
259,2022-09-28,86.06,UNIVERSITY HARDWARE NEW YORK NY,amex_1
260,2022-07-20,21.95,VIZCAYA MUSEUM AND GMIAMI FL,amex_1
261,2022-04-21,11.87,WESTSIDE MARKET,amex_1
262,2022-02-14,2.92,WESTSIDE MARKET 2840NEW YORK NY,amex_1


Truist records require a bit more processing, since amounts contain parentheses (in the case of negative numbers) and dollar signs. Therefore, the following function is lengthier than the function for American Express statements.

In [5]:
def process_truist_statement(file_path, account_type = None):
    '''This function takes Truist financial statement 
    data and converts it into a format that the script can use.
    It also adds in an Account_Type column to help distinguish
    this data from other financial statement data.'''
    expected_column_list = ['Date',
 'Transaction Type',
 'Check/Serial #',
 'Description',
 'Amount',
 'Daily Posted Balance']
    df_csv = pd.read_csv(file_path)
    if df_csv.columns.to_list() != expected_column_list:
        return ValueError("Columns in .csv file are different than expected")
    new_column_list = ['Date', 'Amount', 'Description']
    df_csv = df_csv[new_column_list] # This function both reorders the columns
    # and removes columns that are not necessary for the financial statement
    # analysis.
    
    # The Truist values represent negative numbers with parentheses. The 
    # following lines of code replace the parentheses with a negative
    # sign and also remove the dollar sign.
    df_csv['Amount'] = df_csv['Amount'].astype('str')
    # The following lines remove parentheses
    # and dollar signs from values so that they can be converted to numbers.
    df_csv['Amount'] = df_csv['Amount'].str.replace(')','', regex = False) 
    df_csv['Amount'] = df_csv['Amount'].str.replace('$','', regex = False)
    df_csv['Amount'] = df_csv['Amount'].str.replace('(','-', regex = False)
    df_csv['Amount'] = df_csv['Amount'].str.replace('(','-', regex = False) 
    df_csv['Amount'] = df_csv['Amount'].astype(float)
    df_csv['Amount'] = df_csv['Amount']*-1 # Truist makes expenses 
    # negative by default, but I'm changing them to positive numbers. 
    # (Otherwise, bar and line charts showing spending amounts will tend
    # to be oriented negatively.)
    df_csv['Date'] = pd.to_datetime(df_csv['Date'])
    df_csv['Account_Type'] = account_type

    return df_csv

# Here's an example of this function at work:
df_truist_example = process_truist_statement(df_statements.iloc[1,1], account_type = df_statements.iloc[1,2])
df_truist_example

Unnamed: 0,Date,Amount,Description,Account_Type
0,2022-09-29,-0.52,NEW YORK NY JOES GOURMET DELI FOREIGN ATM SURC...,truist_1
1,2022-07-11,62.23,Roman Catholic C INTERNET PAYMENT,truist_1
2,2022-05-30,386.15,ACH PMT AMEX EPAYMENT INTERNET PAYMENT,truist_1
3,2022-07-10,70.52,ACH PMT AMEX EPAYMENT INTERNET PAYMENT,truist_1
4,2022-07-19,139.45,ACH PMT AMEX EPAYMENT INTERNET PAYMENT,truist_1
...,...,...,...,...
57,2022-12-11,1.11,RIO METRO REGIONAL ALBUQUERQUE NM DEBIT CARD ...,truist_1
58,2022-01-22,73.04,SOCIETY LITTLE FLO DEBIT CARD PURCHASE,truist_1
59,2022-10-23,28.48,ST PATRICK CHURCH DEBIT CARD PURCHASE,truist_1
60,2022-07-03,2.40,TAXI CORDOBA CORDOBA DEBIT CARD PURCHASE,truist_1


## Creating a single DataFrame to store all financial transactions:

It's now finally time to convert the financial statements stored in statements_top_folder into a unified DataFrame so that they can be coded together.

In [6]:
def create_expenses_df(df_statements):
    ''' This function creates a list of DataFrames with financial statement
    data that share the same format. It does so by first 
    reading the original data and then calling functions 
    that can convert that data into a shared format.'''
    df_list = []
    for i in range(len(df_statements)):
        file_type = df_statements.loc[i, 'file_type']
        path = df_statements.loc[i, 'path']
        account_type = df_statements.loc[i, 'Account_Type']
        # This function uses the file_type column to
        # determine how to interpret a given financial 
        # statement. You'll probably need to update this
        # list of file_type values to match your own 
        # statements. (You'll probably also need to create
        # corresponding functions that can process those
        # records.)
        if file_type == 'amex':
            df_list.append(process_amex_statement(
                file_path = path, account_type = account_type))
        elif file_type == 'truist':
            df_list.append(
                process_truist_statement(
                    file_path = path, account_type = account_type))
        else:
            print("Function not in place for account type. Skipping DataFrame")
    return df_list

This function will now be called for this project's simulated financial data.

In [7]:
df_expenses_list = create_expenses_df(df_statements=df_statements)

Since all of the financial statement DataFrames are now in the same format, they can now be concatenated into the same DataFrame, which the following cell accomplishes.

In [8]:
df_finances = pd.concat([df for df in df_expenses_list])
# df_finances.drop_duplicates(inplace=True) # Note: this was dropped because
# sometimes duplicate expenses are valid (e.g. you might use the subway
# twice in one day, or you might have two identical flight records if you
# book two plane tickets.)
df_finances['Subcode'] = '' # Expense subcodes for many rows 
# will be added in soon. 
df_finances['Month'] = df_finances['Date'].dt.month # This column will make it
# easier to compare spending for different months.
df_finances

Unnamed: 0,Date,Amount,Description,Account_Type,Subcode,Month
0,2022-05-06,32.94,310 BOWERY 0000 NEW YORK NY,amex_1,,5
1,2022-01-09,132.30,AIRBNB SAN FRANCISCO CA,amex_1,,1
2,2022-11-06,3.20,ALHEGAZY HALAL FOOD NEW YORK NY,amex_1,,11
3,2022-11-08,11.03,Arts and Crafts BeerNew York NY,amex_1,,11
4,2022-11-23,47.54,BEST BUY NEW YORK NY,amex_1,,11
...,...,...,...,...,...,...
57,2022-12-11,1.11,RIO METRO REGIONAL ALBUQUERQUE NM DEBIT CARD ...,truist_1,,12
58,2022-01-22,73.04,SOCIETY LITTLE FLO DEBIT CARD PURCHASE,truist_1,,1
59,2022-10-23,28.48,ST PATRICK CHURCH DEBIT CARD PURCHASE,truist_1,,10
60,2022-07-03,2.40,TAXI CORDOBA CORDOBA DEBIT CARD PURCHASE,truist_1,,7


## Coding expenses

Now that all expenses have been added together, it's time to begin assigning different codes to them. These codes will make it easier to analyze your spending patterns.

Many of your transactions will probably take place at the same locations each year. By entering these locations into this script, you can automatically assign codes to them. For less-frequent transactions (such as a payment at a restaurant you'll never go to again), it may be easier to manually code these later on. 

The following cell shows which spending descriptions appear the most in the simulated financial data. The highest-ranking transaction, "NYCT PAYGO NEW YORK NY" (which refers to a NYC subway ticket purchase), appears 105 times. Therefore, it makes sense to add this description to the list of expenses to automatically categorize.

In [25]:
purchases_by_description = df_finances.groupby('Description')['Amount'].count().reset_index()
purchases_by_description.sort_values('Amount', ascending = False, inplace = True)
purchases_by_description.head(10)

Unnamed: 0,Description,Amount
75,NYCT PAYGO NEW YORK NY,105
1,ACH PMT AMEX EPAYMENT INTERNET PAYMENT,19
17,CITY FRESH MARKET 00ASTORIA NY,15
82,RENT PAYMENT VIA ZELLE,12
77,ONLINE PAYMENT - THANK YOU,11
79,PAYMENT VENMO INTERNET PAYMENT,10
27,DELTA AIR LINES,6
20,CMSVEND*CV FARMINGDAAMITYVILLE NY,5
22,CURB TAXI APP CURB TLONG IS CITY NY,4
45,GglPay NYCT PAYGO NEW YORK NY,4


The following cell defines a function for assigning a subcode to purchases that fall within a certain category. See the commentary within the function for more information.

In [12]:
def add_subcodes(df, subcode, expense_descriptors):
    '''This function assigns subcodes to purchases that fall within a 
    particular category.

    df refers to the DataFrame that contains financial statements to be
    categorized.
    
    subcode refers to the financial transaction code that should be assigned
    to rows whose description contains one of the strings found within
    expense_descriptors.
    
    expense_descriptors is a string that contains values that represent
    the subcode. For instance, 'restaurant' might be used to categorize
    a 'dining out' subcode. The function would then assign this subcode
    to descriptions like "Seaside Restaurant" and "DTW Restaurants Group"
    because these descriptions contain 'restaurant'. (The strings are not
    case-sensitive.) 
    You'll normally have more than one string for a given category. For 
    instance, suppose you often shop at 3 grocery stores: Great Value, 
    Tasty Village, and PriceSlash. In order to assign descriptions to your 
    'food-groceries' subcode, you could make your expense_descriptors argument
    'great value|tasty village|priceslash.' This argument makes use of the 
    regex operator to specify that strings matching 'great value', 
    'tasty village', or 'priceslash' should be assigned to the 'food-groceries'
    subcode.
    # For the use of '|' to check multiple examples, visit the following link:
    https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
    '''
    
    initial_count = df.query("Subcode != ''")['Subcode'].count() # The function
    # performs two queries of rows matching this subcode so that it can report
    # how many times the specified subcode was added to the list of 
    # financial statements.
    df['Subcode'] = np.where(df['Description'].str.lower().str.contains(
        expense_descriptors.lower(), regex = True) == True, 
        subcode, df['Subcode']) # This line uses np.where(), str.lower(),
        # and str.contains() to assign subcodes to rows within the DataFrame.
        # Rows whose description contains one of the values contained within
        # expense_descriptors will be assigned the subcode; rows that 
        # don't contain one of those values will remain unmodified. 
        # str.lower() is used to convert both the description and the
        # expense_descriptor value to lowercase so that this process
        # won't be case sensitive.
    new_count = df.query("Subcode != ''")['Subcode'].count()
    print(f"Added {subcode} to {new_count - initial_count} entries.") 
    return df

# An alternative to np.where is series.where. Regarding Series.where syntax, 
# see: https://pandas.pydata.org/docs/reference/api/pandas.Series.where.html
# As described in the documentation, this syntax is somewhat the opposite 
# of np.where. In cases where the condition is true, the column will be
# kept the same; in cases where it differs, the column will be changed.

For the above function to correctly process your financial data, you'll need to tell the program which expense descriptors match which subcodes. That's what the following cell accomplishes. Note the use of the '|' regex operator to specify multiple expense descriptors for certain categories.

Be careful to backslash out any regex symbols present in different descriptions. For a list of these symbols, see: https://docs.python.org/3/library/re.html

At this point, you might be asking how to figure out which expense descriptors to use for each subcode. I'll explain this in more detail a few cells later.

In [13]:
code_description_pairs_list = [] # This list will store multiple tuples. The 
# first item in each tuple will be a given subcode, and the second item 
# will be a string representing the descriptors that represent that subcode.

# ATM withdrawals:
code_description_pairs_list.append(("A-X", " atm "))

# Convenience stores:
code_description_pairs_list.append(("B-C", "duane reade|cvs|rite aid"))

# Books:
code_description_pairs_list.append(("B-B", "libreria san pablo|abebooks|magnificat"))

# Pets:
code_description_pairs_list.append(("B-F", "petco"))

# Gifts:
code_description_pairs_list.append(("B-G", "flowers by valli"))

# Taxis and ridesharing:
code_description_pairs_list.append(("B-R", "uber|lyft|curb taxi app|gett|taxi cordoba|nyc taxi|taxi tmc"))

# Shipping: 
code_description_pairs_list.append(("B-S", "usps"))

# Subway and other public transit (in area where you live):
code_description_pairs_list.append(("B-T", "nyct paygo|omnypyg|mnr etix ticket|nyc ferry"))

# Other basics:
code_description_pairs_list.append(("B-X", "klean kanteen"))


# Dining out:
code_description_pairs_list.append(("D-X", "pizza|koronet|dostoros|dos toros|king of falafel|jin 00|columbia university new york|arts and crafts beer|mel's burger bar|diginn|sanfords|junzi kitchen|nyc bagel & coffee|madisoncafe|madison cafe|haagen-dazs|milano market|pita hot|taste of italy|angry orchard cidery|astoria taco factory|baum 00034|blue bottle coffee|butterfunnew|kennedy fried chicke|palace restaurant|flafel on broadway|junko sushi|birch coffee|by the way bakery|cafe metro|citi field concessio|connollys pub|daruma|first on first deli|five guys ny|food hall d|frankie's dogs on the|friedmans|garnet wines and liq|gglpay columbia univ|sip a cup|hamilton deli|hooda halal|hula poke|katsuhama|lili and loo|La Salle Dumpling|mottley kitchen|mottsu|arepas grill|athens grill|peking garden|shake shack         new york|samad's|senza gluten|sweetgreen columbia|toms restaurant|famousfamiglia|symposium restaurantnew york|shake shack - 1197|ample hills|cho dang gol 30new york|dear mama- westn|go go curry 300n|le pain quotidinew york|la masa|alhegazy halal food|310 bowery|osaka japanese|the bronx breweb|the heights ny new york|the thirsty koaastoria|baum 30052"))

# Investment withdrawals:
code_description_pairs_list.append(("D-X", "orale.*tacos"))

# Sporting events:
code_description_pairs_list.append(("E-S", "stubhub"))

# Newspapers:
code_description_pairs_list.append(("E-W", "wall-st-journal"))

# Alcohol:
code_description_pairs_list.append(("F-A", "international wines|angela's wine"))

# Grocery stores:
code_description_pairs_list.append(("F-G", "westside mark|morton willia|trade fair|city fresh market|h mart|tokyo market"))

# Vending machines:
code_description_pairs_list.append(("F-V", "cmsvend|city limits vending"))

# Home storage:
code_description_pairs_list.append(("H-C", "the container store"))

# House furnishings:
code_description_pairs_list.append(("H-F", "university hardware|home depot|homedepot|ikea|lowes"))

# Moving:
code_description_pairs_list.append(("H-M", "u-haul"))

# Rent:
code_description_pairs_list.append(("H-R", "rent payment"))

# Investment withdrawals:
code_description_pairs_list.append(("J-S", "moneyline.*credit"))

# YouTube earnings:
code_description_pairs_list.append(("J-Y", "adsense|youtube_pa"))

# Phone payments:
code_description_pairs_list.append(("K-P", "t-mobile"))

# Crowdsourced donations:
code_description_pairs_list.append(("L-G", "gofundme"))

# Charitable donations:
code_description_pairs_list.append(("L-X", "catholic charities|Consumer myeoffering|catholic relief se|focus|roman cathol    new york|Roman Catholic C|educando by worldf|society little flo"))

# Eye care:
code_description_pairs_list.append(("M-E", "lenscrafters|sp lenses for less|gammarayoptix"))

# Video/Computer Games:
code_description_pairs_list.append(("P-G", "nintendo|pmdg simulations|steampowered|typeracer"))

#Air travel:
code_description_pairs_list.append(("T-A", "delta|united airlines|southwest airlines|american airlines|lufthansa|frontier|jetblue"))

# Bus Travel:
code_description_pairs_list.append(("T-B", "njt pabt|wanderu|washington deluxe bub"))

# Travel-related entertainment:
code_description_pairs_list.append(("T-E", "frost science|vizcaya museum|sky views miami"))

# Food during trips:
code_description_pairs_list.append(("T-F", "fish tales|peak food court|1914 by kolben|farmer's fridge|paschals concourse batlanta|cafe castro|grapevine micro-mark|cafe on the ave|comida buena|farmer brown|frontera grill|herberer term|hooked on the vine|hotel alhambra|duke's lake union ch|hotel on booking|k-1 cafeteria|kroger mid atl|lgad bisoux|moxies grill and barmiami|whataburger|smashburger b7c ewr|tomasita's|izanami 0004|northside socia|atl 6065 low country"))

# Hotels:
code_description_pairs_list.append(("T-H", "airbnb"))

# Parking on trips:
code_description_pairs_list.append(("T-P", "dfw airport parking"))

# Public transit as part of trips:
code_description_pairs_list.append(("T-S", "njt rail|renfe virtual internmadrid|metro de madrid|metrocard/airtrain|smartrip|sound transit|kfir train|clipper systems|dc transit service cmiami"))

# Clothing:
code_description_pairs_list.append(("W-C", "robbie & co"))

# Clothing:
code_description_pairs_list.append(("W-S", "clarks"))


# Items to exclude from budget analysis (e.g. so as not to incur double counting:)
code_description_pairs_list.append(("Z-C", "payment - thank you|ach pmt amex epayment|payment to credit card"))

code_description_pairs_list.append(("Z-X", "mobile deposit"))


# code_description_pairs_list.append(("", ""))

# code_description_pairs_list.append(("", ""))



The next cell uses this list of tuples to add subcodes into df_finances. It also reports how many times each subcode was added.

In [14]:
for pair in code_description_pairs_list:
    df_finances = add_subcodes(df = df_finances, subcode = pair[0], expense_descriptors = pair[1])

Added A-X to 3 entries.
Added B-C to 7 entries.
Added B-B to 2 entries.
Added B-F to 0 entries.
Added B-G to 4 entries.
Added B-R to 5 entries.
Added B-S to 0 entries.
Added B-T to 115 entries.
Added B-X to 1 entries.
Added D-X to 38 entries.
Added D-X to 0 entries.
Added E-S to 1 entries.
Added E-W to 3 entries.
Added F-A to 2 entries.
Added F-G to 22 entries.
Added F-V to 8 entries.
Added H-C to 0 entries.
Added H-F to 8 entries.
Added H-M to 1 entries.
Added H-R to 12 entries.
Added J-S to 3 entries.
Added J-Y to 0 entries.
Added K-P to 0 entries.
Added L-G to 1 entries.
Added L-X to 11 entries.
Added M-E to 1 entries.
Added P-G to 2 entries.
Added T-A to 9 entries.
Added T-B to 0 entries.
Added T-E to 1 entries.
Added T-F to 4 entries.
Added T-H to 1 entries.
Added T-P to 0 entries.
Added T-S to 4 entries.
Added W-C to 1 entries.
Added W-S to 0 entries.
Added Z-C to 32 entries.
Added Z-X to 1 entries.


The following cell reports the most frequently occurring descriptions that have no subcode added to them. This data should prove very useful as you build out your list of subcode/descriptor pairs, since it will help you determine which descriptions are worth adding to your script and which aren't. The process will be as follows:

1. Run the above cells and the following cell
2. Create a list of the most frequently occurring descriptions that don't have a designated subcode
3. Add in descriptions for these subcodes, if possible
4. Continue the above 3 steps until you're satisfied with the level of automation built into the script

It's OK to leave some descriptions uncategorized, either because they occur less frequently or because the code can't be determined from the description alone. (For instance, 'PAYMENT VENMO' could represent both paying your rent and reimbursing friends for restaurant meals, so it may not be possible to automatically categorize those expenses.) You can manually enter the subcodes for these descriptions after this script runs.

In [15]:
df_finances.query("Subcode == ''")['Description'].value_counts().head(40)

PAYMENT VENMO INTERNET PAYMENT                            10
BEST BUY            NEW YORK            NY                 1
CHIPOTLE 0921 0000  NEW YORK            NY                 1
DELL INC            ROUND ROCK          TX                 1
EL TORO ROJO        ASTORIA             NY                 1
FRANKIE S DOGS ON THNEW YORK            NY                 1
LOS ALJIBES ALHAMBRAGRANADA             ES                 1
MANUEL J REGALOS 208CORDOBA             ES                 1
RUBYS CANDY & GROCER                                       1
T2 BOOK STORE       QUEENS              NY                 1
TST* FRANKLIN PARK                                         1
RIO METRO REGIONAL ALBUQUERQUE NM  DEBIT CARD PURCHASE     1
ST PATRICK CHURCH DEBIT CARD PURCHASE                      1
TREAS DRCT TREASURY DIRECT AAAA  ACH DEBIT                 1
Name: Description, dtype: int64

Here's what df_finances looks like after the above cells have run. Note that many, but not all lines have a Subcode entry. These subcodes match those shown in the sample_finance_codes.csv file.

In [16]:
df_finances

Unnamed: 0,Date,Amount,Description,Account_Type,Subcode,Month
0,2022-05-06,32.94,310 BOWERY 0000 NEW YORK NY,amex_1,D-X,5
1,2022-01-09,132.30,AIRBNB SAN FRANCISCO CA,amex_1,T-H,1
2,2022-11-06,3.20,ALHEGAZY HALAL FOOD NEW YORK NY,amex_1,D-X,11
3,2022-11-08,11.03,Arts and Crafts BeerNew York NY,amex_1,D-X,11
4,2022-11-23,47.54,BEST BUY NEW YORK NY,amex_1,,11
...,...,...,...,...,...,...
57,2022-12-11,1.11,RIO METRO REGIONAL ALBUQUERQUE NM DEBIT CARD ...,truist_1,,12
58,2022-01-22,73.04,SOCIETY LITTLE FLO DEBIT CARD PURCHASE,truist_1,L-X,1
59,2022-10-23,28.48,ST PATRICK CHURCH DEBIT CARD PURCHASE,truist_1,,10
60,2022-07-03,2.40,TAXI CORDOBA CORDOBA DEBIT CARD PURCHASE,truist_1,B-R,7


# The next cell shows the number of expenses coded so far, along with the percentage of all expenses that have been assigned a code. 

In [17]:

print("Expenses Coded:",df_finances.query("Subcode != ''")['Subcode'].count())
print("{:.2%}".format(df_finances.query("Subcode != ''")['Subcode'].count()/len(df_finances)))
print("Uncoded Expenses:", df_finances.query("Subcode == ''")['Subcode'].count())

Expenses Coded: 303
92.94%
Uncoded Expenses: 23


In [18]:
df_finances.sort_values(['Subcode', 'Description'], inplace = True)
df_finances

Unnamed: 0,Date,Amount,Description,Account_Type,Subcode,Month
4,2022-11-23,47.54,BEST BUY NEW YORK NY,amex_1,,11
12,2022-08-26,13.52,CHIPOTLE 0921 0000 NEW YORK NY,amex_1,,8
47,2022-05-17,55.03,DELL INC ROUND ROCK TX,amex_1,,5
59,2022-06-16,4.51,EL TORO ROJO ASTORIA NY,amex_1,,6
66,2022-09-16,7.00,FRANKIE S DOGS ON THNEW YORK NY,amex_1,,9
...,...,...,...,...,...,...
223,2022-05-04,-17.34,ONLINE PAYMENT - THANK YOU,amex_1,Z-C,5
224,2022-07-03,-123.91,ONLINE PAYMENT - THANK YOU,amex_1,Z-C,7
225,2022-10-29,-163.57,ONLINE PAYMENT - THANK YOU,amex_1,Z-C,10
226,2022-09-21,-78.51,ONLINE PAYMENT - THANK YOU,amex_1,Z-C,9


The following cell creates a list of all duplicated entries. You can check over this list to help ensure that you didn't have any incorrect duplicate entries in your original data. 

Note that duplicate expenses are often valid. For instance, if you pay for two subway or airplane tickets on the same day, the date, amount, and description may all be the same. If you do find an incorrect duplicate, you can remove it from its corresponding financial statements .csv file or from the output of this script.

In [27]:
duplicate_entries = df_finances[df_finances.duplicated(keep = False)].copy()
# keep = False instructs the function to show all instances of the
# duplicated entry.
duplicate_entries.sort_values(['Subcode', 'Date', 'Amount', 'Description'], inplace = True)
duplicate_entries.to_csv(data_output_folder+'\\duplicated_rows.csv', index = False)
duplicate_entries

Unnamed: 0,Date,Amount,Description,Account_Type,Subcode,Month
165,2022-01-27,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,1
200,2022-01-27,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,1
127,2022-03-01,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
197,2022-03-01,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
145,2022-03-02,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
162,2022-03-02,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
115,2022-03-09,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
199,2022-03-09,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
133,2022-03-10,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3
198,2022-03-10,2.75,NYCT PAYGO NEW YORK NY,amex_1,B-T,3


Finally, the script outputs the updated copy of df_finances to a .csv file. Your next step will be to go into this file and add in subcodes for expenses that are still uncategorized. (Save it under a different name so as not to overwrite it when re-running this script.) Once you've added in those subcodes, you can then import that updated file into the financial_statements_analyzer Jupyter notebook.

In [20]:
df_finances.to_csv(data_output_folder+'\\finances_updated_in_python.csv', index = False)

In [28]:
end_time = time.time()

run_time = end_time - start_time
run_minutes = run_time // 60
run_seconds = run_time % 60
print("Completed run at",time.ctime(end_time),"(local time)")
print("Total run time:",'{:.3f}'.format(run_time),
"second(s) ("+str(run_minutes),"minute(s) and",
'{:.3f}'.format(run_seconds),
"second(s))") # Only valid when the program is run nonstop from start to finish

Completed run at Sat Nov 26 14:25:26 2022 (local time)
Total run time: 3188.473 second(s) (53.0 minute(s) and 8.473 second(s))
