# Parsing Adjusted Gross Income Tax Tables
John Mays | maysj@omb.nyc.gov | Created: 03/11/25 | Last Updated: 03/11/25

Data is from the "Individual income tax returns with exemptions and itemized deductions > Publication 1304" category on the [IRS.gov website](https://www.irs.gov/statistics/soi-tax-stats-individual-statistical-tables-by-size-of-adjusted-gross-income).

In [12]:
import pandas as pd
import re
from pathlib import Path
from tqdm import tqdm # cool arabic word here: taqadum (meaning: progress) = تقدم

In [13]:
data_directory = Path('../data')

In [14]:
list(data_directory.glob('*.xl*'))

[WindowsPath('../data/22in21id.xls'), WindowsPath('../data/99in21id.xls')]

## Collecting all of the files into dataframes:

In [15]:
def collect_files(dir: Path) -> dict:
    sheets = {}
    for sheet_path in tqdm([path for path in data_directory.glob('*.xl*')]):
        sheets[sheet_path.name] = pd.read_excel(sheet_path, header=None)
    return sheets

In [16]:
sheets = collect_files(data_directory)

100%|██████████| 2/2 [00:00<00:00, 53.71it/s]


## Finding the Total Returns Cells:

In [22]:
sheet_names = list(sheets.keys())

In [24]:
sheet = sheets[sheet_names[1]]

In [26]:
def find_total_returns_cells(sheet:pd.DataFrame) -> list:
    indices = []
    for column in sheet.columns:
        col_matches = sheet[column].str.match(
            r"^taxable[, ]*returns[, ]*total", flags=re.IGNORECASE, na=False
        )
        row_indices = list(sheet.index[col_matches])
        if row_indices:
            indices += [(r, column) for r in row_indices]
    return indices


In [27]:
total_returns_cells = find_total_returns_cells(sheet)

In [28]:
for row, col in total_returns_cells:
    print(f'row: {row}, col: {col} -- {sheet.iloc[row, col]}')

row: 26, col: 0 -- Taxable returns, total
row: 53, col: 0 -- Taxable returns, total
row: 84, col: 0 -- Taxable returns, total
row: 111, col: 0 -- Taxable returns, total
row: 143, col: 0 -- Taxable returns, total
row: 172, col: 0 -- Taxable returns, total
row: 205, col: 0 -- Taxable returns, total
row: 233, col: 0 -- Taxable returns, total
row: 266, col: 0 -- Taxable returns, total
row: 292, col: 0 -- Taxable returns, total


## Finding the values of the total returns:

In [30]:
cell_index = total_returns_cells[0]
cell = sheet.iloc[*cell_index]

In [37]:
tr_row, tr_column = cell_index

In [40]:
possible_numeric_indices = [(tr_row, col) for col in sheet.columns if col > tr_column]

In [51]:
numeric_indices_and_values = {}

In [52]:
for index in possible_numeric_indices:
    value = sheet.iloc[*index]
    if isinstance(value, (int, float)): # then the value is numeric & valid
        numeric_indices_and_values[index] = value