# Parsing Adjusted Gross Income Tax Tables
John Mays | maysj@omb.nyc.gov | Created: 03/11/25 | Last Updated: 03/11/25

Data is from the "Individual income tax returns with exemptions and itemized deductions > Publication 1304" category on the [IRS.gov website](https://www.irs.gov/statistics/soi-tax-stats-individual-statistical-tables-by-size-of-adjusted-gross-income).

In [1]:
import pandas as pd
import re
from pathlib import Path
from tqdm import tqdm # cool arabic word here: taqadum (meaning: progress) = تقدم

In [2]:
data_directory = Path('../data')

In [3]:
list(data_directory.glob('*.xl*'))

[WindowsPath('../data/22in21id.xls'), WindowsPath('../data/99in21id.xls')]

## Collecting all of the files into dataframes:

In [4]:
def collect_files(dir: Path) -> dict:
    sheets = {}
    for sheet_path in tqdm([path for path in data_directory.glob('*.xl*')]):
        sheets[sheet_path.name] = pd.read_excel(sheet_path, header=None)
    return sheets

In [5]:
sheets = collect_files(data_directory)

100%|██████████| 2/2 [00:00<00:00, 14.79it/s]


## Finding the Total Returns Cells:

In [6]:
sheet = next(iter(sheets.values()))

In [8]:
def find_total_returns_cells(sheet:pd.DataFrame) -> list:
    indices = []
    for column in sheet.columns:
        col_matches = sheet[column].str.match(
            r"^taxable[, ]*returns[, ]*total", flags=re.IGNORECASE, na=False
        )
        row_indices = list(sheet.index[col_matches])
        if row_indices:
            indices += [(r, column) for r in row_indices]
    return indices


In [9]:
total_returns_cells = find_total_returns_cells(sheet)

In [10]:
for row, col in total_returns_cells:
    print(sheet.iloc[row, col])

Taxable returns, total
