# Parse {year,type,county} subtotals from NJSP PDFs

NJSP pages for each year (example: [2023](https://www.nj.gov/njsp/info/fatalacc/2023-stats.shtml)) include links to two PDFs:
- Year to Date Comparative ([example](https://www.nj.gov/njsp/info/fatalacc/pdf/swfcs2_23.pdf))
- Victim Classification by County ([example](https://www.nj.gov/njsp/info/fatalacc/pdf/ptccr_23.pdf))

These provide some info that's not available elsewhere:
- {year, victim type, county} subtotals
- {year, victim type, age range} subtotals

These are available for all but 4 years since 2008 (the earliest NJSP data): 2008, 2009, 2017, and 2018. Later in this notebook we recover {year,type} subtotals (sans "county" facet) for those years. [read_gpt_csvs.ipynb](../annual-reports/year-type-county/read_gpt_csvs.ipynb) also recovers {year,type,county} stats for those years.

NJSP's per-crash records (`data/FAUQStats*.xml` files) only include "victim type" info since 2020, so this notebook backfills {year,type} subtotals for 2008-2019. It produces `data/year_types.csv`, which `njsp-plots.ipynb` uses as part of a daily Github Action.

In [1]:
from utz import *
from tabula import read_pdf
from njsp.paths import CRASHES_PQT, ANNUAL_SUMMARIES, ANNUAL_SUMMARIES_YT_CSV, ANNUAL_SUMMARIES_YTC_CSV, MISSING_YTC

Tabula helpers

In [2]:
def load_rects(tpl_name):
    tpl_path = f'{ANNUAL_SUMMARIES}/{tpl_name}.json'
    with open(tpl_path, 'r') as f:
        tpl = json.load(f)
    return tpl

def load_pdf_tbl(rect, pdf_path):
    [tbl] = read_pdf(pdf_path, area=[ rect[k] for k in [ 'y1', 'x1', 'y2', 'x2', ] ], pages='all',)
    return tbl

## Load "Victim Classification by County" data

In [3]:
[ptccr_rect] = load_rects('ptccr_23.tabula-template')
def load_ptccr(year):
    pdf_path = f'{ANNUAL_SUMMARIES}/ptccr_%02d.pdf' % (year % 100)
    tbl = load_pdf_tbl(ptccr_rect, pdf_path)
    tbl['year'] = year
    tbl = tbl.set_index('County')
    return tbl

In [4]:
cur_year = now().year
cur_year

2024

In [5]:
%%time
start_year = 2008
missing_years = [ 2008, 2009, 2017, 2018 ]
summaries = pd.concat([
    load_ptccr(year)
    for year in range(start_year, cur_year)
    if year not in missing_years
])
summaries.columns = summaries.columns.str.lower()
summaries = summaries.rename(columns={
    'pedalcyclist': 'cyclist',
})
summaries.index.name = 'county'
summaries.reset_index().set_index(['year', 'county'])
summaries

CPU times: user 26.3 ms, sys: 39.8 ms, total: 66 ms
Wall time: 8.95 s


Unnamed: 0_level_0,driver,passenger,cyclist,pedestrian,fatalities,crashes,year
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Atlantic,10,7,1,6,24,22,2010
Bergen,17,5,0,15,37,36,2010
Burlington,22,7,0,5,34,33,2010
Camden,17,11,3,10,41,37,2010
Cape May,4,1,0,0,5,4,2010
...,...,...,...,...,...,...,...
Somerset,14,4,0,6,24,22,2023
Sussex,6,2,0,1,9,6,2023
Union,13,6,2,15,36,34,2023
Warren,8,1,0,3,12,12,2023


Verify "fatalities" is the sum of the 4 types:

In [6]:
type_cols = ['driver', 'passenger', 'cyclist', 'pedestrian']
assert (summaries[type_cols].sum(1) == summaries.fatalities).all()

Verify "Total" (per year) rows:

In [7]:
assert (summaries.drop(index='Total').groupby('year').sum() == summaries.set_index('year', append=True).loc['Total']).all().all()

In [8]:
missing_ytc = read_csv(MISSING_YTC).set_index(['year', 'county']).astype(int)
missing_ytc

Unnamed: 0_level_0,Unnamed: 1_level_0,crashes,driver,passenger,pedestrian,cyclist
year,county,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008,Atlantic,30,17,8,6,0
2008,Bergen,22,10,5,7,1
2008,Burlington,45,23,6,12,4
2008,Camden,42,25,4,15,0
2008,Cape May,11,8,3,0,0
2008,Cumberland,23,15,5,2,1
2008,Essex,43,23,11,14,1
2008,Gloucester,29,14,13,5,1
2008,Hudson,24,13,4,6,4
2008,Hunterdon,9,6,2,3,0


In [9]:
ytc = pd.concat([
    summaries.drop(index='Total').drop(columns='fatalities').reset_index().set_index(['year', 'county']),
    missing_ytc,
]).sort_index()
ytc

Unnamed: 0_level_0,Unnamed: 1_level_0,driver,passenger,cyclist,pedestrian,crashes
year,county,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008,Atlantic,17,8,0,6,30
2008,Bergen,10,5,1,7,22
2008,Burlington,23,6,4,12,45
2008,Camden,25,4,0,15,42
2008,Cape May,8,3,0,0,11
...,...,...,...,...,...,...
2023,Salem,8,2,0,2,11
2023,Somerset,14,4,0,6,22
2023,Sussex,6,2,0,1,6
2023,Union,13,6,2,15,34


In [10]:
ytc.reset_index(level=1).county.value_counts()

Atlantic      16
Middlesex     16
Union         16
Sussex        16
Somerset      16
Salem         16
Passaic       16
Ocean         16
Morris        16
Monmouth      16
Mercer        16
Bergen        16
Hunterdon     16
Hudson        16
Gloucester    16
Essex         16
Cumberland    16
Cape May      16
Camden        16
Burlington    16
Warren        16
Name: county, dtype: int64

In [11]:
ytc.to_csv(ANNUAL_SUMMARIES_YTC_CSV)

In [12]:
yt = ytc.reset_index(level=1, drop=True).groupby(lambda x:x).sum()
yt

Unnamed: 0_level_0,driver,passenger,cyclist,pedestrian,crashes
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2008,320,112,20,138,555
2009,315,98,14,157,550
2010,303,99,13,141,530
2011,362,105,17,143,586
2012,309,103,14,163,553
2013,304,92,14,132,508
2014,295,80,11,170,523
2015,276,96,17,173,522
2016,330,89,17,166,570
2017,339,85,17,183,591


In [13]:
yt.to_csv(ANNUAL_SUMMARIES_YT_CSV)

## Compare to per-crash records
NJSP's per-crash datasets include victim-type info for crashes since 2020.

This allows for a sanity-check of the {year,type,county} totals reported in the "Victim Classification by County" PDFs (for 2020-2023):

In [14]:
sp = read_parquet(CRASHES_PQT)
sp

Unnamed: 0_level_0,CCODE,CNAME,MCODE,MNAME,HIGHWAY,LOCATION,FATALITIES,INJURIES,STREET,FATAL_D,FATAL_P,FATAL_T,FATAL_B,dt
ACCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1703,01,Atlantic,0102,Atlantic City,446,State/Interstate Authority 446 S MP 1,1.0,1.0,,,,,,2008-01-01 00:35:00
1681,09,Hudson,0910,Union City,,Bergenline Ave S MP 0 at 6th St,1.0,,Bergenline Ave,,,,,2008-01-01 04:11:00
1659,04,Camden,0415,Gloucester Twsp,42,State Highway 42 N MP 8.2,1.0,1.0,,,,,,2008-01-01 06:46:00
1661,20,Union,2004,Elizabeth City,624,County 624 W MP 2.2 at Ikea Dr,1.0,1.0,,,,,,2008-01-01 12:29:00
1811,07,Essex,0716,Nutley Town,648,County 648 E MP .87 at Franklin Ave,1.0,,,,,,,2008-01-01 18:53:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12940,02,Bergen,0219,Fort Lee Boro,,Bruce Reynolds Blvd,1.0,,Bruce Reynolds Blvd,0.0,0.0,1.0,0.0,2024-01-10 05:09:00
12939,01,Atlantic,0121,Somers Point City,,Bay Ave,1.0,,Bay Ave,1.0,0.0,0.0,0.0,2024-01-10 11:08:00
12942,15,Ocean,1506,Brick Twsp,,nj 35/sea breeze way,1.0,,nj 35/sea breeze way,0.0,0.0,1.0,0.0,2024-01-13 09:17:00
12943,01,Atlantic,0111,Galloway Twsp,,61 w jimmie leeds road parking lot,1.0,,61 w jimmie leeds road parking lot,0.0,0.0,1.0,0.0,2024-01-13 14:01:00


In [15]:
cols = [ 'FATALITIES', 'STREET', 'FATAL_D', 'FATAL_P', 'FATAL_T', 'FATAL_B', ]
y = sp.dt.dt.year.rename('year')
c = sp.CNAME.rename('county')
gb = sp.groupby([y, c])
agg = gb[cols].sum(numeric_only=True).astype(int)
agg['crashes'] = gb.size()
agg = agg.rename(columns={
    'FATALITIES': 'fatalities',
    'FATAL_D': 'driver',
    'FATAL_P': 'passenger',
    'FATAL_T': 'pedestrian',
    'FATAL_B': 'cyclist',
})
agg20 = agg.reset_index(level=1)
agg20 = agg20[agg20.index >= 2020]
assert (agg20[type_cols].sum(1) == agg20.fatalities).all()
agg = agg.drop(columns='fatalities')
agg

Unnamed: 0_level_0,Unnamed: 1_level_0,driver,passenger,pedestrian,cyclist,crashes
year,county,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008,Atlantic,0,0,0,0,30
2008,Bergen,0,0,0,0,22
2008,Burlington,0,0,0,0,45
2008,Camden,0,0,0,0,42
2008,Cape May,0,0,0,0,11
...,...,...,...,...,...,...
2024,Mercer,1,0,0,0,1
2024,Middlesex,0,0,2,0,2
2024,Monmouth,0,2,1,0,3
2024,Ocean,0,0,3,0,3


### Combine aggregate stats from per-crash records vs. stats from PDFs

In [16]:
m = agg20.reset_index()
m = m.merge(summaries.reset_index(), how='left', on=['year', 'county'], suffixes=['_sp', '_stats']).dropna()
m = m.set_index(['year', 'county'])
m = m[m.columns.sort_values()].astype(int)
m.columns = pd.MultiIndex.from_tuples([ tuple(reversed(col.split('_'))) for col in m.columns ])
m = m[m.columns.sort_values()].astype(int)
m

Unnamed: 0_level_0,Unnamed: 1_level_0,sp,sp,sp,sp,sp,sp,stats,stats,stats,stats,stats,stats
Unnamed: 0_level_1,Unnamed: 1_level_1,crashes,cyclist,driver,fatalities,passenger,pedestrian,crashes,cyclist,driver,fatalities,passenger,pedestrian
year,county,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
2020,Atlantic,38,0,26,40,5,9,38,0,26,40,5,9
2020,Bergen,38,0,14,43,9,20,38,0,14,43,9,20
2020,Burlington,40,3,26,42,4,9,40,3,26,42,4,9
2020,Camden,36,1,19,38,5,13,36,1,19,38,5,13
2020,Cape May,8,1,5,9,0,3,8,1,5,9,0,3
2020,Cumberland,22,0,14,24,5,5,22,0,14,24,5,5
2020,Essex,39,3,16,45,12,14,39,3,16,45,12,14
2020,Gloucester,33,2,21,35,5,7,33,2,21,35,5,7
2020,Hudson,24,1,11,24,1,11,24,1,11,24,1,11
2020,Hunterdon,12,0,7,12,2,3,12,0,7,12,2,3


In [17]:
row_diffs = (m['sp'] == m['stats']).all(1)
row_diffs = row_diffs[~row_diffs]
row_diffs

Series([], dtype: bool)

In [18]:
assert row_diffs.all(), row_diffs