Why this data?
The CDCR (California Department of Corrections and Rehabilitation) releases monthly reports on the number of people in state prisons around California. These reports also include the designed capacities of each prison, and how the current population compares to that capacity. California prisons have had extreme overcrowding issues for a long time - see, e.g. Brown v Plata, which is the US Supreme Court case that finally sparked a concerted effort to reduce overcrowding.
Unfortunately, the reports provided by the CDCR are only available in PDF format, with one PDF provided per month. Trying to analyze those numbers over many months and years is very difficult in such a format. This repository changes that, parsing the monthly PDF reports from 1996 to the present day to provide these data in one CSV.
The data look something like this:
head data/monthly_cdcr_population.csv | column -t -s, year month institution_name population_felons civil_addict total_population designed_capacity percent_occupied staffed_capacity source_pdf_name 1996 01 VSP (VALLEY SP) 2294 0 2294 1980 115.9 1980 TPOP1Ad9601.pdf 1996 01 SCC (SIERRA CONSERVATION CENTER) 322 0 322 320 100.6 320 TPOP1Ad9601.pdf 1996 01 NCWF (NO CAL WOMEN'S FACIL) 786 4 790 400 197.5 760 TPOP1Ad9601.pdf 1996 01 CCWF (CENTRAL CA WOMEN'S FAC) 2846 13 2859 2004 142.7 3224 TPOP1Ad9601.pdf ...
I've gone through a number of the PDFs by hand to double check the numbers are correct, but if you spot mistakes or otherwise think something is wrong, please create a Github issue. If you're not familiar and want to report a bug, please reach out via email (email@example.com).
Data come from the PDFs of monthly archives at: https://www.cdcr.ca.gov/research/monthly-total-population-report-archive/
The PDFs themselves are pulled down and checked into this repository under data/raw_monthly_pdfs/. The names of the PDFs have not been changed. Pre-2019, they were downloaded by running a script (
datacleaning/scrape_from_cdcr.py). The naming of files / directories changed around a bit starting in 2019 - to add new PDFs, just go to the website and download manually the newest month, and save it to the data/raw_monthly_pdfs/ directory.
Parsing the PDFs
The PDFs are parsed using tools in the
datacleaning directory in the root of this repo. The result of their parsing is in this directory at
To re-parse / re-generate that CSV, run:
python datacleaning/bulk_parse_pdfs.py --verbose
There are tests! Run them with