![logo](https://resolvephilly.org/themes/custom/resolvephl-ci/logo.svg)

# Scraping and standardizing Pennsylvania Act 146 Quarterly Reports

**Author:** Julie Christie | Director of Data & Impact

**Partnering Team:** Our Kids

**Date:** March 28, 2024

## Background

Pennsylvania publishes reports on child fatalities and near fatalities that the state determined were a result of child abuse. The reports include the age, sex, county, and date of the incident, as well as whether the family was previously known to the local department of human/family services. The detailed reports are all published online, however scraping those may prove to be more complicated than scraping the quarterly reports. These reports are written in a narrative format that is consistent throughout the entire quarterly report. The structure of these narrative summaries can change between reports, making scraping this also complicated.

### Goal of Analysis

Specifically, Resolve is looking to understand the frequency at which children who experience abuse that results in their death/near death are already known to the system. We are exploring these rates at the county level to understand what the statewide trend is, and how Philadelphia measures up to that trend.

### Glossary

-   **Act 146** -- *"Act 146 of 2006 went into effect on May 8, 2007. A major provision of this law requires that the department prepare a non-identifying summary for the governor and the General Assembly of findings for each case of substantiated child abuse or neglect that has resulted in a child fatality or near fatality."*
-   **Near fatality** -- *Definition TKTK, which is determined by the "certifying physician" from the state.*
-   **DA De-certification** -- *This gets assigned to a report when the District Attorney determines that the incident was not a result of child abuse.*

### Data

-   [Child Fatality/Near Fatality Quarterly Reports](https://www.pa.gov/en/agencies/dhs/resources/data-reports/quarterly-summaries-child-abuse.html) --- A collection of brief summaries of fatalities/near fatalities of children due to abuse. | No metadata available

### Tools

-   [Python](python.org) -- *Base code to facilitate scraping*
-   [Pandas](https://pandas.pydata.org/) -- *More robust data anlysis*
-   [Regex](https://developers.google.com/edu/python/regular-expressions) -- *Regular Expressions, or Regex, to parse out patterns of characters*
-   [PDF Plumber](https://github.com/jsvine/pdfplumber) -- *Parse information from .pdf files*
-   [Excel](https://www.microsoft.com/en-us/microsoft-365/p/excel/cfq7ttc0hr4r?activetab=pivot:overviewtab) -- *Clean and analyze tabulated data*

### Limitations
- A "certifying pysician" makes an individual call on whether a child's death/near death is the result of abuse, meaning that human error may result in cases not being documented in these reports
- Child fatalities and near fatalities as a result of abuse are an incredibly small and extreme subset of the overall abuse that children face. This analysis does not constitute a full picture, but rather is a snapshot of what the state deemed the most egregious cases.
- These quarterly reports may not contain all instances. A previous scrape of individual reports rendered about 2,400 reports. This scrape yielded TKTK summaries.

## Cleaning

1. Download all the reports from the Pennsylvania DHS site. (See Data for direct link.)
2. Rename the files to have a standard strucutre.
3. Make sure that you convert anything that was downloaded as a `.docx` file into a `.pdf` file

### Overview of process

Each report is put together with a basic structure of: 

```
Fatalities
    County 1
        1. Incident description
        2. Incident description
        3. Incident description
        ...
    County 2 
    ...
    County 67
        ...

Near Fatalities
    County 1
        1. Incident description
        2. Incident description
        3. Incident description
        ...
    County 2 
    ...
    County 67
        ...
```

And within that, each incident description is roughly structured as:

> 1. A `##-age-old` `sex` child `died/nearly died` on `date` as a result of .... `Agency Name` indicated the report on ... naming the victim child's `identifier for relationship` as the perpetrator(s). ... Further details of the incident are written out. ...  The family `was/was not known` to child welfare.

However, this phrasing changes to things like "On `date` a `##-age-old` `sex` child `died/nearly died` ..."

The regex must also take into account any instances where a sibiling is mentioned with a similar structure, like "the victim's ##-age-old sibling was present at the time."

### Prepare Python Environment

In [1]:
pip install pdfplumber

Note: you may need to restart the kernel to use updated packages.


In [None]:
# pip install pandas
# pip install pypdf2

Import the needed libraries

In [8]:
import pdfplumber       # PDF Plumber to scrape throught .pdf files
import re               # Regular Expressions
import csv              # Comma Separated Values
import glob             # To make a list of all files in a folder
from PyPDF2 import PdfReader    # To import file reader
import os               # To help with accessing directories
import argparse         # TKTTKKTK

# import pandas as pd
# from collections import namedtuple

### Repair PDF files

Some of the files may be broken when converting from a Word document to a PDF document.

Source: https://stackoverflow.com/questions/58807673/best-way-to-check-the-pdf-file-is-corrupt-using-python

In [None]:
# Create a function that 
def check_file(fullfile):
    with open(fullfile, 'rb') as f:
        try:
            pdf = PdfReader(f)
            info = pdf.metadata
            if info:
                return True
            else:
                return False
        except Exception as e:
            return False


def search_files(dirpath: str) -> pd.DataFrame:
    pwdpath = os.path.dirname(os.path.realpath(__file__))
    print("Running path : %s" %pwdpath)
    files = []
    if os.access(dirpath, os.R_OK):
        print("Path %s validation OK \n" %dirpath)
        listfiles = os.listdir(dirpath)
        for f in listfiles:
            fullfile = os.path.join(dirpath, f)
            if check_file(fullfile):
                print("OK " + fullfile + "\n################")
                files.append((f, fullfile, 'good'))
            else:
                print("ERROR " + fullfile + "\n################")
                files.append((f, fullfile, 'corrupted'))
    else:
        print("Path is not valid")

    df = pd.DataFrame(files, columns=['filename', 'fullpath', 'status'])
    return df


def main(args):
    df = search_files(args.dirpath)
    df.to_csv(args.output, index=False)
    print(f'Final report saved to {args.output}')
    print(df['status'].value_counts())


if __name__ == '__main__':
    """ Command line script for finding corrupted PDFs in a directory. """
    parser = argparse.ArgumentParser()
    parser.add_argument('--dirpath', type=str, required=True, help='Path to directory containing PDFs.')
    parser.add_argument('--output', type=str, required=True, help='Path to output CSV file.')
    args = parser.parse_args()
    main(args)


### Parse incidents into a `.csv` file

This code was written by [Maggie Lee](http://maggielee.net/)

In [7]:
# Set location of file to scrape and destination file for data

directory = r"/Users/juliechristie/Desktop/OK — CUA System/act_33_quarterly"

for filename in glob.glob(f"{directory}/*"):

	csv_output_file = 'incidents.csv'

	text_of_a_single_pdf = ''

	# output is going to be a list of lists
	#  each list in there will be a list of output: report type, county and narrative
	output = []

	#  this opens the pdf, and loops through every page in the pdf and puts the text of all pages together in `text_of_a_single_pdf`

	with pdfplumber.open(filename) as pdf:
		pages = pdf.pages
		for page in pages:
			text = (page.extract_text())
			text_of_a_single_pdf = text_of_a_single_pdf + text


	text_as_lines = text_of_a_single_pdf.split('\n')

	#  default report type will be fatality, this assumes fatalities always come first
	report_type = 'fatality'
	county = ''
	narrative = ''
	new_row = []
	for line in text_as_lines:

		if 'Near Fatalities:' in line:
			# when parsing, if you come to the line 'near fatalities', the variable 'report_type' will change
			report_type = 'near fatality'
		elif 'County:' in line:
			# 'same with county, the county will stay the same, line after line, until the parser sees a new county name
			county = line
		# Compare the beginning of the line with a Regex expression that identifies all the different types of numbered line starts in the document. 
		elif re.search(r"(?:\d{1,2}|\d{1,2}-\d{1,2})(?:(?:\.|(?:\)))|\.(?:\)))", line):
			# if this happens, if you hit a numbered paragraph, a new row will need to be logged as output, so let's log the old row
			if 'Fatalities: ' in new_row:
				# this is just to kill the row it's trying to make out of the very first line of the file
				pass
			else:
				output.append(new_row)
			#  and start a new row
			new_row = [report_type, county]
			narrative = line
		else:
			narrative = narrative + line
			new_row = [report_type, county, narrative]

#  then log the very last paragraph
output.append(new_row)



with open(csv_output_file, 'w') as f:
	writer = csv.writer(f)
	for row in output:
		writer.writerow(row)

PDFSyntaxError: No /Root object! - Is this really a PDF?

Create the column names for the data that you are extracting from the pdf.

In [None]:
Line = namedtuple('Line', 'fatality county age sex date cause perpetrator indicated_date known_to_agency')

Create a list of all 67 counties in PA to match with the different headers of the reports.

In [None]:
pa_counties = ["Adams", "Allegheny", "Armstrong", "Beaver", "Bedford", "Berks", "Blair", "Bradford", "Bucks", "Butler", "Cambria", "Cameron", "Carbon", "Centre", "Chester", "Clarion", "Clearfield", "Clinton", "Columbia", "Crawford", "Cumberland", "Dauphin", "Delaware", "Elk", "Erie", "Fayette", "Forest", "Franklin", "Fulton", "Greene", "Huntingdon", "Indiana", "Jefferson", "Juniata", "Lackawanna", "Lancaster", "Lawrence", "Lebanon", "Lehigh", "Luzerne", "Lycoming", "McKean", "Mercer", "Mifflin", "Monroe", "Montgomery", "Montour", "Northampton", "Northumberland", "Perry", "Philadelphia", "Pike", "Potter", "Schuylkill", "Snyder", "Somerset", "Sullivan", "Susquehanna", "Tioga", "Union", "Venango", "Warren", "Washington", "Wayne", "Westmoreland", "Wyoming", "York"]

All of the documents are organized by listing whether the case was a fatality as nested headers. The below code sets up a regex function that identifies the headers.

In [None]:
fatality_re = re.compile(r'(Fatalities|Near Fatalities)')
line_re = re.compile(r'\d{1,2}((\.|(\)))|\.(\)))\s')

In [None]:
line_re.search('')

Set the file for your search

In [None]:
file = 'act_33_quarterly/1st Quarter Summaries of Child Fatalities Near Fatalities (1).pdf'

Parse out the data in the report

In [None]:
lines = []

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('(\n|\r)\d{1,2}((\.|(\)))|\.(\)))\s'):
            print(line)
            fatality_set = fatality_re.search(line)
            if fatality_set:
                fatality = fatality_set.group(1)

            elif line.startswith(tuple(pa_counties)):
                county = line

            elif line_re.search(line):
                items = line.split()
                lines.append(Line(vend_no, vend_name, doctype, *items))
