![logo](https://resolvephilly.org/themes/custom/resolvephl-ci/logo.svg)

# Scraping and standardizing Pennsylvania Act 146 Quarterly Reports

**Author:** Julie Christie | Director of Data & Impact

**Partnering Team:** Our Kids

**Date:** March 28, 2024

## Background

Pennsylvania publishes reports on child fatalities and near fatalities that the state determined were a result of child abuse. The reports include the age, sex, county, and date of the incident, as well as whether the family was previously known to the local department of human/family services. The detailed reports are all published online, however scraping those may prove to be more complicated than scraping the quarterly reports. These reports are written in a narrative format that is consistent throughout the entire quarterly report. The structure of these narrative summaries can change between reports, making scraping this also complicated.

### Goal of Analysis

Specifically, Resolve is looking to understand the frequency at which children who experience abuse that results in their death/near death are already known to the system. We are exploring these rates at the county level to understand what the statewide trend is, and how Philadelphia measures up to that trend.

### Glossary

-   **Act 146** -- *"Act 146 of 2006 went into effect on May 8, 2007. A major provision of this law requires that the department prepare a non-identifying summary for the governor and the General Assembly of findings for each case of substantiated child abuse or neglect that has resulted in a child fatality or near fatality."*
-   **Near fatality** -- *Definition TKTK, which is determined by the "certifying physician" from the state.*
-   **DA De-certification** -- *This gets assigned to a report when the District Attorney determines that the incident was not a result of child abuse.*

### Data

-   [Child Fatality/Near Fatality Quarterly Reports](https://www.pa.gov/en/agencies/dhs/resources/data-reports/quarterly-summaries-child-abuse.html) --- A collection of brief summaries of fatalities/near fatalities of children due to abuse. | No metadata available

### Tools

-   [Python](python.org) -- *Base code to facilitate scraping*
-   [Pandas](https://pandas.pydata.org/) -- *More robust data anlysis*
-   [Regex](https://developers.google.com/edu/python/regular-expressions) -- *Regular Expressions, or Regex, to parse out patterns of characters*
-   [PDF Plumber](https://github.com/jsvine/pdfplumber) -- *Parse information from .pdf files*
-   [Excel](https://www.microsoft.com/en-us/microsoft-365/p/excel/cfq7ttc0hr4r?activetab=pivot:overviewtab) -- *Clean and analyze tabulated data*
-   [R](https://www.r-project.org/) -- *Clean and analyze data*

### Limitations
- A "certifying pysician" makes an individual call on whether a child's death/near death is the result of abuse, meaning that human error may result in cases not being documented in these reports
- Child fatalities and near fatalities as a result of abuse are an incredibly small and extreme subset of the overall abuse that children face. This analysis does not constitute a full picture, but rather is a snapshot of what the state deemed the most egregious cases.
- These quarterly reports may not contain all instances. A previous scrape of individual reports rendered about 2,400 reports. This scrape yielded 1,434 reports in quarterly summaries.
- The Quarterly Reports are incomplete. The following are missing quarterly reports:
  - 2011 Q4
  - 2012 Q4
  - 2014 Q4
  - 2015 Q4
  - 2023 Q2-Q4
  - 2024 Q1-Q2

## Cleaning

1. Download all the reports from the Pennsylvania DHS site. (See Data for direct link.)
2. Rename the files to have a standard strucutre.
3. Make sure that you convert anything that was downloaded as a `.docx` file into a `.pdf` file

### Overview of process

Each report is put together with a basic structure of: 

```
Fatalities
    County 1
        1. Incident description
        2. Incident description
        3. Incident description
        ...
    County 2 
    ...
    County 67
        ...

Near Fatalities
    County 1
        1. Incident description
        2. Incident description
        3. Incident description
        ...
    County 2 
    ...
    County 67
        ...
```

And within that, each incident description is roughly structured as:

> 1. A `##-age-old` `sex` child `died/nearly died` on `date` as a result of .... `Agency Name` indicated the report on ... naming the victim child's `identifier for relationship` as the perpetrator(s). ... Further details of the incident are written out. ...  The family `was/was not known` to child welfare.

However, this phrasing changes to things like "On `date` a `##-age-old` `sex` child `died/nearly died` ..."

The regex must also take into account any instances where a sibiling is mentioned with a similar structure, like "the victim's ##-age-old sibling was present at the time."

### Prepare Python Environment

In [None]:
# pip install pdfplumber

In [None]:
# pip install pandas
# pip install pypdf2

Import the needed libraries

In [1]:
import pdfplumber       # PDF Plumber to scrape throught .pdf files
import re               # Regular Expressions
import csv              # Comma Separated Values
import os               # To help with accessing directories

### Parse incidents into a `.csv` file

This code was written by [Maggie Lee](http://maggielee.net/)

This code needs to be improved to have a loop, however in the interest of time,  I will be manually creating a new csv for each report. Not all of the parses will be accurate, and I will manually fix them as well.

In [45]:
# Set location of file to scrape and destination file for data

#directory = r"/Users/juliechristie/Desktop/OK — CUA System/act_33_quarterly"

#for filename in glob.glob(f"{directory}/*"):


filename = 'act_33_quarterly/act146_2023_Q1.pdf'

# Get the base name without the extension
base_name = os.path.splitext(filename)[0]

# Create the CSV output file name with a .csv extension
csv_output_file = f'{base_name}.csv'

text_of_a_single_pdf = ''

# output is going to be a list of lists
# each list in there will be a list of output: report type, county and narrative
output = []

#  this opens the pdf, and loops through every page in the pdf and puts the text of all pages together in `text_of_a_single_pdf`

with pdfplumber.open(filename) as pdf:
	pages = pdf.pages
	for page in pages:
		text = (page.extract_text())
		text_of_a_single_pdf = text_of_a_single_pdf + text


text_as_lines = text_of_a_single_pdf.split('\n')

#  default report type will be fatality, this assumes fatalities always come first
report_type = 'fatality'
county = ''
narrative = ''
new_row = []
for line in text_as_lines:

	if 'Near Fatalities:' in line:
		# when parsing, if you come to the line 'near fatalities', the variable 'report_type' will change
		report_type = 'near fatality'
	elif 'County:' in line:
		# 'same with county, the county will stay the same, line after line, until the parser sees a new county name
		county = line
	# Compare the beginning of the line with a Regex expression that identifies all the different types of numbered line starts in the document. 
	elif re.search(r"(?:\d{1,2}|\d{1,2}-\d{1,2})(?:(?:\.|(?:\)))|\.(?:\)))", line):
		# if this happens, if you hit a numbered paragraph, a new row will need to be logged as output, so let's log the old row
		if 'Fatalities: ' in new_row:
			# this is just to kill the row it's trying to make out of the very first line of the file
			pass
		else:
			output.append(new_row)
		#  and start a new row
		new_row = [report_type, county]
		narrative = line
	else:
		narrative = narrative + line
		new_row = [report_type, county, narrative]

#  then log the very last paragraph
output.append(new_row)



with open(csv_output_file, 'w') as f:
	writer = csv.writer(f)
	for row in output:
		writer.writerow(row)

### Combine individual scraped data

The first part of the process consisted of manually going through each generated .csv file and correcting county names when not scraped properly (which was most cases) and repairing reports that had been broken up into their own rows of data.

Then, ensure that each .csv had identical headers before importing them into R and combining them.

```{r}
# Libraries
library(data.table)
library(tidyverse)

# Import all reports and then combine them into one file
all_reports <-
  list.files(path = "act_33_quarterly/csv files/",
             pattern = "\\.csv$",
             full.names = TRUE) %>%
  map_df(~read_csv(., col_types = cols(.default = "c")))

# Remove rows where there are no county
all_reports <- all_reports %>%
  filter(!is.na(county))

# Clean up the name referencing the county
all_reports$county <- word(all_reports$county, 1)
```

Next, create a new column that identifies the incident date. Go through any NA values manually and fix the original .csv file for things like a missing space that messes up the Regex expression.

```{r}
# Define a function to extract the first date from a string
extract_incident_date <- function(narrative) {
  # Regular expression to match dates in the format Month/Mon. DD, YYYY
  date_pattern <- "\\b(?:Jan\\.|Feb\\.|Mar\\.|Apr\\.|May|Jun\\.|Jul\\.|Aug\\.|Sep\\.|Sept\\.|Oct\\.|Nov\\.|Dec\\.)\\s\\d{1,2},\\s\\d{4}\\b|\\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\\s\\d{1,2},\\s\\d{4}\\b"
  
  # Use str_extract to get the first match
  incident_date <- str_extract(narrative, date_pattern)
  
  return(incident_date)
}

# Apply the function to the 'text' column and create a new column 'first_date'
all_reports$incident_date <- sapply(all_reports$narrative, extract_incident_date)
```

🚧 Then, using an almost identical process, identify the portion of the narrative that identifies whether the family was known to the agency.

```{r}
# Define a function to pull the first sentence that has "known" or "prior" in it
extract_prior_status <- function(narrative) {
  # Regular expression to match the words
  prior_pattern <- "[^.]* (?:(?:was known|was not known)|(?:had prior|had no prior)|history|involved) [^.]*\\."

  #Use str_extract to get the first match
  prior_status <- str_extract(narrative, prior_pattern)

  return(prior_status)
}

# Apply function to 'narrative' column and create a new column 'prior_status'
all_reports$prior_status <- sapply(all_reports$narrative, extract_prior_status)
```

## 🚧 Analysis

Here is some preliminary analysis

### Total incidents for all counties

```{r}
county_totals <- all_reports %>%
  group_by(county) %>%
  summarise(total = n())
```

See the results here.

### Philadelphia annual totals

```{r}
# Filter for Philadelphia reports and create a timeline table
phl_reports <- all_reports %>%
  filter(county == "Philadelphia") %>%
  mutate(year = substr(incident_date, nchar(incident_date) - 3, nchar(incident_date)))

phl_timeline <- phl_reports %>%
  group_by(year) %>%
  summarise(cases = n())

# Show the overall timeline of incidents in Philadelphia
ggplot(data = phl_timeline,
       aes(x = year, y = cases, group = 1)) +
  geom_line(linewidth = 1.5, alpha = .75) +
  #Labels
  labs(
    y = "Number of cases",
    x = "Year",
    title = "Philadelphia fatality and near fatalities")

# Show the Philadelphia incidents by fatality and near fatality
phl_timeline2 <- phl_reports %>%
  group_by(year, status) %>%
  summarise(cases = n())

ggplot(data = phl_timeline2,
       aes(x = year, y = cases, group = status, color = status)) +
  geom_line(linewidth = 1.5, alpha = .75) +
  #Labels
  labs(
    y = "Number of cases",
    x = "Year",
    title = "Philadelphia fatality and near fatalities")
```

![Overall timeline](https://resolvephilly.org/themes/custom/resolvephl-ci/logo.svg)

![Fatality type timeline](https://resolvephilly.org/themes/custom/resolvephl-ci/logo.svg)

Read through the .csv files generated here:

- All reports
- County totals
- Philadelphia reports
- Overall timeline for Philadelphia
- Fatality type timeline for Philadelphia
