# **DSC 190 – PDF Parser Notebook**

This notebook converts the original **UCSD Police Report PDFs** into a clean,
tabular CSV file that can be used in the EDA and statistical analysis notebooks.

The parsing logic here is *only* for data wrangling; no statistical analysis is
performed in this notebook.

## 1. Install PDF parsing dependency

We install `pdfplumber`, a Python library that lets us extract text from PDF
pages while preserving line breaks. This is needed to read the police report
PDFs line by line.

In [1]:
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.11.8-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20251107 (from pdfplumber)
  Downloading pdfminer_six-20251107-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-5.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.8-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20251107-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2. Mount Google Drive and import libraries

- `os`, `Path` and `re` for file paths and regular expressions,
- `pdfplumber` to read PDF text,
- `pandas` to store the parsed records in a DataFrame.

In [2]:
from google.colab import drive
import os
import re
import pdfplumber
import pandas as pd
from pathlib import Path

drive.mount('/content/drive')

Mounted at /content/drive


## 3. Parse police report PDFs into a structured table

### 3.1 `read_pdf_lines(pdf_path)`

- Opens a single PDF file with `pdfplumber`.
- Iterates over **each page** and extracts the raw text.
- Splits the text into lines and normalizes whitespace with a regular expression
  (collapsing multiple spaces into a single space).
- Returns a list of cleaned text lines, preserving the original line order.

This function turns each PDF into a simple list of strings that we can search
and slice.

In [3]:
pdf_folder_path = '/content/drive/MyDrive/DSC 190 Project/Police Reports'

def read_pdf_lines(pdf_path: Path):
    """Extract lines page-by-page, preserving line breaks and normalizing spaces."""
    lines = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text() or ""
            for raw in text.splitlines():
                ln = re.sub(r"\s+", " ", raw.strip())
                lines.append(ln)
    return lines

### 3.2 `parse_records(lines)`

- Scans through the list of lines produced by `read_pdf_lines`.
- Uses a regular expression anchored on `"Date Reported mm/dd/yyyy"` to find the
  **start of each incident entry**.
- From each anchor, reads a fixed block of subsequent lines corresponding to:
  - Incident/Case number  
  - Date Occurred  
  - Time Occurred  
  - Summary text  
  - Disposition / outcome
- Stores each incident as a dictionary with one key per field.
- Returns a list of records, one per incident in the PDF.

In [4]:
def parse_records(lines):
    """
    Find each entry by anchoring on 'Date Reported mm/dd/yyyy' and reading the fixed block:
      i-2: Incident type
      i-1: Location
      i  : Date Reported ...
      i+1: Incident/Case# ...
      i+2: Date Occurred ...
      i+3: Time Occurred ...
      i+4: Summary:
      i+5: Disposition:
    """
    recs = []
    for i, ln in enumerate(lines):
        m = re.match(r"^Date Reported\s+(\d{1,2}/\d{1,2}/\d{4})$", ln)
        if not m:
            continue
        try:
            incident_type    = lines[i-2].strip()
            location         = lines[i-1].strip()
            inc_case_ln      = lines[i+1].strip()
            date_occ_ln      = lines[i+2].strip()
            time_occ_ln      = lines[i+3].strip()
            summary_ln       = lines[i+4].strip()
            disposition_ln   = lines[i+5].strip()
        except IndexError:
            continue

        if not inc_case_ln.startswith("Incident/Case# "):  continue
        if not date_occ_ln.startswith("Date Occurred "):    continue
        if not time_occ_ln.startswith("Time Occurred "):    continue
        if not summary_ln.startswith("Summary:"):           continue
        if not disposition_ln.startswith("Disposition:"):   continue

        date_reported = m.group(1)
        incident_case = inc_case_ln.split("Incident/Case# ", 1)[1].strip()
        date_occurred = date_occ_ln.split("Date Occurred ", 1)[1].strip()
        time_occurred = time_occ_ln.split("Time Occurred ", 1)[1].strip()
        summary       = summary_ln.split("Summary:", 1)[1].strip()
        disposition   = disposition_ln.split("Disposition:", 1)[1].strip()

        recs.append({
            "Incident type": incident_type,
            "Location": location,
            "Date Reported": date_reported,
            "Incident/Case#": incident_case,
            "Date Occurred": date_occurred,
            "Time Occurred": time_occurred,
            "Summary": summary,
            "Disposition": disposition,
        })
    return recs

### 3.3 Loop over all PDFs, create DataFrame, and save CSV

- Walks through all PDF files in `pdf_folder_path` (the Police Reports directory).
- For each PDF:
  - Calls `read_pdf_lines` to get the lines.
  - Calls `parse_records` to extract structured incident records.
  - Extends a master list of all records found across files.
- Converts the full list of dictionaries into a `pandas.DataFrame` with columns:

  - `"Date Reported"`
  - `"Incident/Case#"`
  - `"Date Occurred"`
  - `"Time Occurred"`
  - `"Summary"`
  - `"Disposition"`

- Writes the DataFrame out to:

  `police_logs_parsed_EXACT.csv`

  in the same folder as the PDFs.

The printed messages at the end report how many incidents were parsed and show
the first few rows of the resulting table, to verify that the parsing worked
correctly.

In [5]:
pdf_paths = sorted(Path(pdf_folder_path).glob("*.pdf"))

all_rows = []
for p in pdf_paths:
    lines = read_pdf_lines(p)
    all_rows.extend(parse_records(lines))

df = pd.DataFrame(all_rows, columns=[
    "Incident type",
    "Location",
    "Date Reported",
    "Incident/Case#",
    "Date Occurred",
    "Time Occurred",
    "Summary",
    "Disposition",
])

out_csv = str(Path(pdf_folder_path) / "NEW_police_logs_parsed_EXACT.csv")
df.to_csv(out_csv, index=False)

print(f"Parsed {len(df)} entries")
print(f"Saved: {out_csv}")

display(df.head(10))


Parsed 2456 entries
Saved: /content/drive/MyDrive/DSC 190 Project/Police Reports/NEW_police_logs_parsed_EXACT.csv


Unnamed: 0,Incident type,Location,Date Reported,Incident/Case#,Date Occurred,Time Occurred,Summary,Disposition
0,Suspicious Person,"One Miramar Street, Building 2",8/10/2025,2508100004,8/10/2025,4:30 AM,Subject possibly carrying a folded cardboard box,Unable to Locate
1,Incomplete/Accidental Wireless 911,Tioga Hall,8/10/2025,2508100006,8/10/2025,7:17 AM,,Logged Event
2,Elevator Problem,South Parking Structure,8/10/2025,2508100007,8/10/2025,8:35 AM,Person stuck inside southwest elevator,Referred to Other Department (UCSD)
3,Animal Call,SIO Pier,8/10/2025,2508100008,8/10/2025,8:47 AM,People touching leopard sharks,Referred to Other Agency
4,Fire Alarm,Tamarack Apartments,8/10/2025,2508100010,8/10/2025,10:31 AM,,False Alarm
5,Suspicious Person,"One Miramar Street, Building 3",8/10/2025,2508100013,8/10/2025,12:59 PM,"Male came to reporting party's front door, ask...",Gone on Arrival
6,Gas/Water/Sewer Leak,Regents Road/Regents Park Row,8/10/2025,2508100018,8/10/2025,3:51 PM,Leaking hydrant,Referred to Other Agency
7,Incomplete/Accidental Landline 911,Brisa,8/10/2025,2508100021,8/10/2025,4:30 PM,,Logged Event
8,Suspicious Person,Center Hall,8/10/2025,2508100023,8/10/2025,5:09 PM,Male pulling stuff out of construction site,Gone on Arrival
9,Incomplete/Accidental Landline 911,Brisa,8/10/2025,2508100025,8/10/2025,5:29 PM,,Logged Event
