# Workshop: Web, DOCX, and PDF Scraping in Python  
### Case Study: KDHE Boil Water Advisories and Orders

## Introduction

In this workshop, we will build a data pipeline to collect public notices from the Kansas Department of Health and Environment (KDHE) website. Our focus is on boil water advisories (BWA) and boil water orders (BWO), which are issued when contamination is confirmed or suspected in a public water system.

Government agencies often publish important public health data as web pages, PDFs, and document files rather than as clean machine-readable datasets. Scraping helps us convert these public notices into structured data we can analyze over time.

## Why Scraping?

Scraping is useful when:
- Data is public but not downloadable in a single table
- Information is split across many pages/documents



## Project Plan

This workshop focuses on building a webscraping data collection workflow for KDHE consumer confidence report docx files.
<!-- #BWA/BWO notices. -->

### steps


We will:
1. Understand the layout of the documents to gather data from
2. Setup a document scraper in python to collect relevant information
3. Extract core fields and parse into a pandas dataframe
4. conduct checks on the data frame to ensure clean data
5. Save results to a CSV

<!-- 1. Scrape a KDHE listing page to collect notice detail URLs
2. Parse each detail page for core fields (title, category, posted date, body text)
3. Extract additional structured fields from notice text when possible:
   - notice type (advisory/order/rescinded)
   - affected area
   - start/end timing clues
   - precaution language
   - contamination reason
4. Save results to a tidy CSV for analysis -->


## Core Python libraries

- **requests**: download web pages
- **BeautifulSoup (bs4)** + **lxml**: parse and navigate HTML
- **re** (regex): pattern matching for dates, IDs, and key phrases
- **dateparser**: convert date strings to datetime objects
- **pandas**: tabular data cleaning and export
- **time**: delays between requests
- **pypdf / pdfplumber / pymupdf**: parse PDF notices
<!-- - **python-docx / docx2txt**: parse Word documents -->
<!-- - **waybackpy or CDX API workflow**: historical backfill from the Internet Archive -->

### Environment

We will work in a conda env and stepwise within Jupyter Notebook so each step is:
- explained in markdown,
- implemented in code cells,
- and validated with intermediate outputs.


## Overview of the Project Steps

We will follow a pipeline approach:

1. **Define source URLs and scraping rules**
   - listing page URL
   - detail-page URL pattern
   - request headers and set up scraping settings

2. **Fetch and parse listing page**
   - collect links
   - keep only valid detail-page links
   - deduplicate while preserving order

3. **Fetch and parse detail pages**
   - extract title, metadata (posted/updated), and body text
   - normalize fields into a consistent record format

4. **Build a structured dataset**
   - combine records into a pandas DataFrame
   - sort, inspect missing values, and standardize dates

5. **Export and validate**
   - save CSV
   - run basic checks (required columns, duplicate URLs, date check)

6. **(Optional) Extend to documents and historical coverage**
   - parse linked PDFs/DOCX files where relevant
   <!-- - incorporate archived pages for missing historical notices -->


In [1]:
from docx import Document
import pandas as pd
import os
import re

In [147]:
# Path to where documents have been saved
basedir = r"\\resfs.home.ku.edu\groups_hipaa\PSYC\kdsc_ClassData\KDSC-CDL-Project2\Data\Full Set of CCR Doc Files" 
subdir = r"ccrs2025\kdhe_A_E"

folder = os.path.join(basedir, subdir)

doc_files = [file for file in os.listdir(folder) if file.lower().endswith(".docx")]

print('first file:', doc_files[0])
testdoc = "ABBYVILLE-CITY-OF-KS2015512-DOCX.docx"
doc_path = os.path.join(basedir, subdir, testdoc)
doc = Document(doc_path)


first file: ABBYVILLE-CITY-OF-KS2015512-DOCX.docx


The docx package allows us to open and make edits to a word document all from just python. Since we are only scrapping the data, we want to extract the information into a structured dataframe and **NOT** change the original file. the python-docx will only save if we call the doc.save(doc_path) command, so all we need to do is never use it

the doc variable is loaded as a doc object. The object has attributes that contain the text/numbers/tables and we need to parse through to get the information we need

doc.paragraphs 

attrs = dir(doc)
print(attrs)

What we need to collect
1. PWS NAME
2. PWS ID
3. Testing results
3.1. Regulated Contaminants
3.2. Lead and Copper
3.3. Chlorine/Chloramines Maximum Disenfection Level
3.4. SEcondary Contamionants -- Non-Health Based Contaminants - No Federal Maximum contaminant evel (MCL)
3.5. Compliance Period

In [None]:
# tables = doc.tables
# print(len(tables)) #note that document object tables can overlap pages (table beings near the bottom and spills into another page) This look like a seperate table (with new headers) but it is still indexed from the first table
# print(tables[0].cell(0,0).text) #multiple ways to index cells using their .text objects. cell(0,0)
# print(tables[0].rows[0].cells[0].text)
# print(len(tables[0].rows)) #gets amount of rows in table
# print(len(tables[0].rows[0].cells)) #gets amount of columns in table
# print(tables[0].cell(0,0).text.strip())

In [None]:
p0 = paragraphs[3]
print(p0)

pws_text = p0.text

regex_search = re.search( #regex looks for the parts in ( ) between the matching str and whitespace s*
    r'pws\s*name\s*:\s*(.*?)\s+pws\s*id\s*:\s*([A-Za-z0-9\-]+)',
    str(pws_text),
    flags=re.IGNORECASE
)

print(str(pws_text.strip()))
print(regex_search.group(1))
print(regex_search.group(2))
# print(expected_headers[0].strip().casefold())
# print(str(pws_text).strip().casefold().find(expected_headers[0].strip().casefold()))

<docx.text.paragraph.Paragraph object at 0x000002746B208190>
PWS NAME:	CITY OF ADMIRE			PWS ID: KS2011103
CITY OF ADMIRE
KS2011103


In [209]:
#Table Extraction for REgulated Contaminants
# expected_headers = ['pws name', 'pws id', 'compliance period', 'regulated contaminants', 'collection date', 'highest value', 'range\n(low/high)','unit','mcl','mclg','typical source']
expected_headers = ['pws name', 'pws id', 'regulated contaminants', 'collection date', 'highest value', 'range\n(low/high)','unit','mcl','mclg','typical source']
df = pd.DataFrame(columns=expected_headers)
# df.header = expected_headers
for test_range in range(3):
    doc_path = os.path.join(basedir, subdir, doc_files[test_range])
    doc = Document(doc_path)
    ###################### Paragraph Extraction #####################
    paragraphs = doc.paragraphs #get all the paragraphs in the object
    p0 = paragraphs[3]
    print(p0)

    pws_text = p0.text
    regex_search = re.search(
    r'pws\s*name\s*:\s*(.*?)\s+pws\s*id\s*:\s*([A-Za-z0-9\-]+)',
    str(pws_text),
    flags=re.IGNORECASE
    )
    pws_name = regex_search.group(1)
    pws_id = regex_search.group(2)
        
    ###################### Table Extraction #########################
    tables = doc.tables
    rows_out = [] # setup variable for row data
    for i, table in enumerate(tables):
        # print(i)
        

        if tables[i].cell(0,0).text.strip().casefold() == "regulated contaminants": #strip removes white sopace and casefold avoids capitalization issues
            txt = table.cell(0, 5).text
            raw = table.cell(0, 5).text
            clean = raw.replace('\xa0', ' ').strip()
            print("raw   :", repr(raw))
            print("clean :", repr(clean))

            #print([ord(ch) for ch in txt])
            # print(repr(table.cell(0, 5).text))
            headers = [cell.text.strip().casefold() for cell in table.rows[0].cells] #get all the headers of the current table. We can use these for verification with our pandas df
            for r in range(1, len(table.rows)):
                row_data = {headers[c]: table.cell(r, c).text.strip() for c in range(len(headers))}
                print(row_data)
                rows_out.append(row_data)

    table_df = pd.DataFrame(rows_out)
    table_df["pws name"] = pws_name
    table_df["pws id"] = pws_id
    # col_map = {h: idx for idx, h in enumerate(headers)} #this maps the headers to a dictionary for lookup like col_map.get("Containment")
    # for r, rows in enumerate(tables[i].rows):
    #     for c, cell in enumerate(rows.cells):
            # print(r, c, cell.text)
    # print('table ', i, 'is correct')
    df = pd.concat([df, table_df], ignore_index=True)
# display(df)
# display(row_data)
# display(table_df)
display(df)


<docx.text.paragraph.Paragraph object at 0x00000274690F1E90>
raw   : ''
clean : ''
{'regulated contaminants': 'BARIUM', 'collection date': '1/24/2024', 'highest value': '0.16', 'range\n(low/high)': '0.16', 'unit': 'ppm', '': '2', 'mclg': '2', 'typical source': 'Discharge from metal refineries'}
{'regulated contaminants': 'CHROMIUM', 'collection date': '1/24/2024', 'highest value': '1.4', 'range\n(low/high)': '1.4', 'unit': 'ppb', '': '100', 'mclg': '100', 'typical source': 'Discharge from steel and pulp mills'}
{'regulated contaminants': 'FLUORIDE', 'collection date': '1/24/2024', 'highest value': '0.49', 'range\n(low/high)': '0.49', 'unit': 'ppm', '': '4', 'mclg': '4', 'typical source': 'Natural deposits; Water additive which promotes strong teeth.'}
{'regulated contaminants': 'NITRATE', 'collection date': '1/24/2024', 'highest value': '8.4', 'range\n(low/high)': '8 - 8.4', 'unit': 'ppm', '': '10', 'mclg': '10', 'typical source': 'Runoff from fertilizer use'}
{'regulated contaminants'

Unnamed: 0,pws name,pws id,regulated contaminants,collection date,highest value,range\n(low/high),unit,mcl,mclg,typical source,Unnamed: 11,water system
0,CITY OF ABBYVILLE,KS2015512,BARIUM,1/24/2024,0.16,0.16,ppm,,2,Discharge from metal refineries,2,
1,CITY OF ABBYVILLE,KS2015512,CHROMIUM,1/24/2024,1.4,1.4,ppb,,100,Discharge from steel and pulp mills,100,
2,CITY OF ABBYVILLE,KS2015512,FLUORIDE,1/24/2024,0.49,0.49,ppm,,4,Natural deposits; Water additive which promote...,4,
3,CITY OF ABBYVILLE,KS2015512,NITRATE,1/24/2024,8.4,8 - 8.4,ppm,,10,Runoff from fertilizer use,10,
4,CITY OF ABBYVILLE,KS2015512,SELENIUM,1/24/2024,1.5,1.5,ppb,,50,Erosion of natural deposits,50,
5,CITY OF ABILENE,KS2004112,BARIUM,4/16/2024,0.063,0.063,ppm,,2,Discharge from metal refineries,2,
6,CITY OF ABILENE,KS2004112,FLUORIDE,4/16/2024,0.81,0 - 0.81,ppm,,4,Natural deposits; Water additive which promote...,4,
7,CITY OF ABILENE,KS2004112,NITRATE,1/8/2024,1.0,0.93 - 1,ppm,,10,Runoff from fertilizer use,10,
8,CITY OF ABILENE,KS2004112,SELENIUM,4/16/2024,1.8,1.8,ppb,,50,Erosion of natural deposits,50,
9,CITY OF ADMIRE,KS2011103,BARIUM,4/15/2024,0.015,0.015,ppm,,2,Discharge from metal refineries,2,CITY OF EMPORIA


In [161]:
paragraphs = doc.paragraphs #get all the paragraphs in the object
p0 = paragraphs[3]
print(p0)

text = p0.text
print(text) # gives us the first paragraph (the title of the document)

# tables = doc.tables
# t0 = tables[0] #this indexes the first table in the doc
# print(t0) # no information

# #need to index into the cells for the table data
# rows = t0.rows
# r0 = rows[0]
# print(r0)
# cells = r0.cells
# c00 = cells[0]
# print(c00)
# cell_text = c00.text
# print(t0.cell(5,0).text)




<docx.text.paragraph.Paragraph object at 0x000002746B208190>
PWS NAME:	CITY OF ADMIRE			PWS ID: KS2011103   


In [7]:
# If needed, uncomment:
# !pip install pymupdf pdfplumber pypdf matplotlib

from pathlib import Path
import fitz  # PyMuPDF
import pdfplumber
from pypdf import PdfReader
import matplotlib.pyplot as plt

PDF_PATH = Path(r"C:\Users\Joey\Desktop\KU Classes\KDSC\Boil Water Advisories (Level 3)\Compliance_Report_PDFs\Kansas-Annual-Compliance-Report-2023_202406201048536415.pdf")  # <-- change this
assert PDF_PATH.exists(), f"File not found: {PDF_PATH}"
print("Using:", PDF_PATH.resolve())


Using: C:\Users\Joey\Desktop\KU Classes\KDSC\Boil Water Advisories (Level 3)\Compliance_Report_PDFs\Kansas-Annual-Compliance-Report-2023_202406201048536415.pdf


In [8]:
reader = PdfReader(str(PDF_PATH))
print("=== Basic Info (pypdf) ===")
print("Pages:", len(reader.pages))
print("Metadata:", reader.metadata)

print("\n=== Quick text length per page (first 15 pages) ===")
for i, page in enumerate(reader.pages[:15]):
    txt = page.extract_text() or ""
    print(f"Page {i+1:>3}: {len(txt):>6} chars")


=== Basic Info (pypdf) ===
Pages: 57
Metadata: {'/Author': 'KANSAS DEPARTMNT O HEALTH AND ENVIRONMENT', '/CreationDate': "D:20240618090351-05'00'", '/Creator': 'Adobe Acrobat Pro 2017 17.12.30262', '/ModDate': "D:20240618090351-05'00'", '/Producer': 'Adobe Acrobat Pro 2017 17.12.30262', '/Title': ''}

=== Quick text length per page (first 15 pages) ===
Page   1:    264 chars
Page   2:   5374 chars
Page   3:   4354 chars
Page   4:    344 chars
Page   5:   3765 chars
Page   6:   2483 chars
Page   7:   2312 chars
Page   8:    778 chars
Page   9:   1636 chars
Page  10:   2192 chars
Page  11:   3167 chars
Page  12:   1228 chars
Page  13:   1276 chars
Page  14:   2303 chars
Page  15:   3325 chars


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import fitz

PAGE_NUM = 53
zoom = 2.0

doc = fitz.open(str(PDF_PATH))
page = doc[PAGE_NUM - 1]
pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), alpha=False)

# Convert pixmap samples -> numpy image array
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)

plt.figure(figsize=(10, 12))
if pix.n == 1:
    plt.imshow(img, cmap="gray")
else:
    plt.imshow(img)  # RGB
plt.axis("off")
plt.title(f"Page {PAGE_NUM} visual layout")
plt.show()


In [14]:
PAGE_NUM = 53
doc = fitz.open(str(PDF_PATH))
page = doc[PAGE_NUM - 1]

blocks = page.get_text("blocks")  # (x0, y0, x1, y1, text, block_no, block_type)
print(f"Found {len(blocks)} blocks on page {PAGE_NUM}\n")

# Sort top-to-bottom, then left-to-right
blocks_sorted = sorted(blocks, key=lambda b: (round(b[1], 1), round(b[0], 1)))

for idx, b in enumerate(blocks_sorted[:30], 1):  # print first 30 blocks
    x0, y0, x1, y1, text, bno, btype = b
    snippet = " ".join((text or "").split())[:120]
    print(f"{idx:>2}. bbox=({x0:7.1f},{y0:7.1f},{x1:7.1f},{y1:7.1f}) type={btype} text='{snippet}'")


Found 12 blocks on page 53

 1. bbox=(  346.1,   36.3,  445.0,   46.9) type=0 text='2023 BOIL WATER ADVISORIES'
 2. bbox=(   43.2,   72.2,  699.8,  105.9) type=0 text='Federal ID System Name Issued Rescinded County District POP Reason Alternative Type/Comments KS2009505 Norwich 1/2/2023 '
 3. bbox=(   42.8,  104.7,  726.1,  339.3) type=0 text='Leaking fire hydrant, can't make repairs immediately, shutting off water KS2000708 Sharon 1/5/2023 1/6/2023 Barber SWD 1'
 4. bbox=(   43.2,  108.7,  513.3,  121.9) type=0 text='KS2001515 Leon 1/3/2023 1/5/2023 Butler SCD 667 Loss of Pressure'
 5. bbox=(   43.2,  338.1,  743.7,  379.4) type=0 text='System upgrades causing loss of pressure, second 75 day bwa sent to system to hand issue. KS2005527 Southwind Subdivisio'
 6. bbox=(   43.2,  346.3,  512.6,  355.3) type=0 text='KS2001102 Fulton, City of 6/7/2023 10/12/2023 Bourbon SED 165 Loss of pressure'
 7. bbox=(   43.2,  386.3,  741.8,  419.5) type=0 text='KS2007304 Fall River, City of 6/19/2023 

In [15]:
PAGE_NUM = 1
doc = fitz.open(str(PDF_PATH))
page = doc[PAGE_NUM - 1]

words = page.get_text("words")  
# format: (x0, y0, x1, y1, "word", block_no, line_no, word_no)

print(f"Total words on page {PAGE_NUM}: {len(words)}")
print("First 40 words with coordinates:\n")
for w in words[:40]:
    x0, y0, x1, y1, word, block_no, line_no, word_no = w
    print(f"({x0:7.1f},{y0:7.1f},{x1:7.1f},{y1:7.1f})  {word}")


Total words on page 1: 33
First 40 words with coordinates:

(  304.4,  728.7,  307.5,  743.5)  i
(  109.9,   47.6,  182.9,   72.5)  KANSAS
(  187.4,   47.6,  255.4,   72.5)  PUBLIC
(  259.9,   47.6,  327.9,   72.5)  WATER
(  332.4,   47.6,  402.5,   72.5)  SUPPLY
(  407.0,   47.6,  501.9,   72.5)  PROGRAM
(   67.6,   94.3,  195.9,  135.8)  ANNUAL
(  203.5,   94.3,  411.8,  135.8)  COMPLIANCE
(  419.3,   94.3,  544.3,  135.8)  REPORT
(  207.0,  151.0,  308.9,  176.0)  CALENDAR
(  313.4,  151.0,  364.5,  176.0)  YEAR
(  369.0,  151.0,  405.0,  176.0)  2023
(  251.6,  668.3,  276.2,  684.4)  Janet
(  279.2,  668.3,  311.9,  684.4)  Stanek
(  314.9,  668.3,  360.1,  684.4)  Secretary
(  181.2,  682.1,  199.8,  698.2)  Leo
(  202.8,  682.1,  214.4,  698.2)  G.
(  217.4,  682.1,  261.7,  698.2)  Henning,
(  264.7,  682.1,  307.8,  698.2)  Director,
(  310.8,  682.1,  352.2,  698.2)  Division
(  355.2,  682.1,  365.2,  698.2)  of
(  368.2,  682.1,  430.8,  698.2)  Environment
(  213.8,  695.9

In [3]:
import re
import time
from urllib.parse import urljoin

import pandas as pd
import requests
from bs4 import BeautifulSoup
from dateutil import parser as dateparser



In [None]:

LIST_URL = "https://www.kdhe.ks.gov/m/newsflash?cat=29"
BASE_URL = "https://www.kdhe.ks.gov"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; kdhe-newsflash-scraper/1.0; +https://example.com/)"
}

DETAIL_HREF_RE = re.compile(r"^/m/newsflash/Home/Detail/\d+")


def fetch_soup(session: requests.Session, url: str) -> BeautifulSoup:
    resp = session.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    return BeautifulSoup(resp.text, "lxml")


def extract_detail_links(list_soup: BeautifulSoup) -> list[str]:
    links = []
    for a in list_soup.select("a[href]"):
        href = a.get("href", "").strip()
        if DETAIL_HREF_RE.match(href):
            links.append(urljoin(BASE_URL, href))
    # de-dupe while preserving order
    seen = set()
    out = []
    for u in links:
        if u not in seen:
            seen.add(u)
            out.append(u)
    return out


def parse_detail_page(detail_soup: BeautifulSoup, url: str) -> dict:
    # Title: on these pages there can be more than one H1 ("News Flash" + the actual item title),
    # so taking the LAST H1 is usually the article title. :contentReference[oaicite:1]{index=1}
    h1s = detail_soup.find_all("h1")
    title = h1s[-1].get_text(" ", strip=True) if h1s else None

    # Find the line that contains "Posted on ..."
    text_lines = [ln.strip() for ln in detail_soup.get_text("\n").splitlines() if ln.strip()]
    meta_line = next((ln for ln in text_lines if "Posted on" in ln), "")

    # Example meta line looks like:
    # "Press Releases   Posted on January 30, 2026 | Last Updated on January 30, 2026" :contentReference[oaicite:2]{index=2}
    category = None
    posted_dt = None
    updated_dt = None

    if meta_line:
        category = meta_line.split("Posted on")[0].strip() or None

        m_posted = re.search(r"Posted on\s+([^|]+)", meta_line)
        if m_posted:
            posted_dt = dateparser.parse(m_posted.group(1).strip())

        m_updated = re.search(r"Last Updated on\s+(.+)$", meta_line)
        if m_updated:
            updated_dt = dateparser.parse(m_updated.group(1).strip())

    # Body: collect text after the title until "Related News" (which appears on detail pages). :contentReference[oaicite:3]{index=3}
    body = ""
    title_tag = h1s[-1] if h1s else None
    if title_tag:
        chunks = []
        for sib in title_tag.next_siblings:
            if not hasattr(sib, "get_text"):
                continue
            t = sib.get_text(" ", strip=True)
            if not t:
                continue
            if "Related News" in t:
                break
            # Skip repeating the meta line if it gets picked up
            if "Posted on" in t and "Last Updated" in t:
                continue
            chunks.append(t)
        body = "\n".join(chunks).strip()

    return {
        "url": url,
        "title": title,
        "category": category,
        "posted": posted_dt,
        "last_updated": updated_dt,
        "body": body,
    }


def scrape_kdhe_newsflash(max_items: int = 25, sleep_s: float = 0.5) -> pd.DataFrame:
    with requests.Session() as session:
        list_soup = fetch_soup(session, LIST_URL)
        detail_urls = extract_detail_links(list_soup)

        rows = []
        for u in detail_urls[:max_items]:
            detail_soup = fetch_soup(session, u)
            rows.append(parse_detail_page(detail_soup, u))
            time.sleep(sleep_s)  # be polite

    df = pd.DataFrame(rows)
    # Optional: consistent ordering
    df = df.sort_values(["posted", "title"], ascending=[False, True], na_position="last").reset_index(drop=True)
    return df


if __name__ == "__main__":
    df = scrape_kdhe_newsflash(max_items=20)
    print(df[["posted", "title", "url"]].head(10))
    df.to_csv("kdhe_boil_water_advisories.csv", index=False)
