# Workshop: DOCX Scraping in Python  
### Case Study: KDHE Consumer Confidence Reports

## Introduction

In this workshop, we will build a pipeline to extract structured data from Kansas Department of Health and Environment (KDHE) Consumer Confidence Report (CCR) documents.

Public health data is often published as Word/PDF files rather than clean machine-readable tables. Scraping lets us convert these semi-structured documents into analyzable datasets.

## When is scraping useful?

Use scraping when:
- data is publicly available but not downloadable as a single table,
- information is spread across many files/pages,
- repeated document layouts contain fields you can systematically parse.


## Project Plan

This workshop focuses on building a document-scraping workflow for KDHE Consumer Confidence Report `.docx` files.

### Steps

We will:

1. Inspect the report layout and identify where key fields appear (paragraphs vs tables).
2. Review the Python tools used for document parsing and cleaning.
3. Build a scraper that extracts relevant fields from each report.
4. Parse and standardize extracted values into a pandas DataFrame.
5. Run quality checks for missing or inconsistent values.
6. Export a tidy CSV for later analysis.


## Core Python Libraries

- **os**: file paths, folder traversal, and file management  
- **re**: regular expressions for parsing IDs, dates, units, and text patterns  
- **python-docx**: read Word (`.docx`) structure (paragraphs, runs, tables, rows, cells)  
- **pandas**: tabular wrangling, validation, and CSV export  
- **numpy**: helper operations for missing values and numeric transformations  

## Environment

We will work step-by-step in a Jupyter Notebook, so each stage is:

- explained in markdown,
- implemented in code cells,
- and validated with intermediate outputs.


In [2]:
from docx import Document
import pandas as pd
import os
import re

In [5]:
# Path to where documents have been saved
basedir = r"\\resfs.home.ku.edu\groups_hipaa\PSYC\kdsc_ClassData\KDSC-CDL-Project2\Data\Full Set of CCR Doc Files" 
subdir = r"ccrs2025\kdhe_A_E"

folder = os.path.join(basedir, subdir)

doc_files = [file for file in os.listdir(folder) if file.lower().endswith(".docx")]

print('first file:', doc_files[0])
testdoc = "ABBYVILLE-CITY-OF-KS2015512-DOCX.docx"
doc_path = os.path.join(basedir, subdir, testdoc)
doc = Document(doc_path)


first file: ABBYVILLE-CITY-OF-KS2015512-DOCX.docx


## `python-docx` overview

`python-docx` lets us open and read Word documents as structured Python objects.

For this document-scraping workflow, we are **extracting data only** and building a new DataFrame.  
We do **not** modify source files unless we explicitly call `doc.save(...)`.

### Accessing text in paragraphs

After loading a file (for example, `doc = Document(path)`), paragraph text is available in:

- `doc.paragraphs` → a list of paragraph objects
- `doc.paragraphs[i].text` → the text content of a specific paragraph

---insert pic of word doc showing paragraph examples---

---should include some description of regular expression---

This is useful when fields like **PWS Name** and **PWS ID** appear in normal text or headings instead of tables.


In [10]:
# Accessing paragraph text example

# 1. Selects one paragraph from the paragraph list.  
paragraphs = doc.paragraphs
p2 = paragraphs[3]
# 2. Reads its raw `.text` content.  
pws_text = p2.text

# 3. Uses a regex pattern to extract:
#    - group 1: PWS Name  
#    - group 2: PWS ID 
# Regex captures:
#   group(1) -> text after "PWS Name:" up to "PWS ID:"
#   group(2) -> alphanumeric/hyphen ID after "PWS ID:"
pattern = r"pws\s*name\s*:\s*(.*?)\s+pws\s*id\s*:\s*([A-Za-z0-9\-]+)"

regex_search = re.search(pattern, pws_text, flags=re.IGNORECASE)


# 4. Prints the original paragraph text and the extracted values.
print("Raw paragraph text:")
print(pws_text.strip())


pws_name = regex_search.group(1).strip()
pws_id = regex_search.group(2).strip()

print("\nExtracted values:")
print("PWS Name:", pws_name)
print("PWS ID:", pws_id)


Raw paragraph text:
PWS NAME:	CITY OF ABBYVILLE			PWS ID: KS2015512

Extracted values:
PWS Name: CITY OF ABBYVILLE
PWS ID: KS2015512


### Accessing text within tables

In `python-docx`, tables are stored separately from paragraphs:

- `doc.tables` → list of table objects  
- `tables[i].rows` → row objects for table `i`  
- `tables[i].rows[r].cells[c]` or `tables[i].cell(r, c)` → specific cell  
- `.text` → text content of a cell

This is useful when data are presented in displayed in typical row/column fashion.

> Note: A table that visually continues onto the next page in Word is often still part of the same table object in `python-docx`, even if repeated headers make it look like a new table.


In [17]:
# Accessing table data example
# 1. Loads all tables from the document.
tables = doc.tables

print("\nTotal of ", len(tables), "tables found in the document")
# ---need to add pic of the tables and how to iterate over document to get table to use
# print(len(tables)) 

# 2. Access same cell in two equivalent ways
print("\nFirst cell (cell method):")
print(tables[0].cell(0, 0).text)

# Can also index by calling rows and cells
# print("\nFirst cell (rows/cells indexing):")
# print(tables[0].rows[0].cells[0].text)

# 3. Table dimensions
print("\nRow count in first table:", len(tables[0].rows))
print("Column count in first row:", len(tables[0].rows[0].cells))

# 4. Cleaned text
print("\nFirst cell text:")
print(tables[0].cell(0, 0).text.strip())


Total of  5 tables found in the document

First cell (cell method):
Regulated Contaminants

Row count in first table: 6
Column count in first row: 8

First cell text:
Regulated Contaminants


## Outline: contaminant-data extraction

These reports include both metadata and water-quality measurements.  
Our goal is to extract a consistent set of fields across documents.

### Core fields to collect

1. **PWS Name** (Public Water System name)  
2. **PWS ID**  
3. **Testing results**, including:
   - Regulated contaminants
   - Lead and copper
   - Chlorine/chloramines (disinfection residual / MRDL-related section)
   - Secondary contaminants (non-health-based standards)
   - Compliance period (when reported values apply)

### Parsing notes

- Some values appear in paragraph text, others in tables.
- Section titles may vary slightly across reports, so pattern matching and defensive parsing of text are important.


## Step 1: Extract PWS metadata from one document (paragraph text)

Before processing many files, we first test extraction on a smaller batch.

In this step we collect the information needed from paragraphs of the document:

1. Open one `.docx` file.
2. Access `doc.paragraphs`.
3. Select the paragraph that contains `PWS Name` and `PWS ID`.
4. Use a regex pattern to extract:
   - `pws name`
   - `pws id`


In [18]:
# Step 1: Paragraph extraction from ONE document (no file loop)

doc_path = os.path.join(basedir, subdir, doc_files[0])  # pick one file for demonstration
doc = Document(doc_path)

paragraphs = doc.paragraphs
p0 = paragraphs[3]  # adjust if needed for different document layouts
pws_text = p0.text

print("Selected paragraph:")
print(pws_text.strip())

pattern = r"pws\s*name\s*:\s*(.*?)\s+pws\s*id\s*:\s*([A-Za-z0-9\-]+)"
match = re.search(pattern, pws_text, flags=re.IGNORECASE)

if match:
    pws_name = match.group(1).strip()
    pws_id = match.group(2).strip()
    print("\nExtracted:")
    print("pws name:", pws_name)
    print("pws id  :", pws_id)
else:
    pws_name, pws_id = None, None
    print("\nPattern not found in selected paragraph.")


Selected paragraph:
PWS NAME:	CITY OF ABBYVILLE			PWS ID: KS2015512

Extracted:
pws name: CITY OF ABBYVILLE
pws id  : KS2015512


## Step 2: Extract regulated contaminant rows from one document (table text)

Now we extract table rows from the same single report.

In this step we:

1. Access `doc.tables`.
2. Find the table whose top-left cell is `"regulated contaminants"`.
3. Read header cells from row 0.
4. Read each data row into a dictionary.
5. Convert extracted rows into a DataFrame.
6. Add `pws name` and `pws id` columns from Step 1.



In [None]:
# Step 2: Table extraction from ONE document

expected_headers = [
    'pws name', 'pws id', 'regulated contaminants', 'collection date',
    'highest value', 'range\n(low/high)', 'unit', 'mcl', 'mclg', 'typical source'
]

tables = doc.tables
rows_out = []
target_table = None

# Find the regulated contaminants table
for table in tables:
    first_cell = table.cell(0, 0).text.replace('\xa0', ' ').strip().casefold()
    if first_cell == "regulated contaminants":
        target_table = table
        break

if target_table is None:
    print("No 'regulated contaminants' table found.")
    table_df = pd.DataFrame(columns=expected_headers)
else:
    # Clean headers
    headers = [
        cell.text.replace('\xa0', ' ').strip().casefold()
        for cell in target_table.rows[0].cells
    ]

    # Extract data rows (skip header row at index 0)
    for r in range(1, len(target_table.rows)):
        row_data = {
            headers[c]: target_table.cell(r, c).text.replace('\xa0', ' ').strip()
            for c in range(len(headers))
        }
        rows_out.append(row_data)

    table_df = pd.DataFrame(rows_out)

    # Add metadata from paragraph extraction
    table_df["pws name"] = pws_name
    table_df["pws id"] = pws_id

    # Ensure expected columns exist and order them
    for col in expected_headers:
        if col not in table_df.columns:
            table_df[col] = pd.NA
    table_df = table_df[expected_headers]

display(table_df)


Unnamed: 0,pws name,pws id,regulated contaminants,collection date,highest value,range\n(low/high),unit,mcl,mclg,typical source
0,CITY OF ABBYVILLE,KS2015512,BARIUM,1/24/2024,0.16,0.16,ppm,,2,Discharge from metal refineries
1,CITY OF ABBYVILLE,KS2015512,CHROMIUM,1/24/2024,1.4,1.4,ppb,,100,Discharge from steel and pulp mills
2,CITY OF ABBYVILLE,KS2015512,FLUORIDE,1/24/2024,0.49,0.49,ppm,,4,Natural deposits; Water additive which promote...
3,CITY OF ABBYVILLE,KS2015512,NITRATE,1/24/2024,8.4,8 - 8.4,ppm,,10,Runoff from fertilizer use
4,CITY OF ABBYVILLE,KS2015512,SELENIUM,1/24/2024,1.5,1.5,ppb,,50,Erosion of natural deposits


## Step 3: Scale up to multiple files

After validating extraction on one report, we can apply the same logic across many documents.

This full-loop version:

1. Iterates through selected files.
2. Extracts `pws name`/`pws id` from paragraphs.
3. Finds and parses the `"regulated contaminants"` table.
4. Appends each file’s rows into one combined DataFrame.


In [20]:
#Table Extraction for REgulated Contaminants
# expected_headers = ['pws name', 'pws id', 'compliance period', 'regulated contaminants', 'collection date', 'highest value', 'range\n(low/high)','unit','mcl','mclg','typical source']
expected_headers = ['pws name', 'pws id', 'regulated contaminants', 'collection date', 'highest value', 'range\n(low/high)','unit','mcl','mclg','typical source']
df = pd.DataFrame(columns=expected_headers)
# df.header = expected_headers
for test_range in range(3):
    doc_path = os.path.join(basedir, subdir, doc_files[test_range])
    doc = Document(doc_path)
    ###################### Paragraph Extraction #####################
    paragraphs = doc.paragraphs #get all the paragraphs in the object
    p0 = paragraphs[3]
    print(p0)

    pws_text = p0.text
    regex_search = re.search(
    r'pws\s*name\s*:\s*(.*?)\s+pws\s*id\s*:\s*([A-Za-z0-9\-]+)',
    str(pws_text),
    flags=re.IGNORECASE
    )
    pws_name = regex_search.group(1)
    pws_id = regex_search.group(2)
        
    ###################### Table Extraction #########################
    tables = doc.tables
    rows_out = [] # setup variable for row data
    for i, table in enumerate(tables):
        # print(i)
        

        if tables[i].cell(0,0).text.strip().casefold() == "regulated contaminants": #strip removes white sopace and casefold avoids capitalization issues
            txt = table.cell(0, 5).text
            raw = table.cell(0, 5).text
            clean = raw.replace('\xa0', ' ').strip()
            print("raw   :", repr(raw))
            print("clean :", repr(clean))

            #print([ord(ch) for ch in txt])
            # print(repr(table.cell(0, 5).text))
            headers = [cell.text.strip().casefold() for cell in table.rows[0].cells] #get all the headers of the current table. We can use these for verification with our pandas df
            for r in range(1, len(table.rows)):
                row_data = {headers[c]: table.cell(r, c).text.strip() for c in range(len(headers))}
                print(row_data)
                rows_out.append(row_data)

    table_df = pd.DataFrame(rows_out)
    table_df["pws name"] = pws_name
    table_df["pws id"] = pws_id
    # col_map = {h: idx for idx, h in enumerate(headers)} #this maps the headers to a dictionary for lookup like col_map.get("Containment")
    # for r, rows in enumerate(tables[i].rows):
    #     for c, cell in enumerate(rows.cells):
            # print(r, c, cell.text)
    # print('table ', i, 'is correct')
    df = pd.concat([df, table_df], ignore_index=True)
# display(df)
# display(row_data)
# display(table_df)
display(df)


<docx.text.paragraph.Paragraph object at 0x000001F21AA72250>
raw   : ''
clean : ''
{'regulated contaminants': 'BARIUM', 'collection date': '1/24/2024', 'highest value': '0.16', 'range\n(low/high)': '0.16', 'unit': 'ppm', '': '2', 'mclg': '2', 'typical source': 'Discharge from metal refineries'}
{'regulated contaminants': 'CHROMIUM', 'collection date': '1/24/2024', 'highest value': '1.4', 'range\n(low/high)': '1.4', 'unit': 'ppb', '': '100', 'mclg': '100', 'typical source': 'Discharge from steel and pulp mills'}
{'regulated contaminants': 'FLUORIDE', 'collection date': '1/24/2024', 'highest value': '0.49', 'range\n(low/high)': '0.49', 'unit': 'ppm', '': '4', 'mclg': '4', 'typical source': 'Natural deposits; Water additive which promotes strong teeth.'}
{'regulated contaminants': 'NITRATE', 'collection date': '1/24/2024', 'highest value': '8.4', 'range\n(low/high)': '8 - 8.4', 'unit': 'ppm', '': '10', 'mclg': '10', 'typical source': 'Runoff from fertilizer use'}
{'regulated contaminants'

Unnamed: 0,pws name,pws id,regulated contaminants,collection date,highest value,range\n(low/high),unit,mcl,mclg,typical source,Unnamed: 11,water system
0,CITY OF ABBYVILLE,KS2015512,BARIUM,1/24/2024,0.16,0.16,ppm,,2,Discharge from metal refineries,2,
1,CITY OF ABBYVILLE,KS2015512,CHROMIUM,1/24/2024,1.4,1.4,ppb,,100,Discharge from steel and pulp mills,100,
2,CITY OF ABBYVILLE,KS2015512,FLUORIDE,1/24/2024,0.49,0.49,ppm,,4,Natural deposits; Water additive which promote...,4,
3,CITY OF ABBYVILLE,KS2015512,NITRATE,1/24/2024,8.4,8 - 8.4,ppm,,10,Runoff from fertilizer use,10,
4,CITY OF ABBYVILLE,KS2015512,SELENIUM,1/24/2024,1.5,1.5,ppb,,50,Erosion of natural deposits,50,
5,CITY OF ABILENE,KS2004112,BARIUM,4/16/2024,0.063,0.063,ppm,,2,Discharge from metal refineries,2,
6,CITY OF ABILENE,KS2004112,FLUORIDE,4/16/2024,0.81,0 - 0.81,ppm,,4,Natural deposits; Water additive which promote...,4,
7,CITY OF ABILENE,KS2004112,NITRATE,1/8/2024,1.0,0.93 - 1,ppm,,10,Runoff from fertilizer use,10,
8,CITY OF ABILENE,KS2004112,SELENIUM,4/16/2024,1.8,1.8,ppb,,50,Erosion of natural deposits,50,
9,CITY OF ADMIRE,KS2011103,BARIUM,4/15/2024,0.015,0.015,ppm,,2,Discharge from metal refineries,2,CITY OF EMPORIA
