# Chapter 30: HTML Processing and Configuration Formats

This notebook covers HTML parsing and escaping with the `html` module, and reading/writing configuration files using `configparser` and `tomllib`.

## Key Concepts
- **HTML escaping**: Preventing XSS by converting special characters to entities
- **HTMLParser**: Event-driven parser for extracting data from HTML
- **configparser**: Reading and writing INI-style configuration files
- **tomllib**: Reading TOML configuration files (Python 3.11+)

## Section 1: HTML Escaping and Unescaping

The `html` module provides `escape()` and `unescape()` for converting between raw HTML and safe entity-encoded strings. This is essential for preventing cross-site scripting (XSS) attacks.

In [None]:
from html import escape, unescape

# Escaping dangerous HTML content
dangerous: str = '<script>alert("xss")</script>'
safe: str = escape(dangerous)

print(f"Original:  {dangerous}")
print(f"Escaped:   {safe}")
print(f"Contains <script>: {'<script>' in safe}")
print(f"Contains &lt;script&gt;: {'&lt;script&gt;' in safe}")

In [None]:
# Unescaping reverses the process
escaped_html: str = "&lt;b&gt;bold&lt;/b&gt;"
original: str = unescape(escaped_html)

print(f"Escaped:   {escaped_html}")
print(f"Unescaped: {original}")
print(f"Match: {original == '<b>bold</b>'}")

In [None]:
# escape() handles the five special HTML characters
special_chars: str = '5 > 3 & 2 < 4 with "quotes" and \'apostrophes\''
escaped: str = escape(special_chars, quote=True)
print(f"Original: {special_chars}")
print(f"Escaped:  {escaped}")

# Round-trip: escape -> unescape recovers the original
recovered: str = unescape(escaped)
print(f"\nRound-trip matches: {recovered == special_chars}")

## Section 2: Parsing HTML with HTMLParser

`html.parser.HTMLParser` is an event-driven parser. You subclass it and override handler methods like `handle_starttag`, `handle_endtag`, and `handle_data` to process different parts of the HTML.

In [None]:
from html.parser import HTMLParser


class TagCollector(HTMLParser):
    """Collects all start tags encountered in the HTML."""

    def __init__(self) -> None:
        super().__init__()
        self.tags: list[str] = []

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        self.tags.append(tag)


# Extract all tags from HTML
html_content: str = "<html><body><p>Hello</p><a href='#'>Link</a></body></html>"

collector = TagCollector()
collector.feed(html_content)

print(f"Tags found: {collector.tags}")
print(f"Contains 'html': {'html' in collector.tags}")
print(f"Contains 'p': {'p' in collector.tags}")
print(f"Contains 'a': {'a' in collector.tags}")

In [None]:
class TextCollector(HTMLParser):
    """Extracts text content from HTML, stripping all tags."""

    def __init__(self) -> None:
        super().__init__()
        self.texts: list[str] = []

    def handle_data(self, data: str) -> None:
        stripped: str = data.strip()
        if stripped:
            self.texts.append(stripped)


# Extract text content from HTML
html_content = "<p>Hello</p><p>World</p>"

text_parser = TextCollector()
text_parser.feed(html_content)

print(f"Texts: {text_parser.texts}")
print(f"Match: {text_parser.texts == ['Hello', 'World']}")

In [None]:
class LinkExtractor(HTMLParser):
    """Extracts all hyperlinks and their text from HTML."""

    def __init__(self) -> None:
        super().__init__()
        self.links: list[dict[str, str]] = []
        self._current_href: str | None = None
        self._current_text: str = ""

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag == "a":
            attr_dict: dict[str, str | None] = dict(attrs)
            self._current_href = attr_dict.get("href")
            self._current_text = ""

    def handle_data(self, data: str) -> None:
        if self._current_href is not None:
            self._current_text += data

    def handle_endtag(self, tag: str) -> None:
        if tag == "a" and self._current_href is not None:
            self.links.append({
                "href": self._current_href,
                "text": self._current_text.strip(),
            })
            self._current_href = None


# Extract links from a page
page_html: str = """
<html><body>
    <a href="https://python.org">Python</a>
    <a href="https://docs.python.org">Docs</a>
    <p>Some text without links</p>
    <a href="https://pypi.org">PyPI</a>
</body></html>
"""

extractor = LinkExtractor()
extractor.feed(page_html)

print("Links found:")
for link in extractor.links:
    print(f"  {link['text']} -> {link['href']}")

## Section 3: Configuration with configparser

The `configparser` module reads and writes INI-style configuration files with sections, keys, and values. It supports type conversion, default values, and interpolation.

In [None]:
import configparser

# Parse a configuration string
config_str: str = """
[database]
host = localhost
port = 5432
name = mydb

[debug]
enabled = true
"""

config = configparser.ConfigParser()
config.read_string(config_str)

# Read values as strings (default)
host: str = config["database"]["host"]
print(f"Host: {host}")

# Type-safe access methods
port: int = config.getint("database", "port")
debug_enabled: bool = config.getboolean("debug", "enabled")

print(f"Port: {port} (type: {type(port).__name__})")
print(f"Debug enabled: {debug_enabled} (type: {type(debug_enabled).__name__})")

# List all sections and keys
print(f"\nSections: {config.sections()}")
print(f"Database keys: {list(config['database'].keys())}")

In [None]:
# Default values and fallbacks
config = configparser.ConfigParser()
config.read_string("[section]\nkey = value")

# Access existing key
existing: str = config.get("section", "key")
print(f"Existing key: {existing}")

# Use fallback for missing keys
missing: str = config.get("section", "missing", fallback="default")
print(f"Missing key with fallback: {missing}")

# getint and getboolean also support fallbacks
timeout: int = config.getint("section", "timeout", fallback=30)
verbose: bool = config.getboolean("section", "verbose", fallback=False)
print(f"Timeout (fallback): {timeout}")
print(f"Verbose (fallback): {verbose}")

In [None]:
import io

# Writing configuration files
config = configparser.ConfigParser()
config["app"] = {"name": "myapp", "version": "1.0"}
config["logging"] = {"level": "INFO", "file": "/var/log/app.log"}

# Write to a string buffer
output = io.StringIO()
config.write(output)
written: str = output.getvalue()

print("Generated INI file:")
print(written)

# Verify the output contains expected values
print(f"Contains 'myapp': {'myapp' in written}")
print(f"Contains '1.0': {'1.0' in written}")

In [None]:
# Interpolation: referencing other values within the config
config_with_interpolation: str = """
[paths]
base_dir = /opt/myapp
data_dir = %(base_dir)s/data
log_dir = %(base_dir)s/logs
"""

config = configparser.ConfigParser()
config.read_string(config_with_interpolation)

# Values are interpolated automatically
base: str = config.get("paths", "base_dir")
data: str = config.get("paths", "data_dir")
logs: str = config.get("paths", "log_dir")

print(f"Base: {base}")
print(f"Data: {data}")
print(f"Logs: {logs}")

# Raw access without interpolation
raw_data: str = config.get("paths", "data_dir", raw=True)
print(f"\nRaw value: {raw_data}")

## Section 4: TOML Configuration with tomllib

TOML (Tom's Obvious Minimal Language) is the modern configuration format used by `pyproject.toml`. Python 3.11+ includes `tomllib` for reading TOML files. TOML preserves types natively (integers stay integers, booleans stay booleans).

In [None]:
import sys

if sys.version_info >= (3, 11):
    import tomllib

    # Parse a TOML string (tomllib requires bytes)
    toml_data: str = """
[project]
name = "my-package"
version = "1.0.0"
description = "A sample project"
requires-python = ">=3.11"

[project.optional-dependencies]
dev = ["pytest", "mypy"]

[tool.mypy]
strict = true
warn_return_any = true
"""

    parsed: dict = tomllib.loads(toml_data)

    # TOML preserves Python types
    print(f"Project name: {parsed['project']['name']}")
    print(f"Version: {parsed['project']['version']}")
    print(f"Dev deps: {parsed['project']['optional-dependencies']['dev']}")

    # Booleans are native Python bools
    strict: bool = parsed["tool"]["mypy"]["strict"]
    print(f"\nMypy strict: {strict} (type: {type(strict).__name__})")
else:
    print(f"Python {sys.version_info.major}.{sys.version_info.minor} detected.")
    print("tomllib requires Python 3.11+. Use 'tomli' package as a fallback.")

In [None]:
if sys.version_info >= (3, 11):
    import tomllib

    # TOML supports rich data types
    rich_toml: str = """
[server]
host = "0.0.0.0"
port = 8080
debug = false
workers = 4
allowed_hosts = ["localhost", "example.com"]

[database]
url = "postgresql://localhost/mydb"
pool_size = 10
timeout = 30.5
"""

    data: dict = tomllib.loads(rich_toml)

    # Types are preserved from the TOML source
    print("Server config:")
    for key, value in data["server"].items():
        print(f"  {key} = {value!r} ({type(value).__name__})")

    print("\nDatabase config:")
    for key, value in data["database"].items():
        print(f"  {key} = {value!r} ({type(value).__name__})")
else:
    print("Skipped: tomllib requires Python 3.11+")

## Section 5: Comparing INI and TOML

Both formats serve as configuration files, but they differ in type support and complexity.

In [None]:
# INI: all values are strings, must convert manually
ini_str: str = """
[server]
port = 8080
debug = true
"""

ini_config = configparser.ConfigParser()
ini_config.read_string(ini_str)

# Raw value is always a string
raw_port: str = ini_config["server"]["port"]
print(f"INI port type: {type(raw_port).__name__} -> {raw_port!r}")

# Must use getint/getboolean for typed access
typed_port: int = ini_config.getint("server", "port")
typed_debug: bool = ini_config.getboolean("server", "debug")
print(f"INI typed port: {type(typed_port).__name__} -> {typed_port}")
print(f"INI typed debug: {type(typed_debug).__name__} -> {typed_debug}")

print("\n--- Comparison ---")
print("INI: All values stored as strings, manual type conversion needed")
print("TOML: Native types (int, bool, float, list, datetime)")
print("INI: Simple flat sections")
print("TOML: Nested tables and arrays of tables")

## Summary

### HTML Processing
- **`html.escape(s)`**: Convert `<`, `>`, `&`, `"` to HTML entities -- prevents XSS
- **`html.unescape(s)`**: Convert entities back to characters
- **`HTMLParser`**: Subclass and override `handle_starttag`, `handle_data`, `handle_endtag`
- Event-driven: no DOM tree built, low memory overhead

### configparser (INI Files)
- **`read_string()` / `read()`**: Parse configuration
- **`get()` / `getint()` / `getboolean()`**: Type-safe value access
- **`fallback=`**: Default values for missing keys
- **`%(key)s`**: Value interpolation within sections
- **`write()`**: Serialize config back to INI format

### tomllib (TOML Files)
- **`tomllib.loads(s)`**: Parse TOML string (Python 3.11+)
- **`tomllib.load(f)`**: Parse TOML file (binary mode)
- Native type preservation: int, float, bool, list, datetime
- Used by `pyproject.toml` for Python project configuration
- Read-only: use `tomli-w` or `tomlkit` for writing TOML