# Chapter 30: XML Processing

This notebook covers XML parsing and generation using Python's built-in `xml.etree.ElementTree` module. XML (eXtensible Markup Language) remains widely used in configuration files, web services (SOAP), and data exchange formats.

## Key Concepts
- **Parsing XML**: Converting XML strings or files into navigable tree structures
- **Building XML**: Creating elements programmatically with `Element` and `SubElement`
- **XPath searches**: Finding elements using path expressions
- **Namespaces**: Working with XML namespaces to avoid name collisions
- **Serialization**: Converting element trees back to XML strings

## Section 1: Parsing XML from Strings

`ET.fromstring()` parses an XML string and returns the root `Element`. Each element has a tag, optional text content, attributes, and child elements.

In [None]:
import xml.etree.ElementTree as ET

# Parse a simple XML string
xml_data: str = "<root><item>hello</item><item>world</item></root>"
root: ET.Element = ET.fromstring(xml_data)

print(f"Root tag: {root.tag}")
print(f"Number of children: {len(root)}")

# Access child elements
items: list[ET.Element] = root.findall("item")
for item in items:
    print(f"  <{item.tag}> = {item.text}")

In [None]:
# Parse XML with attributes
xml_with_attrs: str = """
<library>
    <book isbn="978-0134685991" lang="en">
        <title>Effective Java</title>
        <author>Joshua Bloch</author>
        <year>2018</year>
    </book>
    <book isbn="978-0596009205" lang="en">
        <title>Head First Design Patterns</title>
        <author>Eric Freeman</author>
        <year>2004</year>
    </book>
</library>
"""

root = ET.fromstring(xml_with_attrs)

for book in root.findall("book"):
    isbn: str | None = book.get("isbn")
    title: str | None = book.findtext("title")
    author: str | None = book.findtext("author")
    print(f"  [{isbn}] {title} by {author}")

## Section 2: Parsing XML from Files

`ET.parse()` reads an XML file and returns an `ElementTree` object. The root element is obtained via `.getroot()`. For this demo, we write a temporary file first.

In [None]:
import tempfile
from pathlib import Path

# Write a sample XML file
sample_xml: str = """<?xml version="1.0" encoding="UTF-8"?>
<employees>
    <employee id="101">
        <name>Alice</name>
        <department>Engineering</department>
    </employee>
    <employee id="102">
        <name>Bob</name>
        <department>Marketing</department>
    </employee>
</employees>
"""

with tempfile.NamedTemporaryFile(mode="w", suffix=".xml", delete=False) as f:
    f.write(sample_xml)
    temp_path: str = f.name

# Parse the file
tree: ET.ElementTree = ET.parse(temp_path)
root = tree.getroot()

print(f"Root tag: {root.tag}")
for emp in root.findall("employee"):
    emp_id: str | None = emp.get("id")
    name: str | None = emp.findtext("name")
    dept: str | None = emp.findtext("department")
    print(f"  Employee {emp_id}: {name} in {dept}")

# Clean up
Path(temp_path).unlink()

## Section 3: Building XML Programmatically

Use `ET.Element()` for the root and `ET.SubElement()` to add children. Attributes are passed as a dictionary or keyword arguments.

In [None]:
# Build a catalog XML document from scratch
catalog: ET.Element = ET.Element("catalog")

# Add a book with attributes and child elements
book: ET.Element = ET.SubElement(catalog, "book", attrib={"id": "1"})
title: ET.Element = ET.SubElement(book, "title")
title.text = "Python"
price: ET.Element = ET.SubElement(book, "price")
price.text = "39.99"

# Add a second book
book2: ET.Element = ET.SubElement(catalog, "book", attrib={"id": "2"})
title2: ET.Element = ET.SubElement(book2, "title")
title2.text = "Algorithms"
price2: ET.Element = ET.SubElement(book2, "price")
price2.text = "49.99"

# Verify the structure
found_book: ET.Element | None = catalog.find("book")
print(f"First book title: {catalog.find('book/title').text}")
print(f"First book id: {catalog.find('book').get('id')}")

# List all books
for b in catalog.findall("book"):
    print(f"  Book {b.get('id')}: {b.findtext('title')} - ${b.findtext('price')}")

## Section 4: Serializing XML to Strings

`ET.tostring()` converts an element tree back to an XML byte string. Use `encoding="unicode"` to get a regular string instead of bytes.

In [None]:
# Create a simple element
msg: ET.Element = ET.Element("msg")
msg.text = "hello"

# Serialize to bytes (default)
xml_bytes: bytes = ET.tostring(msg)
print(f"Bytes: {xml_bytes}")
print(f"Type: {type(xml_bytes).__name__}")

# Serialize to unicode string
xml_str: str = ET.tostring(msg, encoding="unicode")
print(f"\nString: {xml_str}")
print(f"Type: {type(xml_str).__name__}")

# Serialize the full catalog with XML declaration
catalog_str: str = ET.tostring(catalog, encoding="unicode")
print(f"\nCatalog XML:\n{catalog_str}")

In [None]:
# Pretty-printing with indent (Python 3.9+)
ET.indent(catalog, space="  ")
pretty_xml: str = ET.tostring(catalog, encoding="unicode")
print("Pretty-printed catalog:")
print(pretty_xml)

## Section 5: XPath Expressions

ElementTree supports a subset of XPath for searching the tree. Common patterns:
- `tag` -- direct children with given tag
- `.//tag` -- all descendants with given tag
- `*/tag` -- grandchildren with given tag
- `[@attrib]` -- elements with a specific attribute
- `[@attrib='value']` -- elements where attribute equals value

In [None]:
store_xml: str = """
<store>
    <book category="fiction">
        <title>Novel</title>
        <price>12.99</price>
    </book>
    <book category="tech">
        <title>Python</title>
        <price>39.99</price>
    </book>
    <book category="fiction">
        <title>Mystery</title>
        <price>9.99</price>
    </book>
</store>
"""

store: ET.Element = ET.fromstring(store_xml)

# Find all titles anywhere in the tree
all_titles: list[str] = [el.text for el in store.findall(".//title")]
print(f"All titles: {all_titles}")

# Find books by attribute
fiction_books: list[ET.Element] = store.findall("book[@category='fiction']")
print(f"\nFiction books:")
for book in fiction_books:
    print(f"  {book.findtext('title')} - ${book.findtext('price')}")

# Find all prices
prices: list[float] = [float(el.text) for el in store.findall(".//price")]
print(f"\nAll prices: {prices}")
print(f"Total: ${sum(prices):.2f}")

In [None]:
# More XPath examples with nested structures
nested_xml: str = """
<company>
    <department name="Engineering">
        <team name="Backend">
            <member role="lead">Alice</member>
            <member role="dev">Bob</member>
        </team>
        <team name="Frontend">
            <member role="lead">Charlie</member>
        </team>
    </department>
    <department name="Marketing">
        <team name="Content">
            <member role="lead">Diana</member>
        </team>
    </department>
</company>
"""

company: ET.Element = ET.fromstring(nested_xml)

# All members in the entire company
all_members: list[str] = [m.text for m in company.findall(".//member")]
print(f"All members: {all_members}")

# Only team leads
leads: list[str] = [m.text for m in company.findall(".//member[@role='lead']")]
print(f"Team leads: {leads}")

# Members in grandchild teams (department/team/member)
grandchild_members: list[str] = [
    m.text for m in company.findall("department/team/member")
]
print(f"Via path: {grandchild_members}")

## Section 6: Modifying XML Trees

Elements are mutable: you can change text, attributes, add or remove children after creation.

In [None]:
# Start with a simple XML tree
xml_data = "<config><setting name='debug'>false</setting></config>"
config: ET.Element = ET.fromstring(xml_data)

# Modify text content
setting: ET.Element | None = config.find("setting")
if setting is not None:
    print(f"Before: {setting.text}")
    setting.text = "true"
    print(f"After: {setting.text}")

# Modify attributes
if setting is not None:
    setting.set("name", "verbose")
    setting.set("type", "boolean")
    print(f"Attributes: {setting.attrib}")

# Add a new element
new_setting: ET.Element = ET.SubElement(config, "setting", name="timeout")
new_setting.text = "30"

# Remove an element
config.remove(setting)

result: str = ET.tostring(config, encoding="unicode")
print(f"\nFinal XML: {result}")

## Section 7: XML Namespaces

Namespaces prevent name collisions when combining XML from different sources. They are declared with `xmlns` and appear as prefixes in tag names.

In [None]:
# XML with namespaces
ns_xml: str = """
<root xmlns:h="http://www.w3.org/1999/xhtml"
      xmlns:f="http://www.w3.org/2002/xforms">
    <h:table>
        <h:tr>
            <h:td>Cell 1</h:td>
            <h:td>Cell 2</h:td>
        </h:tr>
    </h:table>
    <f:form>
        <f:input name="user"/>
    </f:form>
</root>
"""

# Define namespace map for searching
namespaces: dict[str, str] = {
    "h": "http://www.w3.org/1999/xhtml",
    "f": "http://www.w3.org/2002/xforms",
}

root = ET.fromstring(ns_xml)

# Find elements using namespace prefix
table: ET.Element | None = root.find("h:table", namespaces)
if table is not None:
    print(f"Found table element: {table.tag}")

cells: list[ET.Element] = root.findall(".//h:td", namespaces)
for cell in cells:
    print(f"  Cell: {cell.text}")

# Find form elements
form_input: ET.Element | None = root.find(".//f:input", namespaces)
if form_input is not None:
    print(f"\nForm input name: {form_input.get('name')}")

In [None]:
# Register namespaces to preserve prefixes during serialization
ET.register_namespace("h", "http://www.w3.org/1999/xhtml")
ET.register_namespace("f", "http://www.w3.org/2002/xforms")

# Build XML with namespaces using Clark notation {uri}local
XHTML: str = "http://www.w3.org/1999/xhtml"
doc: ET.Element = ET.Element("page")
heading: ET.Element = ET.SubElement(doc, f"{{{XHTML}}}h1")
heading.text = "Welcome"

result = ET.tostring(doc, encoding="unicode")
print(f"Namespaced XML: {result}")

## Section 8: Iterative Parsing with iterparse

For large XML files, `ET.iterparse()` processes elements incrementally without loading the entire document into memory.

In [None]:
import io

# Simulate a large XML file with a StringIO stream
large_xml: str = """
<log>
    <entry level="INFO">Application started</entry>
    <entry level="WARNING">Low memory</entry>
    <entry level="ERROR">Connection failed</entry>
    <entry level="INFO">Application stopped</entry>
</log>
"""

# Use iterparse to process events
warnings_and_errors: list[str] = []

for event, elem in ET.iterparse(io.StringIO(large_xml), events=("end",)):
    if elem.tag == "entry":
        level: str | None = elem.get("level")
        if level in ("WARNING", "ERROR"):
            warnings_and_errors.append(f"[{level}] {elem.text}")
        # Clear element to free memory in large files
        elem.clear()

print("Warnings and errors found:")
for entry in warnings_and_errors:
    print(f"  {entry}")

## Section 9: Practical Example -- Converting XML to Dictionaries

A common pattern is converting XML data into Python dictionaries for easier manipulation.

In [None]:
from typing import Any


def xml_to_dict(element: ET.Element) -> dict[str, Any]:
    """Recursively convert an XML element to a dictionary."""
    result: dict[str, Any] = {}

    # Include attributes with @ prefix
    for key, value in element.attrib.items():
        result[f"@{key}"] = value

    # Include text content
    if element.text and element.text.strip():
        if len(element) == 0 and not element.attrib:
            return element.text.strip()  # type: ignore[return-value]
        result["#text"] = element.text.strip()

    # Process child elements
    for child in element:
        child_data: dict[str, Any] | str = xml_to_dict(child)
        if child.tag in result:
            # Convert to list for duplicate tags
            if not isinstance(result[child.tag], list):
                result[child.tag] = [result[child.tag]]
            result[child.tag].append(child_data)
        else:
            result[child.tag] = child_data

    return result


# Convert a sample XML to dict
xml_data = """
<user id="42">
    <name>Alice</name>
    <email>alice@example.com</email>
    <roles>
        <role>admin</role>
        <role>editor</role>
    </roles>
</user>
"""

root = ET.fromstring(xml_data)
user_dict: dict[str, Any] = xml_to_dict(root)

import json
print(json.dumps(user_dict, indent=2))

## Summary

### Parsing XML
- **`ET.fromstring()`**: Parse XML from a string, returns root `Element`
- **`ET.parse()`**: Parse XML from a file, returns `ElementTree`
- **`ET.iterparse()`**: Incremental parsing for large files

### Building XML
- **`ET.Element(tag, attrib)`**: Create a root element
- **`ET.SubElement(parent, tag)`**: Add a child element
- Set `.text` for element content, `.attrib` for attributes

### Searching
- **`find(path)`**: First matching element
- **`findall(path)`**: All matching elements
- **`findtext(path)`**: Text of first matching element
- XPath subset: `.//tag`, `[@attr='val']`, `tag/subtag`

### Serialization
- **`ET.tostring(elem, encoding="unicode")`**: Convert to string
- **`ET.indent(elem)`**: Pretty-print formatting (Python 3.9+)

### Namespaces
- Use namespace dictionaries with `find()`/`findall()`
- Clark notation `{uri}local` for building namespaced elements
- `ET.register_namespace()` for clean serialization