lol-html-py

Async streaming Python bindings for lol_html, the low-output-latency HTML rewriter from Cloudflare. Built on PyO3 and Tokio, integrates with asyncio via pyo3-async-runtimes.

Why

lol_html is designed for streaming: it can begin emitting output before the full input has been received. Exposing that property to Python requires more than a thin FFI wrapper: per-chunk round trips across the PyO3 boundary will dominate runtime for small chunks. This package solves that by running the rewriter on a long-lived Tokio task, coalescing output in Rust, and crossing the FFI boundary only at channel endpoints with symmetric backpressure on both input and output. The chunk sizes that control the properties are all configurable.

Install

pip install lol-html-py

Usage

import asyncio
from lol_html import AsyncRewriter

async def main():
    rw = AsyncRewriter()

    async def produce():
        await rw.feed(b"<html><script>bad()</script><p>ok</p></html>")
        rw.close()

    async def consume():
        out = bytearray()
        async for chunk in rw:
            out += chunk
        return bytes(out)

    _, result = await asyncio.gather(produce(), consume())
    print(result)

asyncio.run(main())

Configuration

To control the latency/throughput trade-off we can vary the I/O and flushing parameters. These have sensible defaults and can be overridden per instance and via environment variables.

Constructor arg	Env var	Default	Meaning
`input_capacity`	`LOL_HTML_INPUT_CAPACITY`	`8`	Bounded input channel depth (producer backpressure)
`output_capacity`	`LOL_HTML_OUTPUT_CAPACITY`	`8`	Bounded output channel depth (consumer backpressure)
`flush_threshold`	`LOL_HTML_FLUSH_THRESHOLD`	`16384`	Bytes accumulated before flushing to the output channel
`flush_every_chunk`	`LOL_HTML_FLUSH_EVERY_CHUNK`	`false`	Force a flush after every input chunk, regardless of buffer size

Note that environment variables are read at library load time via a Rust constructor. Per-instance constructor arguments always take priority over env vars.

Operating points

Goal	`flush_threshold`	`flush_every_chunk`
SSE / per-token streaming	`1`	`True`
General low-latency streaming	`4096`–`16384`	`False`
High-throughput batch	`65536`+	`False`
Single output blob at end	`0`	`False`

flush_threshold=0 disables size-based flushing entirely; output emits only at end-of-stream or per-chunk forced flush.

Architecture

Python producer ──feed()──▶ [input channel, bounded] ──▶ parser task
                                                           │
                                                           ├─ HtmlRewriter.write()
                                                           ├─ sink coalesces into Vec
                                                           └─ flush ──▶ [output channel, bounded]
                                                                            │
                                                       Python consumer ◀────┘ (async for)

The parser task runs on a shared multi-threaded Tokio runtime.
FFI crossings happen only at channel enqueue/dequeue points.
Backpressure is symmetric: slow consumer → full output channel → parser blocks → full input channel → feed() awaits.
Cancellation via cancel() or Drop triggers a CancellationToken that unblocks every select arm, ensuring clean shutdown.

Performance

Benchmarks are split across two harnesses to isolate parse cost from PyO3 FFI + asyncio cost:

just bench-compare   # Run both back-to-back for direct comparison
just bench-native    # Pure Rust baseline
just bench-python    # Python async pipeline

On a 110 KB HTML payload (medium in the bench suite):

Scenario	Config	Native Rust	Python bindings	Python / Native
`null_copy`	memcpy ceiling	~58,000 MB/s	—	—
SSE / per-chunk flush	`chunk=256, thresh=1, flush_every_chunk=T`	158 MB/s	4 MB/s	2.5%
General streaming	`chunk=4096, thresh=16384`	165 MB/s	71 MB/s	43%
High-throughput batch	`chunk=65536, thresh=65536`	158 MB/s	113 MB/s	72%

The gap between native and Python comes from crossing the FFI boundary, sending data over Tokio channels, asyncio scheduling, and allocating a PyBytes object for each output chunk. Most of that cost is per FFI crossing, not per byte — so it compounds when you do many small crossings (SSE: 860 round-trips per document) and amortizes when you do few (batch: 2 round-trips per document).

How input chunk size affects throughput

In the native Rust bench, feeding the payload in 256-byte writes or 64 KB writes both measure ~164 MB/s — chunk size doesn't affect native performance.

In the Python bench, 256-byte chunks run at ~5.7 MB/s; 16 KB chunks run at ~113 MB/s. That's a 20× difference, and it's entirely FFI overhead: each await rw.feed(chunk) crosses into Rust, sends on a channel, and returns an awaitable. At 256 bytes per call you're doing ~430 of those round-trips per document; at 16 KB you're doing ~7.

If you're feeding the rewriter from an HTTP response or a file read, chunks are already ≥ 4 KB by default and this isn't something you need to think about. If you're feeding it from a source that produces tiny writes (a tokenizer, an SSE stream with small events), buffer them in Python before calling feed().

How flush policy affects throughput

With flush_every_chunk=True and flush_threshold=1, the bench runs at ~4 MB/s instead of ~71 MB/s at the same chunk size — roughly 17× slower. Every input chunk causes one output channel send, one PyBytes allocation, and one GIL reacquisition on the Python side.

With the defaults (threshold 16 KB, no per-chunk flush) the rewriter batches output until it has ~16 KB to send, which is usually one or two sends for a document this size.

Pick eager flushing when you specifically need to forward each input chunk's output before the next one arrives — server-sent events, incremental rendering, anything the downstream consumer treats as a real-time stream. Otherwise the default will be faster.

In the Rust bench the same eager flushing costs essentially nothing (SSE and batch both take ~695 μs), because on the Rust side a "flush" is a buffer clear, not an FFI crossing. The entire cost of eager mode is the Python integration.

Is 113 MB/s good?

Compared to raw memcpy (~58,000 MB/s), native lol_html is 355× slower. That reflects the work of tokenizing HTML and matching element handlers — lol_html isn't memory-bound, it's CPU-bound on parse logic. The Python bindings achieve ~70% of native throughput in the batch configuration, so the async layer is taking a meaningful but not dominant share of total time when you're not forcing per-chunk flushes.

Development

Build from source, with uv:

uv sync

or with pip:

git clone https://github.com/lmmx/lol-html
cd lol-html
pip install maturin
maturin develop --release

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
.just		.just
examples		examples
python/lol_html		python/lol_html
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lol-html-py

Why

Install

Usage

Configuration

Operating points

Architecture

Performance

How input chunk size affects throughput

How flush policy affects throughput

Is 113 MB/s good?

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lol-html-py

Why

Install

Usage

Configuration

Operating points

Architecture

Performance

How input chunk size affects throughput

How flush policy affects throughput

Is 113 MB/s good?

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages