dom-node-classifier

A Graph Attention Network (GATv2) that classifies every node of an HTML DOM into one of 14 semantic classes. Designed as a lightweight perception layer for browser agents and DOM annotation pipelines.

What it does

Given a structured representation of an HTML DOM (nodes + edges + metadata), the model outputs a predicted class and confidence score for each node. It replaces raw HTML as the input signal for downstream tasks like automated browser interaction, accessibility audits, or dataset labeling.

Why a graph model? DOM structure is inherently a tree. GATv2 lets each node attend over its neighbors (parent, children, siblings), capturing layout context that element-level classifiers miss — e.g., a <div> inside a <nav> is likely navigation, not noise.

14-class taxonomy

Group	Class	Meaning
Action	`action_input`	Text / search / email field
	`action_select`	Dropdown
	`action_button`	Button, CTA, submit
	`action_link_internal`	Link to same-site page
	`action_link_external`	Link to external domain
Structure	`structure_navigation`	Navbar, breadcrumb, pagination
	`structure_region`	Header, footer, sidebar
	`structure_dismissible`	Modal, cookie banner, overlay
	`structure_card`	Repeated card (product, listing)
	`structure_list_item`	List row
Content	`content_heading`	h1–h6
	`content_text`	Paragraph, label text
	`content_media`	Image, video, svg
Other	`noise`	Decorative, hidden, or irrelevant

Quickstart

git clone https://github.com/lucydjo/dom-node-classifier
cd dom-node-classifier
pip install -r requirements.txt

# Optional: for live URL classification (browser-based extraction)
pip install playwright && playwright install chromium

# Classify a live URL end-to-end (~8s on first run, ~2s after model is warm)
python examples/quickstart.py https://news.ycombinator.com --action-only --min-confidence 0.6

# Or run on the included sample page (no browser needed)
python examples/quickstart.py examples/sample_page.json

Weights are downloaded automatically from HuggingFace Hub on first run.

Minimal usage

from model.inference import DOMClassifier

clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
predictions = clf.classify_page(raw_page)
# → [{"selector": "button#login", "class": "action_button", "confidence": 0.97, ...}, ...]

# Filter to interactable elements only
actionable = clf.classify_page(raw_page, action_only=True, min_confidence=0.6)

raw_page is a dict with keys url, viewport, nodes, and edges. See examples/sample_page.json for the exact format.

Evaluation

Evaluated on a held-out test set, 5 independent seeds (seeds 1, 7, 13, 42, 99).

Metric	Mean ± std	Min	Max
Macro F1	0.825 ± 0.026	0.797	0.865
Weighted F1	0.917 ± 0.032	0.882	0.965

Per-class F1 — best seed (seed 7):

Class	F1	Support
`action_input`	0.704	25
`action_select`	0.696	8
`action_button`	0.980	1 577
`action_link_internal`	0.999	3 119
`action_link_external`	1.000	327
`structure_navigation`	0.862	52
`structure_region`	0.887	52
`structure_dismissible`	0.471	158
`structure_card`	0.850	1 045
`structure_list_item`	0.971	3 885
`content_heading`	0.981	525
`content_text`	0.797	322
`content_media`	0.943	1 319
`noise`	0.972	18 345

Limitations

Thin classes. action_input (n=25) and action_select (n=8) have very low test support — F1 for these classes should be interpreted cautiously.
structure_dismissible (cookie banners, modals) is the hardest class: mean F1 0.363 ± 0.073 across seeds. Cookie consent implementations vary widely across sites.
No price class. Numeric price strings (e.g. "129€") are classified as noise. A content_price class is the most requested addition.
Heuristic labels. Training labels were generated by deterministic rules, not human annotation. Label noise exists, particularly for ambiguous elements at class boundaries.
Static snapshot. The model classifies a DOM snapshot. Dynamic behavior (JavaScript mutations, lazy-loaded content) is not modeled.
Dataset size. Trained on ~135 pages. Performance on highly specialized or unusual page layouts may be lower than benchmark numbers.

Architecture

Model: GATv2 (3 layers, 4 attention heads, hidden dim 128)
Input features: 618 dims/node — tag one-hot, class hashing (Tailwind-robust), attribute presence flags, computed CSS, bbox, topology, link semantics, MiniLM-L6-v2 text embedding
Parameters: 1.27M
Text encoder: sentence-transformers/all-MiniLM-L6-v2 (frozen at inference)
Training: AdamW, cosine LR schedule, early stopping, sqrt-inverse class weighting

See MODEL_CARD.md for full training details.

License

Apache 2.0 — see LICENSE.

Contact

Lucy Paureau · lmi.rest · lucy.paureau@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dom-node-classifier

What it does

14-class taxonomy

Quickstart

Minimal usage

Evaluation

Limitations

Architecture

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
checkpoints_final		checkpoints_final
examples		examples
model		model
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

dom-node-classifier

What it does

14-class taxonomy

Quickstart

Minimal usage

Evaluation

Limitations

Architecture

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages