A Graph Attention Network (GATv2) that classifies every node of an HTML DOM into one of 14 semantic classes. Designed as a lightweight perception layer for browser agents and DOM annotation pipelines.
Given a structured representation of an HTML DOM (nodes + edges + metadata), the model outputs a predicted class and confidence score for each node. It replaces raw HTML as the input signal for downstream tasks like automated browser interaction, accessibility audits, or dataset labeling.
Why a graph model? DOM structure is inherently a tree. GATv2 lets each node attend over its neighbors (parent, children, siblings), capturing layout context that element-level classifiers miss — e.g., a <div> inside a <nav> is likely navigation, not noise.
| Group | Class | Meaning |
|---|---|---|
| Action | action_input |
Text / search / email field |
action_select |
Dropdown | |
action_button |
Button, CTA, submit | |
action_link_internal |
Link to same-site page | |
action_link_external |
Link to external domain | |
| Structure | structure_navigation |
Navbar, breadcrumb, pagination |
structure_region |
Header, footer, sidebar | |
structure_dismissible |
Modal, cookie banner, overlay | |
structure_card |
Repeated card (product, listing) | |
structure_list_item |
List row | |
| Content | content_heading |
h1–h6 |
content_text |
Paragraph, label text | |
content_media |
Image, video, svg | |
| Other | noise |
Decorative, hidden, or irrelevant |
git clone https://github.com/lucydjo/dom-node-classifier
cd dom-node-classifier
pip install -r requirements.txt
# Optional: for live URL classification (browser-based extraction)
pip install playwright && playwright install chromium
# Classify a live URL end-to-end (~8s on first run, ~2s after model is warm)
python examples/quickstart.py https://news.ycombinator.com --action-only --min-confidence 0.6
# Or run on the included sample page (no browser needed)
python examples/quickstart.py examples/sample_page.jsonWeights are downloaded automatically from HuggingFace Hub on first run.
from model.inference import DOMClassifier
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
predictions = clf.classify_page(raw_page)
# → [{"selector": "button#login", "class": "action_button", "confidence": 0.97, ...}, ...]
# Filter to interactable elements only
actionable = clf.classify_page(raw_page, action_only=True, min_confidence=0.6)raw_page is a dict with keys url, viewport, nodes, and edges. See examples/sample_page.json for the exact format.
Evaluated on a held-out test set, 5 independent seeds (seeds 1, 7, 13, 42, 99).
| Metric | Mean ± std | Min | Max |
|---|---|---|---|
| Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 |
| Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 |
Per-class F1 — best seed (seed 7):
| Class | F1 | Support |
|---|---|---|
action_input |
0.704 | 25 |
action_select |
0.696 | 8 |
action_button |
0.980 | 1 577 |
action_link_internal |
0.999 | 3 119 |
action_link_external |
1.000 | 327 |
structure_navigation |
0.862 | 52 |
structure_region |
0.887 | 52 |
structure_dismissible |
0.471 | 158 |
structure_card |
0.850 | 1 045 |
structure_list_item |
0.971 | 3 885 |
content_heading |
0.981 | 525 |
content_text |
0.797 | 322 |
content_media |
0.943 | 1 319 |
noise |
0.972 | 18 345 |
- Thin classes.
action_input(n=25) andaction_select(n=8) have very low test support — F1 for these classes should be interpreted cautiously. structure_dismissible(cookie banners, modals) is the hardest class: mean F1 0.363 ± 0.073 across seeds. Cookie consent implementations vary widely across sites.- No price class. Numeric price strings (e.g.
"129€") are classified asnoise. Acontent_priceclass is the most requested addition. - Heuristic labels. Training labels were generated by deterministic rules, not human annotation. Label noise exists, particularly for ambiguous elements at class boundaries.
- Static snapshot. The model classifies a DOM snapshot. Dynamic behavior (JavaScript mutations, lazy-loaded content) is not modeled.
- Dataset size. Trained on ~135 pages. Performance on highly specialized or unusual page layouts may be lower than benchmark numbers.
- Model: GATv2 (3 layers, 4 attention heads, hidden dim 128)
- Input features: 618 dims/node — tag one-hot, class hashing (Tailwind-robust), attribute presence flags, computed CSS, bbox, topology, link semantics, MiniLM-L6-v2 text embedding
- Parameters: 1.27M
- Text encoder:
sentence-transformers/all-MiniLM-L6-v2(frozen at inference) - Training: AdamW, cosine LR schedule, early stopping, sqrt-inverse class weighting
See MODEL_CARD.md for full training details.
Apache 2.0 — see LICENSE.
Lucy Paureau · lmi.rest · lucy.paureau@gmail.com