Skip to content

lucydjo/dom-node-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dom-node-classifier

License: Apache 2.0 HuggingFace

A Graph Attention Network (GATv2) that classifies every node of an HTML DOM into one of 14 semantic classes. Designed as a lightweight perception layer for browser agents and DOM annotation pipelines.


What it does

Given a structured representation of an HTML DOM (nodes + edges + metadata), the model outputs a predicted class and confidence score for each node. It replaces raw HTML as the input signal for downstream tasks like automated browser interaction, accessibility audits, or dataset labeling.

Why a graph model? DOM structure is inherently a tree. GATv2 lets each node attend over its neighbors (parent, children, siblings), capturing layout context that element-level classifiers miss — e.g., a <div> inside a <nav> is likely navigation, not noise.


14-class taxonomy

Group Class Meaning
Action action_input Text / search / email field
action_select Dropdown
action_button Button, CTA, submit
action_link_internal Link to same-site page
action_link_external Link to external domain
Structure structure_navigation Navbar, breadcrumb, pagination
structure_region Header, footer, sidebar
structure_dismissible Modal, cookie banner, overlay
structure_card Repeated card (product, listing)
structure_list_item List row
Content content_heading h1–h6
content_text Paragraph, label text
content_media Image, video, svg
Other noise Decorative, hidden, or irrelevant

Quickstart

git clone https://github.com/lucydjo/dom-node-classifier
cd dom-node-classifier
pip install -r requirements.txt

# Optional: for live URL classification (browser-based extraction)
pip install playwright && playwright install chromium

# Classify a live URL end-to-end (~8s on first run, ~2s after model is warm)
python examples/quickstart.py https://news.ycombinator.com --action-only --min-confidence 0.6

# Or run on the included sample page (no browser needed)
python examples/quickstart.py examples/sample_page.json

Weights are downloaded automatically from HuggingFace Hub on first run.

Minimal usage

from model.inference import DOMClassifier

clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
predictions = clf.classify_page(raw_page)
# → [{"selector": "button#login", "class": "action_button", "confidence": 0.97, ...}, ...]

# Filter to interactable elements only
actionable = clf.classify_page(raw_page, action_only=True, min_confidence=0.6)

raw_page is a dict with keys url, viewport, nodes, and edges. See examples/sample_page.json for the exact format.


Evaluation

Evaluated on a held-out test set, 5 independent seeds (seeds 1, 7, 13, 42, 99).

Metric Mean ± std Min Max
Macro F1 0.825 ± 0.026 0.797 0.865
Weighted F1 0.917 ± 0.032 0.882 0.965

Per-class F1 — best seed (seed 7):

Class F1 Support
action_input 0.704 25
action_select 0.696 8
action_button 0.980 1 577
action_link_internal 0.999 3 119
action_link_external 1.000 327
structure_navigation 0.862 52
structure_region 0.887 52
structure_dismissible 0.471 158
structure_card 0.850 1 045
structure_list_item 0.971 3 885
content_heading 0.981 525
content_text 0.797 322
content_media 0.943 1 319
noise 0.972 18 345

Limitations

  • Thin classes. action_input (n=25) and action_select (n=8) have very low test support — F1 for these classes should be interpreted cautiously.
  • structure_dismissible (cookie banners, modals) is the hardest class: mean F1 0.363 ± 0.073 across seeds. Cookie consent implementations vary widely across sites.
  • No price class. Numeric price strings (e.g. "129€") are classified as noise. A content_price class is the most requested addition.
  • Heuristic labels. Training labels were generated by deterministic rules, not human annotation. Label noise exists, particularly for ambiguous elements at class boundaries.
  • Static snapshot. The model classifies a DOM snapshot. Dynamic behavior (JavaScript mutations, lazy-loaded content) is not modeled.
  • Dataset size. Trained on ~135 pages. Performance on highly specialized or unusual page layouts may be lower than benchmark numbers.

Architecture

  • Model: GATv2 (3 layers, 4 attention heads, hidden dim 128)
  • Input features: 618 dims/node — tag one-hot, class hashing (Tailwind-robust), attribute presence flags, computed CSS, bbox, topology, link semantics, MiniLM-L6-v2 text embedding
  • Parameters: 1.27M
  • Text encoder: sentence-transformers/all-MiniLM-L6-v2 (frozen at inference)
  • Training: AdamW, cosine LR schedule, early stopping, sqrt-inverse class weighting

See MODEL_CARD.md for full training details.


License

Apache 2.0 — see LICENSE.


Contact

Lucy Paureau · lmi.rest · lucy.paureau@gmail.com

About

GATv2 that classifies every DOM node into 14 semantic classes — perception layer for browser agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages