Temporal Fact Extraction from Web Page DOM Evolution for Time-Aware Retrieval-Augmented Generation
TempWebRAG extracts timestamped facts from the structural evolution of HTML DOM trees, enabling RAG systems to answer temporal queries that no existing system supports:
- "When did the price drop?"
- "Was there a sale last month?"
- "Has availability changed?"
| Metric | Value | 95% CI |
|---|---|---|
| Recall | 100% | [85.2%, 100%] |
| Precision | 70.0% | [55.4%, 82.1%] |
| F1 | 82.4% | — |
No existing RAG system, CSS heuristic, or web scraper can answer temporal queries.
| Method | Top-1 | Top-3 | MRR | p-value |
|---|---|---|---|---|
| Text-only (baseline) | 24.2% | 71.0% | 0.506 | — |
| Text+Structure (ours) | 29.0% | 74.2% | 0.542 | 0.0036 |
| CSS heuristic | 93.5% | 96.8% | 0.929 | — |
Structure-aware retrieval provides statistically significant improvement over text-only (p < 0.01). CSS heuristics outperform both neural methods on well-structured sites.
├── src/webtkgrag/ # Core library
│ ├── dom_parser.py # HTML → DOM Knowledge Graph
│ ├── embedding.py # Structure-aware node embeddings
│ ├── retrieval.py # Tree traversal retrieval
│ ├── temporal.py # Temporal DOM diffing + fact extraction
│ └── pipeline.py # End-to-end RAG (Bedrock/Mock LLM)
├── eval/ # Evaluation
│ ├── comprehensive_eval.py # 37-query eval with 3 baselines
│ ├── temporal_eval_v2.py # Temporal fact extraction eval
│ ├── test_reproducibility.py # 10 reproducibility tests
│ └── results.md # Complete results log (40 review iterations)
├── data/ # Test data
│ ├── ground_truth.py # 37 ground-truth queries
│ └── test_pages/ # 8 locally-saved HTML pages
├── paper/main.tex # LaTeX paper
└── docs/ # Research documentation
pip install -r requirements.txt
# Run reproducibility tests (no model loading, <1s)
PYTHONPATH=src python eval/test_reproducibility.py
# Run full evaluation (requires sentence-transformers, ~15s)
PYTHONPATH=src python eval/comprehensive_eval.py
# Run temporal evaluation
PYTHONPATH=src python eval/temporal_eval_v2.py- Tested on practice websites only (50-200x smaller than real e-commerce)
- Temporal data is simulated (need Wayback Machine validation)
- No end-to-end LLM answer evaluation
- XPath matching breaks on structural DOM changes
- Single-product pages only
- No visual features (bounding box, font size)
- JavaScript-rendered pages not supported
- Hand-coded query profiles and relation inference
See paper Section 6 and eval/results.md for full discussion.
@article{tempwebrag2026,
title={Temporal Fact Extraction from Web Page DOM Evolution
for Time-Aware Retrieval-Augmented Generation},
author={Gaurav Kumar},
year={2026}
}MIT