TempWebRAG

Temporal Fact Extraction from Web Page DOM Evolution for Time-Aware Retrieval-Augmented Generation

Overview

TempWebRAG extracts timestamped facts from the structural evolution of HTML DOM trees, enabling RAG systems to answer temporal queries that no existing system supports:

"When did the price drop?"
"Was there a sale last month?"
"Has availability changed?"

Key Results

Temporal Fact Extraction (Primary Contribution)

Metric	Value	95% CI
Recall	100%	[85.2%, 100%]
Precision	70.0%	[55.4%, 82.1%]
F1	82.4%	—

No existing RAG system, CSS heuristic, or web scraper can answer temporal queries.

Static Retrieval (Secondary Contribution)

Method	Top-1	Top-3	MRR	p-value
Text-only (baseline)	24.2%	71.0%	0.506	—
Text+Structure (ours)	29.0%	74.2%	0.542	0.0036
CSS heuristic	93.5%	96.8%	0.929	—

Structure-aware retrieval provides statistically significant improvement over text-only (p < 0.01). CSS heuristics outperform both neural methods on well-structured sites.

Project Structure

├── src/webtkgrag/           # Core library
│   ├── dom_parser.py        # HTML → DOM Knowledge Graph
│   ├── embedding.py         # Structure-aware node embeddings
│   ├── retrieval.py         # Tree traversal retrieval
│   ├── temporal.py          # Temporal DOM diffing + fact extraction
│   └── pipeline.py          # End-to-end RAG (Bedrock/Mock LLM)
├── eval/                    # Evaluation
│   ├── comprehensive_eval.py    # 37-query eval with 3 baselines
│   ├── temporal_eval_v2.py      # Temporal fact extraction eval
│   ├── test_reproducibility.py  # 10 reproducibility tests
│   └── results.md               # Complete results log (40 review iterations)
├── data/                    # Test data
│   ├── ground_truth.py      # 37 ground-truth queries
│   └── test_pages/          # 8 locally-saved HTML pages
├── paper/main.tex           # LaTeX paper
└── docs/                    # Research documentation

Quick Start

pip install -r requirements.txt

# Run reproducibility tests (no model loading, <1s)
PYTHONPATH=src python eval/test_reproducibility.py

# Run full evaluation (requires sentence-transformers, ~15s)
PYTHONPATH=src python eval/comprehensive_eval.py

# Run temporal evaluation
PYTHONPATH=src python eval/temporal_eval_v2.py

Limitations (Honestly Stated)

Tested on practice websites only (50-200x smaller than real e-commerce)
Temporal data is simulated (need Wayback Machine validation)
No end-to-end LLM answer evaluation
XPath matching breaks on structural DOM changes
Single-product pages only
No visual features (bounding box, font size)
JavaScript-rendered pages not supported
Hand-coded query profiles and relation inference

See paper Section 6 and eval/results.md for full discussion.

Citation

@article{tempwebrag2026,
  title={Temporal Fact Extraction from Web Page DOM Evolution
         for Time-Aware Retrieval-Augmented Generation},
  author={Gaurav Kumar},
  year={2026}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
docs		docs
eval		eval
paper		paper
src/webtkgrag		src/webtkgrag
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TempWebRAG

Overview

Key Results

Temporal Fact Extraction (Primary Contribution)

Static Retrieval (Secondary Contribution)

Project Structure

Quick Start

Limitations (Honestly Stated)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TempWebRAG

Overview

Key Results

Temporal Fact Extraction (Primary Contribution)

Static Retrieval (Secondary Contribution)

Project Structure

Quick Start

Limitations (Honestly Stated)

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages