deltagraph

Delta-Lake-style versioning, incremental MERGE, and time travel for graph-native columnar formats.

Storage and Cypher query come from LadybugDB and its Icebug format — a graph-native, columnar, Parquet-backed layout with direct Cypher query, no ingestion step. deltagraph adds the layer they don't have: an incremental write path, merge-on-read upsert/delete, compaction, and versioned snapshots with time travel (via LakeFS).

The mutation layer runs entirely on Arrow node/relationship tables before any Icebug encoding happens, so it isn't tied to Icebug specifically — the same code applies to any CSR-in-Parquet graph format.

Why this exists

Graph-native columnar formats are read-oriented: fast Cypher queries directly over Parquet, but no concept of a snapshot, no way to ask "what did this graph look like last Tuesday," and no cheap way to apply a stream of upserts without re-encoding the whole graph. Meanwhile, general lakehouse versioning (Delta Lake, Iceberg, LakeFS) is graph-blind — it versions files, not node/edge identities, so it can't tell you a node's incident edges vanished when the node was deleted.

deltagraph sits in between: MERGE semantics (latest-wins upsert, tombstone delete, node-delete cascades to edges) at the identity level — node primary keys, (src, dst) edge pairs — layered on top of whatever columnar graph format and versioning backend you already have. Concretely, that buys you:

Time travel over a live graph. query(cypher, ref=<commit>) runs against the exact graph as of that commit, no separate snapshot export step.
Cheap incremental writes. Upserts land in a small delta region; you don't re-encode the whole graph on every write. compact() folds the delta into the base on your own schedule.
A change feed for downstream consumers. changes(from_ref, to_ref) tells you exactly which nodes/edges were added, removed, or updated between two versions — useful for incremental re-indexing (e.g. re-embedding only the nodes that changed for a GraphRAG pipeline) instead of a full rebuild.
Reproducible snapshots. Pin a Cypher query to a tagged commit and get the same answer every time, even after later writes and compactions — handy for audits, regression tests, or citing "the graph as it was" in a paper.

Status

Early (v0.1), core + extended features implemented against LocalBackend:

Core: create, upsert (row and bulk CSV/Parquet), delete with node-delete cascade, compaction, Cypher query with time travel, tagging. Covered by a hypothesis-based correctness property that checks any interleaving of upsert/delete/compact against an independent reference model.
Extended: a graph change feed (changes), restore, and retention (expire) are all implemented and tested.
Not yet built: graph-aware branch merge (merge_branch, the stretch milestone — needs real multi-branch backend support first), snapshots(), branch(), and the LakeFSBackend (production backend; only LocalBackend is wired up today).

Install

# users
pip install deltagraph
# or
uv add deltagraph

# contributors
uv sync
pre-commit install
docker compose up -d   # only if exercising the optional LakeFS backend

LocalBackend needs no server and no Docker — it's a plain versioned directory, and it's what the whole test suite runs against.

Quick start

from deltagraph import GraphStore
from deltagraph.versioning import LocalBackend

store = GraphStore.create(LocalBackend("./demo"), schema_cypher="""
    CREATE NODE TABLE Drug(name STRING PRIMARY KEY, dose INT64);
    CREATE NODE TABLE Condition(name STRING PRIMARY KEY);
    CREATE REL TABLE TREATS(FROM Drug TO Condition);
""")

store.upsert_nodes("Drug", [{"name": "Aspirin", "dose": 100}])
store.upsert_nodes("Condition", [{"name": "Pain"}, {"name": "Fever"}])
v1 = store.upsert_edges("TREATS", [
    {"src": "Aspirin", "dst": "Pain"},
    {"src": "Aspirin", "dst": "Fever"},
])

store.query("MATCH (d:Drug)-[:TREATS]->(c) RETURN d.name, c.name")            # head
store.query("MATCH (d:Drug)-[:TREATS]->(c) RETURN d.name, c.name", ref=v1)    # time travel

store.upsert_nodes("Drug", [{"name": "Aspirin", "dose": 250}])  # idempotent, latest-wins
store.delete_nodes("Drug", ["Aspirin"])                          # cascades to incident edges
store.compact()                                                  # folds delta into base

A full runnable walkthrough with real output is in examples/quickstart.py — see Examples below.

Change feed, restore, and retention

v1 = store.upsert_nodes("Drug", [{"name": "Aspirin", "dose": 100}])
store.tag("v1", v1)

store.upsert_nodes("Drug", [{"name": "Aspirin", "dose": 250}])   # update
store.upsert_nodes("Drug", [{"name": "Ibuprofen", "dose": 50}])  # add

store.changes("v1")["nodes"]["Drug"]
# {'added': [{'name': 'Ibuprofen', 'dose': 50}],
#  'removed': [],
#  'updated': [{'before': {'name': 'Aspirin', 'dose': 100},
#               'after':  {'name': 'Aspirin', 'dose': 250}}]}

store.restore("v1")   # new commit whose state equals v1, exactly
store.query("MATCH (d:Drug) RETURN d.name, d.dose")
# [('Aspirin', 100)]  -- Ibuprofen and the dose update are gone from head;
#                        both are still reachable at any ref before the restore

store.expire(before=some_cutoff_timestamp)  # drop changelog + old snapshots

changes() is the piece that makes incremental re-indexing practical — e.g. in a GraphRAG pipeline, re-embed only the nodes reported as added/updated since your last sync, instead of re-embedding the whole graph. expire() reclaims space by dropping commit snapshots and changelog history older than a cutoff; anything tagged before that cutoff becomes unreadable, so keep the cutoff older than any tag you still need.

API

Every mutation is one versioning-backend commit; the returned ref is the graph version you can hand back to query(ref=...) for time travel.

Method	What it does
`GraphStore.create(backend, schema_cypher)`	Define node/rel tables via Cypher DDL. One commit. Returns a `GraphStore`.
`.upsert_nodes(table, records)`	Insert-or-update node rows, keyed by the table's declared primary key.
`.upsert_edges(table, records)`	Insert-or-update edge rows; each record carries `src`/`dst` endpoint keys.
`.upsert_nodes_from(table, path, key=None)`	Bulk-load a CSV/Parquet file straight to Arrow (no Python-dict round-trip) into the delta. Idempotent — re-ingesting the same file is a no-op beyond dedup.
`.upsert_edges_from(table, path, src=, dst=, edge_key=None)`	Same, for edge files; `src`/`dst` name the file's endpoint columns.
`.delete_nodes(table, keys)`	Tombstone nodes by primary key. Incident edges disappear on the next read, even though the edge rows themselves aren't touched.
`.delete_edges(table, keys)`	Tombstone edges by `{"src":.., "dst":..}` (or `+key_prop` for edge tables that declare one).
`.compact(tables=None)`	Fold the accumulated delta into a fresh base and clear it. Read/write results are identical before and after.
`.query(cypher, ref=None)`	Run read-only Cypher against the graph as of `ref` (default: head). Returns a PyArrow `Table`. Write clauses (`CREATE`/`MERGE`/`SET`/`DELETE`/`REMOVE`) are rejected — mutate through the methods above instead.
`.tag(name, ref=None)`	Name a commit (default: head) for later reference — `query(ref="v1")`.
`.changes(from_ref, to_ref=None)`	Net node/edge diff (`added`/`removed`/`updated` per table) between two commits — the change feed. Works across a compaction boundary.
`.restore(ref)`	Write a new commit whose state equals an old `ref`. Returns the new ref.
`.expire(before)`	Drop changelog history and superseded commit snapshots older than `before`. Makes any tag older than the cutoff unreadable — keep `before` older than any tag you still need.

Identity, not surrogate IDs. Nodes upsert/delete on their declared primary key. Edges have no primary key in Ladybug, so deltagraph defines edge identity as (src, dst) by default, or (src, dst, key_prop) if the rel table declares a key property.

Versioning backends implement one shared interface (write_file/read_file/list/commit/tag/resolve_time), so the same GraphStore code runs against LocalBackend (default, no server) or LakeFSBackend (production, needs a running LakeFS server).

Extended, git-like features — a change feed, restore, and retention (changes/restore/expire above) are implemented. A graph-aware branch merge with identity-level conflict detection is designed (build brief 6.10) but not yet built — it needs real multi-branch backend support first, which doesn't exist yet either.

Examples

File	Demonstrates
`examples/quickstart.py`	create → upsert → query → re-upsert (latest-wins) → delete (cascade) → time travel → compact
`scripts/ldbc_snb_tiny_loader.py`	bulk ingestion (`upsert_nodes_from`/`upsert_edges_from`) with a synthetic LDBC-SNB-shaped Person/KNOWS graph
`scripts/bench.py`	upsert throughput, compaction cost, head vs. historical query latency, and a `compaction_threshold_ratio` sweep — `uv run python scripts/bench.py --persons 2000 --knows-per-person 10`

bench.py can also plot results to a PNG (needs uv sync --extra bench for matplotlib):

uv run --extra bench python scripts/bench.py --persons 2000 --knows-per-person 10 --chart docs/img/bench_results.png

Run it:

uv run python examples/quickstart.py

Output:

-- head: all TREATS edges (Aspirin is the only Drug so far) --
d.name  c.name
Aspirin  Pain
Aspirin  Fever

-- head after re-upsert: dose updated in place --
d.name  d.dose
Aspirin  250

-- head after delete_nodes: Aspirin and its edges are gone --
(no rows)

-- ref=0000004: pre-delete, pre-update snapshot --
d.name  d.dose  c.name
Aspirin  100  Pain
Aspirin  100  Fever

-- ref=0000004 after compact(): time travel still works --
d.name  d.dose  c.name
Aspirin  100  Pain
Aspirin  100  Fever

Note the last two blocks are identical — compaction never changes what a query returns, at head or at any historical ref.

What `LocalBackend` writes to disk

Point LocalBackend at ./demo instead of a temp dir and run the same script — each mutating call is one copy-on-commit snapshot under commits/<ref>/. Delta parts accumulate per table until compact() folds them into base/ and clears delta/:

demo/
├── refs.json                          # head pointer, tags, ordered commit log
├── staging/                           # writes not yet committed (invisible to reads)
└── commits/
    ├── 0000001/                       # create(): schema only
    │   ├── schema.cypher
    │   └── meta.json
    ├── 0000002/                       # upsert_nodes("Drug", ...)
    │   ├── delta/nodes/Drug/part-0000001.parquet
    │   ├── schema.cypher
    │   └── meta.json
    ├── 0000004/                       # upsert_edges("TREATS", ...)
    │   ├── delta/nodes/Condition/part-0000002.parquet
    │   ├── delta/nodes/Drug/part-0000001.parquet
    │   ├── delta/edges/TREATS/part-0000003.parquet
    │   └── meta.json
    ├── 0000006/                       # delete_nodes("Drug", ["Aspirin"])
    │   ├── delta/nodes/Drug/part-0000001.parquet
    │   ├── delta/nodes/Drug/part-0000004.parquet   # re-upsert part
    │   ├── delta/tombstones/nodes/Drug/part-0000005.parquet
    │   └── ...
    └── 0000007/                       # compact(): delta folded into base
        ├── base/nodes/Drug.parquet
        ├── base/nodes/Condition.parquet
        ├── base/edges/TREATS.parquet
        ├── delta/                      # empty -- cleared by compaction
        └── meta.json

Every commit is a whole-graph snapshot, not a diff: commits/0000006/ still carries every delta part from 0000002 onward, which is what makes query(ref="0000004") an exact historical reconstruction regardless of what happens later. demo/ itself is .gitignored — it's generated state, not source.

Development

uv sync                    # installs the dev group: pytest, hypothesis, ruff, mypy, pre-commit
uv run pytest              # full suite, including the correctness property
uv run pytest -q           # quiet
uv run ruff check .        # lint
uv run ruff format .       # format
uv run mypy deltagraph     # type check

FAQ

Why not just use Ladybug's native MERGE on a .lbdb database? A .lbdb file is a monolithic binary with no file-level structure — versioning it through a backend means storing an opaque blob per commit, with no dedup and no partial-snapshot reads. deltagraph operates on plain Parquet node/rel files instead, so snapshots are Parquet-granular and dedup-friendly on object storage. That granularity is the reason to implement MERGE at the file level rather than defer to the query engine.

Does this support multigraphs (more than one edge between the same node pair)? Not in v0.1. Edge identity defaults to (src, dst), so a second edge between the same pair collapses into the first on upsert. Declaring a key_prop on the rel table (see the API table's identity note) gives each edge its own identity, but multigraph support beyond that isn't built yet — known limitation, not a silent bug.

Can I write Cypher MERGE/CREATE/DELETE directly through query()? No, by design. query() runs against a throwaway materialization built fresh per call (see "What LocalBackend writes to disk" above) — a write executed there would just vanish uncommitted. Mutate through upsert_*/delete_* instead; query() rejects write clauses outright rather than silently no-opping them.

Can I add tables to an existing graph? Not yet — schema.cypher is written once by create() and there's no schema-migration method today. The build brief scopes additive node/rel tables and additive properties as in-scope for v0.1, but that migration path isn't implemented; treat the schema as fixed for now.

Attribution

Storage and Cypher query: LadybugDB / Icebug. Versioning: LakeFS. deltagraph is the incremental-update, merge, and time-travel layer between them — it does not reimplement CSR encoding, traversal, or Cypher.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deltagraph

Why this exists

Status

Install

Quick start

Change feed, restore, and retention

API

Examples

What `LocalBackend` writes to disk

Development

FAQ

Attribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
deltagraph		deltagraph
docs/img		docs/img
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

deltagraph

Why this exists

Status

Install

Quick start

Change feed, restore, and retention

API

Examples

What LocalBackend writes to disk

Development

FAQ

Attribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `LocalBackend` writes to disk

Packages