graphsense-lib 2.13.0
[2.13.0] 2026-05-13
Library (v2.13.0)
Added
- PySpark Delta Lake → Cassandra bulk-ingest transformation (
src/graphsenselib/transformation/). New CLIgraphsense-cli transformation run --env <env> --currency <c>reads raw blockchain data from Delta Lake tables and writes it to a Cassandra raw keyspace via thespark-cassandra-connector. Supports BTC/LTC/BCH/ZEC (UTXO) and ETH/TRX (account) schema types; UTXO transformation derivestransaction_spending,transaction_spent_in,block_transactions, andtx_prefixlookups from base transactions, account transformation handles varint binary columns. Options include--start-block,--end-block,--create-schema,--raw-keyspaceoverride,--delta-lake-pathoverride,--local(Spark local mode),--debug-write-audit(per-Spark-partition row counts and PK skew), and--patchfor account-chain incremental runs (rejected for UTXO because spend tables are not window-local). Two-phase locking: phase 1 pins a top-block snapshot under the delta-ingest lock to avoid tearing concurrent ingest, phase 2 holds the transformed-keyspace lock for the Spark run (ingest is not blocked once phase 1 releases). New[transformation]extra (pyspark>=3.5,<4.0), separateDockerfile.transformation, and Java JRE baked into the main Docker image so the main entrypoint can launch Spark without a sidecar. - One-off UTXO address clustering CLI
graphsense-cli transformation cluster --env <env> --currency <c>(src/graphsenselib/transformation/clustering.py). Reads transactions via point/range queries in--chunk-size-block chunks (default 1000), feeds them to the Rust clustering engine, and streams the resulting mapping back tofresh_address_cluster/fresh_cluster_addressesin the transformed keyspace. No PySpark dependency. Options:--start-block,--end-block(auto-detected from raw keyspace if omitted),--concurrency(default 100),--write-chunk(default 100 000). Gated behindGRAPHSENSE_FRESH_CLUSTERING_ENABLED; the prior PySpark clustering path was retired in favour of this one. graphsense-clusteringRust crate (rust/gs_clustering/, PyO3 + maturin) shipped as abi3 PyPI wheels. Public Python surface:Clusteringclass withprocess_transactions,get_mapping,rebuild_from_mapping,get_diff. New[clustering]extra inpyproject.toml(graphsense-clustering>=0.1.0); local checkouts build the crate from source via aneditabletool.uv.sourcesentry.- Incremental fresh clustering inside UTXO delta update. When
GRAPHSENSE_FRESH_CLUSTERING_ENABLED=true,run_fresh_clusteringruns once per update range (not per batch), reads only affected clusters with dense ID remapping, uses real exchange rates and address IDs from the transformed keyspace, and writes to thefresh_*tables. CQL moved toTransformedDb; raw CQL removed from update logic. Disabled by default — runtime behaviour matches develop with the env var unset (no writes, no reads, no Rust import). - Fresh-clustering schema and migrations. New UTXO transformed tables
fresh_address_clusterandfresh_cluster_addresses; newfresh_cluster_idfield on the address API endpoint. Transformed-keyspace migrations are now applied on startup (GraphsenseSchemas().apply_migrations(..., keyspace_type="transformed")); the first transformed migrationtransformed_utxo_0_to_1ships in this release. - Raw UTXO tx schema additions:
sequence,version,lock_timeprojected from Delta to Cassandra and surfaced via the new transformation pipeline. Newis_rbf_signaledBIP125 predicate ingraphsenselib.utils. - Auto-catch-up of diverged sinks before forward run (
src/graphsenselib/ingest/). When a mixed--sinks delta --sinks cassandraappend finds the registered sinks at different highest blocks, the runner now executes a single-sinkIngestRunnerover[laggard_h+1, target]for each laggard (sharing source/transformer instances and the outer lock stack) before falling through to the forward run. Regression coverage intests/regressions/for catch-up-vs-sync-from-start equivalence, merge-boundary chain-truth/equivalence for ETH, and replaying atrx_rawmid-chunk gap. --patchmode for account-chain transformation (also surfaced oningest from-nodeviamergewrite-mode and shared_run_auto_compacthelper). Lifts the empty-keyspace guard so the transformation can extend or repair an existing account / account_trx raw keyspace via PK-upsert writes; rows outside[start-block, end-block]are untouched. Account chains only — UTXO is rejected because spend-link tables are computed over the full block range loaded by Spark. New--auto-compact/--auto-compact-last-noptions oningest from-nodemirror the soon-to-be-deprecatedingest delta-lake ingestflags.- Per-resource locking across ingest, delta-update, and transformation (
src/graphsenselib/utils/locking.py). Lock keys that previously mixed reader and writer identity (compound{raw_ks}_{transformed_ks}, currency-baseddelta_ingest_{currency}) are replaced with locks keyed on the actual mutated resource. Newdelta_ingest_lock_name(delta_lake_path, currency)helper makes the delta-side lock derivable from the path so transformation and ingest agree on the key without sharing config. ingest_completemarker write-ordering and rename. The bootstrap-marker state table introduced in 2.11.0 is now written as the last PySpark transformation write, so its presence is an atomic "this keyspace is usable" signal even if the run is aborted mid-stream. The table itself is renamed frombootstrap_marker→ingest_complete, the constant and row builder are centralised insrc/graphsenselib/db/, and the configuration seed now uses the target keyspace name (not the prefix).- Transformation startup banner logging env, currency / schema type, delta source (with bucket + endpoint for S3 paths), target keyspace (with
(override)marker), Cassandra nodes, block range, pinned top block, Spark mode (local[*]vs cluster), and patch flag — printed before the Spark session opens so cluster runs are diagnosable from the driver log alone. - Per-partition write audit (
--debug-write-audit) prints per-Spark-partition row counts and partition-key skew before each Cassandra write to diagnose stragglers; adds one shuffle per write. Cassandra write metrics emitted on completion. - Curve
TokenExchangeevents added to swap detection (src/graphsenselib/datatypes/abi.py). Adds the four canonical event variants (StableSwap and CryptoSwap, plus their underlying variants) tagged["curve", "swap"], so Curve pool swaps (3pool, tricrypto, …) are no longer resolved asUNKNOWN. Regression test covers a USDC→USDT 3pool swap. - UTXO delta-update cross-version regression test suite under
tests/regressions/. Ingests a BTC range, runs PySpark Delta Lake → Cassandra transformation, then runs the UTXO delta-updater with the local checkout and a reference release (default v2.12.3) into separate transformed keyspaces and diffs the result. Captures per-side wall time and works against arbitrary previous releases via theRELEASE_REFenv var. Sharedlib/package, conftest fixture factories, and slimmer per-module test files factored out across the regressions tree. - MCP (Model Context Protocol) server mounted inside the existing FastAPI app at
/mcp(override viaGS_MCP_PATH). LLM clients (Claude Code, Claude Desktop, Cursor, custom agents) can query graphsense directly without a separate process. Auto-attached increate_app,create_app_from_dict, andcreate_spec_appvia_maybe_attach_mcp; silent no-op when the[mcp]extra is not installed. Transport: streamable-http,stateless_http=Trueby default (setGS_MCP_STATELESS_HTTP=falseto opt in to stateful). Disable entirely withGS_MCP_ENABLED=false. Implementation insrc/graphsenselib/mcp/. - Curated MCP tool surface driven by a positive-list YAML at
src/graphsenselib/mcp/curation/tools.yaml. Out of FastAPI's 44 routes, 17 are surfaced (18 withsearch_neighborsconfigured): 11 passthroughs (get_statistics,search,get_block,get_block_by_date,list_block_txs,list_tx_flows,get_exchange_rates,list_supported_tokens,get_actor,list_taxonomies,list_concepts), 6 hand-written consolidated tools that collapse common chains (lookup_address,lookup_cluster,lookup_tx_details,list_neighbors,list_txs_for,list_tags_by_address), and an optional external forward to the proprietarysearch_neighborsservice. Curation drift is caught at boot and via the CI gategraphsense-cli mcp validate-curation. graphsense-cli mcp validate-curation— CI-friendly subcommand that validates the curation YAML against the live FastAPI app (uses the minimal spec app, no DB required) and exits non-zero on drift.- Pathfinder deep-link instructions for MCP clients. Server-side
instructions(the MCP analogue of a system prompt) are sourced fromcuration/instructions.mdand substituted with the configuredpathfinder_base_url(defaulthttps://app.iknaio.com) so LLMs can build links like{base}/pathfinder/btc/address/<addr>. Override viaGS_MCP_INSTRUCTIONS/GS_MCP_INSTRUCTIONS_FILE/GS_MCP_PATHFINDER_BASE_URL. - External request routing for the MCP fan-out wrappers. By default, consolidated tools dispatch in-process via httpx
ASGITransport; setGS_MCP_INTERNAL_BASE_URLto route fan-out calls through a real HTTP client so each call traverses upstream middleware. Originating MCP request headers are forwarded on every internal call in both modes. - New
[mcp]extra inpyproject.toml(fastmcp>=3.2,<4.0,pyyaml>=6.0, transitively pulls[web]). Also added to the[all]extra.
Changed
blocktable is now written last in every ingest path (src/graphsenselib/ingest/).get_highest_block()readsMAX(block_id)from the block table as the resume marker, so a mid-chunk crash (e.g. Cassandra coordinator timeout) could otherwise advance the marker past partially-written side tables. The transformer dicts previously emittedblockfirst; sinks now write it after all dependent tables land.- Delta auto-compact scoped to recent partitions.
optimize.compactnow accepts alast_n_partitionsargument and forwards it as apartition_filterspredicate, so weekly auto-compact only rewrites partitions that could plausibly have received writes since the last run. Older raw-data partitions are immutable and no longer touched.deltalakebumped to 1.5.1. graphsense-cli transformation run --s3-config NAMEis now required for S3 delta paths. The transformation CLI no longer derives S3 credentials from the delta sink'ss3_configfield or the top-levels3_credentialsfallback; users pick a named entry froms3_configsexplicitly. Missing/unknown names raise an error listing the available choices.- Spark app name renamed to
graphsense-bulk-ingest-{currency}-{env}(src/graphsenselib/transformation/factory.py) so cluster dashboards group the new transformation runs separately from the Scala lineage. - Docker image: main runtime image shrunk from 5.3 GB → 2.1 GB while gaining the Rust clustering crate and the Java JRE needed for PySpark. Regression tests now use the main
Dockerfiledirectly; the previous separate test image was retired. - Dependencies refreshed (
uv.lock); pyproject.toml constraints bumped where appropriate. See "Dependencies" below.
Fixed
get_latest_tx_id_before_blockcould restart_next_tx_idat 0 on a non-empty keyspace. When the immediately precedingblock_transactionsrow was missing but theblocktable had advanced past that gap, the function returned-1and the next allocation silently overwrote existing tx_ids. Now distinguishes a fresh keyspace from a gap inblock_transactionsand refuses to allocate at 0 when prior data is present.apply_migrationsused the wrong PK column for transformed-config tables. The version-bump UPDATE was built withWHERE id = …, but transformed configuration tables key onkeyspace_name(only raw configurations haveid). The first transformed migration (transformed_utxo_0_to_1) blew up withAttributeError: 'Row' object has no attribute 'id'; now selects the correct PK column per keyspace type.- Legacy ingest UDT shape and
lock_timenaming reconciled with the new schema fields. access_list.storageKeys→storage_keysin the PySpark transformation output (Cassandra column name).- Transformation runs previously read S3 credentials from the wrong sink config; now resolved from the per-sink
s3_configreference, with Spark packages aligned to the iknaio cluster defaults and all Cassandra nodes passed from config (not just the first).
Performance
run_fresh_clusteringrewritten with targeted point reads and dense ID remapping — reads only the clusters affected by the update range instead of scanning the full transformed keyspace.- Spark transformation throughput: Arrow-optimized UDFs enabled, transaction writes repartitioned by partition key (not range), Cassandra writes tuned with parallel table writes,
SinglePartitionbottleneck intx_idcomputation eliminated. Net: per-partition write audit shows balanced shards on production-sized BTC runs.
Web API + Python client (webapi-2.13.0)
Added
graphsense gsCLI group for reading.gssave files (Pathfinder / Graph dashboards) without installinggraphsenselib. Subcommands:txs FILEandaddresses FILEemit a uniform{"network", "id"}shape that pipes directly intolookup-tx/lookup-address(via the standard--address-jq '[].id' --network-jq '[].network'selectors), enabling one-line re-hydration of every reference in a saved graph.decode FILE(optionally--raw) andsummary FILEround out the group. Records are deduped by(network, id)by default;--no-deduperetains repeats.graphsense.gs_filesPython API — pure-stdlib decoder/encoder for.gsfiles, vendored fromsrc/graphsenselib/convert/gs_files/so the standalonegraphsense-pythonpackage picks up the reader without addinggraphsenselibas a runtime dependency. Public surface mirrors the source:decode_gs,structure,summarize,to_jsonable,GsBuilder, plus the typed dataclasses (PathfinderData,GraphData, …).- Sync tooling for the vendored module.
clients/python/scripts/sync_gs_files.pycopies the source verbatim with aDO NOT EDITheader on each file;make -C clients/python sync-gs-fileswrites,make -C clients/python check-gs-filesis the drift check. A repo-level pre-commit hook (sync-gs-files) runs the write step automatically when either the source dir, the vendored copy, or the sync script changes.cli.pyis excluded from the sync — the client wires its ownrich_click-integrated CLI ingraphsense/cli/gs.pyso it inherits the global-f / -o / -d / --inputplumbing.
Changed
clients/python/.openapi-generator-ignorenow also coversgraphsense/gs_files/*andscripts/*to keep the vendored copy and sync utilities out of the generator's overwrite path.ext.client.lookup_addressnever folds the best address tag into the cluster. The convenience client now passesinclude_best_address_tag=Falsewhen fetching the parent cluster so the cluster summary is not contaminated by the address-level best tag of the address being looked up.ext.ioinput/output plumbing cleaned up (jq selector behaviour, error handling, and dedup logic exercised by new tests intests/test_ext_io.py).
Dependencies
Changed
- See commit
43aa309(update dependencies) and the follow-up bump in this release window.uv.lockregenerated.
What's Changed
Full Changelog: v2.12.5...v2.13.0