Skip to content

graphsense-lib 2.13.0

Choose a tag to compare

@github-actions github-actions released this 13 May 13:11
· 92 commits to master since this release

[2.13.0] 2026-05-13

Library (v2.13.0)

Added

  • PySpark Delta Lake → Cassandra bulk-ingest transformation (src/graphsenselib/transformation/). New CLI graphsense-cli transformation run --env <env> --currency <c> reads raw blockchain data from Delta Lake tables and writes it to a Cassandra raw keyspace via the spark-cassandra-connector. Supports BTC/LTC/BCH/ZEC (UTXO) and ETH/TRX (account) schema types; UTXO transformation derives transaction_spending, transaction_spent_in, block_transactions, and tx_prefix lookups from base transactions, account transformation handles varint binary columns. Options include --start-block, --end-block, --create-schema, --raw-keyspace override, --delta-lake-path override, --local (Spark local mode), --debug-write-audit (per-Spark-partition row counts and PK skew), and --patch for account-chain incremental runs (rejected for UTXO because spend tables are not window-local). Two-phase locking: phase 1 pins a top-block snapshot under the delta-ingest lock to avoid tearing concurrent ingest, phase 2 holds the transformed-keyspace lock for the Spark run (ingest is not blocked once phase 1 releases). New [transformation] extra (pyspark>=3.5,<4.0), separate Dockerfile.transformation, and Java JRE baked into the main Docker image so the main entrypoint can launch Spark without a sidecar.
  • One-off UTXO address clustering CLI graphsense-cli transformation cluster --env <env> --currency <c> (src/graphsenselib/transformation/clustering.py). Reads transactions via point/range queries in --chunk-size-block chunks (default 1000), feeds them to the Rust clustering engine, and streams the resulting mapping back to fresh_address_cluster / fresh_cluster_addresses in the transformed keyspace. No PySpark dependency. Options: --start-block, --end-block (auto-detected from raw keyspace if omitted), --concurrency (default 100), --write-chunk (default 100 000). Gated behind GRAPHSENSE_FRESH_CLUSTERING_ENABLED; the prior PySpark clustering path was retired in favour of this one.
  • graphsense-clustering Rust crate (rust/gs_clustering/, PyO3 + maturin) shipped as abi3 PyPI wheels. Public Python surface: Clustering class with process_transactions, get_mapping, rebuild_from_mapping, get_diff. New [clustering] extra in pyproject.toml (graphsense-clustering>=0.1.0); local checkouts build the crate from source via an editable tool.uv.sources entry.
  • Incremental fresh clustering inside UTXO delta update. When GRAPHSENSE_FRESH_CLUSTERING_ENABLED=true, run_fresh_clustering runs once per update range (not per batch), reads only affected clusters with dense ID remapping, uses real exchange rates and address IDs from the transformed keyspace, and writes to the fresh_* tables. CQL moved to TransformedDb; raw CQL removed from update logic. Disabled by default — runtime behaviour matches develop with the env var unset (no writes, no reads, no Rust import).
  • Fresh-clustering schema and migrations. New UTXO transformed tables fresh_address_cluster and fresh_cluster_addresses; new fresh_cluster_id field on the address API endpoint. Transformed-keyspace migrations are now applied on startup (GraphsenseSchemas().apply_migrations(..., keyspace_type="transformed")); the first transformed migration transformed_utxo_0_to_1 ships in this release.
  • Raw UTXO tx schema additions: sequence, version, lock_time projected from Delta to Cassandra and surfaced via the new transformation pipeline. New is_rbf_signaled BIP125 predicate in graphsenselib.utils.
  • Auto-catch-up of diverged sinks before forward run (src/graphsenselib/ingest/). When a mixed --sinks delta --sinks cassandra append finds the registered sinks at different highest blocks, the runner now executes a single-sink IngestRunner over [laggard_h+1, target] for each laggard (sharing source/transformer instances and the outer lock stack) before falling through to the forward run. Regression coverage in tests/regressions/ for catch-up-vs-sync-from-start equivalence, merge-boundary chain-truth/equivalence for ETH, and replaying a trx_raw mid-chunk gap.
  • --patch mode for account-chain transformation (also surfaced on ingest from-node via merge write-mode and shared _run_auto_compact helper). Lifts the empty-keyspace guard so the transformation can extend or repair an existing account / account_trx raw keyspace via PK-upsert writes; rows outside [start-block, end-block] are untouched. Account chains only — UTXO is rejected because spend-link tables are computed over the full block range loaded by Spark. New --auto-compact / --auto-compact-last-n options on ingest from-node mirror the soon-to-be-deprecated ingest delta-lake ingest flags.
  • Per-resource locking across ingest, delta-update, and transformation (src/graphsenselib/utils/locking.py). Lock keys that previously mixed reader and writer identity (compound {raw_ks}_{transformed_ks}, currency-based delta_ingest_{currency}) are replaced with locks keyed on the actual mutated resource. New delta_ingest_lock_name(delta_lake_path, currency) helper makes the delta-side lock derivable from the path so transformation and ingest agree on the key without sharing config.
  • ingest_complete marker write-ordering and rename. The bootstrap-marker state table introduced in 2.11.0 is now written as the last PySpark transformation write, so its presence is an atomic "this keyspace is usable" signal even if the run is aborted mid-stream. The table itself is renamed from bootstrap_markeringest_complete, the constant and row builder are centralised in src/graphsenselib/db/, and the configuration seed now uses the target keyspace name (not the prefix).
  • Transformation startup banner logging env, currency / schema type, delta source (with bucket + endpoint for S3 paths), target keyspace (with (override) marker), Cassandra nodes, block range, pinned top block, Spark mode (local[*] vs cluster), and patch flag — printed before the Spark session opens so cluster runs are diagnosable from the driver log alone.
  • Per-partition write audit (--debug-write-audit) prints per-Spark-partition row counts and partition-key skew before each Cassandra write to diagnose stragglers; adds one shuffle per write. Cassandra write metrics emitted on completion.
  • Curve TokenExchange events added to swap detection (src/graphsenselib/datatypes/abi.py). Adds the four canonical event variants (StableSwap and CryptoSwap, plus their underlying variants) tagged ["curve", "swap"], so Curve pool swaps (3pool, tricrypto, …) are no longer resolved as UNKNOWN. Regression test covers a USDC→USDT 3pool swap.
  • UTXO delta-update cross-version regression test suite under tests/regressions/. Ingests a BTC range, runs PySpark Delta Lake → Cassandra transformation, then runs the UTXO delta-updater with the local checkout and a reference release (default v2.12.3) into separate transformed keyspaces and diffs the result. Captures per-side wall time and works against arbitrary previous releases via the RELEASE_REF env var. Shared lib/ package, conftest fixture factories, and slimmer per-module test files factored out across the regressions tree.
  • MCP (Model Context Protocol) server mounted inside the existing FastAPI app at /mcp (override via GS_MCP_PATH). LLM clients (Claude Code, Claude Desktop, Cursor, custom agents) can query graphsense directly without a separate process. Auto-attached in create_app, create_app_from_dict, and create_spec_app via _maybe_attach_mcp; silent no-op when the [mcp] extra is not installed. Transport: streamable-http, stateless_http=True by default (set GS_MCP_STATELESS_HTTP=false to opt in to stateful). Disable entirely with GS_MCP_ENABLED=false. Implementation in src/graphsenselib/mcp/.
  • Curated MCP tool surface driven by a positive-list YAML at src/graphsenselib/mcp/curation/tools.yaml. Out of FastAPI's 44 routes, 17 are surfaced (18 with search_neighbors configured): 11 passthroughs (get_statistics, search, get_block, get_block_by_date, list_block_txs, list_tx_flows, get_exchange_rates, list_supported_tokens, get_actor, list_taxonomies, list_concepts), 6 hand-written consolidated tools that collapse common chains (lookup_address, lookup_cluster, lookup_tx_details, list_neighbors, list_txs_for, list_tags_by_address), and an optional external forward to the proprietary search_neighbors service. Curation drift is caught at boot and via the CI gate graphsense-cli mcp validate-curation.
  • graphsense-cli mcp validate-curation — CI-friendly subcommand that validates the curation YAML against the live FastAPI app (uses the minimal spec app, no DB required) and exits non-zero on drift.
  • Pathfinder deep-link instructions for MCP clients. Server-side instructions (the MCP analogue of a system prompt) are sourced from curation/instructions.md and substituted with the configured pathfinder_base_url (default https://app.iknaio.com) so LLMs can build links like {base}/pathfinder/btc/address/<addr>. Override via GS_MCP_INSTRUCTIONS / GS_MCP_INSTRUCTIONS_FILE / GS_MCP_PATHFINDER_BASE_URL.
  • External request routing for the MCP fan-out wrappers. By default, consolidated tools dispatch in-process via httpx ASGITransport; set GS_MCP_INTERNAL_BASE_URL to route fan-out calls through a real HTTP client so each call traverses upstream middleware. Originating MCP request headers are forwarded on every internal call in both modes.
  • New [mcp] extra in pyproject.toml (fastmcp>=3.2,<4.0, pyyaml>=6.0, transitively pulls [web]). Also added to the [all] extra.

Changed

  • block table is now written last in every ingest path (src/graphsenselib/ingest/). get_highest_block() reads MAX(block_id) from the block table as the resume marker, so a mid-chunk crash (e.g. Cassandra coordinator timeout) could otherwise advance the marker past partially-written side tables. The transformer dicts previously emitted block first; sinks now write it after all dependent tables land.
  • Delta auto-compact scoped to recent partitions. optimize.compact now accepts a last_n_partitions argument and forwards it as a partition_filters predicate, so weekly auto-compact only rewrites partitions that could plausibly have received writes since the last run. Older raw-data partitions are immutable and no longer touched. deltalake bumped to 1.5.1.
  • graphsense-cli transformation run --s3-config NAME is now required for S3 delta paths. The transformation CLI no longer derives S3 credentials from the delta sink's s3_config field or the top-level s3_credentials fallback; users pick a named entry from s3_configs explicitly. Missing/unknown names raise an error listing the available choices.
  • Spark app name renamed to graphsense-bulk-ingest-{currency}-{env} (src/graphsenselib/transformation/factory.py) so cluster dashboards group the new transformation runs separately from the Scala lineage.
  • Docker image: main runtime image shrunk from 5.3 GB → 2.1 GB while gaining the Rust clustering crate and the Java JRE needed for PySpark. Regression tests now use the main Dockerfile directly; the previous separate test image was retired.
  • Dependencies refreshed (uv.lock); pyproject.toml constraints bumped where appropriate. See "Dependencies" below.

Fixed

  • get_latest_tx_id_before_block could restart _next_tx_id at 0 on a non-empty keyspace. When the immediately preceding block_transactions row was missing but the block table had advanced past that gap, the function returned -1 and the next allocation silently overwrote existing tx_ids. Now distinguishes a fresh keyspace from a gap in block_transactions and refuses to allocate at 0 when prior data is present.
  • apply_migrations used the wrong PK column for transformed-config tables. The version-bump UPDATE was built with WHERE id = …, but transformed configuration tables key on keyspace_name (only raw configurations have id). The first transformed migration (transformed_utxo_0_to_1) blew up with AttributeError: 'Row' object has no attribute 'id'; now selects the correct PK column per keyspace type.
  • Legacy ingest UDT shape and lock_time naming reconciled with the new schema fields.
  • access_list.storageKeysstorage_keys in the PySpark transformation output (Cassandra column name).
  • Transformation runs previously read S3 credentials from the wrong sink config; now resolved from the per-sink s3_config reference, with Spark packages aligned to the iknaio cluster defaults and all Cassandra nodes passed from config (not just the first).

Performance

  • run_fresh_clustering rewritten with targeted point reads and dense ID remapping — reads only the clusters affected by the update range instead of scanning the full transformed keyspace.
  • Spark transformation throughput: Arrow-optimized UDFs enabled, transaction writes repartitioned by partition key (not range), Cassandra writes tuned with parallel table writes, SinglePartition bottleneck in tx_id computation eliminated. Net: per-partition write audit shows balanced shards on production-sized BTC runs.

Web API + Python client (webapi-2.13.0)

Added

  • graphsense gs CLI group for reading .gs save files (Pathfinder / Graph dashboards) without installing graphsenselib. Subcommands: txs FILE and addresses FILE emit a uniform {"network", "id"} shape that pipes directly into lookup-tx / lookup-address (via the standard --address-jq '[].id' --network-jq '[].network' selectors), enabling one-line re-hydration of every reference in a saved graph. decode FILE (optionally --raw) and summary FILE round out the group. Records are deduped by (network, id) by default; --no-dedupe retains repeats.
  • graphsense.gs_files Python API — pure-stdlib decoder/encoder for .gs files, vendored from src/graphsenselib/convert/gs_files/ so the standalone graphsense-python package picks up the reader without adding graphsenselib as a runtime dependency. Public surface mirrors the source: decode_gs, structure, summarize, to_jsonable, GsBuilder, plus the typed dataclasses (PathfinderData, GraphData, …).
  • Sync tooling for the vendored module. clients/python/scripts/sync_gs_files.py copies the source verbatim with a DO NOT EDIT header on each file; make -C clients/python sync-gs-files writes, make -C clients/python check-gs-files is the drift check. A repo-level pre-commit hook (sync-gs-files) runs the write step automatically when either the source dir, the vendored copy, or the sync script changes. cli.py is excluded from the sync — the client wires its own rich_click-integrated CLI in graphsense/cli/gs.py so it inherits the global -f / -o / -d / --input plumbing.

Changed

  • clients/python/.openapi-generator-ignore now also covers graphsense/gs_files/* and scripts/* to keep the vendored copy and sync utilities out of the generator's overwrite path.
  • ext.client.lookup_address never folds the best address tag into the cluster. The convenience client now passes include_best_address_tag=False when fetching the parent cluster so the cluster summary is not contaminated by the address-level best tag of the address being looked up.
  • ext.io input/output plumbing cleaned up (jq selector behaviour, error handling, and dedup logic exercised by new tests in tests/test_ext_io.py).

Dependencies

Changed

  • See commit 43aa309 (update dependencies) and the follow-up bump in this release window. uv.lock regenerated.

What's Changed

Full Changelog: v2.12.5...v2.13.0