graphsense-lib 2.14.0
[2.14.0] - 2026-06-09
Library (v2.14.0)
Added
transformation raw-to-transformed— drive the external graphsense-spark Scala job from graphsenselib.⚠️ ALPHA — the command interface and behaviour may change, and invoking it prints an alpha warning. New CLI command that replaces the standalone bashspark-submitdriver: it creates a fresh transformed keyspace, downloads the graphsense-spark release jar from a public GitHub Release asset (cached locally), and launches the job viaspark-submit. Supports self-contained or slim jars, an optional Cassandra Sidecar bulk-write path, and a dry run that prints the resolved command without side effects. The command is backend-neutral, so a future native-PySpark backend can be selected without changing how it is invoked.transformation pubkey-update/pubkey-compact/pubkey-detect/pubkey-load— cross-chain pubkey → address lookup.⚠️ ALPHA — the command interface and behaviour may change, and invoking these prints an alpha warning.pubkey-updatereads new transactions from a currency's source Delta Lake, extracts signing pubkeys into a shared cross-chain Delta store, and writes derived addresses for any pubkey newly observed on 2+ chains to either Cassandra or a Delta table.pubkey-compactdeduplicates that store between runs; for multi-chain backfillspubkey-detectruns the cross-chain detection once over the fully-appended store; andpubkey-loadloads a delta-only run's result into Cassandra, so the heavy extraction can run without production-Cassandra stress and be reviewed before the throttled write. Extraction covers the common UTXO input/output script types and the ETH/TRX account side (recovering the signing key, with a from-address self-check on ETH), and address derivation adds Bitcoin Cash CashAddr. BCH defaults its start block to the fork height so shared pre-fork BTC history isn't re-extracted into trivial cross-chain collisions. secp256k1 public-key validation now uses libsecp256k1 by default, far faster than the pure-Python path. The cross-chain materialisation persists only pubkeys it successfully derived at least one address for, so off-curve / special keys are retried on the next pass rather than silently consumed.- Environment-variable substitution in config files. String values in config files may now reference environment variables (with optional defaults and an escape for literals), resolved at load time across all config-loading paths (CLI, REST, and the typed loader). Useful for keeping secrets like DB URLs and credentials out of committed config files. Backwards compatible: configs without placeholders are unchanged.
- Overridable Spark Maven packages. The Spark transformation session's Maven packages (previously hardcoded) can now be overridden per-package via config, so only the ones you want to change need to be specified while the defaults stand. The S3 connector is still added only when S3 credentials are present, and the existing full-replace escape hatch still wins.
- Baseline inheritance for
s3_configsandspark_config. Both config sections can now define a sharedbaselineentry that every other named entry inherits from (its own keys win), removing per-entry duplication — e.g. a common S3 endpoint/region with only the credentials differing, or a shared Spark baseline with named profiles overriding it.spark_configstill accepts its legacy flat form unchanged. Fully backwards compatible with configs that omitbaseline. Everytransformationcommand that starts a Spark session also takes a--spark-profileflag to select a profile per run (defaulting to the baseline). - Reader can merge cross-chain pubkey mappings from multiple keyspaces. The REST reader's cross-chain pubkey keyspace setting now accepts a list as well as a single keyspace. When several are configured it looks the queried address up in each, derives addresses from every key found, and merges the results — so a validated new keyspace can be served alongside the legacy one, which still holds keys the new pipeline cannot reproduce exactly (e.g. doge-sourced cross-chain keys). Keyspaces lacking the lookup table are skipped, and the feature enables when at least one has it. On startup the service logs the resolved set once (
cross-chain pubkey lookup active on keyspaces: [...], or adisabledline), so it is visible on a running instance which keyspaces are actually used. Fully backwards compatible: a single keyspace behaves exactly as before. - Trivial cross-chain address detection for BTC↔BCH and TRX↔ETH, independent of the pubkey table. Cross-chain address lookups now also surface the script-equivalent address on the paired chain even when the pubkey table has no entry for the queried address: BTC↔BCH via legacy/cashaddr normalisation (segwit addresses are correctly excluded, as they are not script-equivalent across the fork), and TRX↔ETH via address-format conversion in both directions. Results are deduped against pubkey-backed entries, and the API wire format is unchanged.
Changed
transformationcommands renamed for clarity. The two job commands now name their source → destination directly:run→delta-to-raw(loads the Cassandra raw keyspace from Delta Lake) andrun-full-transform→raw-to-transformed(raw → transformed via the graphsense-spark job). Both are renamed outright with no aliases — update any scripts that invokedtransformation run.- Docker image: multi-stage build, runtime shrunk from ~2.3 GB → ~1.7 GB. The
Dockerfilewas split into abuilderstage and a freshpython:3.13-slim-bookwormruntime stage. All build-time tooling —gcc/g++/make/cmake, the Rust toolchain,curl, andlibpq-devheaders — now lives only in the builder and is verified absent from the shipped image; the runtime stageCOPYs just the two pre-built wheels (graphsense_lib+graphsense_clustering) out of the builder. This replaces the previous single-stage build that installed the toolchain and then tried toapt-get purge/rustup self uninstallit back out within one layer. Runtime OS deps are now scoped to exactly what runs in-container:openjdk-17-jre-headless(PySpark — Java 17, since Java 21 dropped theDirectByteBuffer(long,int)Arrow 12 needs),libpq5(psycopg runtime lib, not the-devheaders), andgit/git-lfs/openssh-client(GitPython + tagpack repo operations).numpy's bundled OpenBLAS.sofiles are deliberately left un-stripped (stripping corrupts their page-aligned LOAD segments and breaksimport numpy);__pycache__dirs are dropped instead. Verified in the built image:git2.39.5,git-lfs3.3.0 (git lfs installapplied), OpenJDK 17,gs_clusteringimports, duckdbhttpfspre-installed, and a real localSparkSession(PySpark 3.5.8) starts and runs.
Fixed
- Date scalars inside a tagpack/actorpack
contextblock no longer break parsing. When acontextfield is written as a YAML mapping, a bare date value (e.g.valid_from: 2022-09-23) is parsed by PyYAML into adatetime.date.Tag/Actorconstruction serialized that mapping to a JSON string viajson.dumpswithout adefaulthandler, raisingObject of type date is not JSON serializable. Both call sites (tagpack.py,actorpack.py) now passdefault=str, matching the existingto_json()helpers, so dates become ISO strings. - Taproot / P2TR (
bc1p…) addresses no longer flagged "possible invalid" in tagpack validation. The native segwit validator (utils/address.py:bech32_validate, used by BTC/LTC) only accepted the bech32 checksum constant (witness v0), so witness v1+ addresses — which BIP-350 encodes with bech32m (constant0x2BC830A3) — failed the checksum and were reported as possible invalid byTagPack.verify_addresses. Validation is now BIP-173/BIP-350-aware: it decodes the witness version, requires bech32 for v0 (with a 20- or 32-byte program) and bech32m for v1–v16, and validates the witness-program length via abech32_convertbitshelper. Mixed-case and over-length (>90) inputs are also rejected per spec. Verified against the BIP-350 test vectors (Taproot and v16 now pass; v0-with-bech32m and corrupted checksums correctly fail). - py4j DEBUG chatter silenced in verbose mode.
configure_loggingnow pins thepy4jlogger toINFO, so running a PySpark job at-vvv(DEBUG) no longer floods output with a pair ofAnswer received: …/Command to send: …lines for every JVM↔Python call, while graphsenselib's own DEBUG logs are kept. py4j warnings/errors still surface. get_spent_in_txs/get_spending_txsnow raiseBadUserInputExceptioninstead of an unhandledValueErrorfor non-hex transaction hashes. When passed a non-hex string (e.g. a BTC address likebc1q…or35usp…) as the transaction hash, both methods indb/asynchronous/cassandra.pycalledbytearray.fromhex(tx_hash)directly, raising an uncaughtValueError(which surfaced as an HTTP 500 in the REST layer). They now convert the hash once inside a guard that raisesBadUserInputException("<hash> does not look like a valid transaction hash."), matching the existing behaviour ofget_tx_by_hash.
Web API + Python client (webapi-2.13.5)
No changes.
Full Changelog: v2.13.5...v2.14.0