Skip to content

graphsense-lib 2.14.0

Choose a tag to compare

@github-actions github-actions released this 09 Jun 16:00
· 4 commits to master since this release

[2.14.0] - 2026-06-09

Library (v2.14.0)

Added

  • transformation raw-to-transformed — drive the external graphsense-spark Scala job from graphsenselib. ⚠️ ALPHA — the command interface and behaviour may change, and invoking it prints an alpha warning. New CLI command that replaces the standalone bash spark-submit driver: it creates a fresh transformed keyspace, downloads the graphsense-spark release jar from a public GitHub Release asset (cached locally), and launches the job via spark-submit. Supports self-contained or slim jars, an optional Cassandra Sidecar bulk-write path, and a dry run that prints the resolved command without side effects. The command is backend-neutral, so a future native-PySpark backend can be selected without changing how it is invoked.
  • transformation pubkey-update / pubkey-compact / pubkey-detect / pubkey-load — cross-chain pubkey → address lookup. ⚠️ ALPHA — the command interface and behaviour may change, and invoking these prints an alpha warning. pubkey-update reads new transactions from a currency's source Delta Lake, extracts signing pubkeys into a shared cross-chain Delta store, and writes derived addresses for any pubkey newly observed on 2+ chains to either Cassandra or a Delta table. pubkey-compact deduplicates that store between runs; for multi-chain backfills pubkey-detect runs the cross-chain detection once over the fully-appended store; and pubkey-load loads a delta-only run's result into Cassandra, so the heavy extraction can run without production-Cassandra stress and be reviewed before the throttled write. Extraction covers the common UTXO input/output script types and the ETH/TRX account side (recovering the signing key, with a from-address self-check on ETH), and address derivation adds Bitcoin Cash CashAddr. BCH defaults its start block to the fork height so shared pre-fork BTC history isn't re-extracted into trivial cross-chain collisions. secp256k1 public-key validation now uses libsecp256k1 by default, far faster than the pure-Python path. The cross-chain materialisation persists only pubkeys it successfully derived at least one address for, so off-curve / special keys are retried on the next pass rather than silently consumed.
  • Environment-variable substitution in config files. String values in config files may now reference environment variables (with optional defaults and an escape for literals), resolved at load time across all config-loading paths (CLI, REST, and the typed loader). Useful for keeping secrets like DB URLs and credentials out of committed config files. Backwards compatible: configs without placeholders are unchanged.
  • Overridable Spark Maven packages. The Spark transformation session's Maven packages (previously hardcoded) can now be overridden per-package via config, so only the ones you want to change need to be specified while the defaults stand. The S3 connector is still added only when S3 credentials are present, and the existing full-replace escape hatch still wins.
  • Baseline inheritance for s3_configs and spark_config. Both config sections can now define a shared baseline entry that every other named entry inherits from (its own keys win), removing per-entry duplication — e.g. a common S3 endpoint/region with only the credentials differing, or a shared Spark baseline with named profiles overriding it. spark_config still accepts its legacy flat form unchanged. Fully backwards compatible with configs that omit baseline. Every transformation command that starts a Spark session also takes a --spark-profile flag to select a profile per run (defaulting to the baseline).
  • Reader can merge cross-chain pubkey mappings from multiple keyspaces. The REST reader's cross-chain pubkey keyspace setting now accepts a list as well as a single keyspace. When several are configured it looks the queried address up in each, derives addresses from every key found, and merges the results — so a validated new keyspace can be served alongside the legacy one, which still holds keys the new pipeline cannot reproduce exactly (e.g. doge-sourced cross-chain keys). Keyspaces lacking the lookup table are skipped, and the feature enables when at least one has it. On startup the service logs the resolved set once (cross-chain pubkey lookup active on keyspaces: [...], or a disabled line), so it is visible on a running instance which keyspaces are actually used. Fully backwards compatible: a single keyspace behaves exactly as before.
  • Trivial cross-chain address detection for BTC↔BCH and TRX↔ETH, independent of the pubkey table. Cross-chain address lookups now also surface the script-equivalent address on the paired chain even when the pubkey table has no entry for the queried address: BTC↔BCH via legacy/cashaddr normalisation (segwit addresses are correctly excluded, as they are not script-equivalent across the fork), and TRX↔ETH via address-format conversion in both directions. Results are deduped against pubkey-backed entries, and the API wire format is unchanged.

Changed

  • transformation commands renamed for clarity. The two job commands now name their source → destination directly: rundelta-to-raw (loads the Cassandra raw keyspace from Delta Lake) and run-full-transformraw-to-transformed (raw → transformed via the graphsense-spark job). Both are renamed outright with no aliases — update any scripts that invoked transformation run.
  • Docker image: multi-stage build, runtime shrunk from ~2.3 GB → ~1.7 GB. The Dockerfile was split into a builder stage and a fresh python:3.13-slim-bookworm runtime stage. All build-time tooling — gcc/g++/make/cmake, the Rust toolchain, curl, and libpq-dev headers — now lives only in the builder and is verified absent from the shipped image; the runtime stage COPYs just the two pre-built wheels (graphsense_lib + graphsense_clustering) out of the builder. This replaces the previous single-stage build that installed the toolchain and then tried to apt-get purge / rustup self uninstall it back out within one layer. Runtime OS deps are now scoped to exactly what runs in-container: openjdk-17-jre-headless (PySpark — Java 17, since Java 21 dropped the DirectByteBuffer(long,int) Arrow 12 needs), libpq5 (psycopg runtime lib, not the -dev headers), and git / git-lfs / openssh-client (GitPython + tagpack repo operations). numpy's bundled OpenBLAS .so files are deliberately left un-stripped (stripping corrupts their page-aligned LOAD segments and breaks import numpy); __pycache__ dirs are dropped instead. Verified in the built image: git 2.39.5, git-lfs 3.3.0 (git lfs install applied), OpenJDK 17, gs_clustering imports, duckdb httpfs pre-installed, and a real local SparkSession (PySpark 3.5.8) starts and runs.

Fixed

  • Date scalars inside a tagpack/actorpack context block no longer break parsing. When a context field is written as a YAML mapping, a bare date value (e.g. valid_from: 2022-09-23) is parsed by PyYAML into a datetime.date. Tag/Actor construction serialized that mapping to a JSON string via json.dumps without a default handler, raising Object of type date is not JSON serializable. Both call sites (tagpack.py, actorpack.py) now pass default=str, matching the existing to_json() helpers, so dates become ISO strings.
  • Taproot / P2TR (bc1p…) addresses no longer flagged "possible invalid" in tagpack validation. The native segwit validator (utils/address.py:bech32_validate, used by BTC/LTC) only accepted the bech32 checksum constant (witness v0), so witness v1+ addresses — which BIP-350 encodes with bech32m (constant 0x2BC830A3) — failed the checksum and were reported as possible invalid by TagPack.verify_addresses. Validation is now BIP-173/BIP-350-aware: it decodes the witness version, requires bech32 for v0 (with a 20- or 32-byte program) and bech32m for v1–v16, and validates the witness-program length via a bech32_convertbits helper. Mixed-case and over-length (>90) inputs are also rejected per spec. Verified against the BIP-350 test vectors (Taproot and v16 now pass; v0-with-bech32m and corrupted checksums correctly fail).
  • py4j DEBUG chatter silenced in verbose mode. configure_logging now pins the py4j logger to INFO, so running a PySpark job at -vvv (DEBUG) no longer floods output with a pair of Answer received: … / Command to send: … lines for every JVM↔Python call, while graphsenselib's own DEBUG logs are kept. py4j warnings/errors still surface.
  • get_spent_in_txs / get_spending_txs now raise BadUserInputException instead of an unhandled ValueError for non-hex transaction hashes. When passed a non-hex string (e.g. a BTC address like bc1q… or 35usp…) as the transaction hash, both methods in db/asynchronous/cassandra.py called bytearray.fromhex(tx_hash) directly, raising an uncaught ValueError (which surfaced as an HTTP 500 in the REST layer). They now convert the hash once inside a guard that raises BadUserInputException ("<hash> does not look like a valid transaction hash."), matching the existing behaviour of get_tx_by_hash.

Web API + Python client (webapi-2.13.5)

No changes.

Full Changelog: v2.13.5...v2.14.0