Implement DeleteCorruptAcsSnapshotTrigger #1096

rautenrieth-da · 2025-06-11T22:33:32Z

Second attempt at making deleting corrupt snapshots faster. This time we accept that this is a very long procedure and move it to a trigger.

The new trigger must complete before update history backfilling starts, because otherwise we could backfill import updates before finishing cleaning up corrupt snapshots, and the remaining corrupt snapshots would appear as valid.

rautenrieth-da · 2025-06-12T07:48:33Z

apps/common/src/main/scala/org/lfdecentralizedtrust/splice/store/UpdateHistory.scala

  case class State(
-      historyId: Option[Long]
+      historyId: Option[Long],
+      corruptSnapshotsDeleted: Boolean,


We could also remove this flag and have ScanHistoryBackfillingTrigger independently check for corrupt snapshots, at the cost of calling migrationsWithCorruptSnapshots more often.

migrationsWithCorruptSnapshots executes 3 independent SQL queries, each of them taking <10ms to run on CILR.

Copilot

Pull Request Overview

This PR introduces a new long-running trigger to remove corrupt ACS snapshots before backfilling, adds support in the store to delete snapshots, and wires everything into the automation and tests.

Adds deleteSnapshot API to AcsSnapshotStore and integrates it into a new DeleteCorruptAcsSnapshotTrigger
Extends UpdateHistory with an in-memory flag and query for corrupt snapshots
Updates automation service and end-to-end tests to register and validate the new trigger

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
apps/scan/src/main/scala/.../AcsSnapshotStore.scala	Added `deleteSnapshot` implementation to remove corrupt data
apps/scan/src/main/scala/.../DeleteCorruptAcsSnapshotTrigger.scala	New trigger to poll for and delete corrupt snapshots
apps/common/src/main/scala/.../UpdateHistory.scala	Introduced `corruptAcsSnapshotsDeleted` flag and query
apps/common/src/main/scala/.../HistoryMetrics.scala	Added metrics for corrupt snapshot deletion
apps/scan/src/test/scala/.../AcsSnapshotStoreTest.scala	Updated tests to use the new deletion API and query
apps/app/src/test/scala/.../ScanHistoryBackfillingIntegrationTest.scala	Registered new trigger and added initial state assertions

Comments suppressed due to low confidence (1)

apps/app/src/test/scala/org/lfdecentralizedtrust/splice/integration/tests/ScanHistoryBackfillingIntegrationTest.scala:267

This integration test verifies the in-memory flag but does not check that the corrupt snapshots are actually deleted from the database. Add assertions to confirm the absence of those snapshots after the trigger runs.

actAndCheck("Run trigger that checks for corrupt snapshots once on all scans", {

apps/scan/src/main/scala/org/lfdecentralizedtrust/splice/scan/store/AcsSnapshotStore.scala

apps/common/src/main/scala/org/lfdecentralizedtrust/splice/store/UpdateHistory.scala

ray-roestenburg-da

LGTM, thanks!

OriolMunoz-da

thanks!

apps/scan/src/test/scala/org/lfdecentralizedtrust/splice/store/db/AcsSnapshotStoreTest.scala

...in/scala/org/lfdecentralizedtrust/splice/scan/automation/ScanHistoryBackfillingTrigger.scala

.../scala/org/lfdecentralizedtrust/splice/scan/automation/DeleteCorruptAcsSnapshotTrigger.scala

rautenrieth-da · 2025-08-18T14:27:05Z

docs/src/release_notes.rst

    <https://github.com/hyperledger-labs/splice/blob/0.4.12/Vagrantfile>`_ for
    details.

+- Scan app


@moritzkiefer-da @martinflorian-da Can you please check if this makes sense? I have intentionally used

deleted snapshots will be recomputed eventually

because the process for deleting snapshots and the process for backfilling missing import updates are separate, and this PR only enables the first process. The intention is to enable import update backfilling (already implemented but behind a feature flag) on CILR later today and by default before the next release. In worst case however, we could end up with a release that deletes corrupt snapshots but does not backfill the fixed ones. I would like to avoid using more feature flags, because I'm worried about breaking existing apps with some combination of flags that I did not think through.

Given that we want to enable both at once on prod, why not put them behind the same feature flag as well? You can then set that feature flag on CILR to true and when you change the default to true you can add the release notes at the same time for both.

As of the current state of this PR, all backfilling would be paused until the new trigger completes, so putting the new trigger behind a feature flag would turn off all backfilling, unless I change how backfilling depends on the new feature flag (i.e., allow regular backfilling but not import update backfilling while deleting snapshots). It would probably work, but I am bit worried there was a deeper reason why it's the way it is right now.

Nevermind, I have put the new trigger behind the same feature flag as import update backfilling.

[ci] Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>

@martinflorian-da

* Tweak fluent bit configuration (#1940) [static] Based on CILR experience - fix severity parsing - truncate long log messages because otherise stack driver gets angry - make time parsing more lenient - make fluent bit parse its own logs better Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Reduce multi-validator deployment parallelism to 2 (#1938) Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> * Bump Canton for KMS resilience fix (#1941) Fixes DACH-NY/canton-network-internal#1337 [ci] Signed-off-by: Martin Florian <martin.florian@digitalasset.com> * Refactor some form components in sv ui (#1936) - Make form errors a re-usable form component - Make EffectiveField a re-usable field component - Upgrade tanstack-form Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com> * Docs: Clarifications around validator DR (#1937) Inspired by questions on Slack: https://daholdings.slack.com/archives/C08AP9QR7K4/p1755245551957219?thread_ts=1753278207.186399&cid=C08AP9QR7K4 [static] Signed-off-by: Martin Florian <martin.florian@digitalasset.com> * Fix tag prefix in stackdriver export (#1944) [static] Don't ask me why fluentbit has mutually incompatible defaults between different filters and outputs … Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * query to aggregate traffic purchases over a time period (#1926) * fork part of total supply query * aggregate .amuletPaid to a separate value * use a bracketed start time as well --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Implement DeleteCorruptAcsSnapshotTrigger (#1096) * Implement DeleteCorruptAcsSnapshotTrigger Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> * [static] increase multi validators parallelism to 5 (#1949) Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> * Write how-to docs for token standard usage (#1872) --------- Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> * Reduce gcp logging components (#1951) I think I accidentally turned on too much when I tried to disable workloads in favor of our own fluentbit. [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * WalletSurviveCantonRestartIntegrationTest: bump wait on participant init (#1952) Fixes DACH-NY/cn-test-failures#5417 The participant did come up eventually and AFAICT the validator app would have continued init if we hadn't stopped that. [static] Signed-off-by: Martin Florian <martin.florian@digitalasset.com> * Bump cometbft mempool and cache size (#1953) fixes #1934 [ci] I honestly don't have a great reason for choosing these specific values. Doubling seems as good as anything else 🤷 See https://github.com/DACH-NY/canton-network-node/pull/17821/files for an earlier change we made in the same direction. Note that I didn't bump the TTL because I don't see a compelling reason why that helps with anything. Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] Add istio rate limits to pulumi (#1798) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * Implement Amulet Rules Proposal Form in new SV UI (#1945) --------- Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com> * Fix fluentbit log truncation (#1959) [static] I should not be allowed to write lua Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] include rate of sequencer events processed in the participant dashboard (#1960) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * move pulumi npm packages into lfdt namespace (#1848) * don't alert a Slack channel unless explicitly set in .envrc.vars (#1913) * don't alert a Slack channel unless explicitly set in .envrc.vars The default for alerting was #team-canton-network-internal-alerts. Now that default is removed; only long-running, production and near-production clusters like dev/test/main should now alert. * also don't default SLACK_ALERT_NOTIFICATION_CHANNEL - suggested by @martinflorian-da; thanks * fail if SLACK_ALERT_NOTIFICATION_CHANNEL defined but not FULL_NAME --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Support running static tests on gh-hosted runners (#1668) Signed-off-by: Itai Segall <itai.segall@digitalasset.com> Co-authored-by: Stephen Compall <stephen.compall@digitalasset.com> * Revert "Support running static tests on gh-hosted runners (#1668)" (#1966) This reverts commit 13bcefe. Signed-off-by: Itai Segall <itai.segall@digitalasset.com> * Make pulumi stack parallelism configurable (#1967) * Make pulumi stack parallelism configurable [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * fmt [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> --------- Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] Make the cluster node pools sizes configurable (#1957) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * Try to fix grafana alert expansion (#1970) [static] We still get tons of spam from logger=ngalert.state.manager rule_uid=ady2ks9ehbw1sb org_id=1 t=2025-08-20T07:37:44.687289759Z level=error msg="Error in expanding template" error="failed to expand template '{{- $labels := .Labels -}}{{- $values := .Values -}}{{- $value := .Value -}}{{- if (gt $values.runs.Value 2) -}}\ncritical\n{{- else -}}\nwarning\n{{- end -}}': error executing template __alert_Busy task-based automation: template: __alert_Busy task-based automation:1:84: executing \"__alert_Busy task-based automation\" at <gt $values.runs.Value 2>: error calling gt: incompatible types for comparison" and for the other one. My current theory is: go templates seem to distinguish integers and floats. And we have one missing null check. Would be too easy if it actually told you the mismatching types … Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [ci] More lenient scan rate limit test (#1971) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * Match package name on template filter (#1955) --------- Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> * Document routing of the JSON API (#1973) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Synchronize on scan processing lock archival (#1969) [ci] fixes DACH-NY/cn-test-failures#5415 Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Add config rendering helper function and enhance splice-participant helm template (#1939) Fixes #1915 Signed-off-by: timpel-fcs <tim.pelzer@finoa.io> * Remove migrate-istio (#1977) Deletes code, must be good. More seriously this was added 7 months ago so we can pretty confidently assume everything is migrated by now. [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * mention BFT success requirement in validator onboarding doc (#1979) We explain the tradeoffs already where we document how to do it, so not going to repeat all that, just linking to it. Onboarding real production nodes shouldn't do this anyway. Reifies this comment <global-synchronizer-foundation/docs#8 (reply in thread)> from @martinflorian-da. Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * shorter output/timeout/portability in validator onboarding test scriptlets (#1982) - overall max-time for curl calls - don't try to jq 4xx responses, just fail - jq portability Adapted from this comment <global-synchronizer-foundation/docs#8 (comment)> from @stas-sbi. * grpcurl output has quotes --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Support running static tests on gh-hosted runners (#1978) Signed-off-by: Itai Segall <itai.segall@digitalasset.com> * Make workflow ids of import updates consistent (#1981) Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> * Further clarify safe ways of bypassing the party limit (#1984) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Remove todo artifacts (#1986) With the new static test job, CI on forks now fails as it conflicts between that job and the main job. Rather than trying to make it conditional or rename it to avoid the conflict, this just removes the step. Noone has used this for years afaik. [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Mention existing transfer preapproval proposal (#1987) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * vagrant: Restart nix-daemon after mounting cache (#1985) - Makes the initial boot more predictable. - Allows recovering after deleting the cache file without re-creating the VM. To recover run `vagrant up --provision`. Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com> * Filter pr_cluster_test for pull requests (#1988) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] Update release notes for 0.4.12 (#1989) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * stop triggering ciupgrade tests (#1983) Signed-off-by: Itai Segall <itai.segall@digitalasset.com> * Upgrade Canton to 3.3.0-snapshot.20250821.16057.0.v3719b9e9 (#1994) [ci] Includes the fix for the initial topology validator that is blocking sv runbook reonboarding on cilr atm. Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [ci] Update VERSION to 0.4.13 (#1995) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * run BigQuery integration test daily (#1873) * add run scheduled for 2:17am CET, allow manual run * flexible version selection * log service account email when setting up BQ test --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Add missing CO_TransferPreapprovalSend case in UserWalletTxLogParser (#2006) Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> --------- Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> Signed-off-by: Martin Florian <martin.florian@digitalasset.com> Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com> Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> Signed-off-by: Itai Segall <itai.segall@digitalasset.com> Signed-off-by: timpel-fcs <tim.pelzer@finoa.io> Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Julien Tinguely <julien.tinguely@digitalasset.com> Co-authored-by: Martin Florian <martin.florian@digitalasset.com> Co-authored-by: fayi-da <112705750+fayi-da@users.noreply.github.com> Co-authored-by: Stephen Compall <stephen.compall@digitalasset.com> Co-authored-by: Robert Autenrieth <31539813+rautenrieth-da@users.noreply.github.com> Co-authored-by: Oriol Muñoz <oriol.munoz@digitalasset.com> Co-authored-by: Nicu Reut <nicu.reut@digitalasset.com> Co-authored-by: Itai Segall <itai.segall@digitalasset.com> Co-authored-by: Tim <tim.pelzer@finoa.io> Co-authored-by: Stanislav German-Evtushenko <ginermail@gmail.com>

* Implement DeleteCorruptAcsSnapshotTrigger Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> Signed-off-by: hrischuk-da <curtis.hrischuk@digitalasset.com>

rautenrieth-da force-pushed the rautenrieth-da/delete-acs-snapshots-trigger branch 2 times, most recently from d9977df to a685d2a Compare June 12, 2025 06:50

rautenrieth-da marked this pull request as ready for review June 12, 2025 07:38

rautenrieth-da commented Jun 12, 2025

View reviewed changes

rautenrieth-da requested review from OriolMunoz-da, Copilot and ray-roestenburg-da June 12, 2025 07:49

Copilot AI reviewed Jun 12, 2025

View reviewed changes

apps/scan/src/main/scala/org/lfdecentralizedtrust/splice/scan/store/AcsSnapshotStore.scala Show resolved Hide resolved

apps/common/src/main/scala/org/lfdecentralizedtrust/splice/store/UpdateHistory.scala Show resolved Hide resolved

ray-roestenburg-da approved these changes Jun 12, 2025

View reviewed changes

OriolMunoz-da approved these changes Jun 12, 2025

View reviewed changes

rautenrieth-da force-pushed the rautenrieth-da/delete-acs-snapshots-trigger branch 4 times, most recently from 4450e31 to 737c6d4 Compare August 18, 2025 12:58

rautenrieth-da commented Aug 18, 2025

View reviewed changes

rautenrieth-da added 2 commits August 18, 2025 20:29

Implement DeleteCorruptAcsSnapshotTrigger

069ee9a

[ci] Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>

Hide behind feature flag

7b162b2

[ci] Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>

rautenrieth-da force-pushed the rautenrieth-da/delete-acs-snapshots-trigger branch from 737c6d4 to 7b162b2 Compare August 18, 2025 20:30

rautenrieth-da merged commit 28d1769 into main Aug 18, 2025
56 checks passed

rautenrieth-da deleted the rautenrieth-da/delete-acs-snapshots-trigger branch August 18, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement DeleteCorruptAcsSnapshotTrigger #1096

Implement DeleteCorruptAcsSnapshotTrigger #1096

Uh oh!

rautenrieth-da commented Jun 11, 2025

Uh oh!

rautenrieth-da Jun 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ray-roestenburg-da left a comment

Uh oh!

OriolMunoz-da left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rautenrieth-da Aug 18, 2025

Uh oh!

moritzkiefer-da Aug 18, 2025

Uh oh!

rautenrieth-da Aug 18, 2025

Uh oh!

rautenrieth-da Aug 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement DeleteCorruptAcsSnapshotTrigger #1096

Implement DeleteCorruptAcsSnapshotTrigger #1096

Uh oh!

Conversation

rautenrieth-da commented Jun 11, 2025

Uh oh!

rautenrieth-da Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

ray-roestenburg-da left a comment

Choose a reason for hiding this comment

Uh oh!

OriolMunoz-da left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rautenrieth-da Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

moritzkiefer-da Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

rautenrieth-da Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

rautenrieth-da Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants