Skip to content

Conversation

@rautenrieth-da
Copy link
Contributor

Second attempt at making deleting corrupt snapshots faster. This time we accept that this is a very long procedure and move it to a trigger.

The new trigger must complete before update history backfilling starts, because otherwise we could backfill import updates before finishing cleaning up corrupt snapshots, and the remaining corrupt snapshots would appear as valid.

@rautenrieth-da rautenrieth-da force-pushed the rautenrieth-da/delete-acs-snapshots-trigger branch 2 times, most recently from d9977df to a685d2a Compare June 12, 2025 06:50
@rautenrieth-da rautenrieth-da marked this pull request as ready for review June 12, 2025 07:38
case class State(
historyId: Option[Long]
historyId: Option[Long],
corruptSnapshotsDeleted: Boolean,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also remove this flag and have ScanHistoryBackfillingTrigger independently check for corrupt snapshots, at the cost of calling migrationsWithCorruptSnapshots more often.

migrationsWithCorruptSnapshots executes 3 independent SQL queries, each of them taking <10ms to run on CILR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new long-running trigger to remove corrupt ACS snapshots before backfilling, adds support in the store to delete snapshots, and wires everything into the automation and tests.

  • Adds deleteSnapshot API to AcsSnapshotStore and integrates it into a new DeleteCorruptAcsSnapshotTrigger
  • Extends UpdateHistory with an in-memory flag and query for corrupt snapshots
  • Updates automation service and end-to-end tests to register and validate the new trigger

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
apps/scan/src/main/scala/.../AcsSnapshotStore.scala Added deleteSnapshot implementation to remove corrupt data
apps/scan/src/main/scala/.../DeleteCorruptAcsSnapshotTrigger.scala New trigger to poll for and delete corrupt snapshots
apps/common/src/main/scala/.../UpdateHistory.scala Introduced corruptAcsSnapshotsDeleted flag and query
apps/common/src/main/scala/.../HistoryMetrics.scala Added metrics for corrupt snapshot deletion
apps/scan/src/test/scala/.../AcsSnapshotStoreTest.scala Updated tests to use the new deletion API and query
apps/app/src/test/scala/.../ScanHistoryBackfillingIntegrationTest.scala Registered new trigger and added initial state assertions
Comments suppressed due to low confidence (1)

apps/app/src/test/scala/org/lfdecentralizedtrust/splice/integration/tests/ScanHistoryBackfillingIntegrationTest.scala:267

  • This integration test verifies the in-memory flag but does not check that the corrupt snapshots are actually deleted from the database. Add assertions to confirm the absence of those snapshots after the trigger runs.
actAndCheck("Run trigger that checks for corrupt snapshots once on all scans", {

Copy link
Contributor

@ray-roestenburg-da ray-roestenburg-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@OriolMunoz-da OriolMunoz-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@rautenrieth-da rautenrieth-da force-pushed the rautenrieth-da/delete-acs-snapshots-trigger branch 4 times, most recently from 4450e31 to 737c6d4 Compare August 18, 2025 12:58
<https://github.com/hyperledger-labs/splice/blob/0.4.12/Vagrantfile>`_ for
details.

- Scan app
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moritzkiefer-da @martinflorian-da Can you please check if this makes sense? I have intentionally used

deleted snapshots will be recomputed eventually

because the process for deleting snapshots and the process for backfilling missing import updates are separate, and this PR only enables the first process. The intention is to enable import update backfilling (already implemented but behind a feature flag) on CILR later today and by default before the next release. In worst case however, we could end up with a release that deletes corrupt snapshots but does not backfill the fixed ones. I would like to avoid using more feature flags, because I'm worried about breaking existing apps with some combination of flags that I did not think through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we want to enable both at once on prod, why not put them behind the same feature flag as well? You can then set that feature flag on CILR to true and when you change the default to true you can add the release notes at the same time for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of the current state of this PR, all backfilling would be paused until the new trigger completes, so putting the new trigger behind a feature flag would turn off all backfilling, unless I change how backfilling depends on the new feature flag (i.e., allow regular backfilling but not import update backfilling while deleting snapshots). It would probably work, but I am bit worried there was a deeper reason why it's the way it is right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I have put the new trigger behind the same feature flag as import update backfilling.

[ci]

Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>
[ci]

Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>
@rautenrieth-da rautenrieth-da force-pushed the rautenrieth-da/delete-acs-snapshots-trigger branch from 737c6d4 to 7b162b2 Compare August 18, 2025 20:30
@rautenrieth-da rautenrieth-da merged commit 28d1769 into main Aug 18, 2025
56 checks passed
@rautenrieth-da rautenrieth-da deleted the rautenrieth-da/delete-acs-snapshots-trigger branch August 18, 2025 21:55
moritzkiefer-da added a commit that referenced this pull request Aug 25, 2025
* Tweak fluent bit configuration (#1940)

[static]

Based on CILR experience

- fix severity parsing
- truncate long log messages because otherise stack driver gets angry
- make time parsing more lenient
- make fluent bit parse its own logs better

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* Reduce multi-validator deployment parallelism to 2 (#1938)



Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>

* Bump Canton for KMS resilience fix (#1941)

Fixes DACH-NY/canton-network-internal#1337

[ci]

Signed-off-by: Martin Florian <martin.florian@digitalasset.com>

* Refactor some form components in sv ui (#1936)

- Make form errors a re-usable form component
- Make EffectiveField a re-usable field component
- Upgrade tanstack-form

Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com>

* Docs: Clarifications around validator DR (#1937)

Inspired by questions on Slack: https://daholdings.slack.com/archives/C08AP9QR7K4/p1755245551957219?thread_ts=1753278207.186399&cid=C08AP9QR7K4

[static]

Signed-off-by: Martin Florian <martin.florian@digitalasset.com>

* Fix tag prefix in stackdriver export (#1944)

[static]

Don't ask me why fluentbit has mutually incompatible defaults between
different filters and outputs …

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* query to aggregate traffic purchases over a time period (#1926)

* fork part of total supply query

* aggregate .amuletPaid to a separate value

* use a bracketed start time as well

---------

Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com>

* Implement DeleteCorruptAcsSnapshotTrigger (#1096)

* Implement DeleteCorruptAcsSnapshotTrigger

Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>

* [static] increase multi validators parallelism to 5 (#1949)

Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>

* Write how-to docs for token standard usage (#1872)


---------

Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com>

* Reduce gcp logging components (#1951)

I think I accidentally turned on too much when I tried to disable
workloads in favor of our own fluentbit.

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* WalletSurviveCantonRestartIntegrationTest: bump wait on participant init (#1952)

Fixes DACH-NY/cn-test-failures#5417

The participant did come up eventually and AFAICT the validator app would have continued init if we hadn't stopped that.

[static]

Signed-off-by: Martin Florian <martin.florian@digitalasset.com>

* Bump cometbft mempool and cache size (#1953)

fixes #1934

[ci]

I honestly don't have a great reason for choosing these specific
values. Doubling seems as good as anything else 🤷

See https://github.com/DACH-NY/canton-network-node/pull/17821/files
for an earlier change we made in the same direction.

Note that I didn't bump the TTL because I don't see a compelling
reason why that helps with anything.

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* [static] Add istio rate limits to pulumi (#1798)

Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>

* Implement Amulet Rules Proposal Form in new SV UI (#1945)


---------

Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com>

* Fix fluentbit log truncation (#1959)

[static]

I should not be allowed to write lua

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* [static] include rate of sequencer events processed in the participant dashboard (#1960)

Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>

* move pulumi npm packages into lfdt namespace  (#1848)

* don't alert a Slack channel unless explicitly set in .envrc.vars (#1913)

* don't alert a Slack channel unless explicitly set in .envrc.vars

The default for alerting was #team-canton-network-internal-alerts. Now
that default is removed; only long-running, production and
near-production clusters like dev/test/main should now alert.

* also don't default SLACK_ALERT_NOTIFICATION_CHANNEL

- suggested by @martinflorian-da; thanks

* fail if SLACK_ALERT_NOTIFICATION_CHANNEL defined but not FULL_NAME

---------

Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com>

* Support running static tests on gh-hosted runners (#1668)


Signed-off-by: Itai Segall <itai.segall@digitalasset.com>
Co-authored-by: Stephen Compall <stephen.compall@digitalasset.com>

* Revert "Support running static tests on gh-hosted runners (#1668)" (#1966)

This reverts commit 13bcefe.

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

* Make pulumi stack parallelism configurable (#1967)

* Make pulumi stack parallelism configurable

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* fmt

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

---------

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* [static] Make the cluster node pools sizes configurable (#1957)


Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>

* Try to fix grafana alert expansion (#1970)

[static]

We still get tons of spam from

logger=ngalert.state.manager rule_uid=ady2ks9ehbw1sb org_id=1
t=2025-08-20T07:37:44.687289759Z level=error msg="Error in expanding
template" error="failed to expand template '{{- $labels := .Labels
-}}{{- $values := .Values -}}{{- $value := .Value -}}{{- if (gt
$values.runs.Value 2) -}}\ncritical\n{{- else -}}\nwarning\n{{- end
-}}': error executing template __alert_Busy task-based automation:
template: __alert_Busy task-based automation:1:84: executing
\"__alert_Busy task-based automation\" at <gt $values.runs.Value 2>:
error calling gt: incompatible types for comparison"

and for the other one.

My current theory is: go templates seem to distinguish integers and
floats. And we have one missing null check. Would be too easy if it
actually told you the mismatching types …

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* [ci] More lenient scan rate limit test (#1971)

Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>

* Match package name on template filter (#1955)


---------

Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com>

* Document routing of the JSON API (#1973)

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* Synchronize on scan processing lock archival (#1969)

[ci]

fixes DACH-NY/cn-test-failures#5415

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* Add config rendering helper function and enhance splice-participant helm template (#1939)

Fixes #1915

Signed-off-by: timpel-fcs <tim.pelzer@finoa.io>

* Remove migrate-istio (#1977)

Deletes code, must be good.

More seriously this was added 7 months ago so we can pretty
confidently assume everything is migrated by now.

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* mention BFT success requirement in validator onboarding doc (#1979)

We explain the tradeoffs already where we document how to do it, so not
going to repeat all that, just linking to it. Onboarding real production
nodes shouldn't do this anyway.

Reifies this comment
<global-synchronizer-foundation/docs#8 (reply in thread)>
from @martinflorian-da.

Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com>

* shorter output/timeout/portability in validator onboarding test scriptlets (#1982)

- overall max-time for curl calls
- don't try to jq 4xx responses, just fail
- jq portability

Adapted from this comment
<global-synchronizer-foundation/docs#8 (comment)>
from @stas-sbi.

* grpcurl output has quotes

---------

Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com>

* Support running static tests on gh-hosted runners (#1978)


Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

* Make workflow ids of import updates consistent (#1981)


Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>

* Further clarify safe ways of bypassing the party limit (#1984)

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* Remove todo artifacts (#1986)

With the new static test job, CI on forks now fails as it conflicts
between that job and the main job.

Rather than trying to make it conditional or rename it to avoid the conflict, this just
removes the step. Noone has used this for years afaik.

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* Mention existing transfer preapproval proposal (#1987)

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* vagrant: Restart nix-daemon after mounting cache (#1985)

- Makes the initial boot more predictable.
- Allows recovering after deleting the cache file without re-creating
  the VM. To recover run `vagrant up --provision`.

Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com>

* Filter pr_cluster_test for pull requests (#1988)

[static]

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* [static] Update release notes for 0.4.12 (#1989)

Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>

* stop triggering ciupgrade tests (#1983)

Signed-off-by: Itai Segall <itai.segall@digitalasset.com>

* Upgrade Canton to 3.3.0-snapshot.20250821.16057.0.v3719b9e9 (#1994)

[ci]

Includes the fix for the initial topology validator that is blocking
sv runbook reonboarding on cilr atm.

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>

* [ci] Update VERSION to 0.4.13 (#1995)

Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>

* run BigQuery integration test daily (#1873)

* add run scheduled for 2:17am CET, allow manual run

* flexible version selection

* log service account email when setting up BQ test

---------

Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com>

* Add missing CO_TransferPreapprovalSend case in UserWalletTxLogParser (#2006)



Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com>

---------

Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
Signed-off-by: Martin Florian <martin.florian@digitalasset.com>
Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com>
Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com>
Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>
Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com>
Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>
Signed-off-by: Itai Segall <itai.segall@digitalasset.com>
Signed-off-by: timpel-fcs <tim.pelzer@finoa.io>
Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com>
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
Co-authored-by: Julien Tinguely <julien.tinguely@digitalasset.com>
Co-authored-by: Martin Florian <martin.florian@digitalasset.com>
Co-authored-by: fayi-da <112705750+fayi-da@users.noreply.github.com>
Co-authored-by: Stephen Compall <stephen.compall@digitalasset.com>
Co-authored-by: Robert Autenrieth <31539813+rautenrieth-da@users.noreply.github.com>
Co-authored-by: Oriol Muñoz <oriol.munoz@digitalasset.com>
Co-authored-by: Nicu Reut <nicu.reut@digitalasset.com>
Co-authored-by: Itai Segall <itai.segall@digitalasset.com>
Co-authored-by: Tim <tim.pelzer@finoa.io>
Co-authored-by: Stanislav German-Evtushenko <ginermail@gmail.com>
hrischuk-da pushed a commit to hrischuk-da/splice that referenced this pull request Aug 29, 2025
* Implement DeleteCorruptAcsSnapshotTrigger

Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>
Signed-off-by: hrischuk-da <curtis.hrischuk@digitalasset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants