-
Notifications
You must be signed in to change notification settings - Fork 44
Implement DeleteCorruptAcsSnapshotTrigger #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DeleteCorruptAcsSnapshotTrigger #1096
Conversation
d9977df to
a685d2a
Compare
| case class State( | ||
| historyId: Option[Long] | ||
| historyId: Option[Long], | ||
| corruptSnapshotsDeleted: Boolean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also remove this flag and have ScanHistoryBackfillingTrigger independently check for corrupt snapshots, at the cost of calling migrationsWithCorruptSnapshots more often.
migrationsWithCorruptSnapshots executes 3 independent SQL queries, each of them taking <10ms to run on CILR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new long-running trigger to remove corrupt ACS snapshots before backfilling, adds support in the store to delete snapshots, and wires everything into the automation and tests.
- Adds
deleteSnapshotAPI toAcsSnapshotStoreand integrates it into a newDeleteCorruptAcsSnapshotTrigger - Extends
UpdateHistorywith an in-memory flag and query for corrupt snapshots - Updates automation service and end-to-end tests to register and validate the new trigger
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/scan/src/main/scala/.../AcsSnapshotStore.scala | Added deleteSnapshot implementation to remove corrupt data |
| apps/scan/src/main/scala/.../DeleteCorruptAcsSnapshotTrigger.scala | New trigger to poll for and delete corrupt snapshots |
| apps/common/src/main/scala/.../UpdateHistory.scala | Introduced corruptAcsSnapshotsDeleted flag and query |
| apps/common/src/main/scala/.../HistoryMetrics.scala | Added metrics for corrupt snapshot deletion |
| apps/scan/src/test/scala/.../AcsSnapshotStoreTest.scala | Updated tests to use the new deletion API and query |
| apps/app/src/test/scala/.../ScanHistoryBackfillingIntegrationTest.scala | Registered new trigger and added initial state assertions |
Comments suppressed due to low confidence (1)
apps/app/src/test/scala/org/lfdecentralizedtrust/splice/integration/tests/ScanHistoryBackfillingIntegrationTest.scala:267
- This integration test verifies the in-memory flag but does not check that the corrupt snapshots are actually deleted from the database. Add assertions to confirm the absence of those snapshots after the trigger runs.
actAndCheck("Run trigger that checks for corrupt snapshots once on all scans", {
apps/scan/src/main/scala/org/lfdecentralizedtrust/splice/scan/store/AcsSnapshotStore.scala
Show resolved
Hide resolved
apps/common/src/main/scala/org/lfdecentralizedtrust/splice/store/UpdateHistory.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
apps/scan/src/test/scala/org/lfdecentralizedtrust/splice/store/db/AcsSnapshotStoreTest.scala
Show resolved
Hide resolved
...in/scala/org/lfdecentralizedtrust/splice/scan/automation/ScanHistoryBackfillingTrigger.scala
Outdated
Show resolved
Hide resolved
.../scala/org/lfdecentralizedtrust/splice/scan/automation/DeleteCorruptAcsSnapshotTrigger.scala
Show resolved
Hide resolved
.../scala/org/lfdecentralizedtrust/splice/scan/automation/DeleteCorruptAcsSnapshotTrigger.scala
Show resolved
Hide resolved
4450e31 to
737c6d4
Compare
docs/src/release_notes.rst
Outdated
| <https://github.com/hyperledger-labs/splice/blob/0.4.12/Vagrantfile>`_ for | ||
| details. | ||
|
|
||
| - Scan app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moritzkiefer-da @martinflorian-da Can you please check if this makes sense? I have intentionally used
deleted snapshots will be recomputed eventually
because the process for deleting snapshots and the process for backfilling missing import updates are separate, and this PR only enables the first process. The intention is to enable import update backfilling (already implemented but behind a feature flag) on CILR later today and by default before the next release. In worst case however, we could end up with a release that deletes corrupt snapshots but does not backfill the fixed ones. I would like to avoid using more feature flags, because I'm worried about breaking existing apps with some combination of flags that I did not think through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we want to enable both at once on prod, why not put them behind the same feature flag as well? You can then set that feature flag on CILR to true and when you change the default to true you can add the release notes at the same time for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of the current state of this PR, all backfilling would be paused until the new trigger completes, so putting the new trigger behind a feature flag would turn off all backfilling, unless I change how backfilling depends on the new feature flag (i.e., allow regular backfilling but not import update backfilling while deleting snapshots). It would probably work, but I am bit worried there was a deeper reason why it's the way it is right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I have put the new trigger behind the same feature flag as import update backfilling.
[ci] Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>
[ci] Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com>
737c6d4 to
7b162b2
Compare
* Tweak fluent bit configuration (#1940) [static] Based on CILR experience - fix severity parsing - truncate long log messages because otherise stack driver gets angry - make time parsing more lenient - make fluent bit parse its own logs better Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Reduce multi-validator deployment parallelism to 2 (#1938) Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> * Bump Canton for KMS resilience fix (#1941) Fixes DACH-NY/canton-network-internal#1337 [ci] Signed-off-by: Martin Florian <martin.florian@digitalasset.com> * Refactor some form components in sv ui (#1936) - Make form errors a re-usable form component - Make EffectiveField a re-usable field component - Upgrade tanstack-form Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com> * Docs: Clarifications around validator DR (#1937) Inspired by questions on Slack: https://daholdings.slack.com/archives/C08AP9QR7K4/p1755245551957219?thread_ts=1753278207.186399&cid=C08AP9QR7K4 [static] Signed-off-by: Martin Florian <martin.florian@digitalasset.com> * Fix tag prefix in stackdriver export (#1944) [static] Don't ask me why fluentbit has mutually incompatible defaults between different filters and outputs … Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * query to aggregate traffic purchases over a time period (#1926) * fork part of total supply query * aggregate .amuletPaid to a separate value * use a bracketed start time as well --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Implement DeleteCorruptAcsSnapshotTrigger (#1096) * Implement DeleteCorruptAcsSnapshotTrigger Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> * [static] increase multi validators parallelism to 5 (#1949) Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> * Write how-to docs for token standard usage (#1872) --------- Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> * Reduce gcp logging components (#1951) I think I accidentally turned on too much when I tried to disable workloads in favor of our own fluentbit. [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * WalletSurviveCantonRestartIntegrationTest: bump wait on participant init (#1952) Fixes DACH-NY/cn-test-failures#5417 The participant did come up eventually and AFAICT the validator app would have continued init if we hadn't stopped that. [static] Signed-off-by: Martin Florian <martin.florian@digitalasset.com> * Bump cometbft mempool and cache size (#1953) fixes #1934 [ci] I honestly don't have a great reason for choosing these specific values. Doubling seems as good as anything else 🤷 See https://github.com/DACH-NY/canton-network-node/pull/17821/files for an earlier change we made in the same direction. Note that I didn't bump the TTL because I don't see a compelling reason why that helps with anything. Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] Add istio rate limits to pulumi (#1798) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * Implement Amulet Rules Proposal Form in new SV UI (#1945) --------- Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com> * Fix fluentbit log truncation (#1959) [static] I should not be allowed to write lua Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] include rate of sequencer events processed in the participant dashboard (#1960) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * move pulumi npm packages into lfdt namespace (#1848) * don't alert a Slack channel unless explicitly set in .envrc.vars (#1913) * don't alert a Slack channel unless explicitly set in .envrc.vars The default for alerting was #team-canton-network-internal-alerts. Now that default is removed; only long-running, production and near-production clusters like dev/test/main should now alert. * also don't default SLACK_ALERT_NOTIFICATION_CHANNEL - suggested by @martinflorian-da; thanks * fail if SLACK_ALERT_NOTIFICATION_CHANNEL defined but not FULL_NAME --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Support running static tests on gh-hosted runners (#1668) Signed-off-by: Itai Segall <itai.segall@digitalasset.com> Co-authored-by: Stephen Compall <stephen.compall@digitalasset.com> * Revert "Support running static tests on gh-hosted runners (#1668)" (#1966) This reverts commit 13bcefe. Signed-off-by: Itai Segall <itai.segall@digitalasset.com> * Make pulumi stack parallelism configurable (#1967) * Make pulumi stack parallelism configurable [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * fmt [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> --------- Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] Make the cluster node pools sizes configurable (#1957) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * Try to fix grafana alert expansion (#1970) [static] We still get tons of spam from logger=ngalert.state.manager rule_uid=ady2ks9ehbw1sb org_id=1 t=2025-08-20T07:37:44.687289759Z level=error msg="Error in expanding template" error="failed to expand template '{{- $labels := .Labels -}}{{- $values := .Values -}}{{- $value := .Value -}}{{- if (gt $values.runs.Value 2) -}}\ncritical\n{{- else -}}\nwarning\n{{- end -}}': error executing template __alert_Busy task-based automation: template: __alert_Busy task-based automation:1:84: executing \"__alert_Busy task-based automation\" at <gt $values.runs.Value 2>: error calling gt: incompatible types for comparison" and for the other one. My current theory is: go templates seem to distinguish integers and floats. And we have one missing null check. Would be too easy if it actually told you the mismatching types … Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [ci] More lenient scan rate limit test (#1971) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * Match package name on template filter (#1955) --------- Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> * Document routing of the JSON API (#1973) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Synchronize on scan processing lock archival (#1969) [ci] fixes DACH-NY/cn-test-failures#5415 Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Add config rendering helper function and enhance splice-participant helm template (#1939) Fixes #1915 Signed-off-by: timpel-fcs <tim.pelzer@finoa.io> * Remove migrate-istio (#1977) Deletes code, must be good. More seriously this was added 7 months ago so we can pretty confidently assume everything is migrated by now. [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * mention BFT success requirement in validator onboarding doc (#1979) We explain the tradeoffs already where we document how to do it, so not going to repeat all that, just linking to it. Onboarding real production nodes shouldn't do this anyway. Reifies this comment <global-synchronizer-foundation/docs#8 (reply in thread)> from @martinflorian-da. Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * shorter output/timeout/portability in validator onboarding test scriptlets (#1982) - overall max-time for curl calls - don't try to jq 4xx responses, just fail - jq portability Adapted from this comment <global-synchronizer-foundation/docs#8 (comment)> from @stas-sbi. * grpcurl output has quotes --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Support running static tests on gh-hosted runners (#1978) Signed-off-by: Itai Segall <itai.segall@digitalasset.com> * Make workflow ids of import updates consistent (#1981) Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> * Further clarify safe ways of bypassing the party limit (#1984) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Remove todo artifacts (#1986) With the new static test job, CI on forks now fails as it conflicts between that job and the main job. Rather than trying to make it conditional or rename it to avoid the conflict, this just removes the step. Noone has used this for years afaik. [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * Mention existing transfer preapproval proposal (#1987) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * vagrant: Restart nix-daemon after mounting cache (#1985) - Makes the initial boot more predictable. - Allows recovering after deleting the cache file without re-creating the VM. To recover run `vagrant up --provision`. Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com> * Filter pr_cluster_test for pull requests (#1988) [static] Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [static] Update release notes for 0.4.12 (#1989) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * stop triggering ciupgrade tests (#1983) Signed-off-by: Itai Segall <itai.segall@digitalasset.com> * Upgrade Canton to 3.3.0-snapshot.20250821.16057.0.v3719b9e9 (#1994) [ci] Includes the fix for the initial topology validator that is blocking sv runbook reonboarding on cilr atm. Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> * [ci] Update VERSION to 0.4.13 (#1995) Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> * run BigQuery integration test daily (#1873) * add run scheduled for 2:17am CET, allow manual run * flexible version selection * log service account email when setting up BQ test --------- Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> * Add missing CO_TransferPreapprovalSend case in UserWalletTxLogParser (#2006) Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> --------- Signed-off-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> Signed-off-by: Martin Florian <martin.florian@digitalasset.com> Signed-off-by: fayi-da <fayimora.femibalogun@digitalasset.com> Signed-off-by: Stephen Compall <stephen.compall@digitalasset.com> Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> Signed-off-by: Oriol Muñoz <oriol.munoz@digitalasset.com> Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com> Signed-off-by: Itai Segall <itai.segall@digitalasset.com> Signed-off-by: timpel-fcs <tim.pelzer@finoa.io> Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com> Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org> Co-authored-by: Julien Tinguely <julien.tinguely@digitalasset.com> Co-authored-by: Martin Florian <martin.florian@digitalasset.com> Co-authored-by: fayi-da <112705750+fayi-da@users.noreply.github.com> Co-authored-by: Stephen Compall <stephen.compall@digitalasset.com> Co-authored-by: Robert Autenrieth <31539813+rautenrieth-da@users.noreply.github.com> Co-authored-by: Oriol Muñoz <oriol.munoz@digitalasset.com> Co-authored-by: Nicu Reut <nicu.reut@digitalasset.com> Co-authored-by: Itai Segall <itai.segall@digitalasset.com> Co-authored-by: Tim <tim.pelzer@finoa.io> Co-authored-by: Stanislav German-Evtushenko <ginermail@gmail.com>
* Implement DeleteCorruptAcsSnapshotTrigger Signed-off-by: Robert Autenrieth <robert.autenrieth@digitalasset.com> Signed-off-by: hrischuk-da <curtis.hrischuk@digitalasset.com>
Second attempt at making deleting corrupt snapshots faster. This time we accept that this is a very long procedure and move it to a trigger.
The new trigger must complete before update history backfilling starts, because otherwise we could backfill import updates before finishing cleaning up corrupt snapshots, and the remaining corrupt snapshots would appear as valid.