fix(pull-mode): do not commit helm stage when a chart failed in the cycle#1728
Closed
thecodeassassin wants to merge 1 commit intoprojectsveltos:mainfrom
Closed
Conversation
handleCharts in controllers/handlers_helm.go was committing the staged ConfigurationBundles even when walkChartsAndDeploy returned an error for one of the charts in the profile. Because staging had aborted before the failing chart was added to the in-memory staging manager, the commit produced a ConfigurationGroup missing the bundles for that chart. The applier-manager on the managed cluster then treats the missing bundles as "this release was removed from the profile" and uninstalls the chart, removing the previously deployed resources from the managed cluster. The user-visible effect is that any pull-mode reconcile which fails inside helm (helm values schema validation, template instantiation error, unreachable chart repo, invalid semantic version from a poisoned cache, etc.) does not just fail loudly: it silently tears down the deployment that was working moments earlier. Issue projectsveltos#1724 reported this for cert-manager when the values ConfigMap referenced by valuesFrom was updated with a values document that did not match the chart schema. The two sibling handlers already handle this correctly: * handlers_resources.go:151-155 returns on deployError before CommitStagedResourcesForDeployment. * handlers_kustomize.go:196-197 does the same. This change aligns handlers_helm.go with that pattern: if walkChartsAndDeploy returned a deployError, return it immediately instead of committing a partial stage. The previously committed ConfigurationGroup is left in place so the agent continues to run the last known good state until a clean reconcile can complete. updateClusterReportWithHelmReports is a no-op outside of DryRun sync mode, so moving the deployError check ahead of it does not affect non-DryRun reconciles. Staged-but-not-committed bundles are cleared by DiscardStagedResourcesForDeployment at the top of the next deployHelmCharts invocation (handlers_helm.go:159), so there is no leak of in-memory staged state. Refs: projectsveltos#1724
Member
|
Fixed here #1725 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1724.
The scenario reported in #1724
A pull-mode ClusterSummary deploys cert-manager via a Profile that reads its values from a ConfigMap. When that ConfigMap is updated with a values document that has wrong indentation (
replicaCountnested undercrdsinstead of at the top level), helm correctly rejects the new values at install time:Expected behavior is for the reconcile to fail in place and leave the previously deployed cert-manager untouched on the managed cluster. Actual behavior is that every resource cert-manager had deployed (
ServiceAccount,CustomResourceDefinition,ClusterRole,ClusterRoleBinding,Role,RoleBinding,Service,Deployment,MutatingWebhookConfiguration,ValidatingWebhookConfiguration,Job) is completely removed from the managed cluster. The ClusterSummary still lists cert-manager asManagingwithStatus: Failed, but the workloads are gone.Root cause
In
controllers/handlers_helm.go,handleChartsrunswalkChartsAndDeployto stage each helm chart's rendered resources into the in-memory pullmode staging manager, then commits the staged set for the agent to consume:When any chart fails inside
walkChartsAndDeploy(in this scenario,handleInstallreturning a helm schema-validation error), that chart never reachesstageHelmResourcesForDeployment. For a single-chart profile the in-memory staging manager ends this reconcile with zero helm bundles.commitStagedResourcesForDeploymentthen publishes a ConfigurationGroup that references only the bundles currently in memory. Because the in-memory set is empty, the ConfigurationGroup no longer references the bundles that were published by the previous successful reconcile. The applier-manager on the managed cluster watches the ConfigurationGroup, sees those bundles as removed from the profile, and deletes the resources they previously deployed. The entire cert-manager release is uninstalled.Anything that prevents a chart from reaching
stageHelmResourcesForDeploymentproduces the same outcome. Other examples:failed to instantiated template: ... map has no entry for key "data"when atemplateResourceRefsConfigMap is renamed or has a key removed.Each of these silently tears down the healthy deployment that was running until the reconcile failed.
The fix
The two sibling handlers already handle this correctly:
controllers/handlers_resources.go:151-155returns ondeployErrorbefore callingpullmode.CommitStagedResourcesForDeployment.controllers/handlers_kustomize.go:196-197does the same.handlers_helm.gowas the only handler still committing a partial staged set on error. This PR moves thedeployErrorcheck ahead ofcommitStagedResourcesForDeployment, matching the existing pattern:When any chart in the cycle fails to stage, the reconcile returns the error without touching the ConfigurationGroup. The previously committed ConfigurationGroup continues to reference the last known good bundles, so the agent keeps running the current deployment instead of tearing it down.
updateClusterReportWithHelmReportsis a no-op outsideSyncModeDryRun, so moving thedeployErrorcheck ahead of it has no effect on non-DryRun reconciles. The in-memory partial staged bundles are cleared bypullmode.DiscardStagedResourcesForDeploymentat the top of the nextdeployHelmChartsinvocation (handlers_helm.go:159), so there is no in-memory leak across reconciles.Scope and behavior
handleCharts. Non pull-mode behavior is unchanged.continueOnError: true, any chart failing to stage still blocks the commit for this cycle. The safer trade is to pause all updates until the failing chart is healthy, rather than risk removing a healthy chart because its sibling failed mid-cycle. This can be revisited if there is demand for per-chart staging granularity, which would require API changes inlibsveltos/lib/pullmode.Test plan
make testwith envtest set up. The existinghandlers_helm_test.gosuite should continue to pass; there is no behavior change on the success path.featureSummaries, and confirm the cert-manager deployment on the managed cluster is still present.chartVersionwhose referenced ConfigMap key is missing, and an unreachable chart repository.Related