Skip to content

fix(pull-mode): do not commit helm stage when a chart failed in the cycle#1728

Closed
thecodeassassin wants to merge 1 commit intoprojectsveltos:mainfrom
Cloud-Exit:fix/pull-mode-partial-stage-wipes-deployments
Closed

fix(pull-mode): do not commit helm stage when a chart failed in the cycle#1728
thecodeassassin wants to merge 1 commit intoprojectsveltos:mainfrom
Cloud-Exit:fix/pull-mode-partial-stage-wipes-deployments

Conversation

@thecodeassassin
Copy link
Copy Markdown
Contributor

@thecodeassassin thecodeassassin commented Apr 18, 2026

Fixes #1724.

The scenario reported in #1724

A pull-mode ClusterSummary deploys cert-manager via a Profile that reads its values from a ConfigMap. When that ConfigMap is updated with a values document that has wrong indentation (replicaCount nested under crds instead of at the top level), helm correctly rejects the new values at install time:

values don't meet the specifications of the schema(s) in the following chart(s):
cert-manager:
- at '/crds': additional properties 'replicaCount' not allowed

Expected behavior is for the reconcile to fail in place and leave the previously deployed cert-manager untouched on the managed cluster. Actual behavior is that every resource cert-manager had deployed (ServiceAccount, CustomResourceDefinition, ClusterRole, ClusterRoleBinding, Role, RoleBinding, Service, Deployment, MutatingWebhookConfiguration, ValidatingWebhookConfiguration, Job) is completely removed from the managed cluster. The ClusterSummary still lists cert-manager as Managing with Status: Failed, but the workloads are gone.

Root cause

In controllers/handlers_helm.go, handleCharts runs walkChartsAndDeploy to stage each helm chart's rendered resources into the in-memory pullmode staging manager, then commits the staged set for the agent to consume:

releaseReports, chartDeployed, deployError := walkChartsAndDeploy(ctx, c, dCtx, kubeconfig, isPullMode, logger)
// Even if there is a deployment error do not return just yet. Update various status and clean stale resources.

clusterSummary, err = updateStatusForNonReferencedHelmReleases(ctx, c, dCtx, logger)
if err != nil {
    return err
}

if isPullMode {
    err = commitStagedResourcesForDeployment(ctx, clusterSummary, configurationHash, mgmtResources, logger)
    ...
    if deployError != nil {
        return deployError
    }
}

When any chart fails inside walkChartsAndDeploy (in this scenario, handleInstall returning a helm schema-validation error), that chart never reaches stageHelmResourcesForDeployment. For a single-chart profile the in-memory staging manager ends this reconcile with zero helm bundles.

commitStagedResourcesForDeployment then publishes a ConfigurationGroup that references only the bundles currently in memory. Because the in-memory set is empty, the ConfigurationGroup no longer references the bundles that were published by the previous successful reconcile. The applier-manager on the managed cluster watches the ConfigurationGroup, sees those bundles as removed from the profile, and deletes the resources they previously deployed. The entire cert-manager release is uninstalled.

Anything that prevents a chart from reaching stageHelmResourcesForDeployment produces the same outcome. Other examples:

  • failed to instantiated template: ... map has no entry for key "data" when a templateResourceRefs ConfigMap is renamed or has a key removed.
  • An unreachable chart repository.
  • Any transient helm error before staging.

Each of these silently tears down the healthy deployment that was running until the reconcile failed.

The fix

The two sibling handlers already handle this correctly:

  • controllers/handlers_resources.go:151-155 returns on deployError before calling pullmode.CommitStagedResourcesForDeployment.
  • controllers/handlers_kustomize.go:196-197 does the same.

handlers_helm.go was the only handler still committing a partial staged set on error. This PR moves the deployError check ahead of commitStagedResourcesForDeployment, matching the existing pattern:

if isPullMode {
    if deployError != nil {
        return deployError
    }

    err = commitStagedResourcesForDeployment(ctx, clusterSummary, configurationHash, mgmtResources, logger)
    ...
}

When any chart in the cycle fails to stage, the reconcile returns the error without touching the ConfigurationGroup. The previously committed ConfigurationGroup continues to reference the last known good bundles, so the agent keeps running the current deployment instead of tearing it down.

updateClusterReportWithHelmReports is a no-op outside SyncModeDryRun, so moving the deployError check ahead of it has no effect on non-DryRun reconciles. The in-memory partial staged bundles are cleared by pullmode.DiscardStagedResourcesForDeployment at the top of the next deployHelmCharts invocation (handlers_helm.go:159), so there is no in-memory leak across reconciles.

Scope and behavior

  • Applies only to the pull-mode branch of handleCharts. Non pull-mode behavior is unchanged.
  • Does not reorder or change the non-error path. Successful reconciles stage, commit, and report exactly as before.
  • Does not recover clusters that were already wiped by a prior partial commit. Those need a clean reconcile (for example after fixing the values document) before the agent will redeploy. This change prevents further occurrences.
  • For profiles with continueOnError: true, any chart failing to stage still blocks the commit for this cycle. The safer trade is to pause all updates until the failing chart is healthy, rather than risk removing a healthy chart because its sibling failed mid-cycle. This can be revisited if there is demand for per-chart staging granularity, which would require API changes in libsveltos/lib/pullmode.

Test plan

  • Run make test with envtest set up. The existing handlers_helm_test.go suite should continue to pass; there is no behavior change on the success path.
  • Manually reproduce the scenario from BUG: Updating a chart in pull mode completely removes deployed artifacts #1724 against a cluster running the patched controller: write an invalid values document into the referenced ConfigMap, observe that the ClusterSummary reports the helm schema error on featureSummaries, and confirm the cert-manager deployment on the managed cluster is still present.
  • Verify the same protection for the other two failure modes: a templated chartVersion whose referenced ConfigMap key is missing, and an unreachable chart repository.

Related

handleCharts in controllers/handlers_helm.go was committing the staged
ConfigurationBundles even when walkChartsAndDeploy returned an error
for one of the charts in the profile. Because staging had aborted
before the failing chart was added to the in-memory staging manager,
the commit produced a ConfigurationGroup missing the bundles for that
chart. The applier-manager on the managed cluster then treats the
missing bundles as "this release was removed from the profile" and
uninstalls the chart, removing the previously deployed resources from
the managed cluster.

The user-visible effect is that any pull-mode reconcile which fails
inside helm (helm values schema validation, template instantiation
error, unreachable chart repo, invalid semantic version from a
poisoned cache, etc.) does not just fail loudly: it silently tears
down the deployment that was working moments earlier. Issue
projectsveltos#1724 reported this for cert-manager
when the values ConfigMap referenced by valuesFrom was updated with a
values document that did not match the chart schema.

The two sibling handlers already handle this correctly:

  * handlers_resources.go:151-155 returns on deployError before
    CommitStagedResourcesForDeployment.
  * handlers_kustomize.go:196-197 does the same.

This change aligns handlers_helm.go with that pattern: if
walkChartsAndDeploy returned a deployError, return it immediately
instead of committing a partial stage. The previously committed
ConfigurationGroup is left in place so the agent continues to run the
last known good state until a clean reconcile can complete.

updateClusterReportWithHelmReports is a no-op outside of DryRun sync
mode, so moving the deployError check ahead of it does not affect
non-DryRun reconciles. Staged-but-not-committed bundles are cleared
by DiscardStagedResourcesForDeployment at the top of the next
deployHelmCharts invocation (handlers_helm.go:159), so there is no
leak of in-memory staged state.

Refs: projectsveltos#1724
@gianlucam76
Copy link
Copy Markdown
Member

Fixed here #1725

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Updating a chart in pull mode completely removes deployed artifacts

2 participants