fix(pull-mode): do not commit helm stage when a chart failed in the cycle by thecodeassassin · Pull Request #1728 · projectsveltos/addon-controller

thecodeassassin · 2026-04-18T22:34:22Z

The scenario reported in #1724

A pull-mode ClusterSummary deploys cert-manager via a Profile that reads its values from a ConfigMap. When that ConfigMap is updated with a values document that has wrong indentation (replicaCount nested under crds instead of at the top level), helm correctly rejects the new values at install time:

values don't meet the specifications of the schema(s) in the following chart(s):
cert-manager:
- at '/crds': additional properties 'replicaCount' not allowed

Expected behavior is for the reconcile to fail in place and leave the previously deployed cert-manager untouched on the managed cluster. Actual behavior is that every resource cert-manager had deployed (ServiceAccount, CustomResourceDefinition, ClusterRole, ClusterRoleBinding, Role, RoleBinding, Service, Deployment, MutatingWebhookConfiguration, ValidatingWebhookConfiguration, Job) is completely removed from the managed cluster. The ClusterSummary still lists cert-manager as Managing with Status: Failed, but the workloads are gone.

Root cause

In controllers/handlers_helm.go, handleCharts runs walkChartsAndDeploy to stage each helm chart's rendered resources into the in-memory pullmode staging manager, then commits the staged set for the agent to consume:

releaseReports, chartDeployed, deployError := walkChartsAndDeploy(ctx, c, dCtx, kubeconfig, isPullMode, logger)
// Even if there is a deployment error do not return just yet. Update various status and clean stale resources.

clusterSummary, err = updateStatusForNonReferencedHelmReleases(ctx, c, dCtx, logger)
if err != nil {
    return err
}

if isPullMode {
    err = commitStagedResourcesForDeployment(ctx, clusterSummary, configurationHash, mgmtResources, logger)
    ...
    if deployError != nil {
        return deployError
    }
}

When any chart fails inside walkChartsAndDeploy (in this scenario, handleInstall returning a helm schema-validation error), that chart never reaches stageHelmResourcesForDeployment. For a single-chart profile the in-memory staging manager ends this reconcile with zero helm bundles.

commitStagedResourcesForDeployment then publishes a ConfigurationGroup that references only the bundles currently in memory. Because the in-memory set is empty, the ConfigurationGroup no longer references the bundles that were published by the previous successful reconcile. The applier-manager on the managed cluster watches the ConfigurationGroup, sees those bundles as removed from the profile, and deletes the resources they previously deployed. The entire cert-manager release is uninstalled.

Anything that prevents a chart from reaching stageHelmResourcesForDeployment produces the same outcome. Other examples:

failed to instantiated template: ... map has no entry for key "data" when a templateResourceRefs ConfigMap is renamed or has a key removed.
An unreachable chart repository.
Any transient helm error before staging.

Each of these silently tears down the healthy deployment that was running until the reconcile failed.

The fix

The two sibling handlers already handle this correctly:

controllers/handlers_resources.go:151-155 returns on deployError before calling pullmode.CommitStagedResourcesForDeployment.
controllers/handlers_kustomize.go:196-197 does the same.

handlers_helm.go was the only handler still committing a partial staged set on error. This PR moves the deployError check ahead of commitStagedResourcesForDeployment, matching the existing pattern:

if isPullMode {
    if deployError != nil {
        return deployError
    }

    err = commitStagedResourcesForDeployment(ctx, clusterSummary, configurationHash, mgmtResources, logger)
    ...
}

When any chart in the cycle fails to stage, the reconcile returns the error without touching the ConfigurationGroup. The previously committed ConfigurationGroup continues to reference the last known good bundles, so the agent keeps running the current deployment instead of tearing it down.

updateClusterReportWithHelmReports is a no-op outside SyncModeDryRun, so moving the deployError check ahead of it has no effect on non-DryRun reconciles. The in-memory partial staged bundles are cleared by pullmode.DiscardStagedResourcesForDeployment at the top of the next deployHelmCharts invocation (handlers_helm.go:159), so there is no in-memory leak across reconciles.

Scope and behavior

Applies only to the pull-mode branch of handleCharts. Non pull-mode behavior is unchanged.
Does not reorder or change the non-error path. Successful reconciles stage, commit, and report exactly as before.
Does not recover clusters that were already wiped by a prior partial commit. Those need a clean reconcile (for example after fixing the values document) before the agent will redeploy. This change prevents further occurrences.
For profiles with continueOnError: true, any chart failing to stage still blocks the commit for this cycle. The safer trade is to pause all updates until the failing chart is healthy, rather than risk removing a healthy chart because its sibling failed mid-cycle. This can be revisited if there is demand for per-chart staging granularity, which would require API changes in libsveltos/lib/pullmode.

Test plan

Run make test with envtest set up. The existing handlers_helm_test.go suite should continue to pass; there is no behavior change on the success path.
Manually reproduce the scenario from BUG: Updating a chart in pull mode completely removes deployed artifacts #1724 against a cluster running the patched controller: write an invalid values document into the referenced ConfigMap, observe that the ClusterSummary reports the helm schema error on featureSummaries, and confirm the cert-manager deployment on the managed cluster is still present.
Verify the same protection for the other two failure modes: a templated chartVersion whose referenced ConfigMap key is missing, and an unreachable chart repository.

handleCharts in controllers/handlers_helm.go was committing the staged ConfigurationBundles even when walkChartsAndDeploy returned an error for one of the charts in the profile. Because staging had aborted before the failing chart was added to the in-memory staging manager, the commit produced a ConfigurationGroup missing the bundles for that chart. The applier-manager on the managed cluster then treats the missing bundles as "this release was removed from the profile" and uninstalls the chart, removing the previously deployed resources from the managed cluster. The user-visible effect is that any pull-mode reconcile which fails inside helm (helm values schema validation, template instantiation error, unreachable chart repo, invalid semantic version from a poisoned cache, etc.) does not just fail loudly: it silently tears down the deployment that was working moments earlier. Issue projectsveltos#1724 reported this for cert-manager when the values ConfigMap referenced by valuesFrom was updated with a values document that did not match the chart schema. The two sibling handlers already handle this correctly: * handlers_resources.go:151-155 returns on deployError before CommitStagedResourcesForDeployment. * handlers_kustomize.go:196-197 does the same. This change aligns handlers_helm.go with that pattern: if walkChartsAndDeploy returned a deployError, return it immediately instead of committing a partial stage. The previously committed ConfigurationGroup is left in place so the agent continues to run the last known good state until a clean reconcile can complete. updateClusterReportWithHelmReports is a no-op outside of DryRun sync mode, so moving the deployError check ahead of it does not affect non-DryRun reconciles. Staged-but-not-committed bundles are cleared by DiscardStagedResourcesForDeployment at the top of the next deployHelmCharts invocation (handlers_helm.go:159), so there is no leak of in-memory staged state. Refs: projectsveltos#1724

gianlucam76 · 2026-04-19T08:21:44Z

Fixed here #1725

thecodeassassin closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pull-mode): do not commit helm stage when a chart failed in the cycle#1728

fix(pull-mode): do not commit helm stage when a chart failed in the cycle#1728
thecodeassassin wants to merge 1 commit intoprojectsveltos:mainfrom
Cloud-Exit:fix/pull-mode-partial-stage-wipes-deployments

thecodeassassin commented Apr 18, 2026 •

edited

Loading

Uh oh!

gianlucam76 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thecodeassassin commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The scenario reported in #1724

Root cause

The fix

Scope and behavior

Test plan

Related

Uh oh!

gianlucam76 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thecodeassassin commented Apr 18, 2026 •

edited

Loading