Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid picking old helm-delete job #232

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

w13915984028
Copy link

@w13915984028 w13915984028 commented Apr 15, 2024

Fix issue #177.

The helm-delete job may be created multi times and remaining job will affect next round of helm-chart removing.

The current code depends on below block to clean the helm-delete job.

				// note: an empty apply removes all resources owned by this chart
				err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
					AllowClusterScoped: true,
				}).
					WithOwner(chart).
					WithSetID("helm-chart-registration").
					ApplyObjects()

But there are 2 parts not covering:
(1) Below code may fail, due to chart object is gone

		chartCopy.Status.JobName = job.Name
		newChart, err := c.helms.Update(chartCopy)

the return error is like:

Operation cannot be fulfilled on helmcharts.helm.cattle.io "rancher-logging": StorageError: invalid object, Code: 4, Key: /registry/helm.cattle.io/helmcharts/cattle-logging-system/rancher-logging, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 78e8d91d-430e-4c7d-a3f9-156c461eedab, UID in object meta: , requeuing"

The second commit can fix it (validated).

In the test, the orphaned helm-delete job can stay a long time.

harv41:/home/rancher # kk get jobs -A
NAMESPACE               NAME                               COMPLETIONS   DURATION   AGE
cattle-logging-system   helm-delete-rancher-logging        1/1           3s         62m

kube-system             helm-install-rke2-canal            1/1           13s        2d15h

(2) When the POD/running process is killed and the generic.ConfigureApplyForObject is not executed.
The current code sololy depends on the executing of generic.ConfigureApplyForObject to clean objects in the last step of OnRemove.

The first commit will fix this, it always checks the existing helm-delete job to avoid picking the orphaned jobs.

On k8s, the object deleting is a bit different than creating, the reconciler can re-run to reach the target finally. But the OnRemove may not always have chance to re-run.

Signed-off-by: Jian Wang <w13915984028@gmail.com>
Comment on lines +238 to +239
// remove old helm-delete job if it was left
job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name)
Copy link
Contributor

@brandond brandond Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do this here, the job patcher already deletes and re-creates the job as necessary, as part of the Apply logic:

  • func (c *Controller) jobPatcher(namespace, name string, pt types.PatchType, data []byte) (runtime.Object, error) {
    err := c.jobs.Delete(namespace, name, &metav1.DeleteOptions{PropagationPolicy: &deletePolicy})
    if err == nil || apierrors.IsNotFound(err) {
    return nil, fmt.Errorf("create or replace job")
    }
    return nil, err
    }
  • c.apply = apply.
    WithCacheTypes(helms, confs, jobs, crbs, sas, cm, s).
    WithStrictCaching().
    WithPatcher(jobs.GroupVersionKind(), c.jobPatcher)
  • // if the job already exists but is tied to an install or upgrade, there will be a need to patch so
    // the apply will execute the jobPatcher, which will delete the install/upgrade job and recreate a uninstall job
    err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{

Is there some other bit of logic that the patcher needs to capture?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the bug is due to WithOwner(chart). in

First round of create & delete a HelmChart X

HelmChart X with ID A 
  helm-delete job 1
  helm-delete job 2 (and possiblely left orphaned)

Second round of create & delete a HelmChart X

Chart X with ID B
  helm-delete job 2 will not be replaced due to `WithOwner`

But

job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name)
will be hit and also
c.recorder.Eventf(chart, corev1.EventTypeNormal, "RemoveJob", "Uninstalled HelmChart using Job %s/%s, removing resources", job.Namespace, job.Name)

That's the gap in the logic.

Copy link
Author

@w13915984028 w13915984028 Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we manually create a same namespaced & named helm-delete job before deleting a HelmChart, I guess it will also shortcut the HelmChart's real deleting.

Copy link
Contributor

@brandond brandond Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the WithOwner stuff uses the owning object's name, namespace, and GVK to track resources across namespaces. As long as the owning HelmChart object shares those same attributes across delete/create cycles it should track it properly. If you run the controller with --debug you should get debug logs from the DesiredSet stuff showing you what it's trying to do - does that shed any light on the situation?

Copy link
Author

@w13915984028 w13915984028 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandond

I add some debug #177 (comment) on Harvester environment, the logs /events can clearly show that, the generic.ConfigureApplyForObject is not able to correctly identify the old left helm-delete job.

From the debug, it looks essential to remove the old job.

Btw the created helm-delete job, has no Owner field; the apply has a interl owner field.

harv41:/home/rancher # kk get jobs -n cattle-logging-system helm-delete-rancher-logging -oyaml
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    batch.kubernetes.io/job-tracking: ""
    objectset.rio.cattle.io/applied: ...
    objectset.rio.cattle.io/id: helm-chart-registration
    objectset.rio.cattle.io/owner-gvk: helm.cattle.io/v1, Kind=HelmChart
    objectset.rio.cattle.io/owner-name: rancher-logging
    objectset.rio.cattle.io/owner-namespace: cattle-logging-system
  creationTimestamp: "2024-04-19T08:32:28Z"
  finalizers:
  - wrangler.cattle.io/promote-node-controller
  generation: 1
  labels:
    helmcharts.helm.cattle.io/chart: rancher-logging
    objectset.rio.cattle.io/hash: ea9e4622f05ef91896488127ad4be76f709ee325
  name: helm-delete-rancher-logging
  namespace: cattle-logging-system
  resourceVersion: "739101"
  uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
spec:
  backoffLimit: 1000
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      batch.kubernetes.io/controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
  suspend: false
  template:
    metadata:
      annotations:
        helmcharts.helm.cattle.io/configHash: SHA256=0832CE8EBAB899509A3AB82C345F3DF72C4D3300D5CDB2E0C47BB943F9630B34
      creationTimestamp: null
      labels:
        batch.kubernetes.io/controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
        batch.kubernetes.io/job-name: helm-delete-rancher-logging
        controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
        helmcharts.helm.cattle.io/chart: rancher-logging
        job-name: helm-delete-rancher-logging
    spec:
...  

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orphaned job, was left per this return

	// once we have run the above logic, we can now check if the job is complete
	job, err = c.jobCache.Get(chart.Namespace, expectedJob.Name)
	if apierrors.IsNotFound(err) {
		// the above apply should have created it, something is wrong.
		// if you are here, there must be a bug in the code.
	@@ -269,7 +289,7 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
		chartCopy.Status.JobName = job.Name
// the `chart` may be on on apiserver side
		newChart, err := c.helms.Update(chartCopy)
		if err != nil {
			return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s, %w", chartCopy.Status.JobName, err)
		}

And the normal running path has this call

	// note: an empty apply removes all resources owned by this chart
	err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
		AllowClusterScoped: true,
	}).
		WithOwner(chart).
		WithSetID("helm-chart-registration").
		ApplyObjects()

I will test if add this call to the above return works.

Signed-off-by: Jian Wang <w13915984028@gmail.com>
@@ -269,7 +289,21 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
chartCopy.Status.JobName = job.Name
newChart, err := c.helms.Update(chartCopy)
if err != nil {
return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s", chartCopy.Status.JobName)
// if chart is gone, clean resources
if apierrors.IsNotFound(err) || strings.Contains(err.Error(), "StorageError") {
Copy link
Contributor

@brandond brandond Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the error you shared: StorageError: invalid object, Code: 4, Key: /registry/helm.cattle.io/helmcharts/cattle-logging-system/rancher-logging, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 78e8d91d-430e-4c7d-a3f9-156c461eedab, UID in object meta: , requeuing"

I think you can use https://pkg.go.dev/k8s.io/apiserver/pkg/storage#IsInvalidObj

Suggested change
if apierrors.IsNotFound(err) || strings.Contains(err.Error(), "StorageError") {
if apierrors.IsNotFound(err) || storage.IsInvalidObj(err) {

It is odd that there's not an apierror for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, it is not like normal k8s package for client usage.

apierrors "k8s.io/apimachinery/pkg/api/errors"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything that prevents you from importing "k8s.io/apiserver/pkg/storage"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait my validation, it may not work as expected when using this package.

The log shows it does not go to the expected if, and the job is still left.

image

harv41:/home/rancher # kk get jobs -A
NAMESPACE               NAME                               COMPLETIONS   DURATION   AGE
cattle-logging-system   helm-delete-rancher-logging        1/1           4s         4m4s

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.IsInvalidObj is not working as expected, revert first.

@brandond
Copy link
Contributor

brandond commented Apr 19, 2024

if job.Status.Succeeded <= 0 {
// temporarily recreate the chart, but keep the deletion timestamp
chartCopy := chart.DeepCopy()
chartCopy.Status.JobName = job.Name
newChart, err := c.helms.Update(chartCopy)

The comment from @aiyengar2 here about temporarily recreate the chart, but keep the deletion timestamp doesn't make any sense to me - how can we call Update to patch a resource that doesn't exist? I feel like we're fixing a symptom here, instead of fixing the incorrect logic that gets us here in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants