Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate HelmReleaseProxy resources after move - potential race #188

Closed
jimmidyson opened this issue Mar 13, 2024 · 5 comments · Fixed by #189
Closed

Duplicate HelmReleaseProxy resources after move - potential race #188

jimmidyson opened this issue Mar 13, 2024 · 5 comments · Fixed by #189

Comments

@jimmidyson
Copy link
Member

jimmidyson commented Mar 13, 2024

Can you help me understand why this reconciles on ClusterUnpaused OR ClusterControlPlaneInitialized and not on both of those (AND)? I've seen what looks like a race on move where I find multiple HelmReleaseProxies for the same HelmChartProxy and cluster - labels are set up correctly and created at same time (albeit at 1s fidelity from CreationTimestamp so can't tell which is created first).

I think that changing this logic to when a cluster is not paused AND the control plane has been initialized would solve this problem. I see that wouldn't work because the ClusterUnpaused explicitly requires the cluster to have previously been in unpaused state before to continue on update, which wouldn't be the case just following cluster control plane initialized.

From the move logs you can see the order that causes this issue:

Creating Cluster="self-hosted-18gxrd" Namespace="self-hosted-wokpvl"
...
Creating HelmChartProxy="cluster-autoscaler-self-hosted-18gxrd" Namespace="self-hosted-wokpvl"
...
Creating HelmReleaseProxy="cluster-autoscaler-self-hosted-18gxrd-7q6p4" Namespace="self-hosted-wokpvl"

By the time the HelmReleaseProxy has been moved, a new HelmReleaseProxy has already been created it seems.

@jimmidyson
Copy link
Member Author

jimmidyson commented Mar 13, 2024

Looking at the resources that are created again, I see that the HelmReleaseProxy was moved before the new HelmReleaseProxy was created. Labels are the same on both resources and as such there should only be one. This is a common issue when using GenerateName

The moved HelmReleaseProxy has an earlier resourceVersion than the duplicate HelmReleaseProxy.

The moved resource:

metadata:
  name: cluster-autoscaler-self-hosted-18gxrd-7q6p4
  namespace: self-hosted-wokpvl
  labels:
    cluster.x-k8s.io/cluster-name: self-hosted-18gxrd
    helmreleaseproxy.addons.cluster.x-k8s.io/helmchartproxy-name: cluster-autoscaler-self-hosted-18gxrd
  ...
  resourceVersion: "2729"

The duplicate resource:

metadata:
  name: cluster-autoscaler-self-hosted-18gxrd-kmc7h
  namespace: self-hosted-wokpvl
  ...
  labels:
    cluster.x-k8s.io/cluster-name: self-hosted-18gxrd
    helmreleaseproxy.addons.cluster.x-k8s.io/helmchartproxy-name: cluster-autoscaler-self-hosted-18gxrd
  resourceVersion: "2730"

Is this a race in the cache? Should this check use a non-caching client perhaps? Should reconciliation detect duplicates and consolidate here?

@jimmidyson
Copy link
Member Author

Thinking about this, I think the HelmChartProxy controller should skip clusters that are paused, just as they do for clusters that do not yet have control plane initialized. I will send in a PR - let me know what you think.

@Jont828
Copy link
Contributor

Jont828 commented Mar 13, 2024

Thanks for bringing this to my attention. At a first glance, I'm not sure what's going on with the race condition, but I think it's fair to skip reconciliation for paused Clusters. I figured we'd already be doing that but it must have been an oversight. Did skipping reconciliation for paused Clusters fix the race condition, or do you think that's a separate issue?

@Jont828
Copy link
Contributor

Jont828 commented Mar 13, 2024

If the issue is with the name generation, an easy workaround could be to have the HCP controller set HRP.Spec.ReleaseName when it gets created, rather than the HRP controller setting it on its own and updating.

@jimmidyson
Copy link
Member Author

jimmidyson commented Mar 14, 2024

This is not a problem with HRP.Spec.ReleaseName but the HRP.Metadata.Name which always uses a generated name. I think move is the only way that this race can be hit (2 clients creating potentially duplicate HRPs at the same time with different names).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants