-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPM3 v1.6.1 CAPI/CAPM3 machine name changed after rolling upgrade while nodeReuse set to True #1584
Comments
This issue is currently awaiting triage. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @jparkash2 thanks for the report! I think for this to be actionable it needs to focus only on the expected/desired CAPM3 behavior, since handling of the disks by longhorn is outside the scope of the CAPM3 component. You mentioned How the data is handled by Longhorn (or any other layered dependency) is not controlled via CAPM3, but I can see how replacing the Machine/Node CRs could potentially cause problems, so perhaps we can firstly determine if CAPM3 is behaving unexpectedly, or if it's working as-designed (but that causes undesired side-effects due to the way CAPI upgrades work e.g via Machine replacement)? |
/triage needs-information |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
@jparkash2 were you able to solve that issue? Is Longhorn able to re-use the data on different disk after capi did a re-provisioning? I think Longhorn is not able to do that. But maybe I am wrong, I created a feature request for Longhorn: |
@guettli Thanks, we can close this thread as it was achieved by making changes at CAMP3. |
@jparkash2 how did you solve that? The issue at Longhorn is still open: longhorn/longhorn#8362 |
We can cross-reference the related Sylva issues which I think contain some details around how this was resolved;
I think this is not a CAPM3 issue, but steps must be taken to ensure that cleaning is disabled and the Node name is reused which should perhaps be better documented somewhere |
@hardys we use constant node names. Unfortunately I don't see how to solve that. The Longhorn node resource contains the config which you configure via the GUI. If cluster API upgrades the node, the Kubernetes node gets deleted, and the Longhorn node object gets deleted, too. Now the configuration like tags are lost. We think about that: a controller syncs the Longhorn node config into a new CRD or a config map. When the node gets created, we can provide this config via longhorn default config annotation, or via update the Longhorn node resource or we use the Longhorn Python client. All this is doable. I am just confused that we seem to be the first to automate that. I like software development, but I am also happy if it's not needed :-) How do you attach the data disk to Longhorn? Is a manual re-attach needed after capi upgraded the machine? |
What steps did you take and what happened:
Longhorn doesn't reuse old replicas after rolling upgrades and creates new replicas instead.
We conducted the rolling upgrade of CAPI with specific configurations, setting
nodeResue
totrue
,automatedCleaningMode
todisabled
at Metal3, andreplica-replenishment-wait-interval
set to3600
on Longhorn. However, despite these settings, Longhorn did not utilize existing data. Instead, it created a new replica copy from the existing copy after the timeout of replica-replenishment-wait-interval even the node joined back the longhorn/kubernates cluster within thereplica-replenishment-wait-interval
time but with a new name. This behaviour was unexpected and not in line with our testing expectations.After CAPI rolling upgrade, we have the below observations regarding Longhorn:
replica-replenishment-wait-interval
Upon further investigation, we identified the root cause of the issue. It appears that during the rolling upgrade, the CAPI machine name changed, resulting in Kubernetes/Longhorn cluster treating the updated node as a new node rather than the existing one.
What did you expect to happen:
After the rolling upgrade Kubernates cluster node name should not change to use existing data on disk to rebuild the replicas instead of create the new copy of replicas.
Anything else you would like to add:
https://gitlab.com/sylva-projects/sylva-core/-/issues/1141
Environment:
kubectl version
): 1.27.10/kind bug
The text was updated successfully, but these errors were encountered: