Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine with cloud-init 23.3.0 or newer fails to join cluster #4745

Open
dlipovetsky opened this issue Jan 18, 2024 · 5 comments · May be fixed by #4746
Open

Machine with cloud-init 23.3.0 or newer fails to join cluster #4745

dlipovetsky opened this issue Jan 18, 2024 · 5 comments · May be fixed by #4746
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@dlipovetsky
Copy link
Contributor

/kind bug

What steps did you take and what happened:

I used https://github.com/kubernetes-sigs/image-builder/ to create an Ubuntu 20.04 AMI with the latest available cloud-init package, 23.3.3. The machine fails to join the cluster.

What did you expect to happen:

The machine should join the cluster.

Anything else you would like to add:

In #1490, CAPA began writing sensitive user-data to AWS Secrets Manager (#1924 added support for an alternative, the SSM Parameter Store). CAPA replaced the user-data produced by CABPK with a mechanism to fetch the user-data from the service. This mechanism relied on an "include" that would, by design, fail the first time cloud-init ran. CAPA relied on cloud-init ignoring the failure.

As of canonical/cloud-init#367, cloud-init stopped ignoring the failure by default, but introduced a feature flag that allowed cloud-init to ignore the failure, as it had in the past. The default settings caused the cloud-init boot to fail, and kubernetes-sigs/image-builder#406 used the feature flag as a work around.

More recently, as of canonical/cloud-init#4228, the feature flag itself was removed. Without the feature flag, the existing workaround has no effect, and cloud-init boot fails.

@supershal and I looked into this issue, and filed kubernetes-sigs/image-builder#1333. We finally understand the root cause.

The most CAPA-maintained AMIs were created with cloud-init 22.4.2, instead of the default cloud-init version.

Environment:

  • Cluster-api-provider-aws version: main
  • Kubernetes version: (use kubectl version): v1.27.8
  • OS (e.g. from /etc/os-release): Ubuntu 20.04
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2024
@dlipovetsky
Copy link
Contributor Author

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jan 18, 2024
@dlipovetsky
Copy link
Contributor Author

/assign @dlipovetsky

@dlipovetsky
Copy link
Contributor Author

dlipovetsky commented Mar 2, 2024

This affects cloud-init v23.3.0 and newer. See https://github.com/canonical/cloud-init/blob/23.3.x/ChangeLog#L98

@dlipovetsky
Copy link
Contributor Author

#4746 is a hack, but it's arguably an improvement over #1490, which (eventually) required us to modify cloud-init internals in order to work.

Frankly, if we don't like #4746, let's consider reverting the functionality in #1490 and #1924. By design, the bootstrap provider passes secrets in user-data, and the infrastructure provider is not in a position to interpose, without hacks. I think this is something to be discussed at the bootstrap provider level. This is, after all, a problem that affects all infra providers that rely on cloud-init user-data.

@dlipovetsky
Copy link
Contributor Author

We would not need to interpose cloud-init, if the user-data did not contain the sensitive data (bootstrap token). See kubernetes-sigs/cluster-api#5294 and kubernetes-sigs/cluster-api#9631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
2 participants