Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure EC2 instance metadata options #4037

Merged
merged 4 commits into from
Mar 7, 2023

Conversation

muraee
Copy link
Contributor

@muraee muraee commented Feb 3, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it:
Allows configuration of EC2 instance metadata options to add IMDSv2 support.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3744

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release Note:

Add support to allow users to configure EC2 instance metadata options

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 3, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 3, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: muraee / name: Mulham Raee (faa884552e21dfb93b928dcd02ebdd5f168b300e)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-priority labels Feb 3, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @muraee!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 3, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @muraee. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 3, 2023
@Skarlso
Copy link
Contributor

Skarlso commented Feb 3, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 3, 2023
@muraee
Copy link
Contributor Author

muraee commented Feb 3, 2023

@Skarlso do I need to include the new API field in vbeta1 as well?

@Skarlso
Copy link
Contributor

Skarlso commented Feb 3, 2023

@Skarlso do I need to include the new API field in vbeta1 as well?

Nope, there is a bunch of stuff regarding conversions.

I'll include something tomorrow, I'm exhausted for today. Try looking up the git history of the conversion.go file and *_conversion.go files in the v1 folder. That will give you some hints as to how to proceed with them.

@muraee
Copy link
Contributor Author

muraee commented Feb 6, 2023

Nope, there is a bunch of stuff regarding conversions.

I'll include something tomorrow, I'm exhausted for today. Try looking up the git history of the conversion.go file and *_conversion.go files in the v1 folder. That will give you some hints as to how to proceed with them.

I fixed the conversion issues, I hope this is the correct way.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 7, 2023

Nope, sadly, it's not. :D I'll try to point you to the right thing today. Sorry, I'm just so frigging busy.

@muraee
Copy link
Contributor Author

muraee commented Feb 14, 2023

@Skarlso could you please take a look if you have got time this week. Thanks :)

@Skarlso
Copy link
Contributor

Skarlso commented Feb 14, 2023

We have a code freeze atm to fix the testing infrastructure. But I'll try to tackle this this week.

@@ -43,6 +43,14 @@ func Convert_v1beta2_NetworkStatus_To_v1beta1_NetworkStatus(in *v1beta2.NetworkS
return autoConvert_v1beta2_NetworkStatus_To_v1beta1_NetworkStatus(in, out, s)
}

func Convert_v1beta2_AWSMachineSpec_To_v1beta1_AWSMachineSpec(in *v1beta2.AWSMachineSpec, out *AWSMachineSpec, s conversion.Scope) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I believe is missing with the conversion that @Skarlso mentioned is that the conversion currently discards the new fields when converting down to v1beta1, so a conversion down to v1beta1 and back to v1beta2 will result in data loss.

To fix this, the additional fields that are not present on v1beta1 but exist on v1beta2 are encoded as a value on an annotation, and extracted back out when converting from v1beta1 to v1beta2.

Here's an example from one of my earlier PR's where @Skarlso helped me previously with the same sort of conversion logic: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3757/files#r989941779 and "sigs.k8s.io/controller-runtime/pkg/conversion" contains a number of helper methods related to conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Cameron. Sorry, I have my hands full with trying to fix this frigging test infra. :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the hint @cnmcavoy

@muraee
Copy link
Contributor Author

muraee commented Feb 23, 2023

@Skarlso all tests passed, could I have a review?

@Skarlso
Copy link
Contributor

Skarlso commented Feb 23, 2023

Okay this is looking better now. I'll give it a better look later today.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 23, 2023

/assign

Copy link
Member

@Ankitasw Ankitasw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Is there a way we could verify this in one of the tests in E2E?
It would also be good if we add documentation regarding this being enabled by default.

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Ankitasw
Copy link
Member

/test pull-cluster-api-provider-aws-e2e

@Ankitasw
Copy link
Member

Ankitasw commented Mar 1, 2023

@muraee there is error in new assertion added, could you please check the e2e failure?

@muraee
Copy link
Contributor Author

muraee commented Mar 2, 2023

/test pull-cluster-api-provider-aws-e2e

@muraee
Copy link
Contributor Author

muraee commented Mar 2, 2023

/test pull-cluster-api-provider-aws-e2e

@muraee
Copy link
Contributor Author

muraee commented Mar 2, 2023

/test pull-cluster-api-provider-aws-e2e

@muraee
Copy link
Contributor Author

muraee commented Mar 3, 2023

/test pull-cluster-api-provider-aws-e2e

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 3, 2023

@muraee: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-apidiff-main 2b5062f link false /test pull-cluster-api-provider-aws-apidiff-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@muraee
Copy link
Contributor Author

muraee commented Mar 3, 2023

@Ankitasw e2e passed when options specified explicitly. Seems the Set_Defaults() is not working, any ideas why?

@Ankitasw
Copy link
Member

Ankitasw commented Mar 7, 2023

@muraee could you please squash your commits before we do the merge?

@Ankitasw Ankitasw merged commit aa7d627 into kubernetes-sigs:main Mar 7, 2023
// ModifyInstanceMetadataOptions modifies the metadata options of the given EC2 instance.
func (s *Service) ModifyInstanceMetadataOptions(instanceID string, options *infrav1.InstanceMetadataOptions) error {
input := &ec2.ModifyInstanceMetadataOptionsInput{
HttpEndpoint: aws.String(string(options.HTTPEndpoint)),
Copy link
Contributor

@wyike wyike Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating a cluster, I feel I meet a regression exceptions continuously with latest code related to the options here. Paste the trace for your awareness...

E0308 07:31:06.070432      17 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 499 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x26b8c20, 0x440c890})
	/Users/yikew/Projects/pkg/mod/k8s.io/apimachinery@v0.25.5/pkg/util/runtime/runtime.go:75 +0xe0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/Users/yikew/Projects/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:110 +0xf0
panic({0x26b8c20, 0x440c890})
	/usr/local/go/src/runtime/panic.go:890 +0x264
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/ec2.(*Service).ModifyInstanceMetadataOptions(0x400134be60, {0x4001053fc8, 0x13}, 0x0)
	/Users/yikew/Projects/src/github.com/kubernetes-sigs/cluster-api-provider-aws/pkg/cloud/services/ec2/instances.go:971 +0x34
sigs.k8s.io/cluster-api-provider-aws/v2/controllers.(*AWSMachineReconciler).ensureInstanceMetadataOptions(0x40005e06c0, {0x2ffb9b0, 0x400134be60}, 0x4000f734a0, 0x40011222c0)
	/Users/yikew/Projects/src/github.com/kubernetes-sigs/cluster-api-provider-aws/controllers/awsmachine_controller.go:1121 +0xf0
sigs.k8s.io/cluster-api-provider-aws/v2/controllers.(*AWSMachineReconciler).reconcileOperationalState(0x40005e06c0, {0x2ffb9b0, 0x400134be60}, 0x40015731a0, 0x4000f734a0)
	/Users/yikew/Projects/src/github.com/kubernetes-sigs/cluster-api-provider-aws/controllers/awsmachine_controller.go:616 +0x28c
sigs.k8s.io/cluster-api-provider-aws/v2/controllers.(*AWSMachineReconciler).reconcileNormal(0x40005e06c0, {0x40005b5090, 0x40010988e8}, 0x40015731a0, {0x2ffb7f0, 0x400020c310}, {0x3003088, 0x400020c310}, {0x3001be0, 0x400020c310}, ...)
	/Users/yikew/Projects/src/github.com/kubernetes-sigs/cluster-api-provider-aws/controllers/awsmachine_controller.go:588 +0x1624
sigs.k8s.io/cluster-api-provider-aws/v2/controllers.(*AWSMachineReconciler).Reconcile(0x40005e06c0, {0x2fee338, 0x4000b39680}, {{{0x40016b7a20, 0x7}, {0x4001030840, 0x16}}})
	/Users/yikew/Projects/src/github.com/kubernetes-sigs/cluster-api-provider-aws/controllers/awsmachine_controller.go:224 +0x93c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x4000330c80, {0x2fee338, 0x4000b39680}, {{{0x40016b7a20, 0x7}, {0x4001030840, 0x16}}})
	/Users/yikew/Projects/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:121 +0xec
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x4000330c80, {0x2fee338, 0x4000b39680}, {0x27acba0, 0x40000f5f20})
	/Users/yikew/Projects/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:320 +0x2f8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x4000330c80, {0x2fee290, 0x40001a4300})
	/Users/yikew/Projects/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273 +0x2b8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/Users/yikew/Projects/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234 +0xa8
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/Users/yikew/Projects/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:230 +0x400
E0308 07:31:06.070538      17 controller.go:326] "Reconciler error" err="panic: runtime error: invalid memory address or nil pointer dereference [recovered]" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="default/c6-control-plane-bswb4" namespace="default" name="c6-control-plane-bswb4" reconcileID=18c05f9f-f837-4366-ab61-e6ce9c225db8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that this is not caught in E2E tests. @muraee could you please take a look.

Copy link
Contributor

@wyike wyike Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't fail the cluster creation, only raise exceptions continuously in the capa logs.

Copy link
Member

@Ankitasw Ankitasw Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not getting set by defaults, I will raise another PR to fix this, might have missed author's comment regarding this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing is we noticed another issue with awsmachine webhooks so that could be the reason, thius defaults are not getting set.
I will wait for that issue to fix first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @Ankitasw, yes it seems the defaults are not getting set for some reason, Although I tried that before and it was working.

Copy link
Contributor

@cnmcavoy cnmcavoy Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I hit this exact issue in a separate change. Assuming this is the case, then using the defaulting webhook is not enough. This is roughly similar to the code I think is missing: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4112/files#diff-0b559bbd149f0e6d54d789235423f66fa8906fdc6ee9c99b9e85db912912011eR192

SetDefaults_AWSMachineSpec needs to be invoked from the AWSMachine controller reconciliation loop.

The patch site is also very important, see the comment.

	// The defaulting must happen before `NewMachineScope` is called since otherwise we keep detecting
	// differences that result in patch operations.

	... do defaulting

Specifically, what is suspect is happening is existing AWSMachines that were created before your change can't receive the defaults, since they already exist. So we need to also set the defaults in the controller, at least until the resource is upgraded to another major apiVersion, then we can stop defaulting in the controller.

e2e tests would have trouble catching this kind of issue, because you can only see them if you create a cluster on the previous version, then upgrade to this change. I guess AWSMachine controller reconciliation isn't tested in an upgrade scenario in the e2e?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4147 should fix it

@wyike
Copy link
Contributor

wyike commented May 6, 2023

@muraee I didn't see where bastion machine is configured to IMDSv2 as the default. From my testing, it is still IMDSv1. Could you help double confirm? Do we need it to be IMDSv2 as well? If yes, I'll create an issue.

@muraee
Copy link
Contributor Author

muraee commented May 16, 2023

@wyike right, the bastion is not configured to use IMDSv2 in this change since it is created/managed by the awscluster_controller.
I am not sure if we want the bastion to default to IMDSv2. Please open an issue for further discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configuration of the EC2 metadata endpoint
6 participants