Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops-grid-scenario-aws-cloud-controller-manager failing because CSI controller is not becoming ready #169

Closed
nckturner opened this issue Jan 30, 2021 · 12 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@nckturner
Copy link
Contributor

Which jobs are failing:
kops-grid-scenario-aws-cloud-controller-manager

Which test(s) are failing:
All (cluster setup is failing)

Since when has it been failing:
2020/1/12

Testgrid link:

Reason for failure:
The ebs-csi-controller does not become ready.

Pod	kube-system/ebs-csi-controller-857f6c78bb-kbcvh	system-cluster-critical pod "ebs-csi-controller-857f6c78bb-kbcvh" is not ready (ebs-plugin)

Anything else we need to know:

/kind failing-test

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 30, 2021
@rifelpet
Copy link
Member

kubernetes/test-infra#20680 will provide logs from the ccm and csi controller pods which should help with troubleshooting this.

@nckturner
Copy link
Contributor Author

@rifelpet that would be helpful. Is there any other way you'd recommmend investigating this one? I'm going to try to figure out how to run kubernetes_e2e.py with similar arguments, let me know if you have advice there.

@rifelpet
Copy link
Member

You should be able to skip all of the kubernetes_e2e.py stuff and just recreate some of the commands that it shells out:

export KOPS_BASE_URL=https://storage.googleapis.com/kops-ci/bin/1.20.0-alpha.2+43d294f4bd

wget $KOPS_BASE_URK/linux/amd64/kops
chmod +x ./kops

./kops create cluster --name e2e-kops-scenario-aw--8aec44b8ad.test-cncf-aws.k8s.io --ssh-public-key /workspace/.ssh/kube_aws_rsa.pub --node-count 4 --node-volume-size 48 --master-volume-size 48 --master-count 1 --zones eu-west-1a --master-size c5.large --kubernetes-version https://storage.googleapis.com/kubernetes-release/release/v1.19.7 --admin-access 34.68.122.92/32 --image 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201201 --cloud aws --container-runtime=docker --override=cluster.spec.cloudControllerManager.cloudProvider=aws --override=cluster.spec.cloudConfig.awsEBSCSIDriver.enabled=true --override cluster.spec.nodePortAccess=0.0.0.0/0 --yes
./kops validate cluster e2e-kops-scenario-aw--8aec44b8ad.test-cncf-aws.k8s.io --wait 15m

Adjusting that IP address, AZ, and the cluster domain name as appropriate.

@nckturner
Copy link
Contributor Author

Ah that's easier, will give it a try.

@nckturner
Copy link
Contributor Author

$ kubectl logs -n kube-system ebs-csi-controller-6fbfbcd7b-7cmfd ebs-plugin
I0131 09:04:32.598675       1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.8.0
panic: EC2 instance metadata is not available

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newControllerService(0xc0000af140, 0xc0000aaa20, 0x0, 0x16)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/controller.go:78 +0x101
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver(0xc00019ff48, 0x6, 0x6, 0xc0000aa9c0, 0xc000230ae0, 0xc00019ff18)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:88 +0x4e0
main.main()
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:31 +0x1b9

@rifelpet
Copy link
Member

rifelpet commented Jan 31, 2021

Ah, kops recently changed the default settings for IMDSv2 so I'm guessing the csi driver is built with an older version of aws-sdk-go and needs updating.

EDIT: actually it looks like v0.8.0 was build with an SDK version from December 2020 so that is likely not the issue.

@olemarkus
Copy link
Member

This happens because we set httpPutResponseHopLimit: 1 by default also on the control plane. Prior to CSI controller, everything using instance metadata ran host networking. setting hop limit to 2 makes the controller work again (on kubenet at least).

@nckturner
Copy link
Contributor Author

Makes sense, thanks @olemarkus

@olemarkus
Copy link
Member

Tests looks better now. The cluster runs, and the tests that are failing now seems to be the same as the ones that failed earlier.

@nckturner
Copy link
Contributor Author

Thanks @olemarkus!

@nckturner
Copy link
Contributor Author

/close

@k8s-ci-robot
Copy link
Contributor

@nckturner: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

No branches or pull requests

4 participants