kops-grid-scenario-aws-cloud-controller-manager failing because CSI controller is not becoming ready #169

nckturner · 2021-01-30T22:43:10Z

Which jobs are failing:
kops-grid-scenario-aws-cloud-controller-manager

Which test(s) are failing:
All (cluster setup is failing)

Since when has it been failing:
2020/1/12

Testgrid link:

Reason for failure:
The ebs-csi-controller does not become ready.

Pod	kube-system/ebs-csi-controller-857f6c78bb-kbcvh	system-cluster-critical pod "ebs-csi-controller-857f6c78bb-kbcvh" is not ready (ebs-plugin)

Anything else we need to know:

/kind failing-test

The text was updated successfully, but these errors were encountered:

rifelpet · 2021-01-30T23:29:01Z

kubernetes/test-infra#20680 will provide logs from the ccm and csi controller pods which should help with troubleshooting this.

nckturner · 2021-01-31T00:28:18Z

@rifelpet that would be helpful. Is there any other way you'd recommmend investigating this one? I'm going to try to figure out how to run kubernetes_e2e.py with similar arguments, let me know if you have advice there.

rifelpet · 2021-01-31T00:44:06Z

You should be able to skip all of the kubernetes_e2e.py stuff and just recreate some of the commands that it shells out:

export KOPS_BASE_URL=https://storage.googleapis.com/kops-ci/bin/1.20.0-alpha.2+43d294f4bd

wget $KOPS_BASE_URK/linux/amd64/kops
chmod +x ./kops

./kops create cluster --name e2e-kops-scenario-aw--8aec44b8ad.test-cncf-aws.k8s.io --ssh-public-key /workspace/.ssh/kube_aws_rsa.pub --node-count 4 --node-volume-size 48 --master-volume-size 48 --master-count 1 --zones eu-west-1a --master-size c5.large --kubernetes-version https://storage.googleapis.com/kubernetes-release/release/v1.19.7 --admin-access 34.68.122.92/32 --image 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201201 --cloud aws --container-runtime=docker --override=cluster.spec.cloudControllerManager.cloudProvider=aws --override=cluster.spec.cloudConfig.awsEBSCSIDriver.enabled=true --override cluster.spec.nodePortAccess=0.0.0.0/0 --yes
./kops validate cluster e2e-kops-scenario-aw--8aec44b8ad.test-cncf-aws.k8s.io --wait 15m

Adjusting that IP address, AZ, and the cluster domain name as appropriate.

nckturner · 2021-01-31T00:49:30Z

Ah that's easier, will give it a try.

nckturner · 2021-01-31T09:07:39Z

$ kubectl logs -n kube-system ebs-csi-controller-6fbfbcd7b-7cmfd ebs-plugin
I0131 09:04:32.598675       1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.8.0
panic: EC2 instance metadata is not available

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newControllerService(0xc0000af140, 0xc0000aaa20, 0x0, 0x16)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/controller.go:78 +0x101
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver(0xc00019ff48, 0x6, 0x6, 0xc0000aa9c0, 0xc000230ae0, 0xc00019ff18)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:88 +0x4e0
main.main()
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:31 +0x1b9

rifelpet · 2021-01-31T13:15:09Z

Ah, kops recently changed the default settings for IMDSv2 so I'm guessing the csi driver is built with an older version of aws-sdk-go and needs updating.

EDIT: actually it looks like v0.8.0 was build with an SDK version from December 2020 so that is likely not the issue.

olemarkus · 2021-01-31T18:43:45Z

This happens because we set httpPutResponseHopLimit: 1 by default also on the control plane. Prior to CSI controller, everything using instance metadata ran host networking. setting hop limit to 2 makes the controller work again (on kubenet at least).

nckturner · 2021-02-01T07:08:20Z

Makes sense, thanks @olemarkus

olemarkus · 2021-02-02T13:54:04Z

Tests looks better now. The cluster runs, and the tests that are failing now seems to be the same as the ones that failed earlier.

nckturner · 2021-02-04T17:28:35Z

Thanks @olemarkus!

nckturner · 2021-02-04T17:28:40Z

/close

k8s-ci-robot · 2021-02-04T17:28:49Z

@nckturner: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 30, 2021

olemarkus mentioned this issue Jan 31, 2021

Increase IMDSv2 hop limit on control plane nodes kubernetes/kops#10702

Merged

k8s-ci-robot closed this as completed Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kops-grid-scenario-aws-cloud-controller-manager failing because CSI controller is not becoming ready #169

kops-grid-scenario-aws-cloud-controller-manager failing because CSI controller is not becoming ready #169

nckturner commented Jan 30, 2021

rifelpet commented Jan 30, 2021

nckturner commented Jan 31, 2021

rifelpet commented Jan 31, 2021

nckturner commented Jan 31, 2021

nckturner commented Jan 31, 2021

rifelpet commented Jan 31, 2021 •

edited

Loading

olemarkus commented Jan 31, 2021

nckturner commented Feb 1, 2021

olemarkus commented Feb 2, 2021

nckturner commented Feb 4, 2021

nckturner commented Feb 4, 2021

k8s-ci-robot commented Feb 4, 2021

kops-grid-scenario-aws-cloud-controller-manager failing because CSI controller is not becoming ready #169

kops-grid-scenario-aws-cloud-controller-manager failing because CSI controller is not becoming ready #169

Comments

nckturner commented Jan 30, 2021

rifelpet commented Jan 30, 2021

nckturner commented Jan 31, 2021

rifelpet commented Jan 31, 2021

nckturner commented Jan 31, 2021

nckturner commented Jan 31, 2021

rifelpet commented Jan 31, 2021 • edited Loading

olemarkus commented Jan 31, 2021

nckturner commented Feb 1, 2021

olemarkus commented Feb 2, 2021

nckturner commented Feb 4, 2021

nckturner commented Feb 4, 2021

k8s-ci-robot commented Feb 4, 2021

rifelpet commented Jan 31, 2021 •

edited

Loading