AWS cluster fails to create - ebs-csi-controller stays pending #15335

daniejstriata · 2023-04-18T21:07:35Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.
1.25.4
1.26.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: v1.26.4
Kustomize Version: v4.5.7
Server Version: v1.25.8

3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
kops-1.25.4 create cluster --name=${NAME} --cloud=aws --zones=us-east-2a --discovery-store=s3://k8s-oidc-store --ssh-public-key ~/.ssh/srv.k8s.pub --yes
5. What happened after the commands executed?
The cluster started with a master and node but it never completed as the process does not get past:
Pod kube-system/ebs-csi-controller-6c85d9666b-6bbk7 system-cluster-critical pod "ebs-csi-controller-6c85d9666b-6bbk7" is pending

6. What did you expect to happen?
Creation of cluster
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

```W0418 16:58:49.238805 13028 get.go:78] kops get [CLUSTER] is deprecated: use `kops get all [CLUSTER]`
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2023-04-18T20:47:23Z"
name: k8s.com
spec:
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://kubert-store/k8s.com
etcdClusters:

cpuRequest: 200m
etcdMembers:
- encryptedVolume: true
  instanceGroup: master-us-east-2a
  name: a
  memoryRequest: 100Mi
  name: main
cpuRequest: 100m
etcdMembers:
- encryptedVolume: true
  instanceGroup: master-us-east-2a
  name: a
  memoryRequest: 100Mi
  name: events
  iam:
  allowContainerRegistry: true
  legacy: false
  useServiceAccountExternalPermissions: true
  kubelet:
  anonymousAuth: false
  kubernetesApiAccess:
0.0.0.0/0
::/0
kubernetesVersion: 1.25.8
masterPublicName: api.k8s.com
networkCIDR: 172.20.0.0/16
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
serviceAccountIssuerDiscovery:
discoveryStore: s3://kubert-oidc-store/k8s.com
enableAWSOIDCProvider: true
sshAccess:
0.0.0.0/0
::/0
subnets:
cidr: 172.20.32.0/19
name: us-east-2a
type: Public
zone: us-east-2a
topology:
dns:
type: Public
masters: public
nodes: public

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-04-18T20:47:23Z"
labels:
kops.k8s.io/cluster: k8s.com
name: master-us-east-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230302
instanceMetadata:
httpPutResponseHopLimit: 3
httpTokens: required
machineType: t3.medium
maxSize: 1
minSize: 1
role: Master
subnets:

us-east-2a

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-04-18T20:47:23Z"
labels:
kops.k8s.io/cluster: k8s.com
name: nodes-us-east-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230302
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t3.medium
maxSize: 1
minSize: 1
role: Node
subnets:

us-east-2a


**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**
https://gist.github.com/daniejstriata/a4a004ab5ccb9b69e161c0e7069cb37f

**9. Anything else do we need to know?**
This used to work fine for me but I cannot create clusters anymore using the same account as I have always done.

The text was updated successfully, but these errors were encountered:

olemarkus · 2023-05-07T10:36:20Z

Can you describe the pod to see why its pending? It may be that your cluster doesn't have enough capacity.

siddharth-sable · 2023-07-19T08:07:52Z

Facing same issue, cluster has enough capacity maybe, its a new cluster

Here the event section of description of the pod

michaelrosejr · 2023-07-19T20:12:33Z

Facing same issue, cluster has enough capacity maybe, its a new cluster

Here the event section of description of the pod

I'm seeing the same error as @siddharth-sable and @daniejstriata as well. I've run the install process 3x in different regions and accounts. I ran through the same steps as above. My versions are a bit different though.

kops version: 1.27.0
kubectl version:
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

jmlineb · 2023-09-19T13:15:55Z

I have the very samne issue. After all these months, why is this still a problem? How do we overcome this problem? Launched in us-west-1b.

NODE STATUS
NAME ROLE READY
i-0365de0b05ed828dc control-plane True
i-08f63ae3f85b8ea2e node True

VALIDATION ERRORS
KIND NAME MESSAGE
Pod kube-system/ebs-csi-controller-7b87d58cdb-dlzbq system-cluster-critical pod "ebs-csi-controller-7b87d58cdb-dlzbq" is pending

Validation Failed
W0919 13:13:38.115139 2445 validate_cluster.go:232] (will retry): cluster not yet healthy

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.5

jmlineb · 2023-09-19T13:19:21Z

Describing the pod mentioned an untolerated taint, just like the error message above. This seems to be a kops bug. How can it be overcome?

Type Reason Age From Message

Warning FailedScheduling 10m default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 4m26s (x4 over 9m57s) default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod topology spread constraints, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling..

jmlineb · 2023-09-19T13:41:01Z

Used kops to stand the cluster up in a different region. us-east-2a. Very same error. This seems to be a pervasive kops issue, not an AWS region status issue.

NODE STATUS
NAME ROLE READY
i-027f747e4ff07f556 control-plane True
i-08905e9d6e5773fb0 node True

VALIDATION ERRORS
KIND NAME MESSAGE
Pod kube-system/ebs-csi-controller-7b87d58cdb-qw94l system-cluster-critical pod "ebs-csi-controller-7b87d58cdb-qw94l" is pending

Validation Failed
W0919 13:40:37.456252 3239 validate_cluster.go:232] (will retry): cluster not yet healthy

jmlineb · 2023-09-19T15:16:17Z

OK, I think I found a workaround. You must specify more than one availability zone when you stand up the cluster. When I specified three instead of one, it worked!

INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
control-plane-us-east-2a ControlPlane t3.medium 1 1 us-east-2a
nodes-us-east-2a Node t3.medium 1 1 us-east-2a
nodes-us-east-2b Node t3.medium 1 1 us-east-2b
nodes-us-east-2c Node t3.medium 1 1 us-east-2c

NODE STATUS
NAME ROLE READY
i-043291d54fb0ba01e node True
i-05235191c1c007b93 node True
i-061864c6ccf118702 control-plane True
i-07be32219bc60db57 node True

Your cluster myfirstcluster.k8s.local is ready

mmadrid · 2023-09-22T00:31:51Z

Take a look at #15852

k8s-triage-robot · 2024-01-28T21:59:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-27T22:51:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-28T23:43:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-28T23:43:08Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS cluster fails to create - ebs-csi-controller stays pending #15335

AWS cluster fails to create - ebs-csi-controller stays pending #15335

daniejstriata commented Apr 18, 2023

olemarkus commented May 7, 2023

siddharth-sable commented Jul 19, 2023

michaelrosejr commented Jul 19, 2023

jmlineb commented Sep 19, 2023

jmlineb commented Sep 19, 2023 •

edited

Loading

jmlineb commented Sep 19, 2023 •

edited

Loading

jmlineb commented Sep 19, 2023

mmadrid commented Sep 22, 2023 •

edited

Loading

k8s-triage-robot commented Jan 28, 2024

k8s-triage-robot commented Feb 27, 2024

k8s-triage-robot commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

AWS cluster fails to create - ebs-csi-controller stays pending #15335

AWS cluster fails to create - ebs-csi-controller stays pending #15335

Comments

daniejstriata commented Apr 18, 2023

olemarkus commented May 7, 2023

siddharth-sable commented Jul 19, 2023

michaelrosejr commented Jul 19, 2023

jmlineb commented Sep 19, 2023

jmlineb commented Sep 19, 2023 • edited Loading

jmlineb commented Sep 19, 2023 • edited Loading

jmlineb commented Sep 19, 2023

mmadrid commented Sep 22, 2023 • edited Loading

k8s-triage-robot commented Jan 28, 2024

k8s-triage-robot commented Feb 27, 2024

k8s-triage-robot commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

jmlineb commented Sep 19, 2023 •

edited

Loading

jmlineb commented Sep 19, 2023 •

edited

Loading

mmadrid commented Sep 22, 2023 •

edited

Loading