Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS cluster fails to create - ebs-csi-controller stays pending #15335

Closed
daniejstriata opened this issue Apr 18, 2023 · 12 comments
Closed

AWS cluster fails to create - ebs-csi-controller stays pending #15335

daniejstriata opened this issue Apr 18, 2023 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@daniejstriata
Copy link

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.25.4
1.26.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: v1.26.4
Kustomize Version: v4.5.7
Server Version: v1.25.8

3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
kops-1.25.4 create cluster --name=${NAME} --cloud=aws --zones=us-east-2a --discovery-store=s3://k8s-oidc-store --ssh-public-key ~/.ssh/srv.k8s.pub --yes
5. What happened after the commands executed?
The cluster started with a master and node but it never completed as the process does not get past:
Pod kube-system/ebs-csi-controller-6c85d9666b-6bbk7 system-cluster-critical pod "ebs-csi-controller-6c85d9666b-6bbk7" is pending

6. What did you expect to happen?
Creation of cluster
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

```W0418 16:58:49.238805 13028 get.go:78] kops get [CLUSTER] is deprecated: use `kops get all [CLUSTER]`
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2023-04-18T20:47:23Z"
name: k8s.com
spec:
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://kubert-store/k8s.com
etcdClusters:

  • cpuRequest: 200m
    etcdMembers:
    • encryptedVolume: true
      instanceGroup: master-us-east-2a
      name: a
      memoryRequest: 100Mi
      name: main
  • cpuRequest: 100m
    etcdMembers:
    • encryptedVolume: true
      instanceGroup: master-us-east-2a
      name: a
      memoryRequest: 100Mi
      name: events
      iam:
      allowContainerRegistry: true
      legacy: false
      useServiceAccountExternalPermissions: true
      kubelet:
      anonymousAuth: false
      kubernetesApiAccess:
  • 0.0.0.0/0
  • ::/0
    kubernetesVersion: 1.25.8
    masterPublicName: api.k8s.com
    networkCIDR: 172.20.0.0/16
    networking:
    kubenet: {}
    nonMasqueradeCIDR: 100.64.0.0/10
    serviceAccountIssuerDiscovery:
    discoveryStore: s3://kubert-oidc-store/k8s.com
    enableAWSOIDCProvider: true
    sshAccess:
  • 0.0.0.0/0
  • ::/0
    subnets:
  • cidr: 172.20.32.0/19
    name: us-east-2a
    type: Public
    zone: us-east-2a
    topology:
    dns:
    type: Public
    masters: public
    nodes: public

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-04-18T20:47:23Z"
labels:
kops.k8s.io/cluster: k8s.com
name: master-us-east-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230302
instanceMetadata:
httpPutResponseHopLimit: 3
httpTokens: required
machineType: t3.medium
maxSize: 1
minSize: 1
role: Master
subnets:

  • us-east-2a

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-04-18T20:47:23Z"
labels:
kops.k8s.io/cluster: k8s.com
name: nodes-us-east-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230302
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t3.medium
maxSize: 1
minSize: 1
role: Node
subnets:

  • us-east-2a

**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**
https://gist.github.com/daniejstriata/a4a004ab5ccb9b69e161c0e7069cb37f

**9. Anything else do we need to know?**
This used to work fine for me but I cannot create clusters anymore using the same account as I have always done.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2023
@olemarkus
Copy link
Member

Can you describe the pod to see why its pending? It may be that your cluster doesn't have enough capacity.

@siddharth-sable
Copy link

Facing same issue, cluster has enough capacity maybe, its a new cluster

Here the event section of description of the pod

image

@michaelrosejr
Copy link

Facing same issue, cluster has enough capacity maybe, its a new cluster

Here the event section of description of the pod

image

image

I'm seeing the same error as @siddharth-sable and @daniejstriata as well. I've run the install process 3x in different regions and accounts. I ran through the same steps as above. My versions are a bit different though.

kops version: 1.27.0
kubectl version:
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

@jmlineb
Copy link

jmlineb commented Sep 19, 2023

I have the very samne issue. After all these months, why is this still a problem? How do we overcome this problem? Launched in us-west-1b.

NODE STATUS
NAME ROLE READY
i-0365de0b05ed828dc control-plane True
i-08f63ae3f85b8ea2e node True

VALIDATION ERRORS
KIND NAME MESSAGE
Pod kube-system/ebs-csi-controller-7b87d58cdb-dlzbq system-cluster-critical pod "ebs-csi-controller-7b87d58cdb-dlzbq" is pending

Validation Failed
W0919 13:13:38.115139 2445 validate_cluster.go:232] (will retry): cluster not yet healthy

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.5

@jmlineb
Copy link

jmlineb commented Sep 19, 2023

Describing the pod mentioned an untolerated taint, just like the error message above. This seems to be a kops bug. How can it be overcome?

Type Reason Age From Message


Warning FailedScheduling 10m default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 4m26s (x4 over 9m57s) default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod topology spread constraints, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling..

@jmlineb
Copy link

jmlineb commented Sep 19, 2023

Used kops to stand the cluster up in a different region. us-east-2a. Very same error. This seems to be a pervasive kops issue, not an AWS region status issue.

NODE STATUS
NAME ROLE READY
i-027f747e4ff07f556 control-plane True
i-08905e9d6e5773fb0 node True

VALIDATION ERRORS
KIND NAME MESSAGE
Pod kube-system/ebs-csi-controller-7b87d58cdb-qw94l system-cluster-critical pod "ebs-csi-controller-7b87d58cdb-qw94l" is pending

Validation Failed
W0919 13:40:37.456252 3239 validate_cluster.go:232] (will retry): cluster not yet healthy

@jmlineb
Copy link

jmlineb commented Sep 19, 2023

OK, I think I found a workaround. You must specify more than one availability zone when you stand up the cluster. When I specified three instead of one, it worked!

INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
control-plane-us-east-2a ControlPlane t3.medium 1 1 us-east-2a
nodes-us-east-2a Node t3.medium 1 1 us-east-2a
nodes-us-east-2b Node t3.medium 1 1 us-east-2b
nodes-us-east-2c Node t3.medium 1 1 us-east-2c

NODE STATUS
NAME ROLE READY
i-043291d54fb0ba01e node True
i-05235191c1c007b93 node True
i-061864c6ccf118702 control-plane True
i-07be32219bc60db57 node True

Your cluster myfirstcluster.k8s.local is ready

@mmadrid
Copy link

mmadrid commented Sep 22, 2023

Take a look at #15852

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants