New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PersistentVolume on EBS can be created in availability zones with no nodes #34583

Closed
buildmaster opened this Issue Oct 12, 2016 · 33 comments

Comments

Projects
None yet
@buildmaster
Copy link

buildmaster commented Oct 12, 2016

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"33cf7b9acbb2cb7c9c72a10d6636321fb180b159", GitTreeState:"clean", BuildDate:"2016-10-10T18:19:49Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.0", GitCommit:"a16c0a7f71a6f93c7e0f222d961f4675cd97a46b", GitTreeState:"clean", BuildDate:"2016-09-26T18:10:32Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: aws
  • Install tools: kops default install

What happened:
AWS EBS backed PersitantVolumeClaim created a volume in an AZ where there was a master but no nodes. Making it unusable.

What you expected to happen:
Zones with master and no nodes shouldn't be used to create the EBS Volume

How to reproduce it (as minimally and precisely as possible):

  • AWS setup across 3 AZs in one region
  • Master in three AZs
  • Nodes in 2 AZs (so one AZ has a Master but no Nodes)
  • Create a Storage class to use EBS without specifying AZ
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: ebs
provisioner: kubernetes.io/aws-ebs
  • Create PVC against StorageClass
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: data
  annotations:
    volume.beta.kubernetes.io/storage-class: "ebs"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  • PVC can create a volume in an availability zone with no nodes

Anything else do we need to know:
Mentioned on Slack channel #sig-aws with @justinsb who asked me to raise this

@mengqiy

This comment has been minimized.

Copy link
Contributor

mengqiy commented Oct 13, 2016

@kubernetes/sig-storage

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Oct 13, 2016

@jsafrane

On Wed, Oct 12, 2016 at 6:32 PM, Mengqi Yu notifications@github.com wrote:

@kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#34583 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVCOO1yxRMC3wgTYu2aNdAtWaspfNks5qzYo7gaJpZM4KUR7w
.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Oct 13, 2016

Running Kubernetes master in a zone that has no running nodes is quite strange and I don't think this is a typical case we should support.

AWS cloud provider lists all machines in the cluster and chooses a random zone that has at least one running instance. It seems it does not distinguish master/node machines. IMO, GCE has the same problem, it lists all zones in a region where the master runs and it does not distinguish masters and nodes either.

We can either introduce a new AWS instance tag with master/node role that is running on the instance or we can move the zone decision into volume plugin itself, where we have a kubeclient and we can filter nodes relatively easily, however we will depend on an admission controller that must add a zone label to nodes.

@justinsb, thoughts?

As a third option, we could rely on kubernetes admin to fill all zones into StorageClass in this weird case. Right now we support only one zone in "zone" parameter, we could allow a list:

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: ebs
provisioner: kubernetes.io/aws-ebs
parameters:
  zone: "us-east-1a,us-east-1b,us-east-1c"
@buildmaster

This comment has been minimized.

Copy link

buildmaster commented Oct 16, 2016

just to add to this, the reason why it's a deal is I actually don't have full control over where the nodes run, as they're in an autoscale group across all AZs. So you could have master in AZ a and nodes in b and c just because that's what the auto scale group picked.

@stevesloka

This comment has been minimized.

Copy link
Contributor

stevesloka commented Jan 15, 2017

I just ran into this, however, I only have nodes / masters in 3 zones, but the PV was created in a 4th zone. So right now I can't even spin up a node in the last AZ since in my infrastructure, I'm not supporting it.

~/dev ❯❯❯ kubectl get no --show-labels
NAME                          STATUS    AGE       LABELS
ip-10-0-70-253.ec2.internal   Ready     2d        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1c,kubernetes.io/hostname=ip-10-0-70-253.ec2.internal
ip-10-0-71-238.ec2.internal   Ready     2d        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-10-0-71-238.ec2.internal
ip-10-0-72-59.ec2.internal    Ready     2d        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1e,kubernetes.io/hostname=ip-10-0-72-59.ec2.internal

~/dev ❯❯❯ kubectl describe pv pvc-288e7b9c-d9d8-11e6-9d58-0a8828a2b648
Name:		pvc-288e7b9c-d9d8-11e6-9d58-0a8828a2b648
Labels:		failure-domain.beta.kubernetes.io/region=us-east-1
		failure-domain.beta.kubernetes.io/zone=us-east-1b
StorageClass:	default
Status:		Bound
Claim:		default/es-data-es-data-0
Reclaim Policy:	Delete
Access Modes:	RWO
Capacity:	100Gi
Message:	
Source:
    Type:	AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:	aws://us-east-1b/vol-0f2110cfa5904e4ba
    FSType:	ext4
    Partition:	0
    ReadOnly:	false
No events.

~/dev ❯❯❯ kubectl describe po es-data-0
...
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  1d		1m		6055	{default-scheduler }			Warning		FailedScheduling	pod (es-data-0) failed to fit in any node
fit failure summary on nodes : NoVolumeZoneConflict (3)
@mastercactapus

This comment has been minimized.

Copy link

mastercactapus commented Jan 15, 2017

@jsafrane I think it may be more than just where there are or not nodes vs masters in an AZ.

As an additional use case, consider pods that must run on a "dedicated" tenancy node in AWS, (which have a baseline cost per-zone) for compliance. Since masters are supposed to be an odd number (i.e. 3 zones) it would make sense to have 3 instead of 2. But for a pod that has to be scheduled on a "dedicated" tenancy node, there may be only 2 zones with nodes that are "dedicated" due to said cost. Even if there are nodes in all zones, a pod may need to be scheduled on a specific node or subset thereof.

More information about node selection:
https://kubernetes.io/docs/user-guide/node-selection/

More information about why this specific case would be necessary here:
https://d0.awsstatic.com/whitepapers/compliance/AWS_HIPAA_Compliance_Whitepaper.pdf

If PVs are used by pods, and pods must be able to access them from whatever nodes they are constrained/scheduled to, I think whatever solution is implemented needs to account for that.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Jan 16, 2017

@stevesloka, do you have any other non-Kubernetes instances in us-east-1b in your AWS account?

Did you try to tag your AWS instances that run Kubernetes with tag KubernetesCluster=steve - all AWS instances in one Kubernetes cluster must share the same tag value, so you can run multiple Kubernetes clusters in one AWS account. This tag is not documented, see #39178.

@stevesloka

This comment has been minimized.

Copy link
Contributor

stevesloka commented Jan 17, 2017

@jsafrane There is another non-k8s instance running in us-east-1b, but it's not related (created via breanstalk).

I added labels to all my nodes / controller as you sugessted and still it created the PV in us-east-1b.

I may bail on Statefulset for now and manage my pv's by hand since I'm building out an operator and it wont be that painful, but would like to see if this could be addressed.

I'm also willing to help out if you can help point into the right places this logic lives. I'll also try and dig in and see if I can find it myself.

@naphthalene

This comment has been minimized.

Copy link

naphthalene commented Jan 27, 2017

It would be ideal to have the StorageClass accept an expression or wildcard zone under the aws provisioner. Ideally I shouldn't have to care where my pods are running and where the EBS are created.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Jan 31, 2017

We've a PR #38505 that adds "zones:" with list of zones instead of single "zone:". No wildcards though.

@dforste

This comment has been minimized.

Copy link

dforste commented Feb 1, 2017

I just hit this as well. I think moving the "zone" decision closer to the provisioning decision would be best. Perhaps wait until a PVC is consumed by a pod before you create it in a zone. That way corner cases like this are not a problem.

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Feb 9, 2017

Just want to check whether PR #38505 is good enough to solve the issue raised here?

@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented Feb 18, 2017

@dforste , is right. Another use case is that if pod consumes >1 PVC, associated PVs might be created in different zones, therefore pod becomes unschedulable, as there is no node which belongs >1 AZ. Zone decision when provisioning PV must be made as late as possible, while currently it seems to be a very first thing to do.

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 19, 2017

AWS: Skip instances that are taggged as a master
We recognize a few AWS tags, and skip over masters when finding zones
for dyanmic volumes.  This will fix kubernetes#34583.

This is not perfect, in that really the scheduler is the only component
that can correctly choose the zone, but should address the common
problem.

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 19, 2017

AWS: Skip instances that are taggged as a master
We recognize a few AWS tags, and skip over masters when finding zones
for dynamic volumes.  This will fix kubernetes#34583.

This is not perfect, in that really the scheduler is the only component
that can correctly choose the zone, but should address the common
problem.

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 19, 2017

AWS: Skip instances that are taggged as a master
We recognize a few AWS tags, and skip over masters when finding zones
for dynamic volumes.  This will fix kubernetes#34583.

This is not perfect, in that really the scheduler is the only component
that can correctly choose the zone, but should address the common
problem.

k8s-merge-robot added a commit that referenced this issue Mar 1, 2017

Merge pull request #41702 from justinsb/fix_34583
Automatic merge from submit-queue (batch tested with PRs 38676, 41765, 42103, 41833, 41702)

AWS: Skip instances that are taggged as a master

We recognize a few AWS tags, and skip over masters when finding zones
for dynamic volumes.  This will fix #34583.

This is not perfect, in that really the scheduler is the only component
that can correctly choose the zone, but should address the common
problem.

```release-note
AWS: Do not consider master instance zones for dynamic volume creation
```
@redbaron

This comment has been minimized.

Copy link
Contributor

redbaron commented Mar 1, 2017

I don't think it can be closed yet, reasons are:

  1. those magical tags are not documented anywhere, how people or going to know to use them?
  2. can't it continue to create volume in a zone where pod with a given PVC can never be scheduled (lets say it has node affinity configured to a node in a particular AZ)?
@mccormd

This comment has been minimized.

Copy link

mccormd commented Jun 8, 2017

Would it make sense to say that the dynamic creation of a volume should happen only once a PVC has been attached to a pod which has been successfully scheduled to a node? Then the av_zone of the volume should match that of the selected node?

@dforste

This comment has been minimized.

Copy link

dforste commented Jun 8, 2017

Yes I would agree with that statement. The PVC should be created when we know the node and AZ it is a apart of.

@colinmorelli

This comment has been minimized.

Copy link

colinmorelli commented Feb 8, 2018

Just to add another use case to this - I just spun up a new cluster in AWS. Masters spanning us-east-1a, us-east-1b, and us-east-1d. After creating the ASG for nodes, I learned that AWS is out of instance capacity in us-east-1d for that particular instance type. Now I currently don't have nodes in us-east-1d (until capacity is restored and I can move an instance over).

The reason this is different is because, in this case, I don't want to tell k8s to only use a subset of AZs. I want it to deploy across all AZs that are available. It just so happens that at this moment in time not all AZs are available.

@mccormd suggestion above would work, I believe. Schedule the node first, then provision a volume. That might cause other problems, though.

@ashb

This comment has been minimized.

Copy link

ashb commented Jun 27, 2018

I just ran into this today - PVC created in eu-west-1c, but the only node with memory to to run the pod was in eu-west-1b.

Events:
  Type     Reason            Age              From               Message
  ----     ------            ----             ----               -------
  Warning  FailedScheduling  2s (x9 over 2m)  default-scheduler  0/3 nodes are available: 1 Insufficient memory, 1 node(s) had no available volume zone, 1 node(s) had taints that the pod didn't tolerate.

This was a new pod and a new volume claim. Can we have this re-opened at least please?

@ValeryNo

This comment has been minimized.

Copy link

ValeryNo commented Jun 27, 2018

We had to ultimately get rid of all persistence volume claims on AWS because of this.
For really large clusters with lots of reserved capacity in all availability zones that woulnd't be such a big deal, but I'd image most companies are trying to optimize their cost and don't have that many node pre-provisioned just for the sake of it.
Still looking for a solution which is easy to configure and AZ-failure friendly

@deitch

This comment has been minimized.

Copy link

deitch commented Jun 27, 2018

Still looking for a solution which is easy to configure and AZ-failure friendly

Right now, there isn't really. You have a few options:

  • EFS - it is regional, not zonal. You have to be OK working with NFS as the protocol, and potential performance implications. To be fair, NFS has gotten quote good, I know companies using it to store critical data. I am not an NFS expert, so don't ask my opinions here...
  • Commercial solutions - Portworx and StorageOS both provide local storage, backed by local disk or EBS, but replicated at the app level (their app) between nodes, across zones, so you don't depend on zonal resources (like EBS).
  • DIY - have fun. Ceph, some other solutions.

For one use case with which I am involved, we are living with it, and using EFS where necessary. We don't love it, but the resources to implement something else aren't there now.

@ashb

This comment has been minimized.

Copy link

ashb commented Jun 27, 2018

Would volumeBindingMode: WaitForFirstConsumer in a StorageClass go some way to fixing the issue? (At least for new claims anyway)?

From https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/

Edit: I tried this, and the kube-controller-manager logs say:

'WaitForFirstConsumer' waiting for first consumer to be created before binding

but the pod is FailedScheduling "2 node(s) didn't find available persistent volumes to bind.".

@stevesloka

This comment has been minimized.

Copy link
Contributor

stevesloka commented Jun 27, 2018

I solved this in the Elastic Operator by creating StorageClasses per zone and spreading work across that way. You need to make sure your app is zone tolerant. For example, the Elastic operator sets up sharding and replication across zones, given one AZ goes down, the other two still have enough of the data to continue functioning.

@deitch

This comment has been minimized.

Copy link

deitch commented Jun 27, 2018

@stevesloka doesn't each pod then need access to storage from a different class (appropriate for the zone it is in)? How does that work?

@willtrking

This comment has been minimized.

Copy link

willtrking commented Sep 13, 2018

Just ran into this as well, running work load specific zones in 2 regions for development to save cost over 3, had an EBS volume provisioned in the zone with no instance.

@ddebroy

This comment has been minimized.

Copy link
Member

ddebroy commented Sep 13, 2018

#65730 should address the above scenarios when you use set volumeBindingMode: WaitForFirstConsumer in the storageclass. The plumbing for WaitForFirstConsumer went into EBS in 1.12 as a beta feature.

@nicroto

This comment has been minimized.

Copy link

nicroto commented Oct 10, 2018

I have a similar issue:

  • 3 masters + 3 nodes in 3 regions. I have a PV (created on regionA) for data backups created by a CronJob. When the cron's job gets scheduled on a node on a different region (regionA's CPU has been allocated almost fully and the master chooses another node in a different region) - it can't start because the PV is not accessible.

Is there a way to avoid this issue?

@chandresh-pancholi

This comment has been minimized.

Copy link

chandresh-pancholi commented Oct 15, 2018

I am also facing the same issue with 1.10 version of Kubernetes.
Nodes are running in two regions, PV got created in the same region. still, I am getting

0/2 nodes are available: 2 node(s) had no available volume zone.

@lachlancooper

This comment has been minimized.

Copy link

lachlancooper commented Oct 15, 2018

As mentioned above, this feature is now in beta in 1.12: https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode

@frankgu968

This comment has been minimized.

Copy link

frankgu968 commented Oct 23, 2018

Are there any workaround for Amazon EKS to address the above issue?

I am trying to make a multi-az JupyterHub deployment and changing zones causes the "no available volume zone" error...

@markyjackson-taulia

This comment has been minimized.

Copy link

markyjackson-taulia commented Oct 29, 2018

Is this still in progress? I just ran in to issue as well

@chandresh-pancholi

This comment has been minimized.

Copy link

chandresh-pancholi commented Oct 29, 2018

we moved it to GCP with 1.10 version and didn't face the issue

@KristianWindsor

This comment has been minimized.

Copy link

KristianWindsor commented Jan 1, 2019

I'm also having this issue...

  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  14s (x15 over 3m20s)  default-scheduler  0/2 nodes are available: 2 node(s) had no available volume zone.

So there's no solution to this?

@AdeOpe

This comment has been minimized.

Copy link

AdeOpe commented Jan 15, 2019

im also having this issue on AWS EKS
`Events:
Type Reason Age From Message


Warning FailedScheduling 2m (x841 over 1h) default-scheduler 0/2 nodes are available: 1 Insufficient pods, 1 node(s) had no available volume zone.`

losing hope over here :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment