Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volumes/nfs example: service name instead hardcoded IP #44528

Closed
z1nkum opened this issue Apr 15, 2017 · 44 comments
Closed

volumes/nfs example: service name instead hardcoded IP #44528

z1nkum opened this issue Apr 15, 2017 · 44 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@z1nkum
Copy link

z1nkum commented Apr 15, 2017

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:57:05Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.6", GitCommit:"114f8911f9597be669a747ab72787e0bd74c9359", GitTreeState:"clean", BuildDate:"2017-03-28T13:36:31Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GCP

What happened:
I've tried this great nfs example here

I try to use service name instead of hardcoded IP in PersistentVolume configuration and it is not work with "Failed to resolve server nfs-server.default.svc.cluster.local: Temporary failure in name resolution" in pod status. At the same time I can ping this FQDN (and just host nfs-server) from other containers.

I saw annotation in README.md:

In the future, we'll be able to tie these together using the service names, but for now, you have to hardcode the IP.

So, what is your plans for this feature?

@wongma7
Copy link
Contributor

wongma7 commented Apr 17, 2017

Old issue here: #8735 . The message in the README is about as old. I would be interested in this too :)

@saad-ali saad-ali added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Apr 21, 2017
@saad-ali
Copy link
Member

CC @kubernetes/sig-storage-feature-requests

@thockin thockin removed the sig/network Categorizes an issue or PR as relevant to SIG Network. label May 19, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2018
@jtgans
Copy link

jtgans commented Feb 4, 2018

Ran into this, myself, while attempting to configure my storageclasses to speak to a heketi glusterfs pod by service name. When I rebooted my cluster, the cluster IP changed, which broke my glusterfs storage solution.

@msau42
Copy link
Member

msau42 commented Feb 4, 2018

I believe the issue is the node's resolv.conf needs to be configured to point to Kubernetes' dns service.

This is a host configuration that needs to be done per deployment. I believe we do it automatically for GCE/GKE but I'm unsure about other environments. cc @jingxu97

@jtgans
Copy link

jtgans commented Feb 4, 2018 via email

@msau42
Copy link
Member

msau42 commented Feb 5, 2018

Hm maybe @kubernetes/sig-network-misc knows something that can be done here.

The problem is volume mounts are done by kubelet, so the nfs server IP/hostname needs to be accessible to kubelet's network.

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Feb 5, 2018
@jtgans
Copy link

jtgans commented Feb 5, 2018

Which on a baremetal deployment should be the node's/host's network, correct?

@msau42
Copy link
Member

msau42 commented Feb 5, 2018

Correct. So if your nfs server is being provided by a Pod, then you need the node/host network to be able to access the pod's network, which like you pointed out, could be tricky depending on how you've configured your network.

@jtgans
Copy link

jtgans commented Feb 5, 2018

Okay, so I see two solutions to this problem, then:

  1. Adjust the Service for kube-dns from ClusterIP to NodePort and adjust the node's /etc/resolv.conf to point to the local IPs to get name resolution working.
  2. Adjust the Service for NFS or GlusterFS (in my case) from ClusterIP to NodePort, and then change the StorageClass to point to one of the node's static IPs in the node's subnet.

Of the two, the first seems like the more generic solution for getting name resolution working across the cluster, but may have unintended side-effects if things are setup to expect it as a ClusterIP. The second solves this direct problem. I'll try the second option when I get home tonight and report back.

We may want to update the public facing docs to mention that StorageClass definitions use the kubelet's network so that others don't run aground when trying to set this up.

@yaumeg
Copy link

yaumeg commented Feb 20, 2018

Did you have a chance to experiment a bit ? For solution 2), Nodeport define ports in the 30000-32767 range, so it seems also necessary to modify nfs PV default ports ? (2049 & 111)

I have a baremetal setup with flannel, and editing resolv.conf doesn't work, because like you pointed out, my nodes don't have access to container's network.

@jtgans
Copy link

jtgans commented Feb 25, 2018

Unfortunately, I can't /edit/ the restUrl for my storageclass because "updating parameters is illegal"... I may end up losing data by deleting the storage class and recreating it with the right url.

@jtgans
Copy link

jtgans commented Feb 25, 2018

Fortunately my analysis was wrong -- didn't lose any data at all, thankfully.

So what I've done is update my Services to be NodePorts, exposed them on port 32708, and set my resturls in the storageclasses to http://<random-node-ip-from-cluster>:32708. This allows things to continue to work, but there are two major downsides now:

  1. My REST endpoint is available from outside of the cluster.
  2. If the node that I chose in the rest URL falls over, my storage provisioner becomes unreachable because I can't edit the resturls once I've set them.
  3. My REST endpoint is avialable from outside of the cluster.
  4. My REST endpoint is avialable from outside of the cluster.

But things work at least.

@MrHohn
Copy link
Member

MrHohn commented Feb 25, 2018

When I rebooted my cluster, the cluster IP changed, which broke my glusterfs storage solution.

@jtgans Sorry that I might be missing the point, why rebooting cluster changes the cluster IP? Did the NFS service got deleted and recreated? If so, what about giving the NFS service a fixed IP in manifest and keeping it as type=ClusterIP?

@jtgans
Copy link

jtgans commented Feb 25, 2018

I actually didn't realize that ClusterIP services could specify their IPs in the definition. I'll give this a try today.

Pretty sure I didn't recreate the service post reboot, but it's been a while since I restarted the cluster.

@mtricolici
Copy link

use full service name, it works fine:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-zuzu
spec:
  capacity:
    storage: 1Mi
  accessModes:
    - ReadWriteMany
  nfs:
    server: nfs-server.build.svc.cluster.local
    path: "/"

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@digglife
Copy link

@mtricolici Tried. Didn't work.

@sjerdo
Copy link

sjerdo commented Aug 6, 2018

@mtricolici @digglife 'build' should be replaced by the namespace.
E.g. when using no namespace (default), the value for server should be nfs-server.default.svc.cluster.local

@PaulMazzuca
Copy link

I have also tried using the service name, and that did not work. Can you clarify the kubectl command that gets the "full" service name that resolves?

@PaulMazzuca
Copy link

Take that back. I just got it to work using "{service-name}.{namespace}.svc.cluster.local". I did not realize that svc.cluster.local was always the same.

@wsourdin
Copy link

I just tried {service-name}.{namespace}.svc.cluster.local on EKS and it doesn't work.

e.g. service name nfs-service on default namespace

I got

mount.nfs: Failed to resolve server nfs-service.default.svc.cluster.local: Name or service not known

They are still no fix or clean workaround available?

@bmbferreira
Copy link

I just tried {service-name}.{namespace}.svc.cluster.local on EKS and it doesn't work.

e.g. service name nfs-service on default namespace

I got

mount.nfs: Failed to resolve server nfs-service.default.svc.cluster.local: Name or service not known

They are still no fix or clean workaround available?

I'm also having this issue on EKS.

@MikeDaniel18
Copy link

Also having this issue on Digital Ocean.

@rjohnson3
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@rjohnson3: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kendru
Copy link

kendru commented Jul 10, 2019

Also having this issue on GKE

@noeliajimenezg
Copy link

noeliajimenezg commented Jul 30, 2019

Same issue on OpenStack.

@linclaus
Copy link

Same issue on NFS

@will-beta
Copy link

I just tried {service-name}.{namespace}.svc.cluster.local on EKS and it doesn't work.

e.g. service name nfs-service on default namespace

I got

mount.nfs: Failed to resolve server nfs-service.default.svc.cluster.local: Name or service not known

They are still no fix or clean workaround available?

I'm also having this issue on AKS.

@k8s-ci-robot
Copy link
Contributor

@will-beta: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cl4u2
Copy link

cl4u2 commented Nov 4, 2019

Same issue on CDK.

@pentago
Copy link

pentago commented Feb 17, 2020

From what I just learned this is the issue only on non-GKE Kubernetes. Can't wait for upstream fix.so we can get proper service DNS name resolution on all providers.

Any progress perhaps?

@colearendt
Copy link

colearendt commented Feb 23, 2020

Worth noting that the example highlighted here is now at: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs

@pentago
Copy link

pentago commented Feb 23, 2020

Yeah, regardless, the question still applies as there was no upstream solution at the time of the docs writing.

What does GKE does that allow for this different behavior from upstream Kubernetes?

@Marek00Malik
Copy link

has there been any work on this ?

@h0jeZvgoxFepBQ2C
Copy link

@saad-ali @thockin Could you reopen this issue please?

It's not possible for non-maintainers to reopen it.

@h0jeZvgoxFepBQ2C
Copy link

Did anyone found a solution for this? It really surprises me that this (imo big) issue hasn't been resolved in 3 years?
How do people cope with this situation, if you want to run multiple nfs servers - you can't always hardcode the IPs?

Any suggestions on how to workaround this?
Specifing nfs-service.default.svc.cluster.local didn't work out for us.

@pentago
Copy link

pentago commented Dec 15, 2020

I'm not sure but could be using the ExternalName service be a viable solution?

For example, using NFS server service DNS name in the externalName property?

I was under impression that this particular object was created to solve these issues. I didn't try it yet but would welcome feedback from those who did, regardless of the outcome.

@colearendt
Copy link

colearendt commented Dec 15, 2020

FWIW I have worked around this by using the nfs-server-provisioner helm chart, persistent volume claims, and moving on with my life. I will say, something has changed about helm's website and now there seem to be two (identical?) options for this, which is a bit weird.

https://artifacthub.io/packages/helm/kvaps/nfs-server-provisioner

Hope that helps! It has worked well enough for me! It would definitely be nice to have a fix though!

A coworker dug into the source and suspected the bug was here, in case it is any help: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/nfs/nfs.go#L256

@msau42
Copy link
Member

msau42 commented Dec 15, 2020

The issue is in some environments, kubelet's host network does not have access to the cluster dns. Using https://github.com/kubernetes-csi/csi-driver-nfs should resolve this because it runs as a Pod so has access to cluster services.

@h0jeZvgoxFepBQ2C
Copy link

This does not appear to us as far as I understand it, when I connect via shell I can ping the nfs server directly via nfs-server-service and the IP resolution works fine. So the the kube proxy knows where our nfs server is - it's just that the volume doesn't know it.

@andre-lx
Copy link

andre-lx commented Jan 18, 2021

As @msau42 mentioned, I solved this issue using the https://github.com/kubernetes-csi/csi-driver-nfs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests