Initial proposal for node-local services #28637

therc · 2016-07-07T21:28:46Z

First step in #28610

This change is

rata · 2016-07-07T22:08:24Z

@therc it seems nice to me, and the goal is something I really think it needs to be solved. But what are other alternatives? For example, DaemonSet + Host network (then the pod connects to the node IP, that it can be known. The port is a problem with this, and unacceptable, but just to say something). Or why not extend DaemonSet in some way?

I'm not saying that any of these should be done, I'm just curious about the alternatives not mentioned in the proposal. The service is becoming something that really does tons of different things depending on the params. I'm not saying it's not worth it, in fact I think it probably is and there is no other alternative now, just that it makes me wonder harder about the alternatives :)

Thanks!

therc · 2016-07-07T22:27:40Z

The alternative I currently employ is a DaemonSet with hostPort (not the whole host network), but there is no way at the moment to obtain the node IP, so I use a magic external IP address. The nodes are set up outside of Kubernetes with a PREROUTING DNAT rule so that the magic address is redirected to the host. Applications are compiled with the magic IP address built-in.

DaemonSet could be extended, but then it would just result in multiple cases of duplication, in most of the logic already handled by the service controller: VIP allocation in the controller manager, DNS exporting in kube-dns, endpoint watching and iptables management in kube-proxy. Those are code paths that only worry about services right now and would have to start caring about DaemonSets, too, while trying to avoid race conditions and clashes. What hostnames would kube-dns now have to return, if not X.svc.* because it's no longer a service?

rata · 2016-07-07T22:45:24Z

@therc Ohh, I see. Yeah, that alternative sucks :-/. I think you can get the node name via the downwards API, as it is in the pod yaml, and that resolves to the IP in my AWS cluster at least. But not sure if the downwards API would do that and, in any case, it doesn't really solve the issue.

Yeah, very good points! I'm more convinced now! Sorry for disturbing, I was honestly curious :-)

therc · 2016-07-07T22:58:29Z

Re; downward API, #26160 was closed and #27880 is not approved yet.

rata · 2016-07-07T23:01:01Z

On Thu, Jul 07, 2016 at 03:59:16PM -0700, Rudi C wrote:

Re; downward API, #26160 was closed and #27880 is not approved yet.

Ohh, great. My bad then, thanks!

smarterclayton · 2016-07-08T22:33:31Z

I think we have consensus, I just need to unify downward API. I would like for the service to be able to handle locality cleanly.

smarterclayton · 2016-07-08T22:34:31Z

Could we make an affinity rule instead for services that prefer local endpoints (VS remote ones)?

therc · 2016-07-08T23:38:47Z

Something like adding a new ServiceAffinity, RequireNodeLocal? Or even a second one, PreferNodeLocal. Because we're talking about potentially long-lived connections, some applications might prefer being told temporarily that there's no endpoint, rather than sticking with a suboptimal one for an indefinite amount of time. The latter scenario is guaranteed to happen when a daemonset gets updated.

smarterclayton · 2016-07-08T23:56:53Z

I agree there's potentially value in two affinity settings. I guess you
might prefer local (for lots of reasons) but once you get it you want to
stick to it. So SessionAffinity and EndpointAffinity (the former is what
happens once you connect, the latter is which one you select).

On Fri, Jul 8, 2016 at 7:39 PM, Rudi C notifications@github.com wrote:

Something like adding a new ServiceAffinity, RequireNodeLocal? Or even a
second one, PreferNodeLocal. Because we're talking about potentially
long-lived connections, some applications might prefer being told
temporarily that there's no endpoint, rather than sticking with a
suboptimal one for an indefinite amount of time. The latter scenario is
guaranteed to happen when a daemonset gets updated.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#28637 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p3m2szK06N1aVL3KE7JU2jzIGtS_ks5qTt-UgaJpZM4JHhID
.

therc · 2016-07-12T16:08:13Z

I updated the document to use a new ServiceAffinity type and mention more use cases. No more "magic" daemon=X selector, which should make the code changes even less invasive.

therc · 2016-07-17T14:31:41Z

Replaced all "hostlocal" with "nodelocal" (if even I can't keep them straight...), rebased and squashed.

therc · 2016-07-18T20:36:31Z

I started to implement a prototype and the main issue seems to be that kube-proxy has no clue what the node's IP is. There's also a TODO about at least one optimization where plumbing the IP address into kube-proxy would be beneficial. The big question for me is what to do with DHCP machines, in particular when for some reason a lease doesn't get renewed or the daemon gets restarted and a new address is issued (unlikely, I know).

erictune · 2016-07-25T16:20:40Z

I like this. One thought: the client of a node-local service should not care that the service is node-local or a "classic" service, or some other kind of implementation. To the client, it should just look like a service that does the right thing.

erictune · 2016-07-25T16:23:36Z

the assumption that the service and DaemonSet are in the same namespace seems restrictive. What motivates this assumption?

therc · 2016-07-26T16:40:46Z

The namespace assumption was first mentioned when I had the magic selector, I think. Still, even with a regular service, pointing to pods in a different namespace is not trivial, right? I'll just strike that sentence out.

And yes, for clients, this should look just like any other service.

thockin · 2016-12-28T07:40:33Z

Also, this needs to move to community repo, but only AFTER LGTM

timothysc · 2017-01-10T15:12:56Z

/cc @marun

davidopp · 2017-01-16T08:09:15Z

ref/ #15675

k8s-github-robot · 2017-01-23T22:05:53Z

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

docs/OWNERS

We suggest the following people:
cc @thockin
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

renaudguerin · 2017-03-31T17:51:47Z

Could someone please clarify if annotating a clusterIP service with externalTraffic=OnlyLocal is good enough to ensure that client pods will only be directed to same-node service endpoints ?

If not, what is the currently recommended way to achieve that in 1.5 ? Daemonset + hostPort + getting status.hostIP from the downward API, and pointing the client to that ?

My use case is simply to send metrics from application pods to their local statsd/Datadog agent, launched on every node by a daemonset.

thockin · 2017-03-31T21:42:55Z

OnlyLocal does not affect pod clients - only external traffic. What you're asking for is not well supported yet. The best answer is probably to get the node's IP and connect to a hostport there.

…

On Fri, Mar 31, 2017 at 10:52 AM, Renaud Guérin ***@***.***> wrote: Could someone please clarify if annotating a clusterIP service with externalTraffic=OnlyLocal is good enough to ensure that client pods will only be directed to same-node service endpoints ? If not, what is the currently recommended way to achieve that in 1.5 ? Daemonset + hostPort + getting status.hostIP from the downward API, and pointing the client to that ? My use case is simply to send metrics from application to the local statsd/Datadog agent launched by a daemonset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28637 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVDTUzjrNLooa8BYwxNGWgeHd3Q3Sks5rrT1HgaJpZM4JHhID> .

fabiand · 2017-03-31T14:03:05Z

docs/proposals/node-local-services.md

+ - logging agents such as fluentd
+ - authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy),
+   [kube2iam](https://github.com/jtblin/kube2iam) or loasd ([#2209](https://github.com/kubernetes/kubernetes/issues/2209))
+


Also: - VM launching pods speaking to a libvirtd pod on the same node as used by [KubeVirt](https://github.com/kubevirt)

fabiand · 2017-04-03T19:21:21Z

docs/proposals/node-local-services.md

+
+## Detailed discussion
+
+Node-local services can reuse most of the existing plumbing.


+1 on taking more then the host as a locality measure. I also like that it references a label to use as the locality value.

fabiand · 2017-04-03T19:32:53Z

docs/proposals/node-local-services.md

+
+   and happens to be crazy (if it can be made to work reliably at all).
+
+## Implementation plan


If we would only support DNS based lookups, couldnÄt this be solved on the node-level if the kubelet or some other component was a DNS resolver?
The node local DNS resolver would resolve a fqdn into a local address whenever the backing service is considered to be node-local.
Otherwise it would just dispatch the request to kube-dns.

@fabiand there is nothing right now in kubelet that works as the kind of DNS proxy that you describe, although it's an interesting idea.

klausenbusk · 2017-04-15T23:57:45Z

Should this be limited to DaemonSets? My use-case is that I have a 3 replica MariaDB Galera cluster (statefulset). When I want to create a backup (with xtrabackup), I first need access to the datafiles (easy as I use hostPath[1]) and I need to know the ip of the MariaDB in control of the datafiles (for locking and so).
Currently I use this as part of my backup script:

KUBE_TOKEN="$(</var/run/secrets/kubernetes.io/serviceaccount/token)"
HOST="$(curl -sS --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer ${KUBE_TOKEN}" "https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_PORT_443_TCP_PORT}/api/v1/pods?fieldSelector=spec.nodeName=${NODE_NAME}&labelSelector=app=mysql-production" | jq -r .items[0].metadata.name)"

Before I switched to k8s I just used 172.17.42.1, but as I understand this proposal, it could potentially solve this issue by providing a hostname to the "local" MariaDB Galera pod (??).

[1] { "podAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [ { "labelSelector": { "matchExpressions": [ { "key": "app", "operator": "In", "values": [ "mysql-production" ] } ] }, "topologyKey": "kubernetes.io/hostname" } ] } }

therc · 2017-04-25T20:13:29Z

@klausenbusk DaemonSets make the most sense, as the proposal mentions. But your example shows that it could work also for regular pods with taints and tolerations.

cmluciano · 2017-05-04T15:38:26Z

We are still missing a LGTM. Do we have consensus on if the scope should be increased to address some of the open questions?

gtaylor · 2017-05-04T17:52:59Z

Might we benefit from keeping the scope more narrow for the initial implementation and experimentation?

apenney · 2017-05-04T20:31:53Z

I'm an outsider to this conversation but I vote in favor of a narrow scope for initial implementation and feedback. At least to determine how useful this is (I need it, for one).

thockin · 2017-05-05T06:05:07Z

I fully ack that maybe something sooner and less ideal would be acceptable. Interestingly, there's work in storage-land that is touching on this topology stuff, too, and so far it still seems to hold water and it looks, dare I say, somewhat clean...

…

On Thu, May 4, 2017 at 1:32 PM, Ashley Penney ***@***.***> wrote: I'm an outsider to this conversation but I vote in favor of a narrow scope for initial implementation and feedback. At least to determine how useful this is (I need it, for one). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28637 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVHiYWe7lvVDOOAcvGiHHxoVXcodIks5r2jXMgaJpZM4JHhID> .

fabiand · 2017-05-05T08:01:28Z

@thockin do you have a pointer to the work in the storage-land you are thinking of?

therc · 2017-05-05T14:16:02Z

@thockin are you hinting at kubernetes/community#306 ?

thockin · 2017-05-05T16:08:59Z

Also #44640 - look for topology key. The idea is that persistent volume specifies a label that indicates the topology in which it is available, beyond just region and node, and the scheduling will respect it.

…

On Fri, May 5, 2017 at 7:16 AM, Rudi C ***@***.***> wrote: @thockin <https://github.com/thockin> are you hinting at kubernetes/community#306 <kubernetes/community#306> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28637 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVGbx60a05SHFtXyOx8k0OJmQiqypks5r2y9AgaJpZM4JHhID> .

mikedanese · 2017-07-18T21:36:26Z

docs/proposals/node-local-services.md

+   [pgbouncer](https://pgbouncer.github.io/) and
+   [synapse](https://github.com/airbnb/synapse)
+ - logging agents such as fluentd
+ - authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy),


For this use case and others, some form of caller ID would be very useful.

smarterclayton · 2017-08-13T20:49:55Z

This appears to have fallen silent. Is there a volunteer from sig-network, sig-apps, or sig-node who can help push this forward?

The use case is pretty clear (maybe not in the top tier of complaints, but increasingly commonly mentioned). It appears that there isn't a ton of resistance to the simpler initial implementation. This has missed the window for 1.8, but getting the proposal marshaled would give it a chance to hit 1.9 as an alpha.

k8s-github-robot · 2017-08-30T21:43:27Z

Adding do-not-merge/release-note-label-needed because the release note process has not been followed.
One of the following labels is required "release-note", "release-note-action-required", "release-note-experimental" or "release-note-none".
Please see: https://github.com/kubernetes/community/blob/master/contributors/devel/pull-requests.md#write-release-notes-if-needed.

igorpeshansky · 2017-11-09T20:37:07Z

This would be very useful for a sharded service that keeps per-node local information in each shard.

m1093782566 · 2017-11-29T03:13:05Z

@smarterclayton I am from sig-network. I like this idea and have some bandwidth to help push it forward if possible.

@therc I am familiar with IPVS-based kube-proxy and can take care of IPVS proxier changes if you are happy.

m1093782566 · 2017-12-30T14:38:43Z

There is a proposal which want to figure out a more generic way, see kubernetes/community#1551

We need more user cases to refine the API, please feel free to populate your comments there :) Thanks!

thockin · 2018-01-02T17:55:59Z

I'm closing this in favor of the discussion around #41442

googlebot added the cla: yes label Jul 7, 2016

k8s-github-robot assigned brendandburns Jul 7, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 7, 2016

therc mentioned this pull request Jul 7, 2016

Node-local services #28610

Closed

4 tasks

therc mentioned this pull request Jul 8, 2016

I have node-local services I want to offer to pods, e.g. DataDog, how can I do that? #15169

Closed

therc force-pushed the node-local-svc branch from 0069043 to ac5e2f8 Compare July 8, 2016 03:49

k8s-github-robot assigned smarterclayton and unassigned brendandburns Jul 11, 2016

therc force-pushed the node-local-svc branch 2 times, most recently from dc6e5a7 to 516b626 Compare July 17, 2016 13:48

k8s-github-robot added the retest-not-required-docs-only label Jul 22, 2016

therc mentioned this pull request Jul 24, 2016

GCP Cloud Provider: Source IP preservation for Virtual IPs kubernetes/enhancements#27

Closed

k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jul 28, 2016

thockin mentioned this pull request Feb 14, 2017

Arbitrary topology through the system #41442

Closed

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 17, 2017

0xmichalis mentioned this pull request Mar 31, 2017

Services backed by DaemonSet pods should have a hostname based network identifier #41977

Closed

fabiand reviewed Apr 3, 2017

View reviewed changes

evie404 mentioned this pull request Apr 14, 2017

Provide a way to add entries to /etc/hosts #43632

Closed

therc mentioned this pull request Apr 25, 2017

Pod locality for services (node-local, etc.) kubernetes/enhancements#274

Closed

mikedanese reviewed Jul 18, 2017

View reviewed changes

k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Aug 30, 2017

thockin closed this Jan 2, 2018


		## Detailed discussion

		Node-local services can reuse most of the existing plumbing.


		and happens to be crazy (if it can be made to work reliably at all).

		## Implementation plan

Initial proposal for node-local services #28637

Initial proposal for node-local services #28637

Conversation

therc commented Jul 7, 2016 • edited by k8s-oncall Loading

rata commented Jul 7, 2016 • edited Loading

therc commented Jul 7, 2016

rata commented Jul 7, 2016 • edited Loading

therc commented Jul 7, 2016

rata commented Jul 7, 2016

smarterclayton commented Jul 8, 2016

smarterclayton commented Jul 8, 2016

therc commented Jul 8, 2016

smarterclayton commented Jul 8, 2016

therc commented Jul 12, 2016

therc commented Jul 17, 2016

therc commented Jul 18, 2016

erictune commented Jul 25, 2016

erictune commented Jul 25, 2016

therc commented Jul 26, 2016

thockin commented Dec 28, 2016

timothysc commented Jan 10, 2017

davidopp commented Jan 16, 2017

k8s-github-robot commented Jan 23, 2017

renaudguerin commented Mar 31, 2017 • edited Loading

thockin commented Mar 31, 2017 via email

fabiand Mar 31, 2017

Choose a reason for hiding this comment

fabiand Apr 3, 2017

Choose a reason for hiding this comment

fabiand Apr 3, 2017

Choose a reason for hiding this comment

therc Apr 25, 2017

Choose a reason for hiding this comment

klausenbusk commented Apr 15, 2017

therc commented Apr 25, 2017

cmluciano commented May 4, 2017

gtaylor commented May 4, 2017

apenney commented May 4, 2017

thockin commented May 5, 2017 via email

fabiand commented May 5, 2017

therc commented May 5, 2017

thockin commented May 5, 2017 via email

mikedanese Jul 18, 2017

Choose a reason for hiding this comment

smarterclayton commented Aug 13, 2017

k8s-github-robot commented Aug 30, 2017

igorpeshansky commented Nov 9, 2017 • edited Loading

m1093782566 commented Nov 29, 2017 • edited Loading

m1093782566 commented Dec 30, 2017 • edited Loading

thockin commented Jan 2, 2018

therc commented Jul 7, 2016 •

edited by k8s-oncall

Loading

rata commented Jul 7, 2016 •

edited

Loading

rata commented Jul 7, 2016 •

edited

Loading

renaudguerin commented Mar 31, 2017 •

edited

Loading

igorpeshansky commented Nov 9, 2017 •

edited

Loading

m1093782566 commented Nov 29, 2017 •

edited

Loading

m1093782566 commented Dec 30, 2017 •

edited

Loading