Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial proposal for node-local services #28637

Closed
wants to merge 2 commits into from

Conversation

therc
Copy link
Member

@therc therc commented Jul 7, 2016

First step in #28610

cc @thockin


This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 7, 2016
@therc therc mentioned this pull request Jul 7, 2016
4 tasks
@rata
Copy link
Member

rata commented Jul 7, 2016

@therc it seems nice to me, and the goal is something I really think it needs to be solved. But what are other alternatives? For example, DaemonSet + Host network (then the pod connects to the node IP, that it can be known. The port is a problem with this, and unacceptable, but just to say something). Or why not extend DaemonSet in some way?

I'm not saying that any of these should be done, I'm just curious about the alternatives not mentioned in the proposal. The service is becoming something that really does tons of different things depending on the params. I'm not saying it's not worth it, in fact I think it probably is and there is no other alternative now, just that it makes me wonder harder about the alternatives :)

Thanks!

@therc
Copy link
Member Author

therc commented Jul 7, 2016

The alternative I currently employ is a DaemonSet with hostPort (not the whole host network), but there is no way at the moment to obtain the node IP, so I use a magic external IP address. The nodes are set up outside of Kubernetes with a PREROUTING DNAT rule so that the magic address is redirected to the host. Applications are compiled with the magic IP address built-in.

DaemonSet could be extended, but then it would just result in multiple cases of duplication, in most of the logic already handled by the service controller: VIP allocation in the controller manager, DNS exporting in kube-dns, endpoint watching and iptables management in kube-proxy. Those are code paths that only worry about services right now and would have to start caring about DaemonSets, too, while trying to avoid race conditions and clashes. What hostnames would kube-dns now have to return, if not X.svc.* because it's no longer a service?

@rata
Copy link
Member

rata commented Jul 7, 2016

@therc Ohh, I see. Yeah, that alternative sucks :-/. I think you can get the node name via the downwards API, as it is in the pod yaml, and that resolves to the IP in my AWS cluster at least. But not sure if the downwards API would do that and, in any case, it doesn't really solve the issue.

Yeah, very good points! I'm more convinced now! Sorry for disturbing, I was honestly curious :-)

@therc
Copy link
Member Author

therc commented Jul 7, 2016

Re; downward API, #26160 was closed and #27880 is not approved yet.

@rata
Copy link
Member

rata commented Jul 7, 2016

On Thu, Jul 07, 2016 at 03:59:16PM -0700, Rudi C wrote:

Re; downward API, #26160 was closed and #27880 is not approved yet.

Ohh, great. My bad then, thanks!

@smarterclayton
Copy link
Contributor

I think we have consensus, I just need to unify downward API. I would like for the service to be able to handle locality cleanly.

@smarterclayton
Copy link
Contributor

Could we make an affinity rule instead for services that prefer local endpoints (VS remote ones)?

@therc
Copy link
Member Author

therc commented Jul 8, 2016

Something like adding a new ServiceAffinity, RequireNodeLocal? Or even a second one, PreferNodeLocal. Because we're talking about potentially long-lived connections, some applications might prefer being told temporarily that there's no endpoint, rather than sticking with a suboptimal one for an indefinite amount of time. The latter scenario is guaranteed to happen when a daemonset gets updated.

@smarterclayton
Copy link
Contributor

I agree there's potentially value in two affinity settings. I guess you
might prefer local (for lots of reasons) but once you get it you want to
stick to it. So SessionAffinity and EndpointAffinity (the former is what
happens once you connect, the latter is which one you select).

On Fri, Jul 8, 2016 at 7:39 PM, Rudi C notifications@github.com wrote:

Something like adding a new ServiceAffinity, RequireNodeLocal? Or even a
second one, PreferNodeLocal. Because we're talking about potentially
long-lived connections, some applications might prefer being told
temporarily that there's no endpoint, rather than sticking with a
suboptimal one for an indefinite amount of time. The latter scenario is
guaranteed to happen when a daemonset gets updated.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#28637 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p3m2szK06N1aVL3KE7JU2jzIGtS_ks5qTt-UgaJpZM4JHhID
.

@therc
Copy link
Member Author

therc commented Jul 12, 2016

I updated the document to use a new ServiceAffinity type and mention more use cases. No more "magic" daemon=X selector, which should make the code changes even less invasive.

@therc therc force-pushed the node-local-svc branch 2 times, most recently from dc6e5a7 to 516b626 Compare July 17, 2016 13:48
@therc
Copy link
Member Author

therc commented Jul 17, 2016

Replaced all "hostlocal" with "nodelocal" (if even I can't keep them straight...), rebased and squashed.

@therc
Copy link
Member Author

therc commented Jul 18, 2016

I started to implement a prototype and the main issue seems to be that kube-proxy has no clue what the node's IP is. There's also a TODO about at least one optimization where plumbing the IP address into kube-proxy would be beneficial. The big question for me is what to do with DHCP machines, in particular when for some reason a lease doesn't get renewed or the daemon gets restarted and a new address is issued (unlikely, I know).

@erictune
Copy link
Member

I like this. One thought: the client of a node-local service should not care that the service is node-local or a "classic" service, or some other kind of implementation. To the client, it should just look like a service that does the right thing.

@erictune
Copy link
Member

the assumption that the service and DaemonSet are in the same namespace seems restrictive. What motivates this assumption?

@therc
Copy link
Member Author

therc commented Jul 26, 2016

The namespace assumption was first mentioned when I had the magic selector, I think. Still, even with a regular service, pointing to pods in a different namespace is not trivial, right? I'll just strike that sentence out.

And yes, for clients, this should look just like any other service.

@k8s-github-robot k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jul 28, 2016
@thockin
Copy link
Member

thockin commented Dec 28, 2016

Also, this needs to move to community repo, but only AFTER LGTM

@timothysc
Copy link
Member

/cc @marun

@davidopp
Copy link
Member

ref/ #15675

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @thockin
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@renaudguerin
Copy link

renaudguerin commented Mar 31, 2017

Could someone please clarify if annotating a clusterIP service with externalTraffic=OnlyLocal is good enough to ensure that client pods will only be directed to same-node service endpoints ?

If not, what is the currently recommended way to achieve that in 1.5 ? Daemonset + hostPort + getting status.hostIP from the downward API, and pointing the client to that ?

My use case is simply to send metrics from application pods to their local statsd/Datadog agent, launched on every node by a daemonset.

@thockin
Copy link
Member

thockin commented Mar 31, 2017 via email

- logging agents such as fluentd
- authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy),
[kube2iam](https://github.com/jtblin/kube2iam) or loasd ([#2209](https://github.com/kubernetes/kubernetes/issues/2209))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: - VM launching pods speaking to a libvirtd pod on the same node as used by [KubeVirt](https://github.com/kubevirt)


## Detailed discussion

Node-local services can reuse most of the existing plumbing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on taking more then the host as a locality measure. I also like that it references a label to use as the locality value.


and happens to be crazy (if it can be made to work reliably at all).

## Implementation plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would only support DNS based lookups, couldnÄt this be solved on the node-level if the kubelet or some other component was a DNS resolver?
The node local DNS resolver would resolve a fqdn into a local address whenever the backing service is considered to be node-local.
Otherwise it would just dispatch the request to kube-dns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand there is nothing right now in kubelet that works as the kind of DNS proxy that you describe, although it's an interesting idea.

@klausenbusk
Copy link
Contributor

Should this be limited to DaemonSets? My use-case is that I have a 3 replica MariaDB Galera cluster (statefulset). When I want to create a backup (with xtrabackup), I first need access to the datafiles (easy as I use hostPath[1]) and I need to know the ip of the MariaDB in control of the datafiles (for locking and so).
Currently I use this as part of my backup script:

KUBE_TOKEN="$(</var/run/secrets/kubernetes.io/serviceaccount/token)"
HOST="$(curl -sS --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer ${KUBE_TOKEN}" "https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_PORT_443_TCP_PORT}/api/v1/pods?fieldSelector=spec.nodeName=${NODE_NAME}&labelSelector=app=mysql-production" | jq -r .items[0].metadata.name)"

Before I switched to k8s I just used 172.17.42.1, but as I understand this proposal, it could potentially solve this issue by providing a hostname to the "local" MariaDB Galera pod (??).

[1] { "podAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [ { "labelSelector": { "matchExpressions": [ { "key": "app", "operator": "In", "values": [ "mysql-production" ] } ] }, "topologyKey": "kubernetes.io/hostname" } ] } }

@therc
Copy link
Member Author

therc commented Apr 25, 2017

@klausenbusk DaemonSets make the most sense, as the proposal mentions. But your example shows that it could work also for regular pods with taints and tolerations.

@cmluciano
Copy link

We are still missing a LGTM. Do we have consensus on if the scope should be increased to address some of the open questions?

@gtaylor
Copy link
Contributor

gtaylor commented May 4, 2017

Might we benefit from keeping the scope more narrow for the initial implementation and experimentation?

@apenney
Copy link

apenney commented May 4, 2017

I'm an outsider to this conversation but I vote in favor of a narrow scope for initial implementation and feedback. At least to determine how useful this is (I need it, for one).

@thockin
Copy link
Member

thockin commented May 5, 2017 via email

@fabiand
Copy link
Contributor

fabiand commented May 5, 2017

@thockin do you have a pointer to the work in the storage-land you are thinking of?

@therc
Copy link
Member Author

therc commented May 5, 2017

@thockin are you hinting at kubernetes/community#306 ?

@thockin
Copy link
Member

thockin commented May 5, 2017 via email

[pgbouncer](https://pgbouncer.github.io/) and
[synapse](https://github.com/airbnb/synapse)
- logging agents such as fluentd
- authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this use case and others, some form of caller ID would be very useful.

@smarterclayton
Copy link
Contributor

This appears to have fallen silent. Is there a volunteer from sig-network, sig-apps, or sig-node who can help push this forward?

The use case is pretty clear (maybe not in the top tier of complaints, but increasingly commonly mentioned). It appears that there isn't a ton of resistance to the simpler initial implementation. This has missed the window for 1.8, but getting the proposal marshaled would give it a chance to hit 1.9 as an alpha.

@k8s-github-robot
Copy link

Adding do-not-merge/release-note-label-needed because the release note process has not been followed.
One of the following labels is required "release-note", "release-note-action-required", "release-note-experimental" or "release-note-none".
Please see: https://github.com/kubernetes/community/blob/master/contributors/devel/pull-requests.md#write-release-notes-if-needed.

@k8s-github-robot k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Aug 30, 2017
@igorpeshansky
Copy link

igorpeshansky commented Nov 9, 2017

This would be very useful for a sharded service that keeps per-node local information in each shard.

@m1093782566
Copy link
Contributor

m1093782566 commented Nov 29, 2017

@smarterclayton I am from sig-network. I like this idea and have some bandwidth to help push it forward if possible.

@therc I am familiar with IPVS-based kube-proxy and can take care of IPVS proxier changes if you are happy.

@m1093782566
Copy link
Contributor

m1093782566 commented Dec 30, 2017

There is a proposal which want to figure out a more generic way, see kubernetes/community#1551

We need more user cases to refine the API, please feel free to populate your comments there :) Thanks!

@thockin
Copy link
Member

thockin commented Jan 2, 2018

I'm closing this in favor of the discussion around #41442

@thockin thockin closed this Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet