DESIGN: Services v2 #1107

Closed
thockin opened this Issue Aug 29, 2014 · 37 comments

Comments

Projects
None yet
10 participants
Owner

thockin commented Aug 29, 2014

Goal

To evaluate options for enhancing the Kubernetes Service abstraction.

Non-Goals

To discuss external IPs bridging into kubernetes clusters. To quibble about names (not yet).

Background

The kubernetes Service abstraction defines a group of pods that can be accessed through a single IP and port, with a policy describing how to access the pods. For example, when a client connects to a service’s IP:port (which it finds through environment variables), a local proxy will round-robin accesses to the constituent pods. This is the only policy supported today, but we envision more before too long. For example, one can easily imagine “real” load-balanced services which have an HAProxy (or similar) in front of them with a real pod IP.

Today this is implemented as a per-minion proxy process which listens on the minion’s primary IP for every service in the cluster, on each service’s port. To be clear: the IP assigned to a service is the IP of the minion the caller is running on. This is exposed as an environment variable, but is effectively a constant. This has a number of drawbacks. First, the proxy is inherently multi-tenant and its resources are not charged to any pod. Second, it forces all services in a cluster to have different port numbers - if any service tries to use a previously consumed port it will fail, but this can not be known a priori. Third service ports potentially collide with any pods that use HostPorts and any daemons that run on the minions. Fourth, environment variables can not be dynamically updated, so running pods can not know about services started after the pod itself. Fifth, should a pod ever live-migrate it may have to take two network hops to reach a service instead of one, and will be forever subject to the availability of the first minion on which it ran.

For these reasons, I think we can do better. Kubernetes should take the stance that we NEVER make users concern themselves with shared port spaces unless they own all of the shared elements. We started down this path with IP-per-Pod semantics. Service ports are the last violation of this principle.

There is an orthogonal concern that impacts this design. Today, any pod can access any service. It is not part of any API to be able to specify which services a pod might want to connect to. There have been some arguments that we might want to make that part of the API. For example: “this pod will want to connect to the service named ‘foo’”.

Design

I see a few options that could make this system more elegant.

Terminology

Pod: A kubernetes pod, running 1 or more containers (no special meaning above the normal definition)

Service: A group of Pods, as determined by a label selector, which all offer a common port name/number that serves a single purpose. The canonical example is a pool of HTTP servers that all have the same content available. Services can conceptually have different policies for accessing them, but the only one implemented today is load-balanced.

Ambassador: A piece of executable logic, hosted somewhere, which understands kubernetes label selector groups and implements the policy for a Service. This might be in a cloud-provider service, or in a standalone pod (e.g. an enlightened haproxy), or in a per-node shared process (e.g. kube-proxy). The Ambassador is how clients access a Service (kubernetes-native apps may choose to link or implement an Ambassador directly into their app).

Portal: A stable IP:port pair which grants access to an Ambassador. When a client connects to a Portal, the packets are transported to the Ambassador without the client needing to understand how the Ambassador is implemented. "Stable" means that neither the IP nor port can change for the lifetime of a client of the Service.

Option 1) IP-per-Service, shared ambassador

When a Service is created, we allocate an IP from a special range of IPs. This IP is the portal IP. We broadcast this Service, along with its port number, to all of the (one-per-minion) kube-proxy instances along with the Portal IP. The kube-proxy sets up iptables rules to “steal” traffic to the Portal [IP, port], and redirect it back to itself on a random/ephemeral port. The kube-proxy acts as the Ambassador (the same as today) - and will round-robin traffic across the constituents of the Service.

JSON for a service:

{
    "name": "my-service"
    "port": 9376
    "containerPort": 80
    "selector": {"role":"my-app-frontend"}
}

Client pseudocode:

if (use_dns) {
  ip = gethostbyname("my-service")
} else {
  ip = getenv("MY_SERVICE_IP")
}
socket = connect(ip, 9376)

Pros:

  • No risk of port collisions anywhere
  • Services can get DNS A (forward) and PTR (reverse) records since the IP does not change
  • DNS SRV records are easy (the port is constant)
  • The iptables rules are configured in the root namespace, so never need updating even if a pod restarts
  • Does not require pods to pre-declare which services they want to access (can be implemented sooner)

Cons:

  • Kube-proxy is still multi-tenant
  • Traffic from kube-proxy has a source IP that is not the calling pod’s IP
  • Requires virtual IP space to be put aside for portals
  • Requires the master to track and checkpoint allocated portal IPs
  • Will probably not scale well to O(thousands) of services

Option 2) IP-per-Service, private ambassador

Similar to option 1, a Portal IP is allocated for each Service. Unlike option 1, though, the Ambassador is private* to each Pod. This requires that Pods declare which Services they want to access up front (or else the kubelet or other root-namespace, true-root user agent will need to change into each pod namespace for each service add/remove in the whole cluster [iptables rules require true root] -- I assume this is a non-starter), so that the iptables rules can be established in the Pod namespace.

(*) "private" could mean "runs in" for now, but it is really more abstract. If we have different "kinds" of services, some might have real load-balancers as Ambassadors, so the Portal would be just an iptables forwarding rule.

JSON for a service:

{
    "name": "my-service"
    "port": 9376
    "containerPort": 80
    "selector": {"role":"my-app-frontend"}
}

YAML for a pod:

containers:
  - name: frontend
    image: nginx
    ports:
      - name: http
        containerPort: 80
portals:
  - destination: my-service

Client pseudocode:

if (use_dns) {
  ip = gethostbyname("my-service")
} else {
  ip = getenv("MY_SERVICE_IP")
}
socket = connect(ip, 9376)

Pros:

  • No risk of port collisions anywhere
  • Services can get DNS A (forward) and PTR (reverse) records since the IP does not change
  • DNS SRV records are easy (the port is constant)
  • The proxy is not multi-tenant.
  • Traffic from kube-proxy has a source IP that is the calling pod’s IP
  • Easy migration from option 1
  • Requires pods to pre-declare service portals (good for structure)

Cons:

  • The iptables rules are configured in the pod namespaces, and must be re-run if a pod’s network namespace restarts
  • Requires virtual IP space to be put aside for portals
  • Requires the master to track and checkpoint allocated portal IPs
  • Requires pods to pre-declare service portals (not yet implemented)

Option 3) Localhost portals, private ambassadors

Instead of allocating an IP for each Service, this option requires that Pods declare which Services they want to access up front, and that they specify an unused port number on localhost which will become the Portal. Like option 2, the Ambassador will be private to each pod, which could mean running a kube-proxy or configuring iptables or other implementations.

{
    "name": "my-service"
    "containerPort": 80
    "selector": {"role":"my-app-frontend"}
}

YAML for a pod:

containers:
  - name: frontend
    image: nginx
    ports:
      - name: http
        containerPort: 80
portals:
  - destination: my-service
    localPort: 12345

Client pseudocode:

socket = connect("localhost", 12345)

Pros:

  • No risk of port collisions anywhere
  • Services can get DNS A (forward) records since the IP does not change
  • The proxy is not multi-tenant.
  • Traffic from kube-proxy has a source IP that is the calling pod’s IP
  • No IP space needed, nor tracking of allocated IPs
  • Users get to control ports

Cons:

  • Services can not get DNS PTR (reverse) records because the IP is always 127.0.0.1
  • Services can not get DNS SRV records because the port is pod-specific (unless we serve source-specific DNS that is different per-pod)
  • Requires pods to pre-declare service portals (not yet implemented)
  • Users have to think about what ports they want for portals

Decision

We are going to pursue option 1 in the short term. It solves the problem of port collisions without requiring all users to pre-declare their needed Services. We will probably proceed to option 2 or even option 3 (or some combination thereof) later.

Notes for later work:

If we implement “real load balanced” services in options 1 or 2, we need to DNAT portal traffic to the load-balancer IP. The iptables to steal traffic and redirect to a different IP:

iptables -t nat -A PREROUTING -d 10.0.0.2/32 -p tcp -m tcp --dport 93 -j DNAT --to-destination 10.240.5.25:9376

iptables -t nat -A OUTPUT -d 10.0.0.2/32 -p tcp -m tcp --dport 93 -j DNAT --to-destination 10.240.5.25:9376

If we implement “real load-balanced” services in option 3, we need to DNAT localhost traffic to the portal to the load-balancer IP. The iptables to steal localhost traffic and redirect to a different IP (has to run in-namespace):

iptables -t nat -A OUTPUT -d 127.0.0.1/32 -p tcp -m tcp --dport 9378 -j DNAT --to-destination 10.240.5.25:9376

iptables -t nat -A POSTROUTING -d 10.240.5.25/32 -p tcp -m tcp --dport 9376 -j MASQUERADE

thockin added the design label Aug 29, 2014

Contributor

proppy commented Aug 29, 2014

In Options 2) and 3):

Pros:

  • Traffic from kube-proxy has a source IP that is not the calling pod’s IP

s/that is not/that is/ ?

Owner

thockin commented Aug 29, 2014

@proppy fixed

Contributor

smarterclayton commented Aug 29, 2014

Requires pods to pre-declare service portals (not yet implemented)
Users have to think about what ports they want for portals

These are listed as cons but they can be pros. Would list the former as requiring a new API object (which we might want to do anyway). I think that the second one is technically a pro because the user can rebind a port for their own reasons.

Contributor

smarterclayton commented Aug 29, 2014

I assume the "ambassador definition" being added to a pod definition has merit on its own - in that services are "magic" and clients can't define their own names for those env vars, whereas with an ambassador def you can choose the port and the env vars that get used.

Owner

thockin commented Aug 29, 2014

@smarterclayton re "can be pros". I listed them as cons because they add time and complexity as prerequisites. Will reword.

Owner

thockin commented Aug 29, 2014

I think it's clear we want to move to private ambassadors, but it is extra work and complexity that we can defer in some of these options. Specifically, we can move from option 1 to option 2 easily. Option 3 requires more up-front work.

Owner

dchen1107 commented Aug 29, 2014

Besides load balancing, is there any other use cases for service object? Is it possible a set of pods are used to declare more than one service? For example, a pod could declare a service for end users, meanwhile provides admin service? If that is the case, in the models with private ambassadors, is it required two kube-proxies to be added to the pod to listen on those Portal IP? or still single one, but listen on both Portal IPs?

Owner

lavalamp commented Aug 30, 2014

I vote for option one. Where is the reasoning for private ambassadors? They sound like an anti-feature to me.

I think that the proxy being multi-tenant is a feature, not a bug. I don't think we should force users to change their pod specs to deal with our weird IPTables magic because we're hacking together IP-per-pod.

(Note that I don't think proxy should stay multi-tenant, because I think it should go away--I'd prefer to see us rejigger the network layer such that it's just plain not necessary. And if we're going to attempt that feat, then we definitely shouldn't be forcing users to predeclare service usage and add slots for ambassadors.)

Owner

thockin commented Aug 30, 2014

You can imagine services that are not load balanced but sharded or master
elected or ...

Yes, pods can be in more than one service, though I am not sure why. You
would have multiple ambassadors, though you could opportunistically
collapse them. So N kube-proxy ambassadors would flatten to a single one.
On Aug 29, 2014 4:55 PM, "Dawn Chen" notifications@github.com wrote:

Besides load balancing, is there any other use cases for service object?
Is it possible a set of pods are used to declare more than one service? For
example, a pod could declare a service for end users, meanwhile provides
admin service? If that is the case, in the models with private ambassadors,
is it required two kube-proxies to be added to the pod to listen on those
Portal IP? or still single one, but listen on both Portal IPs?

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Owner

thockin commented Aug 30, 2014

Multi-tenant is one reason to go away from single proxies. The other is
simply being explicit. There is some advantage to comprehension if you can
see your dependencies. For example, people have invented this several
times internally.

We actually CAN make the network magic enough for simple round robin, but
not much else. Anything more advanced requires code to run.
On Aug 29, 2014 5:09 PM, "Daniel Smith" notifications@github.com wrote:

I vote for option one. Where is the reasoning for private ambassadors?
They sound like an anti-feature to me.

I think that the proxy being multi-tenant is a feature, not a bug. I don't
think we should force users to change their pod specs to deal with our
weird IPTables magic because we're hacking together IP-per-pod.

(Note that I don't think proxy should stay multi-tenant, because I think
it should go away--I'd prefer to see us rejigger the network layer such
that it's just plain not necessary. And if we're going to attempt that
feat, then we definitely shouldn't be forcing users to predeclare service
usage and add slots for ambassadors.)

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Contributor

smarterclayton commented Aug 30, 2014

Where is the reasoning for private ambassadors? They sound like an anti-feature to me.

@lavalamp What parts of private ambassadors are anti-feature to you? Having to have a real container in your pod that's acting as the ambassador (vs. having the infrastructure provide a virtual ambassador)? If so I agree on that point - I don't think we should require modeling relationships / dependencies as actual proxies in containers.

The default ambassador pattern should be whatever can provide the most flexibility / reliability out of the list above. A real (actual in a container) ambassador is something you can always add to your pod if you want - bonus points if you can easily parameterize the ambassador container from the same info the infra uses.

It may also be of value to enable types of ambassadors for the user to select from - even if the first implementation is one of the above, I'm not so sure that being able to have the ambassador qualify its needs (I need PTR records to work, I need secure TLS tunneling, I need this traffic to be high bandwidth) won't be valuable. It's then up to the infrastructure / plugins to satisfy those needs if it can.

we definitely shouldn't be forcing users to predeclare service usage and add slots for ambassadors

Are you saying that pods (which are explicitly Kubernetes concepts) shouldn't declare what they depend on, but should automatically have dependencies injected by the existence of a service?

I'd argue that it's very important to enable authors to use images that are not dependent on Kubernetes concepts (isolated from knowing about the environment they run in). A part of that is enabling pod creators to declare how a dependency is manifest. An example is saying:

I want to expose a mysql database, and oh by the way, I want it to be on localhost:3306 because my code in the image already expects it at that address and also set the database password and username as environment variables named DBUSER and DBPASS.

Contributor

jbeda commented Sep 4, 2014

Some comments:

  • It would be great to have the snippets of YAML for defining both the service definition along with what the consumers sees/does. After we narrow down the choices that'll really help make this concrete.
  • For 2/3: We should think of the ambassador/portal as a flexible idea: we can have built in types that can be efficiently implemented perhaps by the network/cloud. Specifically, when the user says "I want to talk to service foo using a round robin TCP ambassador" we can implement that by either starting a new container in the pod or having some other built in magic (program the cloud if it supports internal LB). If the user says "I want to talk to service foo using embassador implemented by docker image xyz/abc" then we will explicitly run the code in their pod. We could, conceivably do (1) for the first case and blend these together.
  • The DNS SRV cons in (3) assume that we are serving the same DNS per pod. If instead we have a custom DNS per pod (have DNS results be a function of the calling pod) we can return correct results in the localhost case.
  • Note that the "must explicitly predeclare which services are being called" only applies to non-enlightened workloads. At some point we can have an inward facing API for workloads where they can dynamically look up destination information and implement the ambassador/portal as a library/feature of that binary. A dynamic portal/ambassador like this would be necessary for a generic HTTP router service. It would want to dynamically start talking to new services without having to be restarted/rescheduled.
Contributor

smarterclayton commented Sep 4, 2014

At some point we can have an inward facing API for workloads

This was one of the original mission goals of libswarm - to offer a discovery endpoint within a container that was standard to all Docker runtime environments that arbitrary code could introspect. The host environment could then offer arbitrary service discovery as well. I think it's a good objective and fits in with what the Docker ecosystem can accomplish by setting conventions.

Owner

thockin commented Sep 4, 2014

@jbeda

On Thu, Sep 4, 2014 at 2:23 PM, Joe Beda notifications@github.com wrote:

Some comments:

It would be great to have the snippets of YAML for defining both the service definition along with what the consumers sees/does. After we narrow down the choices that'll really help make this concrete.

Added JSON for services and YAML for pods. Does this capture what you
were looking for?

For 2/3: We should think of the ambassador/portal as a flexible idea: we can have built in types that can be efficiently implemented perhaps by the network/cloud. Specifically, when the user says "I want to talk to service foo using a round robin TCP ambassador" we can implement that by either starting a new container in the pod or having some other built in magic (program the cloud if it supports internal LB). If the user says "I want to talk to service foo using embassador implemented by docker image xyz/abc" then we will explicitly run the code in their pod. We could, conceivably do (1) for the first case and blend these together.

Commented on this

The DNS SRV cons in (3) assume that we are serving the same DNS per pod. If instead we have a custom DNS per pod (have DNS results be a function of the calling pod) we can return correct results in the localhost case.

commented

Note that the "must explicitly predeclare which services are being called" only applies to non-enlightened workloads. At some point we can have an inward facing API for workloads where they can dynamically look up destination information and implement the ambassador/portal as a library/feature of that binary. A dynamic portal/ambassador like this would be necessary for a generic HTTP router service. It would want to dynamically start talking to new services without having to be restarted/rescheduled.

That's true, but I am trying to cover the more common case in this
discussion, since that is what will feel the impact of the decision.

Owner

lavalamp commented Sep 5, 2014

More on why private ambassadors are an anti-feature:

Eventually, it seems like the awesomest way to end up is with the service IP being a real IP address that the network fabric understands and not IPTables magic. In that case, the ambassador becomes needless bloat. So let's not ever introduce it if possible. (IPTables magic is an anti-feature IMO and should not be the intended end state.)

If multi-tenancy is the issue, and true network isolation is required, that should be accomplished by putting every tenant on their own isolated virtual network, not via options 2 or 3, which are security through obscurity if I understand it correctly.

Additionally now that I see some JSON for this, it seems arcane to force the user to specify the port in both service definition and also repeat the port number in their pod in order to use the service. Port number should be passed to the pod along with the IP address via whatever discovery method we come up with. With IP-per-service port number effectively doesn't matter.

Contributor

jbeda commented Sep 5, 2014

@thockin Thanks for the YAML/JSON - it helps a lot.

On 3: would we need a portal.targetPort in the pod definition?

More questions:

  • Can I have multiple portals to the same service within a pod? Perhaps I have a replicated database service -- one portal could go to the master only and another could go across all of the read replicas. in that case, the naming of the portal in the pod would be separate from the name of the service that it is targeting. I think some of this comes down to how "thick" the ambassador is.
  • Is there a "type" on the portal? Or is that something on the service? Where does "random start round robin" get specified?
Contributor

jbeda commented Sep 5, 2014

@lavalamp if we end up with ambassadors/portals that are "thicker" in terms of implementing some protocol specific logic, we'll want that to be versioned and run like other user code.

I think we can split this up different ways:

  • Splitting out the target of a "service" from the mechanism to direct a single connection to one of N backends.
  • The code that implements that policy -- it may be thin and simple (round robin TCP forwarding) or it may be protocol aware (understanding a db query and applying sharding function).
    • Simple generic mechanisms can be supported more natively and don't need to be implemented as code running in the pod. (called a built in portal?)
    • More complex policies (anything deeply protocol aware) should probably be done in code run like any other user code.
  • How we present the portal/ambassador to the consumer code can be done with a true IP or port on localhost.

My gut is to keep things simple for now and so I like (3) but I can understand the desire to put the portal/ambassador/proxy/forwarder on its own IP.

Owner

thockin commented Sep 6, 2014

@jbeda The "target" port is captured by the extant Service.containerPort
(which could do with a rename).

Multiple portals to the same service: As currently spec'ed, portals target
Services. So you could define two Services with different policies which
resolve to the same set of Pods, and then have one portal to each Service.
The alternative, as you hint at, is a thicker ambassador that knows the
difference between reads and writes as part of the Service policy. Or we
could extend Service to have multiple policies, but I am not convinced
that's a good idea.

"Type" would be a property of the Service, I think, though you could maybe
contrive examples where that isn't fair. For example, if my service
requires you to mine a new bitcoin for each request, and I embed that in my
policy-specific ambassador, which you have to run in your pod, is that OK?
I would guess not.

As we open the door to different kinds of policies, I think we will need to
consider when rights we give to a client. Suppose this mysql policy you
described - reads go to any slave, but writes go to master. Someone has to
write that code. Do clients HAVE to run the code that we wrote, or can
they substitute any ambassador implementation they like? Do they have to
run it in-pod or could they run it as a different pod? Eventually, we'll
need to allow more than just "dumb" round robin. BUt at the same time,
simple cases need to stay simple. I hope we can squint at one of these
designs and agree it's possible, and then ignore it for now.

On Fri, Sep 5, 2014 at 12:30 PM, Joe Beda notifications@github.com wrote:

@thockin https://github.com/thockin Thanks for the YAML/JSON - it helps
a lot.

On 3: would we need a portal.targetPort in the pod definition?

More questions:

  • Can I have multiple portals to the same service within a pod?
    Perhaps I have a replicated database service -- one portal could go to the
    master only and another could go across all of the read replicas. in that
    case, the naming of the portal in the pod would be separate from the name
    of the service that it is targeting. I think some of this comes down to how
    "thick" the ambassador is.
  • Is there a "type" on the portal? Or is that something on the
    service? Where does "random start round robin" get specified?

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Owner

thockin commented Sep 6, 2014

On Fri, Sep 5, 2014 at 11:01 AM, Daniel Smith notifications@github.com wrote:

More on why private ambassadors are an anti-feature:

Eventually, it seems like the awesomest way to end up is with the service IP being a real IP address that the network fabric understands and not IPTables magic. In that case, the ambassador becomes needless bloat. So let's not ever introduce it if possible. (IPTables magic is an anti-feature IMO and should not be the intended end state.)

longer term:

Some cloudproviders will provide "true" VIPs for internal
load-balancing that are stable "real" IPs.

Some cloudproviders will not, which forces us to schedule an haproxy
or something. That haproxy is a pod, and has an unstable IP, so we do
not want to expose that to consumers. To dodge this, we can assign a
stable portal IP (or use localhost) that simply forwards to the
current pod IP.

In neither case is there a need for a real process acting as
ambassador in each pod, because there is a real process (either a
cloud load-balancer or an haproxy or ...) filling that role.

The decision of whether clients should declare their desire to be able
to connect to a given service is pretty orthogonal to whether portals
are global or per-pod. Maybe I should try to define "portal" better.

If multi-tenancy is the issue, and true network isolation is required, that should be accomplished by putting every tenant on their own isolated virtual network, not via options 2 or 3, which are security through obscurity if I understand it correctly.

It's not through obscurity by necessity, it could be implemented as a
set of fabric routing rules that ONLY allow declare portal
connections. For example, if pod p1 wants to connect to service s1
(which is made up of pods p2 and p3), I could make firewall rules to
the effect of:

  1. ALLOW src=p1 dest=(p2, p3)
  2. DENY everything else

Some hypothetical pod p3 would not be able to connect to s1, nor would
p2 or p3 be able to connect to p1. That's pretty draconian, but
private ambassadors allows for it, whereas global ambassadors do not.

Additionally now that I see some JSON for this, it seems arcane to force the user to specify the port in both service definition and also repeat the port number in their pod in order to use the service. Port number should be passed to the pod along with the IP address via whatever discovery method we come up with. With IP-per-service port number effectively doesn't matter.

That's a bug - there is no user-visible service port in this model. Fixed.

Reply to this email directly or view it on GitHub.

Owner

thockin commented Sep 6, 2014

I more or less have kept my opinion to myself, but here's my thoughts.

I think option 3 is marginally simpler over all, but requires a lot more
work to achieve in the near term as compared to option 1. And option 1 can
convert to option 2 pretty gracefully. The difference between option 2 and
option 3 is pretty small, to me.

For that reason I think #1 is a better solution for right now, with growth
into option 2 later, which could conceivably migrate to option 3 even later.

On Fri, Sep 5, 2014 at 1:27 PM, Joe Beda notifications@github.com wrote:

@lavalamp https://github.com/lavalamp if we end up with
ambassadors/portals that are "thicker" in terms of implementing some
protocol specific logic, we'll want that to be versioned and run like other
user code.

I think we can split this up different ways:

  • Splitting out the target of a "service" from the mechanism to direct
    a single connection to one of N backends.
  • The code that implements that policy -- it may be thin and simple
    (round robin TCP forwarding) or it may be protocol aware (understanding a
    db query and applying sharding function).
    • Simple generic mechanisms can be supported more natively and
      don't need to be implemented as code running in the pod. (called a built in
      portal?)
    • More complex policies (anything deeply protocol aware) should
      probably be done in code run like any other user code.
      • How we present the portal/ambassador to the consumer code can be
        done with a true IP or port on localhost.

My gut is to keep things simple for now and so I like (3) but I can
understand the desire to put the portal/ambassador/proxy/forwarder on its
own IP.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Contributor

smarterclayton commented Sep 6, 2014

... It could be implemented as a set of fabric routing rules that ONLY allow declare portal connections, That's pretty draconian, but private ambassadors allows for it, whereas global ambassadors do not.

We already know this is a customer requirement we plan to satisfy with ovs and private vlans (or similar), where namespaces/projects/policy defines a set of routing rules that allow subdivision of the internal kube network. It's primarily for medium trust multitenant environments where you want shared resources but an extra layer of defense in the event of a container compromise.

Contributor

jbeda commented Sep 6, 2014

@thockin I actually think that option 3 is pretty easy to get done:

  • Make a supported inward API for doing service look up. Let code running in a pod ask where services backends are located
  • Retool the proxy so that it can run inside the pod. It now takes a config (on command line? env variable?) and implements just that config. It'll only proxy the declared services.
  • Have translation magic that takes the portals parts of the yaml, transforms it and launches the proxy.
  • [optional] hide the existence of the proxy container from folks when they inspect the pod from the API.
Owner

thockin commented Sep 6, 2014

That's still notably more work than option 1, with deeper impact on users
(not that we have too many of those) - every single user needs to re-tool
their configs. I don't see a more graceful transition.
On Sep 6, 2014 3:40 PM, "Joe Beda" notifications@github.com wrote:

@thockin https://github.com/thockin I actually think that option 3 is
pretty easy to get done:

  • Make a supported inward API for doing service look up. Let code
    running in a pod ask where services backends are located
  • Retool the proxy so that it can run inside the pod. It now takes a
    config (on command line? env variable?) and implements just that config.
    It'll only proxy the declared services.
  • Have translation magic that takes the portals parts of the yaml,
    transforms it and launches the proxy.
  • [optional] hide the existence of the proxy container from folks when
    they inspect the pod from the API.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Contributor

smarterclayton commented Sep 7, 2014

We should do a quick discussion whether container predeclaration of dependencies is the right pattern, make the pros and cons clear, and try to set a direction. That's probably orthogonal to this discussion, but I think it's more important to what we're trying to achieve with building containerized applications than some of the mechanical details of how interconnections should work (this issue, no offense meant Tim), and I don't get the feeling from this thread that we have total consensus.

I don't think we have an issue to contain it but I'm fine spawning one separately.

Services are globally injectable today, and clients (pods) cannot mutate the form they appear in the container, nor can they adapt the logical entity the service represents (collection of pods) to their needs. They intermediate pods from knowing about changes to other pods, but pods aren't the only thing that a client container might want intermediated (things outside Kube, external IPs or DNS). Services are global but that only scales to a fairly low bound, and it's unlikely all pods in the scope care about all services but must worry about collision.

Some of the solutions to those issues potentially reduce the migration / use pattern change we are worried about between 1 and the others.

On Sep 6, 2014, at 12:24 PM, Tim Hockin notifications@github.com wrote:

I more or less have kept my opinion to myself, but here's my thoughts.

I think option 3 is marginally simpler over all, but requires a lot more
work to achieve in the near term as compared to option 1. And option 1 can
convert to option 2 pretty gracefully. The difference between option 2 and
option 3 is pretty small, to me.

For that reason I think #1 is a better solution for right now, with growth
into option 2 later, which could conceivably migrate to option 3 even later.

On Fri, Sep 5, 2014 at 1:27 PM, Joe Beda notifications@github.com wrote:

@lavalamp https://github.com/lavalamp if we end up with
ambassadors/portals that are "thicker" in terms of implementing some
protocol specific logic, we'll want that to be versioned and run like other
user code.

I think we can split this up different ways:

  • Splitting out the target of a "service" from the mechanism to direct
    a single connection to one of N backends.
  • The code that implements that policy -- it may be thin and simple
    (round robin TCP forwarding) or it may be protocol aware (understanding a
    db query and applying sharding function).
  • Simple generic mechanisms can be supported more natively and
    don't need to be implemented as code running in the pod. (called a built in
    portal?)
  • More complex policies (anything deeply protocol aware) should
    probably be done in code run like any other user code.
  • How we present the portal/ambassador to the consumer code can be
    done with a true IP or port on localhost.

My gut is to keep things simple for now and so I like (3) but I can
understand the desire to put the portal/ambassador/proxy/forwarder on its
own IP.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.


Reply to this email directly or view it on GitHub.

Contributor

jbeda commented Sep 7, 2014

Related topic: do we have multiple protocols/ports specified in a service target spec? See #1205.

Contributor

jbeda commented Sep 8, 2014

One thing I'd like to make sure we have out of the next version of services (based on discussions on IRC) is that an application referencing a service can come up before that service is defined.

Owner

thockin commented Sep 8, 2014

Agree that ordering should be irrelevant. Agree we need some consensus on
whether pre-declare is a net win or not.

That said, I think we can move forward with this proposal option 1, and
migrate it into option 2 or option 3. If there's no more debate on that
point, I'd love to call that decided and maybe make time to implement, or
else dole it out to someone.

On Mon, Sep 8, 2014 at 9:03 AM, Joe Beda notifications@github.com wrote:

One thing I'd like to make sure we have out of the next version of
services (based on discussions on IRC) is that an application referencing a
service can come up before that service is defined.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Contributor

jbeda commented Sep 9, 2014

Do we have confidence that we can make this work in every network environment? If so, I'm okay with doing (1) to start.

Contributor

filbranden commented Sep 9, 2014

I'm against option (3) of using portals on the localhost IP to connect to services.

One reason is that I can't keep a 1:1 mapping of ports in the general case. Consider the case where I have multiple services running MySQL, one with a user database and another with a comments database, both are running in separate pools of pods and each MySQL server is listening on port 3306 of their own pod.

I'd like to be able to connect to userdb:3306 to get to the users database and to commentsdb:3306 to get to the comments database. But if I'm using localhost portals, then I can't have both of them use port 3306, which means I have to start using non-standard ports.

Another problem with start depending on localhost portals is that it doesn't scale, as soon as I want to scale out my service and use a real load balancer to my MySQL hosts, I want to be able to connect to the real IP of the load balancer directly, in which case using a localhost portal only creates the need to keep networking magic around when it's no longer doing any useful work...

I'd say aim for this:

  • Design for the scale out case, try to avoid any iptables etc. in that case;
  • Use iptables etc. to scaled down case, where using a local proxy in place of a load balancer will be enough.
  • This should even make it easier to scale out without a restart, just create a load balancer under the service IP (assuming you're able to request it on a specific IP), remove the iptables rule and traffic will start hitting the load balancer. You can also do it on the other direction to scale down.

Re (1) or (2) I don't have a strong preference, I see both of them as solutions for "toy" setups so I'm not sure they deserve a lot of consideration. I think I lean towards (2) since then the private proxy can run on the same pod/machine and it doesn't require coordination of external resources.

Contributor

smarterclayton commented Sep 9, 2014

I'd like to be able to connect to userdb:3306 to get to the users database and to commentsdb:3306 to get to the comments database. But if I'm using localhost portals, then I can't have both of them use port 3306, which means I have to start using non-standard ports.

Agree portals shouldn't force you to localhost, or to change ports. I don't think that forcing localhost portals should be part of the solution. However, localhost:3306 is strictly better than <random_ip>:3306 for use cases where you want to connect to a single database.

Contributor

smarterclayton commented Sep 9, 2014

Another problem with start depending on localhost portals is that it doesn't scale, as soon as I want to scale out my service and use a real load balancer to my MySQL hosts, I want to be able to connect to the real IP of the load balancer directly, in which case using a localhost portal only creates the need to keep networking magic around when it's no longer doing any useful work...

If localhost portals are optional then you can just change your pod config to drop them. How those portals are exposed into your code shouldn't have to change.

EDIT of EDIT: you're advocating using service ip directly, I think that's valuable for some cases, but the value in localhost is that for most apps in most cases you don't need to do anything to your code, whereas for service IP you still have to configure your pod / code to connect to them.

Owner

thockin commented Sep 10, 2014

I'm 99.5% sure that these portal IPs never touch the wire. The only
hardship is that people need to decide what range of IPs to allocate from.

On Tue, Sep 9, 2014 at 11:17 AM, Joe Beda notifications@github.com wrote:

Do we have confidence that we can make this work in every network
environment? If so, I'm okay with doing (1) to start.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Owner

thockin commented Sep 10, 2014

Regarding "toy" solutions - i think we'll find it sufficient for a large
number of real situations. Your "scale out" case (without iptables)
predicates on "real" ambassadors with stable IP addresses. You might see
that for load-balancing in some cloud providers, but certainly not all
providers and certainly not for potential other policies. I think options
1 and 2 both give graceful empowerment of more advanced infrastructure.

On Tue, Sep 9, 2014 at 11:33 AM, Filipe Brandenburger <
notifications@github.com> wrote:

I'm against option (3) of using portals on the localhost IP to connect to
services.

One reason is that I can't keep a 1:1 mapping of ports in the general
case. Consider the case where I have multiple services running MySQL, one
with a user database and another with a comments database, both are running
in separate pools of pods and each MySQL server is listening on port 3306
of their own pod.

I'd like to be able to connect to userdb:3306 to get to the users database
and to commentsdb:3306 to get to the comments database. But if I'm using
localhost portals, then I can't have both of them use port 3306, which
means I have to start using non-standard ports.

Another problem with start depending on localhost portals is that it
doesn't scale, as soon as I want to scale out my service and use a real
load balancer to my MySQL hosts, I want to be able to connect to the
real IP of the load balancer directly, in which case using a localhost
portal only creates the need to keep networking magic around when it's no
longer doing any useful work...

I'd say aim for this:

  • Design for the scale out case, try to avoid any iptables etc. in
    that case;
  • Use iptables etc. to scaled down case, where using a local proxy in
    place of a load balancer will be enough.
  • This should even make it easier to scale out without a restart, just
    create a load balancer under the service IP (assuming you're able to
    request it on a specific IP), remove the iptables rule and traffic will
    start hitting the load balancer. You can also do it on the other direction
    to scale down.

Re (1) or (2) I don't have a strong preference, I see both of them as
solutions for "toy" setups so I'm not sure they deserve a lot of
consideration. I think I lean towards (2) since then the private proxy can
run on the same pod/machine and it doesn't require coordination of external
resources.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#1107 (comment)
.

Owner

thockin commented Sep 16, 2014

Added a short note on decision. Will flesh out the text and go back to implementation and naming bikeshedding.

thockin was assigned by bgrant0607 Sep 26, 2014

bgrant0607 added this to the v0.5 milestone Sep 26, 2014

Contributor

anguslees commented Oct 8, 2014

With a shared ambassador (or suitably privileged private ambassadors) running on the local host, there should be no need to actually modify the packet headers and adding DNAT rules will just double the amount of connection tracking performed by the local kernel.

Add the service IPs to the right veth interfaces/namespaces (or mess with "local" routes in the routing table) and the ambassador should be able to bind to the right IP+ports and just listen/reply directly.

Owner

thockin commented Oct 16, 2014

#1402 is in. Closing this doc now, though I am sure we will revisit it when private ambassadors come up next.

thockin closed this Oct 16, 2014

skarap commented Feb 20, 2015

Iptables DNAT to load balancer in option 3 will not work: you can't DNAT localhost traffic to another host. Even if you somehow manage to do it using policy routing rules, changing the "local" routing table and such (though I couldn't and wan't able to find someone who could), it will still be unsupported solution which can break in the next kernel release.

@vishh vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

@pwittrock pwittrock Merge pull request #1107 from pwittrock/fix-integration-test
fix integration tests always passing because of obscure golang variab…
967e6bb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment