Connect Envoy Command #4735

banks · 2018-10-01T19:29:51Z

Note: this PR targets merging into f-envoy and depends directly on #4731. This is part of a series of PRs that were developed together but split for easier review.

This PR includes:

Plumbing proxycfg.Manager (Proxy Config Manager #4729) and xDS server (xDS Server Implementation #4731) into the agent startup
A logger shim to nicely capture gRPC logs from the gRPC.Server.
The new consul connect envoy command.

Not Included

Documentation: this is left for later since the whole of f-envoy branch needs a large docs overhaul. That will all be addressed separately once this code is in review.
Tests won't pass - there is another bug fix for cache that is needed for that (in a later PR).
Prepared query upstreams don't work yet. Punted to get this out. I think it's OK to release 1.3 without them but should be done shortly after.
Cache size limits Currently cache can grow a lot I think there are several aspects we should address here but not 100% how:
- Currently all our background refresh cache types (certs, intentions, service disco) have a TTL of 3 days which was chosen to ensure we keep them in memory during a weekend outage of servers. proxycfg.Manager actually removes that since any resources needed by currently registered proxies are constantly blocked on and so won't be considered inactive. So we could set TTLs much more aggressively
- This is super important because of the partition by token behaviour - a well-behaved proxy that rotates it's ACL token every hour will generate all new cache entries each hour but we'll keep old ones around for 3 days consuming 72x the memory we actually need currently
- In general it would be nice to have a configurable limit on cache memory ideally in bytes but at least in number of entries and LRU behaviour to prevent against accidental or malicious pathological cases that chew through cached requests. (e.g. something using ?cached service discovery queries is configured to watch thousands of different service prefixes across tens of datacenters and rotates it's token every few hours...) This could end up consuming the same RAM as the whole Raft state * number of token rotations/distinct client tokens used to fetch within the TTL.
Built-in proxy update to use ?cached not a show stopper but we'll gain full fail-static behaviour when we do this.

TODO:

Todo items before landing:

write some tests for the command grpc config which I only remembered needed adding when writing up the demo below

Demo 🎉

Register a couple of services with sidecar proxies:

$ cat demo.hcl
services {
  name = "web"
  port = 8080
  connect {
    sidecar_service {
      proxy {
        upstreams {
          destination_name = "db"
          local_bind_port = 9191
        }
      }
    }
  }
}
services {
  name = "db"
  port = 9090
  connect { sidecar_service {} }
}

Run the agent with gRPC enabled:

$ consul agent -dev -hcl 'ports { grpc = 8502 }' -config-file demo.hcl

In another terminal, run the "web" envoy sidecar (this assumes envoy 1.7.1 or later is installed in your $PATH):

$ consul connect envoy -sidecar-for web

And in another terminal run the db sidecar (note that admin port needs overriding if you are on same host as it needs to bind:

$ consul connect envoy -sidecar-for db -admin-bind localhost:19001

Note, you can add -- -l debug to the envoy invocations to pass arbitrary flags to envoy - this one will show debug level logs including all the config received via xDS.

In another terminal run an echo server:

$ socat -v tcp-l:9090,fork exec:"/bin/cat"

You should now be able to connect to it as the "web" app by talking to the web sidecar's upstream:

$ nc localhost 9191
hello world
hello world

You can verify connect AuthZ by closing the netcat session, and using:

$ consul intention create -deny  web db 
Created: web => db (deny)
$ nc localhost 9191
hello?
$

Notice that unlike with the built-in proxy, netcat won't return until you actually try to send something - this is just how Envoy handles TCP proxying and TLS handshakes it seems. In the built-in proxy we explicitly force an immediate handshake.

You can get back to working against with:

$ consul intention delete web db
Intention deleted.

mitchellh

Good for now but your CLI needs tests :)

Willing to punt that a bit to get to a clean tree though.

I did leave a couple comments worth thinking about.

command/connect/envoy/envoy.go

agent/connect_auth.go

command/connect/envoy/envoy.go

…t sidecar.

banks · 2018-10-03T19:37:41Z

Going to punt on the CLI tests for now because this PR chain is blocking a couple other tasks.

roboll · 2018-10-03T22:51:43Z

Hey - trying to understand how envoy map traffic from a local port to a service destination. It seems like that is configured in demo.hcl here, a config file passed to the consul agent.

Is this scoped to the agent, i.e. more or less global?

banks · 2018-10-04T09:16:42Z

Hey @roboll We'll be writing up the real docs to go with this over next few days so that should help.

Currently as in this example the config is passed as part of the (proxy) service registration so it's entirely local - every envoy sidecar could configure upstreams differently and on different local ports etc. In practice we'd expect this model to be used with Config management tool or scheduler doing that registration so would be "centrally" managed that way.

We have plans in the near future to make this kind of stuff more centrally managed and pushed out via Consul itself though FYI but that won't be in this release.

It's scope is local to the agent and the proxy instance - you may have many different Envoy's on a single node/agent with different configs and identities by registering them as different services.

Finally, we provide an "escape hatch" mechanism where you can provide a completely custom Envoy listener config and we'll just inject certs and authz filter. Later we'll support richer config options in a nicer/more native way but this should allow more flexibility for advanced usages right away - that is still provided through the service registration. Details to come soon in docs!

* Plumb xDS server and proxyxfg into the agent startup * Add `consul connect envoy` command to allow running Envoy as a connect sidecar. * Add test for help tabs; typos and style fixups from review

roboll · 2018-10-04T15:07:53Z

@banks Cool, thanks.

As far as having port mappings scoped to the sidecar as opposed to the agent, 👍 thats what I hoped. I like the explicit nature of shipping a config file with apps that declares their dependencies and how they should be accessed (over which local port, I mean).

Love the escape hatch for advanced use cases. 💯

$ consul agent -dev -hcl 'ports { grpc = 8502 }' -config-file demo.hcl

This seems to be running consul agent with this config, so I assumed it was scoped to the agent. Where here, the envoy sidecar is just referencing config that the agent has:

$ consul connect envoy -sidecar-for web

Assuming agent-per-node we'll definitely have many envoy per agent and It seems like the agent will need to be aware of all of the service definitions with the local port mappings, not sure what the commands there should look like to scope it to the sidecar only (e.g. pod A may declare a dependency on service C at port 1000, and pod B may declare a dependency on service C at port 2000).

If docs are coming this week I can wait for them though. 😄

banks · 2018-10-04T18:41:07Z

-hcl 'ports { grpc = 8502 }'

Is only setting the agent's gRPC server port (since it's not enabled by default). Actually Since this PR landed that's redundant in -dev mode as we default to this there but in production you'd need to explicitly configure agents to listen for gRPC traffic.

I'll wait for docs to go into more detail but for many services and side cars per node it would look something like this:

service {
  name = "a"
  port = 1111
  connect { sidecar_service {
      upstreams {
        ...
      }
    }
  }
}
service {
  name = "b"
  port = 1111
  connect { sidecar_service {} }
}
service {
  name = "c"
  port = 1111
  connect { sidecar_service {} }
}

That defines 3 different service instances and 3 different sidecar proxy registrations.

Then each proxy would be run with:

$ consul connect envoy -sidecar-for a

$ consul connect envoy -sidecar-for b

$ consul connect envoy -sidecar-for c

And each envoy will get configured with whatever defaults are chosen for the specific sidecar instance. If upstreams are specified, they are specific to a proxy instance.

In this magical defaults case, the proxy public TLS ports themselves are being auto-selected from a configurable range (21000, 21500).

Note that in this model, anything that can talk to the upstream listeners of a proxy is effectively being granted permission to "act" as that service per the security model so you'd only deploy them unprotected on a node where all services share common access rights and can be treated as one from a security perspective, or you need to deploy the sidecar and application into network namespaces or otherwise use iptables rules to ensure only the specific ap and sidecar can talk to each other.

Hope that helps.

roboll · 2018-10-04T21:40:59Z

I think I left out a critical piece of information: we're running on Kubernetes, so our consul agent doesn't know about what services pods on the host could potentially be running. In your example it appears it has to.

It seems there would need to be a consistent global service config that all agents are configured with as opposed to each consumer (sidecar) choosing how to expose things. Is my understanding correct?

(Feel free to tell me to wait for the docs if this is explained in there.)

FWIW: an alternative model (and what I was actually looking for) might be something like consul connect envoy -config-file demo.hcl where each sidecar provides a config for what services it exposes in both directions (services it receives traffic for and proxies locally, and services it exposes locally and proxies to remotes).

mitchellh · 2018-10-04T22:27:03Z

@roboll This is exactly the problem that our Connect injector will do via https://github.com/hashicorp/consul-k8s (and installable via our helm chart). And why we did #4732, so that pods can have a postStart hook to register and preStop hook to deregister, and the sidecar consul connect envoy -side-car will work properly. Does that solve your needs?

banks · 2018-10-05T09:36:32Z

Yeah the workflow Mitchell describes should enable what you envisage.

The reason we do it this way rather than just require a config file for each Envoy invocation directly is that later we want to support much more centralised config via Consul so we want the envoy process to declare it's purpose/identity and let Consul decide what config it should have.

The current config during instance registration will always work too though for cases where different instances of the same service actually need different configs, but Mitchell's k8s work should get you pretty close right away - that should all be documented and ready within a week or two.

roboll · 2018-10-05T16:25:01Z

Cool, glad to see it has been considered.

I'm still working on groking the consul-k8s repo, but if anything isn't covered I'll open an issue. Thanks both!

* Plumb xDS server and proxyxfg into the agent startup * Add `consul connect envoy` command to allow running Envoy as a connect sidecar. * Add test for help tabs; typos and style fixups from review

banks requested a review from a team October 1, 2018 19:29

banks changed the base branch from master to xds-server October 1, 2018 19:30

banks mentioned this pull request Oct 1, 2018

Fix bug in leaf-cert cache type where multiple client tokens collide #4736

Merged

mitchellh added this to the 1.3.0 milestone Oct 1, 2018

mitchellh approved these changes Oct 1, 2018

View reviewed changes

command/connect/envoy/envoy.go Show resolved Hide resolved

command/connect/envoy/envoy.go Outdated Show resolved Hide resolved

mitchellh mentioned this pull request Oct 2, 2018

xDS Server Implementation #4731

Merged

banks force-pushed the xds-server branch from 5ed61df to 104603e Compare October 3, 2018 15:32