Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS cert provisioning #5597

Merged
merged 102 commits into from
Jun 27, 2019
Merged

TLS cert provisioning #5597

merged 102 commits into from
Jun 27, 2019

Conversation

hanshasselberg
Copy link
Member

@hanshasselberg hanshasselberg commented Apr 3, 2019

Intro

This PR introduces automatic certificate provisioning for consul clients to communicate securely via RPC. The goal is to not provision client certificate and key, but get that from the servers instead. Depending on the client configuration, the setup is as secure as manually provisioning certificates or less secure.

  • secure: ACL enabled, CA provided, verify_server_hostname enabled
  • less secure: ACL disabled or CA and verify_server_hostname not provided
  • insecure: No ACL enabled, no CA provided

Even with an insecure configuration, the setup ends up being more secure because TLS is enabled nevertheless.

Implementation

New RPCType

When calling AutoEncrypt.Sign, the client has a CA at best, maybe not even that. But this endpoint has to have TLS enabled, no matter what. Since this is different than every other existing RPC a new RPCType was necessary: RPCTLSInsecure. On the server side every TLS connection is accepted and no client cert is required. The client will establish the TLS connection even though it might not have the server CA and certainly doesn't have a client cert.

New RPCEndpoint

There is a new Endpoint AutoEncrypt.Sign which returns the combined result of ConnectCA.Roots and ConnectCA.Sign. This is the only endpoint so far that always requires TLS and that accepts TLS connections without a client certificate.

Client Startup

If auto_encrypt.tls is set on the client, it will retry indefinitely to get certificates from the server. If it fails, it will never continue startup. Because the operator set the configuration, we assume TLS is wanted and the client won't proceed without it. The client does not rely on discovering servers with serf, but will use every address in -join and -retry-join. That means one of these addresses needs to be a server in order to allow the client to fetch certificates.

Also the Consul startup banner now announces that the agent is about to start instead of the successful start. I had to change that in order to be able to display the attempts to reach the AutoEncrypt endpoint.

Configuration

Prerequisites

TLS needs to be turned on for RPC on the server. (has now its own warning, so that this configuration is invalid)

New options

This PR introduces a new option for servers:

{
	"auto_encrypt": {
		"allow_tls": true
	}
}

auto_encrypt.allow_tls enables connect and the enpoint that is serving the CA and the certificates to the clients. If disabled, the endpoint is not available and a client configured with auto_encrypt.tls cannot start. If enabled, the server starts to accept the manual and the connect CA for incoming connections as well as their corresponding certificates. It will always only provide its manually configured CA and certificate though, which the client can verify using the CA it got from auto_encrypt endpoint. Defaults to false.

And one for clients:

{
	"auto_encrypt": {
		"tls": true
	}
}

auto_encrypt.tls makes the client request CA and certificates for encrypting RPC communication from the servers provided in -join or -retry-join. It only works if one of these provided addresses is a server

auto_encrypt.allow_tls has to be enabled on the server in order to activate the RPC endpoint and auto_encrypt.tls on the client that wants to provision certificates automatically. If the -server-port is not the default one, it has to be provided to the clients as well. Usually this is discovered through serf, but auto_encrypt is happening earlier and cannot use that information.

The most secure auto_encrypt setup is when the client is provided with the CA, verify_server_hostname is turned on, and when ACL are enabled and a token with node.write permissions.

It is also possible to use auto_encrypt with a CA and ACL, but without verify_server_hostname, or only with a ACL enabled, or only with CA and verify_server_hostname, or only with a CA, or finally without a CA and without ACL enabled.

In any case, the communication to the auto_encrypt endpoint is always TLS encrypted.

Enabling auto_encrypt.tls will have the following implications:

  • server:
    • connect is turned on because its certificate capabilities are used internally
  • client:
    • before auto_encrypt was successful:
      • if a CA is provided, verify_outgoing is assumed
    • after auto_encrypt was successful:
      • verify_outgoing is enabled
      • the server response also has a field for verify_server_hostname and the client will turn that on if it enabled in the response. if it is turned on on the server, it will be turned on for the clients.

If the server_port is not the default 8300 it has to be provided to the agent. Since auto_encrypt is potentially happening before the agent joins the serf cluster, it cannot rely on that data coming from serf.

Example configurations

Most secure auto_encrypt setup with ACL token and CA for the client.

# server.json
{
	"ca_file": "./consul-agent-ca.pem",
	"cert_file": "./dc1-server-consul-0.pem",
	"key_file": "./dc1-server-consul-0-key.pem",
	"verify_outgoing": true,
	"verify_incoming": true,
	"verify_server_hostname": true,
	"auto_encrypt": {
		"allow_tls": true
	},
	"acl": {
		"enabled": true,
		"default_policy": "deny",
		"enable_token_persistence": true
	}
}

# client.json
{
	"ca_file": "./consul-agent-ca.pem",
        "verify_server_hostname": true,
	"auto_encrypt": {
		"tls": true
	},
	"acl": {
		"enabled": true,
		"tokens": {
			"default": "212721ff-d0bc-d44a-bfed-7c886a878609"
		}
	}
}

Less secure setup without ACL, but still with CA:

# server.json
{
	"ca_file": "./consul-agent-ca.pem",
	"cert_file": "./dc1-server-consul-0.pem",
	"key_file": "./dc1-server-consul-0-key.pem",
	"verify_outgoing": true,
	"verify_incoming": true,
	"verify_server_hostname": true,
	"auto_encrypt": {
		"allow_tls": true
	}
}

# client.json
{
	"ca_file": "./consul-agent-ca.pem",
	"auto_encrypt": {
		"tls": true
	}
}

Least secure setup without ACL and without CA:

# server.json
{
	"ca_file": "./consul-agent-ca.pem",
	"cert_file": "./dc1-server-consul-0.pem",
	"key_file": "./dc1-server-consul-0-key.pem",
	"verify_outgoing": true,
	"verify_incoming": true,
	"auto_encrypt": {
		"allow_tls": true
	}
}

# client.json
{
	"auto_encrypt": {
		"tls": true
	}
}

verify_incoming

This PR sets up TLS between client and server, it only affects verify_outgoing on the client side.

Todo

Server

  • maybe choose a better name for AutoEncrypt.Sign and Insecure
  • add root connect CA to AutoEncrypt reply
  • add manual CA to AutoEncrypt reply
  • accept incoming connect certs on the server
  • send verify_server_hostname configuration
  • send correct certs
  • auto_encrypt implies connect
  • send gossip key or don't allow configuration
  • add a.config.RetryJoinLAN, but make sure to deal with autodisover
  • check spiffeid when accepting connect certs

Client

  • if CA present, enable verify_server_hostname
  • make sure verify_incoming_rpc and verify_incoming_https have a higher precedence than the "imported" verify_incoming.
  • start watching only after serf came up
  • call RPC on client side for cert retrieval
  • do we store certs? (Only in memory on the client, they are refetched on restart)
  • send correct certs
  • on startup the client says it doesn't have TLS, but it gets it afterwards.
  • make sure verify_server_hostname with ca_{file,path} works.
  • allow disabling it or don't turn it on by default
  • watch and renew certificates
  • when to close the context
  • make sure agents certs are not accepted as service certs
  • move queryServer into client.go and use CallWithCodec
  • check if ACL is working as intended
  • use retry logic
  • prepopulate cache
  • there are no logs about the initial auto_encrypt run because logging is gated and only opened up after agent.Start() which is too late. Needs thinking how to solve that.
  • mention that server_port needs to be configured on the agent

Docs

  • guide on how to use it and how to migrate to it or in it
  • document new config options
  • document implications of different setups in terms of security

agent/agent.go Outdated Show resolved Hide resolved
agent/agent.go Outdated Show resolved Hide resolved
tlsutil/config.go Outdated Show resolved Hide resolved
@pearkes pearkes removed this from the 1.5.0 milestone Apr 29, 2019
@hanshasselberg hanshasselberg force-pushed the cert_provisioning branch 2 times, most recently from 3cc0983 to ab9bcb5 Compare May 7, 2019 20:28
@hanshasselberg hanshasselberg force-pushed the cert_provisioning branch 2 times, most recently from ff421f1 to db86194 Compare June 6, 2019 15:38
@hanshasselberg hanshasselberg marked this pull request as ready for review June 6, 2019 15:39
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this looks good to me. 🎉

Just have minor comments inline.

agent/agent.go Outdated Show resolved Hide resolved
agent/agent.go Outdated
}
a.serviceManager = NewServiceManager(a)

if err := a.initializeACLs(); err != nil {
return nil, err
}

a.logger = logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the logger isn't added above when a is initialized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

agent/agent.go Outdated
@@ -433,6 +436,18 @@ func (a *Agent) Start() error {
// populated from above.
a.registerCache()

if a.config.AutoEncryptTLS && !a.config.ServerMode {
var reply *structs.SignResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you're declaring the reply/err variables here rather than using the short variable declaration :=?

Copy link
Member Author

@hanshasselberg hanshasselberg Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I did it so that I can use the if on the same line. But I guess it doesn't matter as much.

retryJitterWindow = 30 * time.Second
)

func (c *Client) AutoEncrypt(servers []string, port int, token string, interruptCh chan struct{}) (*structs.SignResponse, string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functions feels like it should have a different name. From what I understand it's just making a request to sign a certificate, and then returning that certificate.

Maybe something like RequestCert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I settled on RequestAutoEncryptCerts.

agent/cache-types/connect_ca_leaf.go Show resolved Hide resolved
// exactly what is needed.
c := &ConnectCA{srv: a.srv}

rootsArgs := &structs.DCSpecificRequest{Datacenter: args.Datacenter}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where initializing directly to a pointer looks a bit odd to me.

I don't have strong feelings about it, but I would prefer to use &rootsArgs and &roots as the inputs for c.Roots.

Same below in c.Sign with &cert. Then we can do reply.IssuedCert = cert rather than reply.IssuedCert = *cert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it to what you suggested.

root := "../../test/ca/root.cer"
badRoot := "../../test/ca_path/cert1.crt"

variants := []variant{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One improvement that could be made here is to add a subtests where the name of each subtest is something like VerifyOutgoing fails.

Here is a short read on table driven tests with subtests:
https://blog.golang.org/subtests

There's a few benefits to table driven subtests:

  • You can run individual subtests on their own: go test -run TestAutoEncryptSign/VerifyOutgoing_fails
  • The test case information is displayed in the test output, which makes it easier to line up what you expect vs what you got

Here's an example from envconsul where the name of the subtest is embedded in the test case struct:
https://github.com/hashicorp/envconsul/blob/master/runner_test.go#L181

This talk from Mitchell covers them as well:
https://www.youtube.com/watch?v=8hQG7QlcLBk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

expected []string
}

variants := []variant{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in TestAutoEncryptSign

agent/consul/server.go Show resolved Hide resolved
if err != nil {
return nil, err
}
wrap := configurator.OutgoingRPCWrapper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this to wrapper might be a little clearer, since wrapper is also used in DialTimeoutInsecure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

@hanshasselberg hanshasselberg force-pushed the cert_provisioning branch 2 times, most recently from 80fde56 to 20b9752 Compare June 19, 2019 21:32
@hanshasselberg hanshasselberg requested a review from a team June 19, 2019 22:29
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. Just the one minor comment/question.

agent/agent.go Show resolved Hide resolved
@@ -853,6 +853,15 @@ default will automatically work with some tooling.
contents of each entry.


* <a name="auto_encrypt"></a><a href="#auto_encrypt">`auto_encrypt`</a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page is (mostly) alphabetized, can we move this just under Autopilot?

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


* <a name="tls"></a><a href="#tls">`tls`</a> makes the client request CA and certificates for encrypting RPC communication from the servers provided in `-join` or `-retry-join`. It only works if one of these provided addresses is a server `auto_encrypt.allow_tls` has to be enabled on the server in order to activate the RPC endpoint and `auto_encrypt.tls` on the client that wants to provision certificates automatically. If the `-server-port` is not the default one, it has to be provided to the clients as well. Usually this is discovered through serf, but `auto_encrypt` is happening earlier and cannot use that information. The most secure `auto_encrypt` setup is when the client is provided with the CA, `verify_server_hostname` is turned on, and when ACL are enabled and a token with `node.write` permissions. It is also possible to use `auto_encrypt` with a CA and ACL, but without `verify_server_hostname`, or only with a ACL enabled, or only with CA and `verify_server_hostname`, or only with a CA, or finally without a CA and without ACL enabled. In any case, the communication to the `auto_encrypt` endpoint is always TLS encrypted. Defaults to false.

* <a name="allow_tls"></a><a href="#allow_tls">`allow_tls`</a> enables connect and the endpoint that is serving the CA and the certificates for `auto_encrypt` to the clients. If disabled, the endpoint is not available and a client configured with `auto_encrypt.tls` cannot start. If enabled, the server starts to accept the manual and the connect CA for incoming connections as well as their corresponding certificates. It will always only present its manually configured CA and certificate though, which the client can verify using the CA it got from `auto_encrypt` endpoint. Defaults to false.
Copy link
Contributor

@kaitlincart kaitlincart Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this setting above auto_encrypt.tls since auto_encrypt.tls will be affected by auto_encrypt.allow_tls.


The following sub-keys are available:

* <a name="tls"></a><a href="#tls">`tls`</a> makes the client request CA and certificates for encrypting RPC communication from the servers provided in `-join` or `-retry-join`. It only works if one of these provided addresses is a server `auto_encrypt.allow_tls` has to be enabled on the server in order to activate the RPC endpoint and `auto_encrypt.tls` on the client that wants to provision certificates automatically. If the `-server-port` is not the default one, it has to be provided to the clients as well. Usually this is discovered through serf, but `auto_encrypt` is happening earlier and cannot use that information. The most secure `auto_encrypt` setup is when the client is provided with the CA, `verify_server_hostname` is turned on, and when ACL are enabled and a token with `node.write` permissions. It is also possible to use `auto_encrypt` with a CA and ACL, but without `verify_server_hostname`, or only with a ACL enabled, or only with CA and `verify_server_hostname`, or only with a CA, or finally without a CA and without ACL enabled. In any case, the communication to the `auto_encrypt` endpoint is always TLS encrypted. Defaults to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if users don't have a server listed in join or retry_join?

Copy link
Member Author

@hanshasselberg hanshasselberg Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't work. Meaning your agent won't be able to start up ever, and it keeps retrying indefinitely.

hanshasselberg and others added 3 commits June 27, 2019 20:54
Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@hanshasselberg hanshasselberg merged commit 33a7df3 into master Jun 27, 2019
@hanshasselberg hanshasselberg deleted the cert_provisioning branch June 27, 2019 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants