Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEP-1897 Explicit Backend TLS Connection Configuration (was TLS from Gateway to Backend...) Update with API details #2113

Merged

Conversation

candita
Copy link
Contributor

@candita candita commented Jun 14, 2023

What type of PR is this?

/kind gep

What this PR does / why we need it:

This PR adds the API details for GEP-1897 TLS from Gateway to Backend for ingress. GEP-1897 has already had the questions answered - what do we want to do, and why do we want to do this. This update attempts to answer - how do we do this.

Which issue(s) this PR fixes:

Fixes #1897

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/gep PRs related to Gateway Enhancement Proposal(GEP) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 14, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @candita. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 14, 2023
geps/gep-1897.md Outdated
//
// This field is required for any TLSConnectionPolicyConfig.
//
// Support: Core - A single reference to a Kubernetes Secret of type kubernetes.io/tls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik a kubernetes.io/tls forces the secret to contain tls.key and tls.crt, but the tls.key is not required for TLS origination

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkodg do you have a suggestion or change for the Support:?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could try and relax this constraint, i.e. a kubernetes.io/tls is valid if a tls.key OR tls.crt key is set.
cc @robscott

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the CA Cert can be inputted as a ConfigMap (local to the namespace), and in the case of mTLS, the client's key and cert can be inputted as a kubernetes.io/tls Secret

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere I commented that we should allow/support mTLS to be originated from the gateway so I like the idea of kubernetes.io/tls Secret

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I just realized an mTLS client requires 3 things: a private key, identity cert (public key) and a trust store to validate peer (server) cert. If kubernetes.io/tls contains tls.key and tls.crt then we need one more thing (may be the Cluster trust bundle @howardjohn mentioned?) for full mTLS client side cert config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robscott

I like the idea of using something other than secret to story a CA Cert since that's not really a secret. One of the simplest paths forward would be to use a ConfigMap for CA Cert and a kubernetes.io/tls Secret for the client key and cert for mTLS.

We could change this to ConfigMap for CA Certs now and add kubernetes.io/tls Secret to a different PolicyAttachment, like mTLSConnectionPolicy I mentioned earlier.

Alternatively, I'd chatted with @enj in the past about developing a different kind of resource under the scope of Gateway API that was meant for storing certs. This could potentially help with the issue we have today where most Ingress/Gateway controllers are deployed with read access to all secrets in the cluster by default. I think the ongoing ReferenceGrant work in sig-auth may make this idea irrelevant longer term, but thought it was at least mentioning in this context.

It feels out of scope for this proposal, but I can open another discussion or issue to develop a different kind of resource for certs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense to me - a well-defined field in a ConfigMap for now with a documented plan to move to a new resource type longer term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to ClusterTrustBundle in the long-term, but well-formatted configmap for now makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes for this.

geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated Show resolved Hide resolved
@sanjaypujare
Copy link
Contributor

On line 19 (I could not comment on that line), I see an incomplete sentence: "In terms of the Gateway API personas, only the application developer persona in this solution."

Should it be "In terms of the Gateway API personas, only the application developer persona is considered in this solution." ?

@candita
Copy link
Contributor Author

candita commented Jun 15, 2023

On line 19 (I could not comment on that line), I see an incomplete sentence: "In terms of the Gateway API personas, only the application developer persona in this solution."

Should it be "In terms of the Gateway API personas, only the application developer persona is considered in this solution." ?

@sanjaypujare I changed it to "In terms of the Gateway API personas, only the application developer persona applies in this solution."

@@ -16,7 +16,7 @@ so in order to drive resolution this GEP focuses only on this single piece of fu
1. The solution must satisfy the following use case: the backend pod has its own
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a use-case where a backend (pod/service) is also receiving mesh traffic (in addition to traffic from the gateway) and is configured to only accept mTLS connections. Can we enhance the TLSConnectionPolicy struct to add:

  • a mode field (to specify mTLS or TLS)
  • a field similar to TrustedCACertRefs - say ClientCertRefs - which the client will use to provide client cert in case of mTLS

So that this API can be directly used for the mesh-case and also cases where the Gateway could or would like to provide a client-cert with an mTLS connection?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mTLS is intentionally out of scope for this phase of the GEP, but we want to ensure we have room to add it in the future. So it's probably worth sketching out some possible future extensions for mTLS here, but I don't want to block on deciding on a solution here, just as long as we agree that we've left room for a future solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally want to keep mesh transport out of scope forever, not just short term. Note that is not necessarily the same as "mtls", though.

Mesh transport is NOT a service level concern though, its an infrastructural concern. You would not configure ipsec per service, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mesh case is also intentionally out of scope for this tightly-scoped proposal. In the case we'd need to do something for mTLS, I would rather propose a new mTLSConnectionPolicy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a new mTLSConnectionPolicy type does not sound that bad after all although a single type like TLSConnectionPolicy supporting mTLS is preferable in which case we should make sure it's possible to add it later.

I disagree that mTLS is not a service level concern so should be left out. mTLS = TLS + client certificate. If TLS is a service level concern, then so is mTLS. Also mTLS provides client identity which is used in authz and other(?) mesh configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @howardjohn that mesh transport is an infrastructure concern; in fact, most meshes go out of their way to make the mTLS transparent to the application. Ideally, the application developer doesn't necessarily know or care that their application has mTLS configured. IMO the only use-case for a mTLS policy in Gateway API would be application-enforced mTLS which is certainly a use-case, but not in scope for this GEP

Copy link
Contributor

@youngnick youngnick Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is one of term overloading and us again not being able to be explicit enough.

@sanjaypujare appears to be using mTLS to mean "a connection between a Gateway and backend where the Gateway verifies the backend and the backend verifies the Gateway", or in other words a mutual TLS handshake where the certificate management is done by the owner of the service, whereas I think @keithmattix and @howardjohn are more thinking about the Service Mesh use of mTLS to mean "a mutual auth handshake where the certificate management is performed silently and transparently by infrastructure". In the latter case, it's important that Gateway API constructs are not required, because if they are, then the certificate management is no longer transparent.

The important part here is that this GEP is not about magic infrastructure that automatically upgrades connections to TLS on your behalf, it's about an app owner being able to say that their app expects TLS and has no magic to make it work transparently.

This is why, in the Listener case, I've pushed very hard for us to call similar functionality "client certificate configuration" rather than "mTLS", because the use of the acronym can sometimes, depending on who's using it, carry a lot of extra meaning that others may not know is there.

Now, can I see a use for Client Certificate settings in the future in TLSConnectionPolicy? Possibly, but I think it's important that we keep in mind personas here. The whole idea of having mutual authentication of certificates is that it's expected that the two parties spoke independently to the Certificate Authority, and took some steps to prove that they are who they say they are, and then the Certificate Authority provides a method for the two parties to check that the CA that they both trust says the other is okay.

Having the service or backend owner provide the client certificate details completely bypasses this - at that point, there is very little gain in using TLS at all.

For Gateway API to handle the "service/backend owner is doing their own TLS", and "service or backend owner wants to validate the clients connecting to it", it's on the service or backend owner to provide a method for the Gateway owner to retrieve a certificate, and on us, as Gateway API designers, to provide a place on the Gateway to configure the client keypair that should be used (since it's both a private and a public key that are required in the client certificate case).

So, what's the summary of this rambling diatribe?

  • We've excluded "mTLS" from this precisely because the fact that different people can use it to mean different things means we end up having discussions exactly like this one, over and over again.
  • This GEP is about explicit TLS configuration, not the implicit, automatic TLS configuration you get with a Service Mesh or similar infrastructure tooling. Perhaps we should update some of the preamble to this effect if this makes sense to people other than me, and we can avoid yet another round of this discussion.
  • Using an explicitly configured client certificate for the Gateway end of a TLS connection is a reasonable use case, but that should be configured on the Gateway, not as part of the TLSConnectionPolicy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely put. I'll add some of this to the preamble.

geps/gep-1897.md Outdated
// if the connection should be TLS, the targetRef’s certificate
// should be validated by the certs in TrustedCACertRefs, and a
// status delivered in the response for validation failures.
Port PortNumber `json:port,omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the target of this policy is a Service and as far as I know the Service will have a targetPort (or defaulted to service port?). Can we just use that instead so this port is not needed? Otherwise how does the implementation work when the targetPort in the Service and the Port here are different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Gateway API, a port reference is traditionally part of a Service reference (BackendRef is best example) and it refers to the Port the Service is listening on, not the targetPort (port the Service backends are listening on). The reason this is needed is that it's fairly common for Kubernetes Services to listen on multiple ports and protocols.

So given the following example, this would be referring to port 80, and specifically backendRefs that are targeting this Service on port 80.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app.kubernetes.io/name: MyApp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...it's fairly common for Kubernetes Services to listen on multiple ports and protocols.

Got it. So in the case where a Kubernetes service listens on a single port, the Port field here (on line 243) has to match the service port. Put another way, the port can be inferred in such a case. Just confirming my understanding

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this @candita! Haven't quite made it through everything, but took an initial pass.

/ok-to-test

@@ -16,7 +16,7 @@ so in order to drive resolution this GEP focuses only on this single piece of fu
1. The solution must satisfy the following use case: the backend pod has its own
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mTLS is intentionally out of scope for this phase of the GEP, but we want to ensure we have room to add it in the future. So it's probably worth sketching out some possible future extensions for mTLS here, but I don't want to block on deciding on a solution here, just as long as we agree that we've left room for a future solution.

geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated
// if the connection should be TLS, the targetRef’s certificate
// should be validated by the certs in TrustedCACertRefs, and a
// status delivered in the response for validation failures.
Port PortNumber `json:port,omitempty`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Gateway API, a port reference is traditionally part of a Service reference (BackendRef is best example) and it refers to the Port the Service is listening on, not the targetPort (port the Service backends are listening on). The reason this is needed is that it's fairly common for Kubernetes Services to listen on multiple ports and protocols.

So given the following example, this would be referring to port 80, and specifically backendRefs that are targeting this Service on port 80.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app.kubernetes.io/name: MyApp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376

geps/gep-1897.md Show resolved Hide resolved
geps/gep-1897.md Outdated
Comment on lines 236 to 238
// Port is the network port that the implementation watches to
// know if the connection should be TLS and the targetRef’s
// certificate should be validated by the certs in TrustedCACertRefs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a minor nit, I think implementations will essentially apply this policy to anywhere they're mapping a BackendRef that matches the combination of the TLSConnectionPolicy TargetRef and Port. This is not great, but maybe something like this would be a bit clearer, very open to improvements/changes though:

Suggested change
// Port is the network port that the implementation watches to
// know if the connection should be TLS and the targetRef’s
// certificate should be validated by the certs in TrustedCACertRefs
// Port is the network port of the target. When a target matches a BackendRef,
// this Policy should apply, resulting in the certificate served by the backend
// being validated by the certs in TrustedCACertRefs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending a decision on removing Port.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geps/gep-1897.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 15, 2023
Copy link
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -16,7 +16,7 @@ so in order to drive resolution this GEP focuses only on this single piece of fu
1. The solution must satisfy the following use case: the backend pod has its own
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally want to keep mesh transport out of scope forever, not just short term. Note that is not necessarily the same as "mtls", though.

Mesh transport is NOT a service level concern though, its an infrastructural concern. You would not configure ipsec per service, for example.

geps/gep-1897.md Show resolved Hide resolved
geps/gep-1897.md Outdated
//
// This field is required for any TLSConnectionPolicyConfig.
//
// Support: Core - A single reference to a Kubernetes Secret of type kubernetes.io/tls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3257-trust-anchor-sets#clustertrustbundle-object ? Its in 1.27 I think, alpha though. Its also cluster scoped which is good or bad depending on how you look at it..

geps/gep-1897.md Show resolved Hide resolved
geps/gep-1897.md Outdated
type TLSConnectionPolicyConfig struct {
// TrustedCACertRefs contains one or more references to
// Kubernetes objects that contain TLS certificates, which are
// used to establish a TLS handshake between the gateway and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more specificity on what we are supposed to do with it. Is it a client certificate? Is it a CA? Either? Both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, so it can be both. It is one or more reference to a certificate. When there is a chain of certificates, one or more can be CA/IntermediateCA.

geps/gep-1897.md Outdated
Comment on lines 311 to 314
| TLSRoute | Listener Mode: Passthrough | Yes | No |
| TLSRoute | Listener Mode: Terminate | Yes | Not supported |
| TLSRoute | Listener Mode: Passthrough | No | No |
| TLSRoute | Listener Mode: Terminate | No | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to take mesh clients into account (or rather, another table for mesh). Should they be following TLSConnectionPolicy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the mesh case is not covered in this proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be accepting new APIs into this repo that don't take mesh (now a core use case of the project) into account. I am not talking about handling "mesh transport", but we can't just have no answer for what mesh client that is implementing the API should do when they API is there. Even if it's saying "should always ignore it"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spelled it out in Longer Term Goals that service mesh use cases are a worthy goal, but it's a goal that may need a different GEP for proper attention. I hope we don't get into a habit where we approve/merge a preliminary GEP with clearly stated goals and then request to change the goals later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a small clarification to longer term goals to say that implementations that support those 3 use-cases should ignore TLSConnectionPolicy?

geps/gep-1897.md Outdated
This table describes the effect that a TLSConnectionPolicy has on a Route. There are only two cases where the
TLSConnectionPolicy will signal a Route to connect to a backend using TLS, an HTTPRoute with a backend that is targeted
by a TLSConnectionPolicy, either with or without listener TLS configured. (There are a few other cases where it may be
possible, but is purposely marked “not supported” due to a desire for less confusion on the assigned purpose of each of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does an implementation do for "not supported"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, these are invalid configurations, so they wouldn't pass validation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can they fail validation? The configuration and the route are separate objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, I didn't think it through. So for "not supported", the intention really is that this is not supported, at least not in this proposal, where we're only covering HTTPRoute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something that would benefit from both nested conditions (as suggested in https://github.com/kubernetes-sigs/gateway-api/pull/2113/files#r1239031972) and the SupportedFeatures GEP from @LiorLieberman. We can say that every implementation that claims supports for this policy needs to populate status on the policy describing which Routes it is being implemented for (and maybe also which Routes it is not being implemented for).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about,

Every implementation that claims supports for TLSConnectionPolicy should document for which Routes it is being implemented.

geps/gep-1897.md Outdated
| UDPRoute | No listener TLS | N/A | No |
| UDPRoute | Listener TLS | N/A | No |
| UDPRoute | No listener TLS | N/A | No |
| GRPCRoute | N/A | N/A | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be the same as HTTP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not proposing any other changes in behavior except for HTTPRoute, so I'm not in a hurry to make GRPCRoute have the same behavior as HTTPRoute and be changed as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels fine for provisional but GRPCRoute should probably be supported for implementable. Not blocking right now though IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would request to wait until GRPCRoute itself is no longer experimental. https://gateway-api.sigs.k8s.io/api-types/grpcroute/

Is that acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may actually be pretty straightforward and identical to HTTP as @howardjohn suggested. @gnossen can correct me here, but I think we should expect the same config from TLS from Gateway -> Backend to apply equally to HTTPRoute and GRPCRoute.

As far as the experimental status of GRPCRoute, it's hard to get right. We've gotten into situations in upstream Kubernetes where we let in a bunch of incompatible alpha features and it became quite a pain to graduate any of them past alpha. In this case with GRPCRoute being fairly stable at this point, I'd rather cover it here unless it ends up adding significant complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will mark it "implementation-dependent" intstead of "Yes" in the "Connect to backend with TLS?" column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @robscott . Disregarding any issues of experimental state, I think the values for GRPCRoute should be the same as for HTTPRoute in this table.

geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated
these are currently implementation-dependent, with the following recommended defaults:
- Server Name Indication: enables passing of the server name through server name indication, in the TLS transaction, to
assist with selection of certificates when several hosts share the same IP address. (default to enabled)
- Subject Alternative Name certificates: enable the use of a single certificate that can serve multiple domains.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "enabled"? this is a property of the verification. So does enabled mean we should verify it (we ought to by default or it's the equivalent of -k in curl).

If so what are we verifying? It's not just a bool.

Browsers handle this by asserting the DNS name matches the certificate, but a Kubernetes service doesn't really have as strict of a mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what "enabled" means is that a SAN (multi-domain) certificate will be accepted when true. When false, a certificate containing SANs will be rejected right away. The verification is always done (unless disabled otherwise with a different setting).

Regarding what we are verifying it's a good question: in Kubernetes (or service meshes like Istio) certificates may contain SPIFFE ids but the client needs to be provided with a mapping (service to valid SPIFFE id) to authorize the server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAN is the only non-deprecated way to encode a DNS name as far as I know (CN was deprecated 20 years ago). So it's not a question of if we verify San or not - we MUST.

The question is what SAN do we verify. Tou don't have to get into spife even, the pod could have any certificate..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the simplest case would be to assume svc.ns.svc.cluster.local but I am not sure that is sufficient, or even a common usage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that SAN certs have been around for long and CN-only certs have been deprecated so this setting is not needed/useful.

Regarding "what SAN do we verify": may be that needs to be an additional field here with defaults for inferred names like "svc.ns.svc.cluster.local"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTPRoute hostname is the public internet facing hostname, not the internal service. So I'm a route from example.com->example-svc, I would Not expect the workload to have a cert for example.com

it could be example-svc.ns.svc.cluster.local maybe - but there is no standard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that very service-mesh specific to not expect the pod to have a cert for example.com? Alternatively, with SANs, it could list example.com and example-svc.ns.svc.cluster.local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not related to service mesh in any way. I wouldn't expect them to have a cert for example.com since the Gateway itself has that. And it's not clear why you would serve the same cert in 2 places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair they could have any cert - even google.com (with their own ca, of course). So anything is possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal workload on the pod can serve requests either from a Gateway or another internal workload (the mesh scenario) and in both cases is likely to serve the same certificate (unless SNI is used?) which is more likely to be example-svc and not example.com.

candita and others added 3 commits June 15, 2023 22:55
Co-authored-by: Rob Scott <rob.scott87@gmail.com>
Co-authored-by: Rob Scott <rob.scott87@gmail.com>
@candita candita force-pushed the gep-1897-backend-tls-update branch from 12a6b24 to 5d84379 Compare June 16, 2023 15:16
geps/gep-1897.md Show resolved Hide resolved
//
// * “Accepted”
//
// Possible reasons for this condition to be False are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can ResolvedRefs condition be also used here? since the policy addresses external references - service, secrets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, unless there are privacy issues. As an implementor, what kind of condition reasons would you think should be there? Should there be a general validation on the certificate so it could have an invalid cert reason? Or does this store results of an analysis of reference grants between the policy and targets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most implementations would only look at this if they were implementing a Service that was targeted by one of these policies, so I don't think ResolvedRefs would be relevant in that case, but I agree that it likely would be very useful for the secret reference(s) that will be included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I will pass on unless there is further interest.

geps/gep-1897.md Outdated
TargetRef gatewayv1a2.PolicyTargetReference `json:"targetRef"`

// TLS contains TLS connection policy configuration.
TLS *TLSConnectionPolicyConfig `json:”tls”`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better as a list that's keyed off the PortNumber

Then it's possible to have different TLS configurations for different ports in a single Service. Unless you were specifically looking to have those be distinct policies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidelines are only one target ref per policy attachment. Maybe this also depends on whether we create a new type of target ref that includes port or ports.
x-ref https://github.com/kubernetes-sigs/gateway-api/pull/2113/files/5d84379d2aa570958ce6dbd3fb8cf187447e9f95#r1231594914

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close this one out, we're adding sectionName to Policy TargetRef - which will allow you to target the port's name field.

geps/gep-1897.md Outdated
// if the connection should be TLS, the targetRef’s certificate
// should be validated by the certs in TrustedCACertRefs, and a
// status delivered in the response for validation failures.
Port PortNumber `json:port,omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about letting policy authors use port names and not just numbers?

Similar to how K8s Service ports.targetPort is IntOrString ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decide on whether we need a new object that includes Port, then we can decide on this.

@robscott
Copy link
Member

why not have a valid list of hostnames and have the first entry be used for SNI. This would save you from needing to do a migration in the future (eg. projectcontour/contour#5520)

The goal here is to start with the simplest possible API and also ensure portability as broadly as possible. As I mentioned in my 4th point, this is going to be the first API we've introduced where partial support is unlikely to suffice. Implementations will likely either need to fully implement the resource or consider the backend invalid.

I think our migration path is also a bit simpler than what Contour would be dealing with. In a future world we might have something like this:

sni: This field will configure the SNI the Gateway should use to connect to the backend. This field will configure the SNI the Gateway should use to connect to the backend. Implementations MUST validate that at least one name in the certificate served by the backend matches this field unless the subjectNames field is configured.
subjectNames: Implementations MUST validate that at least one name in the certificate served by the backend matches one name specified in this list.

@dprotaso
Copy link
Contributor

@robscott sounds good - subjectNames support can be an subsequent follow up GEP

@youngnick
Copy link
Contributor

Thanks for all the work on this @candita! I've got some high level thoughts and suggestions on this one:

  1. Split trustedCACertRefs into trustedCA.certRefs and trustedCA.wellKnownRoots. The wellKnownRoots field will be an enum that allows users to specify NONE or SYSTEM values. When SYSTEM is specified, the implementation will use the set of CAs trusted by the Gateway. The default value for wellKnownRoots would be NONE, users must specify either SYSTEM for wellKnownRoots or provide ref(s) to custom CA Certs.

Rob and I have spoken about this, and I agree with these points - except that I think that the default value for wellKnownRoots should be "", but otherwise have the same behavior as NONE - in that you must specify either system (note lower case instead of all caps to match our usual constant value types), in which case the implementation must use whatever system provided CAs are available, or, probably more commonly, you provide at least one entry in trustedCA.certRefs to a valid Configmap or Secret containing a PEM-encoded CA certificate bundle.

@mikemorris
Copy link
Contributor

Transition from allowedSANs to a new sni or hostname field that will be required. This field will configure the SNI the Gateway should use to connect to the backend. Implementations MUST validate that at least one name in the certificate served by the backend matches this field. In the future we may need to support a variation between SNI and allowed SAN(s), but to start it feels like we can cover >90% of use cases with this simpler configuration.

It looks like the existing PreciseHostname API type would be the right fit for the type of this field (as opposed to Hostname which allows wildcard values).

@candita
Copy link
Contributor Author

candita commented Aug 30, 2023

3. Split trustedCACertRefs into trustedCA.certRefs and trustedCA.wellKnownRoots. The wellKnownRoots field will be an enum that allows users to specify NONE or SYSTEM values. When SYSTEM is specified, the implementation will use the set of CAs trusted by the Gateway. The default value for wellKnownRoots would be NONE, users must specify either SYSTEM for wellKnownRoots or provide ref(s) to custom CA Certs.

I'm going to rename the wellKnownRoots to alternateCerts, or maybe caCerts, which I feel are more slightly more descriptive for this peculiar field.

Edit - I changed this a little, adding CertRefs and CACertRefs in the place of trustedCACertRefs. Otherwise, I adopted this suggestion.

@candita
Copy link
Contributor Author

candita commented Aug 30, 2023

@robscott @youngnick @costinm, I Just pushed requested changes, PTAL when you get a chance.

  • Change goals to remove references to TLS versions and cipher suites (use case # 5)
  • Move TLS versions into Longer Term Goals
  • Clear up language in various places to support changes to API field trustedCACerts
  • Add reference to issue Update TargetREf in Policy GEP #2147
  • Add 3 new API fields to TLS - CertRefs, CACertRefs, and Hostname
  • Remove from TLS - TrustedCACerts, AllowedSans, TLSVersions
  • Rename Open Questions to Answered Questions and move some unanswered questions to Graduation Criteria
  • Add Graduation Criteria section

geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated
The names of the fields were chosen to facilitate discussion, but may be substituted without blocking acceptance of the
content of the API change.

The `TrustedCA` contains the `CertRefs` and `CACertRefs` fields. CertRefs is a slice of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, naming is hard, but I think it's important that the field currently called CACertRefs does not end with Refs - that is reserved in the API conventions for structs that contain a list of references to objects.

How about instead of CACertRefs, we have UseExtraCerts - I suggest this because it has two possible values (YAML shown for ease of typing):

  • trustedCAs.useExtraCerts: "" (or unspecified) - don't use any extra certificates for Trusted CAs.
  • trustedCAs.useExtraCerts: "system" - use the system certificates as extra trusted CAs.

This name implies that it's possible to use CertRefs with UseExtraCerts - which I think has undefined behavior at this time.

Other names I can think of:

  • AdditionalCerts (YAML trustedCAs.additionalCerts) Also implies a merging behavior between CertRefs and AdditionalCerts.
  • AlternativeCerts (YAML trustedCAs.alternativeCerts) Implies that you can only choose either CertRefs or AdditionalCerts - they are mutually exclusive.

Whatever name we choose, we need to ensure that the behavior when you set both is defined. ("That's a syntax error" is okay, as is "the lists must be merged" or other possible answers, we just need to pick one, which we can also mark as temporary - given that this is Provisional.)

Again, naming is hard. ping @robscott as well as @candita.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've struggled with naming on this one as well. Of course I'd originally proposed wellKnownRoots but obviously that has it's flaws, and not all systems are necessarily going to be using "well-known" values. So maybe we should just narrow down the scope of this field a bit. It seems very unlikely that Kubernetes or Gateway API will ever try to maintain a common set of certificates that should be used for this purpose, so we can likely say that whatever values are supported here other than none/empty, are going to be at least somewhat implementation-specific.

I think we should use the following guidelines when choosing a name here:

  1. Avoid a name that locks us into whether/how this combines with certRefs. My personal opinion is that we should start with validation that ensures that only certRefs or this field can be set, and we can loosen that later if needed. I think this would rule out additional, alternative, or extra.
  2. Avoid a boolean value so we're not conflicting with the relevant Kubernetes API Conventions.
  3. Building on 2, allow for some theoretical room for expansion. I think the most likely value here would be domain-prefixed values if for some reasons implementations want to support predefined sets of CAs. Even this seems highly unlikely at this point, but want to at least leave room for it so we're not boxed in years from now.

With all that said, I'm struggling to actually find a good name that meets this criteria. Some more ideas:

A. trustedCAs.systemCerts: "System"
B. trustedCAs.predefinedCerts: "System"
C. trustedCAs.standardCerts: "System"

As far as whether the empty value should be explicit ("None") or just empty (nil pointer or ""), I tend to lean towards a default of "None" so there's absolutely no ambiguity here. If we'd rather avoid "None", I think I'd lean towards a nil pointer instead of "", but open to what others think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngnick I changed the name of CACertRefs, didn't realize the Refs suffix had a semantic association. Should I also change the name of CertRefs, which a slice of ConfigMapNames, or is that fitting?

@robscott and I had this conversation last night and I originally proposed alternativeCerts (amongst others) but I get Rob's point about preferring to use a name that doesn't lock us into how this field combines with certRefs. I chose standardCerts. certRefs and standardCerts are intended to be mutually exclusive in the first round.

Regarding using a nil pointer instead of "" for standardCerts, I'd rather leave it "" so that the field is visible when yaml is examined. If we don't normally do that for our users, I can make it a pointer.

Copy link
Contributor

@youngnick youngnick Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that CertRefs is a slice of strings that are ConfigMap names. Yes, that should be a []ObjectReference instead, with a default of Configmap as the Kind (which puts "", meaning "core" as the Group). That way, you can specify a list of certificates in Configmaps like this:

    ...
    certRefs:
    - name: CAcert1
    - name: CAcert2
    ...

or, you can also specify other objects like Secrets if you want. Anything not a ConfigMap should be Implementation Specific support in the initial version though. This lets us change that much more easily lately. (Practically, to be able to set the defaults, you'll need to create another type like SecretObjectReference - probably ConfigmapObjectReference).

Using a nil pointer or "" as the default value makes no difference for display - because the field is optional and will have omitempty set, then a nil pointer (for a *string type) or an empty string (for a string type) will both be not necessary in input YAMLs, and hidden in output YAMLs. The only way you'll see the standardCerts field (which I agree is a good name) is if it's set to system. However, if you do set standardCerts: system, then the CertRefs field also must be empty in this version, so it will also be hidden on output.

So you'll only ever see:

    ...
    certRefs:
    - name: CAcert1
    ...

or

    ...
	standardCerts: "system"
    ...

Which I think is what we want.

However, there's also an interaction with Enum validation, in that we can't specify "" as an Enum value. So, I recommend changing standardCerts to a *string, because then we can have an Enum with only one value (system).
This lets us add further Enum values later, while keeping the above behavior.

So, my requested changes here are:

  • CertRefs should be a []ConfigmapObjectReference, where that is an ObjectReference struct that defaults to Configmap.
  • standardCerts should be a string pointer field with Enum validation, and one valid value: system (or System if that's what we prefer - personally, I prefer these constants to be all lower-case, it makes validation easier).

Copy link
Contributor Author

@candita candita Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngnick we discussed offline, and I made the requested change to CertRefs and changed StandardCerts like this:

  • standardCerts should be a StandardCertType pointer field with Enum validation, and one valid value: System. An informal review showed that alias string values usually start with a capital letter in this API, particularly condition types and reasons, so I used System rather than system.

geps/gep-1897.md Outdated
Comment on lines 276 to 279
// If the CertRefs is empty, then the system trusted
// certificates should be used. If there are none, or the
// implementation doesn't define system trusted certificates,
// then a TLS connection must fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this because this behavior is now handled by the other field (CACertRefs here, but as I said above, I think it needs a new name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a user could accidentally try to use System certs but the system doesn't have them, so I think it should stay in.

Copy link
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Github, for being so slow I duplicated my comment.

Copy link
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern about field naming, and a small fix to the struct details, but aside from that, this LGTM.

@candita
Copy link
Contributor Author

candita commented Aug 30, 2023

@youngnick @robscott I made the updates you suggesed and added a couple of CEL validations to TLSConnectionPolicyConfig. PTAL the changes in the latest commit when you get a chance.

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this @candita! A couple tiny nits and a suggestion to move back to a backend focused policy name but otherwise I think this is good to go.

geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated Show resolved Hide resolved
geps/gep-1897.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2023
@candita
Copy link
Contributor Author

candita commented Sep 6, 2023

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 6, 2023
@robscott
Copy link
Member

robscott commented Sep 6, 2023

🎉 Thanks for all the work on this @candita!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, keithmattix, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d09b142 into kubernetes-sigs:main Sep 6, 2023
9 checks passed
@youngnick
Copy link
Contributor

🎉 Thanks for all your work on this @candita!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/gep PRs related to Gateway Enhancement Proposal(GEP) lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GEP: TLS from Gateway to Backend for ingress (backend TLS termination)