-
Notifications
You must be signed in to change notification settings - Fork 49
Tenants added via UI after cluster creation are using invalid certificate for cortex/dd/loki ingress #923
Comments
Thanks for the detailed write-up, Nick! Want to say why I didn't notice this before: for writing https://opstrace.com/blog/introducing-dynamic-tenant-addition-and-token-management I used an Opstrace instance with LE staging certs and actually did A not-so-surprising lesson in feature qualification testing :). "what's not tested is broken", I want to repeat that for us as often as I can. This was untested, and therefore was likely to be broken -- but it was not really a conscious choice, and that is an important insight. Our CI uses LE staging certs and disables certificate verification basically everywhere. We need to be super mindful about this testing blind spot. Thanks for testing this, Nick. |
@jgehrcke We have unit tests to check the https-cert secret is copied over to the tenant namespaces but clearly it wasn't enough to catch this earlier.
@nickbp The secret is copied over to the tenant namespace here but maybe there's a bug in the secret equality check in the reconcile loop or in the secret update request that prevents the controller from updating the secret. |
Looking through the I'm thinking something like this might be happening:
An easy fix would be to just remove the If that's not feasible, another option would be to validate the SANs in the certificate before copying it, but that might be expensive or complicated? |
Actually I'm starting to think we could just create/manage the tenant certificates as distinct objects, rather than adding everything into the "main" In terms of LetsEncrypt rate limiting it sounds like both the current combined certificate and keeping the certs separate would be the same, since both are resulting in a renewal request whenever a tenant is added:
If anything, the new "separate certs" structure wouldn't trigger a cert renewal in the case of a tenant being deleted, while in today's "single cert" structure I'm guessing that it does renew against the new shortened SANs. However this depends on what cert-manager does in this scenario. However there is the caveat that when a cluster is initially created, there would be N(tenants) cert creations instead of 1 cert creation. |
The controller currently creates one certificate/secret up-front, and then copies the secret to the tenant and application namespaces. The secret is then marked immutable, which leads to problems if the certificate is updated later or if tenant SANs are added asynchronously to the cert (see #923). This PR switches to a model of one cert object per tenant + one for the application namespace. The downside is that this means Ntenants+1 certs will be created with LetsEncrypt when the cluster is first created, when previously it was one cert for everyone. However it avoids the need to copy certificates across namespaces and keeps each tenant on a granular certificate. It also removes the prior immutable setting, allowing certs to be renewed automatically when they're approaching their expiration. An alternate strategy would be to create a single `*.cluster.domain.io` wildcard cert to be shared by everybody, thereby reducing the number of interactions with LetsEncrypt. However we would still likely want to remove the immutable flag on the secret in order to allow expiration updates. Signed-off-by: Nick Parker <nick@opstrace.com>
* controller: manage tenant certs separately The controller currently creates one certificate/secret up-front, and then copies the secret to the tenant and application namespaces. The secret is then marked immutable, which leads to problems if the certificate is updated later or if tenant SANs are added asynchronously to the cert (see #923). This PR switches to a model of one cert object per tenant + one for the application namespace. The downside is that this means Ntenants+1 certs will be created with LetsEncrypt when the cluster is first created, when previously it was one cert for everyone. However it avoids the need to copy certificates across namespaces and keeps each tenant on a granular certificate. It also removes the prior immutable setting, allowing certs to be renewed automatically when they're approaching their expiration. An alternate strategy would be to create a single `*.cluster.domain.io` wildcard cert to be shared by everybody, thereby reducing the number of interactions with LetsEncrypt. However we would still likely want to remove the immutable flag on the secret in order to allow expiration updates. Signed-off-by: Nick Parker <nick@opstrace.com> * controller: need to specify type=ClusterIssuer in certificate objects Signed-off-by: Nick Parker <nick@opstrace.com> * controller: skip calculating diff if debug is not enabled Signed-off-by: Nick Parker <nick@opstrace.com> * controller: avoid setting annotation to undefined value Signed-off-by: Nick Parker <nick@opstrace.com> * controller: subscribe to v1 format CRD stream instead of v1beta1 Signed-off-by: Nick Parker <nick@opstrace.com>
Fixed via above PR. After creating a tenant, it can take a couple minutes for the new cert to arrive, but once it has arrived the new endpoints work as expected:
then later ...
As seen above the other certs are left as-is when this tenant is added. |
In a cluster named
ship
, I created a set of tenants after the cluster was already running. This was done via following the blog post to create the keys, as well as adding the tenant entities in the UI. I saw that the new tenant namespaces were created and looked like they were working fine.Afterwards, I was seeing errors when attempting to push metrics to
cortex.<tenant>
endpoints from an external prometheus instance:After disabling cert validation, I saw that the metrics were arriving in cortex successfully, so it looks like the ingress itself was correctly configured, but it was just using the wrong certificate.
I then checked one of the ingress controller pods and saw that it was flooding the logs with cert file validation warnings like this:
I then checked the configured Certificate object and saw that it mentioned all of the new tenants (
labels-*
,metrics-*
,series-*
) alongside the tenants that were started when the cluster was first created (dev
,prod
,staging
,system
):Going back to the error mentioned by the ingress controller, I looked at the openssl debug dump of the cert secret that it was complaining about (just stuck with viewing the public cert for the purposes of this ticket):
The above is where things start looking wrong - the SANs listed in the above certificate info do not include the newly created credentials.
If we check the
https-cert
copy from theingress
namespace, we see that the SANs there do include the new tenants:Summary:
https-cert
in each of the per-tenant namespaces. However, that copy doesn't have updated SANs with the new tenants added after the cluster was created, so the ingress controller logs a warning and falls back to a defaultingress.local
certificate for those tenants.https-cert
secret in theingress
namespace appears to have the updated SANs.ingress
namespace copy is coming from? (note: Was originally thinking that the per-tenant namespace copies could just go away entirely in favor of theingress
namespace copy, but it looks like per-namespace Ingresses reference this object, so that might not be possible - or is it?)Steps to reproduce:
curl -v https://cortex.newtenant.mycluster.opstrace.io
openssl
debug dump of a certificate mentioned by the ingress controllerWorkarounds:
The text was updated successfully, but these errors were encountered: