Skip to content
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.

Tenants added via UI after cluster creation are using invalid certificate for cortex/dd/loki ingress #923

Closed
nickbp opened this issue Jun 22, 2021 · 5 comments
Assignees
Labels
type: bug user-facing problem (describe the problem in the title!)

Comments

@nickbp
Copy link
Contributor

nickbp commented Jun 22, 2021

In a cluster named ship, I created a set of tenants after the cluster was already running. This was done via following the blog post to create the keys, as well as adding the tenant entities in the UI. I saw that the new tenant namespaces were created and looked like they were working fine.

Afterwards, I was seeing errors when attempting to push metrics to cortex.<tenant> endpoints from an external prometheus instance:

ts=2021-06-22T00:34:03.583Z caller=dedupe.go:112 component=remote level=warn remote_name=78c788 url=https://cortex.labels-a.ship.opstrace.io/api/v1/push msg="Failed to send batch, retrying" err="Post "https://cortex.labels-a.ship.opstrace.io/api/v1/push\": x509: certificate is valid for ingress.local, not cortex.labels-a.ship.opstrace.io"

After disabling cert validation, I saw that the metrics were arriving in cortex successfully, so it looks like the ingress itself was correctly configured, but it was just using the wrong certificate.

I then checked one of the ingress controller pods and saw that it was flooding the logs with cert file validation warnings like this:

$ kubectl logs -n ingress nginx-ingress-controller-api-5ljjq
[...]
W0621 23:05:49.576164 6 controller.go:1181] Validating certificate against DNS names. This will be deprecated in a future version
W0621 23:05:49.576174 6 controller.go:1186] SSL certificate "series-b-tenant/https-cert" does not contain a Common Name or Subject Alternative Name for server "loki.series-b.ship.opstrace.io": x509: certificate is valid for *.dev.ship.opstrace.io, *.prod.ship.opstrace.io, *.staging.ship.opstrace.io, *.system.ship.opstrace.io, dev.ship.opstrace.io, prod.ship.opstrace.io, ship.opstrace.io, staging.ship.opstrace.io, system.ship.opstrace.io, not loki.series-b.ship.opstrace.io
W0621 23:05:49.576184 6 controller.go:1187] Using default certificate
W0621 23:05:49.576198 6 controller.go:1180] Unexpected error validating SSL certificate "series-b-tenant/https-cert" for server "dd.series-b.ship.opstrace.io": x509: certificate is valid for *.dev.ship.opstrace.io, *.prod.ship.opstrace.io, *.staging.ship.opstrace.io, *.system.ship.opstrace.io, dev.ship.opstrace.io, prod.ship.opstrace.io, ship.opstrace.io, staging.ship.opstrace.io, system.ship.opstrace.io, not dd.series-b.ship.opstrace.io
[... the errors continue looping across cortex/dd/loki endpoints for each newly added tenant ...]

I then checked the configured Certificate object and saw that it mentioned all of the new tenants (labels-*, metrics-*, series-*) alongside the tenants that were started when the cluster was first created (dev, prod, staging, system):

$ kubectl describe certificate -n ingress https-cert
Name:         https-cert
Namespace:    ingress
Labels:       <none>
Annotations:  opstrace: owned
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2021-06-15T21:34:11Z
  Generation:          15
  [...]
  Resource Version:  2262716
  Self Link:         /apis/cert-manager.io/v1/namespaces/ingress/certificates/https-cert
  UID:               40543cf7-4d9a-47b0-bddf-47f236d77cfa
Spec:
  Dns Names:
    ship.opstrace.io
    dev.ship.opstrace.io
    *.dev.ship.opstrace.io
    labels-a.ship.opstrace.io
    *.labels-a.ship.opstrace.io
    labels-b.ship.opstrace.io
    *.labels-b.ship.opstrace.io
    labels-c.ship.opstrace.io
    *.labels-c.ship.opstrace.io
    labels-d.ship.opstrace.io
    *.labels-d.ship.opstrace.io
    metrics-a.ship.opstrace.io
    *.metrics-a.ship.opstrace.io
    metrics-b.ship.opstrace.io
    *.metrics-b.ship.opstrace.io
    metrics-c.ship.opstrace.io
    *.metrics-c.ship.opstrace.io
    metrics-d.ship.opstrace.io
    *.metrics-d.ship.opstrace.io
    prod.ship.opstrace.io
    *.prod.ship.opstrace.io
    series-a.ship.opstrace.io
    *.series-a.ship.opstrace.io
    series-b.ship.opstrace.io
    *.series-b.ship.opstrace.io
    staging.ship.opstrace.io
    *.staging.ship.opstrace.io
    system.ship.opstrace.io
    *.system.ship.opstrace.io
  Issuer Ref:
    Kind:       Issuer
    Name:       letsencrypt-prod
  Secret Name:  https-cert
Status:
  Conditions:
    Last Transition Time:  2021-06-21T05:05:48Z
    Message:               Certificate is up to date and has not expired
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2021-09-19T04:05:46Z
  Not Before:              2021-06-21T04:05:47Z
  Renewal Time:            2021-08-20T04:05:46Z
  Revision:                3
Events:                    <none>

Going back to the error mentioned by the ingress controller, I looked at the openssl debug dump of the cert secret that it was complaining about (just stuck with viewing the public cert for the purposes of this ticket):

$ kubectl get secret -n series-b-tenant https-cert -o json | jq -r '.data["tls.crt"]' | base64 -d | openssl x509 -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            04:8d:e1:47:3b:11:91:c6:38:5d:1a:07:6c:74:b1:85:c7:6c
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C = US, O = Let's Encrypt, CN = R3
        Validity
            Not Before: Jun 21 03:56:11 2021 GMT
            Not After : Sep 19 03:56:10 2021 GMT
        Subject: CN = ship.opstrace.io
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    [...]
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                70:84:A1:42:70:99:68:36:A7:96:B9:4F:8B:90:77:92:7E:1A:72:12
            X509v3 Authority Key Identifier: 
                keyid:14:2E:B3:17:B7:58:56:CB:AE:50:09:40:E6:1F:AF:9D:8B:14:C2:C6

            Authority Information Access: 
                OCSP - URI:http://r3.o.lencr.org
                CA Issuers - URI:http://r3.i.lencr.org/

            X509v3 Subject Alternative Name: 
                DNS:*.dev.ship.opstrace.io, DNS:*.prod.ship.opstrace.io, DNS:*.staging.ship.opstrace.io, DNS:*.system.ship.opstrace.io, DNS:dev.ship.opstrace.io, DNS:prod.ship.opstrace.io, DNS:ship.opstrace.io, DNS:staging.ship.opstrace.io, DNS:system.ship.opstrace.io
            X509v3 Certificate Policies: 
                Policy: 2.23.140.1.2.1
                Policy: 1.3.6.1.4.1.44947.1.1.1
                  CPS: http://cps.letsencrypt.org

[...]

The above is where things start looking wrong - the SANs listed in the above certificate info do not include the newly created credentials.

If we check the https-cert copy from the ingress namespace, we see that the SANs there do include the new tenants:

$ kubectl get secret -n ingress https-cert -o json | jq -r '.data["tls.crt"]' | base64 -d | openssl x509 -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            04:a9:e1:33:eb:1d:c4:e1:52:5f:fd:2d:79:7b:60:b9:17:36
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C = US, O = Let's Encrypt, CN = R3
        Validity
            Not Before: Jun 21 04:05:47 2021 GMT
            Not After : Sep 19 04:05:46 2021 GMT
        Subject: CN = ship.opstrace.io
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    [...]
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                70:84:A1:42:70:99:68:36:A7:96:B9:4F:8B:90:77:92:7E:1A:72:12
            X509v3 Authority Key Identifier: 
                keyid:14:2E:B3:17:B7:58:56:CB:AE:50:09:40:E6:1F:AF:9D:8B:14:C2:C6

            Authority Information Access: 
                OCSP - URI:http://r3.o.lencr.org
                CA Issuers - URI:http://r3.i.lencr.org/

            X509v3 Subject Alternative Name: 
                DNS:*.dev.ship.opstrace.io, DNS:*.labels-a.ship.opstrace.io, DNS:*.labels-b.ship.opstrace.io, DNS:*.labels-c.ship.opstrace.io, DNS:*.labels-d.ship.opstrace.io, DNS:*.metrics-a.ship.opstrace.io, DNS:*.metrics-b.ship.opstrace.io, DNS:*.metrics-c.ship.opstrace.io, DNS:*.metrics-d.ship.opstrace.io, DNS:*.prod.ship.opstrace.io, DNS:*.series-a.ship.opstrace.io, DNS:*.series-b.ship.opstrace.io, DNS:*.staging.ship.opstrace.io, DNS:*.system.ship.opstrace.io, DNS:dev.ship.opstrace.io, DNS:labels-a.ship.opstrace.io, DNS:labels-b.ship.opstrace.io, DNS:labels-c.ship.opstrace.io, DNS:labels-d.ship.opstrace.io, DNS:metrics-a.ship.opstrace.io, DNS:metrics-b.ship.opstrace.io, DNS:metrics-c.ship.opstrace.io, DNS:metrics-d.ship.opstrace.io, DNS:prod.ship.opstrace.io, DNS:series-a.ship.opstrace.io, DNS:series-b.ship.opstrace.io, DNS:ship.opstrace.io, DNS:staging.ship.opstrace.io, DNS:system.ship.opstrace.io
            X509v3 Certificate Policies: 
                Policy: 2.23.140.1.2.1
                Policy: 1.3.6.1.4.1.44947.1.1.1
                  CPS: http://cps.letsencrypt.org

[...]

Summary:

  • It looks like the ingress controller is looking at https-cert in each of the per-tenant namespaces. However, that copy doesn't have updated SANs with the new tenants added after the cluster was created, so the ingress controller logs a warning and falls back to a default ingress.local certificate for those tenants.
  • Meanwhile, the https-cert secret in the ingress namespace appears to have the updated SANs.
  • Possible solution: Ensure the per-tenant namespace copies are kept up to date with wherever the ingress namespace copy is coming from? (note: Was originally thinking that the per-tenant namespace copies could just go away entirely in favor of the ingress namespace copy, but it looks like per-namespace Ingresses reference this object, so that might not be possible - or is it?)

Steps to reproduce:

  • Add a tenant via the UI (tenants created via cluster init do not have this problem)
  • curl -v https://cortex.newtenant.mycluster.opstrace.io
  • Check the logs of the ingress controller pods for cert errors
  • Check the openssl debug dump of a certificate mentioned by the ingress controller

Workarounds:

  • Can disable cert validation in the client, but this exposes the tenant auth token sent by the client to potential MITM
@nickbp nickbp added the type: bug user-facing problem (describe the problem in the title!) label Jun 22, 2021
@jgehrcke
Copy link
Contributor

jgehrcke commented Jun 22, 2021

Thanks for the detailed write-up, Nick!

Want to say why I didn't notice this before: for writing https://opstrace.com/blog/introducing-dynamic-tenant-addition-and-token-management I used an Opstrace instance with LE staging certs and actually did curl -vk ... for all curl-based interaction with the new endpoints, i.e. skipping certificate verification.

A not-so-surprising lesson in feature qualification testing :). "what's not tested is broken", I want to repeat that for us as often as I can.

This was untested, and therefore was likely to be broken -- but it was not really a conscious choice, and that is an important insight.

Our CI uses LE staging certs and disables certificate verification basically everywhere. We need to be super mindful about this testing blind spot. Thanks for testing this, Nick.

@sreis
Copy link
Contributor

sreis commented Jun 22, 2021

A not-so-surprising lesson in feature qualification testing :). "what's not tested is broken", I want to repeat that for us as often as I can.
This was untested, and therefore was likely to be broken -- but it was not really a conscious choice, and that is an important insight.

@jgehrcke We have unit tests to check the https-cert secret is copied over to the tenant namespaces but clearly it wasn't enough to catch this earlier.

Possible solution: Ensure the per-tenant namespace copies are kept up to date with wherever the ingress namespace copy is coming from? (note: Was originally thinking that the per-tenant namespace copies could just go away entirely in favor of the ingress namespace copy, but it looks like per-namespace Ingresses reference this object, so that might not be possible - or is it?)

@nickbp The secret is copied over to the tenant namespace here but maybe there's a bug in the secret equality check in the reconcile loop or in the secret update request that prevents the controller from updating the secret.

@nickbp nickbp self-assigned this Jul 8, 2021
@nickbp nickbp added the in progress state (used by codetree) label Jul 8, 2021
@nickbp
Copy link
Contributor Author

nickbp commented Sep 7, 2021

Looking through the getCertSecretCopy code, the copy is setImmutable() upon its creation: https://github.com/opstrace/opstrace/blob/ec86ca0/packages/controller/src/resources/utils.ts#L60

I'm thinking something like this might be happening:

  1. Tenant is added
  2. ingress cert is copied to tenant namespace, copy is marked immutable
  3. ingress cert is updated with additional SANs (following LetsEncrypt negotiation)
  4. copy is not updated because it's immutable

An easy fix would be to just remove the setImmutable(). That might also better deal with e.g. updates when the cert gets renewed. Going to try that and see if there are any obvious side effects...

If that's not feasible, another option would be to validate the SANs in the certificate before copying it, but that might be expensive or complicated?

@nickbp
Copy link
Contributor Author

nickbp commented Sep 7, 2021

Actually I'm starting to think we could just create/manage the tenant certificates as distinct objects, rather than adding everything into the "main" ingress cert and copying that around to the tenants. Then when a tenant is added (or removed) we are creating or deleting a separate certificate request for the tenant rather than updating a single common certificate. Similarly, the SANs for the tenants wouldn't be getting included in the main ingress cert and vice versa.

In terms of LetsEncrypt rate limiting it sounds like both the current combined certificate and keeping the certs separate would be the same, since both are resulting in a renewal request whenever a tenant is added:

The main limit is Certificates per Registered Domain (50 per week). A registered domain is, generally speaking, the part of the domain you purchased from your domain name registrar. For instance, in the name www.example.com, the registered domain is example.com.

If anything, the new "separate certs" structure wouldn't trigger a cert renewal in the case of a tenant being deleted, while in today's "single cert" structure I'm guessing that it does renew against the new shortened SANs. However this depends on what cert-manager does in this scenario.

However there is the caveat that when a cluster is initially created, there would be N(tenants) cert creations instead of 1 cert creation.

nickbp pushed a commit that referenced this issue Sep 7, 2021
The controller currently creates one certificate/secret up-front, and then copies the secret to the
tenant and application namespaces. The secret is then marked immutable, which leads to problems if
the certificate is updated later or if tenant SANs are added asynchronously to the cert (see #923).

This PR switches to a model of one cert object per tenant + one for the application namespace.
The downside is that this means Ntenants+1 certs will be created with LetsEncrypt when the cluster
is first created, when previously it was one cert for everyone. However it avoids the need to copy
certificates across namespaces and keeps each tenant on a granular certificate. It also removes the
prior immutable setting, allowing certs to be renewed automatically when they're approaching their
expiration.

An alternate strategy would be to create a single `*.cluster.domain.io` wildcard cert to be shared
by everybody, thereby reducing the number of interactions with LetsEncrypt. However we would still
likely want to remove the immutable flag on the secret in order to allow expiration updates.

Signed-off-by: Nick Parker <nick@opstrace.com>
nickbp added a commit that referenced this issue Sep 9, 2021
* controller: manage tenant certs separately

The controller currently creates one certificate/secret up-front, and then copies the secret to the tenant and application namespaces. The secret is then marked immutable, which leads to problems if the certificate is updated later or if tenant SANs are added asynchronously to the cert (see #923).

This PR switches to a model of one cert object per tenant + one for the application namespace. The downside is that this means Ntenants+1 certs will be created with LetsEncrypt when the cluster is first created, when previously it was one cert for everyone. However it avoids the need to copy certificates across namespaces and keeps each tenant on a granular certificate. It also removes the prior immutable setting, allowing certs to be renewed automatically when they're approaching their expiration.

An alternate strategy would be to create a single `*.cluster.domain.io` wildcard cert to be shared by everybody, thereby reducing the number of interactions with LetsEncrypt. However we would still likely want to remove the immutable flag on the secret in order to allow expiration updates.

Signed-off-by: Nick Parker <nick@opstrace.com>

* controller: need to specify type=ClusterIssuer in certificate objects

Signed-off-by: Nick Parker <nick@opstrace.com>

* controller: skip calculating diff if debug is not enabled

Signed-off-by: Nick Parker <nick@opstrace.com>

* controller: avoid setting annotation to undefined value

Signed-off-by: Nick Parker <nick@opstrace.com>

* controller: subscribe to v1 format CRD stream instead of v1beta1

Signed-off-by: Nick Parker <nick@opstrace.com>
@nickbp
Copy link
Contributor Author

nickbp commented Sep 9, 2021

Fixed via above PR.

After creating a tenant, it can take a couple minutes for the new cert to arrive, but once it has arrived the new endpoints work as expected:

$ kubectl get certificates -A
NAMESPACE                NAME                           READY   SECRET                AGE
application              https-cert                     True    https-cert            33h
cortex-operator-system   cortex-operator-serving-cert   True    webhook-server-cert   33h
default-tenant           https-cert                     True    https-cert            33h
newtenant-tenant         https-cert                     False   https-cert            89s
system-tenant            https-cert                     True    https-cert            33h
$ curl -vH "Authorization: Bearer $(cat token-showdown-newtenant.jwt)" \
    https://cortex.newtenant.nick-test.opstrace.io/api/v1/labels
[...]
curl: (60) SSL certificate problem: self signed certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

then later ...

$ kubectl get certificates -A
NAMESPACE                NAME                           READY   SECRET                AGE
application              https-cert                     True    https-cert            33h
cortex-operator-system   cortex-operator-serving-cert   True    webhook-server-cert   33h
default-tenant           https-cert                     True    https-cert            33h
newtenant-tenant         https-cert                     True    https-cert            3m
system-tenant            https-cert                     True    https-cert            33h
$ curl -vH "Authorization: Bearer $(cat token-showdown-newtenant.jwt)" \
    https://cortex.newtenant.nick-test.opstrace.io/api/v1/labels
[...]
{"status":"success","data":[]}

As seen above the other certs are left as-is when this tenant is added.

@nickbp nickbp closed this as completed Sep 9, 2021
@opstracy opstracy removed the in progress state (used by codetree) label Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug user-facing problem (describe the problem in the title!)
Development

No branches or pull requests

4 participants