Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checks to provide feedback during cert rotation #3781

Closed
ihcsim opened this issue Dec 3, 2019 · 6 comments
Closed

Add checks to provide feedback during cert rotation #3781

ihcsim opened this issue Dec 3, 2019 · 6 comments

Comments

@ihcsim
Copy link
Contributor

ihcsim commented Dec 3, 2019

The purpose of this issue is to introduce additional checks to provide helpful feedback to the users during the trust root rotation process. (See linkerd/website#595.) This will help the users to find out the latest state of the control plane and data plane, without performing low-level inquiries with kubectl.

@grampelberg @zaharidichev LMKWYT.

The following workflow adds the following items to #3696:

  1. Specify the supported algorithm and key length
  2. Specify the certs' expiry date
  3. Publish (before/after) expiry events to the k8s event bus so that services like Dive can pick them up
  4. Provide feedback on expired certs via linkerd check [--proxy]. Currently, in Add checks for issuer certificate validation #3696, linkerd check [--proxy] is returning a 503 error (tested with expired trust root and/or expired issuer cert)

Let the users determine the expiry dates of the trust root and issue certs:

$ linkerd check
linkerd-mtls
------------------
√ control plane is using ECDSA trust root certificate with a P-256 key
√ identity is using ECDSA issuer certificate with a P-256 key
√ trust root does not expire in 60 days (expiry date: <notAfter>)
√ issuer certificate does not expire in 60 days (expiry date: <notAfter>)

$ linkerd check --proxy
linkerd-data-plane-mtls
------------------
√ data plane is using ECDSA trust root certificate with a P-256 key
√ data plane trust root matches control plane's trust root
√ data plane trust root does not expire in 60 days (expiry date: <notAfter>)

As the expiry date draws closer, we should publish warning events to the k8s event bus (with identity being the event owner), so that services like Dive can pick them up (in the future). check should also provide a link to the relevant documentation:

$ linkerd check
linkerd-mtls
------------------
√ control plane is using ECDSA trust root certificate with a P-256 key
√ identity is using ECDSA issuer certificate with a P-256 key
‼ trust root does not expire in 60 days (expiry date: <notAfter>)
   trust root expires on <notAfter>
   see https://linkerd.io/checks/#l5d-certs-rotation for hints
‼ issuer certificate does not expire in 60 days (expiry date: <notAfter>)
   issuer certiifcate expires on <notAfter>
   see https://linkerd.io/checks/#l5d-certs-rotation for help

$ linkerd check --proxy
linkerd-data-plane-mtls
------------------
√ data plane is using ECDSA trust root certificate with a P-256 key
√ data plane trust root matches control plane's trust root
‼ data plane trust root does not expire in 60 days (expiry date: <notAfter>)
   trust root expires on <notAfter>
   affected namespaces: <ns1>, <ns2>, <ns3> etc,
   see https://linkerd.io/checks/#l5d-certs-rotation for help

Once the trust root and issuer certificate are rotated, linkerd check will show the new expiry dates with √ . If the data plane trust root hasn't been rotated, linkerd check --proxy will issue a warning:

$ linkerd check --proxy
linkerd-data-plane-mtls
------------------
√ data plane is using ECDSA trust root certificate with a P-256 key
‼ data plane trust root matches control plane's trust root
   unknown trust roots in data plane
   affected namespaces: <ns1>, <ns2>, <ns3> etc.
   see https://linkerd.io/checks/#l5d-certs-rotation for help
‼ data plane trust root does not expire in 60 days (expiry date: <notAfter>)
   trust root expires on <notAfter>
   affected namespaces: <ns1>, <ns2>, <ns3> etc.
   see https://linkerd.io/checks/#l5d-certs-rotation for help

Upon restarting the data plane, all checks will show up as √ .

When the trust root and/or issuer certificate expired, the linkerd check and linkerd check --proxy commands should report the errors:

$ linkerd check
linkerd-mtls
------------------
√ control plane is using ECDSA trust root certificate with a P-256 key
√ identity is using ECDSA issuer certificate with a P-256 key
× trust root does not expire in 60 days (expiry date: <notAfter>)
   trust root has expired!
   see https://linkerd.io/checks/#l5d-certs-rotation for help
× issuer certificate does not expire in 60 days (expiry date: <notAfter>)
   issuer certiifcate has expired!
   see https://linkerd.io/checks/#l5d-certs-rotation for help

$ linkerd check --proxy
linkerd-data-plane-mtls
------------------
√ data plane is using ECDSA trust root certificate with a P-256 key
√ data plane trust root matches control plane's trust root
× data plane trust root does not expire in 60 days (expiry date: <notAfter>)
   trust root has expired!
   affected namespaces: <ns1>, <ns2>, <ns3> etc.
   see https://linkerd.io/checks/#l5d-certs-rotation for help

Currently, linkerd check [--proxy] is returning a 503 error in #3696 when the cert(s) expired, which doesn't tell the users what has gone wrong:

$ linkerd check --proxy
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
    Error calling Prometheus from the control plane: server_error: server error: 503
    see https://linkerd.io/checks/#l5d-api-control-api for hints

$ bin/linkerd logs --control-plane-component=prometheus
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy ERR! [   685.744059s] rustls::session TLS alert received: Message {
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     typ: Alert,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     version: TLSv1_2,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     payload: Alert(
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy         AlertMessagePayload {
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy             level: Fatal,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy             description: BadCertificate,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy         },
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     ),
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy }
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy ERR! [   686.245931s] rustls::session TLS alert received: Message {
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     typ: Alert,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     version: TLSv1_2,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     payload: Alert(
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy         AlertMessagePayload {
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy             level: Fatal,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy             description: BadCertificate,
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy         },
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy     ),
linkerd-prometheus-7f789fb7d9-8n5hr linkerd-proxy }

I think the certificate checks might need to happen before the linkerd-api check.

@grampelberg
Copy link
Contributor

👍 from me! I love seeing the success and failure messages. Mind doing ones for the others as well?

@zaharidichev
Copy link
Member

I think the certificate checks might need to happen before the linkerd-api check.

That is correct.

@ihcsim
Copy link
Contributor Author

ihcsim commented Dec 4, 2019

@grampelberg @zaharidichev Some additional error messages, per our standup conversation:

$ linkerd check
linkerd-mtls
------------------
× control plane is using supported trust root
   trust root must be signed by an ECDSA P-256 key
   see https://linkerd.io/checks/#l5d-supported-certs-type for help # points to https://linkerd.io/2/tasks/generate-certificates/
× identity is using supported issuer certificate
   issuer certificate must be signed by an ECDSA P-256 key
   see https://linkerd.io/checks/#l5d-supported-certs-type for help # points to https://linkerd.io/2/tasks/generate-certificates/

$ linkerd check --proxy
linkerd-data-plane-mtls
------------------
× data plane is using supported trust root certificate
   trust root must be signed by an ECDSA P-256 key
   affected namespaces: <ns1>, <ns2>, <ns3> etc.
   see https://linkerd.io/checks/#l5d-supported-certs-type for help # points to https://linkerd.io/2/tasks/generate-certificates/

$ linkerd check --proxy
linkerd-data-plane-mtls
------------------
√ data plane is using supported trust root certificate
‼  data plane trust root matches control plane's trust root
   unknown trust roots in data plane
   affected namespaces: <ns1>, <ns2>, <ns3> etc.
   see https://linkerd.io/checks/#l5d-certs-rotation for help

@zaharidichev LMK how difficult it is to split up the checks into the control plane and data plane categories. #3696 currently has all the checks grouped under the data plane category.

Also, for data plane checks warnings and errors, I listed the affected namespace to help users to locate the errors source. I am not sure if that is sufficient or if we want to list all the pods (which can appear cluttered where there are many pods). LMK if this is doable without requiring a massive refactoring.

@ihcsim
Copy link
Contributor Author

ihcsim commented Dec 5, 2019

@zaharidichev Depending on the amount of effort required, we can make the event publication stuff optional. At this point, completing #3677 and #3696 is more important. LMK if I can help you out in any ways. Thanks.

@zaharidichev
Copy link
Member

I agree that listing namespaces is better than dumping a potentially huge list of pods.

@ihcsim
Copy link
Contributor Author

ihcsim commented Dec 12, 2019

See #3810, #3811 and #3813 for implementation details.

@ihcsim ihcsim closed this as completed Dec 12, 2019
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants