Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random errors: x509: certificate signed by unknown authority #3497

Closed
KIVagant opened this issue Sep 30, 2019 · 18 comments · Fixed by #3673
Closed

Random errors: x509: certificate signed by unknown authority #3497

KIVagant opened this issue Sep 30, 2019 · 18 comments · Fixed by #3673
Assignees

Comments

@KIVagant
Copy link
Contributor

KIVagant commented Sep 30, 2019

Bug Report

What is the issue?

I don't understand all details, but periodically I see the error in different places but Linkerd works in general. The error appears randomly. Pods restarting helps to solve it but I don't think it's a good workaround.

➜ linkerd top deployment/application --namespace default
Error: HTTP error, status Code [503] (unexpected API response: Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-tap.linkerd.svc")'
Trying to reach: 'https://10.56.79.145:8089/apis/tap.linkerd.io/v1alpha1/watch/namespaces/default/deployments/application/tap')
Usage:
  linkerd top [flags] (RESOURCE)
#...

kubectl rollout restart -n linkerd deployment/linkerd-tap
# ...
linkerd top deployment/application --namespace default

# now it works, but after a while the problem returns

How can it be reproduced?

Logs, error output, etc

[linkerd-tap-86c9f7cc98-p49b5 tap] 2019/09/30 14:58:25 http: TLS handshake error from 127.0.0.1:33188: remote error: tls: bad certificate
[linkerd-tap-86c9f7cc98-psztb tap] 2019/09/30 14:58:25 http: TLS handshake error from 127.0.0.1:37118: remote error: tls: bad certificate
[linkerd-tap-86c9f7cc98-psztb tap] 2019/09/30 14:58:26 http: TLS handshake error from 127.0.0.1:37198: remote error: tls: bad certificate

I didn't find any other errors in other L5d pods.

NAME                                      READY   STATUS    RESTARTS   AGE
linkerd-controller-784c8ddfbd-6l7zv       2/2     Running   0          8h
linkerd-controller-784c8ddfbd-b67s2       2/2     Running   0          47m
linkerd-controller-784c8ddfbd-m95ll       2/2     Running   0          8h
linkerd-destination-7655c8bc7c-4zcxm      2/2     Running   0          8h
linkerd-destination-7655c8bc7c-q4jwz      2/2     Running   0          8h
linkerd-destination-7655c8bc7c-xlx9g      2/2     Running   0          8h
linkerd-grafana-86df8766f8-xlxld          2/2     Running   0          8h
linkerd-identity-59f8fbf6fc-ll597         2/2     Running   0          47m
linkerd-identity-59f8fbf6fc-wgcpj         2/2     Running   0          8h
linkerd-identity-59f8fbf6fc-z66p7         2/2     Running   0          8h
linkerd-prometheus-98c96c5d5-jc2lz        2/2     Running   0          8h
linkerd-proxy-injector-67f7db5566-9wdls   2/2     Running   0          8h
linkerd-proxy-injector-67f7db5566-hc2kv   2/2     Running   0          8h
linkerd-proxy-injector-67f7db5566-t225x   2/2     Running   0          8h
linkerd-sp-validator-c4c598c49-djhv7      2/2     Running   0          47m
linkerd-sp-validator-c4c598c49-ktmdw      2/2     Running   0          8h
linkerd-sp-validator-c4c598c49-lb7jv      2/2     Running   0          30m
linkerd-tap-86c9f7cc98-h8c2d              2/2     Running   0          7h31m
linkerd-tap-86c9f7cc98-p49b5              2/2     Running   0          7h31m
linkerd-tap-86c9f7cc98-psztb              2/2     Running   0          7h30m
linkerd-web-549f59496c-sm6p9              2/2     Running   0          47m

linkerd check output

➜ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 19.9.3 but the latest edge version is 19.9.4
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 19.9.3 but the latest edge version is 19.9.4
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: v1.12.10-eks-825e5d
  • Cluster Environment: EKS
  • Host OS: Amazon Linux
  • Linkerd version: 19.9.3

Possible solution

Additional context

@ihcsim
Copy link
Contributor

ihcsim commented Sep 30, 2019

@KIVagant IIRC, you installed Linkerd using the Helm templates, right? Did you override the Tap TLS cert and key in your values.yaml? If yes, how was the cert generated?

I also wonder if this verification error is caused by clock skew on your servers. Can you confirm? Do you have ntpd etc. installed on your server?

@KIVagant
Copy link
Contributor Author

There are still the same certs that were generated in #3414 (comment) taking into account this issue linkerd/website#516

When the problem with extra newlines was solved, L5d works well, but randomly we start getting the TLS handshake error

if this verification error is caused by clock skew on your servers

That's a nice point. I will try to check this tomorrow (UTC+3).

@grampelberg
Copy link
Contributor

This is referring to an APIService which uses a certificate that is part of the resource configuration (caBundle) on the server side and a configmap from kube-system on the client side (extension-apiserver-authentication). Are you doing any kind of certificate rotation?

The fact that restarting the pod fixes it leads me to believe that extension-apiserver-authentication is being updated.

@KIVagant
Copy link
Contributor Author

KIVagant commented Oct 1, 2019

Are you doing any kind of certificate rotation?

No. I created the cert only once, added it into a secret storage and that's it. I'm updating the Helm chart periodically from the upstream, so this maybe can cause secrets regeneration, but the content of the secret stays the same.

➜ kgsecn linkerd
NAME                                 TYPE                                  DATA   AGE
default-token-w7k9t                  kubernetes.io/service-account-token   3      18d
linkerd-controller-token-vj8p4       kubernetes.io/service-account-token   3      18d
linkerd-destination-token-9hm89      kubernetes.io/service-account-token   3      7d11h
linkerd-grafana-token-8j8mq          kubernetes.io/service-account-token   3      18d
linkerd-heartbeat-token-kv9hc        kubernetes.io/service-account-token   3      18d
linkerd-identity-issuer              Opaque                                2      18d
linkerd-identity-token-fwj7c         kubernetes.io/service-account-token   3      18d
linkerd-prometheus-token-khz72       kubernetes.io/service-account-token   3      18d
linkerd-proxy-injector-tls           Opaque                                2      18d
linkerd-proxy-injector-token-5sdmb   kubernetes.io/service-account-token   3      18d
linkerd-sp-validator-tls             Opaque                                2      18d
linkerd-sp-validator-token-s9b79     kubernetes.io/service-account-token   3      18d
linkerd-tap-tls                      Opaque                                2      18d
linkerd-tap-token-x4skt              kubernetes.io/service-account-token   3      18d
linkerd-web-token-x24x6              kubernetes.io/service-account-token   3      18d

I will try to detect if there's a clock skew when the problem appears, as @ihcsim suggested.

@grampelberg
Copy link
Contributor

@KIVagant the certificate in question isn't part of the trust chain at all. ca.pem in linkerd-tap-tls should match caBundle in apiservice/v1alpha1.tap.linkerd.io. I still think extension-apiserver-authentication is being rotated by the api-server though.

@grampelberg
Copy link
Contributor

@KIVagant any new details?

@KIVagant
Copy link
Contributor Author

@grampelberg , sorry, not yet. I am busy with other tickets, but I still see the error (upgraded L5d to 2.6.0 stable). I will return back when I find more. Please, don't close this if it is okay for you.

@KIVagant
Copy link
Contributor Author

KIVagant commented Oct 16, 2019

My findings:

  1. All nodes have running ntpd service, and ntpstat responds something close to this:
synchronised to NTP server (185.144.157.134) at stratum 3
   time correct to within 143 ms
   polling server every 1024 s
  1. debug from linkerd tap:
➜ linkerd tap pod/tools-web-796dc6fb95-5gcv7 --namespace devops --verbose
DEBU[0001] Response from [https://3......A.sk1.us-east-1.eks.amazonaws.com/apis/tap.linkerd.io/v1alpha1/watch/namespaces/devops/pods/tools-web-796dc6fb95-5gcv7/tap] had headers: map[Audit-Id:[35d5289d-a532-431a-97f9-b89cc7112de9] Content-Length:[327] Content-Type:[text/plain; charset=utf-8] Date:[Wed, 16 Oct 2019 13:11:59 GMT]]
Error: HTTP error, status Code [503] (unexpected API response: Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-tap.linkerd.svc")'
Trying to reach: 'https://10.56.73.221:8089/apis/tap.linkerd.io/v1alpha1/watch/namespaces/devops/pods/tools-web-796dc6fb95-5gcv7/tap')
  1. Certificate information from apis/tap.linkerd.io:
# (the cert for https://10.56.73.221:8089/apis/tap.linkerd.io)
# echo | openssl s_client -showcerts -servername 10.56.73.221 -connect 10.56.73.221:8089 2>/dev/null | openssl x509 -inform pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            75:0f:b6:09:3e:ed:94:49:e0:8e:be:65:6b:35:c6:01
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = linkerd-tap.linkerd.svc
        Validity
            Not Before: Oct 15 11:54:18 2019 GMT
            Not After : Oct 14 11:54:18 2020 GMT
        Subject: CN = linkerd-tap.linkerd.svc
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    00:d1:9e:71:48:02:88:eb:78:8a:eb:d5:7c:31:d7:
...
                    b7:0f
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:TRUE
    Signature Algorithm: sha256WithRSAEncryption
         69:05:9c:fb:c5:bf:f9:2a:2e:1e:f9:ea:d8:87:28:d4:42:fa:
...
         b8:c3:13:e1
  1. I still think extension-apiserver-authentication is being rotated by the api-server though.

I cannot confirm that this is correct. From what I see (if I understand it right), the cert was created long time ago.

➜ k get configmaps -n kube-system |grep extension-apiserver-authentication
extension-apiserver-authentication   5      212d

➜ k get configmaps -n kube-system extension-apiserver-authentication -o json |jq -r '.data["requestheader-client-ca-file"]' | openssl x509 -inform pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 0 (0x0)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=kubernetes
        Validity
            Not Before: Mar 18 11:30:28 2019 GMT
            Not After : Mar 15 11:30:28 2029 GMT
        Subject: CN=kubernetes
  1. If I manually call the API, I get another error:
curl --insecure https://10.56.73.221:8089/apis/tap.linkerd.io/v1alpha1/watch/namespaces/devops/pods/tools-web-796dc6fb95-5gcv7/tap
{"error":"no valid CN found. allowed names: [front-proxy-client], client names: []"}

@KIVagant
Copy link
Contributor Author

KIVagant commented Oct 16, 2019

After kubectl rollout restart -n linkerd deployment/linkerd-tap

# (the cert for https://10.56.47.53:8089/apis/tap.linkerd.io)
# echo | openssl s_client -showcerts -servername 10.56.47.53 -connect 10.56.47.53:8089 2>/dev/null | openssl x509 -inform pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            3e:27:68:b7:1f:e3:c4:c0:ef:16:0c:fe:c6:13:93:5e
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = linkerd-tap.linkerd.svc
        Validity
            Not Before: Oct 15 12:28:58 2019 GMT
            Not After : Oct 14 12:28:58 2020 GMT
        Subject: CN = linkerd-tap.linkerd.svc
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    00:d3:b6:8e:77:9e:59:8e:84:c5:64:62:5d:dc:f3:
...
                    78:b1
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:TRUE
    Signature Algorithm: sha256WithRSAEncryption
         31:cb:56:40:f6:04:ff:7d:f9:05:0a:be:94:0a:22:c1:98:11:
...
         d4:d5:21:df

So, this is the difference:

  • before the restart:
            Not Before: Oct 15 11:54:18 2019 GMT
            Not After : Oct 14 11:54:18 2020 GMT
  • after the restart:
            Not Before: Oct 15 12:28:58 2019 GMT
            Not After : Oct 14 12:28:58 2020 GMT

And the new date is equal to the last chart update (LAST DEPLOYED: Tue Oct 15 12:28:58 2019), which makes me think that helm upgrade does not restart everything that must be restarted:

# _Identity.TrustAnchorsPEM.tmp and all others are loaded from a secret storage, the internal content is always the same

helm upgrade --install --namespace=linkerd --values ./values/linkerd2/values.yaml --set-file=Identity.TrustAnchorsPEM=_Identity.TrustAnchorsPEM.tmp --set-file=Identity.Issuer.TLS.CrtPEM=_Identity.Issuer.TLS.CrtPEM.tmp --set-file=Identity.Issuer.TLS.KeyPEM=_Identity.Issuer.TLS.KeyPEM.tmp linkerd2 ./linkerd2-2.6.0-f90805b8.tgz
Release "linkerd2" has been upgraded.
LAST DEPLOYED: Tue Oct 15 12:28:58 2019
NAMESPACE: linkerd
STATUS: DEPLOYED

RESOURCES:
==> v1/APIService
NAME                     AGE
v1alpha1.tap.linkerd.io  33d

==> v1/ClusterRole
NAME                            AGE
linkerd-linkerd-controller      33d
linkerd-linkerd-destination     21d
linkerd-linkerd-identity        33d
linkerd-linkerd-prometheus      33d
linkerd-linkerd-proxy-injector  33d
linkerd-linkerd-sp-validator    33d
linkerd-linkerd-tap             33d
linkerd-linkerd-tap-admin       33d

==> v1/ClusterRoleBinding
NAME                                AGE
linkerd-linkerd-controller          33d
linkerd-linkerd-destination         21d
linkerd-linkerd-identity            33d
linkerd-linkerd-prometheus          33d
linkerd-linkerd-proxy-injector      33d
linkerd-linkerd-sp-validator        33d
linkerd-linkerd-tap                 33d
linkerd-linkerd-tap-auth-delegator  33d
linkerd-linkerd-web-admin           33d

==> v1/ConfigMap
NAME                       DATA  AGE
linkerd-config             3     33d
linkerd-grafana-config     3     33d
linkerd-prometheus-config  1     33d

==> v1/Deployment
NAME                    READY  UP-TO-DATE  AVAILABLE  AGE
linkerd-controller      3/3    3           3          33d
linkerd-destination     3/3    3           3          21d
linkerd-grafana         1/1    1           1          33d
linkerd-identity        3/3    3           3          33d
linkerd-prometheus      1/1    1           1          33d
linkerd-proxy-injector  3/3    3           3          33d
linkerd-sp-validator    3/3    3           3          33d
linkerd-tap             3/3    3           3          33d
linkerd-web             1/1    1           1          33d

==> v1/Pod(related)
NAME                                     READY  STATUS   RESTARTS  AGE
linkerd-controller-6dbb9f99c7-8zq9p      3/3    Running  0         34m
linkerd-controller-6dbb9f99c7-cf47c      3/3    Running  0         34m
linkerd-controller-6dbb9f99c7-z8dc8      3/3    Running  0         34m
linkerd-destination-5f85657cdf-fbfgh     2/2    Running  0         34m
linkerd-destination-5f85657cdf-jmq9h     2/2    Running  0         34m
linkerd-destination-5f85657cdf-p49mg     2/2    Running  0         34m
linkerd-grafana-9fd8b57cf-hw28q          2/2    Running  0         34m
linkerd-identity-54789dd4dd-ngt8f        2/2    Running  0         34m
linkerd-identity-54789dd4dd-r9dmt        2/2    Running  0         34m
linkerd-identity-54789dd4dd-sz94l        2/2    Running  0         34m
linkerd-prometheus-7947675d6d-kpkht      2/2    Running  0         34m
linkerd-proxy-injector-5847d54cbc-cpwnj  2/2    Running  0         34m
linkerd-proxy-injector-5847d54cbc-dgpt8  2/2    Running  0         34m
linkerd-proxy-injector-5847d54cbc-p6cnv  2/2    Running  0         34m
linkerd-sp-validator-57c89c6dd4-6d29w    2/2    Running  0         34m
linkerd-sp-validator-57c89c6dd4-nzwmf    2/2    Running  0         34m
linkerd-sp-validator-57c89c6dd4-w2dns    2/2    Running  0         34m
linkerd-tap-5d4454b48b-d6x7j             2/2    Running  0         34m
linkerd-tap-5d4454b48b-hx7h7             2/2    Running  0         34m
linkerd-tap-5d4454b48b-qpntm             2/2    Running  0         34m
linkerd-web-77b64597d8-qdxxs             2/2    Running  0         34m

==> v1/Role
NAME               AGE
linkerd-heartbeat  33d
linkerd-psp        33d

==> v1/RoleBinding
NAME                             AGE
linkerd-heartbeat                33d
linkerd-linkerd-tap-auth-reader  33d
linkerd-psp                      33d

==> v1/Secret
NAME                        TYPE    DATA  AGE
linkerd-identity-issuer     Opaque  2     33d
linkerd-proxy-injector-tls  Opaque  2     33d
linkerd-sp-validator-tls    Opaque  2     33d
linkerd-tap-tls             Opaque  2     33d

==> v1/Service
NAME                    TYPE       CLUSTER-IP      EXTERNAL-IP  PORT(S)            AGE
linkerd-controller-api  ClusterIP  172.20.234.99   <none>       8085/TCP           33d
linkerd-destination     ClusterIP  172.20.118.59   <none>       8086/TCP           33d
linkerd-dst             ClusterIP  172.20.199.84   <none>       8086/TCP           34m
linkerd-grafana         ClusterIP  172.20.17.98    <none>       3000/TCP           33d
linkerd-identity        ClusterIP  172.20.187.42   <none>       8080/TCP           33d
linkerd-prometheus      ClusterIP  172.20.199.230  <none>       9090/TCP           33d
linkerd-proxy-injector  ClusterIP  172.20.199.105  <none>       443/TCP            33d
linkerd-sp-validator    ClusterIP  172.20.98.94    <none>       443/TCP            33d
linkerd-tap             ClusterIP  172.20.246.46   <none>       8088/TCP,443/TCP   33d
linkerd-web             ClusterIP  172.20.42.162   <none>       8084/TCP,9994/TCP  33d

==> v1/ServiceAccount
NAME                    SECRETS  AGE
linkerd-controller      1        33d
linkerd-destination     1        21d
linkerd-grafana         1        33d
linkerd-heartbeat       1        33d
linkerd-identity        1        33d
linkerd-prometheus      1        33d
linkerd-proxy-injector  1        33d
linkerd-sp-validator    1        33d
linkerd-tap             1        33d
linkerd-web             1        33d

==> v1beta1/CronJob
NAME               SCHEDULE   SUSPEND  ACTIVE  LAST SCHEDULE  AGE
linkerd-heartbeat  0 0 * * *  False    0       12h            33d

==> v1beta1/CustomResourceDefinition
NAME                             AGE
serviceprofiles.linkerd.io       33d
trafficsplits.split.smi-spec.io  33d

==> v1beta1/MutatingWebhookConfiguration
NAME                                   AGE
linkerd-proxy-injector-webhook-config  33d

==> v1beta1/PodSecurityPolicy
NAME                           PRIV   CAPS               SELINUX   RUNASUSER  FSGROUP    SUPGROUP   READONLYROOTFS  VOLUMES
linkerd-linkerd-control-plane  false  NET_ADMIN,NET_RAW  RunAsAny  RunAsAny   MustRunAs  MustRunAs  true            configMap,emptyDir,secret,projected,downwardAPI,persistentVolumeClaim

==> v1beta1/ValidatingWebhookConfiguration
NAME                                 AGE
linkerd-sp-validator-webhook-config  33d

NOTES:
...

At this moment I can't find any recently changed secrets:

➜ k get secret,configmap -n linkerd
NAME                                        TYPE                                  DATA   AGE
secret/default-token-w7k9t                  kubernetes.io/service-account-token   3      34d
secret/linkerd-controller-token-vj8p4       kubernetes.io/service-account-token   3      34d
secret/linkerd-destination-token-9hm89      kubernetes.io/service-account-token   3      22d
secret/linkerd-grafana-token-8j8mq          kubernetes.io/service-account-token   3      34d
secret/linkerd-heartbeat-token-kv9hc        kubernetes.io/service-account-token   3      34d
secret/linkerd-identity-issuer              Opaque                                2      34d
secret/linkerd-identity-token-fwj7c         kubernetes.io/service-account-token   3      34d
secret/linkerd-prometheus-token-khz72       kubernetes.io/service-account-token   3      34d
secret/linkerd-proxy-injector-tls           Opaque                                2      34d
secret/linkerd-proxy-injector-token-5sdmb   kubernetes.io/service-account-token   3      34d
secret/linkerd-sp-validator-tls             Opaque                                2      34d
secret/linkerd-sp-validator-token-s9b79     kubernetes.io/service-account-token   3      34d
secret/linkerd-tap-tls                      Opaque                                2      34d
secret/linkerd-tap-token-x4skt              kubernetes.io/service-account-token   3      34d
secret/linkerd-web-token-x24x6              kubernetes.io/service-account-token   3      34d

NAME                                  DATA   AGE
configmap/linkerd-config              3      34d
configmap/linkerd-grafana-config      3      34d
configmap/linkerd-prometheus-config   1      34d

So I see the correlation between the last deploy and the certificate issue date (Not Before) but I don't see why the linkerd-tap.linkerd.svc certificate was changed.

I guess this can be fixed if Helm always restarts tap pods (and maybe others).

@grampelberg
Copy link
Contributor

Ooooh, you're totally right! That's it. This'd be a really simple PR using helm's shasum.

@grampelberg
Copy link
Contributor

Should happen for at least tap, proxy-injector and sp-validator as those are all using certificates associated with the api server.

@grampelberg grampelberg added this to To do in 2.7 - Release via automation Oct 17, 2019
@zaharidichev zaharidichev self-assigned this Nov 1, 2019
@zaharidichev
Copy link
Member

@grampelberg I took a look at that but have a few questions. To me it looks like we need to hash (in the case of linkerd-tap )against the certs defined in v1alpha1.tap.linkerd.io resource. But in order to do that, we need to define these certs as partials so we are able to refer to them from tap.yaml. And since the certs are generated on the fly every time we include this partial the certs shall be different (i.e. both in tap-rbac.yaml and tap.yaml). Maybe I am a not very well versed in helm syntax, but I cannot really see a good way to handle that one as we are generating these on the fly. Can we refer to the value of caBundle defined in tap-rbac.yaml from tap.yaml as the template is being rendered?

If the former proves to be cumbersome, cant we use --recreate-pods to solve that issue?

@kivagant-ba
Copy link

kivagant-ba commented Nov 4, 2019

@zaharidichev , --recreate-pods will immediately kill all of them. helm/helm#5218
I'm not sure if it was changed in Helm v3 but from my experience, it is not the best idea to use that. kubectl rollout restart does the job better but it must be called explicitly in CI. Other way is to have an environment variable like RESTART_ME or something or an annotation for the same purposes. Any of these solutions don't look mature enough. The simplest solution would be to add random characters somewhere so Kubernetes performs RollingUpdate after any Helm chart update (randAlphaNum). But some people may not be happy with this solution in production.

@grampelberg
Copy link
Contributor

@zaharidichev I'm resonably sure this'll do it:

{{ include (print $.Template.BasePath "/tap-rbac.yaml") . | sha256sum }}

Now, while that is the "correct" way, it might be easier to just do something like:

{{ .Release.Time }}

As we're creating new certs every time, that'll just roll it every time.

@zaharidichev
Copy link
Member

@grampelberg Yes I tried that exact thing and it seems that we we always render tap-rbac.yaml as an empty string here. So if add this annotation to the spec and you do this twice you get the same hash every time. It does not seem to me that this is what we want:

➜  linkerd2 git:(master) ✗ bin/linkerd install --ignore-cluster | grep checksum
        checksum/config: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
➜  linkerd2 git:(master) ✗ bin/linkerd install --ignore-cluster | grep checksum
        checksum/config: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Am I missing something here? Alternatively, yes we can simply do a timestamp

@grampelberg
Copy link
Contributor

Hmmm, that's not how I would have expected it to work. Let's just use the timestamp.

@zaharidichev
Copy link
Member

Yes will give it a shot, although I have the uneasy feeling that it will bring a different set of problems wrt to testing and .golden outputs, but lets see

@ihcsim
Copy link
Contributor

ihcsim commented Nov 4, 2019

So turns out that we need to make sure include is using the right scope (i.e. $ instead of .). The . refers to the current scope and has been changed to {{.Values}} at the start of the template. Using $, we make sure the tap-rbac.yaml template is included using the global scope, hence, all the variables are rendered correctly.

Also, we need to make sure the annotation is add to the pod template, not the deployment. This diff works for me:

diff --git a/charts/linkerd2/templates/tap.yaml b/charts/linkerd2/templates/tap.yaml
index 42d6cd71..d6ed4256 100644
--- a/charts/linkerd2/templates/tap.yaml
+++ b/charts/linkerd2/templates/tap.yaml
@@ -49,6 +49,7 @@ spec:
   template:
     metadata:
       annotations:
+        linkerd.io/config-checksum: {{ include (print $.Template.BasePath "/tap-rbac.yaml") $ | sha256sum }}
         {{.CreatedByAnnotation}}: {{default (printf "linkerd/helm %s" .LinkerdVersion) .CliVersion}}
         {{- include "partials.proxy.annotations" .Proxy| nindent 8}}
       labels:

To reproduce this problem, run:

$ helm upgrade --install linkerd2 charts/linkerd2 --set-file Identity.TrustAnchorsPEM=<crt.pem> --set-file Identity.Issuer.TLS.KeyPEM=<key.pem> --set-file Identity.Issuer.TLS.CrtPEM=<crt.pem> --set Identity.Issuer.CrtExpiry=<crt-expiry-date>

# when control plane is ready, repeat the same command
$ helm upgrade --install linkerd2 charts/linkerd2 --set-file Identity.TrustAnchorsPEM=<crt.pem> --set-file Identity.Issuer.TLS.KeyPEM=<key.pem> --set-file Identity.Issuer.TLS.CrtPEM=<crt.pem> --set Identity.Issuer.CrtExpiry=<crt-expiry-date>

$ linkerd -n linkerd tap deploy
Error: HTTP error, status Code [503] (unexpected API response: Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-tap.linkerd.svc")'
Trying to reach: 'https://10.106.41.71:443/apis/tap.linkerd.io/v1alpha1/watch/namespaces/linkerd/pods//tap')

With the new annotation, the tap pod will get restarted after the second upgrade --install command.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
2.7 - Release
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants