Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache-deployer Pod fails for kubeflow 1.5 #2165

Closed
akartsky opened this issue Mar 9, 2022 · 20 comments · Fixed by #2223
Closed

cache-deployer Pod fails for kubeflow 1.5 #2165

akartsky opened this issue Mar 9, 2022 · 20 comments · Fixed by #2223

Comments

@akartsky
Copy link

akartsky commented Mar 9, 2022

When installing kubeflow from latest master and rc2 release cache-deployer-deployment fails

kubeflow       cache-deployer-deployment-6f4bcc969-jxq2f        1/2     CrashLoopBackOff   23         102m
kubeflow       cache-server-575d97c95-md8bg                     0/2     Init:0/1           0          102m

Error logs for cache-deployer :

+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ [[ '' == '' ]]
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

cache-server.kubeflow looks like this :

$ kubectl get csr cache-server.kubeflow -o yaml

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  creationTimestamp: "2022-03-09T03:05:10Z"
  name: cache-server.kubeflow
  resourceVersion: "306173"
  selfLink: /apis/certificates.k8s.io/v1/certificatesigningrequests/cache-server.kubeflow
  uid: f6179659-fa58-4c3b-a9af-d69543667415
spec:
  groups:
  - system:serviceaccounts
  - system:serviceaccounts:kubeflow
  - system:authenticated
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJREdEQ0NBZ0FDQVFBd1NERXZNQzBHQTFVRUF3d21jM2x6ZEdWdE9tNXZaR1U2WTJGamFHVXRjMlZ5ZG1WeQpMbXQxWW1WbWJHOTNMbk4yWXpzeEZUQVRCZ05WQkFvTURITjVjM1JsYlRwdWIyUmxjekNDQVNJd0RRWUpLb1pJCmh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBT0lHanhzbS83NmdqRy8yWC91OXJpdVR0TUV6b1VJdDVmZFUKQWdIK0xGbVB5Vll4V1VGSmpEbTVhbUNmcmo1aUpIRll1NmxYMitER2gxWmNsMEVBZXJvYi83dm1VYUFOTUJaRwpVaUExMXFRN1ZkMGVuNll1ZnBqZ29CUVN0MTZnWC9FNTNralhQK1FXQ0ErRUNyY1czZG1UdHN4T09kSW9BODdRCjB5U0VhdFhSZFI4ODU5TjF4V0VFTnhXcElzUDdoL1lJSXJ1ZERDV1RiOU5qWWN1bDJCaU9FRXpuMzE2NkREditzQ1VPNW5NTkdkY0JEaVlaMU9SmRJWS9ycTN1MGkrTDNBUFN3aCtoL3V5S3U2Y0Jrc1dMT1ZpdlJHOXFXRnZhQWtDQXdFQUFhQ0JpakNCCmh3WUpLb1pJaHZjTkFRa09NWG93ZURBSkJnTlZIUk1FQWpBQU1Bc0dBMVVkRHdRRUF3SUY0REFUQmdOVkhTVUUKRERBS0JnZ3JCZ0VGQlFjREFUQkpCZ05WSFJFRVFqQkFnZ3hqWVdOb1pTMXpaWEoyWlhLQ0ZXTmhZMmhsTFhObApjblpsY2k1cmRXSmxabXh2ZDRJWlkyRmphR1V0YzJWeWRtVnlMbXQxWW1WbWJHOTNMbk4yWXpBTkJna3Foa2lHCjl3MEJBUXNGQUFPQ0FRRUF3OCtoMjlWdkk1UjhIMVNvdFFiV0FDTXIrQ2puNGJIaHpwWmh2VDZXcHNuRWZUMWUKbjdRY2RFTi9RZzZHbzJ6bi90eXRoZ0lFb2FlNnc4QTM0Ly9RUjRzbW54U3ZianZIM1BoR0wwY25DV0ZIZEI1bwpVSGZaWlAyS3RDeENPYzFKSXpYYzRURzZCNEg1WWU4OE16NUXXXXXXXXXXTdIeEdpY2EvU0J4Ck01UFBKcU5aUmFVckd6YUt4aStaVjFBYlkzRXVPZ1MrRG8vTFMvNmVmN2I0Z3JWYlBWVkxCVlVEUUFpVU8wK3EKbnRaa1BxTm1sSmIzS0xoWmpKbTJjRGlDaVlOZDBiOGNsUEtwNEZHSnhjMWpmQU1QY3kwangwU05USVhkZ05sVQorTU14anl6Q2wrQkZmWDN3dVV2K0E2ekFiUFBIUjRJaHZCWWprdz09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=
  signerName: kubernetes.io/kubelet-serving
  uid: e98f0dd7-1480-42dd-b621-3b5790f53ce0
  usages:
  - digital signature
  - key encipherment
  - server auth
  username: system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa
status:
  conditions:
  - lastTransitionTime: "2022-03-09T03:05:10Z"
    lastUpdateTime: "2022-03-09T03:05:10Z"
    message: This CSR was approved by kubectl certificate approve.
    reason: KubectlApprove
    status: "True"
    type: Approved

Due to this cache-server pod is stuck in Init state :

en-xl2s8 webhook-tls-certs istiod-ca-cert istio-data istio-envoy]: timed out waiting for the condition
  Warning  FailedMount  99s (x107 over 3h23m)  kubelet  MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

(This doesn't seem to affect the ability to run pipelines/notebook)

Tested on :
EKS - 1.19, 1.20 and 1.21

@akartsky
Copy link
Author

akartsky commented Mar 9, 2022

Steps to replicate the issue :

eksctl create cluster --name check-kf-new-release-20 --region us-east-2 --nodes=5 --version="1.20" 

git clone https://github.com/kubeflow/manifests.git && cd manifests
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

# then just watch for the `cache-deployer` pod
# it initially goes into running state before crashing 
kubectl get pods -n kubeflow

@surajkota
Copy link
Contributor

I see another EKS-1.21 user has reported same error in this issue: kubeflow/kubeflow#5248 (comment) on trying to use Kubeflow master 20 days ago(so maybe 1.5.0-rc1). The original issue is pretty old and there are links to these open issues: kubeflow/pipelines#4505 and kubeflow/pipelines#4695.

But we can see the cache-deployer-deployment and cache-server pods get stable in KF-1.4.1. So, next step to investigate is how much has this component changed from Kubeflow Pipelines-1.7.0 to 1.8.1. Can someone from @kubeflow/pipelines who has more context regarding this comment on this?

@akartsky
Copy link
Author

akartsky commented Mar 9, 2022

Confirmed with EKS team that EKS does not issue certificates for CSRs with signerName kubernetes.io/kubelet-serving unless the CSR was actually requested by a kubelet.

One suggested solution if we need a certificate for a non-kubelet application, for 1.21 and below we need to use CSR v1beta1 API and signerName "legacy-unknown".

I will update this ticket with more info but for now this can be tagged as know issue for KF 1.5 on EKS

@zijianjoy
Copy link
Contributor

@surajkota
Copy link
Contributor

surajkota commented Mar 10, 2022

Both the PRs(kubeflow/pipelines#6668, kubeflow/pipelines#7273) are related. cache-deployer-deployment pod is requesting a cert for CSR with signerName kubernetes.io/kubelet-serving.

EKS only issues certificates for CSRs with signerName kubernetes.io/kubelet-serving for actual kubelets based on the information in the official K8s documentation:

kubernetes.io/kubelet-serving`: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager.

It is not supported since it is not recommended in Kubernetes upstream and EKS believes allowing this is unsafe. Kubernetes is recommending to use cert manger controller instead which is already being discussed here: kubeflow/pipelines#4695. IMO this is the right long term fix.

But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR v1beta1 API and signerName legacy-unknown. This would mean pipelines would only work in K8s 1.21 and below since kubernetes.io/legacy-unknown is not supported in stable v1 API of CSR and hence it will not work for K8s 1.22 and above. kubeflow/pipelines#4695 will need to be addressed for K8s 1.22 and above.

Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue.

@juliusvonkohout
Copy link
Member

But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR v1beta1 API and signerName legacy-unknown. This would mean pipelines would only work in K8s 1.21 and below since kubernetes.io/legacy-unknown is not supported in stable v1 API of CSR and hence it will not work for K8s 1.22 and above. kubeflow/pipelines#4695 will need to be addressed for K8s 1.22 and above.

Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue.

I am in favor of fixing it with Kubeflow 1.5.1. according to @annajung #2112 (comment) only notebooks do not support Kubernetes 1.22. So you could use 1.5.1 to fully support Kubernetes 1.22 and fix the certificates properly.

@kimwnasptd
Copy link
Member

Thanks for the detailed summary @surajkota!

From what we know about this bug, that:

  1. It depends on the K8s Server of the installation
  2. Until now we haven't heard from other distros having this problem
  3. It's not a blocking issue. In a worst case scenario the caching mechanism of KFP will not work, but pipelines can still be submitted and run

I'll move on with the KF 1.5.0 release as is. Let's keep the discussion on how to fix this for KF 1.5.1
cc @kubeflow/release-team

@kimwnasptd
Copy link
Member

kimwnasptd commented Mar 10, 2022

Regarding next steps, indeed using CertManager looks like a step in the right direction. Kubebuilder is also using this for handling the controller/webhook certificates.

But we will need to hear the feedback from the KFP folks, on whether they would be OK with such a change for the KF 1.5.1 release. Which also means a new patch version for KFP. cc @zijianjoy @chensun

If using CertManager is not an immediate option, even for KF 1.5.1, I think another alternative would be to investigate why should we use signerName: kubernetes.io/kubelet-serving in the first place. Do we need to generate TLS certificates for a webhook to have an identity of a kubelet?

My understanding so far is:

  1. The initial PR [Backend]Initial execution cache pipelines#3036 used the extendedKeyUsage = serverAuth and usages: [..., "server auth"]
  2. After KF 1.4, in the KFP effort for K8s 1.22 support there was a PR that chore: Update CertificateSigningRequest to current api version. pipelines#6668:
    1. Used the v1 version for CertificateSigningRequest
    2. Explicitly set the signerName: kubernetes.io/kube-apiserver-client, which is required for K8s 1.22 https://kubernetes.io/docs/reference/using-api/deprecation-guide/#certificatesigningrequest-v122
  3. This created the issue in [backend] cache-deployer generate CSR with wrong usage  pipelines#7093 which was resolved by using the signerName: kubernetes.io/kubelet-serving fix(backend): make cache-deployer generate CSR using kubelet-serving signerName pipelines#7273

So my question remains: why use server auth and signerName: kubernetes.io/kubelet-serving in the first place? Why not just go with signerName: kubernetes.io/kube-apiserver-client and client auth, instead of server auth?
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

@surajkota
Copy link
Contributor

surajkota commented Mar 10, 2022

Why not just go with signerName: kubernetes.io/kube-apiserver-client and client auth, instead of server auth?

Thanks for the further deep dive on this. I have been curious about this as well, we will wait for response from pipelines-wg. I have also added this to the agenda in next pipelines community meeting 03/16.

@akartsky
Copy link
Author

akartsky commented Mar 10, 2022

@Tomcli
Copy link
Member

Tomcli commented Mar 10, 2022

Regarding next steps, indeed using CertManager looks like a step in the right direction. Kubebuilder is also using this for handling the controller/webhook certificates.

But we will need to hear the feedback from the KFP folks, on whether they would be OK with such a change for the KF 1.5.1 release. Which also means a new patch version for KFP. cc @zijianjoy @chensun

If using CertManager is not an immediate option, even for KF 1.5.1, I think another alternative would be to investigate why should we use signerName: kubernetes.io/kubelet-serving in the first place. Do we need to generate TLS certificates for a webhook to have an identity of a kubelet?

My understanding so far is:

  1. The initial PR [Backend]Initial execution cache pipelines#3036 used the extendedKeyUsage = serverAuth and usages: [..., "server auth"]

  2. After KF 1.4, in the KFP effort for K8s 1.22 support there was a PR that chore: Update CertificateSigningRequest to current api version. pipelines#6668:

    1. Used the v1 version for CertificateSigningRequest
    2. Explicitly set the signerName: kubernetes.io/kube-apiserver-client, which is required for K8s 1.22 https://kubernetes.io/docs/reference/using-api/deprecation-guide/#certificatesigningrequest-v122
  3. This created the issue in [backend] cache-deployer generate CSR with wrong usage  pipelines#7093 which was resolved by using the signerName: kubernetes.io/kubelet-serving fix(backend): make cache-deployer generate CSR using kubelet-serving signerName pipelines#7273

So my question remains: why use server auth and signerName: kubernetes.io/kubelet-serving in the first place? Why not just go with signerName: kubernetes.io/kube-apiserver-client and client auth, instead of server auth? https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

FYI we did try to use client auth. There was an PR to make it to use client auth but that doesn't work for us.
kubeflow/pipelines#7094

With client auth we are getting the below errors on cache server:

2022/02/07 23:12:31 http: TLS handshake error from 172.30.4.65:56336: remote error: tls: bad certificate

@akartsky
Copy link
Author

Thanks @Tomcli

So looks like we need both client auth and server auth
then using CSR v1beta1 API and signerName kubernetes.io/legacy-unknown is the right way to do it for now
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

@Tomcli
Copy link
Member

Tomcli commented Mar 11, 2022

Thanks @Tomcli

So looks like we need both client auth and server auth then using CSR v1beta1 API and signerName kubernetes.io/legacy-unknown is the right way to do it for now https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

@akartsky Can kubernetes.io/legacy-unknown work on CSR v1? I think the main motivation on moving to CSR V1 is to support Kubernetes 1.22 and above.

@ryansteakley
Copy link

@Tomcli Looks like according to https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers . The stable CertificateSigningRequest API (version certificates.k8s.io/v1 and later) does not allow to set the signerName as kubernetes.io/legacy-unknown

@surajkota
Copy link
Contributor

surajkota commented Mar 17, 2022

Summary of discussion from Kubeflow pipeline meeting on 03/16:

Here is a consolidated list of docs which indicate using kubelet-serving signerName is a bad practice for non-kubelet applications:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1513-certificate-signing-request/README.md#importance-of-specsignername
kubernetes/kubernetes#98146 (comment)
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

  • Cache-v1 is an alpha feature and will eventually be deprecated when KFP v1 is deprecated. Cache-v2 would not need cert anymore since it will be using Argo. Since Cache-v1 is an alpha feature, there is no guarantee for support. There is no timeline for deprecating KFP v1 as of today which means we are looking at 1+year timeline.
  • Pipelines WG thinks that introducing a cert manager dependency is hard to maintain and upgrade. It would be fine to introduce the cert manager solution as an overlay.
  • Next steps to investigate:
    • Do we need a server auth certificate? fix(backend): make cache-deployer generate CSR using kubelet-serving signerName pipelines#7273 seems to suggest it is needed but we did not have a definite answer during the meeting
    • If we need server auth, we can take two routes for a patch release in KFP-1.8 and Kubeflow-1.5.1:
      1. Modify the cache-deloyer code and create an overlay to use legacy-unknown signerName for EKS if other distributions are fine with kubelet-serving signerName. We(AWS) will need to see if there will be some other workaround in EKS-1.22 when it is released
      2. Implement the cert manager overlay and make this the default overlay in Kubeflow. Since Kubeflow already depends on cert-manager, there will be no impact. However, users installing Kubeflow pipelines standalone will need to install cert manager as well. cache-server will need to be modified for this

Timelines
KFPv1 - Current: Stable, End of life: Not announced
KFPv2 - Current: Alpha (released 03/16), Stable: Not announced
EKS-1.22: March 2022
Kubeflow-1.5: March 2022
Kubeflow-1.6: ~July 2022

KFP Version Status EKS Affected Sol1: legacy-unknown Sol2: cert-manager
1.8.1 v1 Stable 1.19, 1.20, 1.21 Y Y Y
1.8.1 v2 Alpha 1.19, 1.20, 1.21 N N/A N/A
1.8.1 v1 Stable 1.22+ Y N Y
1.8.1 v2 Alpha 1.22+ N N/A N/A

@kimwnasptd
Copy link
Member

I also think that using kubelet-serving is technically incorrect here. Unfortunately, there's currently no signer for pod/service serving certificates as shown in the links @surajkota provided above.

So we'll have to go with one of the above approaches.

@surajkota @akartsky can you help submit a PR for this?

@akartsky
Copy link
Author

ya I can work on the PR which creates an overlay for AWS with cert-manager

@konsloiz
Copy link

Hello everyone. I am re-posting the kubeflow/pipelines/issues/7437 after @Linchin 's suggestion.

Caching is one of the most crucial features of KFP. Each time a pipeline step is the same as an already executed, the results are loaded from the cache server. Caching is accomplished in KFP via two interdependent modules: the cache deployer and the cache server.

While trying to set up the modules in an enterprise cluster (Mercedes-Benz AG), it was noted that the installation couldn’t be completed. The reason was that the cache deployer is built to generate a Signed Certificate for the cache server by referring to the Kubernetes Certificate-SigningRequest API.

...

# create  server cert/key CSR and  send to k8s API
cat <<EOF | kubectl create -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: ${csrName}
spec:
  groups:
  - system:authenticated
  request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
  signerName: kubernetes.io/kubelet-serving
  usages:
  - digital signature
  - key encipherment
  - server auth
EOF

..

The usage of API server certificates in our enterprise environment is restricted because those allow permission escalation. The security risk is critical, as by using this API, users can order certificates that let them impersonate both Kubernetes control plane and cluster team access.

To adjust the cache deployer’s certificate generation process without affecting the actual functionality to avoid loosening the security restrictions, we used the widely-known OpenSSL.

Is there any specific reason for using the K8s API?
If not, would the community be interested in an upstream contribution?

@surajkota
Copy link
Contributor

People facing this issue can use Kubeflow-1.5.1 which uses cert-manager for cache-server certificate.

@andre-lx
Copy link

andre-lx commented Sep 1, 2022

People facing this issue can use Kubeflow-1.5.1 which uses cert-manager for cache-server certificate.

Please ensure that you run the deploy using the correct path in your kustomization.yaml file:

# 1.4.1
- - ../apps/pipeline/upstream/env/platform-agnostic-multi-user

# 1.5.1
+ - ../apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants