-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache-deployer Pod fails for kubeflow 1.5 #2165
Comments
Steps to replicate the issue :
|
I see another EKS-1.21 user has reported same error in this issue: kubeflow/kubeflow#5248 (comment) on trying to use Kubeflow master 20 days ago(so maybe 1.5.0-rc1). The original issue is pretty old and there are links to these open issues: kubeflow/pipelines#4505 and kubeflow/pipelines#4695. But we can see the |
Confirmed with EKS team that EKS does not issue certificates for CSRs with signerName kubernetes.io/kubelet-serving unless the CSR was actually requested by a kubelet. One suggested solution if we need a certificate for a non-kubelet application, for 1.21 and below we need to use CSR v1beta1 API and signerName "legacy-unknown". I will update this ticket with more info but for now this can be tagged as know issue for KF 1.5 on EKS |
I am not super familiar with cache-deployer, but can the following PRs be related? |
Both the PRs(kubeflow/pipelines#6668, kubeflow/pipelines#7273) are related. EKS only issues certificates for CSRs with signerName kubernetes.io/kubelet-serving`: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager. It is not supported since it is not recommended in Kubernetes upstream and EKS believes allowing this is unsafe. Kubernetes is recommending to use cert manger controller instead which is already being discussed here: kubeflow/pipelines#4695. IMO this is the right long term fix. But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue. |
I am in favor of fixing it with Kubeflow 1.5.1. according to @annajung #2112 (comment) only notebooks do not support Kubernetes 1.22. So you could use 1.5.1 to fully support Kubernetes 1.22 and fix the certificates properly. |
Thanks for the detailed summary @surajkota! From what we know about this bug, that:
I'll move on with the KF 1.5.0 release as is. Let's keep the discussion on how to fix this for KF 1.5.1 |
Regarding next steps, indeed using CertManager looks like a step in the right direction. Kubebuilder is also using this for handling the controller/webhook certificates. But we will need to hear the feedback from the KFP folks, on whether they would be OK with such a change for the KF 1.5.1 release. Which also means a new patch version for KFP. cc @zijianjoy @chensun If using CertManager is not an immediate option, even for KF 1.5.1, I think another alternative would be to investigate why should we use My understanding so far is:
So my question remains: why use |
Thanks for the further deep dive on this. I have been curious about this as well, we will wait for response from pipelines-wg. I have also added this to the agenda in next pipelines community meeting 03/16. |
Found some links on why we should not be using |
FYI we did try to use client auth. There was an PR to make it to use client auth but that doesn't work for us. With client auth we are getting the below errors on cache server:
|
Thanks @Tomcli So looks like we need both |
@akartsky Can |
@Tomcli Looks like according to https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers . The stable |
Summary of discussion from Kubeflow pipeline meeting on 03/16: Here is a consolidated list of docs which indicate using
Timelines
|
I also think that using So we'll have to go with one of the above approaches. @surajkota @akartsky can you help submit a PR for this? |
ya I can work on the PR which creates an overlay for AWS with cert-manager |
Hello everyone. I am re-posting the kubeflow/pipelines/issues/7437 after @Linchin 's suggestion. Caching is one of the most crucial features of KFP. Each time a pipeline step is the same as an already executed, the results are loaded from the cache server. Caching is accomplished in KFP via two interdependent modules: the cache deployer and the cache server. While trying to set up the modules in an enterprise cluster (Mercedes-Benz AG), it was noted that the installation couldn’t be completed. The reason was that the cache deployer is built to generate a Signed Certificate for the cache server by referring to the Kubernetes Certificate-SigningRequest API. ...
# create server cert/key CSR and send to k8s API
cat <<EOF | kubectl create -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: ${csrName}
spec:
groups:
- system:authenticated
request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
signerName: kubernetes.io/kubelet-serving
usages:
- digital signature
- key encipherment
- server auth
EOF
..
The usage of API server certificates in our enterprise environment is restricted because those allow permission escalation. The security risk is critical, as by using this API, users can order certificates that let them impersonate both Kubernetes control plane and cluster team access. To adjust the cache deployer’s certificate generation process without affecting the actual functionality to avoid loosening the security restrictions, we used the widely-known OpenSSL. Is there any specific reason for using the K8s API? |
People facing this issue can use Kubeflow-1.5.1 which uses cert-manager for cache-server certificate. |
Please ensure that you run the deploy using the correct path in your # 1.4.1
- - ../apps/pipeline/upstream/env/platform-agnostic-multi-user
# 1.5.1
+ - ../apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user
|
When installing kubeflow from latest master and rc2 release
cache-deployer-deployment
failsError logs for
cache-deployer
:cache-server.kubeflow
looks like this :Due to this
cache-server
pod is stuck in Init state :(This doesn't seem to affect the ability to run pipelines/notebook)
Tested on :
EKS - 1.19, 1.20 and 1.21
The text was updated successfully, but these errors were encountered: