Skip to content

cache-deployer Pod fails for kubeflow 1.5  #2165

@akartsky

Description

@akartsky

When installing kubeflow from latest master and rc2 release cache-deployer-deployment fails

kubeflow       cache-deployer-deployment-6f4bcc969-jxq2f        1/2     CrashLoopBackOff   23         102m
kubeflow       cache-server-575d97c95-md8bg                     0/2     Init:0/1           0          102m

Error logs for cache-deployer :

+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ [[ '' == '' ]]
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

cache-server.kubeflow looks like this :

$ kubectl get csr cache-server.kubeflow -o yaml

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  creationTimestamp: "2022-03-09T03:05:10Z"
  name: cache-server.kubeflow
  resourceVersion: "306173"
  selfLink: /apis/certificates.k8s.io/v1/certificatesigningrequests/cache-server.kubeflow
  uid: f6179659-fa58-4c3b-a9af-d69543667415
spec:
  groups:
  - system:serviceaccounts
  - system:serviceaccounts:kubeflow
  - system:authenticated
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJREdEQ0NBZ0FDQVFBd1NERXZNQzBHQTFVRUF3d21jM2x6ZEdWdE9tNXZaR1U2WTJGamFHVXRjMlZ5ZG1WeQpMbXQxWW1WbWJHOTNMbk4yWXpzeEZUQVRCZ05WQkFvTURITjVjM1JsYlRwdWIyUmxjekNDQVNJd0RRWUpLb1pJCmh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBT0lHanhzbS83NmdqRy8yWC91OXJpdVR0TUV6b1VJdDVmZFUKQWdIK0xGbVB5Vll4V1VGSmpEbTVhbUNmcmo1aUpIRll1NmxYMitER2gxWmNsMEVBZXJvYi83dm1VYUFOTUJaRwpVaUExMXFRN1ZkMGVuNll1ZnBqZ29CUVN0MTZnWC9FNTNralhQK1FXQ0ErRUNyY1czZG1UdHN4T09kSW9BODdRCjB5U0VhdFhSZFI4ODU5TjF4V0VFTnhXcElzUDdoL1lJSXJ1ZERDV1RiOU5qWWN1bDJCaU9FRXpuMzE2NkREditzQ1VPNW5NTkdkY0JEaVlaMU9SmRJWS9ycTN1MGkrTDNBUFN3aCtoL3V5S3U2Y0Jrc1dMT1ZpdlJHOXFXRnZhQWtDQXdFQUFhQ0JpakNCCmh3WUpLb1pJaHZjTkFRa09NWG93ZURBSkJnTlZIUk1FQWpBQU1Bc0dBMVVkRHdRRUF3SUY0REFUQmdOVkhTVUUKRERBS0JnZ3JCZ0VGQlFjREFUQkpCZ05WSFJFRVFqQkFnZ3hqWVdOb1pTMXpaWEoyWlhLQ0ZXTmhZMmhsTFhObApjblpsY2k1cmRXSmxabXh2ZDRJWlkyRmphR1V0YzJWeWRtVnlMbXQxWW1WbWJHOTNMbk4yWXpBTkJna3Foa2lHCjl3MEJBUXNGQUFPQ0FRRUF3OCtoMjlWdkk1UjhIMVNvdFFiV0FDTXIrQ2puNGJIaHpwWmh2VDZXcHNuRWZUMWUKbjdRY2RFTi9RZzZHbzJ6bi90eXRoZ0lFb2FlNnc4QTM0Ly9RUjRzbW54U3ZianZIM1BoR0wwY25DV0ZIZEI1bwpVSGZaWlAyS3RDeENPYzFKSXpYYzRURzZCNEg1WWU4OE16NUXXXXXXXXXXTdIeEdpY2EvU0J4Ck01UFBKcU5aUmFVckd6YUt4aStaVjFBYlkzRXVPZ1MrRG8vTFMvNmVmN2I0Z3JWYlBWVkxCVlVEUUFpVU8wK3EKbnRaa1BxTm1sSmIzS0xoWmpKbTJjRGlDaVlOZDBiOGNsUEtwNEZHSnhjMWpmQU1QY3kwangwU05USVhkZ05sVQorTU14anl6Q2wrQkZmWDN3dVV2K0E2ekFiUFBIUjRJaHZCWWprdz09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=
  signerName: kubernetes.io/kubelet-serving
  uid: e98f0dd7-1480-42dd-b621-3b5790f53ce0
  usages:
  - digital signature
  - key encipherment
  - server auth
  username: system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa
status:
  conditions:
  - lastTransitionTime: "2022-03-09T03:05:10Z"
    lastUpdateTime: "2022-03-09T03:05:10Z"
    message: This CSR was approved by kubectl certificate approve.
    reason: KubectlApprove
    status: "True"
    type: Approved

Due to this cache-server pod is stuck in Init state :

en-xl2s8 webhook-tls-certs istiod-ca-cert istio-data istio-envoy]: timed out waiting for the condition
  Warning  FailedMount  99s (x107 over 3h23m)  kubelet  MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

(This doesn't seem to affect the ability to run pipelines/notebook)

Tested on :
EKS - 1.19, 1.20 and 1.21

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions