Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using microk8s: failed to save outputs: Error response from daemon: No such container: #4302

Closed
tomalbrecht opened this issue Jul 31, 2020 · 24 comments

Comments

@tomalbrecht
Copy link

tomalbrecht commented Jul 31, 2020

What steps did you take:

I ran /samples/core/lightweight_component/lightweight_component.ipynb. Running all cells seems ok. I did not change the tutorial code.

image

What happened:

When viewing the run, the pipeline crashed within the "add" function.
image

What did you expect to happen:

I expected the pipeline to complete without errors.

Environment:

  • Kubeflow: 1.0.2
  • Microk8s: v1.18.6-1+64f53401f200a7
  • Linux: Ubuntu 18.04
  • Notebook Pipeline Version: 1.0.0 (Notebook Server: print("kfp Version: {}".format(kfp.version)))

How did you deploy Kubeflow Pipelines (KFP)?
As part of a full Kubeflow 1.0.2 deployment.

KFP version: Build commit: 743746b

KFP SDK version:

  • kfp 1.0.0
  • kfp-server-api 1.0.0

Anything else you would like to add:

logs-from-wait-in-calculation-pipeline-g7df4-1345834039.txt

Any workarounds or ideas how to further debug and fix it?

/kind bug

@Bobgy
Copy link
Contributor

Bobgy commented Jul 31, 2020

Can you click the step and see what the error message is?

@tomalbrecht
Copy link
Author

tomalbrecht commented Jul 31, 2020

This step is in Error state with this message: failed to save outputs: Error response from daemon: No such container: d98f03a40af58bc2f1a623005d9da0a6b6838822ebf0d11fcf0894291d3d5782

image

#Define a Python function
def add(a: float, b: float) -> float:
   '''Calculates sum of two arguments'''
   print("in add Funktion")
   return a + b

Please see included container log file above for more details.

@alfsuse
Copy link

alfsuse commented Aug 3, 2020

Hi, @tomalbrecht I think I've managed to replicate the error and in my case was due to missing capabilities in a pod security policy.
Do you have, by any chance, one or more PSP deployed? If so the default PSP has either a missing:
Allowed Capabilities:

  • SYS_PTRACE
  • NET_ADMIN
  • NET_RAW `
    Note: the bullet points are '-'
    Alternative in your PSP do you have something like "Drop capabilites - ALL" somewhere in it?

A quick& dirty test, if you have a PSP deployed you may simply add:
`allowedCapabilities:

@Bobgy maybe we should add this to the troubleshooting section for Jupyter/KFP? I can do that.

@tomalbrecht
Copy link
Author

tomalbrecht commented Aug 4, 2020

@alfsuse Thanks for your help.

I deployed these manifests from kubeflow.org on microk8s. Within the kustomize directory (/opt/kubeflow/config/kustomize/kubeflow-roles/base) I found cluster-roles.yaml including several cluster roles.

  • Which specific rule should be extended, as suggested by you, to run the mentioned tutorial?

I am running the code above as a user (not admin) - added like suggested here - calc_pipeline 2020-07-31 05-25-28 will be created in namespace -n kubeflow.

  • Is this expected or should the pipeline executed within the users namespace, so I do not have to change any PSPs?

Edit: Running the code as 'admin@kubeflow.org' user, yields the same results as described above. So running as admin is not a workaround.
Edit 2: kf1.0.2-cluster-roles.txt

@alfsuse
Copy link

alfsuse commented Aug 4, 2020

@tomalbrecht I'm not sure that the cluster-roles or the user are related in this context. I think the issue relay on the presence of a psp, can you run kubectl get psp - o yaml and paste here the result? You should have something like this:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: privileged
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
spec:
  privileged: true
  allowPrivilegeEscalation: true
# this section should be missing
  allowedCapabilities: 
  - '*'
########## end of section
  volumes:
  - '*'
  hostNetwork: true
  hostPorts:
  - min: 0
    max: 65535
  hostIPC: true
  hostPID: true
  runAsUser:
    rule: 'RunAsAny'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'

@tomalbrecht
Copy link
Author

tomalbrecht commented Aug 4, 2020

@alfsuse I installed kubeflow with RBAC enabled, if this matters. But it seems that PodSecurityPolicy is a cluster level resource - so maybe not. Anyway. Here's the output.

    allowedCapabilities:
    - NET_ADMIN
    - NET_RAW
    - SYS_ADMIN

complete kubectl get psp - o yaml output:

apiVersion: v1
items:
- apiVersion: policy/v1beta1
  kind: PodSecurityPolicy
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"policy/v1beta1","kind":"PodSecurityPolicy","metadata":{"annotations":{},"labels":{"app":"metallb"},"name":"speaker"},"spec":{"allowPrivilegeEscalation":false,"allowedCapabilities":["NET_ADMIN","NET_RAW","SYS_ADMIN"],"fsGroup":{"rule":"RunAsAny"},"hostNetwork":true,"hostPorts":[{"max":7472,"min":7472}],"privileged":true,"runAsUser":{"rule":"RunAsAny"},"seLinux":{"rule":"RunAsAny"},"supplementalGroups":{"rule":"RunAsAny"},"volumes":["*"]}}
    creationTimestamp: "2020-07-24T12:23:56Z"
    labels:
      app: metallb
    managedFields:
    - apiVersion: policy/v1beta1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:labels:
            .: {}
            f:app: {}
        f:spec:
          f:allowPrivilegeEscalation: {}
          f:allowedCapabilities: {}
          f:fsGroup:
            f:rule: {}
          f:hostNetwork: {}
          f:hostPorts: {}
          f:privileged: {}
          f:runAsUser:
            f:rule: {}
          f:seLinux:
            f:rule: {}
          f:supplementalGroups:
            f:rule: {}
          f:volumes: {}
      manager: kubectl
      operation: Update
      time: "2020-07-24T12:23:56Z"
    name: speaker
    resourceVersion: "9528"
    selfLink: /apis/policy/v1beta1/podsecuritypolicies/speaker
    uid: 0641d4e3-6f60-4b46-9d08-767b80148b1f
  spec:
    allowPrivilegeEscalation: false
    allowedCapabilities:
    - NET_ADMIN
    - NET_RAW
    - SYS_ADMIN
    fsGroup:
      rule: RunAsAny
    hostNetwork: true
    hostPorts:
    - max: 7472
      min: 7472
    privileged: true
    runAsUser:
      rule: RunAsAny
    seLinux:
      rule: RunAsAny
    supplementalGroups:
      rule: RunAsAny
    volumes:
    - '*'
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

I edited the field 'allowedCapabilities' as suggested by you.

image

After this I ran the code again. It took some time, while I got this message.
image

But then I got the the same result
image

I created a fresh notebook server, so I assume, that the new policies were taken. Any ideas?

@tomalbrecht
Copy link
Author

tomalbrecht commented Aug 4, 2020

More information. This time it seems several different resources failed.

image

  • pod/metadata-deployment-5c7df888b9-cqvcx: Readiness probe failed: HTTP probe failed with statuscode: 500
  • deployment/metadata-deployment: failed

Because of this, I deployed deployment/metadata-deployment again. This solved at least the failed metadata containers. Then I recreated the notebook and run the test again. Nothing changed. The pipeline container still fails.

@alfsuse
Copy link

alfsuse commented Aug 4, 2020

Thanks, @tomalbrecht couple of more questions:

  1. I see you have 1.18 version of K8's that is probably the root cause for some pod failures. Is cache-deployer running? If not you need to change the cahce-deployer clusterole as indicated here: [K3S] Can't start cache-deployer-deployment #4138
  2. The psp should be applied immediately.. I wonder if you have any other psp in place beside the one you showed me, keep in mind that psp'a are applied in alphabetical order and so the first one to be applied hit the container.

I currently working on 3 different clusters with 1.18.4, 1.18.2 and 1.16.9 with and without psp's and works fine..I can try to re-deploy with the same manifest of you and see if something changes.

@tomalbrecht
Copy link
Author

tomalbrecht commented Aug 4, 2020

@alfsuse

Thanks, I will take a look at #4138 in case I will stick to kubernetes 1.18.6.

I am not aware of any psp deployments. Right now I am testing on a single user environment with full access. So if there are any, it was deployed with microk8s (juju) or kubeflow manifests.

My install process (scripted) is as follows:

  1. install microk8s ('snap install microk8s --classic --channel=${CHANNEL}') and enabling several components ('microk8s.enable dns storage gpu'). Storing config 'microk8s.kubectl config view --raw > /root/.kube/config'
  2. patch trustworthy JWTs (MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the server could not find the requested resource manifests#959) in kube-controller-manager
  3. restart microk8s
  4. install kubeflow (https://www.kubeflow.org/docs/started/k8s/kfctl-istio-dex/)
  5. setup loadbalancer (https://www.kubeflow.org/docs/started/k8s/kfctl-istio-dex/#expose-kubeflow)
  6. patching minio server for aws (https://www.kubeflow.org/docs/aws/pipeline/#s3-access-from-kubeflow-pipelines)
  7. patching several configurations (dashboard timeout, dex-users, ingress-gateway for self signed ssl, tensorboard, dex-port)
  8. enabling local registry (patching docker before) 'microk8s enable registry'
  9. building and pushing some custom images to local registry

I will install an older version of microk8s for now and see if this will help.

@tomalbrecht
Copy link
Author

Older Versions of microk8s in combination with kubeflow 1.0.2 won't work either. The mentioned tutorial won't work.

# list available versions of microk8s
snap info microk8s
channels:
  latest/stable:    v1.18.6         2020-07-25 (1551) 215MB classic
  latest/candidate: v1.18.6         2020-07-16 (1551) 215MB classic
  latest/beta:      v1.18.6         2020-07-16 (1551) 215MB classic
  latest/edge:      v1.18.6         2020-07-31 (1584) 215MB classic
  dqlite/stable:    –                                       
  dqlite/candidate: –                                       
  dqlite/beta:      –                                       
  dqlite/edge:      v1.16.2         2019-11-07 (1038) 189MB classic
  1.19/stable:      –                                       
  1.19/candidate:   v1.19.0-rc.3    2020-07-29 (1573) 220MB classic
  1.19/beta:        v1.19.0-beta.2  2020-06-13 (1478) 213MB classic
  1.19/edge:        v1.19.0-alpha.3 2020-06-05 (1457) 210MB classic
  1.18/stable:      v1.18.6         2020-07-26 (1550) 201MB classic
  1.18/candidate:   v1.18.6         2020-07-22 (1550) 201MB classic
  1.18/beta:        v1.18.6         2020-07-22 (1550) 201MB classic
  1.18/edge:        v1.18.6         2020-07-15 (1550) 201MB classic
  1.17/stable:      v1.17.9         2020-07-26 (1549) 179MB classic
  1.17/candidate:   v1.17.9         2020-07-21 (1549) 179MB classic
  1.17/beta:        v1.17.9         2020-07-21 (1549) 179MB classic
  1.17/edge:        v1.17.9         2020-07-31 (1586) 179MB classic
  1.16/stable:      v1.16.8         2020-03-27 (1302) 179MB classic
  1.16/candidate:   v1.16.13        2020-08-02 (1587) 179MB classic
  1.16/beta:        v1.16.13        2020-08-02 (1587) 179MB classic
  1.16/edge:        v1.16.13        2020-07-31 (1587) 179MB classic
  1.15/stable:      v1.15.11        2020-03-27 (1301) 171MB classic
  1.15/candidate:   v1.15.11        2020-03-27 (1301) 171MB classic
  1.15/beta:        v1.15.11        2020-03-27 (1301) 171MB classic
  1.15/edge:        v1.15.11        2020-03-26 (1301) 171MB classic
  1.14/stable:      v1.14.10        2020-01-06 (1120) 217MB classic
...

Test 1: microk8s 1.16/stable - v1.16.8 FAILED "No such container"
image

Test 2: microk8s 1.15/stable - v1.15.11 Installation FAILED. Will be stuck in "Waiting for kubernetes core services to be ready.." and kubectl won't be able to connect to kubernetes api.

Test 3: microk8s 1.14/stable - v1.14.10 FAILED "No such container"
image

I'd take a deeper look into it. But how does the process look like, so I can check it? Where do I have to look at?

  1. Start a notebook server from kubeflow dashboard and work within the notebook container '-n tom test-cpu'
  2. At the end, I will create another Container kfp.Client().create_run_from_pipeline_func(calc_pipeline, arguments=arguments)for the pipeline: '-n kubeflow calculation-pipeline-8525m-853130966'

Question: What other containers and services are involved within this process?
Question: What is the container, the pipeline is complaining about? Where can I see what other container the pipeline tries to access?
Question: Which Logfiles would be interesting to get deeper into the problem?

I cannot see any difference using older versions of microk8s. Thus I'd keep debugging microk8s v1.18.6. Any help appreciated.

@alfsuse
Copy link

alfsuse commented Aug 4, 2020

Hi, @tomalbrecht I finally replicated the error on microk8s and found that is directly related to microk8s itself. I found the same error here: jupyterhub/zero-to-jupyterhub-k8s#1189 (comment) the problem seems to be related to the way microk8s manage security context.
As mentioned I have a successful installation with kind,k3s, and minikube.. I will try once more to install kubeflow directly from microk8s capabilities rather than use manifests directly..

@tomalbrecht
Copy link
Author

Wrap up for microk8s 1.18.6 and kubeflow 1.0.2.

  • Patching PSP 'allowedCapabilities' as described here did not help
  • Patching cache-deployer deployment as described here - couldn't find manifests/kustomize/cluster-scoped-resourcesin kubeflow/manifests(https://github.com/kubeflow/manifests)
  • Using older Versions of kubernetes (microk8s) as described here did not help

@alfsuse
Copy link

alfsuse commented Aug 4, 2020

https://github.com/kubeflow/manifests/blob/master/pipeline/upstream/base/cache-deployer/cache-deployer-deployment.yaml
should be the right one..

@alfsuse
Copy link

alfsuse commented Aug 4, 2020

So I reinstalled microk8s with Kubernetes v 18.6 and enabled kubeflow (only) from microk8s directly (microk8s enable dns kubeflow storage) it will install everything else.. than deployed a notebook from withing kubeflow and ran the sample.

  • The first attempt failed because of the psp that are present missing some capabilities
  • Changed to '*' and re-run the demo changing the image in the sample to "tensorflow/tensorflow2 everything run smoothly.
    I've noticed that in the pipeline example the image used is tensorflow 1.11 while the code use tensorflow 1.8 so tried also that change.
    divmod_op = comp.func_to_container_op(my_divmod, base_image='tensorflow/tensorflow:1.8.0-py3')

I think the issue at this point is that microk8s use some specific configurations that do not permit the standard installation via manifests.

@tomalbrecht
Copy link
Author

I wonder how to go on from now. microk8s.enable kubeflow does not seem to be the right solution for us. Kubeflow will be managed by juju (within microk8s), so its not easily possible to patch kubeflow.

@alfsuse
Copy link

alfsuse commented Aug 5, 2020

I don't know your specific use case but if your need is to manage the manifests process probably minikube is a good solution (in case you need GPU support) or kind,k3s.

  • minikube works on docker or KVM,hyper-v,etc..
  • kind and k3s are DinD solutions and work on top of docker
    All the above anyway allow you to install from manifest and support istio, ingress,etc..
    Just one further note for kind and k3s you'll have containerd and not docker as runtime so you need to adjust the argo executor.

There's a new manifest that has been released for this case: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/platform-agnostic-pns
and documentation on how to use kind,k3s, and k3s on wsl2 (and shortly minikube) is on its way to be published.

@tomalbrecht
Copy link
Author

Thanks again. Yes, I'd like to manage the manifests. I'll take a look.

@alfsuse
Copy link

alfsuse commented Aug 5, 2020

Ping me over slack KF channels if you need any help with those

@tomalbrecht
Copy link
Author

Running the same tutorial code on minikube will work (without the patches mentioned above). I opened a issue on microk8s repository: canonical/microk8s#1472

@Ark-kun Ark-kun changed the title Tutorial "Lightweight python components" not working Error when using microk8s: failed to save outputs: Error response from daemon: No such container: Aug 12, 2020
@tomalbrecht
Copy link
Author

tomalbrecht commented Aug 12, 2020

@alfsuse Patching containerRuntimeExecutor: pns in ConfigMap workflow-controller-configmapdoes solve the problem for my microk8s installation. The kubeflow pipeline succeeds now. Thanks for your help.

image

image

Maybe it would help to extend the trouble shooting documentation. I patched the config map within k8dashboard - I had no idea how to merge the manifest (#4302 (comment)) into my existing kubeflow manifests (https://github.com/kubeflow/manifests/blob/master/kfdef/kfctl_istio_dex.v1.0.2.yaml)

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 13, 2020

Patching containerRuntimeExecutor: pns in ConfigMap workflow-controller-configmapdoes solve the problem for my microk8s installation.

Interesting. I wanted to propose PNS, but then I remembered that people trying to use PNS executor get errors about outputs and emptyDir volumes, so I decided not to have you jump through that hoop. It's interesting that it worked for you.

@tomalbrecht
Copy link
Author

@Ark-kun: It does work on microk8s v1.18.6 and k3s v1.16.13+k3s1 for me. Minikube v1.2.0 will run without PNS when driver is set to None on installation process (which means that minikube uses the host docker installation). All setup under ubuntu 18.04 with latest updates.

@Bobgy
Copy link
Contributor

Bobgy commented Aug 14, 2020

FYI, some contributors have documented pns executor for local k8s clusters: kubeflow/website#2056.

@Bobgy
Copy link
Contributor

Bobgy commented Aug 14, 2020

Ohhh, they are actually already commenting above.
Looks like issue solved and we already have public documentation.

@Bobgy Bobgy closed this as completed Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants