Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker image building workflows are failing #1135

Closed
jlewi opened this issue Jul 6, 2018 · 3 comments
Closed

Docker image building workflows are failing #1135

jlewi opened this issue Jul 6, 2018 · 3 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 6, 2018

See #1130

It looks like the latest runs of most of the release workflows for our images failed; e.g. pytorch operator; tf-serving-release, notebook-release.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2018

It looks to me like the dind sidecar got a sig term.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2018

main container logs

gcloud beta logging read --project=${PROJECT} --freshness=1d "resource.labels.cluster_name=\"${CLUSTER}\"  resource.labels.container_name=\"${CONTAINER}\" resource.labels.pod_id=\"${POD_ID}\"" --format="table(timestamp,textPayload)" --order=asc
TIMESTAMP             TEXT_PAYLOAD
2018-07-06T03:43:49Z  + DOCKERFILE=/mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator/Dockerfile
2018-07-06T03:43:49Z  ++ dirname /mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator/Dockerfile
2018-07-06T03:43:49Z  + CONTEXT_DIR=/mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator
2018-07-06T03:43:49Z  + IMAGE=gcr.io/kubeflow-images-public/pytorch-operator
2018-07-06T03:43:49Z  + TAG=v20180706-a50c2fb0
2018-07-06T03:43:49Z  + ROOT_DIR=/mnt
2018-07-06T03:43:49Z  + export GOPATH=/mnt
2018-07-06T03:43:49Z  + GOPATH=/mnt
2018-07-06T03:43:49Z  + GO_DIR=/mnt/src/github.com/kubeflow/pytorch-operator
2018-07-06T03:43:49Z  + gcloud auth activate-service-account --key-file=/secret/gcp-credentials/key.json
2018-07-06T03:43:51Z  Activated service account credentials for: [kubeflow-releasing@kubeflow-releasing.iam.gserviceaccount.com]
2018-07-06T03:43:52Z  + export PATH=/mnt/bin:/usr/local/go/bin:/usr/local/go/bin:/google-cloud-sdk/bin:/workspace:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2018-07-06T03:43:52Z  + PATH=/mnt/bin:/usr/local/go/bin:/usr/local/go/bin:/google-cloud-sdk/bin:/workspace:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2018-07-06T03:43:52Z  + echo 'Create symlink to GOPATH'
2018-07-06T03:43:52Z  + mkdir -p /mnt/src/github.com/kubeflow
2018-07-06T03:43:52Z  + ln -s /mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator /mnt/src/github.com/kubeflow/pytorch-operator
2018-07-06T03:43:52Z  Create symlink to GOPATH
2018-07-06T03:43:52Z  + cd /mnt/src/github.com/kubeflow/pytorch-operator
2018-07-06T03:43:52Z  + echo 'Build operator binary'
2018-07-06T03:43:52Z  Build operator binary
2018-07-06T03:43:52Z  + go build github.com/kubeflow/pytorch-operator/cmd/pytorch-operator

sidecar logs

TIMESTAMP             TEXT_PAYLOAD
2018-07-06T03:43:50Z  time="2018-07-06T03:43:50.157517623Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
2018-07-06T03:43:50Z  time="2018-07-06T03:43:50.157727645Z" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
2018-07-06T03:43:50Z  time="2018-07-06T03:43:50.158923756Z" level=info msg="libcontainerd: new containerd process, pid: 25"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.294852242Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.296157521Z" level=info msg="Loading containers: start."
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.320595687Z" level=warning msg="Running modprobe bridge br_netfilter failed with message: ip: can't find device 'bridge'\nip: can't find device 'br_netfilter'\nbr_netfilter           24576  0 \nmodprobe: can't change directory to '/lib/modules': No such file or directory\n, error: exit status 1"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.328545461Z" level=warning msg="Running modprobe nf_nat failed with message: `ip: can't find device 'nf_nat'\nnf_nat_masquerade_ipv4    16384  1 ipt_MASQUERADE\nnf_nat_ipv4            16384  1 iptable_nat\nnf_nat                 24576  3 xt_nat,nf_nat_masquerade_ipv4,nf_nat_ipv4\nmodprobe: can't change directory to '/lib/modules': No such file or directory`, error: exit status 1"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.341025111Z" level=warning msg="Running modprobe xt_conntrack failed with message: `ip: can't find device 'xt_conntrack'\nmodprobe: can't change directory to '/lib/modules': No such file or directory`, error: exit status 1"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.641431263Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.726742083Z" level=info msg="Loading containers: done."
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.761291641Z" level=info msg="Docker daemon" commit=f4ffd25 graphdriver(s)=overlay2 version=17.10.0-ce
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.761598804Z" level=info msg="Daemon has completed initialization"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.801535920Z" level=info msg="API listen on [::]:2375"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.801628580Z" level=info msg="API listen on /var/run/docker.sock"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.181198827Z" level=info msg="Processing signal 'terminated'"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.185554231Z" level=info msg="stopping containerd after receiving terminated"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.185970947Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = Internal desc = transport is closing"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.202342634Z" level=warning msg="libcontainerd: failed to get events from containerd: \"rpc error: code = Internal desc = grpc: the client connection is closing\""

It looks like the main container was in the middle of a build @ 03:37 when the dind container recieved a SIGTERM.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2018

Looking at the events for the cluster

I see recent events indicating disk pressure but these don't match the timestamp of the pytorch operator

The node was low on resource: [DiskPressure]. 

jlewi added a commit to jlewi/kubeflow that referenced this issue Jul 6, 2018
* See if this makes the build more reliable and faster

* We should really be setting disk space because it looks like (see kubeflow#1135)
  that is one resource that is under pressure. Need to figure out
  if that is easy to control with K8s.

Related to:
  kubeflow#1135 Docker image building workflows are failing
  kubeflow#1132 Building Jupyter images took over 4 hours.
k8s-ci-robot pushed a commit that referenced this issue Jul 6, 2018
#1136)

* See if this makes the build more reliable and faster

* We should really be setting disk space because it looks like (see #1135)
  that is one resource that is under pressure. Need to figure out
  if that is easy to control with K8s.

Related to:
  #1135 Docker image building workflows are failing
  #1132 Building Jupyter images took over 4 hours.
@jlewi jlewi closed this as completed Sep 19, 2018
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
kubeflow#1136)

* See if this makes the build more reliable and faster

* We should really be setting disk space because it looks like (see kubeflow#1135)
  that is one resource that is under pressure. Need to figure out
  if that is easy to control with K8s.

Related to:
  kubeflow#1135 Docker image building workflows are failing
  kubeflow#1132 Building Jupyter images took over 4 hours.
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Fix CNRM cluster package.

* namespace should be set in kustomization.yaml; not in the
  actual resource. If we set it in the resource then if we have kustomize
  subpackages that try to patch the resource kustomize won't be able
  to find the original resource.

* Change the name of the setter from "cluster-name" to "name"

* Divide up the GCP resources into various subpackages.

* Change cluster-name to name in asm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant