Docker image building workflows are failing #1135

jlewi · 2018-07-06T13:33:09Z

It looks like the latest runs of most of the release workflows for our images failed; e.g. pytorch operator; tf-serving-release, notebook-release.

jlewi · 2018-07-06T13:35:30Z

It looks to me like the dind sidecar got a sig term.

jlewi · 2018-07-06T13:50:35Z

main container logs

gcloud beta logging read --project=${PROJECT} --freshness=1d "resource.labels.cluster_name=\"${CLUSTER}\"  resource.labels.container_name=\"${CONTAINER}\" resource.labels.pod_id=\"${POD_ID}\"" --format="table(timestamp,textPayload)" --order=asc
TIMESTAMP             TEXT_PAYLOAD
2018-07-06T03:43:49Z  + DOCKERFILE=/mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator/Dockerfile
2018-07-06T03:43:49Z  ++ dirname /mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator/Dockerfile
2018-07-06T03:43:49Z  + CONTEXT_DIR=/mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator
2018-07-06T03:43:49Z  + IMAGE=gcr.io/kubeflow-images-public/pytorch-operator
2018-07-06T03:43:49Z  + TAG=v20180706-a50c2fb0
2018-07-06T03:43:49Z  + ROOT_DIR=/mnt
2018-07-06T03:43:49Z  + export GOPATH=/mnt
2018-07-06T03:43:49Z  + GOPATH=/mnt
2018-07-06T03:43:49Z  + GO_DIR=/mnt/src/github.com/kubeflow/pytorch-operator
2018-07-06T03:43:49Z  + gcloud auth activate-service-account --key-file=/secret/gcp-credentials/key.json
2018-07-06T03:43:51Z  Activated service account credentials for: [kubeflow-releasing@kubeflow-releasing.iam.gserviceaccount.com]
2018-07-06T03:43:52Z  + export PATH=/mnt/bin:/usr/local/go/bin:/usr/local/go/bin:/google-cloud-sdk/bin:/workspace:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2018-07-06T03:43:52Z  + PATH=/mnt/bin:/usr/local/go/bin:/usr/local/go/bin:/google-cloud-sdk/bin:/workspace:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2018-07-06T03:43:52Z  + echo 'Create symlink to GOPATH'
2018-07-06T03:43:52Z  + mkdir -p /mnt/src/github.com/kubeflow
2018-07-06T03:43:52Z  + ln -s /mnt/test-data-volume/image-release-pytorch-operator-release-a50c2fb-3dd6-1f9d/src/kubeflow/pytorch-operator /mnt/src/github.com/kubeflow/pytorch-operator
2018-07-06T03:43:52Z  Create symlink to GOPATH
2018-07-06T03:43:52Z  + cd /mnt/src/github.com/kubeflow/pytorch-operator
2018-07-06T03:43:52Z  + echo 'Build operator binary'
2018-07-06T03:43:52Z  Build operator binary
2018-07-06T03:43:52Z  + go build github.com/kubeflow/pytorch-operator/cmd/pytorch-operator

sidecar logs

TIMESTAMP             TEXT_PAYLOAD
2018-07-06T03:43:50Z  time="2018-07-06T03:43:50.157517623Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
2018-07-06T03:43:50Z  time="2018-07-06T03:43:50.157727645Z" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
2018-07-06T03:43:50Z  time="2018-07-06T03:43:50.158923756Z" level=info msg="libcontainerd: new containerd process, pid: 25"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.294852242Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.296157521Z" level=info msg="Loading containers: start."
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.320595687Z" level=warning msg="Running modprobe bridge br_netfilter failed with message: ip: can't find device 'bridge'\nip: can't find device 'br_netfilter'\nbr_netfilter           24576  0 \nmodprobe: can't change directory to '/lib/modules': No such file or directory\n, error: exit status 1"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.328545461Z" level=warning msg="Running modprobe nf_nat failed with message: `ip: can't find device 'nf_nat'\nnf_nat_masquerade_ipv4    16384  1 ipt_MASQUERADE\nnf_nat_ipv4            16384  1 iptable_nat\nnf_nat                 24576  3 xt_nat,nf_nat_masquerade_ipv4,nf_nat_ipv4\nmodprobe: can't change directory to '/lib/modules': No such file or directory`, error: exit status 1"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.341025111Z" level=warning msg="Running modprobe xt_conntrack failed with message: `ip: can't find device 'xt_conntrack'\nmodprobe: can't change directory to '/lib/modules': No such file or directory`, error: exit status 1"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.641431263Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.726742083Z" level=info msg="Loading containers: done."
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.761291641Z" level=info msg="Docker daemon" commit=f4ffd25 graphdriver(s)=overlay2 version=17.10.0-ce
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.761598804Z" level=info msg="Daemon has completed initialization"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.801535920Z" level=info msg="API listen on [::]:2375"
2018-07-06T03:43:51Z  time="2018-07-06T03:43:51.801628580Z" level=info msg="API listen on /var/run/docker.sock"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.181198827Z" level=info msg="Processing signal 'terminated'"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.185554231Z" level=info msg="stopping containerd after receiving terminated"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.185970947Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = Internal desc = transport is closing"
2018-07-06T03:47:49Z  time="2018-07-06T03:47:49.202342634Z" level=warning msg="libcontainerd: failed to get events from containerd: \"rpc error: code = Internal desc = grpc: the client connection is closing\""

It looks like the main container was in the middle of a build @ 03:37 when the dind container recieved a SIGTERM.

jlewi · 2018-07-06T14:07:47Z

Looking at the events for the cluster

I see recent events indicating disk pressure but these don't match the timestamp of the pytorch operator

The node was low on resource: [DiskPressure].

* See if this makes the build more reliable and faster * We should really be setting disk space because it looks like (see kubeflow#1135) that is one resource that is under pressure. Need to figure out if that is easy to control with K8s. Related to: kubeflow#1135 Docker image building workflows are failing kubeflow#1132 Building Jupyter images took over 4 hours.

#1136) * See if this makes the build more reliable and faster * We should really be setting disk space because it looks like (see #1135) that is one resource that is under pressure. Need to figure out if that is easy to control with K8s. Related to: #1135 Docker image building workflows are failing #1132 Building Jupyter images took over 4 hours.

kubeflow#1136) * See if this makes the build more reliable and faster * We should really be setting disk space because it looks like (see kubeflow#1135) that is one resource that is under pressure. Need to figure out if that is easy to control with K8s. Related to: kubeflow#1135 Docker image building workflows are failing kubeflow#1132 Building Jupyter images took over 4 hours.

* Fix CNRM cluster package. * namespace should be set in kustomization.yaml; not in the actual resource. If we set it in the resource then if we have kustomize subpackages that try to patch the resource kustomize won't be able to find the original resource. * Change the name of the setter from "cluster-name" to "name" * Divide up the GCP resources into various subpackages. * Change cluster-name to name in asm.

jlewi added priority/p0 area/build-release labels Jul 6, 2018

jlewi mentioned this issue Jul 6, 2018

Set requests and limits for RAM and CPU in TF notebook image releaser. #1136

Merged

jlewi closed this as completed Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker image building workflows are failing #1135

Docker image building workflows are failing #1135

jlewi commented Jul 6, 2018

jlewi commented Jul 6, 2018

jlewi commented Jul 6, 2018

jlewi commented Jul 6, 2018

Docker image building workflows are failing #1135

Docker image building workflows are failing #1135

Comments

jlewi commented Jul 6, 2018

jlewi commented Jul 6, 2018

jlewi commented Jul 6, 2018

jlewi commented Jul 6, 2018