Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout when pulling Docker images taking more than 1 minute to extract #13122

Closed
AlbertoPeon opened this Issue Feb 27, 2017 · 23 comments

Comments

Projects
None yet
@AlbertoPeon
Copy link
Contributor

AlbertoPeon commented Feb 27, 2017

Version

$ oc version
oc v1.4.1+3f9807a
kubernetes v1.4.0+776c994

OpenShift/Kubernetes fails to pull images whose layers take more than one minute to extract.

$ oc get events -w
Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned gitlab-ee-1-3jso0 to oonodedev-001
Pod                     spec.containers{gitlab-ee}    Normal    Pulling             {kubelet oonodedev-001}   pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff"
Pod                   Warning   FailedSync   {kubelet oonodedev-001}   Error syncing pod, skipping: failed to "StartContainer" for "gitlab-ee" with ErrImagePull: "net/http: request canceled"
Pod       spec.containers{gitlab-ee}   Warning   Failed    {kubelet oonodedev-001}   Failed to pull image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff": net/http: request canceled

and in the Origin logs:

Feb 24 15:21:45 oonodedev-001 origin-node[20126] kube_docker_client.go:313] Cancel pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff" because of no progress for 1m0s, latest progress: "ac990a380700: Extracting [==================================================>] 288.7 MB/288.7 MB"

The last layer of this particular image (ie gitlab/gitlab-ee:8.16.4-ee.0) takes several minutes to extract and with the default timeout of 1 minute it never goes through. A normal docker pull works.

The one minute value seems to come from the value of defaultImagePullingStuckTimeout (ref. https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/kube_docker_client.go#L81) which is hardcoded and can't be changed. I'm also seeing this has been changed in Kubernetes 1.6 and the value looks to be customizable.

Could you suggest a possible workaround for the time being? If not, could we increase the default timeout (to something like 10 minutes) and backport it to Origin 1.4 and Origin 1.5?

@pweil-

This comment has been minimized.

Copy link
Member

pweil- commented Feb 28, 2017

@derekwaynecarr setting to p1 for triage in case we need to pull this in to 1.5 before the close.

@pweil-

This comment has been minimized.

Copy link
Member

pweil- commented Feb 28, 2017

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Mar 6, 2017

For reference, the PR change for Kubernetes 1.6 is here:
kubernetes/kubernetes#36887

For the 1.4 and 1.5 releases, changing the default timeout from 1 to 10 minutes may help this situation but would potentially hurt other situations. Is it possible for the large image to be pre-pulled to your nodes instead in the interim?

@AlbertoPeon

This comment has been minimized.

Copy link
Contributor Author

AlbertoPeon commented Mar 6, 2017

Well, it is not only happening with this particular image.

We have seen this issue already in two images (the gitlab image mentioned above and one generated using S2I for one of our users), so I am afraid we could see this again in the future.

@ksemaev

This comment has been minimized.

Copy link

ksemaev commented May 31, 2017

Same happening to me with all images larger then 150 Mb or with more then 5 layers (for example official tomcat image). So we can't increase this timeout?

@chenww

This comment has been minimized.

Copy link

chenww commented Jul 11, 2017

I am facing the same issue and seems this error is random with Kubenetes 1.6. Here is what I observed and explanation is appreciated:

  1. kubernets needs to pull 5 images from internet, and only postgres one failed. I only can fix it by deleting and recreating pod manually. It becomes running right away
  2. kubernete needs to pull 10+images from local private registry, and only one image failed. However, kubernetes retried later and succeeds without issue.

Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ErrImagePull: "net/http: request canceled"

24m 24m 1 kubelet, master0 spec.containers{postgresql} Normal BackOff Back-off pulling image "sameersbn/postgresql:9.6-2"
24m 24m 1 kubelet, master0 Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ImagePullBackOff: "Back-off pulling image "sameersbn/postgresql:9.6-2""

@mliker

This comment has been minimized.

Copy link

mliker commented Jul 13, 2017

+1

@innovia

This comment has been minimized.

Copy link

innovia commented Jul 14, 2017

+1
Happens to me on 1.6.2 using kops

From what i was able to strace i saw that the pull os stuck on FUTEX_WAIT so some other process is deadlocking it

@alifa20

This comment has been minimized.

Copy link

alifa20 commented Jul 18, 2017

+1
happens to me, 1.5.7 using kops.
I am getting with ErrImagePull: "net/http: request canceled" tries to get the image from AWS ECR.
Any ideas guys?

@innovia

This comment has been minimized.

Copy link

innovia commented Jul 18, 2017

I think but im not sure it happened to me because i only had 1 node again im not sure cause i deleted and build my cluster yesterday again

@alifa20

This comment has been minimized.

Copy link

alifa20 commented Jul 18, 2017

Changed EC2 nodes from t2.medium to m3.large and fixed the problem

@stevenmirabito

This comment has been minimized.

Copy link

stevenmirabito commented Jul 24, 2017

Also ran into this issue with the GitLab image. 😞

@alikhajeh1

This comment has been minimized.

Copy link

alikhajeh1 commented Sep 4, 2017

Is there an option to customize the timeout in Origin 3.6? I couldn't find anything in the docs for it but maybe I was searching for the wrong things

@bbrfkr

This comment has been minimized.

Copy link

bbrfkr commented Sep 5, 2017

+1

2 similar comments
@rickbliss

This comment has been minimized.

Copy link

rickbliss commented Sep 10, 2017

+1

@yanhongwang

This comment has been minimized.

Copy link

yanhongwang commented Sep 18, 2017

+1

@AlbertoPeon

This comment has been minimized.

Copy link
Contributor Author

AlbertoPeon commented Sep 18, 2017

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

@AlbertoPeon

This comment has been minimized.

Copy link
Contributor Author

AlbertoPeon commented Sep 18, 2017

Actually, I am happy to close the issue now that this is configurable in Origin 3.6.

@xqianwang

This comment has been minimized.

Copy link

xqianwang commented Nov 18, 2017

@AlbertoPeon so in KubeletArguments , we set image-pull-progress-deadline=10m?

@bbrfkr

This comment has been minimized.

Copy link

bbrfkr commented Dec 1, 2017

@xqianwang
Yes. We can set the parameter image-pull-progress-deadline into /etc/origin/node/node-config.yaml as follow;

kubeletArguments:
  image-pull-progress-deadline:
  - "10m"

This description works fine in my OpenShift Origin environment.

@xqianwang

This comment has been minimized.

Copy link

xqianwang commented Dec 14, 2017

@bbrfkr Thanks a lot!

k8s-github-robot pushed a commit to kubernetes/kops that referenced this issue Dec 17, 2017

Kubernetes Submit Queue
Merge pull request #4046 from artsy/master
Automatic merge from submit-queue.

add imagePullProgressDeadline to kubelet config

Support the kubelet runtime flag `--image-pull-progress-deadline` by mapping the config key `imagePullProgressDeadline`

This supports extending the deadline to pull new images, as detailed in [this issue](openshift/origin#13122)
@sbadakhc

This comment has been minimized.

Copy link

sbadakhc commented Mar 27, 2018

Does this have to to still be set manually? Seeing this in 3.7 and was wondering if it exists as a configurable in the ansible inventory?

@AlbertoPeon

This comment has been minimized.

Copy link
Contributor Author

AlbertoPeon commented Mar 27, 2018

Yes, you can set it in openshift_node_kubelet_args . Note this has to be JSON-fomatted, so something like:

openshift_node_kubelet_args='{"image-pull-progress-deadline":["10m"]}'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.