New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadically get ImagePullBackOff #55691

Closed
ITler opened this Issue Nov 14, 2017 · 6 comments

Comments

Projects
None yet
4 participants
@ITler

ITler commented Nov 14, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I run Kubernetes Cluster 1.7.5 (seen in older versions as well and tested in 1.7.9 as well) on AzureGermanCloud. (K8s is provisioned via Resource Manager template created by acs-engine). When it comes to pod deployment, several images will be pulled from one dedicated AWS EC2 container service. To be able to login there, I use aws ecr get-login [...] to get login credentials and provide them as a K8s docker-registry secret to a dedicated Kubernetes namespace, in which I deploy my applications. Usually, it perfectly works, but sporadically, I end up with different deployments to fail in this manner:

Warning  Failed      4m (x17 over 1h)   kubelet, k8s-agentpool1-41067869-0  Failed to pull image "URLtoAwsEC2/some_image:tag": rpc error: code = 2 desc = unauthorized: authentication required
Warning  FailedSync  4s (x302 over 1h)  kubelet, k8s-agentpool1-41067869-0  Error syncing pod
Normal   BackOff     4s (x285 over 1h)  kubelet, k8s-agentpool1-41067869-0  Back-off pulling image "URLtoAwsEC2/some_image:tag"

It seems to happen randomly. There is no change to registry URL or image's names. Too, the image's tag has no influence on behavior (could be latest as well as custom one). Sometimes only one image can't be pulled, another day another image or even multiple can't be pulled. I can't see a pattern. Local docker pulling works like a charm, that's why I guess, this is not an issue on AWS side. There is no change when recreating secrets, individually, delete affected pods and deployments or even tear down and re-provision whole Kubernetes cluster.
What could I check? How could I debug that deeper?

What you expected to happen:
Work always

How to reproduce it (as minimally and precisely as possible):
Reproducing can't be forced. It's happening, randomly. However, I'm experiencing this since a while, but usually it was magically disappearing. First exception is this week, as I've got issues with same image since yesterday.

  1. Bootstrap Kubernetes
  2. Get secret with login credentials for EC2 container service in place
  3. Deploy pods kubectl apply -f somefile.yaml

Anything else we need to know?:
No idea what, please ask.

@ITler

This comment has been minimized.

Show comment
Hide comment
@ITler

ITler Nov 14, 2017

@kubernetes/sig-azure-bugs

ITler commented Nov 14, 2017

@kubernetes/sig-azure-bugs

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Nov 14, 2017

Contributor

@ITler: Reiterating the mentions to trigger a notification:
@kubernetes/sig-azure-bugs

In response to this:

@kubernetes/sig-azure-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor

k8s-ci-robot commented Nov 14, 2017

@ITler: Reiterating the mentions to trigger a notification:
@kubernetes/sig-azure-bugs

In response to this:

@kubernetes/sig-azure-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ITler

This comment has been minimized.

Show comment
Hide comment
@ITler

ITler Nov 15, 2017

Can't say if this is relevant, but I found that:

master's /var/log/containers/kube-apiserver*.log

{"log":"I1115 07:09:49.472755       1 wrap.go:42] GET /api/v1/namespaces/demo/secrets/registry?resourceVersion=0: (800.44µs) 200 [[hyperkube/v1.7.9 (linux/amd64) kubernetes/19fe919] 172.20.10.4:54810]\n","stream":"stderr","time":"2017-11-
{"log":"I1115 07:09:49.910348       1 logs.go:41] http: TLS handshake error from 168.63.129.16:59459: EOF\n","stream":"stderr","time":"2017-11-15T07:09:49.911205518Z"}

I can't judge, if both entries belong together. However, the GETpoints to the relevant secret. 172.20.10.4 is the IP of K8s agent node (currently I use only one node for testing purpose). No clue about 168.63.129.16, but whoistells it belongs to Microsoft network.

...which is interesting. However this is not the major point of this issue and information in whois might be outdated. But, why should K8s contact MS server. Or does this come from Azure layer? If yes, it is interesting, too. As far as I've learned, AzureGermanCloud should only follow Azure specification, rather than using MS infrastructure on an implementation level.

agent's (172.20.10.4) /var/log/syslog

Nov 15 09:49:36 k8s-agentpool1-41067869-0 docker[11197]: time="2017-11-15T09:49:36.476380458Z" level=error msg="Handler for GET /v1.24/images/URLtoAwsEC2/some_image:tag/json returned error: No such image: URLtoAwsEC2/some_image:tag"

I know this looks like a spelling error in image's URL but I checked 10+ times and even copy-pasted URL. This cannot be a typo and it has worked last week and is unchanged.

ITler commented Nov 15, 2017

Can't say if this is relevant, but I found that:

master's /var/log/containers/kube-apiserver*.log

{"log":"I1115 07:09:49.472755       1 wrap.go:42] GET /api/v1/namespaces/demo/secrets/registry?resourceVersion=0: (800.44µs) 200 [[hyperkube/v1.7.9 (linux/amd64) kubernetes/19fe919] 172.20.10.4:54810]\n","stream":"stderr","time":"2017-11-
{"log":"I1115 07:09:49.910348       1 logs.go:41] http: TLS handshake error from 168.63.129.16:59459: EOF\n","stream":"stderr","time":"2017-11-15T07:09:49.911205518Z"}

I can't judge, if both entries belong together. However, the GETpoints to the relevant secret. 172.20.10.4 is the IP of K8s agent node (currently I use only one node for testing purpose). No clue about 168.63.129.16, but whoistells it belongs to Microsoft network.

...which is interesting. However this is not the major point of this issue and information in whois might be outdated. But, why should K8s contact MS server. Or does this come from Azure layer? If yes, it is interesting, too. As far as I've learned, AzureGermanCloud should only follow Azure specification, rather than using MS infrastructure on an implementation level.

agent's (172.20.10.4) /var/log/syslog

Nov 15 09:49:36 k8s-agentpool1-41067869-0 docker[11197]: time="2017-11-15T09:49:36.476380458Z" level=error msg="Handler for GET /v1.24/images/URLtoAwsEC2/some_image:tag/json returned error: No such image: URLtoAwsEC2/some_image:tag"

I know this looks like a spelling error in image's URL but I checked 10+ times and even copy-pasted URL. This cannot be a typo and it has worked last week and is unchanged.

@ITler

This comment has been minimized.

Show comment
Hide comment
@ITler

ITler Nov 15, 2017

Error on our site.. No issue here

ITler commented Nov 15, 2017

Error on our site.. No issue here

@ITler ITler closed this Nov 15, 2017

@mcwienczek

This comment has been minimized.

Show comment
Hide comment
@mcwienczek

mcwienczek Mar 22, 2018

Could you explain what was the problem? We are having similar problem

mcwienczek commented Mar 22, 2018

Could you explain what was the problem? We are having similar problem

@ITler

This comment has been minimized.

Show comment
Hide comment
@ITler

ITler Mar 27, 2018

@mcwienczek Not sure anymore, but as far as I can remember, this was our problem and had nothing to do with k8s.
Double-check, if you really deploy docker registry secret to the correct namespace. Keep in mind, that the login credentials are valid for 24hrs only

ITler commented Mar 27, 2018

@mcwienczek Not sure anymore, but as far as I can remember, this was our problem and had nothing to do with k8s.
Double-check, if you really deploy docker registry secret to the correct namespace. Keep in mind, that the login credentials are valid for 24hrs only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment