Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic "too old resource version" errors from master pods #22024

Closed
DreadPirateShawn opened this issue Feb 25, 2016 · 33 comments

Comments

@DreadPirateShawn
Copy link

@DreadPirateShawn DreadPirateShawn commented Feb 25, 2016

We see sporadic "too old resource version" errors from the master pods.

(We understand from http://stackoverflow.com/questions/34322969/cause-of-apiserver-received-an-error-that-is-not-an-unversioned-errors-from-ku/34330607 that this could be expected behavior during an upgrade -- can we get confirmation of that, and will those be changed from "error" level to "warning" level at any point?)

Our bigger concern is that these errors continue to occur -- not constantly, but routinely, anywhere from hours to minutes between them.

Is there something we need to perform in order to alleviate this error spam?

Specifically, when we restart our HA masters (eg for an upgrade), we delete the master nodes, regenerate pod yamls, restart hyperkube, then patch the master nodes to be unschedulable:

kubectl --server=https://10.1.4.41:6443 --kubeconfig=... delete node sea5m1kmaster1
node "sea5m1kmaster1" deleted

kubectl --server=https://10.1.4.41:6443 --kubeconfig=... delete node sea5m1kmaster2
node "sea5m1kmaster2" deleted

# generate fresh yamls for apiserver / podmaster / scheduler / controller manager
# restart hyperkube

kubectl --server=https://10.1.4.41:6443 --kubeconfig=... patch node sea5m1kmaster1 -p {\"spec\":{\"unschedulable\":true}}
"sea5m1kmaster1" patched

kubectl --server=https://10.1.4.41:6443 --kubeconfig=... patch node sea5m1kmaster2 -p {\"spec\":{\"unschedulable\":true}}
"sea5m1kmaster2" patched

Every now and then, unrelated to any restarts or other visible errors, we get 1-2 of these errors on our masters:

2016-02-25T18:42:48.7850+00:00 tuk6r2kmaster2 [err] [docker] E0225 18:42:48.785758       1 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 1785219 (1787027)
2016-02-25T18:42:48.7870+00:00 tuk6r2kmaster2 [err] [docker] E0225 18:42:48.787036       1 reflector.go:227] /usr/src/go/src/runtime/asm_amd64.s:2232: Failed to watch *api.Pod: too old resource version: 1785219 (1787027)
2016-02-25T18:46:00.7180+00:00 tuk6r2kmaster2 [err] [docker] E0225 18:46:00.718118       1 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 1785219 (1787027)
2016-02-25T18:46:00.6850+00:00 tuk6r2kmaster1 [err] [docker] E0225 18:46:00.685606       1 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 1785219 (1787027)
2016-02-25T18:46:00.7880+00:00 tuk6r2kmaster1 [err] [kubelet] E0225 18:46:00.716662    3999 reflector.go:227] pkg/kubelet/config/apiserver.go:43: Failed to watch *api.Pod: too old resource version: 1785219 (1787027)
2016-02-25T18:46:01.1700+00:00 tuk6r2kmaster2 [err] [kubelet] E0225 18:46:00.688601   62642 reflector.go:227] pkg/kubelet/config/apiserver.go:43: Failed to watch *api.Pod: too old resource version: 1785219 (1787027)

Is this indicative of a problem, perhaps in our upgrade flow? Or is this expected routine log behavior?

Version Info:

$ kubectl --server=https://10.1.4.41:6443 --... version
Client Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.3", GitCommit:"6a81b50c7e97bbe0ade075de55ab4fa34f049dc2", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.3", GitCommit:"6a81b50c7e97bbe0ade075de55ab4fa34f049dc2", GitTreeState:"clean"}
@bgrant0607

This comment has been minimized.

Copy link
Member

@bgrant0607 bgrant0607 commented Mar 9, 2016

An example from @TermiT on slack https://gist.github.com/roman-shuhov/fb3fe7605152a7e9e7f7

I0309 05:56:57.564279       1 handlers.go:131] GET /api/v1/watch/endpoints?resourceVersion=14145456: (468.125µs) 500
goroutine 297 [running]:
k8s.io/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc2087c7140, 0x1f4)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:188 +0x9a
k8s.io/kubernetes/pkg/apiserver.writeJSON(0x1f4, 0x7fc1dd355fe0, 0xc2080a1320, 0x7fc1dd3567f8, 0xc209508d80, 0x7fc1dd1a3050, 0xc2087c7140, 0x1)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/apiserver.go:389 +0x89
k8s.io/kubernetes/pkg/apiserver.errorJSON(0x7fc1dd350fc0, 0xc208395ba0, 0x7fc1dd355fe0, 0xc2080a1320, 0x7fc1dd1a3050, 0xc2087c7140, 0xc208356c80)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/apiserver.go:416 +0xab
k8s.io/kubernetes/pkg/apiserver.func·026(0xc208dbb830, 0xc2087c7260)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/resthandler.go:256 +0xc2a
github.com/emicklei/go-restful.func·009(0xc208dbb830, 0xc2087c7260)
    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/container.go:244 +0x41
github.com/emicklei/go-restful.(*FilterChain).ProcessFilter(0xc208dbb920, 0xc208dbb830, 0xc2087c7260)
    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/filter.go:21 +0xa2
k8s.io/kubernetes/pkg/apiserver.func·003(0xc208dbb830, 0xc2087c7260, 0xc208dbb920)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/apiserver.go:62 +0x8b
github.com/emicklei/go-restful.(*FilterChain).ProcessFilter(0xc208dbb920, 0xc208dbb830, 0xc2087c7260)
    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/filter.go:19 +0x84
github.com/emicklei/go-restful.(*Container).dispatch(0xc2087370e0, 0x7fc1dd1a3050, 0xc2087c7140, 0xc2095f2f70)
    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/container.go:246 +0xee3
github.com/emicklei/go-restful.*Container.(github.com/emicklei/go-restful.dispatch)·fm(0x7fc1dd1a3050, 0xc2087c7140, 0xc2095f2f70)
    /go/src/k8s.io/kubernetes/Godeps/_workspace/sr [[hyperkube/v1.1.8 (linux/amd64) kubernetes/a8af33d] 10.242.114.51:60742]
E0309 05:56:57.565001       1 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 14145457 (14145478)

cc @wojtek-t @lavalamp

@bgrant0607

This comment has been minimized.

Copy link
Member

@bgrant0607 bgrant0607 commented Mar 9, 2016

Should be documented in known issues #21000 if we're not going to fix it. It generates a lot of noise, though.

@wojtek-t

This comment has been minimized.

Copy link
Member

@wojtek-t wojtek-t commented Mar 9, 2016

Looking into logs from the original post, it seems there are 2 different issues:

first is the problem in the communication apiserver <->etcd:

 2016-02-25T18:42:48.7850+00:00 tuk6r2kmaster2 [err] [docker] E0225 18:42:48.785758       1 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 1785219 (1787027)

second is the problem with communication apiserver <-> client

2016-02-25T18:42:48.7870+00:00 tuk6r2kmaster2 [err] [docker] E0225 18:42:48.787036       1 reflector.go:227] /usr/src/go/src/runtime/asm_amd64.s:2232: Failed to watch *api.Pod: too old resource version: 1785219 (1787027)

The latter is a clear consequence of the former (the time is also matching).

The former: "apiserver <-> etcd" error seems strange - we don't observe those even in very large clusters.
@DreadPirateShawn - do you have enough network bandwidth between apiserver and etcd? This might be related to something like that.

For the former, I don't think we can do much other than documenting it.

@DreadPirateShawn

This comment has been minimized.

Copy link
Author

@DreadPirateShawn DreadPirateShawn commented Mar 10, 2016

Network bandwidth appears to be fine -- looks like we have about 20.68 mb/s between apiserver and etcd host, and the only other activity between those hosts is from heapster, on the order of a couple hundred kb/s.

(I obviously can't speak to @TermiT's repro, though it is reassuring to see that we're not the only ones seeing this.)

Granted, if others are repro'ing this but you're not seeing it, this next question might be hard to answer :-) but, if there really is not be much you can do other than to document it, then could / should this error be lowered to warning level?

(For supporting prod, anything at "error" level merits far more attention / concern than "warning" level.)

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Mar 10, 2016

Filed #22830 about the error message, still reading rest of issue.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Mar 10, 2016

Basically, you're getting an out-of-window message, which is totally normal and expected, and has nothing whatsoever to do with your upgrade flow AFAICT. We will continue to print a message here. The thing wrong is that the error is not interpreted correctly, somehow we've started returning a 500 and the message is confusing. This should not be 500 nor should it print a stack trace. I will follow up on this in #22830, since I already filed it.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Mar 11, 2016

As pointed out in #22830, the error message was fixed by #16628.

IMO you should not worry about "errors" in Kubernetes log files unless there's a corresponding symptom. Kubernetes is designed to be self-healing; errors should all be recoverable & not require admin intervention. If they do require admin intervention, the error message should make that very clear, and/or I'd expect some obvious symptom, like the program exiting.

@lavalamp lavalamp closed this Mar 11, 2016
@DreadPirateShawn

This comment has been minimized.

Copy link
Author

@DreadPirateShawn DreadPirateShawn commented Mar 11, 2016

@lavalamp First and foremost, many thanks for your help and attention!

That being said, will this still be an "error"-level log, or will it be lowered to "warning/info/etc"-level, since it's expected behavior?

The problem with ignoring errors (quotes deliberately omitted) in Kubernetes log files that is that it's basically recommending to put tape over the engine warning lights.

For instance, one big problem is that the symptom may not always be immediately apparent -- they don't all result in pods dying. (For instance, routing misconfiguration can cause "kubectl log" to fail even though scheduling works fine, which can be a nasty surprise once someone tries to check logs in prod... node advertise-address misconfiguration can result in etcd thrashing and steady error logs without technically "breaking" anything...) Kubernetes does self-heal errors, but when you see the same error occurring steadily and always, that's generally a sign that you're looking at something that isn't working properly, not simply a random error that Kubernetes recovered from.

Logging something as "error"-level means that you're concerned about it, which means that I'M concerned about it too.

Waiting until something breaks heavily enough to get a phone call in the middle of the night, or until some new service pod runs into the previously-subtle symptom and fails a deployment -- that's not a sustainable approach.

(Also, as you get more adoption by companies using Kubernetes clusters in prod environments, you're going to have more QA & prod requirements that any and all error logs are investigated fully to completion -- saying "we don't see anything obviously broken" isn't sufficient for us to close our tickets, whether we personally suspect that we don't need to worry about it or not.)

@DreadPirateShawn

This comment has been minimized.

Copy link
Author

@DreadPirateShawn DreadPirateShawn commented Mar 11, 2016

PS: I'm in the #kubernetes-users slack chat as well, user sfalknerhorine, if you want to chat more without an audience :-) -- this is admittedly a bit more philosophical and tangential to the main topic of fixing the etcd response handler, but it may very well come up again as we pursue other various error-level logs.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Mar 11, 2016

This will now print in API server as a 410 HTTP response, instead of the stack trace.

I think components doing watches may still log something when a watch is dropped. I can definitely see your point, and there have been other complaints about this message.

@lavalamp lavalamp reopened this Mar 11, 2016
@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Mar 11, 2016

I have a lot to say about this message, but lack the time today to say it. So I'll leave this issue open a bit longer.

@lavalamp lavalamp self-assigned this Mar 11, 2016
@DreadPirateShawn

This comment has been minimized.

Copy link
Author

@DreadPirateShawn DreadPirateShawn commented Nov 2, 2016

@lavalamp Any update on this?

@wojtek-t

This comment has been minimized.

Copy link
Member

@wojtek-t wojtek-t commented Nov 14, 2016

@DreadPirateShawn - a looooot of things has changed since this issue was opened. Do you observe the same issue in the newer releases?
[BTW - once we luanch etcd v3 as our backend storage, which should happen soon, this should also help with that]

@DreadPirateShawn

This comment has been minimized.

Copy link
Author

@DreadPirateShawn DreadPirateShawn commented Nov 14, 2016

@wojtek-t Great question.

We definitely see it on Kubernetes 1.2.2.

However, we attempted to upgrade to Kube 1.4 and Docker 1.12 recently, but ultimately ran into moby/moby#27607 as a blocker issue, and we needed to perform a full rollback. We considered trying to upgrade each separately, but at this point it makes more sense to wait for Docker 1.13 (pending soon) and revisit.

@obeattie

This comment has been minimized.

Copy link

@obeattie obeattie commented Nov 14, 2016

We definitely see it in our production cluster running 1.3.x. We'll be upgrading to 1.4.6 in the next few days, so I will post an update soon about whether the issue is still present.

@yvespp

This comment has been minimized.

Copy link
Contributor

@yvespp yvespp commented Nov 14, 2016

Seeing this with client-go 1.5 and Kubernetes v1.4.3 on minikube. See containous/traefik#836

@jianzhangbjz

This comment has been minimized.

Copy link
Contributor

@jianzhangbjz jianzhangbjz commented Nov 17, 2016

@wojtek-t I encounter the same problem, how to solve it? The errors as the following:

W1117 07:31:01.865813       1 reflector.go:330] vendor/k8s.io/client-go/1.5/tools/cache/reflector.go:109: watch of *v1.Endpoints ended with: too old resource version: 1205751 (1205872)

K8s version:

root@heatonli-test1:~/sl-vpn# kubectl version
Client Version: version.Info{Major:"1", Minor:"4+", GitVersion:"v1.4.5-beta.0+$Format:%h$", GitCommit:"5f79a5d3909d62b20a87c9cd4e3699cb4dc94d25", GitTreeState:"dirty", BuildDate:"2016-10-25T07:14:10Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4+", GitVersion:"v1.4.5-beta.0+$Format:%h$", GitCommit:"f4f5c21d73fc8eaff3dbec55e3d902fb3f32cedb", GitTreeState:"dirty", BuildDate:"2016-11-01T05:49:02Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Docker version:

root@mfeng-m60-2:/tmp# docker version
Client:
 Version:      1.10.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   590d5108
 Built:        Thu Feb  4 18:36:33 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   590d5108
 Built:        Thu Feb  4 18:36:33 2016
 OS/Arch:      linux/amd64
@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Nov 18, 2016

Looks like this message is now a warning only. It really is not a problem unless you see it continuously (like, multiple times per second). If we make it informational only will that make you all feel better? :) :)

@DreadPirateShawn

This comment has been minimized.

Copy link
Author

@DreadPirateShawn DreadPirateShawn commented Nov 18, 2016

It'll most certainly make our QA dept and error-monitoring tools feel better. ;-) :-D

@jianzhangbjz

This comment has been minimized.

Copy link
Contributor

@jianzhangbjz jianzhangbjz commented Nov 21, 2016

Sure, certainly.

@timoreimann

This comment has been minimized.

Copy link
Contributor

@timoreimann timoreimann commented Jun 6, 2017

We also see it happening sporadically (every few minutes) with client-go 2.x and Kubernetes 1.6 (through containous/traefik).

@lavalamp I'd be interested in taking a stab at lowering the log level to informational. Is this still something you'd be okay with?

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Dec 26, 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Jan 25, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Feb 24, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Mar 13, 2019
…ch-limit

UPSTREAM: 72980: Fix Cinder volume limits

Origin-commit: 11407b139d8dcf9ecfe6aa382640f92051ecc680
@grzesuav

This comment has been minimized.

Copy link

@grzesuav grzesuav commented Mar 28, 2019

Spotted in v1.10.12

@Disturbing

This comment has been minimized.

Copy link

@Disturbing Disturbing commented Apr 3, 2019

Spotted v1.11.7 on GKE.

@grzesuav

This comment has been minimized.

Copy link

@grzesuav grzesuav commented Apr 3, 2019

/reopen

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Apr 3, 2019

@grzesuav: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Apr 4, 2019

My prior comment remains accurate. This is expected at least occasionally during the normal course of operations. I'm fine with lowering the log level (I'm surprised if it hasn't already happened).

@OlegPS

This comment has been minimized.

Copy link

@OlegPS OlegPS commented Apr 15, 2019

These messages creates noise in the logs with warn severity. If you make it informational that will be right, as this is just a "normal course of operations". Thanks!

@sharkymcdongles

This comment has been minimized.

Copy link

@sharkymcdongles sharkymcdongles commented May 22, 2019

+1 If this isn't useful info or relevant we should put this to a info log level. It creates a bunch of noise.

@Starefossen

This comment has been minimized.

Copy link
Contributor

@Starefossen Starefossen commented Jul 30, 2019

I'm seeing this is logs from application connecting to the kubenretes API using the official API clients. This should definitly be handled better then just ignore it.

@gmauleon

This comment has been minimized.

Copy link

@gmauleon gmauleon commented Aug 5, 2019

If it's something we don't have to act upon and we don't want to be polluted, I would go even further and push that to debug level.
Honestly, I would love not to see it at all. All our Prometheus are spitting that out every 3/5 seconds and that means $$$ in stackdriver bills on GKE :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.