Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upSignificantly larger frequency of "Unexpected EOF during watch stream event decoding: unexpected EOF" #13331
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Aug 31, 2015
Member
I was trying to reproduce this problem. So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it.
However, in your logs it seems to be a different issue - so far i wasn't able to reproduce it, but I will continue investigating.
|
I was trying to reproduce this problem. So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it. However, in your logs it seems to be a different issue - so far i wasn't able to reproduce it, but I will continue investigating. |
wojtek-t
self-assigned this
Aug 31, 2015
wojtek-t
added
the
sig/api-machinery
label
Aug 31, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Aug 31, 2015
Member
Actually - I looked deeper into your logs.
The watches that fail on your cluster (i.e. the logs you pasted) are for:
- serviceaccounts
- secrets
- namespaces
I haven't touched the watch for these resources (the new watch is switched on only for pods, nodes and endpoints), which mean it's unrelated to the "new watch in apiserver".
FYI: @timothysc
|
Actually - I looked deeper into your logs. The watches that fail on your cluster (i.e. the logs you pasted) are for:
I haven't touched the watch for these resources (the new watch is switched on only for pods, nodes and endpoints), which mean it's unrelated to the "new watch in apiserver". |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lavalamp
Aug 31, 2015
Member
The frequency of these (which are pretty harmless, btw) scales with load on the cluster-- is it possible that load went up?
|
The frequency of these (which are pretty harmless, btw) scales with load on the cluster-- is it possible that load went up? |
lavalamp
added
the
priority/backlog
label
Aug 31, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
timothysc
Aug 31, 2015
Member
@wojtek-t I only looked at the controller manager in that example. I need to do a more in-depth grok of what's going on.
@lavalamp do you know the root? The error is so non-descript. I do notice that it increases with load, but this also seems to increase unnecessary traffic in the system. When I start shifting the QPS+Burst caps it starts to become troublesome.
|
@wojtek-t I only looked at the controller manager in that example. I need to do a more in-depth grok of what's going on. @lavalamp do you know the root? The error is so non-descript. I do notice that it increases with load, but this also seems to increase unnecessary traffic in the system. When I start shifting the QPS+Burst caps it starts to become troublesome. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Aug 31, 2015
Member
@lavalamp - the load went down (looking at scalability tests there is a significant improvement)
|
@lavalamp - the load went down (looking at scalability tests there is a significant improvement) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Aug 31, 2015
Member
Anyway - let me fix the problem I mentioned above tomorrow and I will continue some investigation.
|
Anyway - let me fix the problem I mentioned above tomorrow and I will continue some investigation. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lavalamp
Aug 31, 2015
Member
So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it.
@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.
Also, I can easily believe that the load goes down when your stuff is turned on. But without your cache, increased load should increase dropped watches, or at least it does so when I last looked into it.
@timothysc Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it, but not likely-- you can look in apiserver log to see if that's the case. If this is actually causing you scalability problems, then you could look into turning on @wojtek-t's cache for more resource types.
@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV. Also, I can easily believe that the load goes down when your stuff is turned on. But without your cache, increased load should increase dropped watches, or at least it does so when I last looked into it. @timothysc Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it, but not likely-- you can look in apiserver log to see if that's the case. If this is actually causing you scalability problems, then you could look into turning on @wojtek-t's cache for more resource types. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Sep 1, 2015
Member
@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.
Yes - that sounds like a better idea.
Yes - that sounds like a better idea. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Sep 1, 2015
Member
@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.
BTW - I was trying to understand where those requests are coming from, but I don't know what's happening there. I can see the following logs:
I0901 11:53:07.556051 5 handlers.go:142] GET /api/v1/watch/nodes?fieldSelector=metadata.name%3De2e-test-wojtekt-minion-b2yd&resourceVersion=: (715.51µs) 500 [[kubelet/v1.1.0 (linux/amd64) kubernetes/a0fba1a] 10.240.144.197:41597]
However, Kubelet is using Reflector for watching node:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L207
and it should never issue watch with empty resourceVersion.
BTW - I was trying to understand where those requests are coming from, but I don't know what's happening there. I can see the following logs:
However, Kubelet is using Reflector for watching node: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lavalamp
Sep 1, 2015
Member
@wojtek-t that's mysterious. The user agent claims to be kubelet, but I agree that reflector should never issue a request like that.
|
@wojtek-t that's mysterious. The user agent claims to be kubelet, but I agree that reflector should never issue a request like that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
timothysc
Sep 3, 2015
Member
Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it.
I'm going to dig a bit further today, but something doesn't make sense here...
/cc @yichengq
I'm going to dig a bit further today, but something doesn't make sense here... /cc @yichengq |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
yichengq
Sep 4, 2015
Contributor
For current etcd 2.x, if watch begins(client starts to read request body), etcd server will wait until event happens or server self is killed.
|
For current etcd 2.x, if watch begins(client starts to read request body), etcd server will wait until event happens or server self is killed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lavalamp
Sep 4, 2015
Member
@timothysc https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/etcd/etcd_watcher.go#L200 Look through your apiserver logs and see if this channel is getting filled-- if so, apiserver is the culprit.
|
@timothysc https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/etcd/etcd_watcher.go#L200 Look through your apiserver logs and see if this channel is getting filled-- if so, apiserver is the culprit. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@timothysc - is this still a problem? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
I'll close and reopen if we still see it post eval. |
timothysc
closed this
Dec 17, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
xiaoping378
Nov 18, 2016
Contributor
i met the same issue, i find the reason is that i using HA active-active
change the mode of ha to be active-backup, will be OK.
|
i met the same issue, i find the reason is that i using HA active-active |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mignev
Feb 6, 2017
I see it as well.
2800 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
....
mignev
commented
Feb 6, 2017
|
I see it as well.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wojtek-t
Feb 6, 2017
Member
If this is the case in new-ish releases, please open a new issue, rather than commenting on the more-than-year-old issue.
|
If this is the case in new-ish releases, please open a new issue, rather than commenting on the more-than-year-old issue. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mignev
commented
Feb 6, 2017
|
Got it! Thanks for the fast response :) |
timothysc commentedAug 28, 2015
Conditions:
3-node etcd cluster 2.2.0-alpha.1+git
HA active-passive (sched+controller-manager) using pacemaker w/ load-balanced api-servers running v1.1.0-alpha.0.2419+109c3f2b056a68
Prior behavior:
Periodically we would see watch reset with the non-descript "Unexpected EOF during watch stream event decoding: unexpected EOF", which doesn't really tell us anything useful at all.
Recent behavior:
Recent modifications seem to have exacerbated the system because it is now happening with greater frequency, and it seems to happen in large "chunks" at a time. A block of watches will suddenly fail.
A block of the re-issued watches can be found here:
https://docs.google.com/spreadsheets/d/1jbGCQnqDl79tGTb77NPJl_rrb2gXv8aLmemOzWA0z2Y/edit?usp=sharing
/cc @wojtek-t @dchen1107