New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly larger frequency of "Unexpected EOF during watch stream event decoding: unexpected EOF" #13331

Closed
timothysc opened this Issue Aug 28, 2015 · 19 comments

Comments

Projects
None yet
6 participants
@timothysc
Member

timothysc commented Aug 28, 2015

Conditions:
3-node etcd cluster 2.2.0-alpha.1+git
HA active-passive (sched+controller-manager) using pacemaker w/ load-balanced api-servers running v1.1.0-alpha.0.2419+109c3f2b056a68

Prior behavior:
Periodically we would see watch reset with the non-descript "Unexpected EOF during watch stream event decoding: unexpected EOF", which doesn't really tell us anything useful at all.

Recent behavior:
Recent modifications seem to have exacerbated the system because it is now happening with greater frequency, and it seems to happen in large "chunks" at a time. A block of watches will suddenly fail.

A block of the re-issued watches can be found here:
https://docs.google.com/spreadsheets/d/1jbGCQnqDl79tGTb77NPJl_rrb2gXv8aLmemOzWA0z2Y/edit?usp=sharing

/cc @wojtek-t @dchen1107

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 31, 2015

Member

I was trying to reproduce this problem. So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it.

However, in your logs it seems to be a different issue - so far i wasn't able to reproduce it, but I will continue investigating.

Member

wojtek-t commented Aug 31, 2015

I was trying to reproduce this problem. So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it.

However, in your logs it seems to be a different issue - so far i wasn't able to reproduce it, but I will continue investigating.

@wojtek-t wojtek-t self-assigned this Aug 31, 2015

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 31, 2015

Member

Actually - I looked deeper into your logs.

The watches that fail on your cluster (i.e. the logs you pasted) are for:

  • serviceaccounts
  • secrets
  • namespaces

I haven't touched the watch for these resources (the new watch is switched on only for pods, nodes and endpoints), which mean it's unrelated to the "new watch in apiserver".
FYI: @timothysc

Member

wojtek-t commented Aug 31, 2015

Actually - I looked deeper into your logs.

The watches that fail on your cluster (i.e. the logs you pasted) are for:

  • serviceaccounts
  • secrets
  • namespaces

I haven't touched the watch for these resources (the new watch is switched on only for pods, nodes and endpoints), which mean it's unrelated to the "new watch in apiserver".
FYI: @timothysc

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Aug 31, 2015

Member

The frequency of these (which are pretty harmless, btw) scales with load on the cluster-- is it possible that load went up?

Member

lavalamp commented Aug 31, 2015

The frequency of these (which are pretty harmless, btw) scales with load on the cluster-- is it possible that load went up?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Aug 31, 2015

Member

@wojtek-t I only looked at the controller manager in that example. I need to do a more in-depth grok of what's going on.

@lavalamp do you know the root? The error is so non-descript. I do notice that it increases with load, but this also seems to increase unnecessary traffic in the system. When I start shifting the QPS+Burst caps it starts to become troublesome.

Member

timothysc commented Aug 31, 2015

@wojtek-t I only looked at the controller manager in that example. I need to do a more in-depth grok of what's going on.

@lavalamp do you know the root? The error is so non-descript. I do notice that it increases with load, but this also seems to increase unnecessary traffic in the system. When I start shifting the QPS+Burst caps it starts to become troublesome.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 31, 2015

Member

@lavalamp - the load went down (looking at scalability tests there is a significant improvement)

Member

wojtek-t commented Aug 31, 2015

@lavalamp - the load went down (looking at scalability tests there is a significant improvement)

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 31, 2015

Member

Anyway - let me fix the problem I mentioned above tomorrow and I will continue some investigation.

Member

wojtek-t commented Aug 31, 2015

Anyway - let me fix the problem I mentioned above tomorrow and I will continue some investigation.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Aug 31, 2015

Member

So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it.

@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.

Also, I can easily believe that the load goes down when your stuff is turned on. But without your cache, increased load should increase dropped watches, or at least it does so when I last looked into it.

@timothysc Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it, but not likely-- you can look in apiserver log to see if that's the case. If this is actually causing you scalability problems, then you could look into turning on @wojtek-t's cache for more resource types.

Member

lavalamp commented Aug 31, 2015

So far I found an issue that the new cache is always returns error 500 when resourceVersion = 0 is passed to watch (previously it just started watching at "now", which wasn't very well defined). I will prepare a PR to fix it.

@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.

Also, I can easily believe that the load goes down when your stuff is turned on. But without your cache, increased load should increase dropped watches, or at least it does so when I last looked into it.

@timothysc Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it, but not likely-- you can look in apiserver log to see if that's the case. If this is actually causing you scalability problems, then you could look into turning on @wojtek-t's cache for more resource types.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 1, 2015

Member

@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.

Yes - that sounds like a better idea.

Member

wojtek-t commented Sep 1, 2015

@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.

Yes - that sounds like a better idea.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 1, 2015

Member

@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.

BTW - I was trying to understand where those requests are coming from, but I don't know what's happening there. I can see the following logs:

I0901 11:53:07.556051       5 handlers.go:142] GET /api/v1/watch/nodes?fieldSelector=metadata.name%3De2e-test-wojtekt-minion-b2yd&resourceVersion=: (715.51µs) 500 [[kubelet/v1.1.0 (linux/amd64) kubernetes/a0fba1a] 10.240.144.197:41597]

However, Kubelet is using Reflector for watching node:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L207
and it should never issue watch with empty resourceVersion.

Member

wojtek-t commented Sep 1, 2015

@wojtek-t I'm not sure this is an error; I'd like to remove support for starting a watch without a RV.

BTW - I was trying to understand where those requests are coming from, but I don't know what's happening there. I can see the following logs:

I0901 11:53:07.556051       5 handlers.go:142] GET /api/v1/watch/nodes?fieldSelector=metadata.name%3De2e-test-wojtekt-minion-b2yd&resourceVersion=: (715.51µs) 500 [[kubelet/v1.1.0 (linux/amd64) kubernetes/a0fba1a] 10.240.144.197:41597]

However, Kubelet is using Reflector for watching node:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L207
and it should never issue watch with empty resourceVersion.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Sep 1, 2015

Member

@wojtek-t that's mysterious. The user agent claims to be kubelet, but I agree that reflector should never issue a request like that.

Member

lavalamp commented Sep 1, 2015

@wojtek-t that's mysterious. The user agent claims to be kubelet, but I agree that reflector should never issue a request like that.

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 3, 2015

Member

Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it.

I'm going to dig a bit further today, but something doesn't make sense here...

/cc @yichengq

Member

timothysc commented Sep 3, 2015

Yes, this happens when etcd drops a watch connection. It's possible that apiserver got so backed up it caused it.

I'm going to dig a bit further today, but something doesn't make sense here...

/cc @yichengq

@yichengq

This comment has been minimized.

Show comment
Hide comment
@yichengq

yichengq Sep 4, 2015

Contributor

For current etcd 2.x, if watch begins(client starts to read request body), etcd server will wait until event happens or server self is killed.

Contributor

yichengq commented Sep 4, 2015

For current etcd 2.x, if watch begins(client starts to read request body), etcd server will wait until event happens or server self is killed.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Sep 4, 2015

Member

@timothysc https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/etcd/etcd_watcher.go#L200 Look through your apiserver logs and see if this channel is getting filled-- if so, apiserver is the culprit.

Member

lavalamp commented Sep 4, 2015

@timothysc https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/etcd/etcd_watcher.go#L200 Look through your apiserver logs and see if this channel is getting filled-- if so, apiserver is the culprit.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 17, 2015

Member

@timothysc - is this still a problem?

Member

wojtek-t commented Dec 17, 2015

@timothysc - is this still a problem?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Dec 17, 2015

Member

I'll close and reopen if we still see it post eval.

Member

timothysc commented Dec 17, 2015

I'll close and reopen if we still see it post eval.

@timothysc timothysc closed this Dec 17, 2015

@xiaoping378

This comment has been minimized.

Show comment
Hide comment
@xiaoping378

xiaoping378 Nov 18, 2016

Contributor

i met the same issue, i find the reason is that i using HA active-active
change the mode of ha to be active-backup, will be OK.

Contributor

xiaoping378 commented Nov 18, 2016

i met the same issue, i find the reason is that i using HA active-active
change the mode of ha to be active-backup, will be OK.

@mignev

This comment has been minimized.

Show comment
Hide comment
@mignev

mignev Feb 6, 2017

I see it as well.

2800 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
....

mignev commented Feb 6, 2017

I see it as well.

2800 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
....
@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Feb 6, 2017

Member

If this is the case in new-ish releases, please open a new issue, rather than commenting on the more-than-year-old issue.

Member

wojtek-t commented Feb 6, 2017

If this is the case in new-ish releases, please open a new issue, rather than commenting on the more-than-year-old issue.

@mignev

This comment has been minimized.

Show comment
Hide comment
@mignev

mignev Feb 6, 2017

Got it! Thanks for the fast response :)

mignev commented Feb 6, 2017

Got it! Thanks for the fast response :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment