New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd is becoming our performance bottleneck for scalability #32361

Closed
wojtek-t opened this Issue Sep 9, 2016 · 29 comments

Comments

Projects
None yet
8 participants
@wojtek-t
Member

wojtek-t commented Sep 9, 2016

My recent experiments show that etcd is becoming our performance bottleneck (note that experiments described below are run with etcd:3.0.4, but using v2 API).

Test:

  • we are running 2000-node Kubemark (hoewever with increase QPS limits in controller manager and scheduler set to 100 and inflight-requests to 800)
  • we are running density and load test on this kubemark

Outcome:

  • the watch from apiserver (cacher) to etcd was dropped twice during the test with the following logs:
W0909 03:41:23.740330    3559 reflector.go:330] pkg/storage/cacher.go:194: watch of *api.Pod ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1043046/1043043]) [1044045]
W0909 03:41:24.746867    3559 cacher.go:463] Terminating all watchers from cacher *api.Pod
W0909 03:41:27.555734    3559 reflector.go:330] pkg/storage/cacher.go:194: watch of *api.Pod ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [1046887/1046856]) [1047886]
W0909 03:41:28.555898    3559 cacher.go:463] Terminating all watchers from cacher *api.Pod

Previously we all had hypothesis, that we are simply processing events not fast enough in apiserver. But after few recent changes it seems it's no longer the case.

In particular, I expanded high-water-marks logs for our buffers and:

  • the highest value for the incoming channel before the first etcd watch drop was:
I0909 03:23:04.883400    3559 etcd_watcher.go:318] watch: 20 objects queued in incoming channel.

(it went higher after watch drop to 65, but never got higher)

  • the higher value for outgoing channel before the first etcd watch drop was:
I0909 03:28:57.772714    3559 etcd_watcher.go:160] watch (*api.Pod): 93 objects queued in outgoing channel.

[It went to 100 after watch drop, so we are also close to limit there, but that's not the biggest problem now.]

All the above shows, that since all the buffers weren't full before the first watch drop, it clearly suggests that simply etcd wasn't able to deliver watch events fast enough.

Potential problems are:

  • we are sending too many data - if so protobufs would be a solution for it
  • etcd itself is too slow - then maybe etcd3 will make it faster

@kubernetes/sig-scalability @gmarek @xiang90 @hongchaodeng @timothysc @lavalamp @fgrzadkowski

@wojtek-t wojtek-t added this to the v1.5 milestone Sep 9, 2016

@wojtek-t wojtek-t self-assigned this Sep 9, 2016

@gmarek

This comment has been minimized.

Show comment
Hide comment
@gmarek

gmarek Sep 9, 2016

Member

Note that we did an experiment with using 10x faster disk for etcd and it didn't help as well.

Member

gmarek commented Sep 9, 2016

Note that we did an experiment with using 10x faster disk for etcd and it didn't help as well.

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 9, 2016

Member

The event in requested index is outdated and cleared

Is this compaction?

Member

timothysc commented Sep 9, 2016

The event in requested index is outdated and cleared

Is this compaction?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 9, 2016

Member

@lavalamp aren't we going to push etcd3 fixes into 1.4.x? Just wondering how we should be setting milestones.

Member

timothysc commented Sep 9, 2016

@lavalamp aren't we going to push etcd3 fixes into 1.4.x? Just wondering how we should be setting milestones.

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Sep 9, 2016

Contributor

A more interesting metrics is how long between the two watch calls. And how many keys are modified during that duration. If n(keys) > 1000, then we have a problem.

This problem normally means etcd processes modification too fast.

@wojtek-t Can you try to get the metrics out? Or @hongchaodeng and I will try to help on this one.

Contributor

xiang90 commented Sep 9, 2016

A more interesting metrics is how long between the two watch calls. And how many keys are modified during that duration. If n(keys) > 1000, then we have a problem.

This problem normally means etcd processes modification too fast.

@wojtek-t Can you try to get the metrics out? Or @hongchaodeng and I will try to help on this one.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Sep 9, 2016

Member

etcd3, in etcd3 format, for 1.5.0 is our goal. (This implies we probably need etcd3 in etcd2 mode before then, in a patch release.)

Member

lavalamp commented Sep 9, 2016

etcd3, in etcd3 format, for 1.5.0 is our goal. (This implies we probably need etcd3 in etcd2 mode before then, in a patch release.)

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 10, 2016

Member

@xiang90 - does this metric already exist in etcd client?

Member

wojtek-t commented Sep 10, 2016

@xiang90 - does this metric already exist in etcd client?

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 12, 2016

Member

A more interesting metrics is how long between the two watch calls. And how many keys are modified during that duration. If n(keys) > 1000, then we have a problem.

n(keys) > 1000 is not the only possible issue. If etcd can process incoming updates faster than it's delivering watches, then it's also a potential problem

Member

wojtek-t commented Sep 12, 2016

A more interesting metrics is how long between the two watch calls. And how many keys are modified during that duration. If n(keys) > 1000, then we have a problem.

n(keys) > 1000 is not the only possible issue. If etcd can process incoming updates faster than it's delivering watches, then it's also a potential problem

@HardySimpson

This comment has been minimized.

Show comment
Hide comment
@HardySimpson

HardySimpson Sep 12, 2016

Contributor

mark

Contributor

HardySimpson commented Sep 12, 2016

mark

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Sep 12, 2016

Contributor

If etcd can process incoming updates faster than it's delivering watches, then it's also a potential problem

It wont. It will not start processing the next request before delivering all events to watches. It will drop requests if it cannot deliver though when the sending buffer is full.

Contributor

xiang90 commented Sep 12, 2016

If etcd can process incoming updates faster than it's delivering watches, then it's also a potential problem

It wont. It will not start processing the next request before delivering all events to watches. It will drop requests if it cannot deliver though when the sending buffer is full.

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Sep 12, 2016

Contributor

@wojtek-t We might start to do the optimization as you described in etcd3 (async watch delivering). But I am not convinced we really need to.

Contributor

xiang90 commented Sep 12, 2016

@wojtek-t We might start to do the optimization as you described in etcd3 (async watch delivering). But I am not convinced we really need to.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 12, 2016

Member

It wont. It will not start processing the next request before delivering all events to watches. It will drop requests if it cannot deliver though when the sending buffer is full.

@xiang90 - sorry, I don't understand this comment. In etcd2 (and we are still using etcd2 in all tests), watch is implemented as a series of GET`s with a specified RV. So even if this particular request is prioritized, then etcd doesn't have any control over what will happen next. So it cannot block the world on it.
So unless I'm not missing anything - you were talking about etcd3, right?

[And this bug is purely about etcd2]

Member

wojtek-t commented Sep 12, 2016

It wont. It will not start processing the next request before delivering all events to watches. It will drop requests if it cannot deliver though when the sending buffer is full.

@xiang90 - sorry, I don't understand this comment. In etcd2 (and we are still using etcd2 in all tests), watch is implemented as a series of GET`s with a specified RV. So even if this particular request is prioritized, then etcd doesn't have any control over what will happen next. So it cannot block the world on it.
So unless I'm not missing anything - you were talking about etcd3, right?

[And this bug is purely about etcd2]

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Sep 12, 2016

Contributor

@wojtek-t

Say current rv is 2000, there are three cases:

  1. we watch on rv > 2000. etcd will not process the next request until the watch event is delivered to the watch buffer. Once the event is in the buffer, it wont be cleared.
  2. we watch on 1000 < rv <= 2000. etcd scans its history and delivers the watch event to the watch buffer.
  3. we watch on rv < 1000. etcd will return the clear thing.

You mentioned we might see 3 happen. It must happen in the following example sequence:

  1. watch returns rv = 900
  2. we start to process watch event
  3. we put 1100 more items into etcd
  4. etcd rv grows to 2000
  5. we finish processing watch event
  6. we start watch with rv = 901 again
  7. clear event is returned

There are other possibilities to make the 2nd watch request arrives late. The slow event processing is just one reason. The network might be slow for example.

Hope this explains

A more interesting metrics is how long between the two watch calls. And how many keys are modified during that duration. If n(keys) > 1000, then we have a problem.

Contributor

xiang90 commented Sep 12, 2016

@wojtek-t

Say current rv is 2000, there are three cases:

  1. we watch on rv > 2000. etcd will not process the next request until the watch event is delivered to the watch buffer. Once the event is in the buffer, it wont be cleared.
  2. we watch on 1000 < rv <= 2000. etcd scans its history and delivers the watch event to the watch buffer.
  3. we watch on rv < 1000. etcd will return the clear thing.

You mentioned we might see 3 happen. It must happen in the following example sequence:

  1. watch returns rv = 900
  2. we start to process watch event
  3. we put 1100 more items into etcd
  4. etcd rv grows to 2000
  5. we finish processing watch event
  6. we start watch with rv = 901 again
  7. clear event is returned

There are other possibilities to make the 2nd watch request arrives late. The slow event processing is just one reason. The network might be slow for example.

Hope this explains

A more interesting metrics is how long between the two watch calls. And how many keys are modified during that duration. If n(keys) > 1000, then we have a problem.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 12, 2016

Member

Sorry - I probably wasn't clear enough, because that's not what I meant.

What I meant is:

  • for simplicity let's assume that watcher is interested in all objects
  • etcd is on RV=2000
  • we send watch on 1900
  • then etcd is processing it an returning objects with 1901 RV
  • the a bunch of "update" events are coming (say 2001, 2002, 2003, 2004 and 2005) and they are successfully committed - etcd is now in RV=2005
  • and only then the next "get' from a watch is coming (it has RV = 1901)
  • etcd processed it before processing anything else, and return object with 1902 RV
  • and then again a bunch of updates come

This is clearly possible, and after few rounds we will end up with exceeding "etcd window".

So that is something that I didn't write anywhere above, but implicitly meant - the issue from this bug is not an etcd server issue, but it's either:

  • etcd client issue (before we reach kubernetes-specific code)
  • or network slowness between etcd server and etcd client (which would be surprising, because it's localhost)

@xiang90 Does it make sense now?

Member

wojtek-t commented Sep 12, 2016

Sorry - I probably wasn't clear enough, because that's not what I meant.

What I meant is:

  • for simplicity let's assume that watcher is interested in all objects
  • etcd is on RV=2000
  • we send watch on 1900
  • then etcd is processing it an returning objects with 1901 RV
  • the a bunch of "update" events are coming (say 2001, 2002, 2003, 2004 and 2005) and they are successfully committed - etcd is now in RV=2005
  • and only then the next "get' from a watch is coming (it has RV = 1901)
  • etcd processed it before processing anything else, and return object with 1902 RV
  • and then again a bunch of updates come

This is clearly possible, and after few rounds we will end up with exceeding "etcd window".

So that is something that I didn't write anywhere above, but implicitly meant - the issue from this bug is not an etcd server issue, but it's either:

  • etcd client issue (before we reach kubernetes-specific code)
  • or network slowness between etcd server and etcd client (which would be surprising, because it's localhost)

@xiang90 Does it make sense now?

@xiang90

This comment has been minimized.

Show comment
Hide comment
@xiang90

xiang90 Sep 12, 2016

Contributor

@wojtek-t

Yea. What you just described is also possible if we consider multiple rounds of watch. The watcher can slowly fall out of the window. My best guess is that:

  1. json decoding/encoding is slow
  2. watch only consumes one routine, but put requests can consume multiple routines. so put might be processed faster than watch. simply put, watch is not concurrent, but put can be.

This issue has been solved in etcd3 by supporting batched events. We will deliver a bunch of events together to the watcher when we detect the watcher is struggling. however, etcd2 does not support this :(

Contributor

xiang90 commented Sep 12, 2016

@wojtek-t

Yea. What you just described is also possible if we consider multiple rounds of watch. The watcher can slowly fall out of the window. My best guess is that:

  1. json decoding/encoding is slow
  2. watch only consumes one routine, but put requests can consume multiple routines. so put might be processed faster than watch. simply put, watch is not concurrent, but put can be.

This issue has been solved in etcd3 by supporting batched events. We will deliver a bunch of events together to the watcher when we detect the watcher is struggling. however, etcd2 does not support this :(

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 12, 2016

Member

@xiang90 - yes, what you wrote makes perfect sense (and matches my guess too).

I also agree that we should make tests with etcd3 asap to see if this particular problem goes away. Will try to make them later this week.

Member

wojtek-t commented Sep 12, 2016

@xiang90 - yes, what you wrote makes perfect sense (and matches my guess too).

I also agree that we should make tests with etcd3 asap to see if this particular problem goes away. Will try to make them later this week.

@HardySimpson

This comment has been minimized.

Show comment
Hide comment
@HardySimpson

HardySimpson Sep 13, 2016

Contributor

@wojtek-t another question, is the 2000 kubemark is serve by one apiserver?

Contributor

HardySimpson commented Sep 13, 2016

@wojtek-t another question, is the 2000 kubemark is serve by one apiserver?

@gmarek

This comment has been minimized.

Show comment
Hide comment
@gmarek

gmarek Sep 13, 2016

Member

@HardySimpson - yes it is.

Member

gmarek commented Sep 13, 2016

@HardySimpson - yes it is.

@luxas

This comment has been minimized.

Show comment
Hide comment
@luxas

luxas Sep 23, 2016

Member

Which etcd3 version are you going to target in v1.5?
etcd 3.1.x with all the latest stability stuff and the new features (when it's released)?

Member

luxas commented Sep 23, 2016

Which etcd3 version are you going to target in v1.5?
etcd 3.1.x with all the latest stability stuff and the new features (when it's released)?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 23, 2016

Member

@luxas too early, I'm going to rev to 3.1.0 2day.

Member

timothysc commented Sep 23, 2016

@luxas too early, I'm going to rev to 3.1.0 2day.

@luxas

This comment has been minimized.

Show comment
Hide comment
@luxas

luxas Sep 23, 2016

Member

@timothysc I'm not sure I got that message :)
You mean you're going to switch k8s Godeps to the coming v3.1.0-alpha.1 etcd release and bump the gcr.io etcd image as well?

If so, that would be really great!

Member

luxas commented Sep 23, 2016

@timothysc I'm not sure I got that message :)
You mean you're going to switch k8s Godeps to the coming v3.1.0-alpha.1 etcd release and bump the gcr.io etcd image as well?

If so, that would be really great!

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 23, 2016

Member

Yes, but 3.1.0 should be released today and I'll submit the PR this afternoon.

Member

timothysc commented Sep 23, 2016

Yes, but 3.1.0 should be released today and I'll submit the PR this afternoon.

@hongchaodeng

This comment has been minimized.

Show comment
Hide comment
@hongchaodeng

hongchaodeng Sep 23, 2016

Member

@timothysc
There will be release of 3.0.10 and another alpha version of 3.1.0

Member

hongchaodeng commented Sep 23, 2016

@timothysc
There will be release of 3.0.10 and another alpha version of 3.1.0

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 23, 2016

Member

k, then 3.0.10 it is.

Member

timothysc commented Sep 23, 2016

k, then 3.0.10 it is.

@luxas

This comment has been minimized.

Show comment
Hide comment
@luxas

luxas Sep 23, 2016

Member

@hongchaodeng When will 3.1.0 be available?
I'm looking forward to that release.
Also, how many more 3.0.x releases will there be?

Member

luxas commented Sep 23, 2016

@hongchaodeng When will 3.1.0 be available?
I'm looking forward to that release.
Also, how many more 3.0.x releases will there be?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc
Member

timothysc commented Sep 23, 2016

xref: #33393

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 10, 2016

Member

@luxas - we are currently not targetting 3.1. The current plan is 3.0.12.

Member

wojtek-t commented Oct 10, 2016

@luxas - we are currently not targetting 3.1. The current plan is 3.0.12.

@luxas

This comment has been minimized.

Show comment
Hide comment
@luxas

luxas Oct 10, 2016

Member

Sad, since arm needs 3.1.x unless the etcd team cherrypick some 32-bit fixes into 3.0.13...
Namely etcd-io/etcd#5890, which I proposed being cherrypicked (very very low risk) into 3.0.x, but seems like it was turned down

cc @xiang90 @gyuho Is a cherrypick okay for you?
There's quite many k8s arm users as well, and I'd like to not "hack" things to make it work on the kubernetes side if not necessary. A cherrypick would solve that issue

Member

luxas commented Oct 10, 2016

Sad, since arm needs 3.1.x unless the etcd team cherrypick some 32-bit fixes into 3.0.13...
Namely etcd-io/etcd#5890, which I proposed being cherrypicked (very very low risk) into 3.0.x, but seems like it was turned down

cc @xiang90 @gyuho Is a cherrypick okay for you?
There's quite many k8s arm users as well, and I'd like to not "hack" things to make it work on the kubernetes side if not necessary. A cherrypick would solve that issue

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 10, 2016

Member

@luxas - the reason for 3.0.x is that that 3.1 is still not release, and we don't want to upgrade at the last moment. Though, if there will be a cherrypick, we can definitely consider it.

Member

wojtek-t commented Oct 10, 2016

@luxas - the reason for 3.0.x is that that 3.1 is still not release, and we don't want to upgrade at the last moment. Though, if there will be a cherrypick, we can definitely consider it.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Nov 2, 2016

Member

We confirmed that we no longer face the issues described above with etcd v3. Thus I'm closing this one.

Member

wojtek-t commented Nov 2, 2016

We confirmed that we no longer face the issues described above with etcd v3. Thus I'm closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment