Offer possibility of faster response for EC #366

xhernandez · 2017-11-30T11:59:15Z

Currently EC sends requests to all necessary bricks and waits for them to complete before returning the answer to the user. This means that all answers will take, at least, the same time than the slowest of the bricks.

This can degrade performance easily if one of the servers or bricks is misbehaving or, in big configurations like 8+4, it's even possible that no server is having a permanent issue but most of the time one brick or the other responds slower in cases of heavy load. In this case, EC as a whole will always work at the speed of the lowest.

To mitigate this problem, I propose to provide a user configurable option that defines the minimum number of bricks that are needed to consider a request completed. The idea is to work exactly as EC is working now (i.e. sending the request to all necessary bricks) but report the result to the user as soon as the specified number of bricks have returned a consistent answer.

For example, in a 4+2 configuration, we can configure this new option to 5, meaning that as soon as 5 of the bricks agree on an answer, it's reported to the user without waiting for the sixth brick. This sixth brick will still complete the request in the background. This way we avoid problems generated by a single server or filesystem not responding fast enough.

Of course this means that for a short period of time the file won't have the theoretical level of redundancy, and if something happens in this small window of time (disk or server crash for example), the resiliency of the volume won't be the expected one. In the example above, there will be a small window of time where the file being modified can have serious problems if two bricks fail simultaneously, even if the volume should have supported up to 2 brick failures without problem. The user will be allowed to determine the trade-off between resiliency and performance.

To make this work we need to decide how will we deal with misbehaving bricks. If a brick is consistently slower than the others, this approach won't have a direct benefit because even if the first request is answered earlier, the next one may need to wait for the previous one to complete on the slower brick, cancelling the initial benefit.

We can do two things here:

We can use per-subvolume queues to keep track of pending requests on each brick. This way we could continue processing requests on other bricks without needing to wait for the slow brick. If the queue grows too much, we can mark the pending files as damaged and clear the queue. The main problem here is that this change would require significant modifications in the current EC implementations making it expensive (this approach is already designed as part of the txn framework).
Mark the brick as bad. This way we immediately stop sending requests to slow bricks, but we also loose a factor of redundancy that may be needed for other files.
Mark the brick as bad only for the affected file. This way we'll only "damage" files that are currently being accessed. Other files can still access the brick, even if it's slower, to solve any issue. It's better to respond a bit later than to return an error. Another thing we can do in this case is to keep a sorted list of bricks based on the response time, so that we always try them first, even if we don't consider an specific brick as bad for a particular file.

In all cases we'll have an added problem because self-heal will try to get in to fix the just damaged files. Allowing that would make the problem worse, so we should also implement some way to make self-heal aware of bricks that are misbehaving, and delay heal attempt for some time.

raghavendrahg · 2017-11-30T12:03:34Z

I think similar functionality can be extended for AFR and directory operations in DHT.

jdarcy · 2017-11-30T14:11:56Z

There are two schools of thought among those who work on similar distributed systems.

(1) Allow the slowest server to affect overall I/O rates, and rely on monitoring to catch/fix persistent laggards.

(2) Treat a slow server as a failed server, already counting against redundancy, so one slow server plus two dead ones is exactly the same as three dead ones and shouldn't be expected to work (though it actually might) if redundancy=2.

In a way, these end up being the same. What really happens with (1) in my experience is that the laggard eventually gets taken down for repair (e.g. replacing a disk that's doing too many retries or bad-block redirections) so it ends up being (2) anyway. The only difference is that letting the slow server affect the entire replica/EC set's performance is more likely to generate user complaints in the interim.

This reminds me of a story from much earlier in my career. I was working on one of the first SCSI multi-pathing drivers (precursor to PowerPath and friends). Under certain conditions involving multiple errors, the driver would repeatedly fail over from one array controller to another trying to find a good path. Sometimes it would do this for as much as five minutes before stabilizing. I was rather proud of the fact that it eventually stabilized at all. A more senior engineer on the project said that users would prefer for it to fail fast and hard in these cases, rather than behaving unpredictably for five minutes. I ended up having to disable my overly-clever code that I'd just spent weeks (away from home BTW) working on. I resented it at the time, but he was right. If you can't guarantee a positive result, sometimes it's better to guarantee a negative one.

I think the same principle applies here. We need to make the "how bad can it be before we mark it dead" threshold tunable anyway. We can use low values for testing. High values put us approximately where we are already, but with bounded rather than unbounded tolerance for laggards. Anyone who really wants to keep running with an obviously flaky server that could fail completely at any moment can set the value to infinity and accept the risk. Implementing anything else is too much complexity for too little value.

itisravi · 2017-12-01T05:10:06Z

I think halo replication #199 in AFR does what Jeff described above w.r.t. point no:2.

raghavendrahg · 2017-12-11T07:32:08Z

I think halo replication #199 in AFR does what Jeff described above w.r.t. point no:2.

I think the gist of this idea is more like a three way replica not waiting for a response from the third brick after it receives responses from any two bricks before sending a reply. Though Halo also operates with same principle, I feel idea suggested here apply to 3 way replica too.

stale · 2020-04-30T10:33:24Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

stale · 2020-05-15T10:51:05Z

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

stale · 2020-05-30T11:37:37Z

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

amarts added the CB: disperse label Dec 1, 2017

stale bot added the wontfix Managed by stale[bot] label Apr 30, 2020

stale bot closed this as completed May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offer possibility of faster response for EC #366

Offer possibility of faster response for EC #366

xhernandez commented Nov 30, 2017

raghavendrahg commented Nov 30, 2017

jdarcy commented Nov 30, 2017

itisravi commented Dec 1, 2017

raghavendrahg commented Dec 11, 2017

stale bot commented Apr 30, 2020

stale bot commented May 15, 2020

stale bot commented May 30, 2020

Offer possibility of faster response for EC #366

Offer possibility of faster response for EC #366

Comments

xhernandez commented Nov 30, 2017

raghavendrahg commented Nov 30, 2017

jdarcy commented Nov 30, 2017

itisravi commented Dec 1, 2017

raghavendrahg commented Dec 11, 2017

stale bot commented Apr 30, 2020

stale bot commented May 15, 2020

stale bot commented May 30, 2020