Reduce resharding impact by redirecting data to new shards #8075

Harkishen-Singh · 2020-10-16T21:00:38Z

Signed-off-by: Harkishen-Singh harkishensingh@hotmail.com

This PR adds support to reduce resharding impact by spinning up new shards at the time of resharding if there are existing shards. The new shards are fed with data from existing shards while blocking the incoming samples. Once this is done, the incoming samples are consumed. This keeps the samples in order. After this, the old shards are replaced with these new shards. This reduces the wait time of the samples due to slow shards, as mentioned in the related issue.

Harkishen-Singh · 2020-10-18T13:59:51Z

cc @cstyan @csmarchbanks

csmarchbanks · 2020-10-18T15:40:59Z

Thank you for the PR! We are aware of this PR and will review it when we have time, probably Monday or Tuesday. Please avoid re-pinging people after only a couple days, especially when those days are a weekend.

…

On Oct 18, 2020, at 8:00 AM, Harkishen Singh ***@***.***> wrote: cc @cstyan @csmarchbanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Harkishen-Singh · 2020-10-18T15:46:55Z

Ahh, sorry. I didn't aim to re-ping, I felt like mentioning in PR description is not useful, so removed from there and mentioned in the comment.

csmarchbanks

No worries, we will get notified if it is in the description or a comment.

I left a few comments, I am not sure this is quite correct yet. It looks like runShard() would currently continue to pick up samples from its queue and try to send them rather than put them into a queue buffer, or a new queue. Then, only the very last request would be moved to the new queues in the current implementation?

storage/remote/queue_manager.go

Harkishen-Singh · 2020-10-20T11:17:05Z

It looks like runShard() would currently continue to pick up samples from its queue and try to send them rather than put them into a queue buffer, or a new queue. Then, only the very last request would be moved to the new queues in the current implementation?

As soon as the queue channel is closed, the pending samples in the shard will be sent to the buffer. It doesn't matter if its the last request or not, the pending samples are sent to buffers than waiting for them to be flushed (which is responsible for a slow shard).

Do we want also want to stop the existing sendSample (which is in process of sending) that may be slow? (assuming the answer to be no)

csmarchbanks · 2020-10-20T14:34:34Z

The issue is that using the default queue config, the queue channel might have 2,500 samples in it. All of those would be processed, and 4-5 requests would be sent, before the close is processed, at which point up to 500 samples would be moved over to the new shards. Using those numbers, resharding would speed up by about 20%, but I think we can do better :).

I have seen queue configs where 100,000 samples might be buffered in the queue channel., which you can imagine may take quite a long time to successfully send to the remote before seeing that the channel was closed.

Does that make sense?

Harkishen-Singh · 2020-10-23T15:38:17Z

@csmarchbanks as far as i understand, as soon as the a queue channel is closed, the following iteration in runShard will make ok as false and hence shift the pending samples to the buffer and then to the new shards. The only way it can be slow is if, when the queue channel is closed but at that time, sendSamples() inside if nPending >= max is running blockingly, meaning we have to wait for that send to finish, which can make it slow, and hence i asked if we want to stop the currently running sendSamples().

Maybe, if you could explain more on how All of those would be processed, and 4-5 requests would be sent is the case. That would be very helpful.

csmarchbanks · 2020-10-24T16:58:42Z

Ahh, I think I might see where the misunderstanding is. When a channel is closed, ok only will be false when all of the samples from that channel are also drained. Since that channel can have thousands to hundreds of thousands of samples in it, all of those would be processed by the normal batch and send logic before the new logic is seen when ok comes back as false.

Here is a go playground example that shows the behavior in a simpler manner: https://play.golang.org/p/lmZ2po7P3OH. As can be seen, if we have 2300 samples in the queue when it is closed, we will send 4 requests of 500 samples before moving the final 300 samples to the new queue.

Did this explanation help at all?

Harkishen-Singh · 2020-10-25T06:24:12Z

Did this explanation help at all?

Yes, very much. Thank you for the awesome explanation. I will update the code soon.

cstyan

few comments, having a bit of a hard time following the changes having come in to this late, but it seems like you and Chris are on top of things :)

storage/remote/queue_manager.go

cstyan · 2020-10-27T23:46:26Z

storage/remote/queue_manager.go

+}
+
+// stop the old shards; subsequent call to enqueue will return false.
+func (s *shards) stopOldShards() {


Having stop and stopOldShards is again confusing naming IMO. Can we can name this one something like stopForResharding or something to that effect?

After the recent push, I feel that stopOldShards() is not required anymore.

csmarchbanks

Thanks for the work! Certainly getting better :)

I left some comments, but generally I agree with Callum, I am having a hard time understanding everything happening. I think the overall workflow is mostly correct, but debugging issues will be painful and I will need more time to review fully. The following is a suggestion that might make it easier to understand :)

Rather than having a having logic spread around among several methods in a shards struct, would it be clearer to represent this logic by creating a brand new shards struct that is then populated with values from the old shards? Resharding would look something like:

newShards := t.newShards()
newShards.start(numShards)
// transferTo would stop the shard, and enqueue pending samples to `newShards`.
t.shards.transferTo(newShards)
t.shards = newShards

But with some additional locking to make sure Append is still safe. Thoughts?

If the above sounds good, I would recommend two pull requests, one to refactor shards to be idempotent and replaced during reshard, and a second to implement the transferTo logic.

csmarchbanks · 2020-10-29T20:47:07Z

storage/remote/queue_manager.go

+		}(i)
+	}
+	wg.Wait()
+	sort.SliceStable(buffer, func(i, j int) bool {


Is it necessary to sort all of the samples, or could we just stream them to the new shards in whatever order the go routines happen to pick up?

I used sorting since we wanted to ensure that samples should be sorted always in the queues. But if you feel that the will eventually be sorted during streaming, then sorting can be avoided.

We anyways need to sort, else I am not sure if sort is guaranteed.

storage/remote/queue_manager.go

csmarchbanks · 2020-10-29T21:10:23Z

storage/remote/queue_manager.go

+		case <-s.reshard:
+			select {
+			case sample := <-queue:
+				// Capture the remaining (in case) sample from the queue.


When would this happen, and is it guaranteed to only happen once?

I have updated the code. This is to accept the samples that would have entered the queue, between when the resharding signal is sent to t.shards.reshard and blocking incoming samples at the time of creation of newQueues. And, this should not happen once, but till the blocking is not done on the queues of the old shards.

Harkishen-Singh · 2020-11-05T19:28:03Z

With the recent code pushed, I feel that stopOldShards() is not required any more. This is because stopOldShards() is called only during resharding, but when a reshard signal is sent, it goes into the t.reshard case, and then accepts incoming samples and redirecets to new queues, ending the old shards/go_routines.

What do others think about this?

I will try to make the code more cleaner (along with the suggestions by for renaming sharding functions). Sorry for the inconveniences 😅

Harkishen-Singh · 2020-11-06T15:52:20Z

If the above sounds good, I would recommend two pull requests, one to refactor shards to be idempotent and replaced during reshard, and a second to implement the transferTo logic.

mmm, they both are part of a single feature, then why for two PRs? Also, as far as I understand, I think for the implementation, they both are required.

Harkishen-Singh · 2020-11-06T17:34:28Z

@csmarchbanks I have updated with the suggestions. The renaming suggestions were helpful and I think the code looks more understandable now. There are some changes like the enqueue() now does not need to send false for blocking, instead is being handled by queuemanager and related changes that I will do after listening your thoughts on the updated implementation.
Thank you.

csmarchbanks

Thanks for all your work, this is certainly becoming more understandable!

What would happen in the current implementation if a reshard is attempted but the remote endpoint is down? You would have shards stuck in sendSamples forever without a context cancellation I think?

Also, there are some lint failures, I would love to see the results of the tests, I think there are some bugs, and it might be that we need better test coverage if they all pass.

storage/remote/queue_manager.go

Harkishen-Singh · 2020-11-27T19:38:27Z

What would happen in the current implementation if a reshard is attempted but the remote endpoint is down? You would have shards stuck in sendSamples forever without a context cancellation I think?

The current implementation (with recent push) is such that after each shard is formed, we check if send sendSample() fails to send any sample, which is beyond the flush deadline, then we do a hardshutdown like earlier cases.

roidelapluie · 2020-12-02T10:42:32Z

I am wondering it it is worth complicating this code now, since there is a consensus to make remote_write transactional, so all this code would change anyway.

https://docs.google.com/document/d/1vhXKpCNY0k2cbm0g10uM2msXoMoH8CTwrg_dyqsFUKo/edit

csmarchbanks

I left a few more comments.

I am wondering it it is worth complicating this code now, since there is a consensus to make remote_write transactional, so all this code would change anyway.

Is anyone actually working on making remote write transactional? I couldn't attend the last dev summit so may have missed it. If not I think this bit of extra complexity is worthwhile until transactional rw is out.

storage/remote/queue_manager.go

csmarchbanks · 2020-12-02T23:00:27Z

storage/remote/queue_manager.go

@@ -846,6 +917,9 @@ func (s *shards) sendSamples(ctx context.Context, samples []prompb.TimeSeries, b
 	if err != nil {
 		level.Error(s.qm.logger).Log("msg", "non-recoverable error", "count", len(samples), "err", err)
 		s.qm.metrics.failedSamplesTotal.Add(float64(len(samples)))
+		if time.Since(begin) > s.qm.flushDeadline {


This means that any retries that take longer than the flushDeadline will cause a hard shutdown? Can we just use the context cancel to stop sending?

This means that any retries that take longer than the flushDeadline will cause a hard shutdown?

Yes

Can we just use the context cancel to stop sending?

Sorry, do you mean to just skip the current send or to stop the shard? The current implementation here stops the reshard by using the context's cancel (hardShutdown is the cacncelFunc). If not, then I think I did not get your point. Can you please explain a bit more?

I will rephrase, right now (*shards).stop() contains all of the logic around soft and hard shutdowns. It would be ideal if that logic would still only be contained in stop(). It is not obvious to be calling hardShutdown from inside a method called sendSamples, and I am sure that will lead to unexpected behavior in the future.

I think that ends up meaning that

if resharding { return }

inside of stop() should be removed. Since the shard would be sending to the new queue instead of to the remote, shutdown should still happen pretty quickly. You may have to setup some background processing in reshardLoop to avoid blocking on stop, but that is preferable IMO.

This has been addressed. I think this can be resolved now.

storage/remote/queue_manager.go

csmarchbanks · 2021-01-06T15:36:27Z

Hi @Harkishen-Singh, I am still on vacation this week, I plan to take another look at this sometime next week!

csmarchbanks

Thank you for your continued work! I Tried to answer your question and left a few more comments. Looks like the remote tests are timing out, which is likely related to a block somewhere in the reshard code.

csmarchbanks · 2021-01-17T06:18:02Z

storage/remote/queue_manager.go

@@ -846,6 +917,9 @@ func (s *shards) sendSamples(ctx context.Context, samples []prompb.TimeSeries, b
 	if err != nil {
 		level.Error(s.qm.logger).Log("msg", "non-recoverable error", "count", len(samples), "err", err)
 		s.qm.metrics.failedSamplesTotal.Add(float64(len(samples)))
+		if time.Since(begin) > s.qm.flushDeadline {


I will rephrase, right now (*shards).stop() contains all of the logic around soft and hard shutdowns. It would be ideal if that logic would still only be contained in stop(). It is not obvious to be calling hardShutdown from inside a method called sendSamples, and I am sure that will lead to unexpected behavior in the future.

I think that ends up meaning that

if resharding { return }

inside of stop() should be removed. Since the shard would be sending to the new queue instead of to the remote, shutdown should still happen pretty quickly. You may have to setup some background processing in reshardLoop to avoid blocking on stop, but that is preferable IMO.

storage/remote/queue_manager.go

Harkishen-Singh · 2021-01-29T14:02:34Z

Hey @csmarchbanks. I have updated the code with a more cleaner approach in lines with the suggestions. Thank you.

Harkishen-Singh · 2021-04-04T08:05:21Z

PS: This isn't ready for review yet. I think the new Github features automatically asks for reviews from the maintainers.

roidelapluie · 2021-04-04T08:25:21Z

Yes, unless your mark the PR as draft.

Harkishen-Singh · 2021-04-07T13:03:10Z

@csmarchbanks the tests are passing now. I tried to go back to the comments/suggestions and I guess they are implemented. It's been a while working on this and I might have lost a comment or so, sorry for that. I think we can give another look at this.

I think we can work on this since the transactional remote-write can take a while to get ready, we can continue with this.

codesome · 2021-07-12T10:04:10Z

@csmarchbanks @cstyan can you take another look at this, please?

codesome · 2021-07-12T10:04:21Z

(closed by mistake)

cstyan

You'll need to rebase as well.

storage/remote/queue_manager.go

Harkishen-Singh · 2021-07-28T10:10:54Z

Rebased and updated.

cstyan · 2021-07-28T20:48:41Z

Looks like there's now a data race and also maybe a routine that never ends somewhere in the resharding procedure, take a look at the failing tests.

Harkishen-Singh · 2021-07-30T14:30:40Z

Ah sorry, I missed a lock. It should be fine now.

PS: Seems like the windows tests that are failing are not part of this PR.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

bboreham · 2023-04-18T14:43:01Z

Is this likely to be resurrected? I see the original issue #7230 is still open.

cstyan · 2023-05-29T22:11:13Z

I personally don't have time to take this over. @Harkishen-Singh do you plan/have any interest in picking this up again or should we mark it as open for contributors?

roidelapluie · 2023-09-05T11:48:11Z

We have looked at this pull request during our bug scrub.

Given the lack of response, I have marked the issue as help wanted, and we have decided to close this PR.

Thank you for your contribution.

csmarchbanks reviewed Oct 19, 2020

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

Harkishen-Singh requested a review from csmarchbanks October 26, 2020 08:05

cstyan reviewed Oct 27, 2020

View reviewed changes

csmarchbanks reviewed Oct 29, 2020

View reviewed changes

Harkishen-Singh force-pushed the reduce_resharding_impact branch from 20bbcd1 to 21e6211 Compare November 5, 2020 19:23

Harkishen-Singh force-pushed the reduce_resharding_impact branch from c767d1c to 0306a80 Compare November 5, 2020 20:01

csmarchbanks reviewed Nov 11, 2020

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

Harkishen-Singh force-pushed the reduce_resharding_impact branch from fdf12e5 to f8494ac Compare November 18, 2020 19:45

Harkishen-Singh requested a review from csmarchbanks November 27, 2020 19:38

csmarchbanks reviewed Dec 2, 2020

View reviewed changes

Harkishen-Singh force-pushed the reduce_resharding_impact branch from 2dddfd7 to 16a7106 Compare January 4, 2021 07:42

Harkishen-Singh requested a review from csmarchbanks January 4, 2021 07:44

csmarchbanks reviewed Jan 17, 2021

View reviewed changes

Harkishen-Singh force-pushed the reduce_resharding_impact branch from 16a7106 to 48d12e6 Compare January 29, 2021 14:01

Harkishen-Singh force-pushed the reduce_resharding_impact branch from 48d12e6 to 4dd627e Compare January 29, 2021 14:03

Harkishen-Singh force-pushed the reduce_resharding_impact branch from 19b4844 to bb4b821 Compare April 4, 2021 08:04

Harkishen-Singh requested review from bwplotka, codesome and tomwilkie as code owners April 4, 2021 08:04

stale bot added the stale label Jun 6, 2021

codesome closed this Jul 12, 2021

codesome reopened this Jul 12, 2021

stale bot removed the stale label Jul 12, 2021

cstyan reviewed Jul 16, 2021

View reviewed changes

Harkishen-Singh force-pushed the reduce_resharding_impact branch from bb4b821 to e7688d9 Compare July 28, 2021 10:09

Harkishen-Singh requested a review from cstyan July 28, 2021 10:11

Harkishen-Singh force-pushed the reduce_resharding_impact branch from e7688d9 to b650da3 Compare July 30, 2021 14:12

Reduce reshard impact by redirecting data to newer shards.

b1c8047

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

Harkishen-Singh force-pushed the reduce_resharding_impact branch from b650da3 to b1c8047 Compare July 30, 2021 14:47

csmarchbanks mentioned this pull request Sep 2, 2021

remote_write: shard up more when backlogged #9274

Merged

ALESYAkot approved these changes Sep 13, 2021

View reviewed changes

stale bot added the stale label Nov 15, 2021

roidelapluie closed this Sep 5, 2023

roidelapluie mentioned this pull request Sep 5, 2023

Reduce the impact of remote write resharding #7230

Open

Reduce resharding impact by redirecting data to new shards #8075

Reduce resharding impact by redirecting data to new shards #8075

Conversation

Harkishen-Singh commented Oct 16, 2020 • edited

Harkishen-Singh commented Oct 18, 2020

csmarchbanks commented Oct 18, 2020 via email

Harkishen-Singh commented Oct 18, 2020

csmarchbanks left a comment • edited

Choose a reason for hiding this comment

Harkishen-Singh commented Oct 20, 2020

csmarchbanks commented Oct 20, 2020 • edited

Harkishen-Singh commented Oct 23, 2020 • edited

csmarchbanks commented Oct 24, 2020

Harkishen-Singh commented Oct 25, 2020

cstyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csmarchbanks left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Harkishen-Singh Nov 5, 2020 • edited

Choose a reason for hiding this comment

Harkishen-Singh Nov 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Harkishen-Singh Nov 5, 2020 • edited

Choose a reason for hiding this comment

Harkishen-Singh commented Nov 5, 2020 • edited

Harkishen-Singh commented Nov 6, 2020 • edited

Harkishen-Singh commented Nov 6, 2020

csmarchbanks left a comment • edited

Choose a reason for hiding this comment

Harkishen-Singh commented Nov 27, 2020

roidelapluie commented Dec 2, 2020

csmarchbanks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csmarchbanks commented Jan 6, 2021

csmarchbanks left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Harkishen-Singh commented Jan 29, 2021

Harkishen-Singh commented Apr 4, 2021

roidelapluie commented Apr 4, 2021

Harkishen-Singh commented Apr 7, 2021

codesome commented Jul 12, 2021

codesome commented Jul 12, 2021

cstyan left a comment

Choose a reason for hiding this comment

Harkishen-Singh commented Jul 28, 2021

cstyan commented Jul 28, 2021

Harkishen-Singh commented Jul 30, 2021 • edited

bboreham commented Apr 18, 2023

cstyan commented May 29, 2023

roidelapluie commented Sep 5, 2023

Harkishen-Singh commented Oct 16, 2020 •

edited

csmarchbanks left a comment •

edited

csmarchbanks commented Oct 20, 2020 •

edited

Harkishen-Singh commented Oct 23, 2020 •

edited

csmarchbanks left a comment •

edited

Harkishen-Singh Nov 5, 2020 •

edited

Harkishen-Singh Nov 5, 2020 •

edited

Harkishen-Singh Nov 5, 2020 •

edited

Harkishen-Singh commented Nov 5, 2020 •

edited

Harkishen-Singh commented Nov 6, 2020 •

edited

csmarchbanks left a comment •

edited

csmarchbanks left a comment •

edited

Harkishen-Singh commented Jul 30, 2021 •

edited