Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix batching logic with write records, introduce concurrent requests #8947

Merged
merged 1 commit into from
Jan 6, 2022

Conversation

nirmeshk
Copy link
Contributor

@nirmeshk nirmeshk commented Mar 6, 2021

Required for all PRs:

  • Associated README.md updated.
  • Has appropriate unit tests.
  • signed CLA

Copy link
Contributor

@telegraf-tiger telegraf-tiger bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Mar 6, 2021
Copy link
Contributor

@ivorybilled ivorybilled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test file needs to be formatted

@nirmeshk
Copy link
Contributor Author

nirmeshk commented Mar 8, 2021

fixed the gofmt issue with timestream_test.go

Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks thorough, but it's a significant one. Not sure about the parallelism.

if err := t.writeToTimestream(writeRecordsInput, true); err != nil {
return err
}
go func(inp *timestreamwrite.WriteRecordsInput) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the upper bound on the number of goroutines we're going to launch here? If I set my batch size to something ridiculously large, could you get hundreds or thousands of concurrent requests? Generally I'm also not really a fan of parallel writes in the output here. You're typically not going to see much in the way of improved throughput.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Currently it is unbounded.
I can add a semaphore that puts an upper bound on the concurrent go-routines.
https://github.com/golang/sync/blob/master/semaphore/semaphore.go

We are making this change as we observed that metrics were being dropped due to the requests taking longer serially. After making this change, things improved, and the metric drop stopped.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in replying to this, we're catching up a little with outstanding PRs.
Yup, that sounds like a good idea to put an upper bound on the requests, please proceed with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the upper bound on the concurrency, and have introduced a parameter for the same. So customers should be able to decide on the concurrency.


// On partial failures, Telegraf will reject the entire batch of metrics and
// retry. writeToTimestream will return retryable exceptions only.
err, _ := <-errs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to read from this channel len(writeRecordsInputs) times, and then you can drop the waitgroup, because this will act as a natural block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 355, we are only adding to errs channel if err != nil
Reading it n times currently just blocks forever when error does not happen.

I can try removing the != nil check and see if it works un-interupted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds reasonable, to remove the nil check and always return the result from writeToTimestream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a range over channel. It exists on the first encountered error. But since the channel is already closed before that, it should be able to garbage collected?

plugins/outputs/timestream/timestream.go Show resolved Hide resolved
@nirmeshk
Copy link
Contributor Author

Ping on the pull request @jagularr @ssoroka
Let me know if the proposal in the comments make sense. I can make the changes and update this

@nirmeshk
Copy link
Contributor Author

!signed-cla

@telegraf-tiger
Copy link
Contributor

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@hyandell
Copy link

Nirmesh - CCLA wise, I believe you're pending the Influx folk updating a database on their side.
Influx folk - let me know if you didn't receive the update.

@sjwang90
Copy link
Contributor

!signed-cla

@sjwang90 sjwang90 added the area/aws AWS plugins including cloudwatch, ecs, kinesis label Sep 13, 2021
@sjwang90 sjwang90 requested a review from popey September 13, 2021 17:33
@sjwang90 sjwang90 changed the title Fix batching logic with write records, introduce concurrent requests fix: Fix batching logic with write records, introduce concurrent requests Sep 13, 2021
@powersj
Copy link
Contributor

powersj commented Sep 21, 2021

@nirmeshk per #8848 it sounds like this PR improves the situation with the Timestream output. Are you going to be able to resolve some of the outstanding review questions, specifically around limiting the number of threads that get called? Thanks!

@MyaLongmire MyaLongmire added the waiting for response waiting for response from contributor label Oct 1, 2021
@sjwang90
Copy link
Contributor

Resolves #8848

@nirmeshk If you can update the PR with the three open comments above we can give it another review and merged in soon.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Oct 18, 2021
@renovate-ombrea
Copy link

any update about this PR ?

@nirmeshk
Copy link
Contributor Author

Hi sorry for the delay in addressing this. Picking it up again now. Will raise the request with everything addressed.

@nirmeshk nirmeshk force-pushed the mainline-2 branch 2 times, most recently from 3f01a9d to 46bb44b Compare December 9, 2021 02:12
@nirmeshk
Copy link
Contributor Author

nirmeshk commented Dec 9, 2021

Hi @sjwang90 @powersj @popey , Have addressed all the comments, and have done another round of load testing internally to make sure there are no concurrency bugs. If you folks can take a look, I would appreciate it.

@sysadmin1139
Copy link

I've spent some time today testing the artifacts, and they clearly perform better than the existing Telegraf binaries. I haven't thrown full prod loading at it, but it's now surviving our pre-prod throughput tests which version 1.21.1 manifestly doesn't. In a side-by-side test with another metrics platform we are getting highly similar results, so we don't seem to be leaking metrics.

Looking forward to this getting merged.

@powersj
Copy link
Contributor

powersj commented Jan 4, 2022

Hi @sjwang90 @powersj @popey , Have addressed all the comments, and have done another round of load testing internally to make sure there are no concurrency bugs. If you folks can take a look, I would appreciate it.

Fix the linter issues and I'm happy to approve.

@nirmeshk nirmeshk force-pushed the mainline-2 branch 2 times, most recently from b66c1e1 to 834a9fe Compare January 6, 2022 04:39
@nirmeshk
Copy link
Contributor Author

nirmeshk commented Jan 6, 2022

@powersj Fixed the lint error. All checks passing now? Have re-based it over the latest changes as well.

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jan 6, 2022
@reimda reimda merged commit ad1694b into influxdata:master Jan 6, 2022
powersj pushed a commit to powersj/telegraf that referenced this pull request Jan 21, 2022
reimda pushed a commit that referenced this pull request Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/aws AWS plugins including cloudwatch, ecs, kinesis fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet