New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/manifests: Configure remote write more conservatively #630
pkg/manifests: Configure remote write more conservatively #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally testing and tweaking this before we enable remote/write would be good no? On something that is not just a cluster-bot.
// buffer before waiting for samples to be sent successfully | ||
// and then continuing to read from the WAL. | ||
Capacity: 30000, | ||
// Should we accumulate 10000 samples before the batch send |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean less than 1000 samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment refers to the MaxSamplesPerSend
, which is set to 10000.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brancz, s-urbaniak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
8 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
The generation step is failing. Putting this on hold /hold |
MinBackoff: "1s", | ||
// 128s is the 8th backoff in a row, once we end up here, we | ||
// don't increase backoff time anymore. As we would at most | ||
// produce (concurrency/256) number of requests per second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are very helpful for uninitated readers 👍
Couple questions that may be worth clarifying:
- 256s is more that 1m. May multiple backing-off batches overlap, or do they block later batches until success / give up?
- "we don't increase backoff time anymore" — does that mean retries continues indefinitely at 256s intervals? Or is the batch given up after 8 failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- they have to be consecutive so only the concurrency factor plays a role here.
- it retries infinitely at 256s intervals and stops after the WAL is cut which happens every 2 hours. Then the mechanisms starts tailing the “new” WAL.
/hold cancel |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@brancz you need bugzilla for this now, just create one with "Alerts for remote write firing" 🙄 |
It’s ok this is fine to wait until 4.5 master opens again. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Note this PR does not re-enable telemetry by default, it merely if enabled fine tunes the remote write config. With this config we neither require resharding nor does remote write fall behind trying to replicate the data to the remote which did happen with the previous configuration. Additionally this drives the req/s per Prometheus down to ~4req/s.
@lilic @s-urbaniak @paulfantom @simonpasquier @pgier