Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit backfill validation concurrency #4598

Merged
merged 19 commits into from
Apr 4, 2023

Conversation

aldernero
Copy link
Contributor

@aldernero aldernero commented Mar 27, 2023

What this PR does

This PR adds a configurable limit to the compactor that sets the maximum number of concurrent backfill validations. This can be used to prevent compactor resources from becoming over-utilized due to the validation phase of block upload which consumes both local storage and CPU resources.

  • adds a new compactor configuration parameter -compactor.max-block-upload-validation-concurrency, defaults to 1, 0 means unlimited
  • adds a new gauge cortex_block_upload_current_validations to monitor concurrency

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@aldernero
Copy link
Contributor Author

I tested that mimirtool functions as expected with the development/mimir-microservices-mode/docker-compose.jsonnet and some tests blocks:

first client starts, completes normally

❯ mimirtool backfill --log.level="debug" 01GKN2SS8BP9G2P7K2NWBKVC5W
INFO[0000] log level set to debug                       
INFO[0000] Backfilling                                   blocks=01GKN2SS8BP9G2P7K2NWBKVC5W user=anonymous
DEBU[0000] New Mimir client created                      address="http://localhost:8006" id=anonymous
INFO[0000] making request to start block upload          block=01GKN2SS8BP9G2P7K2NWBKVC5W file=meta.json path=01GKN2SS8BP9G2P7K2NWBKVC5W
DEBU[0000] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/start"
DEBU[0000] checking response                             status="200 OK"
INFO[0000] uploading block file                          block=01GKN2SS8BP9G2P7K2NWBKVC5W file=index path=01GKN2SS8BP9G2P7K2NWBKVC5W size=34211534
DEBU[0000] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/files?path=index"
DEBU[0000] checking response                             status="200 OK"
INFO[0000] uploading block file                          block=01GKN2SS8BP9G2P7K2NWBKVC5W file=chunks/000001 path=01GKN2SS8BP9G2P7K2NWBKVC5W size=207314021
DEBU[0000] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/files?path=chunks%2F000001"
DEBU[0001] checking response                             status="200 OK"
DEBU[0001] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/finish"
DEBU[0001] checking response                             status="200 OK"
DEBU[0001] sending request to Grafana Mimir API          method=GET url="http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/check"
DEBU[0001] checking response                             status="200 OK"
DEBU[0001] checked block upload state                    block=01GKN2SS8BP9G2P7K2NWBKVC5W path=01GKN2SS8BP9G2P7K2NWBKVC5W state=validating
DEBU[0021] sending request to Grafana Mimir API          method=GET url="http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/check"
DEBU[0021] checking response                             status="200 OK"
DEBU[0021] checked block upload state                    block=01GKN2SS8BP9G2P7K2NWBKVC5W path=01GKN2SS8BP9G2P7K2NWBKVC5W state=complete
INFO[0021] block uploaded successfully                   block=01GKN2SS8BP9G2P7K2NWBKVC5W path=01GKN2SS8BP9G2P7K2NWBKVC5W
INFO[0021] finished uploading blocks                     already_exists=0 failed=0 succeeded=1

second client starts immediately after first, waits and retries

❯ mimirtool backfill --log.level="debug" 01GKN2TQV5A3D13852P6CKCS90
INFO[0000] log level set to debug                       
INFO[0000] Backfilling                                   blocks=01GKN2TQV5A3D13852P6CKCS90 user=anonymous
DEBU[0000] New Mimir client created                      address="http://localhost:8006" id=anonymous
INFO[0000] making request to start block upload          block=01GKN2TQV5A3D13852P6CKCS90 file=meta.json path=01GKN2TQV5A3D13852P6CKCS90
DEBU[0000] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/start"
DEBU[0000] checking response                             status="200 OK"
INFO[0000] uploading block file                          block=01GKN2TQV5A3D13852P6CKCS90 file=index path=01GKN2TQV5A3D13852P6CKCS90 size=49001032
DEBU[0000] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/files?path=index"
DEBU[0000] checking response                             status="200 OK"
INFO[0000] uploading block file                          block=01GKN2TQV5A3D13852P6CKCS90 file=chunks/000001 path=01GKN2TQV5A3D13852P6CKCS90 size=413775399
DEBU[0000] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/files?path=chunks%2F000001"
DEBU[0002] checking response                             status="200 OK"
DEBU[0002] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/finish"
DEBU[0002] checking response                             status="429 Too Many Requests"
ERRO[0002] response                                      body="too many block upload validations in progress\n" status="429 Too Many Requests"
WARN[0002] will sleep and try again                      block=01GKN2TQV5A3D13852P6CKCS90 error="POST request to http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/finish failed: server returned HTTP status: 429 Too Many Requests, body: \"too many block upload validations in progress\\n\"" path=01GKN2TQV5A3D13852P6CKCS90
DEBU[0022] sending request to Grafana Mimir API          method=POST url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/finish"
DEBU[0022] checking response                             status="200 OK"
DEBU[0022] sending request to Grafana Mimir API          method=GET url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/check"
DEBU[0022] checking response                             status="200 OK"
DEBU[0022] checked block upload state                    block=01GKN2TQV5A3D13852P6CKCS90 path=01GKN2TQV5A3D13852P6CKCS90 state=validating
DEBU[0042] sending request to Grafana Mimir API          method=GET url="http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/check"
DEBU[0042] checking response                             status="200 OK"
DEBU[0042] checked block upload state                    block=01GKN2TQV5A3D13852P6CKCS90 path=01GKN2TQV5A3D13852P6CKCS90 state=complete
INFO[0042] block uploaded successfully                   block=01GKN2TQV5A3D13852P6CKCS90 path=01GKN2TQV5A3D13852P6CKCS90
INFO[0042] finished uploading blocks                     already_exists=0 failed=0 succeeded=1

@aldernero
Copy link
Contributor Author

aldernero commented Mar 27, 2023

Screenshot showing the concurrency gauge during the uploads. The second peak shows when one client had to wait on the other's validation to complete.
image

@aldernero aldernero marked this pull request as ready for review March 27, 2023 22:34
@aldernero aldernero requested review from a team as code owners March 27, 2023 22:34
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I like that you also added metric for validations in progress.

Can we also update documentation for API endpoint and mention that it can now return 429? This should also be mentioned in the changelog as [CHANGE] for the endpoint.

pkg/compactor/compactor.go Outdated Show resolved Hide resolved
pkg/compactor/block_upload.go Outdated Show resolved Hide resolved
pkg/mimirtool/client/backfill.go Outdated Show resolved Hide resolved
pkg/mimirtool/client/backfill.go Outdated Show resolved Hide resolved
@pstibrany
Copy link
Member

Very nice work on the PR. Thanks also for including example of output from mimirtool and screenshot from testing!

Copy link
Contributor

@andyasp andyasp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

pkg/mimirtool/client/backfill.go Outdated Show resolved Hide resolved
pkg/compactor/block_upload.go Outdated Show resolved Hide resolved
@aldernero
Copy link
Contributor Author

I re-ran some local tests, this time with -compactor.max-block-upload-validation-concurrency=2 and started up 6 simultaneous uploads with test blocks:

for block in $(ls -d 01G*);do echo $block;done
01GKGCXCHJPKV9J53E39RX8AS6
01GKJ1EG1T87PB8JFFVJ7RS07Y
01GKN1JNHT7PJCSVF8B1P772JS
01GKN2S8N18VZXQ7CCJEDHWDWD
01GKN2SS8BP9G2P7K2NWBKVC5W
01GKN2TQV5A3D13852P6CKCS90
❯ for block in $(ls -d 01G*);do (mimirtool backfill $block > upload-${block}.log 2>&1) &;done
[2] 117596
[3] 117597
[4] 117598
[5] 117599
[6] 117606
[7] 117610

I looked at cortex_block_upload_validations_in_progess to verify it was doing the right thing:
image

I verified from the MinIO container that all the blocks were there:

sh-4.4# ls /data/mimir-tsdb/anonymous/
01GKGCXCHJPKV9J53E39RX8AS6  01GKN2S8N18VZXQ7CCJEDHWDWD  01GWPYY6QMXPJFWX52BGX38EZJ  01GWPYYS30K33FHG1ASCB8H8EE
01GKJ1EG1T87PB8JFFVJ7RS07Y  01GKN2SS8BP9G2P7K2NWBKVC5W  01GWPYYC1PDY3H6GPWP6HDRBKB  bucket-index.json.gz
01GKN1JNHT7PJCSVF8B1P772JS  01GKN2TQV5A3D13852P6CKCS90  01GWPYYHHCWK3PJJVAE5EVMVWK  markers

Lastly I verified from the mimirtool output that all the clients finished successfully and retried when necessary:

❯ grep -e "succeeded" -e "too many" upload-*.log
upload-01GKGCXCHJPKV9J53E39RX8AS6.log:time="2023-03-29T09:13:46-06:00" level=warning msg="will sleep and try again" block=01GKGCXCHJPKV9J53E39RX8AS6 error="POST request to http://localhost:8006/api/v1/upload/block/01GKGCXCHJPKV9J53E39RX8AS6/finish failed: too many requests" path=01GKGCXCHJPKV9J53E39RX8AS6
upload-01GKGCXCHJPKV9J53E39RX8AS6.log:time="2023-03-29T09:14:12-06:00" level=info msg="finished uploading blocks" already_exists=0 failed=0 succeeded=1
upload-01GKJ1EG1T87PB8JFFVJ7RS07Y.log:time="2023-03-29T09:14:06-06:00" level=info msg="finished uploading blocks" already_exists=0 failed=0 succeeded=1
upload-01GKN1JNHT7PJCSVF8B1P772JS.log:time="2023-03-29T09:14:06-06:00" level=info msg="finished uploading blocks" already_exists=0 failed=0 succeeded=1
upload-01GKN2S8N18VZXQ7CCJEDHWDWD.log:time="2023-03-29T09:13:46-06:00" level=warning msg="will sleep and try again" block=01GKN2S8N18VZXQ7CCJEDHWDWD error="POST request to http://localhost:8006/api/v1/upload/block/01GKN2S8N18VZXQ7CCJEDHWDWD/finish failed: too many requests" path=01GKN2S8N18VZXQ7CCJEDHWDWD
upload-01GKN2S8N18VZXQ7CCJEDHWDWD.log:time="2023-03-29T09:14:11-06:00" level=info msg="finished uploading blocks" already_exists=0 failed=0 succeeded=1
upload-01GKN2SS8BP9G2P7K2NWBKVC5W.log:time="2023-03-29T09:13:46-06:00" level=warning msg="will sleep and try again" block=01GKN2SS8BP9G2P7K2NWBKVC5W error="POST request to http://localhost:8006/api/v1/upload/block/01GKN2SS8BP9G2P7K2NWBKVC5W/finish failed: too many requests" path=01GKN2SS8BP9G2P7K2NWBKVC5W
upload-01GKN2SS8BP9G2P7K2NWBKVC5W.log:time="2023-03-29T09:14:15-06:00" level=info msg="finished uploading blocks" already_exists=0 failed=0 succeeded=1
upload-01GKN2TQV5A3D13852P6CKCS90.log:time="2023-03-29T09:13:47-06:00" level=warning msg="will sleep and try again" block=01GKN2TQV5A3D13852P6CKCS90 error="POST request to http://localhost:8006/api/v1/upload/block/01GKN2TQV5A3D13852P6CKCS90/finish failed: too many requests" path=01GKN2TQV5A3D13852P6CKCS90
upload-01GKN2TQV5A3D13852P6CKCS90.log:time="2023-03-29T09:14:17-06:00" level=info msg="finished uploading blocks" already_exists=0 failed=0 succeeded=1

docs/sources/mimir/references/http-api/index.md Outdated Show resolved Hide resolved
pkg/compactor/block_upload.go Outdated Show resolved Hide resolved
pkg/compactor/block_upload.go Outdated Show resolved Hide resolved
pkg/mimirtool/client/backfill.go Outdated Show resolved Hide resolved
pkg/mimirtool/client/backfill.go Outdated Show resolved Hide resolved
pkg/mimirtool/client/backfill.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code seems fine; I had a couple of nits, especially one line in CHANGELOG that looks unrelated.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
pkg/compactor/block_upload.go Outdated Show resolved Hide resolved
pkg/compactor/compactor.go Outdated Show resolved Hide resolved
@56quarters 56quarters enabled auto-merge (squash) April 4, 2023 15:27
@56quarters 56quarters merged commit bf1be65 into main Apr 4, 2023
@56quarters 56quarters deleted the aldernero/backfill-limit-concurrency branch April 4, 2023 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants