Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mimirtool backfill: failed uploading block #8164

Closed
Sh4d1 opened this issue May 21, 2024 · 25 comments
Closed

mimirtool backfill: failed uploading block #8164

Sh4d1 opened this issue May 21, 2024 · 25 comments

Comments

@Sh4d1
Copy link

Sh4d1 commented May 21, 2024

Describe the bug

I'm trying to backfill prometheus blocks to mimir.
Maybe linked to my blocks ?
Mimir/mimirtool version: 2.12.0

So first time I upload a block ( kinda big, ~50G), the 10.52.1.224 is the IP of the compactor:

/var/lib/prometheus $ /tmp/mimirtool backfill --id=anonymous --address=http://10.52.1.224:8080 01HGXF4Z4HTZ22TSNTH9AZXKNN
INFO[0000] Backfilling                                   blocks=01HGXF4Z4HTZ22TSNTH9AZXKNN user=anonymous
INFO[0000] making request to start block upload          block=01HGXF4Z4HTZ22TSNTH9AZXKNN file=meta.json path=01HGXF4Z4HTZ22TSNTH9AZXKNN
INFO[0000] uploading block file                          block=01HGXF4Z4HTZ22TSNTH9AZXKNN file=index path=01HGXF4Z4HTZ22TSNTH9AZXKNN size=2899347416
INFO[0031] uploading block file                          block=01HGXF4Z4HTZ22TSNTH9AZXKNN file=chunks/000001 path=01HGXF4Z4HTZ22TSNTH9AZXKNN size=536869938
[...]
INFO[0434] uploading block file                          block=01HGXF4Z4HTZ22TSNTH9AZXKNN file=chunks/000066 path=01HGXF4Z4HTZ22TSNTH9AZXKNN size=420500175
ERRO[0559] failed uploading block                        error="block validation failed: failed to download block: context canceled" path=01HGXF4Z4HTZ22TSNTH9AZXKNN
INFO[0559] finished uploading blocks                     already_exists=0 failed=1 succeeded=0
mimirtool: error: blocks failed to upload 1 block(s), try --help

And once I try again:

/var/lib/prometheus $ /tmp/mimirtool backfill --id=anonymous --address=http://10.52.1.224:8080 01HGXF4Z4HTZ22TSNTH9AZXKNN 
INFO[0000] Backfilling                                   blocks=01HGXF4Z4HTZ22TSNTH9AZXKNN user=anonymous
INFO[0000] making request to start block upload          block=01HGXF4Z4HTZ22TSNTH9AZXKNN file=meta.json path=01HGXF4Z4HTZ22TSNTH9AZXKNN
ERRO[0000] response                                      body="block validation failed\n" status="400 Bad Request"
ERRO[0000] failed uploading block                        error="request to start block upload failed: POST request to http://10.52.1.224:8080/api/v1/upload/block/01HGXF4Z4HTZ22TSNTH9AZXKNN/start failed: server returned HTTP status: 400 Bad Request, body: \"block validation failed\\n\"" path=01HGXF4Z4HTZ22TSNTH9AZXKNN
INFO[0000] finished uploading blocks                     already_exists=0 failed=1 succeeded=0
mimirtool: error: blocks failed to upload 1 block(s), try --help

Any ideas?

Environment

  • Infrastructure: Mimir distributed on GKE
  • Deployment tool: helm
@pstibrany
Copy link
Member

Hello, it seems that block validation step (done by Mimir after block is fully uploaded, but before it is "accepted") has failed, and now block cannot be uploaded or validated anymore. Such block is treated as incomplete-upload, and will be eventually deleted by Mimir. After that, you can try upload it again.

Alternatively, you can try to delete validation.json file from inside uploaded block in the Mimir's storage. After that, you can run mimirtool to upload the block again, and run new validation. (You can also disable validation step by setting -compactor.block-upload-validation-enabled=false on compactor, and then upload will finish successfully, without doing block-validation. However this allows you to upload any "garbage" as TSDB block, and may cause errors in Mimir later).

@Sh4d1
Copy link
Author

Sh4d1 commented May 21, 2024

Ah good to know thanks! Though it appears I get this issue on every block I try to import 🤔

@Sh4d1
Copy link
Author

Sh4d1 commented May 21, 2024

Is there any way to see how the validation failed? It's a block taken from a classic prometheus instance, I figure it should work without issues? (though maybe it's too old in the past ? )

@pstibrany
Copy link
Member

You should be able to see failure reasons in compactor logs. Based on mimirtool output, these blocks look quite big (40 GB?), so validation can take a while, and compactor must be running during the process without interruption.

@Sh4d1
Copy link
Author

Sh4d1 commented May 21, 2024

Yeah I only get that:

ts=2024-05-21T12:36:57.248928356Z caller=block_upload.go:360 level=error component=compactor user=anonymous feature="block upload" block=01HGC2YXSZ86TQK4Z9G8A2DKZN operation="com
plete block upload" request_id=17d18125344c3e74 msg="error while validating block" err="failed to download block: context canceled"

Guess there is some timeout to tweak somewhere?

@Sh4d1
Copy link
Author

Sh4d1 commented May 21, 2024

ts=2024-05-21T12:39:07.165217295Z caller=block_upload.go:105 level=info component=compactor user=anonymous feature="block upload" block=01HGXF4Z4HTZ22TSNTH9AZXKNN operation="star
t block upload" request_id=17d18157c78e4bb4 msg="started block upload"
ts=2024-05-21T12:46:25.950370338Z caller=block_upload.go:172 level=info component=compactor user=anonymous feature="block upload" block=01HGXF4Z4HTZ22TSNTH9AZXKNN operation="comp
lete block upload" request_id=17d181bdef52b395 msg="validation process started"
tts=2024-05-21T12:47:44.330682597Z caller=block_upload.go:360 level=error component=compactor user=anonymous feature="block upload" block=01HGXF4Z4HTZ22TSNTH9AZXKNN operation="com
plete block upload" request_id=17d181bdef52b395 msg="error while validating block" err="failed to download block: context canceled"

to be precise*, so it seems to fail after 80s ?

@pstibrany
Copy link
Member

I don't know about any timeout on the compactor side for validation / downloading blocks, but individual storage clients can have operation timeouts (eg. Swift storage has request timeout config option).

@Sh4d1
Copy link
Author

Sh4d1 commented May 21, 2024

@pstibrany so you think the error comes from the storage provider? I'm using GCS, and doesn't seem to be any timeouts configurable :/

@pstibrany
Copy link
Member

My thinking was that client-side could have some timeout, but in fact I wouldn't expect that to be logged as context canceled. We use GCS client internally, and don't see timeouts when reading data from it.

I'm wondering ... Is that the full compactor log, or is there possibly more messages that could be related to validation (block upload feature)? Any other error logged by compactor that would explain what's happening?

@Sh4d1
Copy link
Author

Sh4d1 commented May 21, 2024

The 3 lines were next to each other in the compactor logs. So I guess there is no other messages related to this upload ?

@pstibrany
Copy link
Member

The 3 lines were next to each other in the compactor logs. So I guess there is no other messages related to this upload ?

Could you try grepping on block ID, maybe something still shows up?

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

I just tested with another block, live tailing the logs only gets me that:

ts=2024-05-22T09:04:52.177159745Z caller=block_upload.go:105 level=info component=compactor user=anonymous feature="block upload" block=01HHMMGVFCC3J5EY8VTDK1JAH3 operation="start block upload" request_id=17d1c43b4e8d4ad3 msg="started block upload"
ts=2024-05-22T09:06:56.208641698Z caller=block_upload.go:172 level=info component=compactor user=anonymous feature="block upload" block=01HHMMGVFCC3J5EY8VTDK1JAH3 operation="complete block upload" request_id=17d1c4582f908ec3 msg="validation process started"
ts=2024-05-22T09:08:14.223566493Z caller=block_upload.go:360 level=error component=compactor user=anonymous feature="block upload" block=01HHMMGVFCC3J5EY8VTDK1JAH3 operation="complete block upload" request_id=17d1c4582f908ec3 msg="error while validating block" err="failed to download block: context canceled"

@pstibrany
Copy link
Member

To unblock you, you can disable validation (-compactor.block-upload-validation-enabled=false), then any blocks will be accepted without further checks. If you are uploading blocks from Prometheus, that should be fine.

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

Yes it looks like it's uploaded correctly. It's not yet shown in Mimir though, I guess I need to wait some time? Can I refresh reset some cache somewhere?

@pstibrany
Copy link
Member

Yes it looks like it's uploaded correctly. It's not yet shown in Mimir though, I guess I need to wait some time? Can I refresh reset some cache somewhere?

It can take about 30mins to block to become queryable -- compactor needs to include the block in bucket-index, and then queriers and store-gateways need to reload the bucket index and fetch parts of the block (store-gateways do that). If you try to query it before, empty results may be cached in the results cache. You can flush results cache by restarting memcached.

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

Okay I might see some empty cached results, though I have deployed mimir without memcached enabled, is it still used somewhere?

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

Well I restarted my 3 store gateways, but still nothing. I'll wait a bit more, but it's been over 1h already 🤔

@pstibrany
Copy link
Member

Well I restarted my 3 store gateways, but still nothing. I'll wait a bit more, but it's been over 1h already 🤔

What time range does the uploaded block cover?

Can you see if store-gateways loaded your new block?

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

The upload block cover roughly from january 2024 to a few weeks ago.

I can see in the store gateway:

ts=2024-05-22T14:07:01.823562954Z caller=bucket.go:407 level=info user=anonymous msg="loaded new block" elapsed=403.665477ms id=01HMQ525MXFV8SR05WT2BMXBT2
ts=2024-05-22T14:22:39.628135866Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loading sparse index-header from full index-header" id=01HMQ525MXFV8SR05WT2BMXBT2
ts=2024-05-22T14:22:39.669792351Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from full index-header" id=01HMQ525MXFV8SR05WT2BMXBT2 elapsed=41.658729ms
ts=2024-05-22T14:22:39.669820198Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="writing sparse index-header to disk" id=01HMQ525MXFV8SR05WT2BMXBT2 path=/data/tsdb-sync/anonymous/01HMQ525MXFV8SR05WT2BMXBT2/sparse-index-header
ts=2024-05-22T14:22:39.68422412Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="wrote sparse index-header to disk" id=01HMQ525MXFV8SR05WT2BMXBT2 path=/data/tsdb-sync/anonymous/01HMQ525MXFV8SR05WT2BMXBT2/sparse-index-header elapsed=14.394569ms

If I grep a block I uploaded. Another for instance:

ts=2024-05-22T14:06:13.466412037Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loading from sparse index-header from disk" id=01HK2ZTGZXM7543W0ZJE5DXDQ2 path=/data/tsdb-sync/anonymous/01HK2ZTGZXM7543W0ZJE5DXDQ2/sparse-index-header
ts=2024-05-22T14:06:13.468317073Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from disk" id=01HK2ZTGZXM7543W0ZJE5DXDQ2 path=/data/tsdb-sync/anonymous/01HK2ZTGZXM7543W0ZJE5DXDQ2/sparse-index-header elapsed=1.907567ms
ts=2024-05-22T14:06:13.49157358Z caller=bucket.go:407 level=info user=anonymous msg="loaded new block" elapsed=28.30006ms id=01HK2ZTGZXM7543W0ZJE5DXDQ2

@pstibrany
Copy link
Member

👍 That shows that blocks are now in bucket-index, and store-gateways know about them. Queriers should too, since they load the same bucket-index.

Since you don't have any cache ("I have deployed mimir without memcached enabled"), I don't see other reason why your queries should not work anymore. You can try something like count({__name__!=""}) to see how many series there are between january and now in your Mimir.

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

Well I'm getting timeout with this query 😅 even on a 5 minutes interval if it's a few weeks ago. In the querier logs:

ts=2024-05-22T15:22:08.666957598Z caller=scheduler_processor.go:240 level=error user=anonymous msg="error notifying scheduler about finished query after the scheduler stream previously failed and returned error" err=EOF addr=10.52.1.244:9095 previousErr="rpc error: code = Canceled desc = context canceled: frontend cancelled query"

Do you have any ideas?

@Sh4d1
Copy link
Author

Sh4d1 commented May 22, 2024

Hum it seems I have a lot of consistency checks on these blocks
image

@Sh4d1
Copy link
Author

Sh4d1 commented May 23, 2024

Seems it's working now after the night! I'll take it! Thanks for the help!

@Sh4d1 Sh4d1 closed this as completed May 23, 2024
@pstibrany
Copy link
Member

Hum it seems I have a lot of consistency checks on these blocks image

Consistency checks mean that querier tried to query some block, but no store-gateway that was supposed to have it, replied with data from the block. Perhaps store-gateways were still loading new blocks?

@pstibrany
Copy link
Member

Well I'm getting timeout with this query 😅 even on a 5 minutes interval if it's a few weeks ago. In the querier logs:

ts=2024-05-22T15:22:08.666957598Z caller=scheduler_processor.go:240 level=error user=anonymous msg="error notifying scheduler about finished query after the scheduler stream previously failed and returned error" err=EOF addr=10.52.1.244:9095 previousErr="rpc error: code = Canceled desc = context canceled: frontend cancelled query"

Do you have any ideas?

Unfortunately I'm not familiar with this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants