tempo compactor reading storage container error #2215

suxiaoxiaomm · 2023-03-16T06:21:12Z

Hi Experts,

We are using Azure blob as our tempo storage.

Currently we are experiencing tempo compactor keeps reporting error:

level=error ts=2023-03-16T04:20:51.990648343Z caller=tempodb.go:441 msg="failed to poll blocklist. using previously polled lists" err="reading storage container: Head \"https://xxxxxx.blob.core.windows.net/xxxxxxxxxxxxxxxxxxxx/meta.compacted.json?timeout=61\": read tcp xxxxx->xxxxxx: read: connection reset by peer"
level=error ts=2023-03-16T04:25:53.505014898Z caller=tempodb.go:441 msg="failed to poll blocklist. using previously polled lists" err="reading storage container: Head \"https://xxxxxx.blob.core.windows.net/xxxxxxxxxxxxxxxxxxxx//meta.json?timeout=61\": read tcp xxxxxx->xxxxxx: read: connection reset by peer"
...

This error message shows every 5mins, looks like not a random issue as 5min is exactly a default poll interval.

I am sure the connection string to Azure blob is correct as Tempo ingester and querier are using the exactly same connection string. They seems working fine.

And in the meanwhile, above issue seems will cause the issue that we cannot query traces from grafana frontend.

Your help and suggestion are highly appriciated!

Thanks a lot!

The text was updated successfully, but these errors were encountered:

suxiaoxiaomm · 2023-03-16T06:49:02Z

I just noticed there are error logs from Tempo querier as below:

level=error ts=2023-02-18T15:56:24.154102011Z caller=poller.go:179 msg="failed to pull bucket index for tenant. falling back to polling" tenant=single-tenant err="reading storage container: Head \"https://xxxx/index.json.gz?timeout=61\": net/http: TLS handshake timeout"

Maybe this is the cause that we cannot use grafana frontend for query traces?

joe-elliott · 2023-03-16T14:38:17Z

We have also seen strange Azure blob storage behavior as well. Can you try the settings recommended here:

https://grafana.com/docs/tempo/latest/configuration/azure/#azure-blocklist-polling

There is also a very long discussion here about how to handle these issues:

#1462

@electron0zero and @zalegrala, I actually don't know the current recommended best practices for Azure DNS. Do we recommend an ndots change or something else? If either of you know can you PR an update to the linked Azure docs?

suxiaoxiaomm · 2023-03-16T14:41:32Z

@joe-elliott Thank you for the quick response.
Yes, I did try below settings, but no help.

  storage:
    trace:
      blocklist_poll_tenant_index_builders: 1
      blocklist_poll_jitter_ms: 500

electron0zero · 2023-03-16T15:18:37Z

@joe-elliott currently we are setting ndots to 3, and NOT setting blocklist_poll_jitter_ms in our azure clusters, @suxiaoxiaomm see #1462 (comment) for details on how to configure it at pod level.

@suxiaoxiaomm can you try this in your compactor pods, and report back.

here is what our compactor's storage section looks like:

storage:
        trace:
            azure:
                container_name: <bucket-name>
            backend: azure
            block:
                version: vParquet
            blocklist_poll: 5m
            blocklist_poll_tenant_index_builders: 1
            cache: memcached
            memcached:
                consistent_hash: true
                host: memcached
                service: memcached-client
                timeout: 200ms
            pool:
                queue_depth: 2000
            wal:
                path: /var/tempo/wal

and here is what our compactor dnsConfig looks like:

      dnsConfig:
        options:
        - name: ndots
          value: "3"

with these settings I see 23 instances of this error in last 24 hours, across 3 azure clusters.

suxiaoxiaomm · 2023-03-17T03:42:38Z

@joe-elliott @electron0zero Thanks for the suggestion.
I added above config and restart the pod.
But still not helping.

One thing to mention is that our AKS cluster and azure blob are placed at different location, one is at West Europe and the other one is at East US.

So I am wondering maybe the 61s timeout is too short? Is that a configurable parameter?
err="reading storage container: Head \"https://xxxxxx.blob.core.windows.net/xxxxxxxxxxxxxxxxxxxx//meta.json?**timeout=61**\": read tcp xxxxxx->xxxxxx: read: connection reset by peer"

suxiaoxiaomm · 2023-03-17T13:20:24Z

@joe-elliott @electron0zero Hi, i just tried to create another container at the same Azure storage blob, and that one works fine.

I guess somehow the old container for tempo data have some issues.

Is there a safe way for me to move the old data to this new container? And shall I also move the index.json.gz file?

Appreciate your suggestion.

joe-elliott · 2023-03-17T13:23:41Z

Tempo does not have any specific knowledge of the container. If you copy all of the blocks from the old container to the new container and then start Tempo up it will work fine.

The index.json.gz file will be recreated from the existing blocks after Tempo starts up. It may be safer to not move that one file.

suxiaoxiaomm · 2023-03-21T02:58:40Z

Hi @joe-elliott, I tryed to copy the old blocks to a new container, but same issue with compactor when it trying to access this new container.
level=error ts=2023-03-20T10:25:45.550804828Z caller=tempodb.go:441 msg="failed to poll blocklist. using previously polled lists" err="reading storage container: Head \"https://XXXXX/meta.json?timeout=61\": read tcp XXX->XXX: read: connection reset by peer"

So I tried to copy the recent 3days' data, it looks like fine.

I am not sure why does this happen?

Does the amount of blocks affect the compactor?

joe-elliott · 2023-03-21T20:40:55Z

Do you have a partial block somehow? If so just cleaning up the block might work. Check the block whose meta.json keeps failing to load.

Does the amount of blocks affect the compactor?

Yes, definitely, but we have seen Tempo survive with 80k+ blocklists.

suxiaoxiaomm · 2023-03-22T06:33:18Z

Hi @joe-elliott , it doesn't really failed at the same block or even at the same day's blocks.
The blocks at which the compactor fails are quite random.

zalegrala · 2023-03-22T16:31:54Z

connection reset by peer sounds suspicious to me. In this case, I'd expect that you already have an IP for the address you are trying to contact (post-DNS resolution), and an open connection has been established.

suxiaoxiaomm · 2023-03-27T03:18:34Z

@zalegrala It might related to I enable istio-proxy as a sidecar for this compactor pod.
When I removed the sidecar, the error change to "connect: connection refused"

zalegrala · 2023-03-27T18:06:03Z

Connection refused sounds like either the target address or port is incorrect. When is it that you see the "connection refused" message?

suxiaoxiaomm · 2023-04-04T03:07:40Z

@zalegrala It shouldn't be. If I change a container , it works fine. But if I migrate old the old data to this new container, I get the error again.

It looks like in the code, tempo compactor will fire one goroutine for each block and send requests to azure blob. I am wondering maybe when blocks are getting too much(e.g. > 10000), maybe it caused the connection rejection by azure.

When checking the code for thanos-compactor, it limited the concurrent goroutines to 32.

electron0zero · 2023-04-04T09:17:21Z

@suxiaoxiaomm can you please check Azure Blob Storage Rate Limits and Quotas, and see if you are hitting any limits or quotas. Ideally we should get 429 for rate limits but just in case.

github-actions · 2023-06-04T00:03:47Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

github-actions bot added the stale Used for stale issues / PRs label Jun 4, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tempo compactor reading storage container error #2215

tempo compactor reading storage container error #2215

suxiaoxiaomm commented Mar 16, 2023 •

edited

suxiaoxiaomm commented Mar 16, 2023

joe-elliott commented Mar 16, 2023

suxiaoxiaomm commented Mar 16, 2023

electron0zero commented Mar 16, 2023 •

edited

suxiaoxiaomm commented Mar 17, 2023 •

edited

suxiaoxiaomm commented Mar 17, 2023

joe-elliott commented Mar 17, 2023

suxiaoxiaomm commented Mar 21, 2023

joe-elliott commented Mar 21, 2023 •

edited

suxiaoxiaomm commented Mar 22, 2023

zalegrala commented Mar 22, 2023

suxiaoxiaomm commented Mar 27, 2023

zalegrala commented Mar 27, 2023

suxiaoxiaomm commented Apr 4, 2023

electron0zero commented Apr 4, 2023

github-actions bot commented Jun 4, 2023

tempo compactor reading storage container error #2215

tempo compactor reading storage container error #2215

Comments

suxiaoxiaomm commented Mar 16, 2023 • edited

suxiaoxiaomm commented Mar 16, 2023

joe-elliott commented Mar 16, 2023

suxiaoxiaomm commented Mar 16, 2023

electron0zero commented Mar 16, 2023 • edited

suxiaoxiaomm commented Mar 17, 2023 • edited

suxiaoxiaomm commented Mar 17, 2023

joe-elliott commented Mar 17, 2023

suxiaoxiaomm commented Mar 21, 2023

joe-elliott commented Mar 21, 2023 • edited

suxiaoxiaomm commented Mar 22, 2023

zalegrala commented Mar 22, 2023

suxiaoxiaomm commented Mar 27, 2023

zalegrala commented Mar 27, 2023

suxiaoxiaomm commented Apr 4, 2023

electron0zero commented Apr 4, 2023

github-actions bot commented Jun 4, 2023

suxiaoxiaomm commented Mar 16, 2023 •

edited

electron0zero commented Mar 16, 2023 •

edited

suxiaoxiaomm commented Mar 17, 2023 •

edited

joe-elliott commented Mar 21, 2023 •

edited