Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tempo compactor reading storage container error #2215

Closed
suxiaoxiaomm opened this issue Mar 16, 2023 · 16 comments
Closed

tempo compactor reading storage container error #2215

suxiaoxiaomm opened this issue Mar 16, 2023 · 16 comments
Labels
stale Used for stale issues / PRs

Comments

@suxiaoxiaomm
Copy link

suxiaoxiaomm commented Mar 16, 2023

Hi Experts,

We are using Azure blob as our tempo storage.

Currently we are experiencing tempo compactor keeps reporting error:

level=error ts=2023-03-16T04:20:51.990648343Z caller=tempodb.go:441 msg="failed to poll blocklist. using previously polled lists" err="reading storage container: Head \"https://xxxxxx.blob.core.windows.net/xxxxxxxxxxxxxxxxxxxx/meta.compacted.json?timeout=61\": read tcp xxxxx->xxxxxx: read: connection reset by peer"
level=error ts=2023-03-16T04:25:53.505014898Z caller=tempodb.go:441 msg="failed to poll blocklist. using previously polled lists" err="reading storage container: Head \"https://xxxxxx.blob.core.windows.net/xxxxxxxxxxxxxxxxxxxx//meta.json?timeout=61\": read tcp xxxxxx->xxxxxx: read: connection reset by peer"
...

This error message shows every 5mins, looks like not a random issue as 5min is exactly a default poll interval.

I am sure the connection string to Azure blob is correct as Tempo ingester and querier are using the exactly same connection string. They seems working fine.

And in the meanwhile, above issue seems will cause the issue that we cannot query traces from grafana frontend.

Your help and suggestion are highly appriciated!

Thanks a lot!

@suxiaoxiaomm
Copy link
Author

I just noticed there are error logs from Tempo querier as below:

level=error ts=2023-02-18T15:56:24.154102011Z caller=poller.go:179 msg="failed to pull bucket index for tenant. falling back to polling" tenant=single-tenant err="reading storage container: Head \"https://xxxx/index.json.gz?timeout=61\": net/http: TLS handshake timeout"

Maybe this is the cause that we cannot use grafana frontend for query traces?

@joe-elliott
Copy link
Member

We have also seen strange Azure blob storage behavior as well. Can you try the settings recommended here:

https://grafana.com/docs/tempo/latest/configuration/azure/#azure-blocklist-polling

There is also a very long discussion here about how to handle these issues:

#1462

@electron0zero and @zalegrala, I actually don't know the current recommended best practices for Azure DNS. Do we recommend an ndots change or something else? If either of you know can you PR an update to the linked Azure docs?

@suxiaoxiaomm
Copy link
Author

@joe-elliott Thank you for the quick response.
Yes, I did try below settings, but no help.

  storage:
    trace:
      blocklist_poll_tenant_index_builders: 1
      blocklist_poll_jitter_ms: 500

@electron0zero
Copy link
Member

electron0zero commented Mar 16, 2023

@joe-elliott currently we are setting ndots to 3, and NOT setting blocklist_poll_jitter_ms in our azure clusters, @suxiaoxiaomm see #1462 (comment) for details on how to configure it at pod level.

@suxiaoxiaomm can you try this in your compactor pods, and report back.

here is what our compactor's storage section looks like:

storage:
        trace:
            azure:
                container_name: <bucket-name>
            backend: azure
            block:
                version: vParquet
            blocklist_poll: 5m
            blocklist_poll_tenant_index_builders: 1
            cache: memcached
            memcached:
                consistent_hash: true
                host: memcached
                service: memcached-client
                timeout: 200ms
            pool:
                queue_depth: 2000
            wal:
                path: /var/tempo/wal

and here is what our compactor dnsConfig looks like:

      dnsConfig:
        options:
        - name: ndots
          value: "3"

with these settings I see 23 instances of this error in last 24 hours, across 3 azure clusters.

@suxiaoxiaomm
Copy link
Author

suxiaoxiaomm commented Mar 17, 2023

@joe-elliott @electron0zero Thanks for the suggestion.
I added above config and restart the pod.
But still not helping.

One thing to mention is that our AKS cluster and azure blob are placed at different location, one is at West Europe and the other one is at East US.

So I am wondering maybe the 61s timeout is too short? Is that a configurable parameter?
err="reading storage container: Head \"https://xxxxxx.blob.core.windows.net/xxxxxxxxxxxxxxxxxxxx//meta.json?**timeout=61**\": read tcp xxxxxx->xxxxxx: read: connection reset by peer"

@suxiaoxiaomm
Copy link
Author

@joe-elliott @electron0zero Hi, i just tried to create another container at the same Azure storage blob, and that one works fine.

I guess somehow the old container for tempo data have some issues.

Is there a safe way for me to move the old data to this new container? And shall I also move the index.json.gz file?

Appreciate your suggestion.

@joe-elliott
Copy link
Member

Tempo does not have any specific knowledge of the container. If you copy all of the blocks from the old container to the new container and then start Tempo up it will work fine.

The index.json.gz file will be recreated from the existing blocks after Tempo starts up. It may be safer to not move that one file.

@suxiaoxiaomm
Copy link
Author

Hi @joe-elliott, I tryed to copy the old blocks to a new container, but same issue with compactor when it trying to access this new container.
level=error ts=2023-03-20T10:25:45.550804828Z caller=tempodb.go:441 msg="failed to poll blocklist. using previously polled lists" err="reading storage container: Head \"https://XXXXX/meta.json?timeout=61\": read tcp XXX->XXX: read: connection reset by peer"

So I tried to copy the recent 3days' data, it looks like fine.

I am not sure why does this happen?

Does the amount of blocks affect the compactor?

@joe-elliott
Copy link
Member

joe-elliott commented Mar 21, 2023

Do you have a partial block somehow? If so just cleaning up the block might work. Check the block whose meta.json keeps failing to load.

Does the amount of blocks affect the compactor?

Yes, definitely, but we have seen Tempo survive with 80k+ blocklists.

@suxiaoxiaomm
Copy link
Author

Hi @joe-elliott , it doesn't really failed at the same block or even at the same day's blocks.
The blocks at which the compactor fails are quite random.

@zalegrala
Copy link
Contributor

connection reset by peer sounds suspicious to me. In this case, I'd expect that you already have an IP for the address you are trying to contact (post-DNS resolution), and an open connection has been established.

@suxiaoxiaomm
Copy link
Author

@zalegrala It might related to I enable istio-proxy as a sidecar for this compactor pod.
When I removed the sidecar, the error change to "connect: connection refused"

@zalegrala
Copy link
Contributor

Connection refused sounds like either the target address or port is incorrect. When is it that you see the "connection refused" message?

@suxiaoxiaomm
Copy link
Author

@zalegrala It shouldn't be. If I change a container , it works fine. But if I migrate old the old data to this new container, I get the error again.

It looks like in the code, tempo compactor will fire one goroutine for each block and send requests to azure blob. I am wondering maybe when blocks are getting too much(e.g. > 10000), maybe it caused the connection rejection by azure.

When checking the code for thanos-compactor, it limited the concurrent goroutines to 32.

@electron0zero
Copy link
Member

@suxiaoxiaomm can you please check Azure Blob Storage Rate Limits and Quotas, and see if you are hitting any limits or quotas. Ideally we should get 429 for rate limits but just in case.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Jun 4, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

4 participants