New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tempo compactor reading storage container error #2215
Comments
I just noticed there are error logs from Tempo querier as below:
Maybe this is the cause that we cannot use grafana frontend for query traces? |
We have also seen strange Azure blob storage behavior as well. Can you try the settings recommended here: https://grafana.com/docs/tempo/latest/configuration/azure/#azure-blocklist-polling There is also a very long discussion here about how to handle these issues: @electron0zero and @zalegrala, I actually don't know the current recommended best practices for Azure DNS. Do we recommend an |
@joe-elliott Thank you for the quick response.
|
@joe-elliott currently we are setting @suxiaoxiaomm can you try this in your compactor pods, and report back. here is what our compactor's storage section looks like: storage:
trace:
azure:
container_name: <bucket-name>
backend: azure
block:
version: vParquet
blocklist_poll: 5m
blocklist_poll_tenant_index_builders: 1
cache: memcached
memcached:
consistent_hash: true
host: memcached
service: memcached-client
timeout: 200ms
pool:
queue_depth: 2000
wal:
path: /var/tempo/wal and here is what our compactor dnsConfig looks like: dnsConfig:
options:
- name: ndots
value: "3"
with these settings I see 23 instances of this error in last 24 hours, across 3 azure clusters. |
@joe-elliott @electron0zero Thanks for the suggestion. One thing to mention is that our AKS cluster and azure blob are placed at different location, one is at West Europe and the other one is at East US. So I am wondering maybe the 61s timeout is too short? Is that a configurable parameter? |
@joe-elliott @electron0zero Hi, i just tried to create another container at the same Azure storage blob, and that one works fine. I guess somehow the old container for tempo data have some issues. Is there a safe way for me to move the old data to this new container? And shall I also move the index.json.gz file? Appreciate your suggestion. |
Tempo does not have any specific knowledge of the container. If you copy all of the blocks from the old container to the new container and then start Tempo up it will work fine. The index.json.gz file will be recreated from the existing blocks after Tempo starts up. It may be safer to not move that one file. |
Hi @joe-elliott, I tryed to copy the old blocks to a new container, but same issue with compactor when it trying to access this new container. So I tried to copy the recent 3days' data, it looks like fine. I am not sure why does this happen? Does the amount of blocks affect the compactor? |
Do you have a partial block somehow? If so just cleaning up the block might work. Check the block whose meta.json keeps failing to load.
Yes, definitely, but we have seen Tempo survive with 80k+ blocklists. |
Hi @joe-elliott , it doesn't really failed at the same block or even at the same day's blocks. |
|
@zalegrala It might related to I enable istio-proxy as a sidecar for this compactor pod. |
Connection refused sounds like either the target address or port is incorrect. When is it that you see the "connection refused" message? |
@zalegrala It shouldn't be. If I change a container , it works fine. But if I migrate old the old data to this new container, I get the error again. It looks like in the code, tempo compactor will fire one goroutine for each block and send requests to azure blob. I am wondering maybe when blocks are getting too much(e.g. > 10000), maybe it caused the connection rejection by azure. When checking the code for thanos-compactor, it limited the concurrent goroutines to 32. |
@suxiaoxiaomm can you please check Azure Blob Storage Rate Limits and Quotas, and see if you are hitting any limits or quotas. Ideally we should get 429 for rate limits but just in case. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Hi Experts,
We are using Azure blob as our tempo storage.
Currently we are experiencing tempo compactor keeps reporting error:
This error message shows every 5mins, looks like not a random issue as 5min is exactly a default poll interval.
I am sure the connection string to Azure blob is correct as Tempo ingester and querier are using the exactly same connection string. They seems working fine.
And in the meanwhile, above issue seems will cause the issue that we cannot query traces from grafana frontend.
Your help and suggestion are highly appriciated!
Thanks a lot!
The text was updated successfully, but these errors were encountered: