New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure DNS Lookup Failures #1462
Azure DNS Lookup Failures #1462
Comments
Not sure it is related, we also see |
Yes, both of those can be triggered by this issue. In Azure we would also recommend setting the following value:
This will reduce errors related to tenant index building. Azure does not like it when multiple processes attempt to write to the same object. |
@joe-elliott thanks, I am trying it |
@joe-elliott Hi, I just notice that we are using only 1 tempo compactor, so this |
Compactors simply reduce the length of the blocklist and help keep query latencies down. If your latencies are fine then your compactors are fine. Two metrics that can help you keep an eye on your blocklist: |
Inside the compcator pod, I test nslookup the azure blob host , it looks good
how can I enable the debug mode for Tempo so I could have more log? |
You can use the following configuration to enable debug logging on Tempo
|
Hello :)
But it should looks like
Are you sure this is related to Azure DNS ? :( |
@joe-elliott storage:
trace:
blocklist_poll_tenant_index_builders: 1 We are using a private endpoint in Azure to connect to the storage account? |
If you're referring to the memberlist issues in your logs, failures during rollout are normal and should be ignored. Constant memberlist failures suggest networking issues in your cluster. (#927) We run multiple clusters on AWS, GCP and Azure. Azure is the only cloud provider on which we see this issue and we see it on all Azure clusters. It is almost always occurs when performing a We have only seen this issue occurring in compactors and only when polling the blocklist. Perhaps something about the rapid fire requests to Azure blob storage during polling is causing an issue? @martin-jaeger-maersk That suggested configuration prevents a different error unique to ABS. ABS sometimes throws a strange error when two processes attempt to write the same object at the same time. We were receiving the error here until we reduced Following up with the thought that perhaps our default settings are hitting ABS too hard I'm going to experiment with lowering |
In a cluster with ~250 blocks I changed I'm going to spend some time trying to upgrade our client. There appear to be 2 official clients and its unclear what the relationship is between them. |
I agree, that does sound very odd. Where in the tempo code is the azure client called (in this particular case) |
Here's where we begin polling the blocklist concurrently: tempo/tempodb/blocklist/poller.go Line 186 in d8d4dc8
The actual poll function requests the block meta and, if that can't be found, the compacted block meta: tempo/tempodb/blocklist/poller.go Line 234 in d8d4dc8
Both tempo/tempodb/backend/azure/azure.go Line 144 in d8d4dc8
|
@joe-elliott A few more points, which could make sense, since it's only seen in Azure DNS request throttle
DNS race condition when using UDP Not Azure specific per say, but can be a result of the OS and patch-level of VMs used in AKS. https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69 Found this article, that talks about all the things that can go wrong with hammering DNS with UDP. The thesis is that high amounts of DNS requests over UDP, can cause a race condition in the kernel, which can result in UDP requests failing |
Using the newly added config in the linked PR and setting the |
@joe-elliott |
I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period. |
Right, seems like a valid fix then. I suppose we would have to wait for this to land in a release, unless of course we want to build and publish images on our own. |
Every push to main builds a new image: https://hub.docker.com/r/grafana/tempo/tags You could just update your compactor image if you'd like to leave the rest of the components alone. |
This seemed to be fixed for awhile, but has recently returned even with the jitter setting referenced above. We are still looking into this. |
We managed to fix these errors with #1632, and tempo pods DNS In #1632, We increased MaxIdleConns and MaxIdleConnsPerHost for Azure backend to 100 (default is 2) Testing was done in tempo cluster running in AKS, and configured with Azure backend. Testing with only #1632 errors reduced from 4-5 errors per hour to 1 error every 2-3 hours. Testing with #1632 and Pod DNS Config set to No errors Checkout following links to learn more about ndots config, and how to set it in your pods: |
See also grafana/jsonnet-libs#762; lowering [EDIT: was previously linking to non-public issue] |
Another way to solve this without modifying But we can't do that because Azure responds with Here is what Azure responds with when we add
This looks like a bug, I will file a ticket with Azure. note: this works fine with curl because curl trims the |
@electron0zero Hi, thanks for the fix. |
In our case, we were seeing these errors only in compactor pods, so we added We did this by setting compactor Pod's DNS Config spec:
dnsConfig:
options:
- name: ndots
value: "3" I am not sure about helm, If you are using the chart from https://github.com/grafana/helm-charts, can you open an issue in that repo please :) |
Update from Azure support: Host headers are not allowed to have trailing dot in Azure services, you can use FQDN but need to set Host header without trailing dot.
We need to modify Azure client to trim traling dot from Host header for FQDN to work in Azure. for example: curl trims traling dot from Created #1726 to track this for Azure and other storage backends |
Describe the bug
In Azure all components can register DNS errors while trying to work with meta.compacted.json. This issue does not occur in any other environments. Unsure if this is an Azure issue or a Tempo issue. The error itself is a failed tcp connection to a DNS server which suggests this is some issue with Azure infrastructure. However, the fact that the error (almost?) always occurs on
meta.compacted.json
suggests something about the way we handle that file is different and is causing this issue.The failures look like:
or
We have seen this issue internally. Also reported here: #1372.
The text was updated successfully, but these errors were encountered: