Azure DNS Lookup Failures #1462

joe-elliott · 2022-06-01T13:34:48Z

Describe the bug
In Azure all components can register DNS errors while trying to work with meta.compacted.json. This issue does not occur in any other environments. Unsure if this is an Azure issue or a Tempo issue. The error itself is a failed tcp connection to a DNS server which suggests this is some issue with Azure infrastructure. However, the fact that the error (almost?) always occurs on meta.compacted.json suggests something about the way we handle that file is different and is causing this issue.

The failures look like:

reading storage container: Head "https://tempoe**************.blob.core.windows.net/tempo/single-tenant/d8aafc48-5796-4221-ac0b-58e001d18515/meta.compacted.json?timeout=61": dial tcp: lookup tempoe**************.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled

or

error deleting blob, name: single-tenant/*******************/data: Delete "https://tempoe******.blob.core.windows.net/tempo/single-tenant/5b1ab746-fee7-409c-944d-1c1d5ba7a70e/data?timeout=61": dial tcp: lookup tempoe******.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled

We have seen this issue internally. Also reported here: #1372.

The text was updated successfully, but these errors were encountered:

hwu-talendbj · 2022-06-06T03:23:35Z

Not sure it is related, we also see TempoTenantIndexFailures and TempoTenantIndexTooOld errors, they may caused by this

joe-elliott · 2022-06-06T13:52:06Z

Yes, both of those can be triggered by this issue. In Azure we would also recommend setting the following value:

storage:
  trace:
    blocklist_poll_tenant_index_builders: 1

This will reduce errors related to tenant index building. Azure does not like it when multiple processes attempt to write to the same object.

hwu-talendbj · 2022-06-07T07:06:47Z

@joe-elliott thanks, I am trying it

hwu-talendbj · 2022-06-07T09:47:57Z

@joe-elliott Hi, I just notice that we are using only 1 tempo compactor, so this blocklist_poll_tenant_index_builders: 1 may not help in this case, or do you think it is better to go with multiple compactors?

joe-elliott · 2022-06-07T12:57:08Z

Compactors simply reduce the length of the blocklist and help keep query latencies down. If your latencies are fine then your compactors are fine. Two metrics that can help you keep an eye on your blocklist:
avg(tempodb_blocklist_length{}) by (tenant)
and
sum(tempodb_compaction_outstanding_blocks{}) by (tenant)

hwu-talendbj · 2022-06-15T07:07:41Z

Inside the compcator pod, I test nslookup the azure blob host , it looks good

nslookup tempoe4c***********2b.blob.core.windows.net
Server:		10.0.0.10
Address:	10.0.0.10:53

Non-authoritative answer:
tempoe4c1**********b.blob.core.windows.net	canonical name = blob.ams07prdstr02a.store.core.windows.net
Name:	blob.ams07prdstr02a.store.core.windows.net
Address: 20.150.37.228

Non-authoritative answer:
tempoe4c**********b.blob.core.windows.net	canonical name = blob.ams07prdstr02a.store.core.windows.net

how can I enable the debug mode for Tempo so I could have more log?

annanay25 · 2022-06-15T07:28:06Z

You can use the following configuration to enable debug logging on Tempo

server:
    log_level: debug

galvesLoy · 2022-06-21T13:17:21Z

Hello :)
After using debug mode with @hwu-talendbj, we found more data. This is what happens during pull/push

{"caller":"memberlist_logger.go:74","level":"error","msg":"Push/Pull with grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 failed: dial tcp 10.128.18.112:7946: i/o timeout","ts":"2022-06-21T12:46:32.111904425Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:32.452547619Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.20.104:42880","ts":"2022-06-21T12:46:33.866113092Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.16.18:51284","ts":"2022-06-21T12:46:36.210169812Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.38042ms","traceID":"5ce5bffbf7a76751","ts":"2022-06-21T12:46:38.496729396Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.50652ms","traceID":"4f0e40a70e0652d9","ts":"2022-06-21T12:46:38.510874078Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:39.451709386Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 has failed, no acks received","ts":"2022-06-21T12:46:39.454869705Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.17.62:55936","ts":"2022-06-21T12:46:39.986359017Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:40.182091363Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:41.455734719Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476815384Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476889284Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:52.451737191Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:56.453139815Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.18.158:46552","ts":"2022-06-21T12:46:57.18343619Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:57.396034335Z"}
{"caller":"broadcast.go:48","content":"[grafana-tempo-ingester-0]","key":"collectors/ring","level":"debug","msg":"Invalidating forwarded broadcast","oldContent":"[grafana-tempo-ingester-0]","oldVersion":6660,"ts":"2022-06-21T12:46:57.397407443Z","version":6661}

But it should looks like

{"caller":"memberlist_logger.go:74","level":"debug","msg":"Initiating push/pull sync with: grafana-tempo-ingester-0-435a9d15 10.128.18.158:7946","ts":"2022-06-21T12:48:02.11782556Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.511421ms","traceID":"4aa6e02d1b78e775","ts":"2022-06-21T12:48:08.497212902Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.194218ms","traceID":"6ab9c753aed0029e","ts":"2022-06-21T12:48:08.510489979Z"}
{"caller":"retention.go:46","level":"debug","msg":"Performing block retention","retention":"48h0m0s","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476575596Z"}
{"caller":"compactor.go:192","hash":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","level":"debug","msg":"checking hash","ts":"2022-06-21T12:48:21.476661197Z"}
{"caller":"compactor.go:211","level":"debug","msg":"checking addresses","owning_addr":"10.128.17.60:0","this_addr":"10.128.17.60:0","ts":"2022-06-21T12:48:21.476687197Z"}
{"blockID":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","caller":"retention.go:76","level":"info","msg":"deleting block","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476701097Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476772798Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476833798Z"}

Are you sure this is related to Azure DNS ? :(

martinjaegerdk · 2022-06-22T08:45:59Z

@joe-elliott
Also experiencing same error when running in Azure (AKS and Azure Blob Storage). I tried running with below config for two days, but getting same error.

storage:
  trace:
    blocklist_poll_tenant_index_builders: 1

We are using a private endpoint in Azure to connect to the storage account?
Is anyone else using private endpoint to connect to the storage account?

joe-elliott · 2022-06-22T18:11:36Z

@galvesLoy

Are you sure this is related to Azure DNS ? :(

If you're referring to the memberlist issues in your logs, failures during rollout are normal and should be ignored. Constant memberlist failures suggest networking issues in your cluster. (#927)

We run multiple clusters on AWS, GCP and Azure. Azure is the only cloud provider on which we see this issue and we see it on all Azure clusters. It is almost always occurs when performing a HEAD request on a meta.compacted.json but rarely it also occurs on a GET request to meta.json. It is likely related to access patterns in Tempo b/c AFAIK other teams internally at Grafana are not seeing this issue.

We have only seen this issue occurring in compactors and only when polling the blocklist. Perhaps something about the rapid fire requests to Azure blob storage during polling is causing an issue?

@martin-jaeger-maersk

That suggested configuration prevents a different error unique to ABS. ABS sometimes throws a strange error when two processes attempt to write the same object at the same time. We were receiving the error here until we reduced blocklist_poll_tenant_index_builders to 1. It is 2 by default.

Following up with the thought that perhaps our default settings are hitting ABS too hard I'm going to experiment with lowering storage.trace.blocklist_poll_concurrency to see if the issue goes away.

joe-elliott · 2022-06-24T13:02:27Z

In a cluster with ~250 blocks I changed storage.trace.blocklist_poll_concurrency from the default of 50 to 10, but we continued to see the issue. Then I lowered storage.trace.blocklist_poll_concurrency to 1 and the issue went away. This technically "fixes" the issue but is likely not useful for larger clusters where the time to complete a polling cycle would go up significantly. I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.

I'm going to spend some time trying to upgrade our client. There appear to be 2 official clients and its unclear what the relationship is between them.

martinjaegerdk · 2022-06-24T19:19:31Z

I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.

I agree, that does sound very odd.
It can be the azure go-client doing it by intention, or simply a very low TTL on the DNS entry.

Where in the tempo code is the azure client called (in this particular case)

joe-elliott · 2022-06-24T19:35:03Z

Here's where we begin polling the blocklist concurrently:

tempo/tempodb/blocklist/poller.go

Line 186 in d8d4dc8

bg := boundedwaitgroup.New(p.cfg.PollConcurrency)

The actual poll function requests the block meta and, if that can't be found, the compacted block meta:

tempo/tempodb/blocklist/poller.go

Line 234 in d8d4dc8

    
           func (p *Poller) pollBlock(ctx context.Context, tenantID string, blockID uuid.UUID) (*backend.BlockMeta, *backend.CompactedBlockMeta, error) {

Both BlockMeta() and CompactedBlockMeta() eventually hit this method on the Azure backend:

tempo/tempodb/backend/azure/azure.go

Line 144 in d8d4dc8

    
           func (rw *readerWriter) Read(ctx context.Context, name string, keypath backend.KeyPath, _ bool) (io.ReadCloser, int64, error) {

martinjaegerdk · 2022-06-24T19:51:19Z

@joe-elliott A few more points, which could make sense, since it's only seen in Azure

DNS request throttle

https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling

Azure DNS and Azure Private DNS have a throttle limit of 500 read (GET) operations per 5 minutes.

https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-name-resolution-for-vms-and-role-instances#considerations

DNS query traffic is throttled for each VM. Throttling shouldn't impact most applications. If request throttling is observed, ensure that client-side caching is enabled. For more information, see DNS client configuration.

DNS race condition when using UDP

Not Azure specific per say, but can be a result of the OS and patch-level of VMs used in AKS.

https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69

Found this article, that talks about all the things that can go wrong with hammering DNS with UDP.
One suggestion that he brings up, is to force the container to use TCP over UDP. This adds more overhead, but could result in connections actually being established.

The thesis is that high amounts of DNS requests over UDP, can cause a race condition in the kernel, which can result in UDP requests failing

joe-elliott · 2022-06-28T12:28:51Z

Using the newly added config in the linked PR and setting the blocklist_poll_jitter_ms value to 500 I was able to go 24 hours without seeing this issue.

#1518

martinjaegerdk · 2022-06-28T12:49:24Z

@joe-elliott
That's very interesting 👍
Does that mean that after 24 hours you saw the error, or simply that you have tested for 24 hours without seeing the error?

joe-elliott · 2022-06-28T13:24:06Z

I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.

martinjaegerdk · 2022-06-28T14:02:32Z

I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.

Right, seems like a valid fix then. I suppose we would have to wait for this to land in a release, unless of course we want to build and publish images on our own.

joe-elliott · 2022-06-28T14:07:35Z

Every push to main builds a new image: https://hub.docker.com/r/grafana/tempo/tags

You could just update your compactor image if you'd like to leave the rest of the components alone.

joe-elliott · 2022-08-11T12:18:55Z

This seemed to be fixed for awhile, but has recently returned even with the jitter setting referenced above. We are still looking into this.

electron0zero · 2022-08-22T14:26:06Z

We managed to fix these errors with #1632, and tempo pods DNS ndots=3

In #1632, We increased MaxIdleConns and MaxIdleConnsPerHost for Azure backend to 100 (default is 2)

Testing was done in tempo cluster running in AKS, and configured with Azure backend.

Testing with only #1632

errors reduced from 4-5 errors per hour to 1 error every 2-3 hours.

Testing with #1632 and Pod DNS Config set to ndots=3

No errors

Checkout following links to learn more about ndots config, and how to set it in your pods:

bboreham · 2022-08-23T10:47:27Z

See also grafana/jsonnet-libs#762; lowering ndots strictly speaking breaks Kubernetes service discovery, but if you remember not to rely on the specific bit you'll be ok.

[EDIT: was previously linking to non-public issue]

electron0zero · 2022-08-23T13:31:56Z

Another way to solve this without modifying ndots is to add . at then end of hostname to make the hostname fully qualified. If we go with that route, we don't have to lower ndots and risk breaking Kubernetes service discovery

But we can't do that because Azure responds with 400 bad request, and have to go with setting ndots.

Here is what Azure responds with when we add . at the end.

GET https://tempo**************.blob.core.windows.net./**************?restype=container&timeout=61

Status: 400 Bad Request
Content-Length: [334]
Content-Type: [text/html; charset=us-ascii]
Date: [Thu, 11 Aug 2022 19:38:19 GMT]
Server: [Microsoft-HTTPAPI/2.0]

xml: (*azblob.storageError).UnmarshalXML did not consume entire <HTML> element

This looks like a bug, I will file a ticket with Azure.

note: this works fine with curl because curl trims the . in the url, try it with chrome or other tools that doesn't trim .

hwu-talendbj · 2022-08-30T07:55:31Z

@electron0zero Hi, thanks for the fix.
So after the new image is built, We also need to set tempo pods DNS config ndots=3 for resolving the issue, right?
We deploy tempo through Helm chart, not sure if there is a good way to have this setting

electron0zero · 2022-09-01T10:14:38Z

In our case, we were seeing these errors only in compactor pods, so we added ndots=3 in compactor pods for Azure.

We did this by setting compactor Pod's DNS Config

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "3"

I am not sure about helm, If you are using the chart from https://github.com/grafana/helm-charts, can you open an issue in that repo please :)

electron0zero · 2022-09-08T19:34:57Z

Update from Azure support: Host headers are not allowed to have trailing dot in Azure services, you can use FQDN but need to set Host header without trailing dot.

curl -v --header 'Host: *****03.blob.core.windows.net' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61 -> this is allowed
curl -v --header 'Host: *****03.blob.core.windows.net.' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61 -> this is NOT allowed, and will result in 400 (notice the dot in Host header)

We need to modify Azure client to trim traling dot from Host header for FQDN to work in Azure.

for example: curl trims traling dot from Host header, see curl/curl@3a61428 for details.

Created #1726 to track this for Azure and other storage backends

joe-elliott mentioned this issue Jun 28, 2022

Add blocklist_poll_jitter_ms config value #1518

Merged

3 tasks

joe-elliott closed this as completed in #1518 Jun 28, 2022

joe-elliott reopened this Aug 11, 2022

joe-elliott added this to the Next milestone Aug 11, 2022

electron0zero mentioned this issue Aug 11, 2022

Set MaxIdleConns and MaxIdleConnsPerHost for Azure backend #1632

Merged

3 tasks

cristiangsp linked a pull request Aug 17, 2022 that will close this issue

Set MaxIdleConns and MaxIdleConnsPerHost for Azure backend #1632

Merged

3 tasks

cristiangsp assigned electron0zero Aug 17, 2022

cristiangsp modified the milestones: Next, v2.0 Aug 17, 2022

joe-elliott closed this as completed in #1632 Aug 22, 2022

hwu-talendbj mentioned this issue Sep 2, 2022

[tempo-distributed] added ndots=* in compactor pods for Azure. grafana/helm-charts#1769

Open

electron0zero mentioned this issue Sep 8, 2022

Use fully qualifed domain name for storage backends when running on Kubernetes #1726

Open

joe-elliott mentioned this issue Sep 28, 2022

Status: 500 Internal Server Error Body: too many failed block queries 5 #1380

Closed

joe-elliott mentioned this issue Mar 16, 2023

tempo compactor reading storage container error #2215

Closed

mchahar mentioned this issue Mar 30, 2023

[tempo-distributed] Add possibility to override DNS config for tempo … grafana/helm-charts#2308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure DNS Lookup Failures #1462

Azure DNS Lookup Failures #1462

joe-elliott commented Jun 1, 2022

hwu-talendbj commented Jun 6, 2022

joe-elliott commented Jun 6, 2022

hwu-talendbj commented Jun 7, 2022

hwu-talendbj commented Jun 7, 2022

joe-elliott commented Jun 7, 2022

hwu-talendbj commented Jun 15, 2022 •

edited

Loading

annanay25 commented Jun 15, 2022

galvesLoy commented Jun 21, 2022

martinjaegerdk commented Jun 22, 2022

joe-elliott commented Jun 22, 2022 •

edited

Loading

joe-elliott commented Jun 24, 2022

martinjaegerdk commented Jun 24, 2022

joe-elliott commented Jun 24, 2022 •

edited

Loading

martinjaegerdk commented Jun 24, 2022 •

edited

Loading

joe-elliott commented Jun 28, 2022

martinjaegerdk commented Jun 28, 2022

joe-elliott commented Jun 28, 2022

martinjaegerdk commented Jun 28, 2022

joe-elliott commented Jun 28, 2022

joe-elliott commented Aug 11, 2022

electron0zero commented Aug 22, 2022

bboreham commented Aug 23, 2022 •

edited

Loading

electron0zero commented Aug 23, 2022

hwu-talendbj commented Aug 30, 2022

electron0zero commented Sep 1, 2022 •

edited

Loading

electron0zero commented Sep 8, 2022 •

edited

Loading

Azure DNS Lookup Failures #1462

Azure DNS Lookup Failures #1462

Comments

joe-elliott commented Jun 1, 2022

hwu-talendbj commented Jun 6, 2022

joe-elliott commented Jun 6, 2022

hwu-talendbj commented Jun 7, 2022

hwu-talendbj commented Jun 7, 2022

joe-elliott commented Jun 7, 2022

hwu-talendbj commented Jun 15, 2022 • edited Loading

annanay25 commented Jun 15, 2022

galvesLoy commented Jun 21, 2022

martinjaegerdk commented Jun 22, 2022

joe-elliott commented Jun 22, 2022 • edited Loading

joe-elliott commented Jun 24, 2022

martinjaegerdk commented Jun 24, 2022

joe-elliott commented Jun 24, 2022 • edited Loading

martinjaegerdk commented Jun 24, 2022 • edited Loading

joe-elliott commented Jun 28, 2022

martinjaegerdk commented Jun 28, 2022

joe-elliott commented Jun 28, 2022

martinjaegerdk commented Jun 28, 2022

joe-elliott commented Jun 28, 2022

joe-elliott commented Aug 11, 2022

electron0zero commented Aug 22, 2022

bboreham commented Aug 23, 2022 • edited Loading

electron0zero commented Aug 23, 2022

hwu-talendbj commented Aug 30, 2022

electron0zero commented Sep 1, 2022 • edited Loading

electron0zero commented Sep 8, 2022 • edited Loading

hwu-talendbj commented Jun 15, 2022 •

edited

Loading

joe-elliott commented Jun 22, 2022 •

edited

Loading

joe-elliott commented Jun 24, 2022 •

edited

Loading

martinjaegerdk commented Jun 24, 2022 •

edited

Loading

bboreham commented Aug 23, 2022 •

edited

Loading

electron0zero commented Sep 1, 2022 •

edited

Loading

electron0zero commented Sep 8, 2022 •

edited

Loading