Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure DNS Lookup Failures #1462

Closed
joe-elliott opened this issue Jun 1, 2022 · 26 comments · Fixed by #1518, #1632 or grafana/helm-charts#2308
Closed

Azure DNS Lookup Failures #1462

joe-elliott opened this issue Jun 1, 2022 · 26 comments · Fixed by #1518, #1632 or grafana/helm-charts#2308
Assignees
Milestone

Comments

@joe-elliott
Copy link
Member

Describe the bug
In Azure all components can register DNS errors while trying to work with meta.compacted.json. This issue does not occur in any other environments. Unsure if this is an Azure issue or a Tempo issue. The error itself is a failed tcp connection to a DNS server which suggests this is some issue with Azure infrastructure. However, the fact that the error (almost?) always occurs on meta.compacted.json suggests something about the way we handle that file is different and is causing this issue.

The failures look like:

reading storage container: Head "https://tempoe**************.blob.core.windows.net/tempo/single-tenant/d8aafc48-5796-4221-ac0b-58e001d18515/meta.compacted.json?timeout=61": dial tcp: lookup tempoe**************.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled

or

error deleting blob, name: single-tenant/*******************/data: Delete "https://tempoe******.blob.core.windows.net/tempo/single-tenant/5b1ab746-fee7-409c-944d-1c1d5ba7a70e/data?timeout=61": dial tcp: lookup tempoe******.blob.core.windows.net on 10.0.0.10:53: dial udp 10.0.0.10:53: operation was canceled

We have seen this issue internally. Also reported here: #1372.

@hwu-talendbj
Copy link

Not sure it is related, we also see TempoTenantIndexFailures and TempoTenantIndexTooOld errors, they may caused by this

@joe-elliott
Copy link
Member Author

Yes, both of those can be triggered by this issue. In Azure we would also recommend setting the following value:

storage:
  trace:
    blocklist_poll_tenant_index_builders: 1

This will reduce errors related to tenant index building. Azure does not like it when multiple processes attempt to write to the same object.

@hwu-talendbj
Copy link

@joe-elliott thanks, I am trying it

@hwu-talendbj
Copy link

@joe-elliott Hi, I just notice that we are using only 1 tempo compactor, so this blocklist_poll_tenant_index_builders: 1 may not help in this case, or do you think it is better to go with multiple compactors?

@joe-elliott
Copy link
Member Author

Compactors simply reduce the length of the blocklist and help keep query latencies down. If your latencies are fine then your compactors are fine. Two metrics that can help you keep an eye on your blocklist:
avg(tempodb_blocklist_length{}) by (tenant)
and
sum(tempodb_compaction_outstanding_blocks{}) by (tenant)

@hwu-talendbj
Copy link

hwu-talendbj commented Jun 15, 2022

Inside the compcator pod, I test nslookup the azure blob host , it looks good

nslookup tempoe4c***********2b.blob.core.windows.net
Server:		10.0.0.10
Address:	10.0.0.10:53

Non-authoritative answer:
tempoe4c1**********b.blob.core.windows.net	canonical name = blob.ams07prdstr02a.store.core.windows.net
Name:	blob.ams07prdstr02a.store.core.windows.net
Address: 20.150.37.228

Non-authoritative answer:
tempoe4c**********b.blob.core.windows.net	canonical name = blob.ams07prdstr02a.store.core.windows.net

how can I enable the debug mode for Tempo so I could have more log?

@annanay25
Copy link
Contributor

You can use the following configuration to enable debug logging on Tempo

server:
    log_level: debug

@galvesLoy
Copy link

Hello :)
After using debug mode with @hwu-talendbj, we found more data. This is what happens during pull/push

{"caller":"memberlist_logger.go:74","level":"error","msg":"Push/Pull with grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 failed: dial tcp 10.128.18.112:7946: i/o timeout","ts":"2022-06-21T12:46:32.111904425Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:32.452547619Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.20.104:42880","ts":"2022-06-21T12:46:33.866113092Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.16.18:51284","ts":"2022-06-21T12:46:36.210169812Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.38042ms","traceID":"5ce5bffbf7a76751","ts":"2022-06-21T12:46:38.496729396Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.50652ms","traceID":"4f0e40a70e0652d9","ts":"2022-06-21T12:46:38.510874078Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:39.451709386Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect grafana-tempo-distributor-66fbfb5d99-ms794-ad6112e9 has failed, no acks received","ts":"2022-06-21T12:46:39.454869705Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.17.62:55936","ts":"2022-06-21T12:46:39.986359017Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:40.182091363Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:41.455734719Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476815384Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:46:51.476889284Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:52.451737191Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: grafana-tempo-distributor-66fbfb5d99-j67d2-a0aa99f0 (timeout reached)","ts":"2022-06-21T12:46:56.453139815Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=10.128.18.158:46552","ts":"2022-06-21T12:46:57.18343619Z"}
{"addr":"10.128.18.112:7946","caller":"tcp_transport.go:428","component":"memberlist TCPTransport","err":"dial tcp 10.128.18.112:7946: i/o timeout","level":"warn","msg":"WriteTo failed","ts":"2022-06-21T12:46:57.396034335Z"}
{"caller":"broadcast.go:48","content":"[grafana-tempo-ingester-0]","key":"collectors/ring","level":"debug","msg":"Invalidating forwarded broadcast","oldContent":"[grafana-tempo-ingester-0]","oldVersion":6660,"ts":"2022-06-21T12:46:57.397407443Z","version":6661}

But it should looks like

{"caller":"memberlist_logger.go:74","level":"debug","msg":"Initiating push/pull sync with: grafana-tempo-ingester-0-435a9d15 10.128.18.158:7946","ts":"2022-06-21T12:48:02.11782556Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.511421ms","traceID":"4aa6e02d1b78e775","ts":"2022-06-21T12:48:08.497212902Z"}
{"caller":"logging.go:67","level":"debug","msg":"GET /metrics (200) 3.194218ms","traceID":"6ab9c753aed0029e","ts":"2022-06-21T12:48:08.510489979Z"}
{"caller":"retention.go:46","level":"debug","msg":"Performing block retention","retention":"48h0m0s","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476575596Z"}
{"caller":"compactor.go:192","hash":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","level":"debug","msg":"checking hash","ts":"2022-06-21T12:48:21.476661197Z"}
{"caller":"compactor.go:211","level":"debug","msg":"checking addresses","owning_addr":"10.128.17.60:0","this_addr":"10.128.17.60:0","ts":"2022-06-21T12:48:21.476687197Z"}
{"blockID":"1b8c524d-a934-451b-a9ca-4ea7db8e3b7e","caller":"retention.go:76","level":"info","msg":"deleting block","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476701097Z"}
{"caller":"compactor.go:64","level":"info","msg":"starting compaction cycle","offset":0,"tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476772798Z"}
{"caller":"compactor.go:70","level":"info","msg":"compaction cycle complete. No more blocks to compact","tenantID":"single-tenant","ts":"2022-06-21T12:48:21.476833798Z"}

Are you sure this is related to Azure DNS ? :(

@martinjaegerdk
Copy link

@joe-elliott
Also experiencing same error when running in Azure (AKS and Azure Blob Storage). I tried running with below config for two days, but getting same error.

storage:
  trace:
    blocklist_poll_tenant_index_builders: 1

We are using a private endpoint in Azure to connect to the storage account?
Is anyone else using private endpoint to connect to the storage account?

@joe-elliott
Copy link
Member Author

joe-elliott commented Jun 22, 2022

@galvesLoy

Are you sure this is related to Azure DNS ? :(

If you're referring to the memberlist issues in your logs, failures during rollout are normal and should be ignored. Constant memberlist failures suggest networking issues in your cluster. (#927)

We run multiple clusters on AWS, GCP and Azure. Azure is the only cloud provider on which we see this issue and we see it on all Azure clusters. It is almost always occurs when performing a HEAD request on a meta.compacted.json but rarely it also occurs on a GET request to meta.json. It is likely related to access patterns in Tempo b/c AFAIK other teams internally at Grafana are not seeing this issue.

We have only seen this issue occurring in compactors and only when polling the blocklist. Perhaps something about the rapid fire requests to Azure blob storage during polling is causing an issue?

@martin-jaeger-maersk

That suggested configuration prevents a different error unique to ABS. ABS sometimes throws a strange error when two processes attempt to write the same object at the same time. We were receiving the error here until we reduced blocklist_poll_tenant_index_builders to 1. It is 2 by default.

Following up with the thought that perhaps our default settings are hitting ABS too hard I'm going to experiment with lowering storage.trace.blocklist_poll_concurrency to see if the issue goes away.

@joe-elliott
Copy link
Member Author

In a cluster with ~250 blocks I changed storage.trace.blocklist_poll_concurrency from the default of 50 to 10, but we continued to see the issue. Then I lowered storage.trace.blocklist_poll_concurrency to 1 and the issue went away. This technically "fixes" the issue but is likely not useful for larger clusters where the time to complete a polling cycle would go up significantly. I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.

I'm going to spend some time trying to upgrade our client. There appear to be 2 official clients and its unclear what the relationship is between them.

@martinjaegerdk
Copy link

I do find it quite odd that the client appears to be making a DNS request every time we access an object, but maybe this is an intentional choice on Azure's part.

I agree, that does sound very odd.
It can be the azure go-client doing it by intention, or simply a very low TTL on the DNS entry.

Where in the tempo code is the azure client called (in this particular case)

@joe-elliott
Copy link
Member Author

joe-elliott commented Jun 24, 2022

Here's where we begin polling the blocklist concurrently:

bg := boundedwaitgroup.New(p.cfg.PollConcurrency)

The actual poll function requests the block meta and, if that can't be found, the compacted block meta:

func (p *Poller) pollBlock(ctx context.Context, tenantID string, blockID uuid.UUID) (*backend.BlockMeta, *backend.CompactedBlockMeta, error) {

Both BlockMeta() and CompactedBlockMeta() eventually hit this method on the Azure backend:

func (rw *readerWriter) Read(ctx context.Context, name string, keypath backend.KeyPath, _ bool) (io.ReadCloser, int64, error) {

@martinjaegerdk
Copy link

martinjaegerdk commented Jun 24, 2022

@joe-elliott A few more points, which could make sense, since it's only seen in Azure

DNS request throttle

https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling

Azure DNS and Azure Private DNS have a throttle limit of 500 read (GET) operations per 5 minutes.

https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-name-resolution-for-vms-and-role-instances#considerations

DNS query traffic is throttled for each VM. Throttling shouldn't impact most applications. If request throttling is observed, ensure that client-side caching is enabled. For more information, see DNS client configuration.

DNS race condition when using UDP

Not Azure specific per say, but can be a result of the OS and patch-level of VMs used in AKS.

https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69

Found this article, that talks about all the things that can go wrong with hammering DNS with UDP.
One suggestion that he brings up, is to force the container to use TCP over UDP. This adds more overhead, but could result in connections actually being established.

The thesis is that high amounts of DNS requests over UDP, can cause a race condition in the kernel, which can result in UDP requests failing

@joe-elliott
Copy link
Member Author

Using the newly added config in the linked PR and setting the blocklist_poll_jitter_ms value to 500 I was able to go 24 hours without seeing this issue.

#1518

@martinjaegerdk
Copy link

@joe-elliott
That's very interesting 👍
Does that mean that after 24 hours you saw the error, or simply that you have tested for 24 hours without seeing the error?

@joe-elliott
Copy link
Member Author

I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.

@martinjaegerdk
Copy link

I have been testing this change for 24 hours and have not seen the issue. We normally see it 10 - 20 times in a 24 hour period.

Right, seems like a valid fix then. I suppose we would have to wait for this to land in a release, unless of course we want to build and publish images on our own.

@joe-elliott
Copy link
Member Author

Every push to main builds a new image: https://hub.docker.com/r/grafana/tempo/tags

You could just update your compactor image if you'd like to leave the rest of the components alone.

@joe-elliott
Copy link
Member Author

This seemed to be fixed for awhile, but has recently returned even with the jitter setting referenced above. We are still looking into this.

@electron0zero
Copy link
Member

We managed to fix these errors with #1632, and tempo pods DNS ndots=3

In #1632, We increased MaxIdleConns and MaxIdleConnsPerHost for Azure backend to 100 (default is 2)

Testing was done in tempo cluster running in AKS, and configured with Azure backend.

Testing with only #1632

errors reduced from 4-5 errors per hour to 1 error every 2-3 hours.

testing only with max connectoins

Testing with #1632 and Pod DNS Config set to ndots=3

No errors

ndots is 3, and maxconnection is 100

Checkout following links to learn more about ndots config, and how to set it in your pods:

@bboreham
Copy link
Contributor

bboreham commented Aug 23, 2022

See also grafana/jsonnet-libs#762; lowering ndots strictly speaking breaks Kubernetes service discovery, but if you remember not to rely on the specific bit you'll be ok.

[EDIT: was previously linking to non-public issue]

@electron0zero
Copy link
Member

Another way to solve this without modifying ndots is to add . at then end of hostname to make the hostname fully qualified. If we go with that route, we don't have to lower ndots and risk breaking Kubernetes service discovery

But we can't do that because Azure responds with 400 bad request, and have to go with setting ndots.

Here is what Azure responds with when we add . at the end.

GET https://tempo**************.blob.core.windows.net./**************?restype=container&timeout=61

Status: 400 Bad Request
Content-Length: [334]
Content-Type: [text/html; charset=us-ascii]
Date: [Thu, 11 Aug 2022 19:38:19 GMT]
Server: [Microsoft-HTTPAPI/2.0]

xml: (*azblob.storageError).UnmarshalXML did not consume entire <HTML> element

This looks like a bug, I will file a ticket with Azure.

note: this works fine with curl because curl trims the . in the url, try it with chrome or other tools that doesn't trim .

@hwu-talendbj
Copy link

@electron0zero Hi, thanks for the fix.
So after the new image is built, We also need to set tempo pods DNS config ndots=3 for resolving the issue, right?
We deploy tempo through Helm chart, not sure if there is a good way to have this setting

@electron0zero
Copy link
Member

electron0zero commented Sep 1, 2022

In our case, we were seeing these errors only in compactor pods, so we added ndots=3 in compactor pods for Azure.

We did this by setting compactor Pod's DNS Config

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "3"

I am not sure about helm, If you are using the chart from https://github.com/grafana/helm-charts, can you open an issue in that repo please :)

@electron0zero
Copy link
Member

electron0zero commented Sep 8, 2022

Update from Azure support: Host headers are not allowed to have trailing dot in Azure services, you can use FQDN but need to set Host header without trailing dot.

  • curl -v --header 'Host: *****03.blob.core.windows.net' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61 -> this is allowed

  • curl -v --header 'Host: *****03.blob.core.windows.net.' https://******03.blob.core.windows.net./*******-data?restype=container&timeout=61 -> this is NOT allowed, and will result in 400 (notice the dot in Host header)

We need to modify Azure client to trim traling dot from Host header for FQDN to work in Azure.

for example: curl trims traling dot from Host header, see curl/curl@3a61428 for details.

Created #1726 to track this for Azure and other storage backends

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment