Micrometer client not reconnecting to statsd after server is restarted #3563

dunnk2022 · 2022-12-09T19:20:33Z

Describe the bug
The micrometer client does not reconnect to the statsd server (Datadog agent) when the agent is restarted.

Environment

Micrometer version - 1.9.5
Micrometer registry - statsd, Datadog flavor
OS: Linux
Java version: 11.0.14

To Reproduce
How to reproduce the bug:
Restart the statsd server (datadog-agent)

Expected behavior
A clear and concise description of what you expected to happen.

The micrometer client re-establishes a connection to the statsd server on port 8125 when the daemon is back up.

Additional context
Add any other context about the problem here, e.g. related issues.

shakuzen · 2023-01-25T02:29:25Z

I tried to reproduce this locally and I was unable to. I started the Datadog agent on my macbook, started an application using Micrometer's statsd registry with datadog flavor, waited for some metrics to be sent, restarted the Datadog Agent, waited for more metrics to be sent, and I was able to see the metrics from after the restart in Datadog. Could you help us figure out how to reproduce the issue you're seeing?

dunnk2022 · 2023-01-25T16:27:39Z

Thanks for the response @shakuzen

For context the Datadog agent version is 7.38.0-jmx

dunnk2022 · 2023-01-25T16:31:26Z

We really just do a rolling restart of the agent and the issue occurs. Perhaps turning DEBUG logging on for the client will help to gather more info?

shakuzen · 2023-01-26T01:30:51Z

If you turn on debug logging for classes under the io.micrometer.shaded.reactor.netty package, we can probably get some more information about what's happening. Or turn on debug logging for everything under io.micrometer.

raymondchen625 · 2023-08-30T12:35:48Z

We ran into similar issues lately. Our Datadog agent (v7.43.0) restarted. The Java application with Micrometer 1.9.13 stopped sending custom metrics to Datadog. I logged into the container and verified the connectivity to the agent host's UDP port 8125 was fine. I could send some metrics directly to that port and find them on Datadog console. Once I restarted those containers, the metrics could be sent successfully again.
The Datadog agent's endpoint is a K8s service hostname, I don't think there was an IP change.

shakuzen · 2023-08-31T14:51:31Z

@raymondchen625 thanks for the additional datapoint. Unfortunately, since I haven't been able to reproduce it locally or get logs from anyone with more info about what is happening, we haven't made any progress on this. If you would be able to provide steps to reproduce this (ideally locally or with as minimal other things involved as possible) or some debug logs, it may help get to the bottom of this.

raymondchen625 · 2023-09-02T15:29:30Z

@raymondchen625 thanks for the additional datapoint. Unfortunately, since I haven't been able to reproduce it locally or get logs from anyone with more info about what is happening, we haven't made any progress on this. If you would be able to provide steps to reproduce this (ideally locally or with as minimal other things involved as possible) or some debug logs, it may help get to the bottom of this.

I could not reproduce this locally. Unfortunately, I didn't try to capture IP packets before restarting the pods to see if Micrometer was sending traffic somewhere. I did check the application logs and didn't see any error. For custom metrics being sent to UDP 8125, I believe there is no error logs if the agent is down. But we also had APM metrics sent to TCP 8126, which should print error logs if the service is down. That makes me think Micrometer might have stopped sending out metrics. I might have captured nothing even if I had done the tcpdump on it.

shakuzen · 2023-09-04T19:19:04Z

I suspect (but could be wrong) that there will be some logs at debug level when the issue occurs, but most people don't have debug logs enabled on their production services.

raymondchen625 · 2023-09-07T14:02:36Z

I suspect (but could be wrong) that there will be some logs at debug level when the issue occurs, but most people don't have debug logs enabled on their production services.

We found the same issue in our staging environment. I killed one of our two pods now it's clearly from the Datadog dashboard that only the new pod is able to send metrics. I used tcpdump to capture IP packets on both containers. They are both sending TCP packets to port 8126 or UDP datagrams to port 8125.
It seems there is something wrong on the Datadog agent side. When I further investigate it, I would set the log level of io.micrometer.shaded.reactor.netty to get better insight.

raymondchen625 · 2023-12-17T19:57:35Z

Hi @shakuzen, I finally somehow found a scenario to reproduce the issue.
There are a few facts about micrometer-registry-statsd component (1.9.17 in my test):

It doesn't print out any logging message when there are problems sending data to the UDP port of the target host, whether it's an unresolvable hostname or the statsd service on the target host is down. It's just totally quiet.
The IP resolution only happens once when the UdpClient instance is initialized. The instance is reused and no further DNS resolution will happen.

To reproduce the issue:

Put the statsd services listening on a UDP port behind a Kubernetes service. And configure the client application's statsdHost to the service's Kubernetes internal DNS record name, such as mystatsd.mynamespace.svc.cluster.local
Deploy both the statsd service and the test application. We should be able to see the UDP datagrams are routed to one of the pods behind the statsd service.
Undeploy the statsd service. No error log will be printed out from the client application, If we log into the client application's container and use netstat or lsof to quey the UDP connections, we should be able to see the socket whose destination IP is still the old service hostname's former IP address, which should not work anymore by now.
Redeploy the test statsd Kubernetes service, we will see the hostname mystatsd.mynamespace.svc.cluster.local being resolved to a new IP. But the client application is still sending datagraes to an non-existent old IP.

I'm not sure if this is exactly the same scenario @dunnk2022 had. But this indicates an issue with the current implementation: it doesn't resolve the hostname again after UdpClient initialization.
Ideally, it should be able to be configured to detect DNS resolution changes and reinitialize UdpClient to send metrics to the new destination. If that's not possible, at least print some error logs.

dunnk2022 · 2023-12-18T19:07:36Z

@raymondchen625 We added an initContainer to our Datadog host agent pods to delete conntrack entries before starting the Datadog agent.

This change was based on comments from this issue - containernetworking/plugins#123.

We have not had a report of missing metrics sent over StatsD since.

marcingrzejszczak · 2023-12-20T16:35:23Z

After looking at the latest comments, can we assume that the problem is fixed ?

shakuzen · 2023-12-21T08:46:20Z

@raymondchen625 What you've described sounds like #1252. If that is still an issue in the latest versions, please comment over there so we know it still needs to be fixed. I'm not sure if that was the original problem here or not. If the server was restarted with the same IP, I guess it's a different problem.

raymondchen625 · 2023-12-21T14:54:51Z

If the server was restarted with the same IP, I guess it's a different problem.

Yes, that's probably related to a series of kube-proxy issues:

IMO, this issue can be closed. The DNS change detection issue can be addressed by #1252 .

marcingrzejszczak · 2023-12-21T15:22:08Z

Closing in favour of #1252

dunnk2022 · 2023-12-21T15:22:31Z

I would say that my reported issue is fixed.

Thanks All!

marcingrzejszczak · 2023-12-21T15:31:50Z

@dunnk2022 how is it fixed? I'm asking cause #1252 seems to be opened

dunnk2022 · 2023-12-21T15:41:15Z

@marcingrzejszczak My issue was not on the micrometer end.
It is related to stale conntrack entries and the portmap version used within our CNI (Flannel v0.11) as described here - containernetworking/plugins#123.

@raymondchen625 We added an initContainer to our Datadog host agent pods to delete conntrack entries before starting the Datadog agent.

This change was based on comments from this issue - containernetworking/plugins#123.

We have not had a report of missing metrics sent over StatsD since.

dunnk2022 added the waiting-for-triage label Dec 9, 2022

shakuzen added the registry: statsd A StatsD Registry related issue label Jan 25, 2023

shakuzen added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Jan 25, 2023

raymondchen625 mentioned this issue Dec 21, 2023

StatsdMeterRegistry doesn't renew resolved ip address on DNS change #1252

Open

marcingrzejszczak closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2023

marcingrzejszczak removed the waiting for feedback We need additional information before we can continue label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micrometer client not reconnecting to statsd after server is restarted #3563

Micrometer client not reconnecting to statsd after server is restarted #3563

dunnk2022 commented Dec 9, 2022

shakuzen commented Jan 25, 2023

dunnk2022 commented Jan 25, 2023 •

edited

dunnk2022 commented Jan 25, 2023

shakuzen commented Jan 26, 2023

raymondchen625 commented Aug 30, 2023

shakuzen commented Aug 31, 2023

raymondchen625 commented Sep 2, 2023

shakuzen commented Sep 4, 2023

raymondchen625 commented Sep 7, 2023

raymondchen625 commented Dec 17, 2023

dunnk2022 commented Dec 18, 2023

marcingrzejszczak commented Dec 20, 2023

shakuzen commented Dec 21, 2023

raymondchen625 commented Dec 21, 2023

marcingrzejszczak commented Dec 21, 2023

dunnk2022 commented Dec 21, 2023

marcingrzejszczak commented Dec 21, 2023

dunnk2022 commented Dec 21, 2023

Micrometer client not reconnecting to statsd after server is restarted #3563

Micrometer client not reconnecting to statsd after server is restarted #3563

Comments

dunnk2022 commented Dec 9, 2022

shakuzen commented Jan 25, 2023

dunnk2022 commented Jan 25, 2023 • edited

dunnk2022 commented Jan 25, 2023

shakuzen commented Jan 26, 2023

raymondchen625 commented Aug 30, 2023

shakuzen commented Aug 31, 2023

raymondchen625 commented Sep 2, 2023

shakuzen commented Sep 4, 2023

raymondchen625 commented Sep 7, 2023

raymondchen625 commented Dec 17, 2023

dunnk2022 commented Dec 18, 2023

marcingrzejszczak commented Dec 20, 2023

shakuzen commented Dec 21, 2023

raymondchen625 commented Dec 21, 2023

marcingrzejszczak commented Dec 21, 2023

dunnk2022 commented Dec 21, 2023

marcingrzejszczak commented Dec 21, 2023

dunnk2022 commented Dec 21, 2023

dunnk2022 commented Jan 25, 2023 •

edited