Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micrometer client not reconnecting to statsd after server is restarted #3563

Closed
dunnk2022 opened this issue Dec 9, 2022 · 18 comments
Closed
Labels
registry: statsd A StatsD Registry related issue

Comments

@dunnk2022
Copy link

Describe the bug
The micrometer client does not reconnect to the statsd server (Datadog agent) when the agent is restarted.

Environment

  • Micrometer version - 1.9.5
  • Micrometer registry - statsd, Datadog flavor
  • OS: Linux
  • Java version: 11.0.14

To Reproduce
How to reproduce the bug:
Restart the statsd server (datadog-agent)

Expected behavior
A clear and concise description of what you expected to happen.

The micrometer client re-establishes a connection to the statsd server on port 8125 when the daemon is back up.

Additional context
Add any other context about the problem here, e.g. related issues.

@shakuzen shakuzen added the registry: statsd A StatsD Registry related issue label Jan 25, 2023
@shakuzen
Copy link
Member

I tried to reproduce this locally and I was unable to. I started the Datadog agent on my macbook, started an application using Micrometer's statsd registry with datadog flavor, waited for some metrics to be sent, restarted the Datadog Agent, waited for more metrics to be sent, and I was able to see the metrics from after the restart in Datadog. Could you help us figure out how to reproduce the issue you're seeing?

@shakuzen shakuzen added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Jan 25, 2023
@dunnk2022
Copy link
Author

dunnk2022 commented Jan 25, 2023

Thanks for the response @shakuzen

For context the Datadog agent version is 7.38.0-jmx

@dunnk2022
Copy link
Author

We really just do a rolling restart of the agent and the issue occurs. Perhaps turning DEBUG logging on for the client will help to gather more info?

@shakuzen
Copy link
Member

If you turn on debug logging for classes under the io.micrometer.shaded.reactor.netty package, we can probably get some more information about what's happening. Or turn on debug logging for everything under io.micrometer.

@raymondchen625
Copy link

We ran into similar issues lately. Our Datadog agent (v7.43.0) restarted. The Java application with Micrometer 1.9.13 stopped sending custom metrics to Datadog. I logged into the container and verified the connectivity to the agent host's UDP port 8125 was fine. I could send some metrics directly to that port and find them on Datadog console. Once I restarted those containers, the metrics could be sent successfully again.
The Datadog agent's endpoint is a K8s service hostname, I don't think there was an IP change.

@shakuzen
Copy link
Member

@raymondchen625 thanks for the additional datapoint. Unfortunately, since I haven't been able to reproduce it locally or get logs from anyone with more info about what is happening, we haven't made any progress on this. If you would be able to provide steps to reproduce this (ideally locally or with as minimal other things involved as possible) or some debug logs, it may help get to the bottom of this.

@raymondchen625
Copy link

@raymondchen625 thanks for the additional datapoint. Unfortunately, since I haven't been able to reproduce it locally or get logs from anyone with more info about what is happening, we haven't made any progress on this. If you would be able to provide steps to reproduce this (ideally locally or with as minimal other things involved as possible) or some debug logs, it may help get to the bottom of this.

I could not reproduce this locally. Unfortunately, I didn't try to capture IP packets before restarting the pods to see if Micrometer was sending traffic somewhere. I did check the application logs and didn't see any error. For custom metrics being sent to UDP 8125, I believe there is no error logs if the agent is down. But we also had APM metrics sent to TCP 8126, which should print error logs if the service is down. That makes me think Micrometer might have stopped sending out metrics. I might have captured nothing even if I had done the tcpdump on it.

@shakuzen
Copy link
Member

shakuzen commented Sep 4, 2023

I suspect (but could be wrong) that there will be some logs at debug level when the issue occurs, but most people don't have debug logs enabled on their production services.

@raymondchen625
Copy link

I suspect (but could be wrong) that there will be some logs at debug level when the issue occurs, but most people don't have debug logs enabled on their production services.

We found the same issue in our staging environment. I killed one of our two pods now it's clearly from the Datadog dashboard that only the new pod is able to send metrics. I used tcpdump to capture IP packets on both containers. They are both sending TCP packets to port 8126 or UDP datagrams to port 8125.
It seems there is something wrong on the Datadog agent side. When I further investigate it, I would set the log level of io.micrometer.shaded.reactor.netty to get better insight.

@raymondchen625
Copy link

Hi @shakuzen, I finally somehow found a scenario to reproduce the issue.
There are a few facts about micrometer-registry-statsd component (1.9.17 in my test):

  1. It doesn't print out any logging message when there are problems sending data to the UDP port of the target host, whether it's an unresolvable hostname or the statsd service on the target host is down. It's just totally quiet.
  2. The IP resolution only happens once when the UdpClient instance is initialized. The instance is reused and no further DNS resolution will happen.

To reproduce the issue:

  1. Put the statsd services listening on a UDP port behind a Kubernetes service. And configure the client application's statsdHost to the service's Kubernetes internal DNS record name, such as mystatsd.mynamespace.svc.cluster.local
  2. Deploy both the statsd service and the test application. We should be able to see the UDP datagrams are routed to one of the pods behind the statsd service.
  3. Undeploy the statsd service. No error log will be printed out from the client application, If we log into the client application's container and use netstat or lsof to quey the UDP connections, we should be able to see the socket whose destination IP is still the old service hostname's former IP address, which should not work anymore by now.
  4. Redeploy the test statsd Kubernetes service, we will see the hostname mystatsd.mynamespace.svc.cluster.local being resolved to a new IP. But the client application is still sending datagraes to an non-existent old IP.

I'm not sure if this is exactly the same scenario @dunnk2022 had. But this indicates an issue with the current implementation: it doesn't resolve the hostname again after UdpClient initialization.
Ideally, it should be able to be configured to detect DNS resolution changes and reinitialize UdpClient to send metrics to the new destination. If that's not possible, at least print some error logs.

@dunnk2022
Copy link
Author

@raymondchen625 We added an initContainer to our Datadog host agent pods to delete conntrack entries before starting the Datadog agent.

This change was based on comments from this issue - containernetworking/plugins#123.

We have not had a report of missing metrics sent over StatsD since.

@marcingrzejszczak
Copy link
Contributor

After looking at the latest comments, can we assume that the problem is fixed ?

@shakuzen
Copy link
Member

@raymondchen625 What you've described sounds like #1252. If that is still an issue in the latest versions, please comment over there so we know it still needs to be fixed. I'm not sure if that was the original problem here or not. If the server was restarted with the same IP, I guess it's a different problem.

@raymondchen625
Copy link

If the server was restarted with the same IP, I guess it's a different problem.

Yes, that's probably related to a series of kube-proxy issues:

IMO, this issue can be closed. The DNS change detection issue can be addressed by #1252 .

@marcingrzejszczak
Copy link
Contributor

Closing in favour of #1252

@marcingrzejszczak marcingrzejszczak closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2023
@marcingrzejszczak marcingrzejszczak removed the waiting for feedback We need additional information before we can continue label Dec 21, 2023
@dunnk2022
Copy link
Author

I would say that my reported issue is fixed.

Thanks All!

@marcingrzejszczak
Copy link
Contributor

@dunnk2022 how is it fixed? I'm asking cause #1252 seems to be opened

@dunnk2022
Copy link
Author

@marcingrzejszczak My issue was not on the micrometer end.
It is related to stale conntrack entries and the portmap version used within our CNI (Flannel v0.11) as described here - containernetworking/plugins#123.

@raymondchen625 We added an initContainer to our Datadog host agent pods to delete conntrack entries before starting the Datadog agent.

This change was based on comments from this issue - containernetworking/plugins#123.

We have not had a report of missing metrics sent over StatsD since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
registry: statsd A StatsD Registry related issue
Projects
None yet
Development

No branches or pull requests

4 participants