Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dropping packets to one endpoint can impact other endpoints (routes?) #57

Closed
Dieterbe opened this issue Mar 10, 2015 · 1 comment
Closed

Comments

@Dieterbe
Copy link
Contributor

with a config like

addRoute sendAllMatch carbon-default  host1:2005 spool=true  host2:2004 spool=true

when host1 becomes slow and starts dropping packets (seen by dest ... nonBlockingSend -> dropping due to slow conn messages and in grafana), sometimes host2 exhibits the same behavior, a drop in traffic at the same time and following the same pattern.
we must always make sure the pipeline can't hang, so i traced through the code (starting at table.Dispatch) and made sure none of the route/destpoint/conn code can block the processing. especially of importance is the Destination.relay loop. in there we actually have the explicit flush and shutdown handling which might take a while, but the first is seemingly never called, and the latter is definitely never called during runtime.
so i installed a profile hook by importing _ "net/http/pprof", loading up a relay and blasting it with input traffic to cause it to start dropping traffic, and then collected a profile:
oeomox6
clearly we spend too much time logging. for comparison, the other functions invoked by relay are around 0.01s ~ 0.04s.

my current thinking is:

  • we can get too busy executing logging logic (specifically, the "warning.. dropped" messages), causing the pipe to hang, which blocks delivery to other endpoints
  • in fact, it might even cause more dropped packets to the bad route

i've tried replacing these high-volume warning messages to info (which is appropriate, since we use it to trace what happens to individual metric messages), if we tell a loglevel below warning might cause a lot of traffic, we're good. so i ran again with warning level and tested again (this time using the /debug/vars2 because no more warn messages on screen). There's definitely a lot less time spent in logger.Info, but still some, and also significant time spent in logger.Debug. Not sure why the latter, as I changed to Info, or maybe because we now process so much traffic that the Debug stuff also starts adding up.

@Dieterbe
Copy link
Contributor Author

fixed by the above commit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant