dropping packets to one endpoint can impact other endpoints (routes?) #57

Dieterbe · 2015-03-10T20:59:39Z

with a config like

addRoute sendAllMatch carbon-default  host1:2005 spool=true  host2:2004 spool=true

when host1 becomes slow and starts dropping packets (seen by dest ... nonBlockingSend -> dropping due to slow conn messages and in grafana), sometimes host2 exhibits the same behavior, a drop in traffic at the same time and following the same pattern.
we must always make sure the pipeline can't hang, so i traced through the code (starting at table.Dispatch) and made sure none of the route/destpoint/conn code can block the processing. especially of importance is the Destination.relay loop. in there we actually have the explicit flush and shutdown handling which might take a while, but the first is seemingly never called, and the latter is definitely never called during runtime.
so i installed a profile hook by importing _ "net/http/pprof", loading up a relay and blasting it with input traffic to cause it to start dropping traffic, and then collected a profile:

clearly we spend too much time logging. for comparison, the other functions invoked by relay are around 0.01s ~ 0.04s.

my current thinking is:

we can get too busy executing logging logic (specifically, the "warning.. dropped" messages), causing the pipe to hang, which blocks delivery to other endpoints
in fact, it might even cause more dropped packets to the bad route

i've tried replacing these high-volume warning messages to info (which is appropriate, since we use it to trace what happens to individual metric messages), if we tell a loglevel below warning might cause a lot of traffic, we're good. so i ran again with warning level and tested again (this time using the /debug/vars2 because no more warn messages on screen). There's definitely a lot less time spent in logger.Info, but still some, and also significant time spent in logger.Debug. Not sure why the latter, as I changed to Info, or maybe because we now process so much traffic that the Debug stuff also starts adding up.

The text was updated successfully, but these errors were encountered:

see #57

Dieterbe · 2015-03-16T23:20:50Z

fixed by the above commit!

Dieterbe pushed a commit that referenced this issue Mar 10, 2015

assure we don't spend signifant time logging, in common levels

b7ecf93

see #57

Dieterbe mentioned this issue Mar 10, 2015

too much cpu time being spent even in loglevels below current level op/go-logging#50

Closed

Dieterbe closed this as completed Mar 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dropping packets to one endpoint can impact other endpoints (routes?) #57

dropping packets to one endpoint can impact other endpoints (routes?) #57

Dieterbe commented Mar 10, 2015

Dieterbe commented Mar 16, 2015

dropping packets to one endpoint can impact other endpoints (routes?) #57

dropping packets to one endpoint can impact other endpoints (routes?) #57

Comments

Dieterbe commented Mar 10, 2015

Dieterbe commented Mar 16, 2015