You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when host1 becomes slow and starts dropping packets (seen by dest ... nonBlockingSend -> dropping due to slow conn messages and in grafana), sometimes host2 exhibits the same behavior, a drop in traffic at the same time and following the same pattern.
we must always make sure the pipeline can't hang, so i traced through the code (starting at table.Dispatch) and made sure none of the route/destpoint/conn code can block the processing. especially of importance is the Destination.relay loop. in there we actually have the explicit flush and shutdown handling which might take a while, but the first is seemingly never called, and the latter is definitely never called during runtime.
so i installed a profile hook by importing _ "net/http/pprof", loading up a relay and blasting it with input traffic to cause it to start dropping traffic, and then collected a profile:
clearly we spend too much time logging. for comparison, the other functions invoked by relay are around 0.01s ~ 0.04s.
my current thinking is:
we can get too busy executing logging logic (specifically, the "warning.. dropped" messages), causing the pipe to hang, which blocks delivery to other endpoints
in fact, it might even cause more dropped packets to the bad route
i've tried replacing these high-volume warning messages to info (which is appropriate, since we use it to trace what happens to individual metric messages), if we tell a loglevel below warning might cause a lot of traffic, we're good. so i ran again with warning level and tested again (this time using the /debug/vars2 because no more warn messages on screen). There's definitely a lot less time spent in logger.Info, but still some, and also significant time spent in logger.Debug. Not sure why the latter, as I changed to Info, or maybe because we now process so much traffic that the Debug stuff also starts adding up.
The text was updated successfully, but these errors were encountered:
with a config like
when host1 becomes slow and starts dropping packets (seen by
dest ... nonBlockingSend -> dropping due to slow conn
messages and in grafana), sometimes host2 exhibits the same behavior, a drop in traffic at the same time and following the same pattern.we must always make sure the pipeline can't hang, so i traced through the code (starting at
table.Dispatch
) and made sure none of the route/destpoint/conn code can block the processing. especially of importance is theDestination.relay
loop. in there we actually have the explicit flush and shutdown handling which might take a while, but the first is seemingly never called, and the latter is definitely never called during runtime.so i installed a profile hook by importing
_ "net/http/pprof"
, loading up a relay and blasting it with input traffic to cause it to start dropping traffic, and then collected a profile:clearly we spend too much time logging. for comparison, the other functions invoked by
relay
are around 0.01s ~ 0.04s.my current thinking is:
i've tried replacing these high-volume warning messages to info (which is appropriate, since we use it to trace what happens to individual metric messages), if we tell a loglevel below warning might cause a lot of traffic, we're good. so i ran again with warning level and tested again (this time using the /debug/vars2 because no more warn messages on screen). There's definitely a lot less time spent in logger.Info, but still some, and also significant time spent in logger.Debug. Not sure why the latter, as I changed to Info, or maybe because we now process so much traffic that the Debug stuff also starts adding up.
The text was updated successfully, but these errors were encountered: