Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StatsRelay detect if StatsD Daemons are Alive #2

Open
jjneely opened this issue Apr 17, 2015 · 2 comments
Open

Make StatsRelay detect if StatsD Daemons are Alive #2

jjneely opened this issue Apr 17, 2015 · 2 comments

Comments

@jjneely
Copy link
Owner

jjneely commented Apr 17, 2015

The current code base does nothing to detect or react to StatsD daemons that are not alive. The UDP StatsD protocol is designed to be fire-and-forget and offers no way to detect if the other side has received the packet.

StatsD daemons have a TCP administrative interface that's probably very useful for checking if the process is alive. That may be of help with this issue.

Things to think about:

  • How do we configure this? Command line representation? I'd rather not require a config file if possible -- although I'm not opposed to it.
  • What do we do with metrics destined for a down StatsD daemon? Buffer them? Probably redirect them to the next available daemon as time stamp information is gathered when the packet is received and isn't in the packet. This may cause inconsistent data in upstream Graphite when/if multiple statsd daemons submit the same metric during hash-ring changes. But probably the least bad situation.
  • What do we do when all statsd daemons are dead? Log loudly and drop packets?
@justdaver
Copy link

Initially I was thinking about using something like mon to periodically check if the statsd backend's are up (port check against the statsd admin port?) and if mon detects that a statsd host is down then restart the statsrelay daemon(s) and leave out the host which is down - when it comes back online then restart the statsrelay daemon(s) and include the host again. That said, I really like your idea of including this kind of functionality into statsrelay.

Some ideas/thoughts/2c from my side:

  • Check every 30 seconds (-t 30 or -time=30) against the statsd admin port (-a 8126 or -adminport=8126) to test if a statsd host is up/down
  • If down for long periods perhaps buffering metrics won't work so well, rather redirect as you mention. If the down'd host is removed from the hash table completely then I'm not sure if we'd run into issues regarding inconsistent data as you mention? Wouldn't statsrelay still only redirect your metrics to a single statsd daemon? Would have to test this.
  • On second thought, perhaps buffering metrics could work with a limitation option, Eg buffer the last 50k lines. Logging to a error log would be great too.
  • Would I be crazy by suggesting a similar admin type port with funtionality like the statsd daemon or would that be too much? Adding / removing statsd hosts on the fly could be usefull for automation and scripting purposes...

Unfortunately I am not much of a programmer.. and my coding kung fu is very weak but will help out with as much as possible on the testing side of things!

@denen99
Copy link
Contributor

denen99 commented Jun 23, 2015

I would suggest creating a fixed size memory buffer that just gets overriden. I would also couple that with some sort of a timeout. So buffer X MB of metrics, for Y seconds. Y would be the TTL before you removed the node from the ring and just started sending metrics to another node (as noted above, the least bad situation). When the node comes back up, flush the buffer to the previously used node, add the now up node back to the ring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants