Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: heartbeat for producers #131

Merged
merged 1 commit into from
Feb 21, 2013
Merged

Conversation

mreiferson
Copy link
Member

Would be nice to have a heartbeat for the producers. would make connection liveness easier to keep track of for slow producers.

@mreiferson
Copy link
Member

interesting, this is definitely something that should be added, thanks..

It completely slipped our mind as we dont actually produce messages over the wire protocol ourselves (we publish over the HTTP interfaces)

@mreiferson mreiferson mentioned this pull request Jan 22, 2013
@mreiferson
Copy link
Member

ready @jehiah

cc @dustismo if you want to test this out

@mreiferson
Copy link
Member

@jehiah ready for review

@jehiah
Copy link
Member

jehiah commented Feb 21, 2013

LGTM

jehiah added a commit that referenced this pull request Feb 21, 2013
@jehiah jehiah merged commit 52c7599 into nsqio:master Feb 21, 2013
@dmarkham
Copy link
Contributor

dmarkham commented Mar 9, 2013

Without a way to turn off the heartbeat for producers... Any client Not using threads or a event loop will miss the heartbeat and will now start disconnecting if the producer blocks to collect it's data. This make it's impossible to write simple producers like a simple log watcher that pushes into NSQ over TCP. Or embedding it in a fastCGI script to log it's request stats. Now I'm not able to block on a socket waiting for a web request without nsq disconnecting my producer's connection. Not every producer is dedicating a thread or in a event loop. There is a reason Redis, memcached, RabbitMQ or Postgres no not force a heartbeat that must be responded to! We should be able to write very simple "while STDIN publish into NSQ" script and with this patch it's not possible anymore without breaking out the threads or Event loops. :( @mreiferson @mreiferson @dustismo @jehiah Thoughts? Ideas?

Thanks guys for your time,

-Dan

@mreiferson
Copy link
Member

Hi @dmarkham - I understand your perspective. We consider the TCP protocol to be the more sophisticated approach to publishing and made this change to address an inconsistency with consumers.

Would it sufficiently meet your needs to have "simple" publishers like the ones you mention publish via HTTP? nsqd has both /put and /mput endpoints that require no state.

To put this in perspective, at bitly, we exclusively use the HTTP interface for publishing and our NSQ cluster hits peaks of over 80k messages per second.

Thanks for the feedback!

@dmarkham
Copy link
Contributor

dmarkham commented Mar 9, 2013

First off thank you for responding,

For consumers a heartbeat is great idea. It is their job to listen to things streaming from NSQ.

I agree I could use the HTTP interface. I'm rather shocked/impressed this is not more expensive than you are willing to pay at that scale. You have a very clean TCP spec with low overhead. With that said I will start benching my setup with the HTTP interface before drawing any real conclusions. So I will dig in and see what the HTTP keep-alive default/max is configured to be on the nsqd servers. I'm highly motivated to get something I enjoy using to work well.

So even though my examples were talking about a simple way to publish into NSQ. I should have given more details into my concerns. I have been working to see if I can replace most logfile writing with dropping the messages directly into NSQ. So I was pretty sure I needed a persistant low latency connection to NSQ.

Whats your thoughts on this. Or would you still recommend spooling things to logs before pushing into NSQ anyway?

So once I felt like it meet my latency benchmark to write directly into NSQ vs logfiles I started working on the persistant part. This is where I hit my issues with the TCP interface. Something felt wrong when I can hold open long lasting tcp connections to all of my other backend services yet I can not publish data without maintaining a heartbeat with NSQ. I'm all for consistency but I think your just being consistent with the wrong thing (the consumer vs other publishing models).

I would love you guys to give this one more think over taking into consideration other publishing models people are currently programming against. Publishing data in to things like Redis/Memcached/RabbitMQ**/RDBMS.. The things in this list and things I can think of really only use heartbeats for cluster health... not client health and definitely not publisher health, I also find that interesting.

Ok I'm off to start testing the http interface for my usage. But it mentally burns me the tcp interface is going unused in my stack because of a heartbeat.

Thanks for your time,

-Dan

**RabbitMQ/amqp you can ask for a heartbeat but it's not forced

@michaelhood
Copy link
Contributor

Hey Dan,

I'll wait for @mreiferson to respond to the meat of your post, I just wanted to add some anecdata re:

I agree I could use the HTTP interface. I'm rather shocked/impressed this is not more expensive than you are willing to pay at that scale.

We also use the HTTP interface exclusively, if only because we had battle-tested client libs ready. We use a rather unoptimized libcurl implementation (PHP binding) on the producer side, and nsq_to_http on the consumer side.

The costs of bandwidth are obvious, but the "resource" costs (CPU time, sockets' memory usage, etc.) will present themselves on the clients' ends — and be highly dependent on the RTT.

We average around 0.7ms RTT on the consumer leg with virtually no retransmissions (<0.01%), and our producers are local to nsqd. That makes the impact negligible, but in a different architecture/scenario it could certainly be another story entirely.

For what it's worth here in the shadow of @mreiferson's ❗ 80k msgs/second.. ;)

We process around 600 million messages a day this way, synchronously, with an average payload (that is, Content-length) of 357 bytes.

All that said, would love if you share your benchmarks!

[and thanks for taking the time to write all of this, already]

@dmarkham
Copy link
Contributor

My testing scripts, notes, numbers, versions, etc.
https://github.com/dmarkham/nsq_testing

At the very least I now know what rate with HTTP I'm working towards. I'm sure I can improve it some on the client side (LWP replacement).

TCP:  Rate: 0.0002597/sec
HTTP: Rate: 0.0010598/sec
About a 4x improvement

@michaelhood You both have me beat on messages per day. I'm getting pretty close to 100M/day for my planned use of NSQ. I'm mostly concerned about the latency/cost per message. My goal is to spend about the same amount of time to hand off a message to NSQ as to write it to memcahcd/redis/logfile. Are you writing your messages to disk first? Or directly into NSQ?

Thanks for any thoughts. I'm just a heartbeat away (pun entended) from this working out perfectly.

@dustismo
Copy link
Contributor Author

Seems to me you can still do a simple synchronous tcp connector with heartbeats. Just check if any result frame is a heartbeat. Something like:

public Result sendMessage(message) {
    sendMessageOnSocket(message)
    frame = socket.read()
    if (isHeartbeat(frame)) {
        sendMessageOnSocket(NOPMessage)
        frame = socket.read()
    } 
    return frame
}

Would that work for your purpose?

@michaelhood
Copy link
Contributor

@dmarkham I'll take a look at your code as soon as I have a few minutes, but I expect it's not the bytes but rather the relatively slow LWP. If you need to subject yourself to stick with Perl for this, I'd try WWW::Curl (and beware this gotcha.)

I tried to make a quick histogram from a production box just now, but nsq is too fast for the precision output by histogram.py (from bitly/data_hacks) @jehiah ;)

I'll put together some stuff and also take a look at your code as soon as I have a few.

# tcpdump -s 384 -tttt -nn -q -K -l -p -i lo tcp port 4151 -c 10000 2>/dev/null | pt-tcp-model --watch-server=127.0.0.1:4151 | awk '{print $4*1000000}' | histogram.py -m 9 -x 25
# NumSamples = 3022; Min = 9.00; Max = 25.00
# 1091 values outside of min/max
# Mean = 72.471211; Variance = 23775.104896; SD = 154.191780
# each * represents a count of 6
    9.0000 -    10.6000 [   490]: *********************************************************************************
   10.6000 -    12.2000 [   418]: *********************************************************************
   12.2000 -    13.8000 [    33]: *****
   13.8000 -    15.4000 [   140]: ***********************
   15.4000 -    17.0000 [   372]: **************************************************************
   17.0000 -    18.6000 [   164]: ***************************
   18.6000 -    20.2000 [   190]: *******************************
   20.2000 -    21.8000 [    49]: ********
   21.8000 -    23.4000 [    50]: ********
   23.4000 -    25.0000 [    25]: ****

@michaelhood
Copy link
Contributor

@dustismo I've not had a chance to look these changes over in detail, but unless I'm missing something, could he not also receive a heartbeat at a time not-immediately-after sending a message (and therefore not polling for a reply)?

edited for clarity

@dmarkham
Copy link
Contributor

@dustismo I think your idea could get me out of sync with the server. If i'm sent a heartbeat frame before I start a PUB when I go to publish my data the server would be expecting a NOP not a PUB.

So I'm guessing I'm not able to:
Server Sends heartbeat -> Client PUB -> Read heartbeat frame -> respond NOP -> Read OK..

The server should be expecting a NOP after a heartbeat not a PUB
Not that I have tested if that would work but it seems wrong.

And yes LWP can't be the best tool for this. I have already start poking around with libCurl and friends WWW::Curl is not playing nice ATM..

By time I strip down these http libs and remove all the unneeded headers it's going to look SO very close the the TCP interface with no heartbeat and more difficult to parse. Something to be said for the fast TCP protocol you guys have built it's super simple to parse and easy to write a fast lib against.

I will have to build a perl lib regardless. Not everything can be in golang overnight!

thanks again for your time looking at my messy prototype code.

@mreiferson
Copy link
Member

@dmarkham thanks for taking the time to put together those benchmarks. As @michaelhood pointed out, it's probably not so much the size difference (both fit into a single packet aka single syscall) but rather the simple fact that it's HTTP (client lib + golang's stdlib server).

As a simple test to evaluate the client side cost, you could try just writing a pre-formatted HTTP request to a raw (persistent) TCP socket connected to nsqd's HTTP port.

To answer your other question (for messages going through NSQ) we don't first write them to disk. Like you're evaluating now as a potential solution, our applications that produce messages write to a local nsqd via HTTP. We use nsq_to_file, typically on separate hosts in some sort of redundant pairing, to persist the NSQ topics to disk (available in the examples dir, https://github.com/bitly/nsq/tree/master/examples/nsq_to_file).

Ultimately, I think you're headed down the correct path in terms of topology. Relatedly, this blog post might be useful http://word.bitly.com/post/38385370762/spray-some-nsq-on-it.

re: @dustismo's suggestion - the heartbeat doesn't expect a NOP command in return (any command is fine). NOP was intended to be an easy solution when a client was otherwise idle. This doesn't make that proposed solution any easier to implement though because you could theoretically get more than one heartbeat, and you can't rely on the racey timing of whether you sent your initial PUB after the server sent the heartbeat, so you still need to NOP. I think it would look something like this 😄:

responded = False
send(pub_msg)
while True:
    res = read()
    if not is_heartbeat(res):
        return res
    if not responded:
        send(nop)
        responded = True

Regardless, the real problem with this for you (as I understand it) is that you're blocked sitting in an Accept() or Read() in your HTTP server (and may not receive any request for more than the nsqd timeout).

Taking a step back, we try really hard for NSQ to "just work". It means that some of our opinions about how things should work are baked into the system. Flexibility is great but there is a fine line we try to be conscious of (you can't please everyone).

I'm going to consider your perspective and requirements (which I understand to be stateless, low-latency, low-overhead publishing) and think through what this means for NSQ. I'm certainly open to suggestions as to the specific changes that might resolve this for all interested parties.

CC @jehiah for his thoughts

@dmarkham
Copy link
Contributor

This is great I think everyone understands my concernes. It also sounds like everyone is open to at least reviewing a patch. I have some catching up to do in the code. Unless someone beats me to it. Or has a better Idea. I'm going to try and put a patch together for review.

My first thought was to add:

  • HEARTBEAT - change the rate nsqd sends the _heartbeat_ response to your client. Defaults to 30 seconds.

    HEARTBEAT <seconds>\n
    
    <seconds> - a string representation of integer N  where 0 <=  N  <  86400. 
    

    Success Response:

    OK
    

    NOTE: The V2 protocol requires a heartbeat every N seconds, nsqd will send a
    _heartbeat_ response and expect a command in return. If the client is idle, send NOP. After N*2
    seconds, nsqd will timeout and forcefully close a client connection that it has not heard from. Zero is a special case that will disable heartbeats on this client connection.

@mreiferson
Copy link
Member

@dmarkham - if you're gonna take a swing... we built some flexibility into the IDENTIFY command for situations like this (it takes a JSON payload). It makes sense to add a new key to that JSON blob for this setting (rather than a new command).

@dmarkham
Copy link
Contributor

noted I have a version almost working with the HEARTBEAT command i'll move it to the IDENTIFY json blob before submitting it for further review. The clients conn's readDeadline being managed in a different place than the heartbeat Ticker makes it interesting to manage them together after the messagePump has started up. I'll have something creative to look at soon.

@dmarkham
Copy link
Contributor

Ok here is my swing at it. Adding a heartbeat: <seconds> into your json on IDENTIFY does the trick. The entire patch is really just a little sugar around a time.ticker named heartbeat.

https://github.com/dmarkham/nsq/tree/heartbeat_configure

I'm highly interested in observations I could improve on.

@mreiferson
Copy link
Member

@dmarkham nice, thanks :)

would you mind opening a pull request so we can review the change?

@jehiah
Copy link
Member

jehiah commented Mar 11, 2013

@michaelhood I like the use of awk multiplier for historgram.py. Slick trick (and it's always cool to see data_hacks in use).

I've been silent on this thread but i concur with @mreiferson's comment that heartbeat configuration would be best served in the IDENTIFY command payload.

absolute8511 added a commit to absolute8511/nsq that referenced this pull request Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants