Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working towards 1.0 #126

Closed
mreiferson opened this issue Jan 11, 2013 · 6 comments
Closed

working towards 1.0 #126

mreiferson opened this issue Jan 11, 2013 · 6 comments

Comments

@mreiferson
Copy link
Member

We're looking to solidify a 1.0 release and I wanted to document some possible changes as early as possible.

As mentioned in the README we intend on following the semantic versioning scheme. A 1.0 release would mean, in short:

  1. the public API in terms of protocol and parameters will be frozen for this release
  2. all future point releases would only be for backwards compatible bug fixes
  3. all future minor releases would be for backwards compatible functionality enhancements
  4. all future major releases would be for significant changes that could be backwards incompatible

We have a slew of open issues tagged with 1.0 and if other bugs are identified in the interim they will most likely be included as well.

Structurally, the biggest change will be the Go library moving into its own repo, most likely http://github.com/bitly/go-nsq. It will also be considered 1.0 at that point. What this means for applications using this library is that import paths will change from github.com/bitly/nsq/nsq to github.com/bitly/go-nsq. Any other changes will be documented in the respective ChangeLog.

If you have any questions/comments/suggestions for things you'd like to see in 1.0 please use this as a forum for that discussion.

Thanks!

@dustismo
Copy link
Contributor

There is an issue with the current protocol design. It can be impossible to match a specific request with the appropriate response if many requests are made in short order.

example
client: INDENTIFY
client: SUB topic1 mychannel
server: E_INVALID

There is no way to tell if E_INVALID is for the INDENTIFY or the SUB request.

@mreiferson
Copy link
Member Author

@dustismo first, just want to make sure we're on the same page that INDENTIFY (in your example) is spelled incorrectly and is the one throwing the E_INVALID error.

You point is fair though, and it's something we've debated internally for some time.

We've taken the approach that errors are generally fatal. I started going through the process of making changes on the nsqd side that actually enforced this (ie. nsqd would close the connection if it encountered some of these scenarios for a client). In your example I realize there are a few more that should be promoted to fatal as well (like IDENTIFY and SUB). I also think IDENTIFY and SUB should both always return a success response so that clients could be written like:

connect()
send_magic()
identify(short_id, long_id)
err = read_response()
if err {
   // bail
}

subscribe(topic, channel)
err = read_response()
if err {
   // bail
}

That would clear up this specific case you're talking about where you'd like to provide useful feedback to your applications about why they couldn't connect, authenticate, or subscribe successfully.

For the rest of the cases (after subscribing and messages are flowing) our reader libraries simply log these errors and continue processing... you're really only dealing with 3 commands at that point (RDY, FIN, REQ). These also happen to be the commands that are generally sent at high volume. The overhead of having another packet for a success response (perhaps with some sort of client generated identifier?) isn't actually worth the performance cost. For some of these cases perhaps just adding more context to the error response (E_FIN_FAILED ... why ...) would be enough to effectively provide that same utility to you without overburdening the client libraries with additional state tracking?

@dustismo
Copy link
Contributor

Ok, makes sense. I will keep the 'errors are fatal' mentality in mind (though, of course, more context would be good). Though I would still argue for an OK response on the IDENTIFY action.

This brings me to another issue (I'm being noisy I know :) )

The client I am writing is more or less completely async, so this pattern isn't so easy:

  1. make request
  2. wait for response

It works more like:

while(true) {
   send any waiting requests;
}
/// separately
while(true) {
   read reponse;
   send to response handler;
}

so matching requests with response can be more difficult.

What would make life much cleaner for this would be to have OK responses include the action. something like:

OK SUB

This would be esp helpful for the OK PUB response. I am currently using a semaphore to count OK response wrt to PUB commands to make sure I don't send faster then they are received (yet still maintain high throughput). This feels very brittle considering I don't know with certainty what the OK is for. (It does work, and I don't get any errant OK messages, just feels wrong)

@mreiferson
Copy link
Member Author

Let's separate consumers from producers...

Consumers

I agree that IDENTIFY and SUB should always return a response and be fatal in their error cases (and have already started working on this in #133). At runtime, I think FIN, REQ, and RDY should probably be improved to return contextual error messages (like E_FIN_FAILED FIN <message_id>). This should resolve the issues you mention without the need for clients to keep track and match commands to responses.

Producers

Remember, commands on a single nsqd connection are serial. I suspect you'll be able to accomplish the throughput and robustness requirements you need by restructuring the way you're handling connections. I suggest:

  1. use the MPUB command to achieve pipelining and higher throughput for a single, serial, connection.
  2. use more connections performing serialized commands rather than decoupling the IO of a single connection.

For example, take a look at https://github.com/bitly/nsq/blob/master/examples/bench_writer/bench_writer.go - it can publish 321,000 (200 byte) messages/sec over a single connection using serialized MPUB commands.

Serializing async IO would look something like (in python, unfortunately, since I dont have experience with netty):

all_messages = [...]

def pub_loop(conn):
    message = all_messages.pop()
    pub(conn, message)

def pub(conn, message)
    callback = functools.partial(done_pub, conn=conn)
    conn.send("pub", message_data, callback=callback)

def done_pub(frame_type, data, conn):
    if frame_type == nsq.FRAME_TYPE_ERROR:
       # log, bail, retry
       return
    callback = functools.partial(pub_loop, conn=conn)
    ioloop.add_callback(callback)

Essentially just passing around context via callbacks on the io loop.

@mreiferson
Copy link
Member Author

NSQ Protocol / Client Library Notes

I thought it would be helpful to document some of the philosophy and intentions here and solicit feedback as we move closer to 1.0. Much of this is already in place but there are some subtle changes that will improve the ability to write robust and performant client libraries. This will probably form the basis for some new documentation along side the existing protocol document.

The protocol design goals are as follows:

  • human readable commands over the wire
  • fast machine parseable response framing
  • errors are fatal (server closes client connection)
  • versioning
  • bounded
  • detection of slow/unresponsive clients (i.e. application level heartbeats)

Heartbeats

Application level heartbeats are useful for detection of slow/unresponsive/disconnected clients when TCP would prefer to continue to retry for considerable amounts of time. Think of the case like pulling a network cable, you want to control and bound these scenarios.

Client libraries should be designed in such a way that they account for and respond to heartbeats automatically.

Even though data is pushed to clients from nsqd asynchronously, client libraries can rely on the strictly ordered nature of commands and responses for matching.

The general approach should be to keep a queue of commands (could be a simple linked list), push onto the back when sending a command and pop off the front whenever a FrameTypeResponse or FrameTypeError is received (thus matching command to response).

Whenever a heartbeat is encountered it should be handled automatically (possibly calling back to the client as a notification). NOTE: Heartbeats will be promoted to their own frame type, to make this easier.

Consumers

Consumers can easily determine success/failure of connection, authentication, and subscription phase by:

  1. send 4-byte magic
  2. send IDENTIFY
  3. send SUB
  4. start read loop (with deadline) which will timeout or receive an error response if it fails

At runtime it is possible for FIN and REQ to return errors due to nsqd side message timeouts (and subsequent re-delivery to another consumer). These should be considered non-fatal. For performance reasons, along with RDY, these commands do not have success responses (and client libraries should make no attempt to match these, instead they should simply log the error message).

All other errors are considered fatal and connections are forcibly closed on the nsqd side.

All reads should be deadlined to 2x the heartbeat interval.

Producers

Producers can follow the same procedures as consumers for robust connection and authentication
(without subscribing).

Producers only have two commands available, PUB and MPUB. Producers should be written to perform
commands in serial. MPUB should be used for high-throughput pipelining.

Producers should expect to receive heartbeats when those are added in #131.

All reads should be deadlined to 2x the heartbeat interval.

TODO

  • MERGED Issue nsqd: atomic MPUB #135 - make MPUB an atomic operation (i.e. all error conditions are validated up front and the appropriate locks are acquired to assure this before actually queueing messages). This addresses the implicit problem of determining (from an error response) which messages an MPUB command succeeded or failed.
  • MERGED Issue nsqd: v2 protocol responses/fatal errors #133:
    • simplified protocol version handling
    • fatal error changes to SUB/IDENTIFY
    • close conns on invalid magic
    • better context for FIN/REQ errors (consumers won't need to match commands to responses)
  • MERGED Issue nsqd: heartbeat for producers #131: producers do not currently have heartbeats. For consistency with the consumer side, nsqd will send heartbeats periodically.
  • promote heartbeats to their own frame type

@mreiferson
Copy link
Member Author

closing this, I've updated the issues labeled with 1.0 to reflect what I think is necessary.

arussellsaw pushed a commit to arussellsaw/nsq that referenced this issue Mar 5, 2018
Allow BackoffStrategy to be set via flag, document BackoffStrategy.SetConfig
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants