Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: --worker-id issues #429

Merged
merged 1 commit into from
Aug 4, 2014
Merged

nsqd: --worker-id issues #429

merged 1 commit into from
Aug 4, 2014

Conversation

mreiferson
Copy link
Member

It appears there are some cases where nsqd will fail to FIN a message because it thinks it's not in flight (but the message is actually in flight).

In playing around with nsqd locally, I kept seeing messages like this in the log:

E_FIN_FAILED FIN 06dbb0680b436005 failed ID not in flight

This was happening with a single consumer on a topic that was doing nothing but pulling messages off the queue and immediately FIN'ing them.

At first I thought it was a bug with the client library I was writing, but I was able to reproduce it with other client libraries as well (krakow and pynsq).

After some trial and error, I've found a way to reliably reproduce it:

  1. Start nsqd
  2. Write messages to it quickly
  3. Simultaneously, read and fin messages as fast as possible while using a reasonably high max_in_flight (50+)

Here's a small Ruby program that will reproduce it every time:
https://gist.github.com/bschwartz/d9b82670c48bcf6a5b7d

The output for it looks like this:

➜  nsq-fin-failer  ruby fin-failer.rb
.
.
.
.
.
FIN Failed for: 4165
FIN Failed for: 4166
FIN Failed for: 4167
FIN Failed for: 4168
FIN Failed for: 4169
FIN Failed for: 4170
FIN Failed for: 4171
FIN Failed for: 4172
FIN Failed for: 4173
FIN Failed for: 4174
FIN Failed for: 4175
FIN Failed for: 4176
FIN Failed for: 4177
FIN Failed for: 4178
FIN Failed for: 4179
FIN Failed for: 4180
FIN Failed for: 4249
FIN Failed for: 4250
FIN Failed for: 4251
FIN Failed for: 4252
FIN Failed for: 4253
FIN Failed for: 4254
FIN Failed for: 4256
FIN Failed for: 4257
FIN Failed for: 4258
FIN Failed for: 4259
FIN Failed for: 4260
FIN Failed for: 4261
FIN Failed for: 4321
FIN Failed for: 4322
.

Hopefully it's clear what's going on in there. I am hoping to read through the source of nsqd and help track this down, but thought I'd throw it up here in case you guys know right away what might be causing it.

Thanks for building a great piece of software!

@mreiferson
Copy link
Member

Can you confirm what version of nsqd you're running (nsqd -v) and if possible, what commit it was built from? (we should probably put the latter in the binary)

@mreiferson
Copy link
Member

So this one's a doozy 😁 🔥

--worker-id has a maximum bit-width of 12 (this is undocumented 😈), meaning the valid range is [0,4096). It looks like nsq-cluster sets --worker-id to the default port which is 4150.

This effectively breaks message ID generation which leads to duplicate message IDs being delivered to clients which breaks state tracking in both nsqd and the client.

We need to do a few things:

  1. fail when this value is set out of range (duh)
  2. document what the range is
  3. document what the hell --worker-id actually does and how it should be used (and its impact on a topology)

@mreiferson mreiferson changed the title FIN failing when it should succeed nsqd: --worker-id issues Jul 30, 2014
@mreiferson
Copy link
Member

I've updated the title to reflect my previous comment

@jehiah
Copy link
Member

jehiah commented Jul 30, 2014

💣

@bschwartz
Copy link
Contributor Author

Dang, you guys are good! Thanks for spotting that so quickly. I'll change NsqCluster to fix this.
I'm using the latest version of nsqd:

nsqd v0.2.29 (built w/go1.3)

@mreiferson
Copy link
Member

Thank you for the report 👍!

@mreiferson
Copy link
Member

RFR @jehiah, going to separately improve docs on gh-pages

@mreiferson
Copy link
Member

Also, I've decided I really hate the name worker-id (I honestly don't know what I was originally thinking).

Alternatives include node-id or just id or message-id-seed?

@jehiah
Copy link
Member

jehiah commented Aug 2, 2014

Since message ids are only ever used between a single nsqd to it's clients (and never shared amongst hosts), why is it particularly important to use this as part of the message guid ? I guess i'm also missing exactly what implication here broke message generation?

@mreiferson
Copy link
Member

A value greater than the max broke message ID generation.

The original idea of making this configurable was to be able to build your cluster in a way in which message IDs are guaranteed globally unique. In practice, partly due to the complete lack of documentation, I doubt it's ever used.

@mreiferson
Copy link
Member

Ping @jehiah

Independent of its current name and usefulness we should at least get these small changes in for the current state of things.

jehiah added a commit that referenced this pull request Aug 4, 2014
@jehiah jehiah merged commit 375a9c4 into nsqio:master Aug 4, 2014
@mreiferson mreiferson deleted the worker_id_429 branch August 4, 2014 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants