Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: stops sending messages #851

Closed
thelinuxlich opened this issue Jan 25, 2017 · 19 comments
Closed

nsqd: stops sending messages #851

thelinuxlich opened this issue Jan 25, 2017 · 19 comments

Comments

@thelinuxlich
Copy link

Sometimes one of my nsqd nodes simply stops sending messages to the consumer/reader(using nsqjs) and starts saving these messages to disk. When I connect with nsqadmin to the nsqlookupd address I can see that this node is not in the Topic Message Queue list. So I restart its nsqd daemon, and all of a sudden it starts sending all those messages in the disk to the consumer and it reappears in the Topic Message Queue list. Its log is not very informative before that, just sending heartbeats to nsqlookupd.

Is there a reason why this should be happening or is it a bug?

@jehiah
Copy link
Member

jehiah commented Jan 25, 2017

@thelinuxlich can you provide some context on your configuration, and relevant logs from nsqd prior to and after restarting it?

@thelinuxlich
Copy link
Author

I have two nsqd nodes connecting to two nsqlookupd nodes. The two nsqd nodes also hosts a microservice directly writing messages to their local instances which gets subscribed by two microservices connecting directly to their nqslookupd instances.

Logs usually only print this:

Jan 25 13:31:25 aidax-collector-2 nsqd[19813]: [nsqd] 2017/01/25 13:31:25.501527 LOOKUPD(10.240.200.16:4160): sending heartbeat
Jan 25 13:31:25 aidax-collector-2 nsqd[19813]: [nsqd] 2017/01/25 13:31:25.504267 LOOKUPD(10.240.51.68:4160): sending heartbeat

But there was a error last week:

Jan 18 17:09:41 aidax-collector-2 nsqd[322]: [nsqd] 2017/01/18 17:09:41.682483 LOOKUPD(10.240.200.16:4160): ERROR PING - read tcp 10.240.226.74:54986->10.240.200.16:4160: i/o timeout
Jan 18 17:09:42 aidax-collector-2 nsqd[322]: [nsqd] 2017/01/18 17:09:42.682712 LOOKUPD(10.240.51.68:4160): ERROR PING - read tcp 10.240.226.74:56002->10.240.51.68:4160: i/o timeout

But since it's heartbeating every 15s, shouldn't it reconnect normally instead of saving to the disk forever?

@ploxiln
Copy link
Member

ploxiln commented Jan 25, 2017

The consumer should reconnect to nsqd and receive messages. But the report is still a bit too vague. Please include:

  • version of nsqd
  • command-line flags and/or config file used for nsqd
  • output of "curl localhost:4151/stats" on nsqd server
  • a bit more info about the consumer, how it is configured and started

Those heartbeats, and the "error ping", are between nsqd and nsqlookupd. They do not indicate any connections from consumers. The problem may be in the consumer, or in network access between consumer and nsqd.

@thelinuxlich
Copy link
Author

I'm using nsq 0.3.8 in all nodes.

The nsqd service configuration:

[Service]
WorkingDirectory=/home/thelinuxlich
ExecStart=/usr/local/bin/nsqd -lookupd-tcp-address=10.240.200.16:4160 -lookupd-tcp-address=10.240.51.68:4160
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=nsqd
User=thelinuxlich
Group=thelinuxlich

[Install]
WantedBy=multi-user.target

The "curl localhost:4151/stats" output of the nsqd node that I had to restart because it stopped sending messages:

nsqd v0.3.8 (built w/go1.6.2)
start_time 2017-01-25T07:10:55Z
uptime 38h2m8.709939651s

Health: OK

   [batch          ] depth: 0     be-depth: 0     msgs: 56181    e2e%: 
      [default                  ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 56181    e2e%: 
        [V2 aidax-batcher-2:53769] state: 3 inflt: 0    rdy: 14   fin: 28931    re-q: 0        msgs: 28931    connected: 37h48m11s
        [V2 aidax-batcher-1:49706] state: 3 inflt: 0    rdy: 12   fin: 27250    re-q: 0        msgs: 27250    connected: 37h48m3s

The consumer uses https://github.com/dudleycarr/nsqjs here a sample JS code describing how I use it:

const nsq = require("nsqjs"),
   reader = new nsq.Reader("batch", "default", {
        maxInFlight: 100,
        lookupdHTTPAddresses: ['127.0.0.1:4161']
    });

reader.connect();

reader.on("message", rec => {
    const msg = rec.body.toString();
    do_something_with_it(msg);
    rec.finish();
});

@dudleycarr
Copy link
Contributor

I could be misreading the stats output, but it does look to me like the channel has processed all of the messages. Are you sure that there are unprocessed messages?

@thelinuxlich
Copy link
Author

There were millions of messages unprocessed, but I restarted the nsqd daemon before posting the issue here.

@dudleycarr
Copy link
Contributor

When the message processing stalls, could you please try restarting the nsqjs worker instead of the nsqd. If that allows the processing to resume, it's probably an issue with nsqjs. If that's the case, it would be helpful to open an issue on nsqjs and provide the output from running a worker with the DEBUG=* node ....

@thelinuxlich
Copy link
Author

I tried restarting the nsqjs worker first and it didn't worked. Only restarting the nsqd node it reappeared again in the nsqadmin UI sending all those messages saved to disk.

@ploxiln
Copy link
Member

ploxiln commented Jan 27, 2017

If it happens again, please capture stats and recent log lines before restarting anything, and post them here. Thanks!

@thomasmodeneis
Copy link

I have this issue as well for quite some time, but I believe this has more to do with nsq.js than with NSQ itself.
On my experience this has something to do with large queues and the malfunctioning of the network connections.
I usually have this problem when the processors are really busy for a while and there is a interval when nothing is processed, as soon as more messages are dropped into the queue it seems that the driver is not able to recognise that there are new messages and re-start processing jobs.

By restarting the job processors I was able to work around the issue, if you work with docker containers you can just have a job that check message count on the queue and if nothing is being processed for lets say 30 mins, you restart the docker containers responsible for consuming this msgs.

I would recommend you having a look into the Golang driver https://github.com/nsqio/go-nsq and checking with a small poc if you are able to reproduce the same problem with Golang as well..

Cheers.

@mcorb
Copy link

mcorb commented Feb 27, 2017

Whether or not this bug a JS client issue, it'd be useful to get some clarity on the three available libraries:

nsqjs (without dot) seems to introduce a lot of complexity and special cases compared to node-nsqueue, while nsq.js (with dot) sits somewhere in-between.

Is there a recommended client? node-nsqueue looks promising in terms of simplicity and closeness to the wire protocol. In contrast the other two contain custom message queues, object coercion and a bunch of stuff that introduces surface area for bugs.

@mreiferson
Copy link
Member

@mcorb I'm 99.99999% sure this issue here is related to client libraries, not nsqd, I've left this issue open in the hopes that we would uncover information to confirm that.

My recommendation would be to use dudlercarr/nsqjs or segmentio/nsq.js which have both seen significant production usage.

@ceejbot
Copy link

ceejbot commented Apr 13, 2017

I've just seen this happening with nsqd 1.0.0-compat & the latest version of nsqjs. The setup is about 30 nsqds on individual hosts (only 4 active during this test), and 1 nslookupd. 4 worker processes, all successfully pulling data from 3 of the 4 nsqd nodes. 1 node was just piling up unprocessed messages, even though the workers reported connecting to it. Restarting the workers did not fix it.

Here's a screenshot of the state of nsqadmin after nsqd restart. The workers did NOT need to be restarted to start getting data from the previously-dead node.

nsqadmin 2017-04-12 17-26-07

The affected nsqd log was full of heartbeats & nothing else:

[nsqd] 2017/04/13 00:19:16.502328 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:19:31.502323 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:19:46.502330 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:01.502338 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:16.502321 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:31.502325 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:46.502328 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:21:01.506680 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:21:16.502330 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:21:31.502325 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat

@ploxiln
Copy link
Member

ploxiln commented Apr 13, 2017

If In-Flight and Ready are both zero, then no messages will be received by that client connection, even though it is connected. @ceejbot in your case this is the problem, all zeros.

If the max_in_flight option passed to the client library is less than the number of nsqd that have the topic, then in order to guarantee that maximum is not exceeded, the client library will have to send a zero Ready count to some nsqd. It should "redistribute" the Ready count from one nsqd with non-zero Ready to another with zero Ready, when the nsqd with non-zero Ready has no messages to send for a second or two. But if a client is constantly receiving more messages from an nsqd, it may never redistribute the Ready count from that nsqd to another.

It may be necessary to set max_in_flight for the client to at least number of nsqd to ensure messages keep moving everywhere. It would be nice to have some different Ready distribution behaviors. That depends on implementation in each client library.

@ceejbot
Copy link

ceejbot commented Apr 13, 2017

If this is a clue that is at all helpful: attempting to empty the metrics queue before the restart didn't touch the backlog of messages on the blocked nsqd. After the restart, boom, emptying the queue worked fine.

@ceejbot
Copy link

ceejbot commented Apr 13, 2017

I'm happy to tweak the in-flight & ready params, but I am baffled about the other behavior described if that's indeed the problem.

@ploxiln
Copy link
Member

ploxiln commented Apr 13, 2017

The max_in_flight option is what you want to set (on the consumer). It's an important knob for performance anyway. For a single client, the sum of "In Flight" and "Ready Count" across all of its connections to different nsqd will generally add up to max_in_flight.

@ceejbot
Copy link

ceejbot commented Apr 13, 2017

Setting to 2x my nsqd instance count seems to have worked a treat, thanks!

@mreiferson
Copy link
Member

But if a client is constantly receiving more messages from an nsqd, it may never redistribute the Ready count from that nsqd to another.

Finally addressing this in nsqio/go-nsq#208, obviously other client libraries will need to adopt this approach if it lands (I'll likely handle pynsq).

Also, there hasn't been evidence here that this is an nsqd issue, so I'm going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants