nsqd: stops sending messages #851

thelinuxlich · 2017-01-25T07:21:54Z

Sometimes one of my nsqd nodes simply stops sending messages to the consumer/reader(using nsqjs) and starts saving these messages to disk. When I connect with nsqadmin to the nsqlookupd address I can see that this node is not in the Topic Message Queue list. So I restart its nsqd daemon, and all of a sudden it starts sending all those messages in the disk to the consumer and it reappears in the Topic Message Queue list. Its log is not very informative before that, just sending heartbeats to nsqlookupd.

Is there a reason why this should be happening or is it a bug?

jehiah · 2017-01-25T11:36:03Z

@thelinuxlich can you provide some context on your configuration, and relevant logs from nsqd prior to and after restarting it?

thelinuxlich · 2017-01-25T13:36:36Z

I have two nsqd nodes connecting to two nsqlookupd nodes. The two nsqd nodes also hosts a microservice directly writing messages to their local instances which gets subscribed by two microservices connecting directly to their nqslookupd instances.

Logs usually only print this:

Jan 25 13:31:25 aidax-collector-2 nsqd[19813]: [nsqd] 2017/01/25 13:31:25.501527 LOOKUPD(10.240.200.16:4160): sending heartbeat
Jan 25 13:31:25 aidax-collector-2 nsqd[19813]: [nsqd] 2017/01/25 13:31:25.504267 LOOKUPD(10.240.51.68:4160): sending heartbeat

But there was a error last week:

Jan 18 17:09:41 aidax-collector-2 nsqd[322]: [nsqd] 2017/01/18 17:09:41.682483 LOOKUPD(10.240.200.16:4160): ERROR PING - read tcp 10.240.226.74:54986->10.240.200.16:4160: i/o timeout
Jan 18 17:09:42 aidax-collector-2 nsqd[322]: [nsqd] 2017/01/18 17:09:42.682712 LOOKUPD(10.240.51.68:4160): ERROR PING - read tcp 10.240.226.74:56002->10.240.51.68:4160: i/o timeout

But since it's heartbeating every 15s, shouldn't it reconnect normally instead of saving to the disk forever?

ploxiln · 2017-01-25T22:54:02Z

The consumer should reconnect to nsqd and receive messages. But the report is still a bit too vague. Please include:

version of nsqd
command-line flags and/or config file used for nsqd
output of "curl localhost:4151/stats" on nsqd server
a bit more info about the consumer, how it is configured and started

Those heartbeats, and the "error ping", are between nsqd and nsqlookupd. They do not indicate any connections from consumers. The problem may be in the consumer, or in network access between consumer and nsqd.

thelinuxlich · 2017-01-26T21:18:35Z

I'm using nsq 0.3.8 in all nodes.

The nsqd service configuration:

[Service]
WorkingDirectory=/home/thelinuxlich
ExecStart=/usr/local/bin/nsqd -lookupd-tcp-address=10.240.200.16:4160 -lookupd-tcp-address=10.240.51.68:4160
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=nsqd
User=thelinuxlich
Group=thelinuxlich

[Install]
WantedBy=multi-user.target

The "curl localhost:4151/stats" output of the nsqd node that I had to restart because it stopped sending messages:

nsqd v0.3.8 (built w/go1.6.2)
start_time 2017-01-25T07:10:55Z
uptime 38h2m8.709939651s

Health: OK

   [batch          ] depth: 0     be-depth: 0     msgs: 56181    e2e%: 
      [default                  ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 56181    e2e%: 
        [V2 aidax-batcher-2:53769] state: 3 inflt: 0    rdy: 14   fin: 28931    re-q: 0        msgs: 28931    connected: 37h48m11s
        [V2 aidax-batcher-1:49706] state: 3 inflt: 0    rdy: 12   fin: 27250    re-q: 0        msgs: 27250    connected: 37h48m3s

The consumer uses https://github.com/dudleycarr/nsqjs here a sample JS code describing how I use it:

const nsq = require("nsqjs"),
   reader = new nsq.Reader("batch", "default", {
        maxInFlight: 100,
        lookupdHTTPAddresses: ['127.0.0.1:4161']
    });

reader.connect();

reader.on("message", rec => {
    const msg = rec.body.toString();
    do_something_with_it(msg);
    rec.finish();
});

dudleycarr · 2017-01-26T21:26:17Z

I could be misreading the stats output, but it does look to me like the channel has processed all of the messages. Are you sure that there are unprocessed messages?

thelinuxlich · 2017-01-26T21:35:06Z

There were millions of messages unprocessed, but I restarted the nsqd daemon before posting the issue here.

dudleycarr · 2017-01-26T21:38:42Z

When the message processing stalls, could you please try restarting the nsqjs worker instead of the nsqd. If that allows the processing to resume, it's probably an issue with nsqjs. If that's the case, it would be helpful to open an issue on nsqjs and provide the output from running a worker with the DEBUG=* node ....

thelinuxlich · 2017-01-26T22:19:06Z

I tried restarting the nsqjs worker first and it didn't worked. Only restarting the nsqd node it reappeared again in the nsqadmin UI sending all those messages saved to disk.

ploxiln · 2017-01-27T00:14:26Z

If it happens again, please capture stats and recent log lines before restarting anything, and post them here. Thanks!

thomasmodeneis · 2017-02-07T11:11:33Z

I have this issue as well for quite some time, but I believe this has more to do with nsq.js than with NSQ itself.
On my experience this has something to do with large queues and the malfunctioning of the network connections.
I usually have this problem when the processors are really busy for a while and there is a interval when nothing is processed, as soon as more messages are dropped into the queue it seems that the driver is not able to recognise that there are new messages and re-start processing jobs.

By restarting the job processors I was able to work around the issue, if you work with docker containers you can just have a job that check message count on the queue and if nothing is being processed for lets say 30 mins, you restart the docker containers responsible for consuming this msgs.

I would recommend you having a look into the Golang driver https://github.com/nsqio/go-nsq and checking with a small poc if you are able to reproduce the same problem with Golang as well..

Cheers.

mcorb · 2017-02-27T20:16:02Z

Whether or not this bug a JS client issue, it'd be useful to get some clarity on the three available libraries:

nsqjs (without dot) seems to introduce a lot of complexity and special cases compared to node-nsqueue, while nsq.js (with dot) sits somewhere in-between.

Is there a recommended client? node-nsqueue looks promising in terms of simplicity and closeness to the wire protocol. In contrast the other two contain custom message queues, object coercion and a bunch of stuff that introduces surface area for bugs.

mreiferson · 2017-03-05T19:23:31Z

@mcorb I'm 99.99999% sure this issue here is related to client libraries, not nsqd, I've left this issue open in the hopes that we would uncover information to confirm that.

My recommendation would be to use dudlercarr/nsqjs or segmentio/nsq.js which have both seen significant production usage.

ceejbot · 2017-04-13T00:29:57Z

I've just seen this happening with nsqd 1.0.0-compat & the latest version of nsqjs. The setup is about 30 nsqds on individual hosts (only 4 active during this test), and 1 nslookupd. 4 worker processes, all successfully pulling data from 3 of the 4 nsqd nodes. 1 node was just piling up unprocessed messages, even though the workers reported connecting to it. Restarting the workers did not fix it.

Here's a screenshot of the state of nsqadmin after nsqd restart. The workers did NOT need to be restarted to start getting data from the previously-dead node.

The affected nsqd log was full of heartbeats & nothing else:

[nsqd] 2017/04/13 00:19:16.502328 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:19:31.502323 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:19:46.502330 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:01.502338 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:16.502321 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:31.502325 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:20:46.502328 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:21:01.506680 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:21:16.502330 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat
[nsqd] 2017/04/13 00:21:31.502325 LOOKUPD(nsqadmin-west-1.npm.red:4160): sending heartbeat

ploxiln · 2017-04-13T00:41:50Z

If In-Flight and Ready are both zero, then no messages will be received by that client connection, even though it is connected. @ceejbot in your case this is the problem, all zeros.

If the max_in_flight option passed to the client library is less than the number of nsqd that have the topic, then in order to guarantee that maximum is not exceeded, the client library will have to send a zero Ready count to some nsqd. It should "redistribute" the Ready count from one nsqd with non-zero Ready to another with zero Ready, when the nsqd with non-zero Ready has no messages to send for a second or two. But if a client is constantly receiving more messages from an nsqd, it may never redistribute the Ready count from that nsqd to another.

It may be necessary to set max_in_flight for the client to at least number of nsqd to ensure messages keep moving everywhere. It would be nice to have some different Ready distribution behaviors. That depends on implementation in each client library.

ceejbot · 2017-04-13T00:41:55Z

If this is a clue that is at all helpful: attempting to empty the metrics queue before the restart didn't touch the backlog of messages on the blocked nsqd. After the restart, boom, emptying the queue worked fine.

ceejbot · 2017-04-13T00:43:00Z

I'm happy to tweak the in-flight & ready params, but I am baffled about the other behavior described if that's indeed the problem.

ploxiln · 2017-04-13T00:47:57Z

The max_in_flight option is what you want to set (on the consumer). It's an important knob for performance anyway. For a single client, the sum of "In Flight" and "Ready Count" across all of its connections to different nsqd will generally add up to max_in_flight.

ceejbot · 2017-04-13T00:49:00Z

Setting to 2x my nsqd instance count seems to have worked a treat, thanks!

mreiferson · 2017-04-16T22:10:20Z

But if a client is constantly receiving more messages from an nsqd, it may never redistribute the Ready count from that nsqd to another.

Finally addressing this in nsqio/go-nsq#208, obviously other client libraries will need to adopt this approach if it lands (I'll likely handle pynsq).

Also, there hasn't been evidence here that this is an nsqd issue, so I'm going to close this.

thelinuxlich mentioned this issue Jan 25, 2017

Reconnection to dead nsqd nodes segmentio/nsq.js#77

Open

thelinuxlich mentioned this issue Jan 25, 2017

Weird bug, could it be related to the javascript driver? dudleycarr/nsqjs#121

Closed

mreiferson added bug needs info labels Jan 26, 2017

mreiferson changed the title ~~Weird bug~~ nsqd: stops sending messages Jan 26, 2017

mreiferson closed this as completed Apr 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: stops sending messages #851

nsqd: stops sending messages #851

thelinuxlich commented Jan 25, 2017

jehiah commented Jan 25, 2017

thelinuxlich commented Jan 25, 2017

ploxiln commented Jan 25, 2017

thelinuxlich commented Jan 26, 2017

dudleycarr commented Jan 26, 2017

thelinuxlich commented Jan 26, 2017

dudleycarr commented Jan 26, 2017

thelinuxlich commented Jan 26, 2017

ploxiln commented Jan 27, 2017

thomasmodeneis commented Feb 7, 2017

mcorb commented Feb 27, 2017

mreiferson commented Mar 5, 2017

ceejbot commented Apr 13, 2017

ploxiln commented Apr 13, 2017

ceejbot commented Apr 13, 2017

ceejbot commented Apr 13, 2017

ploxiln commented Apr 13, 2017 •

edited

Loading

ceejbot commented Apr 13, 2017

mreiferson commented Apr 16, 2017

nsqd: stops sending messages #851

nsqd: stops sending messages #851

Comments

thelinuxlich commented Jan 25, 2017

jehiah commented Jan 25, 2017

thelinuxlich commented Jan 25, 2017

ploxiln commented Jan 25, 2017

thelinuxlich commented Jan 26, 2017

dudleycarr commented Jan 26, 2017

thelinuxlich commented Jan 26, 2017

dudleycarr commented Jan 26, 2017

thelinuxlich commented Jan 26, 2017

ploxiln commented Jan 27, 2017

thomasmodeneis commented Feb 7, 2017

mcorb commented Feb 27, 2017

mreiferson commented Mar 5, 2017

ceejbot commented Apr 13, 2017

ploxiln commented Apr 13, 2017

ceejbot commented Apr 13, 2017

ceejbot commented Apr 13, 2017

ploxiln commented Apr 13, 2017 • edited Loading

ceejbot commented Apr 13, 2017

mreiferson commented Apr 16, 2017

ploxiln commented Apr 13, 2017 •

edited

Loading