nsqd: disk full does not cause error on PUB to existing topic #1231

dodysw2 · 2020-01-10T14:53:32Z

Hello,

I noticed nsqd correctly marked the health state to "not ok" (e.g. via /ping end point) when disk is full and a message is published. However, it did not do this when a message is published to existing topic -- even worse, it reset the state to healthy/ok.

Steps to replicate:

./nsqd -mem-queue-size=0
curl 'http://localhost:4151/ping'
## see OK
echo "Haha" | ./to_nsq --topic t1 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
curl 'http://localhost:4151/ping'
## see OK
sudo fallocate -l 80G penuh80g
df
## see 0 available space left
curl 'http://localhost:4151/ping'
## see OK
echo "Haha" | ./to_nsq --topic t1 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
curl 'http://localhost:4151/ping'
## still OK
echo "Hoho" | ./to_nsq --topic t2 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
curl 'http://localhost:4151/ping'
## now NOK
echo "Haha" | ./to_nsq --topic t1 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
## sending to existing topic reset the error status -- now OK
df
## see 0 available space left

The text was updated successfully, but these errors were encountered:

mreiferson · 2020-06-11T05:25:44Z

Hey @dodysw2, apologies for the delay in responding here.

This behavior can likely be explained by the sync behavior of the underlying diskqueue. For an existing topic, where the underlying diskqueue files have already been created, a single write of a ~4 byte message isn't going to force a sync to the filesystem, which nsqd then interprets as a successful write.

I suspect if you configure nsqd with --sync-every=1 then you'll see the behavior you expect.

There's probably some improvement to be made here, but the "healthiness" indicator in nsqd isn't intended to be incredibly sophisticated. There are so many different failure modes that I'm not convinced it's worth the effort.

NOTE: in your example debugging steps the --rate parameter to to_nsq doesn't actually send 1000 messages, it just rate limits the messages on stdin to 1000 😁

mreiferson closed this as completed Jun 11, 2020

mreiferson added the question label Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: disk full does not cause error on PUB to existing topic #1231

nsqd: disk full does not cause error on PUB to existing topic #1231

dodysw2 commented Jan 10, 2020

mreiferson commented Jun 11, 2020

nsqd: disk full does not cause error on PUB to existing topic #1231

nsqd: disk full does not cause error on PUB to existing topic #1231

Comments

dodysw2 commented Jan 10, 2020

mreiferson commented Jun 11, 2020