Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: disk full does not cause error on PUB to existing topic #1231

Closed
dodysw2 opened this issue Jan 10, 2020 · 1 comment
Closed

nsqd: disk full does not cause error on PUB to existing topic #1231

dodysw2 opened this issue Jan 10, 2020 · 1 comment
Labels

Comments

@dodysw2
Copy link

dodysw2 commented Jan 10, 2020

Hello,

I noticed nsqd correctly marked the health state to "not ok" (e.g. via /ping end point) when disk is full and a message is published. However, it did not do this when a message is published to existing topic -- even worse, it reset the state to healthy/ok.

Steps to replicate:

./nsqd -mem-queue-size=0
curl 'http://localhost:4151/ping'
## see OK
echo "Haha" | ./to_nsq --topic t1 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
curl 'http://localhost:4151/ping'
## see OK
sudo fallocate -l 80G penuh80g
df
## see 0 available space left
curl 'http://localhost:4151/ping'
## see OK
echo "Haha" | ./to_nsq --topic t1 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
curl 'http://localhost:4151/ping'
## still OK
echo "Hoho" | ./to_nsq --topic t2 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
curl 'http://localhost:4151/ping'
## now NOK
echo "Haha" | ./to_nsq --topic t1 --nsqd-tcp-address 127.0.0.1:4150 --rate 1000
## sending to existing topic reset the error status -- now OK
df
## see 0 available space left
@mreiferson
Copy link
Member

Hey @dodysw2, apologies for the delay in responding here.

This behavior can likely be explained by the sync behavior of the underlying diskqueue. For an existing topic, where the underlying diskqueue files have already been created, a single write of a ~4 byte message isn't going to force a sync to the filesystem, which nsqd then interprets as a successful write.

I suspect if you configure nsqd with --sync-every=1 then you'll see the behavior you expect.

There's probably some improvement to be made here, but the "healthiness" indicator in nsqd isn't intended to be incredibly sophisticated. There are so many different failure modes that I'm not convinced it's worth the effort.

NOTE: in your example debugging steps the --rate parameter to to_nsq doesn't actually send 1000 messages, it just rate limits the messages on stdin to 1000 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants