-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datanode won't restart after hitting flood-stage watermark #57
Comments
@j3k0 Hi, sorry to hear from your problem. Can you add more from the logs? The snippet is not enough for me to deduct where it fails and how to help you. Also, which setup do you run? |
docker-compose.yml (section about datanode only): datanode:
image: "graylog/graylog-datanode:5.2"
hostname: "2a9e851d2339"
environment:
GRAYLOG_DATANODE_NODE_ID_FILE: "/var/lib/graylog-datanode/node-id"
GRAYLOG_DATANODE_PASSWORD_SECRET: "xxx"
GRAYLOG_DATANODE_ROOT_PASSWORD_SHA2: "xxx"
GRAYLOG_DATANODE_MONGODB_URI: "mongodb://mongodb:27017/graylog"
ulimits:
memlock:
hard: -1
soft: -1
nofile:
soft: 65536
hard: 65536
ports:
- "192.168.96.2:8999:8999/tcp" # DataNode API
- "192.168.96.2:9200:9200/tcp"
- "192.168.96.2:9300:9300/tcp"
volumes:
- "/opt/graylog/data/graylog-datanode:/var/lib/graylog-datanode"
restart: "on-failure"
Logs:
|
@j3k0 Thanks for the additional info. I have to discuss it with my colleagues. I'll get back to you asap. |
@j3k0 We're missing the surrounding logs of the DataNode (the excerpt only covers OpenSearch) - is it possible to also attach them? |
I apologize I recreated new docker containers, so the logs are lost I'm afraid. Reproducing should be easy: reduce OpenSearch's flood-stage limit on the index will turn it read-only. Try restarting the docker containers, OpenSearch doesn't start (while it should start read-only I believe). |
Narrowing it down: I ran the following scenarios on a non-containerized datanode without being able to reproduce the problem - datanode recovered gracefully and server resumed ingesting data when more disk space became available. So it appears to be specific to docker or the reporter's environment.
@j3k0 Is there any chance the freed up disk space wasn't enough? Or that it is not available to docker? Also for debugging, can you disable observability, opendistro and any other 3rd party plugins? |
Are there some news on this?
|
@dascgit Hi, what news are you looking for? your log snippet does not give an indication about the error (might be a total different problem than the first), do you have more of the logs before these lines? Also, we could not reproduce the original error. Thinking about the original error again, my current guess is - because it's inside docker - that for the original poster, the internal docker volume ran full. |
@janheise I think its the same problem. |
It's with an API request to the REST API, which requires the OpenSearch to be up, even if it's in read-only mode. See https://stackoverflow.com/questions/50609417/elasticsearch-error-cluster-block-exception-forbidden-12-index-read-only-all |
What I don't understand: In my experience (and also, what my colleague tested) - if OpenSearch hits the watermark, it's still running. It's also restarting. It should - after expanding the volume - automatically get back into regular mode after some time. If that's not working, yes, one should send a request to do so. This is currently not easy to accomplish. But, if OpenSearch is not starting at all inside the DataNode, you can not send any requests that change configuration. I think, in that case, it's not a simple watermark issue. I need more logs if we want to get to the bottom of this. |
@janheise If I will get the errors again I will try to get you more logs |
Closed for now, since it did not seem to reappear. |
My computer went low on disk, hitting opensearch's flood-stage watermark, so opensearch set all indices to read only.
Now (after freeing up disk space) graylog-datanode will still not restart:
WARN ClusterBlockException[index [.opensearch-observability] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];] ... WARN [OpensearchNodeHeartbeat] Opensearch REST api of process 679 unavailable. Cause: Unable to parse response body WARN [OpensearchProcessImpl] Opensearch process failed
Problem is that I need opensearch to be up in order to reset the read-only status of the indice (it's done with the REST api).
Since the startup script kills opensearch pretty much immediately I don't have time to proceed.
Any idea?
The text was updated successfully, but these errors were encountered: