Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host get stuck on "DOWN" after an memory allocation error #1598

Closed
arthurzenika opened this issue Apr 29, 2015 · 5 comments
Closed

Host get stuck on "DOWN" after an memory allocation error #1598

arthurzenika opened this issue Apr 29, 2015 · 5 comments

Comments

@arthurzenika
Copy link
Contributor

At some point we had a memory shortage on the shinken server, a lot of hosts then went to a DOWN status with the following error :

[Errno 12] Cannot allocate memory

I believe the status should be unknown in this case.

Second problem : when the memory shortage was fixed, we can't find a way to get them to be seen again. One way I've found is say a host has a downtime of 1minute and then it goes back to green.

@naparuba
Copy link
Contributor

The first one is normal, there is no unknown for hosts, only up or down
(fail check=down).

But the second is more problematic. Was the error on the check output or in
the schedulerd.log?

On Wed, Apr 29, 2015 at 4:29 PM, Arthur Lutz notifications@github.com
wrote:

At some point we had a memory shortage on the shinken server, a lot of
hosts then went to a DOWN status with the following error :

[Errno 12] Cannot allocate memory

I believe the status should be unknown in this case.

Second problem : when the memory shortage was fixed, we can't find a way
to get them to be seen again. One way I've found is say a host has a
downtime of 1minute and then it goes back to green.


Reply to this email directly or view it on GitHub
#1598.

@arthurzenika
Copy link
Contributor Author

This is a critical failure of shinken on debian jessie. We cannot find any way to get shinken to restart checks.

After try to force checks with schedule downtimes, with forcing checks via livestatus

echo -e "COMMAND [$(date +%s)] SCHEDULE_FORCED_HOST_CHECK;machine.logilab.priv;$(date +%s)\n\n" | nc localhost 50000

we tried removing the retention data in /var/lib/shinken/*.dat (after stopping shinken), the restarting.

Everything stays as pending checks. No usable information in the logs. Putting the logs in debug mode drowns the information in performance print outs.

@arthurzenika
Copy link
Contributor Author

Wow... we ended up getting to work (after much debugging) by restarting the service in a given order. The restart of poller when all other services were running, got it to work again. Would it be an init script bug ? is this bug specific to debian ?

@arthurzenika
Copy link
Contributor Author

bug report for debian maintainers : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784624

@naparuba
Copy link
Contributor

cool :)

On Thu, May 7, 2015 at 11:05 AM, Arthur Lutz notifications@github.com
wrote:

bug report for debian maintainers :
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784624


Reply to this email directly or view it on GitHub
#1598 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants