Host get stuck on "DOWN" after an memory allocation error #1598

arthurzenika · 2015-04-29T14:29:45Z

At some point we had a memory shortage on the shinken server, a lot of hosts then went to a DOWN status with the following error :

[Errno 12] Cannot allocate memory

I believe the status should be unknown in this case.

Second problem : when the memory shortage was fixed, we can't find a way to get them to be seen again. One way I've found is say a host has a downtime of 1minute and then it goes back to green.

The text was updated successfully, but these errors were encountered:

naparuba · 2015-04-29T18:53:57Z

The first one is normal, there is no unknown for hosts, only up or down
(fail check=down).

But the second is more problematic. Was the error on the check output or in
the schedulerd.log?

On Wed, Apr 29, 2015 at 4:29 PM, Arthur Lutz notifications@github.com
wrote:

At some point we had a memory shortage on the shinken server, a lot of
hosts then went to a DOWN status with the following error :

[Errno 12] Cannot allocate memory

I believe the status should be unknown in this case.

Second problem : when the memory shortage was fixed, we can't find a way
to get them to be seen again. One way I've found is say a host has a
downtime of 1minute and then it goes back to green.

—
Reply to this email directly or view it on GitHub
#1598.

arthurzenika · 2015-04-30T07:55:39Z

This is a critical failure of shinken on debian jessie. We cannot find any way to get shinken to restart checks.

After try to force checks with schedule downtimes, with forcing checks via livestatus

echo -e "COMMAND [$(date +%s)] SCHEDULE_FORCED_HOST_CHECK;machine.logilab.priv;$(date +%s)\n\n" | nc localhost 50000

we tried removing the retention data in /var/lib/shinken/*.dat (after stopping shinken), the restarting.

Everything stays as pending checks. No usable information in the logs. Putting the logs in debug mode drowns the information in performance print outs.

arthurzenika · 2015-05-07T08:48:32Z

Wow... we ended up getting to work (after much debugging) by restarting the service in a given order. The restart of poller when all other services were running, got it to work again. Would it be an init script bug ? is this bug specific to debian ?

arthurzenika · 2015-05-07T09:05:02Z

bug report for debian maintainers : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784624

naparuba · 2015-05-11T07:29:30Z

cool :)

On Thu, May 7, 2015 at 11:05 AM, Arthur Lutz notifications@github.com
wrote:

bug report for debian maintainers :
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784624

—
Reply to this email directly or view it on GitHub
#1598 (comment).

naparuba closed this as completed Apr 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host get stuck on "DOWN" after an memory allocation error #1598

Host get stuck on "DOWN" after an memory allocation error #1598

arthurzenika commented Apr 29, 2015

naparuba commented Apr 29, 2015

arthurzenika commented Apr 30, 2015

arthurzenika commented May 7, 2015

arthurzenika commented May 7, 2015

naparuba commented May 11, 2015

Host get stuck on "DOWN" after an memory allocation error #1598

Host get stuck on "DOWN" after an memory allocation error #1598

Comments

arthurzenika commented Apr 29, 2015

naparuba commented Apr 29, 2015

arthurzenika commented Apr 30, 2015

arthurzenika commented May 7, 2015

arthurzenika commented May 7, 2015

naparuba commented May 11, 2015