Merge current Naemon master into fork #1

nook24 · 2019-04-10T14:23:45Z

No description provided.

The tv_sec field in a timespec struct is defined as a time_t, which is often a long int (but not always). On 64-bit systems this is usually fine (except when it's not), but on 32-bit systems, this can have the consequence of the arithmetic in the timespec_msdiff function overflowing the resulting diff, for example when given a large negative value such as (0 - time(NULL)). This in turn would lead to a diff a number of years in the future. For example, (time_t)-1464166432 * 1000 = 417415936 which is about 13 years in the future. This has a nasty consequence, because of how we use this function, in that it would a) cause events scheduled way back to be scheduled over a decade too late, and b) thusly block the execution of *all other events* until that event was run. This patch fixes that issue, by defining what happens in the case of an overflow more strictly, while still maintaining the desired properties. Namely, we simply return LONG_MIN/LONG_MAX if event A is too far in the past or future relative to event B. In addition, I also added a bunch of tests to catch this type of bug in the future, as well as portable support for overflow-checked arithmetics. Signed-off-by: Anton Lofgren <alofgren@op5.com>

Because of precedence rules, if (!connect(...) == 0) is not evaluated as (not(connect(...) == 0)) in C, but as (not(connect(...)) == 0) This triggers a warning in clang. Signed-off-by: Anton Lofgren <alofgren@op5.com>

new installations should not come with warnings like: Warning: enable_environment_macros is deprecated and will be removed. Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>

Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>

select/poll/epoll_wait returns EINTR on signal interruption (duh), and the comment below even says that that's what we're looking for. This was introduced by some guy in 21c9601. This fixes GitHub issue #138. Signed-off-by: Anton Lofgren <alofgren@op5.com>

This fixes GitHub issue #140. Signed-off-by: Anton Lofgren <alofgren@op5.com>

Replace temp_file with status_file for temporary status.dat pattern to mkstemp

Add ferror(fp) to result code to check for error with calls to fprintf()

Replace temp_file with retention_file for temporary retention.dat pattern to mkstemp

Add ferror(fp) to result code to check for error with calls to fprintf()

Signed-off-by: Anton Lofgren <alofgren@op5.com>

There's no need to reinvent a shell (again!) when we're already guaranteed one by POSIX. Let's use that one instead, since it allows us to a) write more powerful tests, and b) write real-er tests Signed-off-by: Anton Lofgren <alofgren@op5.com>

In order to provide backwards compatibility with older configurations, this patch lends some leniency to when grandchild processes are killed. It used to be the case that a process group was killed as soon as we reaped its leader. With this patch, we instead allow the descendants of the leader to continue running until the timeout of the job in question has been reached, at which point all still remaining processes will be reaped. This allows users to continue using, for example, asynchronous mail delivery via mailx, without losing their notifications when mailx's children are unexpectedly killed. In addition, this patch also cleans up the worker test suite a bit, to make it more reliable and extendable. This fixes GitHub issue #137. This is related to MON-8090. Signed-off-by: Anton Lofgren <alofgren@op5.com>

Signed-off-by: Anton Lofgren <alofgren@op5.com>

In some build systems, git is not present unless defined as a build dependency. However the .git directory may still be copied in along with the source. This change will still allow the required behavior in case .git is present but the Git binary is not.

The callback was probably meant to make it possible to create a custom worker with a separate job spawning method. But since all those methods were local to that file, no external builds could have used that method. In fact, the struct for tracking jobs didn't provide any space for custom job related data, thus this wasn't even possible. Instead of fixing it without a use case, it's just better to remove, to simplify the debugging of the code. Part of op5 issue MON-9528 Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>

The ESTALE seems to be when the job has any processes left that needs to be reaped. That's what happens when the main process finishes, and we call finish_job. Thus, mark the job as ESTALE, and don't log when we kill dormant children, since that's the default behaviour for jobs, where we still need to be able to have dangling processes a while after execution. Those dangling processes we have had is for example sendmail in notification scripts, which forks a process to send the mail and exit early. Resolves: MON-9528 Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>

…mon-core into ipstatic-version-without-git

A couple of thousand lines of output each tests doesn't make any sense. Repetetive tests can simply just have a failure counter and validate that there are no failures. For tracking exact values, gdb is really helpful. Just watch the failure counter. And also, since lot of ci environments truncates the output on a couple of thousand log ilnes, this will help with not truncating the log. Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>

Just a tiny change to enforce constness of timeperiods where we don't want to modify them. Signed-off-by: Anton Lofgren <alofgren@op5.com>

This shouldn't happend during normal operation, but it happens ocassionally during unit tests, and it clutters up the output and skews the test results. Signed-off-by: Anton Lofgren <alofgren@op5.com>

I had an issue with events being destroyed after the queue having been freed which was non-trivial to track down. These assertions, along with actually NULL'ing the event_queue would've told me right away what went wrong. Signed-off-by: Anton Lofgren <alofgren@op5.com>

…nitored-object to master * commit 'fcf4ad56ea70f9828a66dfc851fff5ccf75429e5': events: Sprinkle with assertions workers: Don't error on a NULL specialized_workers table timeperiods: more const correctness Make timeperiod tests not that noisy

This behaviour was changed when fixing another configuration validation bug (GH issue #68). This patch restores the behaviour and fixes MON-9353. Signed-off-by: Anton Lofgren <alofgren@op5.com>

* commit '46b44e0ed7ff8350b7bb381e96a2ea07af639179': xodtemplate: Treat missing service description as an error

Even though the one handling command by id is somwhat faster, it didn't call all the broker modules. And also couldn't be stopped from being executed correctly. Thus, make sure all external command proccessing is handled the same way, and that the event can stop a command from being executed Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>

…r-nebevent-on-all-external to master * commit '0bb33b5d2086875d48cff7f160ea89b479c8a4f4': Reuse check command processing

This commit ensures that the next_check schedule for hosts and services are retained on Naemon restart, given that use_retained_scheduling_info is enabled. The logic is as follows: - If use_retained_scheduling_info is disabled, set a random time (as before) - If use_retained_schedule_info is enabled: - If we didn't miss the check during the restart, retain the old next_check time - If we missed one check, schedule the service/host within the next interval_length (usually 60 seconds) - If we missed more than one check, schedule the next check randomly. We schedule missed checks within 60 seconds, rather than immediately in order to do some load balacing. This is also the rationale for scheduling the check randomly, in case we missed more than one check (this indicates Naemon has been down for a longer period of time). This fixes: - #224 - #156 - MON-10720 (https://jira.op5.com/browse/MON-10720) Signed-off-by: Jacob Hansen <jhansen@op5.com>

This commit adds tests to ensure that the next_check is set correctly after Naemon restarts. This ensures the logic is from the previous commit is correctly followed. This fixes: - #224 - #156 - MON-10720 (https://jira.op5.com/browse/MON-10720) Signed-off-by: Jacob Hansen <jhansen@op5.com>

…xt-schedule Retain next_check schedule on restart (#224, #156)

Since we install the el7 logrotate in our Makefile.am without further OS detection we need to replace the logrotate file for el6 later. Otherwise we would end up with the el7 file and no logrotation.

right now we did only check the state during dependency checks. But for pending hosts and services the state is usually OK/UP so the check passed. For pending flag checks we have to look at the has_been_checked flag as well. This leads to the situation where services checks will be run if the master service is in pending state even if the service has pending service execution failure flag set.

…hutdown Init: Increase delay between SIGTERM and SIGKILL

After #259 we now keep the next_check schedule over restarts if use_retained_schedule_info is enabled. However after this patch, if one would lower the check_interval it was possible that after the restart, the next check of an object would be more than one check_interval away. This commit ensures that if the next_check is more than one check_interval away, then we randomly schedule the next check, instead of using the retention data. This fixed MON-11295 (https://jira.op5.com/browse/MON-11295) Signed-off-by: Jacob Hansen <jhansen@op5.com>

This commit adds a COPYING file with the GPLv2 license. This ensures that the we do not get a wrong license when running automake, and also that GitHub automatically can detect the license for the project. Signed-off-by: Jacob Hansen jhansen@op5.com

…ecute-checks-within-interval Always schedule next_check within check_interval

Re-add COPYING file with license

the orphaned check eventhandler checks the next_check against the expected next check. But normal service/host check events simply run schedule_next_... so then orphan check will never match. right now its like this: handle_host_check_event() -> run_async_host_check() -> sets is_executing true -> check never comes back -> next check scheduled handle_host_check_event() -> run_async_host_check() -> returns an error because is_executing is still set -> next check scheduled So since next_check is always pushed forward, the orphan check will never match, even if the host/service has the is_executing flag for days. To fix this, we only reschedule the next check if the is_executing flag is false. Now when the check takes longer than the check interval, this can lead to situations that there is no event scheduled. So make sure we scheduled a event when receiving a check result and there is no event yet. Signed-off-by: Sven Nierlein <sven@nierlein.de>

#154) right now, we have to reset the flag in mod-gearman but hosts should just behave like services here when processing check results and reset the flag on processing an active check result. Signed-off-by: Sven Nierlein <sven@nierlein.de>

this flag was used by mod-gearman to detect orphaned checks for example from misconfiguration to submit a critical check result with a useful message.

Overriding checks during the host/service_initiate stage leads to a memory leak. Freeing the check_result pointer helps.

Add option to *not* check services if their host is down. references: - NagiosEnterprises/nagioscore@05e1dda

This commit adds a simple test to ensure that no service checks are being run if the setting host_down_disable_service_checks is enabled. Signed-off-by: Jacob Hansen <jhansen@op5.com>

there was a missing newline which prevented the query handler from returning the errors for commands. add a test case to ensure this does not fail again.

A neb callback deregistering itself in a callback currently causes a heap corruption. This is due to getting the naming information for the module after the callback has been run. This patch gets the naming information of the module before the callback, ensuring that we will not try to access the callback pointer after it is potentially freed. This fixes: MON-11365 & #268 Signed-off-by: Jacob Hansen <jhansen@op5.com>

…corruption Fix heap corruption when callback dereigsters itself

With use_retained_scheduling_info enabled, we would schedule checks which was missed with less than one check_interval, within one interval_lenght. This commit introduces a new setting retained_scheduling_randomize_window which allows users to configure the window in which checks that were missed over a restart is rescheduled. This can be useful in order to increase the load balacing done after a restart, and might be able to help fixing CPU load spikes, due to checks being unevenly scheduled. This part of MON-11418 Signed-off-by: Jacob Hansen <jhansen@op5.com>

If the retained_scheduling_randomize window is larger than the objects check_interval, then we use the check_interval for scheduling instead. This ensures that the object is always scheduled within the first check_interval after a restart. Signed-off-by: Jacob Hansen <jhansen@op5.com>

Signed-off-by: Jacob Hansen <jhansen@op5.com>

…eck-schedule Introduce retained_scheduling_randomize_window

Signed-off-by: Jacob Hansen <jhansen@op5.com>

- newlines from spoolfiles need to be unescaped, otherwise they remain as \\n in the plugin output und multiline output parser does not parse the output correctly. - instead of adding more and more exceptions to g_strescape, we really only want to escape newlines, so do just that. Otherwise we end up with double encoded escape sequences in the long plugin output. Signed-off-by: Sven Nierlein <sven@nierlein.de>

returning NEBERROR_CALLBACKCANCEL from a NEBTYPE_HOSTCHECK_INITIATE or NEBTYPE_SERVICECHECK_INITIATE neb callback resulted in naemon running the check itself. Instead naemon should just skip the check and reschedule it. Signed-off-by: Sven Nierlein <sven@nierlein.de>

Signed-off-by: Sven Nierlein <sven@nierlein.de>

cmd_name may be null. Signed-off-by: Sven Nierlein <sven@nierlein.de>

Anton Lofgren and others added 30 commits May 31, 2016 15:11

shadownaemon: Fix a logic error when connecting

9ad77dd

Because of precedence rules, if (!connect(...) == 0) is not evaluated as (not(connect(...) == 0)) in C, but as (not(connect(...)) == 0) This triggers a warning in clang. Signed-off-by: Anton Lofgren <alofgren@op5.com>

remove enable_environment_macros from sample config

d4fe38c

new installations should not come with warnings like: Warning: enable_environment_macros is deprecated and will be removed. Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>

release 1.0.4

aad65eb

Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>

update news file

5b036e7

contact: Don't try to log NULL pointers

73888e4

This fixes GitHub issue #140. Signed-off-by: Anton Lofgren <alofgren@op5.com>

remove temp_file

b068cf4

Replace temp_file with status_file for temporary status.dat pattern to mkstemp

check ferror() before return

4e9bebb

Add ferror(fp) to result code to check for error with calls to fprintf()

remove temp_file

e775e10

Replace temp_file with retention_file for temporary retention.dat pattern to mkstemp

check ferror() before return

4293110

Add ferror(fp) to result code to check for error with calls to fprintf()

update my_rename

195e718

utils: Update doc for my_rename()

5186ccf

Signed-off-by: Anton Lofgren <alofgren@op5.com>

release 1.0.5

5c39f58

workers: Bail out if we can't register wproc query handler

e4ac825

Signed-off-by: Anton Lofgren <alofgren@op5.com>

Checking for Git binary

6f780b0

In some build systems, git is not present unless defined as a build dependency. However the .git directory may still be copied in along with the source. This change will still allow the required behavior in case .git is present but the Git binary is not.

Merge branch 'version-without-git' of https://github.com/ipstatic/nae…

7631114

…mon-core into ipstatic-version-without-git

timeperiods: more const correctness

51de06f

Just a tiny change to enforce constness of timeperiods where we don't want to modify them. Signed-off-by: Anton Lofgren <alofgren@op5.com>

workers: Don't error on a NULL specialized_workers table

eb6004f

This shouldn't happend during normal operation, but it happens ocassionally during unit tests, and it clutters up the output and skews the test results. Signed-off-by: Anton Lofgren <alofgren@op5.com>

xodtemplate: Treat missing service description as an error

46b44e0

This behaviour was changed when fixing another configuration validation bug (GH issue #68). This patch restores the behaviour and fixes MON-9353. Signed-off-by: Anton Lofgren <alofgren@op5.com>

Merge pull request #29 in MONITOR/naemon from bugfix/MON-9353 to master

d77b41b

* commit '46b44e0ed7ff8350b7bb381e96a2ea07af639179': xodtemplate: Treat missing service description as an error

Merge pull request #30 in MONITOR/naemon from feature/MON-9733-trigge…

3241e6e

…r-nebevent-on-all-external to master * commit '0bb33b5d2086875d48cff7f160ea89b479c8a4f4': Reuse check command processing

jacobbaungard and others added 29 commits September 11, 2018 09:58

Merge pull request #259 from jacobbaungard/bugfix/MON-10720_retain-ne…

183178c

…xt-schedule Retain next_check schedule on restart (#224, #156)

el6: use correct logrotate script

a2596a8

Since we install the el7 logrotate in our Makefile.am without further OS detection we need to replace the logrotate file for el6 later. Otherwise we would end up with the el7 file and no logrotation.

Merge pull request #257 from jacobbaungard/bugfix/MON-10565-cleaner-s…

859b20d

…hutdown Init: Increase delay between SIGTERM and SIGKILL

Re-add COPYING file with license

5e65ce3

This commit adds a COPYING file with the GPLv2 license. This ensures that the we do not get a wrong license when running automake, and also that GitHub automatically can detect the license for the project. Signed-off-by: Jacob Hansen jhansen@op5.com

Merge pull request #265 from jacobbaungard/bugfix/MON-11295-always-ex…

ea156c2

…ecute-checks-within-interval Always schedule next_check within check_interval

Merge pull request #266 from naemon/readd_license_file

a248674

Re-add COPYING file with license

set flag if check is scheduled from the orphan event handler

cec6e10

this flag was used by mod-gearman to detect orphaned checks for example from misconfiguration to submit a critical check result with a useful message.

fix memory leak when overriding checks

27e64c9

Overriding checks during the host/service_initiate stage leads to a memory leak. Freeing the check_result pointer helps.

Add host_down_disable_service_checks config option

f021ca9

Add option to *not* check services if their host is down. references: - NagiosEnterprises/nagioscore@05e1dda

Test host_down_disable_service_checks

1d42073

This commit adds a simple test to ensure that no service checks are being run if the setting host_down_disable_service_checks is enabled. Signed-off-by: Jacob Hansen <jhansen@op5.com>

fix query handler not returning command response

25ff76b

there was a missing newline which prevented the query handler from returning the errors for commands. add a test case to ensure this does not fail again.

Merge pull request #276 from jacobbaungard/bugfix/MON-11365-fix-heap-…

3ba44f1

…corruption Fix heap corruption when callback dereigsters itself

Corrected retained window typo in sample-config

4ac82d7

Signed-off-by: Jacob Hansen <jhansen@op5.com>

Merge pull request #277 from jacobbaungard/bugfix/MON-11418-uneven-ch…

ed6e3c1

…eck-schedule Introduce retained_scheduling_randomize_window

Release 1.0.9

b0987b1

Signed-off-by: Jacob Hansen <jhansen@op5.com>

release 1.0.10

026725e

Signed-off-by: Sven Nierlein <sven@nierlein.de>

add missing documentation

030c134

Signed-off-by: Sven Nierlein <sven@nierlein.de>

fix format overflow

72aa5cb

cmd_name may be null. Signed-off-by: Sven Nierlein <sven@nierlein.de>

nook24 merged commit 3440ac1 into nook24:master Apr 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge current Naemon master into fork #1

Merge current Naemon master into fork #1

nook24 commented Apr 10, 2019

Merge current Naemon master into fork #1

Merge current Naemon master into fork #1

Conversation

nook24 commented Apr 10, 2019