Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge current Naemon master into fork #1

Merged
merged 3,370 commits into from
Apr 10, 2019
Merged

Merge current Naemon master into fork #1

merged 3,370 commits into from
Apr 10, 2019

Conversation

nook24
Copy link
Owner

@nook24 nook24 commented Apr 10, 2019

No description provided.

Anton Lofgren and others added 30 commits May 31, 2016 15:11
The tv_sec field in a timespec struct is defined as a time_t, which is
often a long int (but not always). On 64-bit systems this is usually
fine (except when it's not), but on 32-bit systems, this can have the
consequence of the arithmetic in the timespec_msdiff function
overflowing the resulting diff, for example when given a large negative
value such as (0 - time(NULL)). This in turn would lead to a diff a
number of years in the future.

For example, (time_t)-1464166432 * 1000 = 417415936 which is about 13
years in the future.

This has a nasty consequence, because of how we use this function, in that it would

a) cause events scheduled way back to be scheduled over a decade too
late, and
b) thusly block the execution of *all other events* until that event was
run.

This patch fixes that issue, by defining what happens in the case of an
overflow more strictly, while still maintaining the desired properties.
Namely, we simply return LONG_MIN/LONG_MAX if event A is too far in the
past or future relative to event B.

In addition, I also added a bunch of tests to catch this type of bug in
the future, as well as portable support for overflow-checked arithmetics.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
Because of precedence rules,
if (!connect(...) == 0) is not evaluated as

(not(connect(...) == 0)) in C, but as
(not(connect(...)) == 0)

This triggers a warning in clang.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
new installations should not come with warnings like:

    Warning: enable_environment_macros is deprecated and will be removed.

Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>
Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>
select/poll/epoll_wait returns EINTR on signal interruption (duh), and
the comment below even says that that's what we're looking for.

This was introduced by some guy in 21c9601.

This fixes GitHub issue #138.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
This fixes GitHub issue #140.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
Replace temp_file with status_file for temporary status.dat pattern to mkstemp
Add ferror(fp) to result code to check for error with calls to fprintf()
Replace temp_file with retention_file for temporary retention.dat pattern to mkstemp
Add ferror(fp) to result code to check for error with calls to fprintf()
Signed-off-by: Anton Lofgren <alofgren@op5.com>
There's no need to reinvent a shell (again!) when we're already
guaranteed one by POSIX. Let's use that one instead, since it allows us
to

a) write more powerful tests, and
b) write real-er tests

Signed-off-by: Anton Lofgren <alofgren@op5.com>
In order to provide backwards compatibility with older configurations,
this patch lends some leniency to when grandchild processes are killed.

It used to be the case that a process group was killed as soon as we
reaped its leader. With this patch, we instead allow the descendants of
the leader to continue running until the timeout of the job in question
has been reached, at which point all still remaining processes will be
reaped.

This allows users to continue using, for example, asynchronous mail
delivery via mailx, without losing their notifications when mailx's
children are unexpectedly killed.

In addition, this patch also cleans up the worker test suite a bit, to
make it more reliable and extendable.

This fixes GitHub issue #137.

This is related to MON-8090.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
Signed-off-by: Anton Lofgren <alofgren@op5.com>
In some build systems, git is not present unless defined as a build
dependency. However the .git directory may still be copied in along with
the source. This change will still allow the required behavior in case
.git is present but the Git binary is not.
The callback was probably meant to make it possible to create a custom
worker with a separate job spawning method. But since all those methods
were local to that file, no external builds could have used that method.
In fact, the struct for tracking jobs didn't provide any space for
custom job related data, thus this wasn't even possible.

Instead of fixing it without a use case, it's just better to remove, to
simplify the debugging of the code.

Part of op5 issue MON-9528

Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
The ESTALE seems to be when the job has any processes left that needs to
be reaped. That's what happens when the main process finishes, and we
call finish_job.

Thus, mark the job as ESTALE, and don't log when we kill dormant
children, since that's the default behaviour for jobs, where we still
need to be able to have dangling processes a while after execution.

Those dangling processes we have had is for example sendmail in
notification scripts, which forks a process to send the mail and exit
early.

Resolves: MON-9528

Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
A couple of thousand lines of output each tests doesn't make any sense.
Repetetive tests can simply just have a failure counter and validate
that there are no failures. For tracking exact values, gdb is really
helpful. Just watch the failure counter.

And also, since lot of ci environments truncates the output on a couple
of thousand log ilnes, this will help with not truncating the log.

Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
Just a tiny change to enforce constness of timeperiods where we don't
want to modify them.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
This shouldn't happend during normal operation, but it happens
ocassionally during unit tests, and it clutters up the output and skews
the test results.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
I had an issue with events being destroyed after the queue having been
freed which was non-trivial to track down. These assertions, along with
actually NULL'ing the event_queue would've told me right away what went
wrong.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
…nitored-object to master

* commit 'fcf4ad56ea70f9828a66dfc851fff5ccf75429e5':
  events: Sprinkle with assertions
  workers: Don't error on a NULL specialized_workers table
  timeperiods: more const correctness
  Make timeperiod tests not that noisy
This behaviour was changed when fixing another configuration validation
bug (GH issue #68). This patch restores the behaviour and fixes
MON-9353.

Signed-off-by: Anton Lofgren <alofgren@op5.com>
* commit '46b44e0ed7ff8350b7bb381e96a2ea07af639179':
  xodtemplate: Treat missing service description as an error
Even though the one handling command by id is somwhat faster, it didn't
call all the broker modules. And also couldn't be stopped from being
executed correctly.

Thus, make sure all external command proccessing is handled the same
way, and that the event can stop a command from being executed

Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
…r-nebevent-on-all-external to master

* commit '0bb33b5d2086875d48cff7f160ea89b479c8a4f4':
  Reuse check command processing
jacobbaungard and others added 29 commits September 11, 2018 09:58
This commit ensures that the next_check schedule for hosts and services
are retained on Naemon restart, given that use_retained_scheduling_info
is enabled.

The logic is as follows:

- If use_retained_scheduling_info is disabled, set a random time (as
  before)
- If use_retained_schedule_info is enabled:
  - If we didn't miss the check during the restart, retain the old
    next_check time
  - If we missed one check, schedule the service/host within the next
    interval_length (usually 60 seconds)
  - If we missed more than one check, schedule the next check randomly.

We schedule missed checks within 60 seconds, rather than immediately in
order to do some load balacing. This is also the rationale for
scheduling the check randomly, in case we missed more than one check
(this indicates Naemon has been down for a longer period of time).

This fixes:
- #224
- #156
- MON-10720 (https://jira.op5.com/browse/MON-10720)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
This commit adds tests to ensure that the next_check is set correctly
after Naemon restarts. This ensures the logic is from the previous
commit is correctly followed.

This fixes:
- #224
- #156
- MON-10720 (https://jira.op5.com/browse/MON-10720)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
…xt-schedule

Retain next_check schedule on restart (#224, #156)
Since we install the el7 logrotate in our Makefile.am without further OS
detection we need to replace the logrotate file for el6 later. Otherwise we
would end up with the el7 file and no logrotation.
right now we did only check the state during dependency checks. But for pending hosts
and services the state is usually OK/UP so the check passed. For pending flag checks
we have to look at the has_been_checked flag as well. This leads to the situation where
services checks will be run if the master service is in pending state even if the service
has pending service execution failure flag set.
…hutdown

Init: Increase delay between SIGTERM and SIGKILL
After #259 we now keep the
next_check schedule over restarts if use_retained_schedule_info is
enabled. However after this patch, if one would lower the check_interval
it was possible that after the restart, the next check of an object
would be more than one check_interval away.

This commit ensures that if the next_check is more than one
check_interval away, then we randomly schedule the next check, instead
of using the retention data.

This fixed MON-11295 (https://jira.op5.com/browse/MON-11295)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
This commit adds a COPYING file with the GPLv2 license. This ensures
that the we do not get a wrong license when running automake, and also
that GitHub automatically can detect the license for the project.

Signed-off-by: Jacob Hansen jhansen@op5.com
…ecute-checks-within-interval

Always schedule next_check within check_interval
Re-add COPYING file with license
the orphaned check eventhandler checks the next_check against the expected next
check. But normal service/host check events simply run schedule_next_... so
then orphan check will never match.

right now its like this:

  handle_host_check_event()
  ->  run_async_host_check()
  ->  sets is_executing true
  ->  check never comes back
  ->  next check scheduled
  handle_host_check_event()
  ->  run_async_host_check()
  ->  returns an error because is_executing is still set
  ->  next check scheduled

So since next_check is always pushed forward, the orphan check will never
match, even if the host/service has the is_executing flag for days. To fix
this, we only reschedule the next check if the is_executing flag is false.

Now when the check takes longer than the check interval, this can lead to
situations that there is no event scheduled. So make sure we scheduled a event
when receiving a check result and there is no event yet.

Signed-off-by: Sven Nierlein <sven@nierlein.de>
#154)

right now, we have to reset the flag in mod-gearman but hosts should just
behave like services here when processing check results and reset the flag on
processing an active check result.

Signed-off-by: Sven Nierlein <sven@nierlein.de>
this flag was used by mod-gearman to detect orphaned checks for example
from misconfiguration to submit a critical check result with a useful
message.
Overriding checks during the host/service_initiate stage leads to a memory leak.
Freeing the check_result pointer helps.
Add option to *not* check services if their host is down.

references:
      - NagiosEnterprises/nagioscore@05e1dda
This commit adds a simple test to ensure that no service checks are
being run if the setting host_down_disable_service_checks is enabled.

Signed-off-by: Jacob Hansen <jhansen@op5.com>
there was a missing newline which prevented the query handler
from returning the errors for commands.
add a test case to ensure this does not fail again.
A neb callback deregistering itself in a callback currently causes a heap
corruption. This is due to getting the naming information for the module
after the callback has been run.

This patch gets the naming information of the module before the
callback, ensuring that we will not try to access the callback pointer
after it is potentially freed.

This fixes: MON-11365 & #268

Signed-off-by: Jacob Hansen <jhansen@op5.com>
…corruption

Fix heap corruption when callback dereigsters itself
With use_retained_scheduling_info enabled, we would schedule checks
which was missed with less than one check_interval, within one
interval_lenght.

This commit introduces a new setting
retained_scheduling_randomize_window which allows users to configure
the window in which checks that were missed over a restart is
rescheduled.

This can be useful in order to increase the load balacing done after a
restart, and might be able to help fixing CPU load spikes, due to checks
being unevenly scheduled.

This part of MON-11418

Signed-off-by: Jacob Hansen <jhansen@op5.com>
If the retained_scheduling_randomize window is larger than the objects
check_interval, then we use the check_interval for scheduling instead.
This ensures that the object is always scheduled within the first
check_interval after a restart.

Signed-off-by: Jacob Hansen <jhansen@op5.com>
Signed-off-by: Jacob Hansen <jhansen@op5.com>
…eck-schedule

Introduce retained_scheduling_randomize_window
Signed-off-by: Jacob Hansen <jhansen@op5.com>
- newlines from spoolfiles need to be unescaped, otherwise they remain
  as \\n in the plugin output und multiline output parser does not
  parse the output correctly.

- instead of adding more and more exceptions to g_strescape, we really only
  want to escape newlines, so do just that. Otherwise we end up with double
  encoded escape sequences in the long plugin output.

Signed-off-by: Sven Nierlein <sven@nierlein.de>
returning NEBERROR_CALLBACKCANCEL  from a NEBTYPE_HOSTCHECK_INITIATE or
NEBTYPE_SERVICECHECK_INITIATE neb callback resulted in naemon running the check
itself. Instead naemon should just skip the check and reschedule it.

Signed-off-by: Sven Nierlein <sven@nierlein.de>
Signed-off-by: Sven Nierlein <sven@nierlein.de>
Signed-off-by: Sven Nierlein <sven@nierlein.de>
cmd_name may be null.

Signed-off-by: Sven Nierlein <sven@nierlein.de>
@nook24 nook24 merged commit 3440ac1 into nook24:master Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.