forked from naemon/naemon-core
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge current Naemon master into fork #1
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The tv_sec field in a timespec struct is defined as a time_t, which is often a long int (but not always). On 64-bit systems this is usually fine (except when it's not), but on 32-bit systems, this can have the consequence of the arithmetic in the timespec_msdiff function overflowing the resulting diff, for example when given a large negative value such as (0 - time(NULL)). This in turn would lead to a diff a number of years in the future. For example, (time_t)-1464166432 * 1000 = 417415936 which is about 13 years in the future. This has a nasty consequence, because of how we use this function, in that it would a) cause events scheduled way back to be scheduled over a decade too late, and b) thusly block the execution of *all other events* until that event was run. This patch fixes that issue, by defining what happens in the case of an overflow more strictly, while still maintaining the desired properties. Namely, we simply return LONG_MIN/LONG_MAX if event A is too far in the past or future relative to event B. In addition, I also added a bunch of tests to catch this type of bug in the future, as well as portable support for overflow-checked arithmetics. Signed-off-by: Anton Lofgren <alofgren@op5.com>
Because of precedence rules, if (!connect(...) == 0) is not evaluated as (not(connect(...) == 0)) in C, but as (not(connect(...)) == 0) This triggers a warning in clang. Signed-off-by: Anton Lofgren <alofgren@op5.com>
new installations should not come with warnings like: Warning: enable_environment_macros is deprecated and will be removed. Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>
Signed-off-by: Sven Nierlein <Sven.Nierlein@consol.de>
This fixes GitHub issue #140. Signed-off-by: Anton Lofgren <alofgren@op5.com>
Replace temp_file with status_file for temporary status.dat pattern to mkstemp
Add ferror(fp) to result code to check for error with calls to fprintf()
Replace temp_file with retention_file for temporary retention.dat pattern to mkstemp
Add ferror(fp) to result code to check for error with calls to fprintf()
Signed-off-by: Anton Lofgren <alofgren@op5.com>
There's no need to reinvent a shell (again!) when we're already guaranteed one by POSIX. Let's use that one instead, since it allows us to a) write more powerful tests, and b) write real-er tests Signed-off-by: Anton Lofgren <alofgren@op5.com>
In order to provide backwards compatibility with older configurations, this patch lends some leniency to when grandchild processes are killed. It used to be the case that a process group was killed as soon as we reaped its leader. With this patch, we instead allow the descendants of the leader to continue running until the timeout of the job in question has been reached, at which point all still remaining processes will be reaped. This allows users to continue using, for example, asynchronous mail delivery via mailx, without losing their notifications when mailx's children are unexpectedly killed. In addition, this patch also cleans up the worker test suite a bit, to make it more reliable and extendable. This fixes GitHub issue #137. This is related to MON-8090. Signed-off-by: Anton Lofgren <alofgren@op5.com>
Signed-off-by: Anton Lofgren <alofgren@op5.com>
In some build systems, git is not present unless defined as a build dependency. However the .git directory may still be copied in along with the source. This change will still allow the required behavior in case .git is present but the Git binary is not.
The callback was probably meant to make it possible to create a custom worker with a separate job spawning method. But since all those methods were local to that file, no external builds could have used that method. In fact, the struct for tracking jobs didn't provide any space for custom job related data, thus this wasn't even possible. Instead of fixing it without a use case, it's just better to remove, to simplify the debugging of the code. Part of op5 issue MON-9528 Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
The ESTALE seems to be when the job has any processes left that needs to be reaped. That's what happens when the main process finishes, and we call finish_job. Thus, mark the job as ESTALE, and don't log when we kill dormant children, since that's the default behaviour for jobs, where we still need to be able to have dangling processes a while after execution. Those dangling processes we have had is for example sendmail in notification scripts, which forks a process to send the mail and exit early. Resolves: MON-9528 Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
…mon-core into ipstatic-version-without-git
A couple of thousand lines of output each tests doesn't make any sense. Repetetive tests can simply just have a failure counter and validate that there are no failures. For tracking exact values, gdb is really helpful. Just watch the failure counter. And also, since lot of ci environments truncates the output on a couple of thousand log ilnes, this will help with not truncating the log. Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
Just a tiny change to enforce constness of timeperiods where we don't want to modify them. Signed-off-by: Anton Lofgren <alofgren@op5.com>
This shouldn't happend during normal operation, but it happens ocassionally during unit tests, and it clutters up the output and skews the test results. Signed-off-by: Anton Lofgren <alofgren@op5.com>
I had an issue with events being destroyed after the queue having been freed which was non-trivial to track down. These assertions, along with actually NULL'ing the event_queue would've told me right away what went wrong. Signed-off-by: Anton Lofgren <alofgren@op5.com>
…nitored-object to master * commit 'fcf4ad56ea70f9828a66dfc851fff5ccf75429e5': events: Sprinkle with assertions workers: Don't error on a NULL specialized_workers table timeperiods: more const correctness Make timeperiod tests not that noisy
This behaviour was changed when fixing another configuration validation bug (GH issue #68). This patch restores the behaviour and fixes MON-9353. Signed-off-by: Anton Lofgren <alofgren@op5.com>
* commit '46b44e0ed7ff8350b7bb381e96a2ea07af639179': xodtemplate: Treat missing service description as an error
Even though the one handling command by id is somwhat faster, it didn't call all the broker modules. And also couldn't be stopped from being executed correctly. Thus, make sure all external command proccessing is handled the same way, and that the event can stop a command from being executed Signed-off-by: Max Sikstrom <max.sikstrom@op5.com>
…r-nebevent-on-all-external to master * commit '0bb33b5d2086875d48cff7f160ea89b479c8a4f4': Reuse check command processing
This commit ensures that the next_check schedule for hosts and services are retained on Naemon restart, given that use_retained_scheduling_info is enabled. The logic is as follows: - If use_retained_scheduling_info is disabled, set a random time (as before) - If use_retained_schedule_info is enabled: - If we didn't miss the check during the restart, retain the old next_check time - If we missed one check, schedule the service/host within the next interval_length (usually 60 seconds) - If we missed more than one check, schedule the next check randomly. We schedule missed checks within 60 seconds, rather than immediately in order to do some load balacing. This is also the rationale for scheduling the check randomly, in case we missed more than one check (this indicates Naemon has been down for a longer period of time). This fixes: - #224 - #156 - MON-10720 (https://jira.op5.com/browse/MON-10720) Signed-off-by: Jacob Hansen <jhansen@op5.com>
This commit adds tests to ensure that the next_check is set correctly after Naemon restarts. This ensures the logic is from the previous commit is correctly followed. This fixes: - #224 - #156 - MON-10720 (https://jira.op5.com/browse/MON-10720) Signed-off-by: Jacob Hansen <jhansen@op5.com>
Since we install the el7 logrotate in our Makefile.am without further OS detection we need to replace the logrotate file for el6 later. Otherwise we would end up with the el7 file and no logrotation.
right now we did only check the state during dependency checks. But for pending hosts and services the state is usually OK/UP so the check passed. For pending flag checks we have to look at the has_been_checked flag as well. This leads to the situation where services checks will be run if the master service is in pending state even if the service has pending service execution failure flag set.
…hutdown Init: Increase delay between SIGTERM and SIGKILL
After #259 we now keep the next_check schedule over restarts if use_retained_schedule_info is enabled. However after this patch, if one would lower the check_interval it was possible that after the restart, the next check of an object would be more than one check_interval away. This commit ensures that if the next_check is more than one check_interval away, then we randomly schedule the next check, instead of using the retention data. This fixed MON-11295 (https://jira.op5.com/browse/MON-11295) Signed-off-by: Jacob Hansen <jhansen@op5.com>
This commit adds a COPYING file with the GPLv2 license. This ensures that the we do not get a wrong license when running automake, and also that GitHub automatically can detect the license for the project. Signed-off-by: Jacob Hansen jhansen@op5.com
…ecute-checks-within-interval Always schedule next_check within check_interval
Re-add COPYING file with license
the orphaned check eventhandler checks the next_check against the expected next check. But normal service/host check events simply run schedule_next_... so then orphan check will never match. right now its like this: handle_host_check_event() -> run_async_host_check() -> sets is_executing true -> check never comes back -> next check scheduled handle_host_check_event() -> run_async_host_check() -> returns an error because is_executing is still set -> next check scheduled So since next_check is always pushed forward, the orphan check will never match, even if the host/service has the is_executing flag for days. To fix this, we only reschedule the next check if the is_executing flag is false. Now when the check takes longer than the check interval, this can lead to situations that there is no event scheduled. So make sure we scheduled a event when receiving a check result and there is no event yet. Signed-off-by: Sven Nierlein <sven@nierlein.de>
#154) right now, we have to reset the flag in mod-gearman but hosts should just behave like services here when processing check results and reset the flag on processing an active check result. Signed-off-by: Sven Nierlein <sven@nierlein.de>
this flag was used by mod-gearman to detect orphaned checks for example from misconfiguration to submit a critical check result with a useful message.
Overriding checks during the host/service_initiate stage leads to a memory leak. Freeing the check_result pointer helps.
Add option to *not* check services if their host is down. references: - NagiosEnterprises/nagioscore@05e1dda
This commit adds a simple test to ensure that no service checks are being run if the setting host_down_disable_service_checks is enabled. Signed-off-by: Jacob Hansen <jhansen@op5.com>
there was a missing newline which prevented the query handler from returning the errors for commands. add a test case to ensure this does not fail again.
A neb callback deregistering itself in a callback currently causes a heap corruption. This is due to getting the naming information for the module after the callback has been run. This patch gets the naming information of the module before the callback, ensuring that we will not try to access the callback pointer after it is potentially freed. This fixes: MON-11365 & #268 Signed-off-by: Jacob Hansen <jhansen@op5.com>
…corruption Fix heap corruption when callback dereigsters itself
With use_retained_scheduling_info enabled, we would schedule checks which was missed with less than one check_interval, within one interval_lenght. This commit introduces a new setting retained_scheduling_randomize_window which allows users to configure the window in which checks that were missed over a restart is rescheduled. This can be useful in order to increase the load balacing done after a restart, and might be able to help fixing CPU load spikes, due to checks being unevenly scheduled. This part of MON-11418 Signed-off-by: Jacob Hansen <jhansen@op5.com>
If the retained_scheduling_randomize window is larger than the objects check_interval, then we use the check_interval for scheduling instead. This ensures that the object is always scheduled within the first check_interval after a restart. Signed-off-by: Jacob Hansen <jhansen@op5.com>
Signed-off-by: Jacob Hansen <jhansen@op5.com>
…eck-schedule Introduce retained_scheduling_randomize_window
Signed-off-by: Jacob Hansen <jhansen@op5.com>
- newlines from spoolfiles need to be unescaped, otherwise they remain as \\n in the plugin output und multiline output parser does not parse the output correctly. - instead of adding more and more exceptions to g_strescape, we really only want to escape newlines, so do just that. Otherwise we end up with double encoded escape sequences in the long plugin output. Signed-off-by: Sven Nierlein <sven@nierlein.de>
returning NEBERROR_CALLBACKCANCEL from a NEBTYPE_HOSTCHECK_INITIATE or NEBTYPE_SERVICECHECK_INITIATE neb callback resulted in naemon running the check itself. Instead naemon should just skip the check and reschedule it. Signed-off-by: Sven Nierlein <sven@nierlein.de>
Signed-off-by: Sven Nierlein <sven@nierlein.de>
Signed-off-by: Sven Nierlein <sven@nierlein.de>
cmd_name may be null. Signed-off-by: Sven Nierlein <sven@nierlein.de>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.