Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics json #761

Open
wants to merge 82 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
570f27a
Add first implementation of monitor-metrics-json"
Jan 30, 2022
3765e31
Encode the JSON in monitor_metrics_json, add
Jan 30, 2022
5baaa51
Add --monitor-metrics-json to the --help output
Feb 4, 2022
4935208
Capture the snapshot_health_issues metric
Feb 4, 2022
6c4b1e9
Add comment on $snapshot_health_issues
Feb 17, 2022
b9897d4
Add newest_snapshot_ctime_seconds to JSON output
Feb 18, 2022
e963dd2
Add tests README with instructions on testing
Mar 17, 2022
cdbc191
Add new monitoring test scaffolding.
Mar 17, 2022
e26671c
Add monitoring test framework
Mar 17, 2022
372eadc
Add python test
Mar 17, 2022
b3ee98a
Test failure.
Mar 17, 2022
3655c02
Test output with no zpool.
Mar 17, 2022
1bdb607
add subprocess import
Mar 17, 2022
123e35e
Fix --monitor-snapshots command in test
Mar 17, 2022
c21d8d0
Fix path to sanoid snapshot cache file
Mar 17, 2022
8bb8d38
Remove check on test
Mar 17, 2022
c488df5
Fix test output.
Mar 17, 2022
f939851
add second section
Mar 17, 2022
26777c2
Add 2nd section to sanoid.conf for monitoring test
Mar 17, 2022
501c992
Test return code for monitoring no pool
Mar 17, 2022
c8f2b3f
Fix return code of first test
Mar 17, 2022
e556495
Create zpool, test.
Mar 17, 2022
37bce97
add test
Mar 17, 2022
890c785
Fixes to zpool creation test.
Mar 17, 2022
3e9730d
debugging
Mar 17, 2022
8571510
Fix pool_target issues in test.
Mar 17, 2022
3364805
Clean up monitoring tests zpool creation/teardown
Mar 17, 2022
49589e2
Comment out zpool export, as now in python
Mar 17, 2022
867b755
Delete temp disk images as part of teardown
Mar 17, 2022
02aedfa
Test monitoring after sanoid cron
Mar 17, 2022
d0f89c9
Fix monitoring test after sanoid run
Mar 17, 2022
b992beb
Add test for monitoring immediately after running
Mar 17, 2022
6710b2d
Clear snapshot cache before each test
Mar 17, 2022
0b7c05f
First time-based test (one warning)
Mar 23, 2022
2561a5e
Add more info for debugging
Mar 23, 2022
a2fc4b3
Add print for debugging
Mar 23, 2022
f870d27
Test warning string
Mar 23, 2022
d9da30d
Fix test for one monitoring warning.
Mar 23, 2022
4c9a9e2
Use sanoid --force-update instead of
Mar 23, 2022
adddd20
Add initial test for two critical monitoring
Mar 23, 2022
7462a1a
Adjust time slightly
Mar 23, 2022
a7fd743
Add test details for two criticals
Mar 23, 2022
f0fd435
Fix test_two_criticals output test
Mar 23, 2022
38ed37c
Fix test_two_criticals returncode
Mar 23, 2022
ae4c0c6
Start test_two_warnings_daily
Mar 23, 2022
c28f1c3
Fix test_two_warnings_daily text test
Mar 23, 2022
8152952
Fix return code
Mar 23, 2022
c7e57b7
Fix test_two_warnings
Mar 23, 2022
5015aec
change name of test_two_warnings_daily
Mar 27, 2022
cdbe07e
Create a new set of tests for --monitor-snapshots
Mar 27, 2022
3a4e043
Merge branch 'add_vm_tests' of https://github.com/Hooloovoo/sanoid in…
Mar 27, 2022
9cc6b7f
Delete redundant set of tests from rename/merge
Mar 27, 2022
9759fa4
Ready to merge
Mar 27, 2022
b17a0fe
Add newline at end of sanoid.conf
Mar 27, 2022
fec7cf9
Add newline
Mar 27, 2022
7e9d4e4
Merge branch 'add_vm_tests' into add_metrics_json
Mar 31, 2022
6a9fadd
Temporarily make the test suite only run
Mar 31, 2022
c0f61f2
Add libjson-perl as a dependency
Mar 31, 2022
1f20a6b
Start with broken test to get JSON output
Mar 31, 2022
6db15f7
Add tests for JSON sanoid-test-1
Mar 31, 2022
3a5c62c
Fix crit_age_seconds
Mar 31, 2022
934030d
Fix has_snapshots test
Mar 31, 2022
65859b5
Fix monitor_dont_crit and monitor_dont_warn tests
Mar 31, 2022
1b635c3
Make monitor_dont_crit and warn convert to numbers
Apr 1, 2022
7c854b4
Fix snapshot_health_issues
Apr 1, 2022
ddd9c21
Fix warn_age_seconds
Apr 1, 2022
b39bc7c
Fix daily test
Apr 1, 2022
be52e5e
Fix monthly tests
Apr 1, 2022
e2c39cf
Fix spacing
Apr 2, 2022
628d902
Add test_no_zpool json tests and add
Apr 4, 2022
861e86b
Use int to ensure JSON values dumped as numbers
Apr 4, 2022
c82d908
Add sanoid-test-2 info to
Jun 14, 2022
dac3bff
Add sanoid-test-2 to
Jun 14, 2022
bf6b733
Start JSON tests for test_one_warning_hourly
Jun 29, 2022
e6d6a8b
Added monitoring tests to test_one_warning_hourly
Jul 4, 2022
e7ac2ad
Added test_two_criticals_hourly JSON tests
Jul 29, 2022
bd17421
All test_monitoring tests added and passing
Aug 4, 2022
e8fdb8a
Re-enable all tests. All pass.
Aug 5, 2022
a7438a5
Merge remote-tracking branch 'sanoid/master' into add_metrics_json
Aug 16, 2022
8001870
Add mkdir /etc/sanoid to the test runner
Aug 16, 2022
c103b0f
Added instructions for creating a test VM with LXD
Aug 16, 2022
3b81748
Updated comment in run.sh
Aug 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/debian/control
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Package: sanoid
Architecture: all
Depends: libcapture-tiny-perl,
libconfig-inifiles-perl,
libjson-perl,
zfsutils-linux | zfs,
${misc:Depends},
${perl:Depends}
Expand Down
139 changes: 107 additions & 32 deletions sanoid
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ use Config::IniFiles; # read samba-style conf file
use Data::Dumper; # debugging - print contents of hash
use File::Path 'make_path';
use Getopt::Long qw(:config auto_version auto_help);
use JSON; # for monitor-metrics-json
use Pod::Usage; # pod2usage
use Time::Local; # to parse dates in reverse
use Capture::Tiny ':all';
Expand All @@ -25,8 +26,8 @@ my %args = (
GetOptions(\%args, "verbose", "debug", "cron", "readonly", "quiet",
"configdir=s", "cache-dir=s", "run-dir=s",
"monitor-health", "force-update",
"monitor-snapshots", "take-snapshots", "prune-snapshots", "force-prune",
"monitor-capacity"
"monitor-snapshots", "monitor-metrics-json", "take-snapshots", "prune-snapshots",
"force-prune", "monitor-capacity"
) or pod2usage(2);

# If only config directory (or nothing) has been specified, default to --cron --verbose
Expand Down Expand Up @@ -74,6 +75,7 @@ if ($args{'debug'}) { $args{'verbose'}=1; blabber (@params); }
if ($args{'monitor-snapshots'}) { monitor_snapshots(@params); }
if ($args{'monitor-health'}) { monitor_health(@params); }
if ($args{'monitor-capacity'}) { monitor_capacity(@params); }
if ($args{'monitor-metrics-json'}) { monitor_metrics_json(@params); }
if ($args{'force-update'}) { my $snaps = getsnaps( \%config, $cacheTTL, 1 ); }

if ($args{'cron'}) {
Expand Down Expand Up @@ -128,12 +130,20 @@ sub monitor_snapshots {
# check_snapshot_date - test ZFS fs creation timestamp for recentness
# accepts arguments: $filesystem, $warn (in seconds elapsed), $crit (in seconds elapsed)

my ($config, $snaps, $snapsbytype, $snapsbypath) = @_;
my ($config, $snaps, $snapsbytype, $snapsbypath, $return_info) = @_;
my %datestamp = get_date();
my $errorlevel = 0;
my $msg;
my @msgs;
my @paths;
my %snapshot_info;

# use Data::Dumper;
# print Dumper $config;
# print Dumper $snaps;
# print Dumper $snapsbytype;
# print Dumper $snapsbypath;
# print encode_json $snapsbytype;

foreach my $section (keys %config) {
if ($section =~ /^template/) { next; }
Expand Down Expand Up @@ -161,41 +171,81 @@ sub monitor_snapshots {
my $warn = convertTimePeriod($config{$section}{$typewarn}, $smallerperiod);
my $crit = convertTimePeriod($config{$section}{$typecrit}, $smallerperiod);
my $elapsed = -1;

# $errorlevel tracks the overall error level for all snapshots,
# snapshot_health_issues is specific to this path/type combination
# 0 = no issues, 1 = warn, 2 = critical
my $snapshot_health_issues = 0;

if (defined $snapsbytype{$path}{$type}{'newest'}) {
$elapsed = $snapsbytype{$path}{$type}{'newest'};
}

my $dispelapsed = displaytime($elapsed);
my $dispwarn = displaytime($warn);
my $dispcrit = displaytime($crit);
if ( $elapsed > $crit || $elapsed == -1) {
if ($crit > 0) {
if (! $config{$section}{'monitor_dont_crit'}) { $errorlevel = 2; }
if (! $config{$section}{'monitor_dont_crit'}) { $snapshot_health_issues = 2; }
if ($elapsed == -1) {
push @msgs, "CRIT: $path has no $type snapshots at all!";
} else {
push @msgs, "CRIT: $path newest $type snapshot is $dispelapsed old (should be < $dispcrit)";
}
}
} elsif ($elapsed > $warn) {
} elsif ($elapsed > $warn) {
if ($warn > 0) {
if (! $config{$section}{'monitor_dont_warn'} && ($errorlevel < 2) ) { $errorlevel = 1; }
if (! $config{$section}{'monitor_dont_warn'} && ($snapshot_health_issues < 2) ) { $snapshot_health_issues = 1; }
push @msgs, "WARN: $path newest $type snapshot is $dispelapsed old (should be < $dispwarn)";
}
} else {
# push @msgs .= "OK: $path newest $type snapshot is $dispelapsed old \n";
}

if (defined $return_info) {
# $return_info has been defined, so export JSON instead of printing
$snapshot_info{$path}{$type}{"crit_age_seconds"} = $crit;
$snapshot_info{$path}{$type}{"warn_age_seconds"} = $warn;
$snapshot_info{$path}{$type}{"monitor_dont_crit"} = int($config{$section}{'monitor_dont_crit'});
$snapshot_info{$path}{$type}{"monitor_dont_warn"} = int($config{$section}{'monitor_dont_warn'});
$snapshot_info{$path}{$type}{"snapshot_health_issues"} = $snapshot_health_issues;

if ($elapsed == -1) {
# The $path has no $type snapshots
$snapshot_info{$path}{$type}{"has_snapshots"} = 0;
} else {
$snapshot_info{$path}{$type}{"has_snapshots"} = 1;
$snapshot_info{$path}{$type}{"newest_age_seconds"} = $elapsed;
my $most_recent_snap_of_type_name = $snapsbytype{$path}{$type}{"newestname"};
$snapshot_info{$path}{$type}{"newest_snapshot_ctime_seconds"} = int($snaps{$path}{$most_recent_snap_of_type_name}{"ctime"});
}
}

if ($snapshot_health_issues > $errorlevel){
# This path/type combination a warning or crit level higher than any we have seen so far,
# so adjust the overall error level to match
$errorlevel = $snapshot_health_issues;
}

}
}

my @sorted_msgs = sort { lc($a) cmp lc($b) } @msgs;
my @sorted_paths = sort { lc($a) cmp lc($b) } @paths;
$msg = join (", ", @sorted_msgs);
my $paths = join (", ", @sorted_paths);
if (defined ($return_info)) {
# Called by monitor_metrics_json, so return the JSON

if ($msg eq '') { $msg = "OK: all monitored datasets \($paths\) have fresh snapshots"; }
return \%snapshot_info, $errorlevel;

print "$msg\n";
} else {
# A normal run of the subroutine, so we print the information
my @sorted_msgs = sort { lc($a) cmp lc($b) } @msgs;
my @sorted_paths = sort { lc($a) cmp lc($b) } @paths;
$msg = join (", ", @sorted_msgs);
my $paths = join (", ", @sorted_paths);

if ($msg eq '') { $msg = "OK: all monitored datasets \($paths\) have fresh snapshots"; }

print "$msg\n";
}
exit $errorlevel;
}

Expand Down Expand Up @@ -258,6 +308,30 @@ sub monitor_capacity {
####################################################################################
####################################################################################

sub monitor_metrics_json {
my ($config, $snaps, $snapsbytype, $snapsbypath) = @_;
my %metrics;

# Set a schema version for the JSON each time it changes in case any changes are
# not backwards compatible. Today's date backawards followed by an incrementing
# digit (YYYYMMDDX)
$metrics{"schema_version"} = 202204041;

# Add all the information about snapshots we need that would be returned from
# --monitor-snapshots (or that let people derive the same information)
my ($snapshot_info, $overall_snapshot_health) = monitor_snapshots($config, $snaps, $snapsbytype, $snapsbypath, 1);
$metrics{"snapshot_info"} = $snapshot_info;
$metrics{"overall_snapshot_health_issues"} = $overall_snapshot_health;

# my $snapshot_info_json = encode_json $snapshot_info;
my $metrics_json = encode_json \%metrics;

print "$metrics_json\n";
}

####################################################################################
####################################################################################
####################################################################################

sub prune_snapshots {

Expand Down Expand Up @@ -1701,23 +1775,24 @@ Assumes --cron --verbose if no other arguments (other than configdir) are specif

Options:

--configdir=DIR Specify a directory to find config file sanoid.conf
--cache-dir=DIR Specify a directory to store the zfs snapshot cache
--run-dir=DIR Specify a directory for temporary files such as lock files

--cron Creates snapshots and purges expired snapshots
--verbose Prints out additional information during a sanoid run
--readonly Simulates creation/deletion of snapshots
--quiet Suppresses non-error output
--force-update Clears out sanoid's zfs snapshot cache

--monitor-health Reports on zpool "health", in a Nagios compatible format
--monitor-capacity Reports on zpool capacity, in a Nagios compatible format
--monitor-snapshots Reports on snapshot "health", in a Nagios compatible format
--take-snapshots Creates snapshots as specified in sanoid.conf
--prune-snapshots Purges expired snapshots as specified in sanoid.conf
--force-prune Purges expired snapshots even if a send/recv is in progress

--help Prints this helptext
--version Prints the version number
--debug Prints out a lot of additional information during a sanoid run
--configdir=DIR Specify a directory to find config file sanoid.conf
--cache-dir=DIR Specify a directory to store the zfs snapshot cache
--run-dir=DIR Specify a directory for temporary files such as lock files

--cron Creates snapshots and purges expired snapshots
--verbose Prints out additional information during a sanoid run
--readonly Simulates creation/deletion of snapshots
--quiet Suppresses non-error output
--force-update Clears out sanoid's zfs snapshot cache

--monitor-health Reports on zpool "health", in a Nagios compatible format
--monitor-capacity Reports on zpool capacity, in a Nagios compatible format
--monitor-metrics-json Reports zpool and snapshot metrics in JSON format
--monitor-snapshots Reports on snapshot "health", in a Nagios compatible format
--take-snapshots Creates snapshots as specified in sanoid.conf
--prune-snapshots Purges expired snapshots as specified in sanoid.conf
--force-prune Purges expired snapshots even if a send/recv is in progress

--help Prints this helptext
--version Prints the version number
--debug Prints out a lot of additional information during a sanoid run
20 changes: 20 additions & 0 deletions tests/3_monitor_snapshots/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
set -x

# this test will create pools in a number of states
# and check the output text and return code of
# sanoid --monitor-snapshots
# and the JSON data created by
# sanoid --monitor-metrics-json

. ../common/lib.sh

# prepare
setup
checkEnvironment
disableTimeSync

# set timezone
ln -sf /usr/share/zoneinfo/Europe/Vienna /etc/localtime

python3 test_monitoring.py
37 changes: 37 additions & 0 deletions tests/3_monitor_snapshots/sanoid.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[sanoid-test-1]
use_template = production

[sanoid-test-2]
use_template = demo

[template_production]
hourly = 36
daily = 30
monthly = 3
yearly = 0
autosnap = yes
autoprune = no
hourly_warn = 90m
hourly_crit = 360m
daily_warn = 28h
daily_crit = 32h
weekly_warn = 0
weekly_crit = 0
monthly_warn = 32d
monthly_crit = 40d
yearly_warn = 0
yearly_crit = 0


[template_demo]
daily = 60
hourly_warn = 290m
hourly_crit = 360m
daily_warn = 28h
daily_crit = 48h
weekly_warn = 0
weekly_crit = 0
monthly_warn = 32d
monthly_crit = 40d
yearly_warn = 0
yearly_crit = 0
Loading