Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sslcheck module: (remote) SSL certificate expiry time check #5365

Merged
merged 24 commits into from Mar 13, 2019
Merged

sslcheck module: (remote) SSL certificate expiry time check #5365

merged 24 commits into from Mar 13, 2019

Conversation

ghost
Copy link

@ghost ghost commented Feb 11, 2019

Summary

Hi Netdata community!

So I think we need to graph the days until ssl certificates expire. I have forked the portcheck plugin to do that.

Component Name

collector/sslcheck

Additional Information

After lots of trying around, reading the doc and some similar issues I can just not get this plugin to graph anything.
I have done this on a fresh netdata setup with curl | bash setup of netdata on a Debian 9 server:

  • created /etc/netdata/python.d/sslcheck.conf
  • git clone https://github.com/blunix/netdata-plugin-sslcheck /usr/src/netdata/collectors/python.d.plugin/sslcheck/ - thats the same code there as in this PR
  • systemctl restart netdata.service

But it neither shows errors or any mention of the sslcheck plugin in the logs, nor does it show anything in the WebUI. I must be missing something.. Do you guys have any idea?

Thanks a lot in advance!

@ghost ghost requested review from cakrit, ilyam8 and ktsaou as code owners February 11, 2019 19:41
@CLAassistant
Copy link

CLAassistant commented Feb 11, 2019

CLA assistant check
All committers have signed the CLA.

@ghost ghost changed the title added WIP ssl certificate expiry time check plugin [WIP] added ssl certificate expiry time check plugin Feb 11, 2019
@ghost ghost changed the title [WIP] added ssl certificate expiry time check plugin [WIP] added (remote) SSL certificate expiry time check plugin Feb 11, 2019
@netdatabot
Copy link
Member

This pull request introduces 2 alerts when merging 6f3a58b into 527b53c - view on LGTM.com

new alerts:

  • 2 for Unused import

Comment posted by LGTM.com

@ilyam8
Copy link
Member

ilyam8 commented Feb 11, 2019

@p-thurner i think we need something like this

https://github.com/influxdata/telegraf/tree/master/plugins/inputs/x509_cert

What do you think?

And since we have go.d.plugin we are willing to add new modules to it, if there is no limitations. So, the question, are you willing to rewrite it in go?

@ilyam8
Copy link
Member

ilyam8 commented Feb 11, 2019

@p-thurner gj btw, we definitely need this one 👍

@ghost
Copy link
Author

ghost commented Feb 11, 2019

@ilyam8 hm I'd rather keep it in python as I didn't learn golang yet.. Is it ok to keep this one in python?

I'd be super happy to get this working (invested some hours into just getting the graph into the webui with no luck) :(
If this works I also want to add:

  • monitoring cronjobs (like mk-job from check-mk)
  • apt security upgrades
  • ssl setup rating (by asking the ssllabs API)

Then I have pretty much what my previous monitoring solution had.

So I would be very thankful for any help to get this working.. Will commit more plugins then!

@ilyam8
Copy link
Member

ilyam8 commented Feb 11, 2019

It is ok to have them in python, but i'd prefer to have new modules in go. It is very easy/fast to write certificate check in go, because it is already written (telegraf) 😄 I can do it tomorrow.

Please describe how the other 3 modules works (how they fetch data - http req/file read). WIll see, maybe some of them would be ok to have in python.

@ilyam8
Copy link
Member

ilyam8 commented Feb 11, 2019

I'd be super happy to get this working

in general - use debug mode, it helps a lot.

  1. sudo su -s /bin/bash netdata
  2. ./python.d.plugin debug trace <module_name>

@ghost
Copy link
Author

ghost commented Feb 11, 2019

Monitoring cronjobs

@ilyam8, check-mk is another monitoring system. It has a shell script called mk-job, which is used to execute cronjobs like so:

5 0 * * * root mk-job nightly-backup /usr/local/bin/backup >/dev/null

This will create a simple file in /var/lib/check_mk/agent/nightly-backup with the following exemplary content:

user@host:~$ cat /var/lib/check_mk/agent/nightly-backup
start_time 1549921929
exit_code 0
real_time 0:01.00
user_time 0.00
system_time 0.00
reads 0
writes 0
max_res_kbytes 2176
avg_mem_kbytes 0
invol_context_switches 1
vol_context_switches 2

Here is the shell script: https://github.com/opinkerfi/check_mk/blob/master/agents/mk-job
It allows to monitor the exit status of the cronjob, how long the job was running and if it was executed at a specified time.. quite handy :)

Monitoring apt security upgrades

Apt security upgrades is simple, just count the number of security upgrades.. With the unattended-upgrades configured on Debian and Ubuntu systems, it makes sense to count those packages and alert if the number is still above 0 after 24 hours (the package "randomly" installs security upgrades throughout the day so not all machines in the world pull the upgrades at the same time).

Monitoring how well you setup SSL on your webserver

Now you may know this: https://www.ssllabs.com/ssltest/analyze.html?d=my%2dnetdata.io&s=104.28.3.248&latest (one of many I suppose)
It returns something like A+, A, B, C and so on depending on "how nicely" you setup your SSL on your webserver. I suppose A+ could be 0, A could be 1, B could be 2 and so on. This way one could graph it.
ssllabs of course gets angry when you hit their API once a minute.. They do however cache the result for a while.. I'm not sure how to get that into netdata but it would be pretty awesome to have it as unmaintained webservers get a worse rating after some time (weeks / month).
One more "problem" with that is that ssllabs itself takes a while to get the present the final result, which means the check plugin would run for several minutes the first time, then would take half a second (cached result) until the cache of ssllabs expires again.. So that one is a bit more complicated.
I suppose a simple shellscript / cronjob could query ssllabs and write the results to a textfile, where they could be picked up by a netdata plugin.. Not sure if thats not to hacky. I think it would be ok to do.. Just some extra setup for the user of netdata though..

@ghost
Copy link
Author

ghost commented Feb 11, 2019

btw if anyone can still help me debug my sslcheck plugin it would still be awesome :) I would still want to get it working to learn.

@ilyam8
Copy link
Member

ilyam8 commented Feb 12, 2019

@p-thurner

btw if anyone can still help me debug my sslcheck plugin it would still be awesome :) I would still want to get it working to learn.

debug mode/error.log (in this order) are your friends

@cakrit
Copy link
Contributor

cakrit commented Feb 12, 2019

One note from my side, update every should default to something very large, perhaps once per hour. No point checking every second for something that makes sense on a daily timeframe.

@netdatabot netdatabot added area/collectors Everything related to data collection area/docs area/external/python labels Feb 12, 2019
@Ferroin
Copy link
Member

Ferroin commented Feb 12, 2019

On the note of the SSL configuration thing, it might be worth looking at https://observatory.mozilla.org, they provide a clear quantifiable score (currently on the range of 0 to 135), and check some useful security aspects beyond just SSL configuration, including various security related HTTP headers. They cache results for 5 minutes though, so that's probably the absolute minimum polling period (I think SSLLabs caches less aggressively).

For the APT security updates, on systems that have a recent version of APT, you can list all upgradeable packages with apt list --upgradeable (though it may require root). The output format is one line per package, with each line looking something like this: apt/unstable 1.8.0~rc3 amd64 [upgradable from: 1.8.0~beta1]. I'm not sure how to parse security updates out of that, but being able to just list all available updates is still useful.

@ilyam8
Copy link
Member

ilyam8 commented Feb 12, 2019

My thougths:

SSL certificate expiry time check

Ok, lets do it in python.

Monitoring cronjobs

According your descriptions is a super specific thing. It is not monitoring cron job actually, it is parsing files produced by some scripts. Job couldbe executed once a day, once a week etc. There is nothing to chart.

I think we don't need this module in core. We can make a link on the third-party modules page tho.

Monitoring how well you setup SSL on your webserver

One more "problem" with that is that ssllabs itself takes a while to get the present the final result, which means the check plugin would run for several minutes the first time, then would take half a second (cached result) until the cache of ssllabs expires again.. So that one is a bit more complicated.
I suppose a simple shellscript / cronjob could query ssllabs and write the results to a textfile, where they could be picked up by a netdata plugin.

Same.

tl;dr

  • SSL certificate expiry time check in python, module is kind of ok, can be merged in core
  • others are too specific, i suggest to add them on third-party page and that is it.

@paulfantom @Ferroin @cakrit

@Ferroin
Copy link
Member

Ferroin commented Feb 12, 2019

I would tend to agree with @ilyam8 on the SSL server configuration (I think it would be neat to have, you could get an alert when your server configuration falls out of BCP) and the cron jobs (also, would be kind of neat to have, but I don't consider it critical (if you have a proper email server set up, any sensible cron system will email you if there are errors)).

I do, however, think having a module that reports how many updates are pending on the system (not necessarily just security updates, and not necessarily just APT either) would be awesome, but I think that's probably something that should only be exposed as a counter, not a graph, and I also think the check frequency should be significantly longer for it by default than Netdata normally tracks (at least 30 minutes, possibly even longer). It's not something that is likely to change with high frequency, so it just doesn't make sense to track historical data in most cases except for triggering alarms on changes or non-zero values. If we do decide to do such a module, I can provide some help to get it working for Gentoo, as well as being able to help with testing on Debian and a couple of other distros.

@ghost
Copy link
Author

ghost commented Feb 12, 2019

Cronjobs

[...]if you have a proper email server set up, any sensible cron system will email you if there are errors[...]
[...]Job couldbe executed once a day, once a week etc. There is nothing to chart.[...]

I would like to politely disagree here. I run a bunch of servers with PHP websites. My customers use a lot of PHP written scripts / cron jobs. Sometimes these jobs behave weirdly, use way more ressources and take much longer than they usually do. Having thing monitoring for this is a very vital thing in my opinion.
Graphing the status file that mk-job outputs would not be to hard in my opinion (with the note that the graph wouldn't change very often of course..) - one could make a plugin that accepts a config file with multiple sections like so:

my_favorite_job:
  status_file: /var/foo/bar/my_favorite_job.txt
  every: 12h

All of this is of course not monitoring the actual cronjob.. But it is close enough in my opinion to detect:

  • cronjobs runs to long
  • exits != 0
  • needs to much ressources

SSL "score"

The mozilla thing sounds very cool, especially the score.. Caching only 5 mins of course is a bit less.. SSLLabs caches the result a bit longer I think. I also just found this: https://github.com/ssllabs/ssllabs-scan which we maybe could use for more simplicity.

Apt

I agree with you guys, regular and security upgrades can be a counter (I would not mind a graph either).
I previously used bloonix for monitoring, which does it like this (danger, perl :P)
https://github.com/bloonix/bloonix-plugins-linux/blob/master/plugins/check-linux-updates#L234-L244
Pretty much the following in a nutshell. Please note that I ran this as regular user (on my ubuntu 18.04 Laptop), not as root.

# Regular
user@host:~$ apt-get --force-yes --simulate --quiet --yes --allow-unauthenticated upgrade 2>/dev/null | grep "^Inst" | wc -l
119

# Security
user@host:~$ apt-get --force-yes --simulate --quiet --yes --allow-unauthenticated upgrade 2>/dev/null | egrep "^Inst|Security" | wc -l

Sadly the second one doesn't output anything for me as I don't have any server where not all security upgrades are installed x) But according to this thread it should work: https://askubuntu.com/questions/774805/how-to-get-a-list-of-all-pending-security-updates

I can only join the conversation with Debian / Ubuntu, I'm not very familiar with other distros.. I assume having several plugins for zypper / apt / rpm / whatnot wouldn't hurt and increase readability of the plugins.

I would also like to note that checking every 30 minutes is totally ok here. I "sometimes" have the case where customers shoot apt in the head by doing weird things.. Then apt itself has broken dependencies or what not and the check itself can not execute anymore (you run apt-get --simulate upgrade and it exits != 0 with some error message). Detecting that would also be very nifty, as in this case the system won't install security upgrades anymore (and often it doesn't tell you - the unattended-upgrades cronjob is a bit hidden so that people don't adjust the time for it I think (to many servers pulling upgrades at the same time problem for Debian repos) and for some reason, for me, the cronjob never sends me any emails if it fails (yes, I have setup my mails on the servers correctly so stderr of cronjobs is send by mails).

@ghost
Copy link
Author

ghost commented Feb 12, 2019

One note from my side, update every should default to something very large, perhaps once per hour. No point checking every second for something that makes sense on a daily timeframe.

I will get back to working on the sslcheck this evening.

@ilyam8
Copy link
Member

ilyam8 commented Feb 12, 2019

: Graphing the status file that mk-job outputs would not be to hard in my opinion (with the note that the graph wouldn't change very often of course..)

Having charts that updating once a day is confusing imo and kind of alien for netdata. I agree that we need to support collecting data without actually charting it, but now we can't

And no, it is not that easy as you think. Your cronjob monigtoring script should read mk-job file right after cron job is done and only if the file was updated (and not on some predefined interval).

@Ferroin
Copy link
Member

Ferroin commented Feb 12, 2019

On the note of cronjobs, what you're talking about checking other than exit status can actually be done a couple of different ways without needing a new plugin for Netdata. One option would be to use the existing apps.plugin together with some creative abuse of the process name (IOW, use something to change the process name of the script to a unique, recognizable name, and then configure apps.plugin to look for it and treat it as it's own group). The other is to put each cron job in it's own cgroup, and then just use Netdata's cgroup support to monitor them (this has the nice bonus that it can be configured to automatically kill the cron job if it uses too many resources).

For the upgrade related stuff, it looks like the command I had posted generically works correctly for regular users too, though it still doesn't appear to provide an easy way to identify security updates. I'd kind of like to avoid depending on classic apt-get given that current stable versions of Ubuntu, Mint, and Debian have both already moved to the new apt command as the default, but that's probably just me. On the Gentoo side, I could probably have something working for just upgrades pretty quickly, but it would be dependent on an extra piece of software that isn't installed by default on Gentoo (though most sane people who use Gentoo have said software installed already).

@ghost
Copy link
Author

ghost commented Feb 12, 2019

both already moved to the new apt command as the default

From my Ubuntu 18.04 Laptop:

user@host:~$ apt update | grep fo
[...]
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
[...]

I think apt-get is still a valid choice for the future.

@Ferroin
Copy link
Member

Ferroin commented Feb 12, 2019

From my Ubuntu 18.04 Laptop:

user@host:~$ apt update | grep fo
[...]
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
[...]

I think apt-get is still a valid choice for the future.

Yeah, I'd forgotten that they still complain if you use it in a pipe. I seriously doubt that the interface of the particular sub-command we would be using will change, but it probably is better to follow the official advice and use apt-get for scripting.

@ghost
Copy link
Author

ghost commented Feb 18, 2019

Howdy fellas,

sorry didn't find time the last few days. So I did this on a fresh Debian 9 installation with no firewall configured:

# Setup the hostname in /etc/hosts, /etc/hostname and via `hostname` command to "monitoring.example.com"
# Update all packages to latest
apt update; apt -y upgrade; apt -y dist-upgrade; apt -y install unattended-upgrades curl git; dpkg-reconfigure unattended-upgrades
# Install netdata
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Git clone the sslcheck plugin (same code as in this PR) and move files in place (as root user)
git clone https://github.com/blunix/netdata-plugin-sslcheck.git
cd netdata-plugin-sslcheck/
cp sslcheck.chart.py /usr/libexec/netdata/python.d/
editor /etc/netdata/python.d/sslcheck.conf (contents below)
my_sslcheck:
  host: my-netdata.io
  port: 443
  timeout: 10
  daysuntilexpiration: 5
daysuntilexpiration: 5


# Restart netdata
systemctl restart netdata.service

I then tried the debug command:

sudo su -s /bin/bash netdata
cd ~
netdata@monitoring:~$ /usr/libexec/netdata/plugins.d/python.d.plugin debug trace sslcheck
2019-02-18 14:04:58: python.d INFO: plugin: main: Using python 2
2019-02-18 14:04:58: python.d DEBUG: plugin: main: loading '/etc/netdata/python.d.conf'
2019-02-18 14:04:58: python.d ERROR: plugin: main: cannot load '/etc/netdata/python.d.conf': [Errno 2] No such file or directory: '/etc/netdata/python.d.conf'. Will try stock version.
2019-02-18 14:04:58: python.d DEBUG: plugin: main: loading '/usr/lib/netdata/conf.d/python.d.conf'
2019-02-18 14:04:58: python.d DEBUG: plugin: main: module load source: 'sslcheck' => [OK]
2019-02-18 14:04:58: python.d DEBUG: plugin: main: loading '/etc/netdata/python.d/sslcheck.conf'
2019-02-18 14:04:58: python.d DEBUG: plugin: main: job initialization: 'sslcheck my_sslcheck' => ['OK']
2019-02-18 14:04:58: python.d DEBUG: plugin: main: module status: 'sslcheck' => [OK] (jobs: 1)
2019-02-18 14:04:58: python.d DEBUG: sslcheck: my_sslcheck: Enabled sslcheck: my-netdata.io:443, update every 1s, timeout: 10s
2019-02-18 14:04:58: python.d INFO: sslcheck: my_sslcheck: check() => [OK]
2019-02-18 14:04:58: python.d DEBUG: sslcheck: my_sslcheck: create() => [NOT ADDED] chart 'daysuntilexpiration' not in definitions. Skipping it.
2019-02-18 14:04:58: python.d ERROR: sslcheck: my_sslcheck: create() => [FAILED] (charts: 0)
2019-02-18 14:04:58: python.d INFO: plugin: main: FINISHED

Looks good so far - not sure why the debug command wasn't outputting anything for me the last time.

Now that [NOT ADDED] chart 'daysuntilexpiration' not in definitions I don't fully get.. Any ideas?

@ilyam8
Copy link
Member

ilyam8 commented Feb 18, 2019

@p-thurner (not in definitions == not in CHARTS)

ORDER = ['daysuntilexpiration']

 CHARTS = {
    'daysvalid': {

@ilyam8
Copy link
Member

ilyam8 commented Mar 12, 2019

Screenshot_20190312_195729

Alarm:

  • warning if cert exp time < days_until_expiration_warning (days_until_expiration_warning is 5 by default)

@ilyam8
Copy link
Member

ilyam8 commented Mar 12, 2019

There is a bug in calculation update time

def calc_next(self):
self.start_mono = monotonic()
return self.start_mono - (self.start_mono % self.update_every) + self.update_every + self.penalty

In [1]: import time                                                                                               

In [2]: def c(i): 
      :     t = time.monotonic() 
      :     d = t - (t%i) + i 
      :     return t, d, d - t 
      :                                                                                                           

In [3]: c(60)                                                                                                     
Out[3]: (84688.099451557, 84720.0, 31.900548442994477)

In [4]: c(60)                                                                                                     
Out[4]: (84689.748439694, 84720.0, 30.25156030600192)

In [5]: c(60)                                                                                                     
Out[5]: (84693.18335048, 84720.0, 26.816649519998464)

In [6]: c(60)                                                                                                     
Out[6]: (84702.413057355, 84720.0, 17.58694264499354)

In [7]: c(60)                                                                                                     
Out[7]: (84720.166530715, 84780.0, 59.83346928500396)

That means that module first poll can be delayed up to update_every seconds, which is bad if update_every value is high.

Will be fixed in a separate PR.

@ghost
Copy link
Author

ghost commented Mar 12, 2019

👍 and thanks a lot for the help @ilyam8 !! :)

@ilyam8 ilyam8 added this to the v1.13 milestone Mar 13, 2019
@ilyam8 ilyam8 merged commit 97699c5 into netdata:master Mar 13, 2019
@ilyam8 ilyam8 added this to Done in New collectors via automation Mar 13, 2019
This was referenced Mar 14, 2019
ilyam8 added a commit that referenced this pull request Mar 15, 2019
##### Summary

rename

> health/health.d/sslcheck.conf → health/health.d/x509check.conf

**Why**

sslcheck module (#5365) was removed(#5626) because of memory leak bug (#5624).

The module was rewritten in go (#5631, netdata/go.d.plugin#166). New module name - `x509check`. 

This PR changes name of the alarm.
jackyhuang85 pushed a commit to jackyhuang85/netdata that referenced this pull request Jan 1, 2020
…5365)

* added WIP ssl certificate expiry time check plugin

* fixing bugs

* more bugfixes

* cleaned up

* fixed graphing

* More pretty readme

* cleaned up style

* change author

* simplify

* add days_until_expiration_warn and correctly calc seconds

* update config

* config update

* readme update

* return false from check if module failed to collect data

* set default update_every to 60

* add alarm

* add sslcheck to makefile

* fix indentation

* add crit to alarm

* update conf

* update readme

* add days_until_expiration_critical

* change default days_until_expiration_warning to 14

* minor
jackyhuang85 pushed a commit to jackyhuang85/netdata that referenced this pull request Jan 1, 2020
##### Summary

rename

> health/health.d/sslcheck.conf → health/health.d/x509check.conf

**Why**

sslcheck module (netdata#5365) was removed(netdata#5626) because of memory leak bug (netdata#5624).

The module was rewritten in go (netdata#5631, netdata/go.d.plugin#166). New module name - `x509check`. 

This PR changes name of the alarm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection area/docs
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants