Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.systemd_units): Allow to query unloaded/disabled units #14814

Merged
merged 24 commits into from Mar 5, 2024

Conversation

srebhan
Copy link
Contributor

@srebhan srebhan commented Feb 13, 2024

Summary

Unloaded or disabled units are neither listed with systemctl list-units not systemctl show as systemctl only will work on loaded units (see systemd/systemd#5063) especially with wildcards!

The only way to make this work is to determine the unit-files matching the patterns and then use the exact unit-name to infer further information as this will enforce loading the unit by systemd. Unfortunately, this is not particularly easy using the systemctl command, therefore, this PR basically rewrites the inputs.systemd_units plugin to be based on DBUS communication.

Important note: This PR changes the tag-names uf_state and uf_preset to state and preset match the output of systemctl. It also replaces the subcommand option by a boolean details config option to simplify code. Furthermore, the PR corrects the type of mem_* and swap_* fields to uint64 instead of int as this is the native underlying type of those fields.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #14763

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Feb 13, 2024
@1tft
Copy link

1tft commented Feb 14, 2024

Output with this build looks so good!

[[inputs.systemd_units]]
  pattern = 'telegra* crond.service' 
  subcommand = "show" 

Output for inactive (dead) and disabled service "telegraf" (using wildcards for search pattern) and a running active enabled service "crond":

> systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=telegraf.service,sub=dead,uf_preset=disabled,uf_state=disabled active_code=2i,load_code=0i,mem_current=9223372036854775807i,pid=0i,restarts=0i,status_errno=0i,sub_code=1i 1707924229000000000
> systemd_units,active=active,host=localhost.localdomain,load=loaded,name=crond.service,sub=running,uf_preset=enabled,uf_state=enabled active_code=0i,load_code=0i,mem_current=1167360i,pid=906i,restarts=0i,status_errno=0i,sub_code=0i 1707924229000000000

@srebhan
Copy link
Contributor Author

srebhan commented Feb 14, 2024

@1tft and @knollet I would ask you to also check multi-instance units (the ones with an @)! Same for dynamically starting/stopping, enabling/disabling or creating new units while Telegraf is running...

@1tft
Copy link

1tft commented Feb 14, 2024

My tests so far: Monitored at least 2 services same time, using wildcards in pattern. Started, removed, disabled and created new services (e.g. smartd). Killed service too.
One time I used also pattern = "*"
systemd_units output created always valid metrics. Checked most of fields regarding plausability.

Telegraf log was clean (normal logging mode - no debug on).

systemctl --version
systemd 239 (239-68.el8)

Only issue/anomaly:
A loaded but inactive service produces as mem_current value a max 64-bit signed integer. Example:

]# systemctl status smartd.service 
● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

Metrics:
systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=smartd.service,sub=dead,uf_preset=enabled,uf_state=disabled status_errno=0i,mem_current=9223372036854775807i,load_code=0i,sub_code=1i,pid=0i,active_code=2i,restarts=0i 1707942750000000000

I hope to find time for testing @ services tomorrow.

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Feb 15, 2024
@srebhan srebhan assigned powersj and DStrand1 and unassigned srebhan Feb 15, 2024
@1tft
Copy link

1tft commented Feb 15, 2024

Tested again. Now normal "static" services (services without [install] part) and multi instances services.

Telegraf log was clean (normal logging mode - no debug on).

systemctl --version
systemd 239 (239-68.el8)

telegraf --version
Telegraf 1.30.0-125c0bec (git: pull/14814@125c0bec)

Like yesterday and uncritical: A loaded but inactive service produces as mem_current value with max 64-bit signed integer value.

Maybe changelog and systemd_units README.md should mention that pattern search now relies on unit file name.
That means following change regarding seach pattern wildcard use:

When we have a service "lvm2-pvscan@252:2.service":
systemctl status lvm2-pvscan@2*

● lvm2-pvscan@252:2.service - LVM event activation on device 252:2
   Loaded: loaded (/usr/lib/systemd/system/lvm2-pvscan@.service; static; vendor preset: disabled)
   Active: active (exited) since Thu 2024-02-15 21:00:57 CET; 15min ago
     Docs: man:pvscan(8)
  Process: 1617 ExecStop=/usr/sbin/lvm pvscan --cache 252:2 (code=exited, status=0/SUCCESS)
  Process: 1627 ExecStart=/usr/sbin/lvm pvscan --cache --activate ay 252:2 (code=exited, status=0/SUCCESS)
 Main PID: 1627 (code=exited, status=0/SUCCESS)

The old Telegraf v1.29.4 produces results with following config

[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@2*
  # subcommand = "show"
  interval = "30s"

systemd_units,active=active,host=localhost.localdomain,load=loaded,name=lvm2-pvscan@252:2.service,sub=exited load_code=0i,active_code=0i,sub_code=4i 1708028550000000000
systemd_units,active=active,host=localhost.localdomain,load=loaded,name=user@0.service,sub=running active_code=0i,sub_code=0i,load_code=0i 1708028550000000000

Now Telegraf v1.30.0 does not produce any output with following config

[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@2*
  subcommand = "show"
  interval = "30s"

OR


[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@2*
  # subcommand = "show"
  interval = "30s"

_

User now have to use:

[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@*
  subcommand = "show"
  interval = "30s"

OR


[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@*
  # subcommand = "show"
  interval = "30s"

@srebhan srebhan removed the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Feb 16, 2024
@srebhan
Copy link
Contributor Author

srebhan commented Feb 16, 2024

@1tft thanks for the finding. I will try to fix the changed behavior...

@srebhan srebhan assigned srebhan and unassigned powersj and DStrand1 Feb 16, 2024
@1tft
Copy link

1tft commented Feb 18, 2024

Sadly found a serious issue regarding new release
telegraf --version
Telegraf 1.30.0-7f8e03bf (git: pull/14814@7f8e03bf)

For static service (service without [Install] section) like
● testservice.service - TestService
Loaded: loaded (/usr/lib/systemd/system/testservice.service; static; vendor preset: disabled)
Active: inactive (dead)

telegraf produces 2 lines of metrics

systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=testservice.service,sub=dead sub_code=1i,load_code=1i,active_code=2i 1708267290000000000
systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=testservice.service,sub=dead load_code=1i,sub_code=1i,active_code=2i 1708267290000000000

After starting testservice.service telegraf produces 2 lines again, and does not recognized that service has been started. Also systemctl daemon-reload was no solution.
Installed telegraf-1.30.0~125c0bec-0.x86_64.rpm again, produced immediately correct result.
Installed telegraf-1.30.0~7f8e0bec-0.x86_64.rpm again, still produced false results.

Also reproduced issue when removing [Install] section from existing standard service like smartd. Than it was "static" and telegraf produced two false (indentical) metrics for every interval.

My testservice:
cat /usr/lib/systemd/system/testservice.service

[Unit]
Description=TestService

[Service]
Type=forking

ExecStart=/bin/bash -c "sleep 2; $(sleep 60) &"

TimeoutStartSec=15
TimeoutStopSec=2

RemainAfterExit=true

Restart=on-failure
RestartSec=60
StartLimitInterval=300
StartLimitBurst=2

@srebhan
Copy link
Contributor Author

srebhan commented Feb 19, 2024

Thank you very much @1tft for the thorough testing! That's much appreciated and invaluable! Hope I got all the corner cases right this time...

@1tft
Copy link

1tft commented Feb 20, 2024

Thank you for new build!
Looks very good, no double lines and service status is recognized correctly again by telegraf.

Regarding mem_current. For static and not running services no mem_current is printed out.

● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; static; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=smartd.service,sub=dead load_code=1i,active_code=2i,sub_code=1i 1708415340000000000

But for other inactive services mem_current is still printed out:

● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=smartd.service,sub=dead,uf_preset=enabled,uf_state=disabled pid=0i,load_code=0i,status_errno=0i,restarts=0i,mem_current=9223372036854775807i,active_code=2i,sub_code=1i 1708414470000000000

 lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
   Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2024-02-20 08:45:59 CET; 19s ago
     Docs: man:dmeventd(8)
           man:lvcreate(8)
           man:lvchange(8)
           man:vgchange(8)
  Process: 1614 ExecStop=/usr/sbin/lvm vgchange --monitor n (code=exited, status=0/SUCCESS)
 Main PID: 701 (code=exited, status=0/SUCCESS)
 
systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=lvm2-monitor.service,sub=dead,uf_preset=enabled,uf_state=enabled load_code=0i,restarts=0i,mem_current=9223372036854775807i,pid=0i,active_code=2i,sub_code=1i,status_errno=0i 1708415220000000000

@srebhan
Copy link
Contributor Author

srebhan commented Feb 20, 2024

@1tft thanks for testing this again!

Hmmm I think all of those are a lie if I look at the number... ;-) The issue is that for the disabled units, systemd provides those numbers on request, while we can't get those for the "static" ones. Do you want me to set those fields to zero? I mean in InfluxDB you would get a nil value for that field and can fill it with whatever you need in your calculation...

@powersj any preferences?

@knollet
Copy link
Contributor

knollet commented Feb 20, 2024

Regarding mem_current:
According to dbus-specification there is a dbus-type t which is UNsigned 64bit.
https://dbus.freedesktop.org/doc/dbus-specification.html#basic-types

gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/ypbind_2eservice
(a random service that doesn't run on my system)
return lots of stuff including this:
readonly t MemoryCurrent = 18446744073709551615;

So I think:
telegraf should return an unsigned mem_current if possible.

If can't be returned unsigned, it should be -1 in case of a non-running service or be redacted.
(I think I would check if there are marker values for when systemctl show returns [not set] and redact those. The alternative -- having a list of valid data fields for every service status combination -- seems less reasonable to me)

For clarity, I would say:

  • Use the type (unsigned) you get from upstream (dbus) (for me, as a C programmer, LLONG_MAX is really weird)
  • Redact mem_current if it is 2^64. (do so for every other field you gather).
  • Don't have this list:
{
    "running": ["mem_current", ...],
    "exited": [...],
    ...
}

@srebhan
Copy link
Contributor Author

srebhan commented Feb 20, 2024

@knollet not sure what you mean... We currently return what we get from DBUS without any modification. Regarding redacting/removing fields, I'm not sure if we should do this...

@knollet
Copy link
Contributor

knollet commented Feb 20, 2024

Actually you don't. I don't know if you programmed anything or if the go dbus library is buggy, but you changed numbers.

Do a gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop /systemd1/unit/ypbind_2eservice | grep "MemoryCurrent"
with a non-running service, and you will see: readonly t MemoryCurrent = 18446744073709551615;

You return for a non-running service mem_current=9223372036854775807i (see above) which is 2^63-1. (signed int max, 0x7FFFFFFFFFFFFFFF)
DBus returns 18446744073709551615 which is actually 0xFFFFFFFFFFFFFFFF (2^64-1) typed t
(defined in this document: https://dbus.freedesktop.org/doc/dbus-specification.html#basic-types )
which is UNsigned int, not signed.

You are converting it to a signed int for some reason and losing half the number doing it.

If you convert 2^64-1 to signed int, though, a special value which should indicate that the value is actually missing, I would either return unsigned int max, or -1, which would both be the same bitwise: 64 1 bits.
The value you currently return is not, it's 0 zero bit and 63 one bits, which is weird.

@srebhan
Copy link
Contributor Author

srebhan commented Feb 20, 2024

@knollet are you sure you enabled influx_uint_support = true?

@knollet
Copy link
Contributor

knollet commented Feb 20, 2024

@1tft it may well be that this is the culprit.

@1tft
Copy link

1tft commented Feb 20, 2024

@knollet Setting influx_uint_support = true at output section prints mem_current=18446744073709551615u instead of 9223372036854775807i

@srebhan So we are OK with everything regarding mem_current handling (current behaviour, 0 or supressing complete field). I think 0 would be better than this very high value 9223372036854775807 / 18446744073709551615, only because for diagram scaling reasons. For InfluxDB its not relevant, because 9223372036854775807 is max...
Supressing only mem_current is also not a good solution.

I think Readme.md should mention this special situation, than everybody can decide to build a custom processor for this special value and we do not have to invest any more time on this edge case.

@srebhan
Copy link
Contributor Author

srebhan commented Feb 23, 2024

One more tweak... I now set the "invalid"/"unset" memory fields to zero. I also noticed that before (with the systemd command) the memory values were provided as signed-integers which is plain wrong as the dbus type is unsigned integer. So I fixed this as the feature was not yet released...

@srebhan
Copy link
Contributor Author

srebhan commented Feb 26, 2024

@1tft could you please give this another (hopefully last) round of testing!?

@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Mar 5, 2024

@srebhan srebhan merged commit ebea0b2 into influxdata:master Mar 5, 2024
26 checks passed
@srebhan srebhan deleted the systemd_units_issue_14763 branch March 5, 2024 15:34
@github-actions github-actions bot added this to the v1.30.0 milestone Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[[inputs.systemd_units]] no metric for inactive (dead) and disabled service
5 participants