Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export unit substates #12

Open
hamiltont opened this issue Feb 26, 2020 · 4 comments
Open

Export unit substates #12

hamiltont opened this issue Feb 26, 2020 · 4 comments

Comments

@hamiltont
Copy link
Contributor

We currently export unit states, but we do not export the unit substate. Substates often include much more actionable information than states, such as why a unit is inactive (e.g. did it stop with error, was it killed, did it stop without error, etc). Note that a unit's possible substates depend on the type of the unit - different types (service, mount, etc) have different possible substates. See the large list below for all possible combinations on systemd v237 (and be aware that different systemd versions have added/removed substates as needed).

Exporting substates would be useful to support querying/graphing/possibly alerting by substate e.g. sum(systemd_unit_state{state="inactive"}) by (type, substate).

As I see it, there are two reasonable ways to expose this substate information:

  • Add a new label substate to the systemd_unit_state metric
  • Export a new metric for each unit type with a substate label. For example, systemd_mount_state{name="foo.mount", substate="mounted"}

IMO adding a new label to systemd_unit_state makes the most sense, but other opinions are welcome

Regardless of approach, I do not think we would follow the standard prometheus guideline of exporting all possible values of substate as 0-value timeseries. The cardinality explosion is ridiculous. For example, for each service unit we would be exporting approx. 6 states * 16 substates = 96 timeseries.

Instead, we would add the current substate label to each metric. When the substate changes, this would be a new timeseries. For example systemd_unit_state{name="ssh.service", type="service", state="inactive", substate="failed"} would be distinct from systemd_unit_state{name="ssh.service", type="service", state="inactive", substate="dead"}. This might require aggregation in PromQL queries. However, as we already export one-timeseries-per-state, this may be an easy transition (e.g. convert by (state) into by (state, substate). Feedback welcome on this...

Regarding exporter performance, the good news is we are already receiving substate information from dbus. It's included in every dbus.UnitStatus already, so there is effectively zero performance penalty for adding it as a new label.

List of states and substates on one of my systems. Note: different systemd versions will have different lists of substates.

m ~$ systemctl --state=help
Available unit load states:
stub
loaded
not-found
error
merged
masked

Available unit active states:
active
reloading
inactive
failed
activating
deactivating

Available automount unit substates:
dead
waiting
running
failed

Available device unit substates:
dead
tentative
plugged

Available mount unit substates:
dead
mounting
mounting-done
mounted
remounting
unmounting
remounting-sigterm
remounting-sigkill
unmounting-sigterm
unmounting-sigkill
failed

Available path unit substates:
dead
waiting
running
failed

Available scope unit substates:
dead
running
abandoned
stop-sigterm
stop-sigkill
failed

Available service unit substates:
dead
start-pre
start
start-post
running
exited
reload
stop
stop-sigabrt
stop-sigterm
stop-sigkill
stop-post
final-sigterm
final-sigkill
failed
auto-restart

Available slice unit substates:
dead
active

Available socket unit substates:
dead
start-pre
start-chown
start-post
listening
running
stop-pre
stop-pre-sigterm
stop-pre-sigkill
stop-post
final-sigterm
final-sigkill
failed

Available swap unit substates:
dead
activating
activating-done
active
deactivating
deactivating-sigterm
deactivating-sigkill
failed

Available target unit substates:
dead
active

Available timer unit substates:
dead
waiting
running
elapsed
failed
@hamiltont
Copy link
Contributor Author

@povilasv FYI - I'm not trying to include this into #10 - it has just been on my mind for a while so I wanted to write up the issue

/cc @SuperQ - Any insights you have on how implement this in a sane manner would be appreciated

@povilasv
Copy link
Contributor

👍 thanks for this. IMO it makes sense and I like the systemd_unit_state{name="ssh.service", type="service", state="inactive", substate="failed"} approach.

@SuperQ
Copy link
Contributor

SuperQ commented Feb 27, 2020

While it's not optimal, I agree that the cardinality for substates is a bit much. There will still be issues with disappearing metrics.

My only concern is that there will be different behavior between the two labels.

@hamiltont
Copy link
Contributor Author

One additional thought. It would be straightforward to have a feature flag collector.enable-complete-substate-series to allow users to request we do create zero-values series for all possible substates. This would by potentially useful for advanced users who want to alert on substate changes for a small set of business-critical units, specified with collector.unit-whitelist)

IMO, maintaining the boilerplate code (list of all possible substates) is not worth it for a feature that might be used by a hypothetical advanced user. Would be better to wait for someone to request something like this before prematurely adding the feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants