Add monitoring instrumentation around `dhcp-disable` state. #133

davidhankins · 2024-01-19T18:55:42Z

PROBLEM

We're tracking reported issues in our environment where Kea servers are acting like they are being left disabled by unkonwn causes. Discovering the root cause of that problem has been hampered by poor visibility;

It is normal for servers to go without receiving packets. When the pkt?-received counter stops incrementing, we cannot determine reliably if this is due to e.g. DHCP relay agent misconfiguration or a Kea service issue. It frustrates debugging.
Kea does not increment any counter to indicate it is disabled, providing us with no clear monitoring predicate to detect the situation early.
It is consequently hours or days before this is noticed, by which time debug logs have rotated.

This results in the situation of there being no outward indication of disabled state.

SOLUTION

In this change, we resolve our visibility problems;

A new pkt?-raw-received counter is incremented at the earliest point of execution we know a packet has been received (in runOne()).
A new pkt?-dhcp-disabled counter counts packets dropped due to the service being in a disabled state at the time of reception.
A WARN level log is emitted for packets dropped while disabled. To alleviate spam concerns, a 1-in-100 sample is used.
- To facilitate this change, the StatsMgr now provides a getInteger() method.
NetworkState internal boolean constructs are exposed as monitoring metrics. This allows us to observe the normal process of disabling service during e.g. HA reconnects and recovery, and separately detect extended unexpected disabled states prior to sufficient packet level impacts (consider e.g. DHCP anycast servers which may not receive packets to count as disabled for extended periods - we want to know it had been disabled for several days before the anycast route changes, not after).

This "belt and suspenders" approach should close our monitoring gaps and provide attention to faulty services as they present.

DISCUSSION

The pkt?-raw-received counters are a glaring admission of guilt, and we should use pkt?-received in these places. The exsisting pkt?-received counters should be renamed to pkt?-processed, as they indicate execution of the respective processPacket() function - still a useful monitoring indicator that e.g. the thread handoff succeeded. Probably there should also be a counter for the case where the callback addition failed instead of the present debug log.

HOWEVER, existing Kea users may rely on the present behavior and names of metrics. A monitoring predicate using the current pkt?-received metrics may trigger automatic remediation processes for Kea servers unintentionally left disabled for too long; these predicate configurations would have to change or else the remediation would be defeated. This may not be an ideal user experience.

Consequently, I've offered the unfortunate solution you see here; pkt?-raw-received, but I'm extremely unhappy with it; I'd be happy to modify the situation to whatever ISC prefers to offer their users in this case.

PROBLEM ======= We're tracking reported issues in our environment where Kea servers are acting like they are being left disabled by unkonwn causes. Discovering the root cause of that problem has been hampered by poor visibility; * It is normal for servers to go without receiving packets. When the `pkt?-received` counter stops incrementing, we cannot determine reliably if this is due to e.g. DHCP relay agent misconfiguration or a Kea service issue. It frustrates debugging. * Kea does not increment any counter to indicate it is disabled, providing us with no clear monitoring predicate to detect the situation early. * It is consequently hours or days before this is noticed, by which time debug logs have rotated. This results in the situation of there being no outward indication of disabled state. SOLUTION ======== In this change, we resolve our visibility problems; * A new `pkt?-raw-received` counter is incremented at the earliest point of execution we know a packet has been received (in `runOne()`). * A new `pkt?-dhcp-disabled` counter counts packets dropped due to the service being in a disabled state at the time of reception. * A WARN level log is emitted for packets dropped while disabled. To alleviate spam concerns, a 1-in-100 sample is used. * To facilitate this change, the `StatsMgr` now provides a `getInteger()` method. * `NetworkState` internal boolean constructs are exposed as monitoring metrics. This allows us to observe the normal process of disabling service during e.g. HA reconnects and recovery, and separately detect extended unexpected disabled states prior to sufficient packet level impacts (consider e.g. DHCP anycast servers which may not receive packets to count as disabled for extended periods - we want to know it had been disabled for several days before the anycast route changes, not after). This "belt and suspenders" approach should close our monitoring gaps and provide attention to faulty services as they present. DISCUSSION ========== The `pkt?-raw-received` counters are a glaring admission of guilt, and we should use `pkt?-received` in these places. The exsisting `pkt?-received` counters should be renamed to `pkt?-processed`, as they indicate execution of the respective `processPacket()` function - still a useful monitoring indicator that e.g. the thread handoff succeeded. Probably there should also be a counter for the case where the callback addition failed instead of the present debug log. **HOWEVER**, existing Kea users may rely on the present behavior and names of metrics. A monitoring predicate using the current `pkt?-received` metrics may trigger automatic remediation processes for Kea servers unintentionally left disabled for too long; these predicate configurations would have to change or else the remediation would be defeated. This may not be an ideal user experience. Consequently, I've offered the unfortunate solution you see here; `pkt?-raw-received`, but I'm extremely unhappy with it; I'd be happy to modify the situation to whatever ISC prefers to offer their users in this case.

tomaszmrugalski · 2024-01-22T16:19:05Z

Added gitlab#3228 to track this.

davidhankins · 2024-01-22T17:30:38Z

Ah, I adapted this from a patch on Kea 2.2, and thought I had an easy way to adapt for the new origin. But now I see the number of origins for HA groups is expected to be more numerous (HA_REMOTE_COMMAND+N); I'll have to think about how to resolve that properly.

tomaszmrugalski · 2024-01-22T21:41:12Z

Hub and spoke model for HA is about to go out in 2.5.5 in a week.

The contains() method was added in C++20, while count() will always return 0 or 1 and goes back to C++11.

…toring

davidhankins · 2024-01-30T22:25:46Z

I am unhappy with my brute force solution to network_state monitoring; it turns the average O(constant) disable / enable by origin into a O(D) operation for D disabled origins. This may be temporarily satisfactory considering our expectation for small populations of disabled origins during routine operation (at e.g. startup we may experience limit-populations (2,001 currently?)).

After thinking about it, to resolve the complexity issue I would need feedback on which approach to take;

Add Support for Metrics with Labels and Lambdas.

Using the popular Prometheus data model for time series data as our example, we refer to a single metric by name which can have multiple values differentiated by collections of labels; https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

For this solution, which would be a large undertaking bear in mind, we would first extend the isc::stats module:

Create a Observation::Type that is a map of {(label_name, label_value)} sets to values of the current types (INTEGER, BIG_INTEGER, FLOAT, DURATION, STRING). When exposing this in get-stats methods, each value is exposed by label set.
Create an Observation where the sampled value is determined by a callback function rather than set on the Observation by the caller. ex https://pkg.go.dev/github.com/bg451/gokit/metrics/prometheus#RegisterCallbackGauge This Observation could be limited to 1 value, and would always represent the current time.

Then, we repurpose the O(D) updateStats() method is in this PR (in a thread safe way) into a callback to inform the gauge; but now with the added benefit that each HA group can be instrumented separately by label. With a little added effort (a registry of HA groups and LOCAL/REMOTE offset indexes), we could instrument all configured HA groups reliably with 0 or 1 values (we could provide a label of e.g. ha_group="ha-group-a" rather than integer indices).

This removes all the non-constant complexity from network_state's normal operation and shifts it to once-a-metric-sample, and compared with some other solutions would require the NetworkState mutex less frequently (but for longer).

Implement Lambdas and Move the monitoring of HA group disable states into the HA hook.

We add a method to network_state to test if a specific Origin is disabled or not (NetworkState.isDisabled(origin)). This may be desirable in implementing for large hub and spoke populations (we may not want to continue disabling DHCP globally if one spoke has disabled DHCP on a 1,000 spoke hub) - we may need this function for other purposes I mean.

We borrow again the idea to instrument metrics by Lambda callback functions; in the HA hook, for each HA group, we create a monitoring metric by name ("disabled-(local|remote)-by-ha-group-my-ha-group-name") with a callback that checks for its own disable states by integer index.

This check is on average O(constant) and would be executed when the metric is sampled (with H metrics), and each metric sampled in this way would wind up briefly taking the NetworkState mutex independently.

Reorganize the `disabled_by_origin_` `std::unordered_set`

This is by far the simplest approach. We could;

Use a a single ordered std::set, changing complexity of all operations to logarithmic. Updating stats for HA remote/local could use the std::set.lower_bound() interface to inform the metric value to set.
Use multiple std::unordered_set objects - one for local, one for remote ha. The .empty() boolean is all that is required, which retains average constant complexity for all operations. We would probably move the User origin to a standalone boolean.

Conclusion

If you have some thoughts on how to go about implementing our way out of linear complexity, I can consider making those changes in this PR; if it is anything but the most simplest of implementations however, I doubt I will be able to accommodate at this time.

davidhankins added 2 commits January 19, 2024 09:58

Include namespace of origin constants.

e0392c4

davidhankins added 5 commits January 25, 2024 15:54

Use std::unordered_set<>.count() instead of .contains().

323435d

The contains() method was added in C++20, while count() will always return 0 or 1 and goes back to C++11.

Fix compilation errors.

34e3453

Keep the StatsMgr instance on the stack.

e726921

Reference gitlab in Changelog.

6e0ce9d

Merge remote-tracking branch 'upstream/master' into disable-dhcp-moni…

e534bc9

…toring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add monitoring instrumentation around `dhcp-disable` state. #133

Add monitoring instrumentation around `dhcp-disable` state. #133

davidhankins commented Jan 19, 2024

tomaszmrugalski commented Jan 22, 2024

davidhankins commented Jan 22, 2024

tomaszmrugalski commented Jan 22, 2024

davidhankins commented Jan 30, 2024

Add monitoring instrumentation around dhcp-disable state. #133

Are you sure you want to change the base?

Add monitoring instrumentation around dhcp-disable state. #133

Conversation

davidhankins commented Jan 19, 2024

PROBLEM

SOLUTION

DISCUSSION

tomaszmrugalski commented Jan 22, 2024

davidhankins commented Jan 22, 2024

tomaszmrugalski commented Jan 22, 2024

davidhankins commented Jan 30, 2024

Add Support for Metrics with Labels and Lambdas.

Implement Lambdas and Move the monitoring of HA group disable states into the HA hook.

Reorganize the disabled_by_origin_ std::unordered_set

Conclusion

Add monitoring instrumentation around `dhcp-disable` state. #133

Add monitoring instrumentation around `dhcp-disable` state. #133

Reorganize the `disabled_by_origin_` `std::unordered_set`