Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitoring instrumentation around dhcp-disable state. #133

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

davidhankins
Copy link

PROBLEM

We're tracking reported issues in our environment where Kea servers are acting like they are being left disabled by unkonwn causes. Discovering the root cause of that problem has been hampered by poor visibility;

  • It is normal for servers to go without receiving packets. When the pkt?-received counter stops incrementing, we cannot determine reliably if this is due to e.g. DHCP relay agent misconfiguration or a Kea service issue. It frustrates debugging.

  • Kea does not increment any counter to indicate it is disabled, providing us with no clear monitoring predicate to detect the situation early.

  • It is consequently hours or days before this is noticed, by which time debug logs have rotated.

This results in the situation of there being no outward indication of disabled state.

SOLUTION

In this change, we resolve our visibility problems;

  • A new pkt?-raw-received counter is incremented at the earliest point of execution we know a packet has been received (in runOne()).

  • A new pkt?-dhcp-disabled counter counts packets dropped due to the service being in a disabled state at the time of reception.

  • A WARN level log is emitted for packets dropped while disabled. To alleviate spam concerns, a 1-in-100 sample is used.

    • To facilitate this change, the StatsMgr now provides a getInteger() method.
  • NetworkState internal boolean constructs are exposed as monitoring metrics. This allows us to observe the normal process of disabling service during e.g. HA reconnects and recovery, and separately detect extended unexpected disabled states prior to sufficient packet level impacts (consider e.g. DHCP anycast servers which may not receive packets to count as disabled for extended periods - we want to know it had been disabled for several days before the anycast route changes, not after).

This "belt and suspenders" approach should close our monitoring gaps and provide attention to faulty services as they present.

DISCUSSION

The pkt?-raw-received counters are a glaring admission of guilt, and we should use pkt?-received in these places. The exsisting pkt?-received counters should be renamed to pkt?-processed, as they indicate execution of the respective processPacket() function - still a useful monitoring indicator that e.g. the thread handoff succeeded. Probably there should also be a counter for the case where the callback addition failed instead of the present debug log.

HOWEVER, existing Kea users may rely on the present behavior and names of metrics. A monitoring predicate using the current pkt?-received metrics may trigger automatic remediation processes for Kea servers unintentionally left disabled for too long; these predicate configurations would have to change or else the remediation would be defeated. This may not be an ideal user experience.

Consequently, I've offered the unfortunate solution you see here; pkt?-raw-received, but I'm extremely unhappy with it; I'd be happy to modify the situation to whatever ISC prefers to offer their users in this case.

PROBLEM
=======

We're tracking reported issues in our environment where Kea servers are
acting like they are being left disabled by unkonwn causes. Discovering
the root cause of that problem has been hampered by poor visibility;

* It is normal for servers to go without receiving packets. When the
  `pkt?-received` counter stops incrementing, we cannot determine
  reliably if this is due to e.g. DHCP relay agent misconfiguration or a
  Kea service issue. It frustrates debugging.

* Kea does not increment any counter to indicate it is disabled,
  providing us with no clear monitoring predicate to detect the
  situation early.

* It is consequently hours or days before this is noticed, by which time
  debug logs have rotated.

This results in the situation of there being no outward indication of
disabled state.

SOLUTION
========

In this change, we resolve our visibility problems;

* A new `pkt?-raw-received` counter is incremented at the earliest point
  of execution we know a packet has been received (in `runOne()`).

* A new `pkt?-dhcp-disabled` counter counts packets dropped due to the
  service being in a disabled state at the time of reception.

* A WARN level log is emitted for packets dropped while disabled. To
  alleviate spam concerns, a 1-in-100 sample is used.

  * To facilitate this change, the `StatsMgr` now provides a
    `getInteger()` method.

* `NetworkState` internal boolean constructs are exposed as monitoring
  metrics. This allows us to observe the normal process of disabling
  service during e.g. HA reconnects and recovery, and separately detect
  extended unexpected disabled states prior to sufficient packet level
  impacts (consider e.g. DHCP anycast servers which may not receive
  packets to count as disabled for extended periods - we want to know it
  had been disabled for several days before the anycast route changes,
  not after).

This "belt and suspenders" approach should close our monitoring gaps and
provide attention to faulty services as they present.

DISCUSSION
==========

The `pkt?-raw-received` counters are a glaring admission of guilt,
and we should use `pkt?-received` in these places. The exsisting
`pkt?-received` counters should be renamed to `pkt?-processed`, as
they indicate execution of the respective `processPacket()` function -
still a useful monitoring indicator that e.g. the thread handoff
succeeded. Probably there should also be a counter for the case where
the callback addition failed instead of the present debug log.

**HOWEVER**, existing Kea users may rely on the present behavior and
names of metrics. A monitoring predicate using the current
`pkt?-received` metrics may trigger automatic remediation processes for
Kea servers unintentionally left disabled for too long; these predicate
configurations would have to change or else the remediation would be
defeated. This may not be an ideal user experience.

Consequently, I've offered the unfortunate solution you see here;
`pkt?-raw-received`, but I'm extremely unhappy with it; I'd be happy to
modify the situation to whatever ISC prefers to offer their users in
this case.
@tomaszmrugalski
Copy link
Member

Added gitlab#3228 to track this.

@davidhankins
Copy link
Author

Ah, I adapted this from a patch on Kea 2.2, and thought I had an easy way to adapt for the new origin. But now I see the number of origins for HA groups is expected to be more numerous (HA_REMOTE_COMMAND+N); I'll have to think about how to resolve that properly.

@tomaszmrugalski
Copy link
Member

Hub and spoke model for HA is about to go out in 2.5.5 in a week.

@davidhankins
Copy link
Author

I am unhappy with my brute force solution to network_state monitoring; it turns the average O(constant) disable / enable by origin into a O(D) operation for D disabled origins. This may be temporarily satisfactory considering our expectation for small populations of disabled origins during routine operation (at e.g. startup we may experience limit-populations (2,001 currently?)).

After thinking about it, to resolve the complexity issue I would need feedback on which approach to take;

Add Support for Metrics with Labels and Lambdas.

Using the popular Prometheus data model for time series data as our example, we refer to a single metric by name which can have multiple values differentiated by collections of labels; https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

For this solution, which would be a large undertaking bear in mind, we would first extend the isc::stats module:

  1. Create a Observation::Type that is a map of {(label_name, label_value)} sets to values of the current types (INTEGER, BIG_INTEGER, FLOAT, DURATION, STRING). When exposing this in get-stats methods, each value is exposed by label set.
  2. Create an Observation where the sampled value is determined by a callback function rather than set on the Observation by the caller. ex https://pkg.go.dev/github.com/bg451/gokit/metrics/prometheus#RegisterCallbackGauge This Observation could be limited to 1 value, and would always represent the current time.

Then, we repurpose the O(D) updateStats() method is in this PR (in a thread safe way) into a callback to inform the gauge; but now with the added benefit that each HA group can be instrumented separately by label. With a little added effort (a registry of HA groups and LOCAL/REMOTE offset indexes), we could instrument all configured HA groups reliably with 0 or 1 values (we could provide a label of e.g. ha_group="ha-group-a" rather than integer indices).

This removes all the non-constant complexity from network_state's normal operation and shifts it to once-a-metric-sample, and compared with some other solutions would require the NetworkState mutex less frequently (but for longer).

Implement Lambdas and Move the monitoring of HA group disable states into the HA hook.

We add a method to network_state to test if a specific Origin is disabled or not (NetworkState.isDisabled(origin)). This may be desirable in implementing for large hub and spoke populations (we may not want to continue disabling DHCP globally if one spoke has disabled DHCP on a 1,000 spoke hub) - we may need this function for other purposes I mean.

We borrow again the idea to instrument metrics by Lambda callback functions; in the HA hook, for each HA group, we create a monitoring metric by name ("disabled-(local|remote)-by-ha-group-my-ha-group-name") with a callback that checks for its own disable states by integer index.

This check is on average O(constant) and would be executed when the metric is sampled (with H metrics), and each metric sampled in this way would wind up briefly taking the NetworkState mutex independently.

Reorganize the disabled_by_origin_ std::unordered_set

This is by far the simplest approach. We could;

  1. Use a a single ordered std::set, changing complexity of all operations to logarithmic. Updating stats for HA remote/local could use the std::set.lower_bound() interface to inform the metric value to set.
  2. Use multiple std::unordered_set objects - one for local, one for remote ha. The .empty() boolean is all that is required, which retains average constant complexity for all operations. We would probably move the User origin to a standalone boolean.

Conclusion

If you have some thoughts on how to go about implementing our way out of linear complexity, I can consider making those changes in this PR; if it is anything but the most simplest of implementations however, I doubt I will be able to accommodate at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants