Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for BMC eventing API #167

Merged
merged 1 commit into from Jul 5, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
143 changes: 143 additions & 0 deletions design/baremetal-operator/bmc-events.md
@@ -0,0 +1,143 @@
<!--
This work is licensed under a Creative Commons Attribution 3.0
Unported License.

http://creativecommons.org/licenses/by/3.0/legalcode
-->

# Event Subscription API

Users of bare metal hardware may want to receive events from the
baseboard management controller (BMC) in order to act on them in the
event of a hardware fault, increase in temperature, removal of a
device, etc.

## Status

implemented

## Summary

The Redfish standard includes the ability to subscribe to events, which
will cause hardware events to be sent in a particular format to a destination
URI. This design document describes a Metal3 API for configuring a
subscription. While Redfish is the primary target for this design, the
Ironic API is vendor-neutral and seeks to provide a unified interface
for configuring events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the unified interface was delayed, we've concentrated on redfish only for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++


## Motivation

Some environments run workloads that need to deal with potential
faults or environmental changes quicker than they would get an alert
through other channels. For example, some workloads may have a sidecar
container that knows how to deal with an alert that a particular network
interface went down, or that the CPU temperature reached a certain
threshold.

### Goals

- Provide an API to manage subscriptions to events

### Non-Goals

- [Configurable events and thresholds](#configurable-events-and-thresholds)
- Any kind of event polling
- Software for processing the events, i.e. any webhook
- BMC's beyond Redfish for now

## Proposal

### User Stories

- I'd like to configure my BMC to send events to a destination URI.
- I'd like to provide context to a particular event subscription.
- I'd like to provide arbitrary HTTP headers
- I'd like the baremetal-operator to reconcile on the
BMCEventSubscription resource, and ensure its state is accurate in
Ironic.

## Design Details

### Implementation Details

```yaml
apiVersion: metal3.io/v1alpha1
kind: BMCEventSubscription
metadata:
name: worker-1-events
spec:
hostName: ostest-worker-1
destination: https://events.apps.corp.example.com/webhook
context: “SomeUserContext”
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding an enabled field to turn on/off the subscription?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is fairly simple, is there a reason to not just delete the CR if one isn't interested in an event anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to something more like a "pause", but I'd guess that CR deletion could be good enough to start

httpHeadersRef:
name: some-secret-name
namespace: some-namespace
status:
errorMessage: ""
subscriptionID: aa618a32-9335-42bc-a04b-20ddeed13ade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the ID that the BMC will give when we create a sbuscription? If yes, this is not required to be an UUID (Dell uses UUID, HP uses numbers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an arbitrary string. Is that sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@honza oh ok! I was wondering if could have some sort of validation to make sure it was an UUID, if is just a string it's ok. Thanks!

```

- A BMCEventSubscription resource represents a subscription to the events generated
by a specific BMC.
- Ironic will manage configuring the subscription using a vendor passthru API.
- The BMCEventSubscription will maintain a reference to a BareMetalHost.
- The BMCEventSubscription will allow injection of headers using a
reference to a secret, for example to provide basic auth credentials.
- The BMCEventSubscription will reside in the same namespace as the referenced
BareMetalHost.
- The BMCEventSubscription will maintain a reference to the subscription ID
obtained from the BMC.
- The baremetal-operator binary will be expanded to include an additional
reconciler with a dedicated controller/reconcile loop for
BMCEventSubscriptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to expand this section to define more precisely the expected behavior in case of host deletion? In case the subscription resource remains alive, what will happen when/if the host will come back? What will be status shown to the user meanwhile?

### Open Questions

### Risks and Mitigations

#### Thundering herd

Large numbers of events across large numbers of BareMetalHosts could generate a
lot of traffic. Users can control how many events their webhook receives by
configuring the alert thresholds out of band.

### Dependencies

### Test Plan

There is some existing POC code for working with Redfish Events; we
could build on this to implement a test framework for BMC events. We could
also consider modifying sushy-tools to support emulated eventing.

### Upgrade / Downgrade Strategy

Not required as this is a new API being introduced

### Alternatives

#### Configurable events and thresholds

This API is for subscribing to events of a pre-defined type. In cases
where no particular type is available, users would need to configure it
out-of-band. For example, one may want to have a TemperatureOver40C
alert that monitors the enclosure's temperature.

The Redfish standard itself does not seem to have a way to specify
specific alerts and thresholds. For example, to receive an alert when
the temperature exceeds 40C, one would need to configure this manually
according to the vendor's reccomendations.

Vendors, however, do provide vendor-specific ways to configure these
thresholds, but it's hard to abstract to a neutral interface. For
example, here is a [Dell example for temperature](https://www.dell.com/support/manuals/en-jm/idrac9-lifecycle-controller-v4.x-series/idrac9_4.00.00.00_redfishapiguide_pub/temperature?guid=guid-5a798111-407b-485d-b6fb-7d6e367d4ad4&lang=en-us).

In the short term, Ironic has no plans to abstract the various vendor
implementations (if they exist at all).

## References

- [Ironic Vendor Passthru for Subscriptions](https://storyboard.openstack.org/#!/story/2009061)
- [Supermicro Redfish Guide](https://www.supermicro.com/manuals/other/RedfishRefGuide.pdf)
- [DMTF: Redfish Eventing](https://www.dmtf.org/sites/default/files/Redfish%20School%20-%20Events.pdf)
- [Redfish Event Controller (POC)](https://github.com/dhellmann/redfish-event-controller)
- [Redfish Event Experiment (POC)](https://github.com/dhellmann/redfish-event-experiment)