Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-35878: escalate alarms to OpsGenie #52

Merged
merged 4 commits into from Aug 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion bin/run_watcher
@@ -1,4 +1,5 @@
#!/usr/bin/env python

# This file is part of ts_watcher.
#
# Developed for Vera C. Rubin Observatory Telescope and Site Systems.
Expand All @@ -16,7 +17,7 @@
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#

from lsst.ts.watcher import run_watcher


Expand Down
71 changes: 71 additions & 0 deletions doc/developer_guide.rst
@@ -0,0 +1,71 @@
.. py:currentmodule:: lsst.ts.watcher

.. _lsst.ts.watcher.developer_guide:

###############
Developer Guide
###############

The Watcher CSC is implemented using `ts_salobj <https://github.com/lsst-ts/ts_salobj>`_.

The fundamental objects that make up the Watcher are rules and alarms.
There is a one to one relationship between these: every rule contains one associated alarm.
It is the logic in a rule that determines the state of its alarm.

Each rule monitors messages from remote SAL components (or, potentially, other sources).
Based on that logic the Watcher sets the severity of the associated alarm.
Rules are instances of *subclasses* of `BaseRule`.
There are many such subclasses.

Each alarm contains state, including the current severity, whether the alarm has been acknowledged, and the maximum severity seen since last acknowledgement.
Alarms are instances of `Alarm`.

Other Classes
=============

`Model` manages all the rules that are in use.
It is the model that uses the watcher configuration to construct rules, construct salobj remotes and topics and wire everything together.
The model also disables rules when the Watcher CSC is not in the ENABLED state.

In order to reduce resource usage, remotes (instances of `lsst.ts.salobj.Remote`) and topics (instances of `lsst.ts.salobj.topics.ReadTopic`) are only constructed if a rule that is in use needs them.
Also remotes and topics are shared, so if more than one rule needs a given Remote, only one is constructed.

Since rules share remotes and topics, the rule's constructor does not construct remotes or topics (which also means that a rule's constructor does not make the rule fully functional).
Instead a rule specifies the remotes and topics it needs by constructing `RemoteInfo` objects, which the `Model` uses to construct the remotes and topics and connect them to the rule.

`TopicCallback` supports calling more than one rule from a topic.
This is needed because a salobj topic can only call back to a single function and we may have more than one rule that wants to be called.

Rules are isolated from each other in two ways, both of which are implemented by wrapping each remote with multiple instances of `RemoteWrapper`, one instance per rule that uses the remote:

* A rule can only see the topics that it specifies it wants.
This eliminates a source of surprising errors where if rule A if uses a topic specified only by rule B then the topic will only be available to rule A if rule B is being used.
* A rule can only see the current value of a topic; it cannot wait on the next value of a topic.
That prevents one rule from stealing data from another rule.

Writing Rules
=============

.. toctree::
:maxdepth: 2

writing_rules.rst

Contributing
============

``lsst.ts.watcher`` is developed at https://github.com/lsst-ts/ts_watcher.
You can find Jira issues for this module using `labels=ts_watcher <https://jira.lsstcorp.org/issues/?jql=project%3DDM%20AND%20labels%3Dts_watcher>`_.

.. _lsst.ts.watcher-pyapi:

Python API reference
====================

.. automodapi:: lsst.ts.watcher
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules.test
:no-main-docstr:
:no-inheritance-diagram:
158 changes: 92 additions & 66 deletions doc/index.rst
Expand Up @@ -2,38 +2,30 @@

.. _lsst.ts.watcher:

########################
ts_watcher documentation
########################

A CSC which monitors other SAL components and uses the data to generate alarms for display by LOVE.
###############
lsst.ts.Watcher
###############

.. image:: https://img.shields.io/badge/Project Metadata-gray.svg
:target: https://ts-xml.lsst.io/index.html#index-csc-table-watcher
.. image:: https://img.shields.io/badge/SAL\ Interface-gray.svg
:target: https://ts-xml.lsst.io/sal_interfaces/Watcher.html
.. image:: https://img.shields.io/badge/GitHub-gray.svg
:target: https://github.com/lsst-ts/ts_watcher
.. image:: https://img.shields.io/badge/Jira-gray.svg
:target: https://jira.lsstcorp.org/issues/?jql=project%3DDM%20AND%20labels%3Dts_watcher

Overview
========

The Watcher monitors other SAL components and uses the data to generate alarms for display by LOVE.
The point is to provide a simple, uniform interface to handle alarms.

The alarms are generated by rules which are defined in this package.
The alarms are generated by rules, which are defined in this package.
The CSC configuration specifies which of the available rules are used, and the configuration for each rule.

Using lsst.ts.watcher
=====================

The fundamental objects that make up the Watcher are rules and alarms.
Rules monitor topics from remote SAL components and, based on that information, set the severity of alarms.
Alarms contains the state of an alarm, including the current severity, whether the alarm has been acknowledged, and the maximum severity seen since last acknowledgement.
Rules are instances of *subclasses* of `BaseRule`.
Alarms are instances of `Alarm`.

There is a one to one relationship between rules and alarms: every rule contains one associated alarm.

The set of rules used by the Watcher and the configuration of each rule is specified by the CSC configuration.
The configuration options for each rule are specified by a schema provided by the rule.
A typical Watcher configuration file will specify most available rules, and will likely be large.
The Watcher configuration also has a list of disabled SAL components, for the situation that a subsystem is down for maintenance or repair.
Rules that use a disabled SAL component are not loaded.

.. toctree::
:maxdepth: 2

writing_rules.rst
displaying_alarms.rst
There is a one to one relationship between rules and alarms.
Each rule has one alarm, and it is the logic in the rule, operating on the state of the control system, that determines the severity of the alarm.

.. _lsst.ts.watcher.severity_levels:

Expand All @@ -54,7 +46,68 @@ Each alarm has two severity fields:
* severity: the current severity, as reported by the rule.
* max_severity: the maximum severity seen since the alarm was last acknowledged.

Keeping track of max_severity makes sure that transient problems are seen and acknowledged.
Keeping track of max_severity makes sure that transient problems are seen, acknowledged, and (if so configured) escalated to OpsGenie.

User Guide
==========

Start the Watcher CSC as follows:

.. prompt:: bash

run_watcher

Stop the watcher by commanding it to the OFFLINE state, using the standard CSC state transition commands.

See Watcher `SAL communication interface <https://ts-xml.lsst.io/sal_interfaces/Watcher.html>`_ for commands, events and telemetry.

The configuration of the Watcher specifies:

* Which of the available rules it will run.
* The configuration for each rule.
* Escalation: which rules (if any) will be escalated as OpsGenie alerts if something goes seriously wrong.
* Automatic acknowledgement and unacknowledgement of alarms.

Configuration
-------------

The set of rules used by the Watcher and the configuration of each rule is specified by the CSC configuration.
The configuration options for each rule are specified by a schema provided by the rule.
A typical Watcher configuration file will specify most available rules, and will likely be large.

The Watcher configuration also has a list of disabled SAL components, for the situation that a subsystem is down for maintenance or repair.
Rules that use a disabled SAL component are not loaded.

Escalation
----------

It is possible to configure rules to escalate to OpsGenie, which can text or phone people who are on call.
See the escalation section of the configuration schema for the format.

A Watcher alarm is escalated by creating an OpsGenie alert if all of the following are true:

* The alarm is configured for escalation, meaning escalation delay > 0 and at least one escalation responder is specified, and the top-level configuration field ``opsgenie_url`` is not blank.
* The alarm reaches CRITICAL severity (even if only briefly).
* The alarm is not acknowledged before the escalation delay elapses.

If OpsGenie accepts the request to create an alert then the Watcher sets the alarm's ``escalated_id`` field to the ID of the OpsGenie alert.
This ID allows you to track the status of the alert in OpsGenie.
If the attempt to create an OpsGenie alart fails, ``escalated_id`` is set to an explanatory message that starts with "Failed: ".

If an escalated Watcher alarm is acknowledged, the Watcher will try to close the OpsGenie alert, and will always set the alarm's ``escalation_id`` field back to an empty string.
This occurs regardless of the current severity of the alarm.

Subtleties:

* Each escalation configuration may apply to more than one rule.
However, each rule will have, at most, one escalation configuration: the first match wins.
* If a given rule has no escalation configuration (a very common case) then it will never be escalated.
* Escalation and de-escalation are done on a "best effort" basis.
The watcher will log a warning if anything goes obviously wrong.
* OpsGenie's API operates in two phases.
First it responds to a request with 202=ACCEPTED, or an error code if the request is rejected.
The ACCEPTED message includes an ID you can use to poll OpsGenie to find out if the request eventually succeeds or fails.
However, the CSC only ever listens for the initial response, because there is nothing much it can do if the request eventually fails.

Auto Acknowledge and Unacknowledge
----------------------------------
Expand All @@ -68,47 +121,20 @@ An alarm will be automatically acknowledged only if its current severity stays N
An alarm will be automatically unacknowledged only if the condition does not get worse than the level at which it was ackowledged,
and does not get resolved (go to NONE), during the full ``auto_unacknowledge_delay`` period after being acknowledged.

Other Classes
=============

`Model` manages all the rules that are in use.
It is the model that uses the watcher configuration to construct rules, construct salobj remotes and topics and wire everything together.
The model also disables rules when the Watcher CSC is not in the ENABLED state.
Displaying Alarms
=================

In order to reduce resource usage, remotes (instances of `lsst.ts.salobj.Remote`) and topics (instances of `lsst.ts.salobj.topics.ReadTopic`) are only constructed if a rule that is in use needs them.
Also remotes and topics are shared, so if more than one rule needs them, only one is constructed.

Since rules share remotes and topics, the rule's constructor does not construct remotes or topics (which also means that a rule's constructor does not make the rule fully functional).
Instead a rule specifies the remotes and topics it needs by constructing `RemoteInfo` objects, which the `Model` uses to construct the remotes and topics and connect them to the rule.

`TopicCallback` supports calling more than one rule from a topic.
This is needed because a salobj topic can only call back to a single function and we may have more than one rule that wants to be called.

Rules are isolated from each other in two ways, both of which are implemented by wrapping each remote with multiple instances of `RemoteWrapper`, one instance per rule that uses the remote:

* A rule can only see the topics that it specifies it wants.
This eliminates a source of surprising errors where if rule A if uses a topic specified only by rule B then the topic will only be available to rule A if rule B is being used.
* A rule can only see the current value of a topic; it cannot wait on the next value of a topic.
That prevents one rule from stealing data from another rule.

Contributing
============

``lsst.ts.watcher`` is developed at https://github.com/lsst-ts/ts_watcher.
You can find Jira issues for this module using `labels=ts_watcher <https://jira.lsstcorp.org/issues/?jql=project%20%3D%20DM%20AND%20labels%20%20%3D%20ts_watcher>`_.
.. toctree::
:maxdepth: 2

.. _lsst.ts.watcher-pyapi:
displaying_alarms.rst

Python API reference
====================
Developer Guide
===============

.. automodapi:: lsst.ts.watcher
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules.test
:no-main-docstr:
:no-inheritance-diagram:
.. toctree::
developer_guide
:maxdepth: 1

Version History
===============
Expand Down
32 changes: 32 additions & 0 deletions doc/version_history.rst
Expand Up @@ -6,6 +6,38 @@
Version History
###############

v1.10.0
-------

Changes:

* Escalate alarms to OpsGenie by using the REST API to create alerts.

* Update the CSC configuration schema to version 3:

* Update ``escalation`` items by replacing the ``to`` field (a string) ``responders`` (a list of objects).
* Add escalation_url.

* Overhaul escalation-related `Alarm` fields.
It is important to keep track of the ID of escalation alerts.
* Update `Model` to handle the new `Alarm` fields.
* Update `WatcherCsc` to handle the new `Alarm` fields and `Model` changes.
* Add `MockOpsGenie`, a mock OpsGenie service for unit tests.
* Add support for ts_xml 12.1, which has more detailed escalation information in the ``alarm`` event, while retaining backwards compatibility with ts_xml 11.

* Modernize the documentation.
Split the main page into a User Guide (still part of the main page) and a Developer Guide (a separate page).
Add a section on alarm escalation to the User Guide.


Requires:

* ts_utils 1.1
* ts_salobj 7.1
* ts_idl 2
* IDL files for ``Watcher``, ``ATDome``, ``ESS``, ``MTMount``, ``ScriptQueue``, and ``Test``, plus any additional SAL components you wish to watch.
These may be generated using ``make_idl_files.py`` built with ts_xml 11 (preferably 12.1) and ts_sal 7.

v1.9.0
------

Expand Down
1 change: 1 addition & 0 deletions python/lsst/ts/watcher/__init__.py
Expand Up @@ -38,3 +38,4 @@
from .model import *
from .watcher_csc import *
from .testutils import *
from .mock_opsgenie import *