Skip to content

Commit

Permalink
Update the documentation
Browse files Browse the repository at this point in the history
* Split into User Guide and Developer Guide
* Add a section on escalation
  • Loading branch information
r-owen committed Aug 16, 2022
1 parent ad2794b commit e1fb230
Show file tree
Hide file tree
Showing 2 changed files with 163 additions and 66 deletions.
71 changes: 71 additions & 0 deletions doc/developer_guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
.. py:currentmodule:: lsst.ts.watcher
.. _lsst.ts.watcher.developer_guide:

###############
Developer Guide
###############

The Watcher CSC is implemented using `ts_salobj <https://github.com/lsst-ts/ts_salobj>`_.

The fundamental objects that make up the Watcher are rules and alarms.
There is a one to one relationship between these: every rule contains one associated alarm.
It is the logic in a rule that determines the state of its alarm.

Each rule monitors messages from remote SAL components (or, potentially, other sources).
Based on that logic the Watcher sets the severity of the associated alarm.
Rules are instances of *subclasses* of `BaseRule`.
There are many such subclasses.

Each alarm contains state, including the current severity, whether the alarm has been acknowledged, and the maximum severity seen since last acknowledgement.
Alarms are instances of `Alarm`.

Other Classes
=============

`Model` manages all the rules that are in use.
It is the model that uses the watcher configuration to construct rules, construct salobj remotes and topics and wire everything together.
The model also disables rules when the Watcher CSC is not in the ENABLED state.

In order to reduce resource usage, remotes (instances of `lsst.ts.salobj.Remote`) and topics (instances of `lsst.ts.salobj.topics.ReadTopic`) are only constructed if a rule that is in use needs them.
Also remotes and topics are shared, so if more than one rule needs a given Remote, only one is constructed.

Since rules share remotes and topics, the rule's constructor does not construct remotes or topics (which also means that a rule's constructor does not make the rule fully functional).
Instead a rule specifies the remotes and topics it needs by constructing `RemoteInfo` objects, which the `Model` uses to construct the remotes and topics and connect them to the rule.

`TopicCallback` supports calling more than one rule from a topic.
This is needed because a salobj topic can only call back to a single function and we may have more than one rule that wants to be called.

Rules are isolated from each other in two ways, both of which are implemented by wrapping each remote with multiple instances of `RemoteWrapper`, one instance per rule that uses the remote:

* A rule can only see the topics that it specifies it wants.
This eliminates a source of surprising errors where if rule A if uses a topic specified only by rule B then the topic will only be available to rule A if rule B is being used.
* A rule can only see the current value of a topic; it cannot wait on the next value of a topic.
That prevents one rule from stealing data from another rule.

Writing Rules
=============

.. toctree::
:maxdepth: 2

writing_rules.rst

Contributing
============

``lsst.ts.watcher`` is developed at https://github.com/lsst-ts/ts_watcher.
You can find Jira issues for this module using `labels=ts_watcher <https://jira.lsstcorp.org/issues/?jql=project%3DDM%20AND%20labels%3Dts_watcher>`_.

.. _lsst.ts.watcher-pyapi:

Python API reference
====================

.. automodapi:: lsst.ts.watcher
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules.test
:no-main-docstr:
:no-inheritance-diagram:
158 changes: 92 additions & 66 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,30 @@
.. _lsst.ts.watcher:

########################
ts_watcher documentation
########################

A CSC which monitors other SAL components and uses the data to generate alarms for display by LOVE.
###############
lsst.ts.Watcher
###############

.. image:: https://img.shields.io/badge/Project Metadata-gray.svg
:target: https://ts-xml.lsst.io/index.html#index-csc-table-watcher
.. image:: https://img.shields.io/badge/SAL\ Interface-gray.svg
:target: https://ts-xml.lsst.io/sal_interfaces/Watcher.html
.. image:: https://img.shields.io/badge/GitHub-gray.svg
:target: https://github.com/lsst-ts/ts_watcher
.. image:: https://img.shields.io/badge/Jira-gray.svg
:target: https://jira.lsstcorp.org/issues/?jql=project%3DDM%20AND%20labels%3Dts_watcher

Overview
========

The Watcher monitors other SAL components and uses the data to generate alarms for display by LOVE.
The point is to provide a simple, uniform interface to handle alarms.

The alarms are generated by rules which are defined in this package.
The alarms are generated by rules, which are defined in this package.
The CSC configuration specifies which of the available rules are used, and the configuration for each rule.

Using lsst.ts.watcher
=====================

The fundamental objects that make up the Watcher are rules and alarms.
Rules monitor topics from remote SAL components and, based on that information, set the severity of alarms.
Alarms contains the state of an alarm, including the current severity, whether the alarm has been acknowledged, and the maximum severity seen since last acknowledgement.
Rules are instances of *subclasses* of `BaseRule`.
Alarms are instances of `Alarm`.

There is a one to one relationship between rules and alarms: every rule contains one associated alarm.

The set of rules used by the Watcher and the configuration of each rule is specified by the CSC configuration.
The configuration options for each rule are specified by a schema provided by the rule.
A typical Watcher configuration file will specify most available rules, and will likely be large.
The Watcher configuration also has a list of disabled SAL components, for the situation that a subsystem is down for maintenance or repair.
Rules that use a disabled SAL component are not loaded.

.. toctree::
:maxdepth: 2

writing_rules.rst
displaying_alarms.rst
There is a one to one relationship between rules and alarms.
Each rule has one alarm, and it is the logic in the rule, operating on the state of the control system, that determines the severity of the alarm.

.. _lsst.ts.watcher.severity_levels:

Expand All @@ -54,7 +46,68 @@ Each alarm has two severity fields:
* severity: the current severity, as reported by the rule.
* max_severity: the maximum severity seen since the alarm was last acknowledged.

Keeping track of max_severity makes sure that transient problems are seen and acknowledged.
Keeping track of max_severity makes sure that transient problems are seen, acknowledged, and (if so configured) escalated to OpsGenie.

User Guide
==========

Start the Watcher CSC as follows:

.. prompt:: bash

run_watcher

Stop the watcher by commanding it to the OFFLINE state, using the standard CSC state transition commands.

See Watcher `SAL communication interface <https://ts-xml.lsst.io/sal_interfaces/Watcher.html>`_ for commands, events and telemetry.

The configuration of the Watcher specifies:

* Which of the available rules it will run.
* The configuration for each rule.
* Escalation: which rules (if any) will be escalated as OpsGenie alerts if something goes seriously wrong.
* Automatic acknowledgement and unacknowledgement of alarms.

Configuration
-------------

The set of rules used by the Watcher and the configuration of each rule is specified by the CSC configuration.
The configuration options for each rule are specified by a schema provided by the rule.
A typical Watcher configuration file will specify most available rules, and will likely be large.

The Watcher configuration also has a list of disabled SAL components, for the situation that a subsystem is down for maintenance or repair.
Rules that use a disabled SAL component are not loaded.

Escalation
----------

It is possible to configure rules to escalate to OpsGenie, which can text or phone people who are on call.
See the escalation section of the configuration schema for the format.

A Watcher alarm is escalated by creating an OpsGenie alert if all of the following are true:

* The alarm is configured for escalation, meaning escalation delay > 0 and at least one escalation responder is specified, and the top-level configuration field ``opsgenie_url`` is not blank.
* The alarm reaches CRITICAL severity (even if only briefly).
* The alarm is not acknowledged before the escalation delay elapses.

If OpsGenie accepts the request to create an alert then the Watcher sets the alarm's ``escalated_id`` field to the ID of the OpsGenie alert.
This ID allows you to track the status of the alert in OpsGenie.
If the attempt to create an OpsGenie alart fails, ``escalated_id`` is set to an explanatory message that starts with "Failed: ".

If an escalated Watcher alarm is acknowledged, the Watcher will try to close the OpsGenie alert, and will always set the alarm's ``escalation_id`` field back to an empty string.
This occurs regardless of the current severity of the alarm.

Subtleties:

* Each escalation configuration may apply to more than one rule.
However, each rule will have, at most, one escalation configuration: the first match wins.
* If a given rule has no escalation configuration (a very common case) then it will never be escalated.
* Escalation and de-escalation are done on a "best effort" basis.
The watcher will log a warning if anything goes obviously wrong.
* OpsGenie's API operates in two phases.
First it responds to a request with 202=ACCEPTED, or an error code if the request is rejected.
The ACCEPTED message includes an ID you can use to poll OpsGenie to find out if the request eventually succeeds or fails.
However, the CSC only ever listens for the initial response, because there is nothing much it can do if the request eventually fails.

Auto Acknowledge and Unacknowledge
----------------------------------
Expand All @@ -68,47 +121,20 @@ An alarm will be automatically acknowledged only if its current severity stays N
An alarm will be automatically unacknowledged only if the condition does not get worse than the level at which it was ackowledged,
and does not get resolved (go to NONE), during the full ``auto_unacknowledge_delay`` period after being acknowledged.

Other Classes
=============

`Model` manages all the rules that are in use.
It is the model that uses the watcher configuration to construct rules, construct salobj remotes and topics and wire everything together.
The model also disables rules when the Watcher CSC is not in the ENABLED state.
Displaying Alarms
=================

In order to reduce resource usage, remotes (instances of `lsst.ts.salobj.Remote`) and topics (instances of `lsst.ts.salobj.topics.ReadTopic`) are only constructed if a rule that is in use needs them.
Also remotes and topics are shared, so if more than one rule needs them, only one is constructed.

Since rules share remotes and topics, the rule's constructor does not construct remotes or topics (which also means that a rule's constructor does not make the rule fully functional).
Instead a rule specifies the remotes and topics it needs by constructing `RemoteInfo` objects, which the `Model` uses to construct the remotes and topics and connect them to the rule.

`TopicCallback` supports calling more than one rule from a topic.
This is needed because a salobj topic can only call back to a single function and we may have more than one rule that wants to be called.

Rules are isolated from each other in two ways, both of which are implemented by wrapping each remote with multiple instances of `RemoteWrapper`, one instance per rule that uses the remote:

* A rule can only see the topics that it specifies it wants.
This eliminates a source of surprising errors where if rule A if uses a topic specified only by rule B then the topic will only be available to rule A if rule B is being used.
* A rule can only see the current value of a topic; it cannot wait on the next value of a topic.
That prevents one rule from stealing data from another rule.

Contributing
============

``lsst.ts.watcher`` is developed at https://github.com/lsst-ts/ts_watcher.
You can find Jira issues for this module using `labels=ts_watcher <https://jira.lsstcorp.org/issues/?jql=project%20%3D%20DM%20AND%20labels%20%20%3D%20ts_watcher>`_.
.. toctree::
:maxdepth: 2

.. _lsst.ts.watcher-pyapi:
displaying_alarms.rst

Python API reference
====================
Developer Guide
===============

.. automodapi:: lsst.ts.watcher
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules
:no-main-docstr:
.. automodapi:: lsst.ts.watcher.rules.test
:no-main-docstr:
:no-inheritance-diagram:
.. toctree::
developer_guide
:maxdepth: 1

Version History
===============
Expand Down

0 comments on commit e1fb230

Please sign in to comment.