Clarify metrics monotonicity (#1995)

* Add supplementary doc about monotonicity * rewrap * reword the doc based on feedback * explain how monotonicity could help reset detection * improve the flow * improve example * minor fix Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
open-telemetry · Oct 13, 2021 · 90b6df7 · 90b6df7
1 parent a1a8676
commit 90b6df7
Show file tree

Hide file tree

Showing 2 changed files with 80 additions and 5 deletions.
diff --git a/specification/metrics/api.md b/specification/metrics/api.md
@@ -700,8 +700,8 @@ operation is provided by the `callback`, which is registered during the
 `UpDownCounter` is a [synchronous Instrument](#synchronous-instrument) which
 supports increments and decrements.
 
-Note: if the value grows
-[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
+Note: if the value is
+[monotonically](https://wikipedia.org/wiki/Monotonic_function) increasing, use
 [Counter](#counter) instead.
 
 Example uses for `UpDownCounter`:
@@ -844,8 +844,8 @@ process heap size - it makes sense to report the heap size from multiple
 processes and sum them up, so we get the total heap usage_) when the instrument
 is being observed.
 
-Note: if the value grows
-[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
+Note: if the value is
+[monotonically](https://wikipedia.org/wiki/Monotonic_function) increasing, use
 [Asynchronous Counter](#asynchronous-counter) instead; if the value is
 non-additive, use [Asynchronous Gauge](#asynchronous-gauge) instead.
 
@@ -886,7 +886,7 @@ The `callback` function is responsible for reporting the
 observed. [OpenTelemetry API](../overview.md#api) authors SHOULD define whether
 this callback function needs to be reentrant safe / thread safe or not.
 
-Note: Unlike [UpDownCounter.Add()](#add) which takes the increment/delta value,
+Note: Unlike [UpDownCounter.Add()](#add-1) which takes the increment/delta value,
 the callback function reports the absolute value of the Asynchronous
 UpDownCounter. To determine the reported rate the Asynchronous UpDownCounter is
 changing, the difference between successive measurements is used.

diff --git a/specification/metrics/supplementary-guidelines.md b/specification/metrics/supplementary-guidelines.md
@@ -9,6 +9,8 @@ Table of Contents:
 * [Guidelines for instrumentation library
   authors](#guidelines-for-instrumentation-library-authors)
   * [Instrument selection](#instrument-selection)
+  * [Additive property](#additive-property)
+  * [Monotonicity property](#monotonicity-property)
   * [Semantic convention](#semantic-convention)
 * [Guidelines for SDK authors](#guidelines-for-sdk-authors)
   * [Aggregation temporality](#aggregation-temporality)
@@ -62,6 +64,79 @@ Here is one way of choosing the correct instrument:
     * If the value is NOT monotonically increasing - use an [Asynchronous
       UpDownCounter](./api.md#asynchronous-updowncounter).
 
+### Additive property
+
+### Monotonicity property
+
+In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md)
+specifications, the word `monotonic` has been used frequently.
+
+It is important to understand that different
+[Instruments](#instrument-selection) handle monotonicity differently.
+
+Let's take an example with a network driver using a [Counter](./api.md#counter)
+to record the total number of bytes received:
+
+* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
+  * no network packet has been received
+* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
+  * received a packet with `30` bytes - `Counter.Add(30)`
+  * received a packet with `200` bytes - `Counter.Add(200)`
+  * received a packet with `50` bytes - `Counter.Add(50)`
+* During the time range (T<sub>2</sub>, T<sub>3</sub>]
+  * received a packet with `100` bytes - `Counter.Add(100)`
+
+You can see that the total increment during (T<sub>0</sub>, T<sub>1</sub>] is
+`0`, the total increment during (T<sub>1</sub>, T<sub>2</sub>] is `280` (`30 +
+200 + 50`), the total increment during (T<sub>2</sub>, T<sub>3</sub>] is `100`,
+and the total increment during (T<s3ub>0</sub>, T<sub>3</sub>] is `380` (`0 +
+280 + 100`). All the increments are non-negative, in other words, the **sum is
+monotonically increasing**.
+
+Note that it is inaccurate to say "the total bytes received by T<sub>3</sub> is
+`380`", because there might be network packets received by the driver before we
+started to observe it (e.g. before the last operating system reboot). The
+accurate way is to say "the total bytes received during (T<sub>0</sub>,
+T<sub>3</sub>] is `380`". In a nutshell, the count represents a **rate** which
+is associated with a time range.
+
+This monotonicity property is important because it gives the downstream systems
+additional hints so they can handle the data in a better way. Imagine we report
+the total number of bytes received in a cumulative sum data stream:
+
+* At T<sub>n</sub>, we reported `3,896,473,820`.
+* At T<sub>n+1</sub>, we reported `4,294,967,293`.
+* At T<sub>n+2</sub>, we reported `1,800,372`.
+
+The backend system could tell that there was integer overflow or system restart
+during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to "fix" the data.
+
+Let's take another example with a process using an [Asynchronous
+Counter](./api.md#asynchronous-counter) to report the total page faults of the
+process:
+
+The page faults are managed by the operating system, and the process could
+retrieve the number of page faults via some system APIs.
+
+* At T<sub>0</sub>:
+  * the process started
+  * the process didn't ask the operating system to report the page faults
+* At T<sub>1</sub>:
+  * the operating system reported with `1000` page faults for the process
+* At T<sub>2</sub>:
+  * the process didn't ask the operating system to report the page faults
+* At T<sub>3</sub>:
+  * the operating system reported with `1050` page faults for the process
+* At T<sub>4</sub>:
+  * the operating system reported with `1200` page faults for the process
+
+You can see that the number being reported is the absolute value rather than
+increments, and the value is monotonically increasing.
+
+If we need to calculate "how many page faults have been introduced during
+(T<sub>3</sub>, T<sub>4</sub>]", we need to apply subtraction `1200 - 1050 =
+150`.
+
 ### Semantic convention
 
 Once you decided [which instrument(s) to be used](#instrument-selection), you