Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics prototype scenario #146

Merged
merged 20 commits into from
Feb 25, 2021
Merged
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 220 additions & 0 deletions text/metrics/0146-metrics-prototype-scenarios.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Scenarios for Metrics API/SDK Prototyping

With the stable release of the tracing specification, the OpenTelemetry
community is willing to spend more energy on metrics API/SDK. The goal is to get
the metrics API/SDK specification to
[`Experimental`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#experimental)
state by end of 5/2021, and make it
[`Stable`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable)
before end of 2021:

* By end of 5/2021, we should have a good confidence that we can recommend
language client owners to work on metrics preview release. This means starting
from 7/1/2021 the specification should not have major surprise or big changes
reyang marked this conversation as resolved.
Show resolved Hide resolved
which would thrust the language client maintainers.

* By end of 9/2021, we should mark the metrics API/SDK specification as
[`Feature-freeze`](https://github.com/open-telemetry/opentelemetry-specification/blob/1afab39e5658f807315abf2f3256809293bfd421/specification/document-status.md#feature-freeze),
and focusing on bug fixing or editorial changes.
reyang marked this conversation as resolved.
Show resolved Hide resolved

* By end of 2021, we want to have a stable release of metrics API/SDK
specification, with multiple language SIGs providing RC (release candidate) or
[stable](https://github.com/open-telemetry/opentelemetry-specification/blob/9047c91412d3d4b7f28b0f7346d8c5034b509849/specification/versioning-and-stability.md#stable)
clients.

In this document, we will focus on two scenarios that we use for prototyping
metrics API/SDK. The goal is to have two scenarios which clearly capture the
major requirements, so we can work with language client SIGs to prototype,
gather the learnings, determine the scopes and stages. Later the scenarios can
reyang marked this conversation as resolved.
Show resolved Hide resolved
be used as examples and test cases for all the language clients.

Here are the languages we've agreed to use during the prototyping:

* C#
* Java
* Python

Instead of boiling the ocean, we will need to divide the work into multiple
reyang marked this conversation as resolved.
Show resolved Hide resolved
stages:

1. Do the end-to-end prototype to get the overall understanding of the problem
reyang marked this conversation as resolved.
Show resolved Hide resolved
domain. We should also clarify the scope and be able to articulate it
precisely during this stage, here are some examples:

* Why do we want to introduce brand new metrics APIs versus taking a well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point (and the next point) beg the question:

Should we make ONE of the use cases be "hide OTEL behind another library to help users take advantage of OTEL telemetry-unfiication concepts, like Baggage + Resource". This could be Micrometer, OpenCensus, whatever.

established API (e.g. Prometheus and Micrometer), what makes OpenTelemetry
metrics API different (e.g. Baggage)?
* Do we need to consider OpenCensus Stats API shim, or this is out of scope?
reyang marked this conversation as resolved.
Show resolved Hide resolved

2. Focus on a core subset of API, cover the end-to-end library instrumentation
scenario. At this stage we don't expect to cover all the APIs as some of them
might be very similar (e.g. if we know how to record an integer, we don't
have to work on float/double as we can add them later by replicating what
we've done for integer).

3. Focus on a core subset of SDK. This would help us to get the end-to-end
application.

4. Replicate stage 2 to cover the complete set of APIs.

5. Replicate stage 4 to cover the complete set of SDKs.
reyang marked this conversation as resolved.
Show resolved Hide resolved

## Scenario 1: Grocery

The **Grocery** scenario covers how a developer could use metrics API and SDK in
a final application. It is a self-contained application which covers:

* How to instrument the code in a vendor agnostic way
* How to configure the SDK and exporter

Considering there might be multiple grocery stores, the metrics we collect will
have the store name as a dimension - which is fairly static (not changing while
the store is running).

The store has plenty supply of potato and tomato, with the following price:

* Potato: $1.00 / ea
* Tomato: $3.00 / ea

Each customer has a unique name (e.g. customerA, customerB), a customer could
come to the store multiple times. Here goes the Python snippet:

```python
store = GroceryStore("Portland")
store.process_order("customerA", {"potato": 2, "tomato": 3})
store.process_order("customerB", {"tomato": 10})
store.process_order("customerC", {"potato": 2})
store.process_order("customerA", {"tomato": 1})
```

When the store is closed, we will report the following metrics:
reyang marked this conversation as resolved.
Show resolved Hide resolved
reyang marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this type of offline historical reporting is a good primary use case for the metrics API? Although I can envision a metrics API doing it I'd guess it is a better fit for a standard transaction database where there are stronger guarantees about data consistency and richer data, but potentially worse performance/availability. I think of metrics being focused on high availability and low latency which is more oriented towards diagnostics/live monitoring/alerting where the grocery would be looking for signs like:

  1. Is there an unexpected change in rate of sales suggesting an unknown incident may be occuring at the store?
  2. Is inventory getting unexpectedly low so we need to dispatch an urgent delivery from the warehouse?
  3. Is there a sudden spike in demand for a product so we need to consider rationing or price changes?

Of course if I am looking at it with too narrow a lense then this example might be accomplishing exactly what it intends, expanding my understanding of what scenarios a metrics API is intended to support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see (1) and (2) as good use-cases for monitoring using a metrics API, but maybe not (3). Although the example feels like it fell out of a textbook, you could re-imagine the store as a Message-Queue consumer processing orders in a horizontally scalable store. Can we ask another form of query: "how many stores were in operation at a given time?"


### Order info

| Store | Customer | Number of Orders | Amount (USD) |
| -------- | --------- | ---------------- | ------------ |
| Portland | customerA | 2 | 14.00 |
| Portland | customerB | 1 | 30.00 |
| Portland | customerC | 1 | 2.00 |

### Items sold

| Store | Customer | Item | Count |
| -------- | --------- | ------ | ----- |
| Portland | customerA | potato | 2 |
| Portland | customerA | tomato | 4 |
| Portland | customerB | tomato | 10 |
| Portland | customerC | potato | 2 |

## Scenario 2: HTTP Server

The _HTTP Server_ scenario covers how a library developer X could use metrics
API to instrument a library, and how the application developer Y can configure
the library to use OpenTelemetry SDK in a final application. X and Y are working
for different companies and they don't communicate. The demo has two parts - the
library (HTTP lib owned by X) and the server app (owned by Y):
reyang marked this conversation as resolved.
Show resolved Hide resolved

* How developer X could instrument the library code in a vendor agnostic way
* Performance is critical for X
* X doesn't know which metrics and which dimensions will Y pick
reyang marked this conversation as resolved.
Show resolved Hide resolved
* X doesn't know the aggregation time window, nor the final destination of the
metrics
* How developer Y could configure the SDK and exporter
* How should Y hook up the metrics SDK with the library
* How should Y configure the time window(s) and destination(s)
* How should Y pick the metrics and the dimensions

### Library Requirements

The library developer (developer X) will expose the following metrics out of
box:

### Pull Metrics
reyang marked this conversation as resolved.
Show resolved Hide resolved

These are pull metrics - the value is always available, and is only reported and
collected based on the ask from consumer(s). If there is no ask from the
consumer, the value will not be reported at all (e.g. there is no API call to
fetch the room temperature unless someone is asking for the room temperature).

#### Process CPU Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful in your tables to show "resource" with sub-tables for the components coming from resource (host name, process_id I assume)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Process CPU Usage" really an HTTP server concern? I feel like this may be a general instrumentation concern, and the HTTP metrics should really be focused only on things HTTP libraries do. CPU consumption is more of a process-wide concern.

I'd suggest using Active HTTP Connections as the pull metric from the HTTP library.

For Server Room Temperature, I love what it's trying to do, but I'm having trouble buying it's part of an http library. Maybe move it into its own instrumentation component?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed CPU since itself is a complex topic and not the focus on this OTEP/prototype.
I've extracted temperature/humidity to a separate lib to make the scenario more realistic/reasonable.

For Active HTTP Connections, I don't know if we want to do it by reporting "total received - total processed" or doing it differently (see the discussion here). Given this discussion might take extra time, probably leave it out for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing about CPU is that it's a perfect example of an asynchronous instrument in the draft API. We want these to be recorded though callbacks, because it's expensive. This means the value is returned in cumulative form, not in delta form.


Note: the **Host Name** should leverage [`OpenTelemetry
Resource`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md),
so it should be covered by the metrics SDK rather than API, and strictly
speaking it is not considered as a "dimension" from the SDK perspective.
Copy link

@victlu victlu Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will ask about how vendors can "enrich" a data point. I assume we have some "Processors" or extension points for vendors to enrich or alter data points (what is allowed/disallowed is also a discussion).

Separately, it would be good to clearly state who is responsible for providing the "Resource" labels? Is it an Exporter or a Processor, or Auto-instrumentation, or ??? Is this data available from beginning or added at end of the pipeline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC - This is done in Exporters for Trace. If a vendor isn't directly supporting Resouce (or where their notion of Resource doesn't fully align with OTLP), the vendor can lift Resource labels onto the trace in the export method. We actually plan to do this for instrumentation library (although I don't think we do it today).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are more like the SDK design question and implementation question. Probably not putting too much info in this doc since we're trying to cover the scenario/scope here?


| Host Name | Process ID | CPU% [0.0, 100.0] |
| --------- | ---------- | ----------------- |
| MachineA | 1234 | 15.3 |

#### System CPU Usage

| Host Name | CPU% [0, 100] |
| --------- | ------------- |
| MachineA | 30 |

#### Server Room Temperature

| Host Name | Temperature (F) |
| --------- | --------------- |
| MachineA | 65.3 |

### Push Metrics

These are the push metrics - the value is reported (via the API) when it is
available, and collected (via the SDK) based on the ask from consumer(s). If
there is no ask from the consumer, the API will be no-op and the data will be
dropped on the floor.

#### Received HTTP Requests

Note: the **Client Type** is passed in via the [`OpenTelemetry
Baggage`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/baggage/api.md),
strictly speaking it is not part of the metrics API, but it is considered as a
"dimension" from the metrics SDK perspective.

| Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port |
| --------- | ---------- | ----------- | ----------- | --------- | ----------- | --------- | --------- | --------- | --------- |
| MachineA | 1234 | Android | GET | otel.org | 1.1 | 127.0.0.1 | 51327 | 127.0.0.1 | 80 |
| MachineA | 1234 | Android | POST | otel.org | 1.1 | 127.0.0.1 | 51328 | 127.0.0.1 | 80 |
| MachineA | 1234 | iOS | PUT | otel.org | 1.1 | 127.0.0.1 | 51329 | 127.0.0.1 | 80 |

#### HTTP Server Duration

Note: the server duration is only available for **finished HTTP requests**.

| Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Status Code | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | Duration (ms) |
| --------- | ---------- | ----------- | ----------- | --------- | ---------------- | ----------- | --------- | --------- | --------- | --------- | ------------- |
| MachineA | 1234 | Android | GET | otel.org | 200 | 1.1 | 127.0.0.1 | 51327 | 127.0.0.1 | 80 | 8.5 |
| MachineA | 1234 | Android | POST | otel.org | 304 | 1.1 | 127.0.0.1 | 51328 | 127.0.0.1 | 80 | 100.0 |

### Application Requirements

The application owner (developer Y) would only want the following metrics:

* [System CPU Usage](#system-cpu-usage) reported every 5 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is something we want in initial stage, but i'd like to add a requirement of same metric being reported with different interval, potentially with diff. dimensions.
For example, the app owner wants to see HTTP Server Duration metric exported every 1 second with only HttpStatusCode dimension, and HTTP Server Duration metric exported every 30 seconds with dimensions {hostname, HTTP Method, Host, Status Code, Client Type}. The former is typically used for near-real-time dashboards, and the latter for more permanent storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example - I guess normally people would only report the 1 second one from the SDK pre-aggregation, and rely on the metrics backend to aggregate the 30 seconds one (and daily/weekly/monthly summary if there is a need).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we need to create multiple "pipelines" so that each pipeline can be configured individually. This implies IMHO that measurements reported from API needs to be routed to each individual pipelines.

Currently, I think this is an issue with our use of a "Default" provider in Library X, if Library X does not take a dependency on the SDK.

* [Server Room Temperature](#server-room-temperature) reported every minute
* [HTTP Server Duration](#http-server-duration), reported every 5 seconds, with
a subset of the dimensions:
* Host Name
* HTTP Method
* HTTP Host
* HTTP Status Code
* Client Type
* 90%, 95%, 99% and 99.9% server duration
* HTTP request counters, reported every 5 seconds:
* Total number of received HTTP requests
* Total number of finished HTTP requests
* Number of currently-in-flight HTTP requests (concurrent HTTP requests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this example asks for three counters, because it seems possible to achieve with two instruments: a count of received requests and a histogram of response durations (i.e., seems to call for either a view or a 3rd instrument).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it might affect the semantic convention open-telemetry/opentelemetry-specification#1378 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


| Host Name | Process ID | HTTP Host | Received Requests | Finished Requests | Concurrent Requests |
| --------- | ---------- | --------- | ----------------- | ----------------- | ------------------- |
| MachineA | 1234 | otel.org | 630 | 601 | 29 |
| MachineA | 5678 | otel.org | 1005 | 1001 | 4 |
* Exception samples (exemplar) - in case HTTP 5xx happened, developer Y would
reyang marked this conversation as resolved.
Show resolved Hide resolved
want to see a sample request with trace id, span id and all the dimensions
(IP, Port, etc.)

| Trace ID | Span ID | Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Status Code | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | Exception |
| -------------------------------- | ---------------- | --------- | ---------- | ----------- | ----------- | --------- | ---------------- | ----------- | --------- | --------- | --------- | --------- | -------------------- |
| 8389584945550f40820b96ce1ceb9299 | 745239d26e408342 | MachineA | 1234 | iOS | PUT | otel.org | 500 | 1.1 | 127.0.0.1 | 51329 | 127.0.0.1 | 80 | SocketException(...) |