Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics prototype scenario #146

Merged
merged 20 commits into from
Feb 25, 2021
Merged

Metrics prototype scenario #146

merged 20 commits into from
Feb 25, 2021

Conversation

reyang
Copy link
Member

@reyang reyang commented Feb 18, 2021

Follow up on the 02/11/2021 #108 Metrics API/SDK SIG Mtg, I've created this OTEP which has described two scenarios for the metrics prototyping work.

The actual prototype will be submitted as PR(s) to the language client repo, for example:

@reyang reyang marked this pull request as ready for review February 18, 2021 04:58
@reyang reyang requested review from a team as code owners February 18, 2021 04:58

The application owner (developer Y) would only want the following metrics:

* [System CPU Usage](#system-cpu-usage) reported every 5 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is something we want in initial stage, but i'd like to add a requirement of same metric being reported with different interval, potentially with diff. dimensions.
For example, the app owner wants to see HTTP Server Duration metric exported every 1 second with only HttpStatusCode dimension, and HTTP Server Duration metric exported every 30 seconds with dimensions {hostname, HTTP Method, Host, Status Code, Client Type}. The former is typically used for near-real-time dashboards, and the latter for more permanent storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example - I guess normally people would only report the 1 second one from the SDK pre-aggregation, and rely on the metrics backend to aggregate the 30 seconds one (and daily/weekly/monthly summary if there is a need).

reyang and others added 6 commits February 17, 2021 21:59
Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
text/metrics/0146-metrics-prototype-scenarios.md Outdated Show resolved Hide resolved
domain. We should also clarify the scope and be able to articulate it
precisely during this stage, here are some examples:

* Why do we want to introduce brand new metrics APIs versus taking a well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point (and the next point) beg the question:

Should we make ONE of the use cases be "hide OTEL behind another library to help users take advantage of OTEL telemetry-unfiication concepts, like Baggage + Resource". This could be Micrometer, OpenCensus, whatever.

text/metrics/0146-metrics-prototype-scenarios.md Outdated Show resolved Hide resolved
consumer, the value will not be reported at all (e.g. there is no API call to
fetch the room temperature unless someone is asking for the room temperature).

#### Process CPU Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful in your tables to show "resource" with sub-tables for the components coming from resource (host name, process_id I assume)

Note: the **Host Name** should leverage [`OpenTelemetry
Resource`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md),
so it should be covered by the metrics SDK rather than API, and strictly
speaking it is not considered as a "dimension" from the SDK perspective.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC - This is done in Exporters for Trace. If a vendor isn't directly supporting Resouce (or where their notion of Resource doesn't fully align with OTLP), the vendor can lift Resource labels onto the trace in the export method. We actually plan to do this for instrumentation library (although I don't think we do it today).

consumer, the value will not be reported at all (e.g. there is no API call to
fetch the room temperature unless someone is asking for the room temperature).

#### Process CPU Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Process CPU Usage" really an HTTP server concern? I feel like this may be a general instrumentation concern, and the HTTP metrics should really be focused only on things HTTP libraries do. CPU consumption is more of a process-wide concern.

I'd suggest using Active HTTP Connections as the pull metric from the HTTP library.

For Server Room Temperature, I love what it's trying to do, but I'm having trouble buying it's part of an http library. Maybe move it into its own instrumentation component?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed CPU since itself is a complex topic and not the focus on this OTEP/prototype.
I've extracted temperature/humidity to a separate lib to make the scenario more realistic/reasonable.

For Active HTTP Connections, I don't know if we want to do it by reporting "total received - total processed" or doing it differently (see the discussion here). Given this discussion might take extra time, probably leave it out for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing about CPU is that it's a perfect example of an asynchronous instrument in the draft API. We want these to be recorded though callbacks, because it's expensive. This means the value is returned in cumulative form, not in delta form.

store.process_order("customerA", {"tomato": 1})
```

When the store is closed, we will report the following metrics:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this type of offline historical reporting is a good primary use case for the metrics API? Although I can envision a metrics API doing it I'd guess it is a better fit for a standard transaction database where there are stronger guarantees about data consistency and richer data, but potentially worse performance/availability. I think of metrics being focused on high availability and low latency which is more oriented towards diagnostics/live monitoring/alerting where the grocery would be looking for signs like:

  1. Is there an unexpected change in rate of sales suggesting an unknown incident may be occuring at the store?
  2. Is inventory getting unexpectedly low so we need to dispatch an urgent delivery from the warehouse?
  3. Is there a sudden spike in demand for a product so we need to consider rationing or price changes?

Of course if I am looking at it with too narrow a lense then this example might be accomplishing exactly what it intends, expanding my understanding of what scenarios a metrics API is intended to support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see (1) and (2) as good use-cases for monitoring using a metrics API, but maybe not (3). Although the example feels like it fell out of a textbook, you could re-imagine the store as a Message-Queue consumer processing orders in a horizontally scalable store. Can we ask another form of query: "how many stores were in operation at a given time?"

* HTTP request counters, reported every 5 seconds:
* Total number of received HTTP requests
* Total number of finished HTTP requests
* Number of currently-in-flight HTTP requests (concurrent HTTP requests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this example asks for three counters, because it seems possible to achieve with two instruments: a count of received requests and a histogram of response durations (i.e., seems to call for either a view or a 3rd instrument).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it might affect the semantic convention open-telemetry/opentelemetry-specification#1378 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store.process_order("customerA", {"tomato": 1})
```

When the store is closed, we will report the following metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see (1) and (2) as good use-cases for monitoring using a metrics API, but maybe not (3). Although the example feels like it fell out of a textbook, you could re-imagine the store as a Message-Queue consumer processing orders in a horizontally scalable store. Can we ask another form of query: "how many stores were in operation at a given time?"

The application owner (developer Y) would only want the following metrics:

* Server temperature - reported every 5 seconds
* Server humidity - reported every minute
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we need to have multiple "pipelines" for different configurations. This may include...

  • Reporting period which is related to Collection rate, Export rate, etc...
  • Selection / grouping of desired metrics for this pipeline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all SDK configuration topics, to me. The use-cases in this document are about how code is instrumented, I think, and what the API looks like.

OpenCensus had a programmatic API for configuring the kinds of variables you mentioned. I'm not sure if a programmatic API or a configuration file is what user's want, but I'd argue to keep this setup out of the instrumentation API.

reyang and others added 4 commits February 19, 2021 11:17
Co-authored-by: Leighton Chen <lechen@microsoft.com>
Co-authored-by: Leighton Chen <lechen@microsoft.com>
Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment, thinking deeper on this, is whether or not "external annotation" needs to be called out specifically as a non-goal.

Specifically, should we call out a use case where an DevOps wants to add metric labels POST DEVELOPMENT to the API. I think this is an SDK concern and should be called out there (e.g. how you can add annotations to Resource via ENV, which would then influence metrics), but I think the scenarios listed here are sufficient for API discussions.

Copy link

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @reyang!

Co-authored-by: Cijo Thomas <cithomas@microsoft.com>
@jmacd
Copy link
Contributor

jmacd commented Feb 25, 2021

I reviewed the open comments and believe that @reyang has addressed them all. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants