Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics export pipeline + metrics stdout exporter #341

Merged
merged 48 commits into from
Feb 11, 2020

Conversation

lzchen
Copy link
Contributor

@lzchen lzchen commented Dec 19, 2019

This completes an end-to-end metrics export pipeline.

Uses alot of elements from the go implementation as well as the WIP spec.

The new files that are added relate to the export pipeline: batcher.py, aggregate.py and controller.py, which contain the implementations for Batcher, Aggregator and Controller classes (see this diagram for an overview of how the pieces work together).

Currently, only the UngroupedBatcher and CounterAggregator are implemented. This also means that there are no AggregationSelector s since aggregations for Measure have not been implemented. a PushController is also included, which is worker thread that continuously initiates collection of metrics and calls the exporter. metrics_example.py` in the examples folder shows how to use these components.

Some SDK changes:

  • Metric s have a reference now to the Meter that created them. This is used for constructing an aggregator for the specific metric type when creating a handle.
  • LabelSet has a field for its encoded value

@lzchen lzchen requested a review from a team as a code owner December 19, 2019 19:14
@codecov-io
Copy link

codecov-io commented Dec 31, 2019

Codecov Report

Merging #341 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #341   +/-   ##
=======================================
  Coverage   85.85%   85.85%           
=======================================
  Files          41       41           
  Lines        2078     2078           
  Branches      242      242           
=======================================
  Hits         1784     1784           
  Misses        223      223           
  Partials       71       71
Impacted Files Coverage Δ
...elemetry-api/src/opentelemetry/metrics/__init__.py 95.58% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6136987...5f68832. Read the comment docs.

Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is built on a WIP spec I don't see much of a reason to not approve it.

A couple thoughts:

  1. it would be really nice if there was some clear, concise examples. The unit tests seem to cover different pieces without giving me a full example on how I configure a meter in an app to actually record and export a value to console. That would be very helpful if the goal is to evaluate the API
  2. to that end, some sort of document that helps outline how to work on more and more complex cases would be helpful, even before implementing any more code. For example:

a. start with configuring a single, no dimensionality metric and exporting to console
b. single metric with dimensions (introduce label sets)
c. multiple metrics that need to be aggregated (aggregator).

I've left some thoughts, but it's very hard to evaluate the interface and usage without a common use case. Thoughts on making the next PR a full example of collecting application metrics (CPU / memory, thread count) so we can see how it'll actually work in practice?

@@ -220,7 +220,7 @@ def create_metric(
metric_type: Type[MetricT],
label_keys: Sequence[str] = (),
enabled: bool = True,
monotonic: bool = False,
alternate: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "alternate" doesn't really correspond to the behavior. is this still in flux? what's the best way to give feedback if this is a spec-level decision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. There are new terms for expressing default behaviour here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -298,9 +373,6 @@ def get_label_set(self, labels: Dict[str, str]):
# Use simple encoding for now until encoding API is implemented
encoded = tuple(sorted(labels.items()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose behind the label encodings? is this similar to the previous PR?

There isn't much of a performance improvement here, since the creation of the encoded cache key to even check against is quite expensive (iterate dict -> sort -> tuple).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking through it, there's some level of merit in not recreating the same label set over and over again, but constructing the labels dictionary is of itself quite expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I believe the optimization is so that the same label set does not need to be calculated over and over again. Currently, the encoding used is a default placeholder encoding type. There is talks about implementing a configurable encoder in the SDK, so user's can configure the type of encoding appropriate to the exporter they want to send metrics to. There is no specs for it yet but the Go implementation has created one. I think once we have this, vendors can pass in their own encoders with however they want to encode, so the iterate dict -> sort -> tuple algorithm does not always apply.

def __init__(self):
self.current = None
self.check_point = None
self._lock = threading.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a lock the right thing to put in the base class? I understand most implementations will want that, but it's not particularly related to the interface, and we may be needlessly locking if we have an async aggregator or use atomic integers and doubles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if an async aggregator makes sense, as we would want to export metric values at a certain point in time. With that being said, all implementations would need a way to handle concurrent updates, so I would think putting the lock in the base class makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, I think there's a good reason for keeping implementation out of ABCs even though it's technically allowed. We addressed (but didn't resolve) this in #311 (comment).

In this particular case I still don't think it makes sense to instantiate the lock here. This (half-abstract) implementation doesn't use it, and it's not part of the class' API.

self.current += value

def checkpoint(self):
# TODO: Implement lock-free algorithm for concurrency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to completely contradict myself earlier: is it worth implementing a lock-free algorithm here? checkpoint shouldn't be called particularly often, and lock-free algorithms are easy to get wrong and can often lead to worse performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no specs for for how to implement checkpoint to be concurrent. There was talks that using a lock would be too slow, but I agree that lock-free algorithms are prone to error and might not guarantee performance. I think I might change this to a question, rather than a TODO.

self.current = 0

def merge(self, other):
self.check_point += other.check_point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should current be updated as well? seems like this addition would be cleared immediately by the next call to checkpoint, which will match what exists in current.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge is usually called right after checkpoint() in the process step in a Batcher. This handles the race condition where an update to the metric happens at the same time as process. merge updates the checkpoint value to prepare for exporting, while the current is left untouched (but changed by the update seperately to be exported later on).

# Collect all of the meter's metrics to be exported
self.meter.collect()
# Export the given metrics in the batcher
self.exporter.export(self.meter.batcher.check_point_set())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a design concern with calling batcher methods directly for things like setting checkpoints and declaring the collection as finished, but calling the more general meter on collect?

I guess to be clearer: are there operations that the general meter should perform, on top of just the batcher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture is designed so that the meter handles the "metric related events" such as construction, updates and label sets, while the batcher is responsible for collection and aggregation. Yes, the meter could technically perform all the instructions handled by the batcher, but this separation of responsibilities by the above seems like a good logical pattern. What design concerns are you thinking of specifically?

aggregator.update(1.0)
label_set = {}
batch_map = {}
batch_map[(metric, "")] = (aggregator, label_set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this block sets a very specific expectation around the format of the batch_map. In general, I feel like this is a bit error-prone as assigning the value requires knowing the right pair of tuples to set as the key and the value.

Is there some refactoring that can be done to not require direct manipulation of the batch_map, providing helper methods instead? that way, there's clear documentation on the structure of the key-value pairs, or hopefully one doesn't have to set the key-value pairs at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think adding a helper method to assign values to the batch map would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: There seems to be some problems with circular imports (since init.py depends on the batcher.py and batcher.py would have to depend on init.py for typing). Sphinx also complains too since the forward refs aren't able to be found. Any ideas on how to fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do you see the circular imports? In general this can be solved by factoring out the circular component in it's own module, and having the circular importees import that instead.

@toumorokoshi toumorokoshi added discussion Issue or PR that needs/is extended discussion. needs reviewers PRs with this label are ready for review and needs people to review to move forward. labels Jan 1, 2020
@lzchen
Copy link
Contributor Author

lzchen commented Jan 7, 2020

@toumorokoshi
metrics_example.py shows how users can create, update and export metrics. Is the example sufficient?

@lzchen
Copy link
Contributor Author

lzchen commented Jan 7, 2020

I've added some more examples under examples/metrics/ with comments. It should outline the basic usages of metrics.

Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a partial review, I haven't finished but I don't want to risk losing my comments.

So far I am concerned about two points:

  • I don't see usage of locks / atomic types to handle concurrency.
  • Handles are not being released.

examples/metrics/non_stateful.py Outdated Show resolved Hide resolved
examples/metrics/non_stateful.py Outdated Show resolved Hide resolved
Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some few nits, nothing to worry about.

I think it is a good base to move ahead. It would be nice if we could have a list of tasks to be done after this is merged, like, implementing other aggregators, etc.

self.tick()

def shutdown(self):
self.finished.set()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the controller perform the last tick() before shutting down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. There isn't clearly defined behaviour yet for flushing metrics on application close. I think this is a discussion for the spec.

examples/metrics/non_stateful.py Outdated Show resolved Hide resolved
examples/metrics/non_stateful.py Outdated Show resolved Hide resolved
opentelemetry-api/src/opentelemetry/metrics/__init__.py Outdated Show resolved Hide resolved
@lzchen
Copy link
Contributor Author

lzchen commented Feb 8, 2020

@mauriciovasquezbernal Thanks so much for reviewing! Can you approve if everything else looks okay?

Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last point, if tick() is not added in #341 (comment), then a sleep has to be added to the record example, otherwise it'll not print anything, besides that LGTM!

Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some drive-by comments on units in the examples.

examples/metrics/record.py Outdated Show resolved Hide resolved
examples/metrics/record.py Outdated Show resolved Hide resolved
examples/metrics/stateful.py Outdated Show resolved Hide resolved
examples/metrics/stateful.py Outdated Show resolved Hide resolved
examples/metrics/stateless.py Outdated Show resolved Hide resolved
@c24t
Copy link
Member

c24t commented Feb 10, 2020

I don't want to include too many loader changes in this PR. As of now, the examples would not work because the loader doesn't seem to be addressing use case.

I opened lzchen#7 to fix the default meter type. A better long term fix might be to pass both the abstract and defaut types to the loader.

Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more blocking comments from me, I'll follow up with a comment about my lingering design concerns.

@c24t c24t merged commit fe99dbc into open-telemetry:master Feb 11, 2020
toumorokoshi pushed a commit to toumorokoshi/opentelemetry-python that referenced this pull request Feb 17, 2020
Initial implementation of the end-to-end metrics pipeline.
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020
* feat(jaeger-exporter): adds flushing on an interval

closes open-telemetry/opentelemetry-js#340

* respond to comments

* yarn fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Issue or PR that needs/is extended discussion. metrics needs reviewers PRs with this label are ready for review and needs people to review to move forward.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants