Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC for OpenTelemetry integration #42

Merged
merged 5 commits into from
Sep 3, 2021

Conversation

atoulme
Copy link
Contributor

@atoulme atoulme commented Jan 26, 2021

Signed-off-by: Antoine Toulme antoine@lunar-ocean.com

This RFC offers a path to integrate OpenTelemetry, and specifically adding tracing and metrics reporting capabilities, to client SDKs and chaincode shim frameworks. This RFC relies and references on the work done in a pull request made to the Java shim chaincode repository.

Feedback and comments are welcome and encouraged. Thanks!

The OpenTelemetry project is a CNCF project that aims to standardize observability, especially in a cloud-native environment.
OpenTelemetry is working to create a standard around representing metrics, traces and logs.
We aim to bring full observability of Hyperledger Fabric to allow tracing of transactions all the way from the client
to each of the chaincode invocations. We also want to capture chaincode interactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to capture chaincode interactions.

Why is this information helpful?

We want to be able to trace and correlate actions from chaincode events.

A chaincode event is simply an output of its execution. It is recorded on the blockchain. You can't trace it as it doesn't go to any node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is helpful as it helps understand the path of execution of code - the equivalent of logging entering methods and exiting methods, but with a simple, lightweight object that can track metadata and tags for the purpose of powering operations.

Chaincode events are typically triggers for oracles that then spawn code execution. Using the trace information associated with the event allows to correlate the execution of arbitrary code back to the event that triggered it and the transaction that originated the event.

@yacovm
Copy link
Contributor

yacovm commented Jan 26, 2021

It seems to me the proposal is focused mainly on shims and SDKs. I'm not sure why tracing these two components is useful. An SDK is simply a software component that has a private key and can orchestrate transaction creation and notify the application about its success. A chaincode shim is nothing but a static process that receives as input a proposal, executes arbitrary logic and communicates with its peer (excluding the constellation where you deploy it as server which no one uses).

If anything, the main thing that seems worthwhile to me to trace, is transaction identifiers. A transaction identifier flows from a client to peers and then back to the client to the ordering service, there it flows among nodes until it lands in a block and then it flows inside the peer in its stages of validation and commit, and tracing this end to end can be useful for analysis of bottlenecks and troubleshooting.

I think that the proposal lacks examples of what you would want to trace, and also how this works with the fact that there is already a metric framework in Fabric that integrates with statsd and prometheus.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 26, 2021

The trace information is typically rich, with some trace providers arguing to replace logging and add as much metadata in the form of tags for traces. A transaction identifier would definitely be something to add to the trace. It would then be possible to search by transaction identifier for traces.

The existing metrics framework of Hyperledger Fabric is not discussed here indeed - OpenTelemetry can collect Prometheus metrics, and we have been successful in collecting those metrics with the OpenTelemetry Collector or Fabric Logger. I have been looking at offering OpenTelemetry as a framework for metrics for Fabric as well, but I believe that would be a separate enhancement. OpenTelemetry supports exporting metrics in Prometheus format or using the OTLP protocol, so it would be a shoe-in for the existing metrics framework.

In this enhancement, we propose that developers interested in shoring up metrics as part of their chaincode may do so, and allow for those metrics to be sent to a remote OTLP endpoint. This is a useful capability for chaincode developers.

I will work on providing more samples. Thank you!

@jt-nti
Copy link
Member

jt-nti commented Mar 1, 2021

It seems like this RFC is a good fit for some telemetry that was just optionally getting logged out in the Java chaincode implementation. Good to see that gRPC metadata works instead of making changes to the protobuf messages.

@denyeart
Copy link
Contributor

denyeart commented Mar 3, 2021

I agree with Yacov that tracing a transaction end-to-end seems like the most valuable use case for Fabric with OpenTelemetry. I can see the need to instrument each of the SDKs for the end-to-end transaction view, but it seems it would be more efficient to instrument the peer and orderer for the transaction, rather than instrumenting each of the language-specific chaincode shims as is described in this RFC. Why implement in each chaincode language shim when it could be implemented once in the peer?

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
…racing info

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
@atoulme
Copy link
Contributor Author

atoulme commented Mar 3, 2021

I have amended the RFC based on the conversation on the maintainers call this morning. In particular, I have removed the notion of chaincode tracing. I will create a separate RFC specifically for this particular aspect, as well as adding trace information to chaincode events.

…ng the maintainers call

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
@atoulme
Copy link
Contributor Author

atoulme commented Mar 5, 2021

As we continue the discussion and reviews of this RFC, may I request that a number be associated with this RFC and we may move it along the process?

Copy link

@sykesm sykesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the party but I agree with most of the comments that have been made by the other maintainers.

I also want to point out that the current implementation uses gRPC streams to communicate between the peer and chaincode. This pattern complicates things as messages from many different transactions will share the same stream context. I also want to note that Fabric does not consistently propagate contexts from the transport layer so there is some work to do beyond wiring in the interceptors.

text/0000-opentelemetry.md Outdated Show resolved Hide resolved
text/0000-opentelemetry.md Outdated Show resolved Hide resolved
Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
@atoulme
Copy link
Contributor Author

atoulme commented Mar 9, 2021

No, the streaming and context propagation might just be fine as the interceptor works at the level of a gRPC service and message exchange over the stream. Don't take my word for it - let's try it out!

@sykesm
Copy link

sykesm commented Mar 9, 2021

No, the streaming and context propagation might just be fine as the interceptor works at the level of a gRPC service and message exchange over the stream. Don't take my word for it - let's try it out!

I'm trying to point to an area that you may want to spend a bit of time looking at. Obviously the PRs to Fabric will demonstrate if and how things work.

@atoulme
Copy link
Contributor Author

atoulme commented Mar 9, 2021

No worries. I built an integration test that shows we can collect traces with the chaincode stream here:
https://github.com/hyperledger/fabric-chaincode-java/pull/153/files#diff-3d1a2998188d8694d38faff7ee0a9a713fe5373927db9fa5eef9962eff319b2eR127

Maybe it can be enhanced to push a bit further, but chaincode is out of scope anyway. I guess I'll get started on a RFC just for chaincode soon.

Base automatically changed from master to main March 19, 2021 23:22
@atoulme
Copy link
Contributor Author

atoulme commented Apr 14, 2021

This request was approved a while back, should it be merged? What is the next step here?

@mastersingh24
Copy link

@denyeart - Can you have a look at this as well? I think we are almost ready for final review on this one

@atoulme
Copy link
Contributor Author

atoulme commented Apr 30, 2021

Would it be possible to receive an update on this RFC please?

@denyeart
Copy link
Contributor

denyeart commented May 11, 2021

@atoulme Sorry this got held up, I think we are close. The last thing that I think could be clarified in the RFC is some discussion of tracing versus metrics. This RFC mentions tracing the most, but also mentions metrics, so it is unclear if the focus is the former or both. Would it make sense as a first pass to focus on tracing only in this RFC? Or do tracing and metrics typically go hand-in-hand? Do you envision the metrics support complementing or replacing the existing Fabric metrics? Would you expect users to want to use both existing Fabric metrics and OpenTelemetry metrics or would that be overkill? Can the existing Prometheus service in the peer and orderer be leveraged for the OpenTelemetry metrics?

I see you replied to Yacov with some metrics thoughts, but I think those thoughts either need to be expanded within the RFC text itself (as well as addressing my questions), or removed from the RFC to make it clear that the initial focus is on tracing.

@atoulme
Copy link
Contributor Author

atoulme commented May 26, 2021

I am happy to slice and dice it however makes most sense, but OpenTelemetry provides support for metrics and traces. Therefore, if it is applied across the board as a standard, you should expect to see metric integration too.
The RFC sells heavily on the traces side because that's a net new feature for Fabric.

Since I introduced this RFC, OpenTelemetry also issued a standard for logs.

@atoulme
Copy link
Contributor Author

atoulme commented Jul 16, 2021

@denyeart please let me know what you think?

@denyeart
Copy link
Contributor

@atoulme I think we are very close... could you respond to each of my questions from May 11th and update the RFC accordingly? Thanks!

@atoulme
Copy link
Contributor Author

atoulme commented Aug 4, 2021

Understood, working on it. Thank you for your patience.

Signed-off-by: Antoine Toulme <antoine@lunar-ocean.com>
@atoulme
Copy link
Contributor Author

atoulme commented Aug 17, 2021

I have amended the scope of this RFC. @denyeart, please see my answers inline:

@atoulme Sorry this got held up, I think we are close. The last thing that I think could be clarified in the RFC is some discussion of tracing versus metrics. This RFC mentions tracing the most, but also mentions metrics, so it is unclear if the focus is the former or both.

Sorry about that. I dropped all reference to metrics.

Would it make sense as a first pass to focus on tracing only in this RFC? Or do tracing and metrics typically go hand-in-hand?

You can just do tracing for now. That's plenty to get started.

Do you envision the metrics support complementing or replacing the existing Fabric metrics? Would you expect users to want to use both existing Fabric metrics and OpenTelemetry metrics or would that be overkill? Can the existing Prometheus service in the peer and orderer be leveraged for the OpenTelemetry metrics?

The OpenTelemetry SDK is really great to use for metric collection. It also allows to push metrics through the current Prometheus set up with no changes - you just integrate OpenTelemetry with a Prometheus collector. You can combine sending metrics with OpenTelemetry gRPC push and Prometheus. The current Prometheus exports are fine and can be used for now. OpenTelemetry Collector can scrape Prometheus metrics. We don't need to do this at this time or in this RFC.

I see you replied to Yacov with some metrics thoughts, but I think those thoughts either need to be expanded within the RFC text itself (as well as addressing my questions), or removed from the RFC to make it clear that the initial focus is on tracing.

I have removed metrics from the RFC so the scope is tighter.

Copy link
Contributor

@denyeart denyeart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoulme thanks for addressing the comments. Given three approvals we can enter final comment period. I'll announce on the Fabric mailing list.

@ryjones
Copy link
Member

ryjones commented Sep 2, 2021

this has three approvals and it's been a week - should this be merged? @denyeart

@denyeart
Copy link
Contributor

denyeart commented Sep 3, 2021

@ryjones Yes, this one is ready to merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants