Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tracing Feature for Curve #2229

Open
wu-hanqing opened this issue Feb 6, 2023 · 13 comments
Open

Add Tracing Feature for Curve #2229

wu-hanqing opened this issue Feb 6, 2023 · 13 comments
Labels

Comments

@wu-hanqing
Copy link
Contributor

wu-hanqing commented Feb 6, 2023

  • Description: At present, Curve has logging and metrics, both of which can be used to analyze performance as well as locate problems. While they improve the observability of the system, the granularity is coarse and does not allow for precise analysis of how long requests take at each stage. Tracing is a powerful tool that can concatenate invocation relationships between services and log invocation time in the request dimension, preserving essential information and concatenating dispersed log events to help us better understand system behavior, assist in debugging and troubleshooting performance issues.
  • Expected Outcome: Design the solution and implement it, introduce it into CurveBS, and analyze the latency of IO requests. The implementation needs to be well scalable and can be applied to other modules.
  • Recommended Skills: C++, OpenTracing
@wu-hanqing wu-hanqing added enhancement improve feature GSOC2023 labels Feb 6, 2023
@kriti-sc
Copy link

Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?

@wu-hanqing
Copy link
Contributor Author

Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?

Hi @kriti-sc, I am glad you are interested in this project. But, I think now there is only an idea, lack of a clear design plan, should we first discuss a plan, such as what framework / library to use, write demos to verify, etc.

@kriti-sc
Copy link

kriti-sc commented Mar 2, 2023

Hi @wu-hanqing. I agree with you. Since the idea is already there, I am working on an approach to resolve this issue.

@caoxianfei1
Copy link
Contributor

@kriti-sc Ok,feel free to try it.

@Ziy1-Tan
Copy link
Contributor

I want to try it.

@wu-hanqing
Copy link
Contributor Author

I want to try it.

Of course, please note the timeline, and feel free to raise any ideas or questions you may have.

@zzzz-vincent
Copy link

Hi all-
just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts.
Thanks.

@wu-hanqing
Copy link
Contributor Author

Hi all- just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts. Thanks.

hi, OpenTracing and OpenCensus are merged into OpenTelemetry, so you can try this.

@kriti-sc
Copy link

kriti-sc commented Apr 4, 2023

Hi all, I will be withdrawing from this feature. I outline my initial thoughts below.

This goal is to enable tracing in CurveBS and using the trace data, analyze the latency of IO requests. There are 3 components to building the solution:

Instrumentation

This step would introduce methods to trace an IO request as it flows through the system into the codebase. I intend to use OpenTelemetry for it, as that is the standard today. OpenTracing (which you mention in the issue) has been subsumed by OpenTelemetry. OpenTelemetry has an API SDK in C++, so I intend to use that. The following are the different pieces to gathering trace data:

  1. Trace: A trace represents the entire execution path of the request. In the case of an IO request in Curve, a trace would start when the Curve IO call is first made by the user/client. The trace would end when the IO request has been completed and responded to by Curve. Thus, a trace will be started when the IO request makes the first Curve API call and a corresponding unique trace ID will be generated.

  2. Span: A span represents a single unit of work through the entire execution path of the request. A trace may contain multiple spans. For example, to service an IO request, multiple components of Curve are involved and multiple function calls are made within Curve. Each function call will be one span. Each of these spans will have a reference to the trace they are part of. Thus, each function call will be a span and associated with the original trace. A span will be started when a function starts and will end just before the function returns. Each span will contain the start time and end time of the function call.

  3. Context Propagation: Usually, there are multiple function calls in a single function. Thus, function calls may be nested. To understand the execution path of a request, it is important to capture the nested nature of function calls. It is important to capture from where the current function was called, and the status of the stack at that point. This is achieved using context propagation. Relevant telemetry data is stored as context in the calling function and then propagated to the callee function. In the callee function, telemetry data is gathered and added as context before being propagated back to the caller function when the function call returns. Thus, context propagation will be done before every function call. Context from the calling function will be propagated to the callee, and then from the callee back to the caller. The contexts will be the spans corresponding to each function.

These three pieces of information put together are called a trace and give us an entire picture of the execution path of a request, along with how long each step took. Custom metrics can be added as well.

Collecting instrumentation data

The trace data is collected on the servers the application is running. It is then transported to a central system, where all the trace data from the multiple servers are brought to a single place.
For this purpose, we will use Jaeger Agent and Jaeger Collector. Jaeger Agent will be deployed on the application servers and will collect the trace data and send it to the Jaeger Collector. Jaeger Collector will be the central system that collects all the trace data from all the application servers and processes it. Jaeger Agent and Collector both support OpenTelemetry formats.

Analyzing instrumentation data

For the purpose of analysis and visualization, we will use Jaeger again. Particularly the Jaeger Query feature.

Some implementation considerations by @wu-hanqing:

  1. It needs to have good scalability and can be easily applied to other modules.
  2. The impact on performance needs to be evaluated. If the impact is significant, it needs to be able to dynamically turn on or off.
  3. Deployment of related components (OpenTelemetry/Jeager). If there is sufficient time, it is best to integrate the deployment process into curveadm .

@kriti-sc kriti-sc removed their assignment Apr 4, 2023
@Ziy1-Tan
Copy link
Contributor

Ziy1-Tan commented May 6, 2023

Design docs and PR, Welcome to continue :)

@UniverseParticle
Copy link
Contributor

I want to try it. assign me

@Cyber-SiKu
Copy link
Contributor

@UniverseParticle Have you encountered any difficulties?

@wuhongsong
Copy link
Contributor

its difficult, and it will be a hard issue in curve summer coding camp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants