Add Tracing Feature for Curve #2229

wu-hanqing · 2023-02-06T02:01:18Z

Description: At present, Curve has logging and metrics, both of which can be used to analyze performance as well as locate problems. While they improve the observability of the system, the granularity is coarse and does not allow for precise analysis of how long requests take at each stage. Tracing is a powerful tool that can concatenate invocation relationships between services and log invocation time in the request dimension, preserving essential information and concatenating dispersed log events to help us better understand system behavior, assist in debugging and troubleshooting performance issues.
Expected Outcome: Design the solution and implement it, introduce it into CurveBS, and analyze the latency of IO requests. The implementation needs to be well scalable and can be applied to other modules.
Recommended Skills: C++, OpenTracing

kriti-sc · 2023-02-26T00:51:39Z

Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?

wu-hanqing · 2023-03-02T01:51:07Z

Hi @wu-hanqing, I would like to pick this up as part of GSOC 2023. Since the idea is there, should I focus on implementation in the proposal?

Hi @kriti-sc, I am glad you are interested in this project. But, I think now there is only an idea, lack of a clear design plan, should we first discuss a plan, such as what framework / library to use, write demos to verify, etc.

kriti-sc · 2023-03-02T02:00:49Z

Hi @wu-hanqing. I agree with you. Since the idea is already there, I am working on an approach to resolve this issue.

caoxianfei1 · 2023-03-06T01:56:24Z

@kriti-sc Ok，feel free to try it.

Ziy1-Tan · 2023-03-10T03:08:23Z

I want to try it.

wu-hanqing · 2023-03-14T07:08:47Z

I want to try it.

Of course, please note the timeline, and feel free to raise any ideas or questions you may have.

zzzz-vincent · 2023-03-25T03:39:04Z

Hi all-
just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts.
Thanks.

wu-hanqing · 2023-03-27T01:57:28Z

Hi all- just found this project and started to look into it. However, I realized that OpenTracing has been archived? Any reason for using this library and do you have any other libraries in mind? I am thinking about OpenCensus but would like to hear your thoughts. Thanks.

hi, OpenTracing and OpenCensus are merged into OpenTelemetry, so you can try this.

kriti-sc · 2023-04-04T20:49:59Z

Hi all, I will be withdrawing from this feature. I outline my initial thoughts below.

This goal is to enable tracing in CurveBS and using the trace data, analyze the latency of IO requests. There are 3 components to building the solution:

Instrumentation

This step would introduce methods to trace an IO request as it flows through the system into the codebase. I intend to use OpenTelemetry for it, as that is the standard today. OpenTracing (which you mention in the issue) has been subsumed by OpenTelemetry. OpenTelemetry has an API SDK in C++, so I intend to use that. The following are the different pieces to gathering trace data:

Trace: A trace represents the entire execution path of the request. In the case of an IO request in Curve, a trace would start when the Curve IO call is first made by the user/client. The trace would end when the IO request has been completed and responded to by Curve. Thus, a trace will be started when the IO request makes the first Curve API call and a corresponding unique trace ID will be generated.
Span: A span represents a single unit of work through the entire execution path of the request. A trace may contain multiple spans. For example, to service an IO request, multiple components of Curve are involved and multiple function calls are made within Curve. Each function call will be one span. Each of these spans will have a reference to the trace they are part of. Thus, each function call will be a span and associated with the original trace. A span will be started when a function starts and will end just before the function returns. Each span will contain the start time and end time of the function call.
Context Propagation: Usually, there are multiple function calls in a single function. Thus, function calls may be nested. To understand the execution path of a request, it is important to capture the nested nature of function calls. It is important to capture from where the current function was called, and the status of the stack at that point. This is achieved using context propagation. Relevant telemetry data is stored as context in the calling function and then propagated to the callee function. In the callee function, telemetry data is gathered and added as context before being propagated back to the caller function when the function call returns. Thus, context propagation will be done before every function call. Context from the calling function will be propagated to the callee, and then from the callee back to the caller. The contexts will be the spans corresponding to each function.

These three pieces of information put together are called a trace and give us an entire picture of the execution path of a request, along with how long each step took. Custom metrics can be added as well.

Collecting instrumentation data

The trace data is collected on the servers the application is running. It is then transported to a central system, where all the trace data from the multiple servers are brought to a single place.
For this purpose, we will use Jaeger Agent and Jaeger Collector. Jaeger Agent will be deployed on the application servers and will collect the trace data and send it to the Jaeger Collector. Jaeger Collector will be the central system that collects all the trace data from all the application servers and processes it. Jaeger Agent and Collector both support OpenTelemetry formats.

Analyzing instrumentation data

For the purpose of analysis and visualization, we will use Jaeger again. Particularly the Jaeger Query feature.

Some implementation considerations by @wu-hanqing:

It needs to have good scalability and can be easily applied to other modules.
The impact on performance needs to be evaluated. If the impact is significant, it needs to be able to dynamically turn on or off.
Deployment of related components (OpenTelemetry/Jeager). If there is sufficient time, it is best to integrate the deployment process into curveadm .

Ziy1-Tan · 2023-05-06T02:41:11Z

Design docs and PR, Welcome to continue :)

UniverseParticle · 2023-05-27T08:11:31Z

I want to try it. assign me

Cyber-SiKu · 2023-06-27T02:29:41Z

@UniverseParticle Have you encountered any difficulties?

wuhongsong · 2023-07-17T02:47:38Z

its difficult, and it will be a hard issue in curve summer coding camp

wu-hanqing added enhancement improve feature GSOC2023 labels Feb 6, 2023

Cyber-SiKu assigned kriti-sc Feb 28, 2023

zhanghuidinah assigned Ziy1-Tan Mar 15, 2023

Ziy1-Tan mentioned this issue Mar 31, 2023

[WIP] Add Tracing Feature for Curve #2365

Closed

14 tasks

zhanghuidinah assigned zzzz-vincent Apr 3, 2023

kriti-sc removed their assignment Apr 4, 2023

Ziy1-Tan removed their assignment May 6, 2023

wu-hanqing mentioned this issue May 8, 2023

[2023 Q2 Developer Activities]: Call For Participation！ #2334

Closed

wu-hanqing removed the GSOC2023 label May 8, 2023

wuhongsong added the Developer Activities label May 17, 2023

wu-hanqing added the GLCC2023 GitLink code camp label May 19, 2023

wuhongsong removed the GLCC2023 GitLink code camp label May 22, 2023

Ziy1-Tan assigned UniverseParticle and unassigned zzzz-vincent May 27, 2023

ilixiaocui added good first issue Good for newcomers assigned labels Jun 9, 2023

wuhongsong mentioned this issue Jul 3, 2023

[ Curve Summer Coding Camp]: Call For Participation！ #2603

Closed

caoxianfei1 unassigned UniverseParticle Jul 30, 2023

caoxianfei1 removed the assigned label Jul 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tracing Feature for Curve #2229

Add Tracing Feature for Curve #2229

wu-hanqing commented Feb 6, 2023 •

edited

kriti-sc commented Feb 26, 2023

wu-hanqing commented Mar 2, 2023

kriti-sc commented Mar 2, 2023

caoxianfei1 commented Mar 6, 2023

Ziy1-Tan commented Mar 10, 2023

wu-hanqing commented Mar 14, 2023

zzzz-vincent commented Mar 25, 2023

wu-hanqing commented Mar 27, 2023

kriti-sc commented Apr 4, 2023

Ziy1-Tan commented May 6, 2023

UniverseParticle commented May 27, 2023

Cyber-SiKu commented Jun 27, 2023

wuhongsong commented Jul 17, 2023

Add Tracing Feature for Curve #2229

Add Tracing Feature for Curve #2229

Comments

wu-hanqing commented Feb 6, 2023 • edited

kriti-sc commented Feb 26, 2023

wu-hanqing commented Mar 2, 2023

kriti-sc commented Mar 2, 2023

caoxianfei1 commented Mar 6, 2023

Ziy1-Tan commented Mar 10, 2023

wu-hanqing commented Mar 14, 2023

zzzz-vincent commented Mar 25, 2023

wu-hanqing commented Mar 27, 2023

kriti-sc commented Apr 4, 2023

Instrumentation

Collecting instrumentation data

Analyzing instrumentation data

Ziy1-Tan commented May 6, 2023

UniverseParticle commented May 27, 2023

Cyber-SiKu commented Jun 27, 2023

wuhongsong commented Jul 17, 2023

wu-hanqing commented Feb 6, 2023 •

edited