Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python Otel] Manage call tracer life cycle use call arena. #37460

Closed
wants to merge 5 commits into from

Conversation

XuanWang-Amos
Copy link
Contributor

@XuanWang-Amos XuanWang-Amos commented Aug 12, 2024

We're seeing segfault in Python CSM tests:

2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')

The issue

After investigation, we found that the call tracer was deleted before RecordEnd was called.

Why this fix

  • To fix this, we decide to use arena to manage the life cycle of CallTracer.
  • Since CallTracer was created in another shard object library (grpcio_observability) which don't have a dependency on grpc core, we can't use grpc_core::Arena directly when creating the call tracer.
  • As a workaround, we created a wrapper class ClientCallTracerWrapper to wrap the CallTracer and created another core API grpc_call_tracer_set_and_manage so that we can manage the life cycle of CallTracer use the wrapper class.

@XuanWang-Amos XuanWang-Amos added the release notes: yes Indicates if PR needs to be in release notes label Aug 12, 2024
@XuanWang-Amos XuanWang-Amos marked this pull request as ready for review August 12, 2024 23:17
@gnossen
Copy link
Contributor

gnossen commented Aug 12, 2024

@XuanWang-Amos For reference, can you please link the PR where the new Core API was introduced into the description?

@XuanWang-Amos
Copy link
Contributor Author

@XuanWang-Amos For reference, can you please link the PR where the new Core API was introduced into the description?

It's added in this PR.

src/core/lib/surface/call.h Outdated Show resolved Hide resolved
XuanWang-Amos added a commit to XuanWang-Amos/grpc that referenced this pull request Aug 14, 2024
We're seeing segfault in Python CSM tests:
```
2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')
```

After investigation, we found that the call tracer was deleted before `RecordEnd` was called.

* To fix this, we decide to use arena to manage the life cycle of CallTracer.
* Since CallTracer was created in another shard object library (`grpcio_observability`) which don't have a dependency on grpc core, we can't use `grpc_core::Arena` directly when creating the call tracer.
* As a workaround, we created a wrapper class `ClientCallTracerWrapper` to wrap the CallTracer and created another core API `grpc_call_tracer_set_and_manage` so that we can manage the life cycle of CallTracer use the wrapper class.

<!--

If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the appropriate
lang label.

-->

Closes grpc#37460

COPYBARA_INTEGRATE_REVIEW=grpc#37460 from XuanWang-Amos:fix_otel_segfault 33c0b98
PiperOrigin-RevId: 662966853
XuanWang-Amos added a commit to XuanWang-Amos/grpc that referenced this pull request Aug 14, 2024
We're seeing segfault in Python CSM tests:
```
2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')
```

### The issue

After investigation, we found that the call tracer was deleted before `RecordEnd` was called.

### Why this fix

* To fix this, we decide to use arena to manage the life cycle of CallTracer.
* Since CallTracer was created in another shard object library (`grpcio_observability`) which don't have a dependency on grpc core, we can't use `grpc_core::Arena` directly when creating the call tracer.
* As a workaround, we created a wrapper class `ClientCallTracerWrapper` to wrap the CallTracer and created another core API `grpc_call_tracer_set_and_manage` so that we can manage the life cycle of CallTracer use the wrapper class.

<!--

If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the appropriate
lang label.

-->

Closes grpc#37460

COPYBARA_INTEGRATE_REVIEW=grpc#37460 from XuanWang-Amos:fix_otel_segfault 33c0b98
PiperOrigin-RevId: 662966853
drfloob pushed a commit that referenced this pull request Aug 14, 2024
…backport) (#37479)

Backport of #37460 to v1.66.x.
---
We're seeing segfault in Python CSM tests:
```
2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')
```

### The issue

After investigation, we found that the call tracer was deleted before
`RecordEnd` was called.

### Why this fix

* To fix this, we decide to use arena to manage the life cycle of
CallTracer.
* Since CallTracer was created in another shard object library
(`grpcio_observability`) which don't have a dependency on grpc core, we
can't use `grpc_core::Arena` directly when creating the call tracer.
* As a workaround, we created a wrapper class `ClientCallTracerWrapper`
to wrap the CallTracer and created another core API
`grpc_call_tracer_set_and_manage` so that we can manage the life cycle
of CallTracer use the wrapper class.


<!--

If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the
appropriate
lang label.

-->
XuanWang-Amos added a commit that referenced this pull request Aug 14, 2024
…backport) (#37478)

Backport of #37460 to v1.65.x.
---
We're seeing segfault in Python CSM tests:
```
2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')
```

### The issue

After investigation, we found that the call tracer was deleted before
`RecordEnd` was called.

### Why this fix

* To fix this, we decide to use arena to manage the life cycle of
CallTracer.
* Since CallTracer was created in another shard object library
(`grpcio_observability`) which don't have a dependency on grpc core, we
can't use `grpc_core::Arena` directly when creating the call tracer.
* As a workaround, we created a wrapper class `ClientCallTracerWrapper`
to wrap the CallTracer and created another core API
`grpc_call_tracer_set_and_manage` so that we can manage the life cycle
of CallTracer use the wrapper class.


<!--

If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the
appropriate
lang label.

-->
sourabhsinghs pushed a commit to sourabhsinghs/grpc that referenced this pull request Sep 26, 2024
We're seeing segfault in Python CSM tests:
```
2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')
```

### The issue

After investigation, we found that the call tracer was deleted before `RecordEnd` was called.

### Why this fix

* To fix this, we decide to use arena to manage the life cycle of CallTracer.
* Since CallTracer was created in another shard object library (`grpcio_observability`) which don't have a dependency on grpc core, we can't use `grpc_core::Arena` directly when creating the call tracer.
* As a workaround, we created a wrapper class `ClientCallTracerWrapper` to wrap the CallTracer and created another core API `grpc_call_tracer_set_and_manage` so that we can manage the life cycle of CallTracer use the wrapper class.

<!--

If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the appropriate
lang label.

-->

Closes grpc#37460

COPYBARA_INTEGRATE_REVIEW=grpc#37460 from XuanWang-Amos:fix_otel_segfault 33c0b98
PiperOrigin-RevId: 662966853
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants