Skip to content

Implement new tls tracing method behind stirling cli flag (feature toggle)#1123

Merged
JamesMBartlett merged 8 commits intopixie-io:mainfrom
ddelnano:ddelnano/add-feature-flag-to-opt-into-new-tls-tracing-method
May 1, 2023
Merged

Implement new tls tracing method behind stirling cli flag (feature toggle)#1123
JamesMBartlett merged 8 commits intopixie-io:mainfrom
ddelnano:ddelnano/add-feature-flag-to-opt-into-new-tls-tracing-method

Conversation

@ddelnano
Copy link
Copy Markdown
Member

@ddelnano ddelnano commented Mar 29, 2023

Summary: Implement new tls tracing method behind stirling cli flag (feature toggle)

The tls tracing method added in this PR determines a connection's identity (socket fd) through a different mechanism from our existing tracing. Instead of relying on struct offsets of user space data structures, it accesses the socket fd via the underlying socket syscalls while SSL_write / SSL_read calls occur. This is a prerequisite to support BoringSSL because its rolling release style makes the previous method of user space offsets untenable. This has the added benefit of reducing our maintenance cost for our existing OpenSSL tracing. Assuming future versions of OpenSSL maintain the same contract, we will not require any code changes to support them -- Pixie will gain OpenSSL v3 support once this new tracing is the default (as mentioned in the later testing).

Our assumption is that the applications that explicitly set the socket fd on the SSL struct (the applications supported by our previous tracing technique) -- nginx, python all use Openssl with its native BIO interface. In order to verify that assumption, this change will be feature flagged and monitored carefully as its enabled on internal clusters. The existing conn_stats_bytes metric will be monitored for volume of tls traffic traced to verify there is no loss of instrumentation coverage between the new and old method.

Relevant Issues: #692

Type of change: /kind feature

Test Plan: Verified the following

@ddelnano ddelnano marked this pull request as ready for review March 29, 2023 17:55
@ddelnano ddelnano requested a review from a team March 29, 2023 17:55
Copy link
Copy Markdown
Member

@JamesMBartlett JamesMBartlett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that there's two flags to control the new behaviour: a build time flag and a runtime flag. It seems to me like both flags need to be set for the new behaviour to be enabled. As far as I can tell, if the build-time flag is enabled but the runtime flag is not, it will attempt to trace openssl v3 traffic, but fail since the new fd tracing is not enabled (maybe?). And if only the runtime flag is enabled it won't attempt to trace openssl v3 traffic. If this is the case I think it might make sense to just have a runtime flag and not have a build time flag. Additionally, you won't be able to change the build-time flag with launchdarkly, so that side of things will be easier with just a runtime flag.

Comment thread src/stirling/source_connectors/socket_tracer/BUILD.bazel
@ddelnano
Copy link
Copy Markdown
Member Author

My understanding is that there's two flags to control the new behaviour: a build time flag and a runtime flag

@JamesMBartlett regarding your comment above, is the build time flag you are referencing the enable_openssl_v3_testing config setting? If not, can you please identify what you are referencing?

The enable_openssl_v3_testing setting is temporary and only intended for running tests. If kLibSSLMatchers was configurable at runtime, I would have opted for the test to override its entries rather than using a build setting.

This new method of tls tracing should provide the same coverage as our existing method, meaning that a PEM should trace the same amount of bytes before and after the change. However, I want to verify that assumption. In order to test that, I intend to roll out the new implementation while keeping our supported tls tracing targets constant (OpenSSL v3 must remain disabled as the feature toggle is enabled). Once the new implementation is fully rolled out and the coverage is validated against our existing method, we can remove the config setting and release OpenSSL v3 support.

@JamesMBartlett
Copy link
Copy Markdown
Member

My understanding is that there's two flags to control the new behaviour: a build time flag and a runtime flag

@JamesMBartlett regarding your comment above, is the build time flag you are referencing the enable_openssl_v3_testing config setting? If not, can you please identify what you are referencing?

Yes that's the flag I was referring to.

The enable_openssl_v3_testing setting is temporary and only intended for running tests. If kLibSSLMatchers was configurable at runtime, I would have opted for the test to override its entries rather than using a build setting.

This new method of tls tracing should provide the same coverage as our existing method, meaning that a PEM should trace the same amount of bytes before and after the change. However, I want to verify that assumption. In order to test that, I intend to roll out the new implementation while keeping our supported tls tracing targets constant (OpenSSL v3 must remain disabled as the feature toggle is enabled). Once the new implementation is fully rolled out and the coverage is validated against our existing method, we can remove the config setting and release OpenSSL v3 support.

I see, I misunderstood the purpose since kLibSSLMatchers is not in a testing file, but this makes sense now.

Comment thread src/stirling/source_connectors/socket_tracer/openssl_trace_bpf_test.cc Outdated
Comment thread src/stirling/source_connectors/socket_tracer/bcc_bpf/openssl_trace.c Outdated
@ddelnano ddelnano marked this pull request as draft March 30, 2023 14:32
@ddelnano
Copy link
Copy Markdown
Member Author

Moving this back to a draft since there is some additional testing the Stirling team wants to do prior to pushing forward with this.

JamesMBartlett pushed a commit that referenced this pull request Apr 6, 2023
…tls tracing method (#1161)

Summary: Instrument OpenSSL tracing to detect validity of assumptions
for new tls tracing method

After discussing #1123 with @oazizi000 and @etep, we decided that invest
in stronger validation that the assumptions for that PR and our future
tls tracing are valid. Instead of relying on struct offsets of user
space data structures, the new tracing method will access a connection's
socket fd from the underlying socket syscalls during `SSL_write` /
`SSL_read` calls.

This technique should only be compatible with "BIO native" OpenSSL use
cases, which are the OpenSSL use cases Pixie supports today. BIO native
means that a compatible application uses a
[BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via
[SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html))
and results in OpenSSL issuing read/write syscalls on your behalf for
the underlying socket.

There are two situations we wanted to understand prior to proceeding
with #1123: how custom BIO implementations behave (netty, nodejs, etc)
and detecting unrelated (non socket) syscalls while `SSL_write` and
`SSL_read` are on the stack. The former was verified with an experiment
based on this change and is described in the Test Plan section below. We
believe the latter should not occur, but the changes in this PR
instrument this situation in order to detect if doest occur. Assuming
this condition isn't encountered, we will proceed with #1132 (with the
minor changes learned from this experiment).

Relevant Issues: #692

Type of change: /kind feature

Test Plan: Verified the following:
- [x] `DCHECK` added to `socket_trace_connector` does not detect
mismatched file descriptors for `openssl_trace_bpf_test` and
`netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345)
-- openssl_trace_bpf_test is still disabled - #699)
- [x] Verified with a BPF histogram that the number of active syscalls
within SSL_write/SSL_read are non zero for BIO native use cases and zero
for non BIO native cases (nodejs, netty). See more details in #1160

---------

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
@ddelnano ddelnano marked this pull request as ready for review April 27, 2023 14:34
@ddelnano
Copy link
Copy Markdown
Member Author

@etep this is ready for review again.

Comment thread src/stirling/source_connectors/socket_tracer/openssl_trace_bpf_test.cc Outdated
@ddelnano ddelnano requested review from a team and JamesMBartlett April 28, 2023 18:18
@ddelnano ddelnano force-pushed the ddelnano/add-feature-flag-to-opt-into-new-tls-tracing-method branch from b11d05d to 38b5ea2 Compare May 1, 2023 17:23
@ddelnano ddelnano requested a review from a team as a code owner May 1, 2023 17:23
ddelnano added a commit to ddelnano/pixie that referenced this pull request May 1, 2023
…tls tracing method (pixie-io#1161)

Summary: Instrument OpenSSL tracing to detect validity of assumptions
for new tls tracing method

After discussing pixie-io#1123 with @oazizi000 and @etep, we decided that invest
in stronger validation that the assumptions for that PR and our future
tls tracing are valid. Instead of relying on struct offsets of user
space data structures, the new tracing method will access a connection's
socket fd from the underlying socket syscalls during `SSL_write` /
`SSL_read` calls.

This technique should only be compatible with "BIO native" OpenSSL use
cases, which are the OpenSSL use cases Pixie supports today. BIO native
means that a compatible application uses a
[BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via
[SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html))
and results in OpenSSL issuing read/write syscalls on your behalf for
the underlying socket.

There are two situations we wanted to understand prior to proceeding
with pixie-io#1123: how custom BIO implementations behave (netty, nodejs, etc)
and detecting unrelated (non socket) syscalls while `SSL_write` and
`SSL_read` are on the stack. The former was verified with an experiment
based on this change and is described in the Test Plan section below. We
believe the latter should not occur, but the changes in this PR
instrument this situation in order to detect if doest occur. Assuming
this condition isn't encountered, we will proceed with pixie-io#1132 (with the
minor changes learned from this experiment).

Relevant Issues: pixie-io#692

Type of change: /kind feature

Test Plan: Verified the following:
- [x] `DCHECK` added to `socket_trace_connector` does not detect
mismatched file descriptors for `openssl_trace_bpf_test` and
`netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345)
-- openssl_trace_bpf_test is still disabled - pixie-io#699)
- [x] Verified with a BPF histogram that the number of active syscalls
within SSL_write/SSL_read are non zero for BIO native use cases and zero
for non BIO native cases (nodejs, netty). See more details in pixie-io#1160

---------

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
ddelnano and others added 8 commits May 1, 2023 17:30
…toggle)

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
@ddelnano ddelnano force-pushed the ddelnano/add-feature-flag-to-opt-into-new-tls-tracing-method branch from 38b5ea2 to 927d232 Compare May 1, 2023 17:32
@JamesMBartlett JamesMBartlett merged commit 0c8d702 into pixie-io:main May 1, 2023
vihangm pushed a commit that referenced this pull request May 8, 2023
Summary: Remove update to BPF map not used in status quo tls tracing

In order to vet the new style of tls tracing, we introduced a mechanism
for detecting mismatched fds (#1161). This instrumented all of our tls
tracing when it was first developed. When the new method of tls tracing
was introduced, we removed the mismatched fd detection from the status
quo tls tracing (#1123), however, this BPF map update was missed in that
refactor (#1123).

Relevant Issues: #692

Type of change: /kind bug

Test Plan: Existing tests pass

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request Feb 25, 2026
…tls tracing method (pixie-io#1161)

Summary: Instrument OpenSSL tracing to detect validity of assumptions
for new tls tracing method

After discussing pixie-io#1123 with @oazizi000 and @etep, we decided that invest
in stronger validation that the assumptions for that PR and our future
tls tracing are valid. Instead of relying on struct offsets of user
space data structures, the new tracing method will access a connection's
socket fd from the underlying socket syscalls during `SSL_write` /
`SSL_read` calls.

This technique should only be compatible with "BIO native" OpenSSL use
cases, which are the OpenSSL use cases Pixie supports today. BIO native
means that a compatible application uses a
[BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via
[SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html))
and results in OpenSSL issuing read/write syscalls on your behalf for
the underlying socket.

There are two situations we wanted to understand prior to proceeding
with pixie-io#1123: how custom BIO implementations behave (netty, nodejs, etc)
and detecting unrelated (non socket) syscalls while `SSL_write` and
`SSL_read` are on the stack. The former was verified with an experiment
based on this change and is described in the Test Plan section below. We
believe the latter should not occur, but the changes in this PR
instrument this situation in order to detect if doest occur. Assuming
this condition isn't encountered, we will proceed with pixie-io#1132 (with the
minor changes learned from this experiment).

Relevant Issues: pixie-io#692

Type of change: /kind feature

Test Plan: Verified the following:
- [x] `DCHECK` added to `socket_trace_connector` does not detect
mismatched file descriptors for `openssl_trace_bpf_test` and
`netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345)
-- openssl_trace_bpf_test is still disabled - pixie-io#699)
- [x] Verified with a BPF histogram that the number of active syscalls
within SSL_write/SSL_read are non zero for BIO native use cases and zero
for non BIO native cases (nodejs, netty). See more details in pixie-io#1160

---------

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request Feb 25, 2026
…ggle) (pixie-io#1123)

Summary: Implement new tls tracing method behind stirling cli flag
(feature toggle)

The tls tracing method added in this PR determines a connection's
identity (socket fd) through a different mechanism from our existing
tracing. Instead of relying on struct offsets of user space data
structures, it accesses the socket fd via the underlying socket syscalls
while `SSL_write` / `SSL_read` calls occur. This is a prerequisite to
support BoringSSL because its rolling release style makes the previous
method of user space offsets untenable. This has the added benefit of
reducing our maintenance cost for our existing OpenSSL tracing. Assuming
future versions of OpenSSL maintain the same contract, we will not
require any code changes to support them -- Pixie will gain OpenSSL v3
support once this new tracing is the default (as mentioned in the later
testing).

Our assumption is that the applications that explicitly set the socket
fd on the SSL struct (the applications supported by our previous tracing
technique) --
[nginx](https://github.com/nginx/nginx/blob/dfe70f74a3558f05142fb552cea239add123d414/src/event/ngx_event_openssl.c#L1696),
[python](https://github.com/python/cpython/blob/e375bff03736f809fbc234010c087ef9d7e0d384/Modules/_ssl.c#L836)
all use Openssl with its native BIO interface. In order to verify that
assumption, this change will be feature flagged and monitored carefully
as its enabled on internal clusters. The existing [conn_stats_bytes
metric](https://github.com/pixie-io/pixie/blob/f45ced1803e6e44406f20f1171c15a24f4d5a17a/src/stirling/source_connectors/socket_tracer/metrics.cc#L62)
will be monitored for volume of tls traffic traced to verify there is no
loss of instrumentation coverage between the new and old method.

Relevant Issues: pixie-io#692

Type of change: /kind feature

Test Plan: Verified the following
- [x] New test passes with `enable_openssl_v3_testing` bool flag enabled
([P336](https://phab.corp.pixielabs.ai/P336)). This verifies that the
new tracing technique and the feature toggle are functional since our
previous tracing does not support OpenSSL v3
- [x] Existing `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test`
passes ([P335](https://phab.corp.pixielabs.ai/P335)). This is explicitly
mentioned since `openssl_trace_bpf_test` is still disabled pixie-io#699
- [x] Verified the `bool_flag` added from PR feedback conditionally
enables the OpenSSL v3 testing
([P337](https://phab.corp.pixielabs.ai/P337))
- [x] Validate metrics from pixie-io#1161 to verify our assumptions are correct

---------

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request Feb 25, 2026
…#1298)

Summary: Remove update to BPF map not used in status quo tls tracing

In order to vet the new style of tls tracing, we introduced a mechanism
for detecting mismatched fds (pixie-io#1161). This instrumented all of our tls
tracing when it was first developed. When the new method of tls tracing
was introduced, we removed the mismatched fd detection from the status
quo tls tracing (pixie-io#1123), however, this BPF map update was missed in that
refactor (pixie-io#1123).

Relevant Issues: pixie-io#692

Type of change: /kind bug

Test Plan: Existing tests pass

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request Feb 25, 2026
…tls tracing method (pixie-io#1161)

Summary: Instrument OpenSSL tracing to detect validity of assumptions
for new tls tracing method

After discussing pixie-io#1123 with @oazizi000 and @etep, we decided that invest
in stronger validation that the assumptions for that PR and our future
tls tracing are valid. Instead of relying on struct offsets of user
space data structures, the new tracing method will access a connection's
socket fd from the underlying socket syscalls during `SSL_write` /
`SSL_read` calls.

This technique should only be compatible with "BIO native" OpenSSL use
cases, which are the OpenSSL use cases Pixie supports today. BIO native
means that a compatible application uses a
[BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via
[SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html))
and results in OpenSSL issuing read/write syscalls on your behalf for
the underlying socket.

There are two situations we wanted to understand prior to proceeding
with pixie-io#1123: how custom BIO implementations behave (netty, nodejs, etc)
and detecting unrelated (non socket) syscalls while `SSL_write` and
`SSL_read` are on the stack. The former was verified with an experiment
based on this change and is described in the Test Plan section below. We
believe the latter should not occur, but the changes in this PR
instrument this situation in order to detect if doest occur. Assuming
this condition isn't encountered, we will proceed with pixie-io#1132 (with the
minor changes learned from this experiment).

Relevant Issues: pixie-io#692

Type of change: /kind feature

Test Plan: Verified the following:
- [x] `DCHECK` added to `socket_trace_connector` does not detect
mismatched file descriptors for `openssl_trace_bpf_test` and
`netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345)
-- openssl_trace_bpf_test is still disabled - pixie-io#699)
- [x] Verified with a BPF histogram that the number of active syscalls
within SSL_write/SSL_read are non zero for BIO native use cases and zero
for non BIO native cases (nodejs, netty). See more details in pixie-io#1160

---------

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request Feb 25, 2026
…ggle) (pixie-io#1123)

Summary: Implement new tls tracing method behind stirling cli flag
(feature toggle)

The tls tracing method added in this PR determines a connection's
identity (socket fd) through a different mechanism from our existing
tracing. Instead of relying on struct offsets of user space data
structures, it accesses the socket fd via the underlying socket syscalls
while `SSL_write` / `SSL_read` calls occur. This is a prerequisite to
support BoringSSL because its rolling release style makes the previous
method of user space offsets untenable. This has the added benefit of
reducing our maintenance cost for our existing OpenSSL tracing. Assuming
future versions of OpenSSL maintain the same contract, we will not
require any code changes to support them -- Pixie will gain OpenSSL v3
support once this new tracing is the default (as mentioned in the later
testing).

Our assumption is that the applications that explicitly set the socket
fd on the SSL struct (the applications supported by our previous tracing
technique) --
[nginx](https://github.com/nginx/nginx/blob/dfe70f74a3558f05142fb552cea239add123d414/src/event/ngx_event_openssl.c#L1696),
[python](https://github.com/python/cpython/blob/e375bff03736f809fbc234010c087ef9d7e0d384/Modules/_ssl.c#L836)
all use Openssl with its native BIO interface. In order to verify that
assumption, this change will be feature flagged and monitored carefully
as its enabled on internal clusters. The existing [conn_stats_bytes
metric](https://github.com/pixie-io/pixie/blob/f45ced1803e6e44406f20f1171c15a24f4d5a17a/src/stirling/source_connectors/socket_tracer/metrics.cc#L62)
will be monitored for volume of tls traffic traced to verify there is no
loss of instrumentation coverage between the new and old method.

Relevant Issues: pixie-io#692

Type of change: /kind feature

Test Plan: Verified the following
- [x] New test passes with `enable_openssl_v3_testing` bool flag enabled
([P336](https://phab.corp.pixielabs.ai/P336)). This verifies that the
new tracing technique and the feature toggle are functional since our
previous tracing does not support OpenSSL v3
- [x] Existing `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test`
passes ([P335](https://phab.corp.pixielabs.ai/P335)). This is explicitly
mentioned since `openssl_trace_bpf_test` is still disabled pixie-io#699
- [x] Verified the `bool_flag` added from PR feedback conditionally
enables the OpenSSL v3 testing
([P337](https://phab.corp.pixielabs.ai/P337))
- [x] Validate metrics from pixie-io#1161 to verify our assumptions are correct

---------

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request Feb 25, 2026
…#1298)

Summary: Remove update to BPF map not used in status quo tls tracing

In order to vet the new style of tls tracing, we introduced a mechanism
for detecting mismatched fds (pixie-io#1161). This instrumented all of our tls
tracing when it was first developed. When the new method of tls tracing
was introduced, we removed the mismatched fd detection from the status
quo tls tracing (pixie-io#1123), however, this BPF map update was missed in that
refactor (pixie-io#1123).

Relevant Issues: pixie-io#692

Type of change: /kind bug

Test Plan: Existing tests pass

Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants