Implement new tls tracing method behind stirling cli flag (feature toggle)#1123
Conversation
JamesMBartlett
left a comment
There was a problem hiding this comment.
My understanding is that there's two flags to control the new behaviour: a build time flag and a runtime flag. It seems to me like both flags need to be set for the new behaviour to be enabled. As far as I can tell, if the build-time flag is enabled but the runtime flag is not, it will attempt to trace openssl v3 traffic, but fail since the new fd tracing is not enabled (maybe?). And if only the runtime flag is enabled it won't attempt to trace openssl v3 traffic. If this is the case I think it might make sense to just have a runtime flag and not have a build time flag. Additionally, you won't be able to change the build-time flag with launchdarkly, so that side of things will be easier with just a runtime flag.
@JamesMBartlett regarding your comment above, is the build time flag you are referencing the The This new method of tls tracing should provide the same coverage as our existing method, meaning that a PEM should trace the same amount of bytes before and after the change. However, I want to verify that assumption. In order to test that, I intend to roll out the new implementation while keeping our supported tls tracing targets constant (OpenSSL v3 must remain disabled as the feature toggle is enabled). Once the new implementation is fully rolled out and the coverage is validated against our existing method, we can remove the config setting and release OpenSSL v3 support. |
Yes that's the flag I was referring to.
I see, I misunderstood the purpose since |
|
Moving this back to a draft since there is some additional testing the Stirling team wants to do prior to pushing forward with this. |
…tls tracing method (#1161) Summary: Instrument OpenSSL tracing to detect validity of assumptions for new tls tracing method After discussing #1123 with @oazizi000 and @etep, we decided that invest in stronger validation that the assumptions for that PR and our future tls tracing are valid. Instead of relying on struct offsets of user space data structures, the new tracing method will access a connection's socket fd from the underlying socket syscalls during `SSL_write` / `SSL_read` calls. This technique should only be compatible with "BIO native" OpenSSL use cases, which are the OpenSSL use cases Pixie supports today. BIO native means that a compatible application uses a [BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via [SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html)) and results in OpenSSL issuing read/write syscalls on your behalf for the underlying socket. There are two situations we wanted to understand prior to proceeding with #1123: how custom BIO implementations behave (netty, nodejs, etc) and detecting unrelated (non socket) syscalls while `SSL_write` and `SSL_read` are on the stack. The former was verified with an experiment based on this change and is described in the Test Plan section below. We believe the latter should not occur, but the changes in this PR instrument this situation in order to detect if doest occur. Assuming this condition isn't encountered, we will proceed with #1132 (with the minor changes learned from this experiment). Relevant Issues: #692 Type of change: /kind feature Test Plan: Verified the following: - [x] `DCHECK` added to `socket_trace_connector` does not detect mismatched file descriptors for `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345) -- openssl_trace_bpf_test is still disabled - #699) - [x] Verified with a BPF histogram that the number of active syscalls within SSL_write/SSL_read are non zero for BIO native use cases and zero for non BIO native cases (nodejs, netty). See more details in #1160 --------- Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
|
@etep this is ready for review again. |
b11d05d to
38b5ea2
Compare
…tls tracing method (pixie-io#1161) Summary: Instrument OpenSSL tracing to detect validity of assumptions for new tls tracing method After discussing pixie-io#1123 with @oazizi000 and @etep, we decided that invest in stronger validation that the assumptions for that PR and our future tls tracing are valid. Instead of relying on struct offsets of user space data structures, the new tracing method will access a connection's socket fd from the underlying socket syscalls during `SSL_write` / `SSL_read` calls. This technique should only be compatible with "BIO native" OpenSSL use cases, which are the OpenSSL use cases Pixie supports today. BIO native means that a compatible application uses a [BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via [SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html)) and results in OpenSSL issuing read/write syscalls on your behalf for the underlying socket. There are two situations we wanted to understand prior to proceeding with pixie-io#1123: how custom BIO implementations behave (netty, nodejs, etc) and detecting unrelated (non socket) syscalls while `SSL_write` and `SSL_read` are on the stack. The former was verified with an experiment based on this change and is described in the Test Plan section below. We believe the latter should not occur, but the changes in this PR instrument this situation in order to detect if doest occur. Assuming this condition isn't encountered, we will proceed with pixie-io#1132 (with the minor changes learned from this experiment). Relevant Issues: pixie-io#692 Type of change: /kind feature Test Plan: Verified the following: - [x] `DCHECK` added to `socket_trace_connector` does not detect mismatched file descriptors for `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345) -- openssl_trace_bpf_test is still disabled - pixie-io#699) - [x] Verified with a BPF histogram that the number of active syscalls within SSL_write/SSL_read are non zero for BIO native use cases and zero for non BIO native cases (nodejs, netty). See more details in pixie-io#1160 --------- Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
…toggle) Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
38b5ea2 to
927d232
Compare
Summary: Remove update to BPF map not used in status quo tls tracing In order to vet the new style of tls tracing, we introduced a mechanism for detecting mismatched fds (#1161). This instrumented all of our tls tracing when it was first developed. When the new method of tls tracing was introduced, we removed the mismatched fd detection from the status quo tls tracing (#1123), however, this BPF map update was missed in that refactor (#1123). Relevant Issues: #692 Type of change: /kind bug Test Plan: Existing tests pass Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
…tls tracing method (pixie-io#1161) Summary: Instrument OpenSSL tracing to detect validity of assumptions for new tls tracing method After discussing pixie-io#1123 with @oazizi000 and @etep, we decided that invest in stronger validation that the assumptions for that PR and our future tls tracing are valid. Instead of relying on struct offsets of user space data structures, the new tracing method will access a connection's socket fd from the underlying socket syscalls during `SSL_write` / `SSL_read` calls. This technique should only be compatible with "BIO native" OpenSSL use cases, which are the OpenSSL use cases Pixie supports today. BIO native means that a compatible application uses a [BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via [SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html)) and results in OpenSSL issuing read/write syscalls on your behalf for the underlying socket. There are two situations we wanted to understand prior to proceeding with pixie-io#1123: how custom BIO implementations behave (netty, nodejs, etc) and detecting unrelated (non socket) syscalls while `SSL_write` and `SSL_read` are on the stack. The former was verified with an experiment based on this change and is described in the Test Plan section below. We believe the latter should not occur, but the changes in this PR instrument this situation in order to detect if doest occur. Assuming this condition isn't encountered, we will proceed with pixie-io#1132 (with the minor changes learned from this experiment). Relevant Issues: pixie-io#692 Type of change: /kind feature Test Plan: Verified the following: - [x] `DCHECK` added to `socket_trace_connector` does not detect mismatched file descriptors for `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345) -- openssl_trace_bpf_test is still disabled - pixie-io#699) - [x] Verified with a BPF histogram that the number of active syscalls within SSL_write/SSL_read are non zero for BIO native use cases and zero for non BIO native cases (nodejs, netty). See more details in pixie-io#1160 --------- Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
…ggle) (pixie-io#1123) Summary: Implement new tls tracing method behind stirling cli flag (feature toggle) The tls tracing method added in this PR determines a connection's identity (socket fd) through a different mechanism from our existing tracing. Instead of relying on struct offsets of user space data structures, it accesses the socket fd via the underlying socket syscalls while `SSL_write` / `SSL_read` calls occur. This is a prerequisite to support BoringSSL because its rolling release style makes the previous method of user space offsets untenable. This has the added benefit of reducing our maintenance cost for our existing OpenSSL tracing. Assuming future versions of OpenSSL maintain the same contract, we will not require any code changes to support them -- Pixie will gain OpenSSL v3 support once this new tracing is the default (as mentioned in the later testing). Our assumption is that the applications that explicitly set the socket fd on the SSL struct (the applications supported by our previous tracing technique) -- [nginx](https://github.com/nginx/nginx/blob/dfe70f74a3558f05142fb552cea239add123d414/src/event/ngx_event_openssl.c#L1696), [python](https://github.com/python/cpython/blob/e375bff03736f809fbc234010c087ef9d7e0d384/Modules/_ssl.c#L836) all use Openssl with its native BIO interface. In order to verify that assumption, this change will be feature flagged and monitored carefully as its enabled on internal clusters. The existing [conn_stats_bytes metric](https://github.com/pixie-io/pixie/blob/f45ced1803e6e44406f20f1171c15a24f4d5a17a/src/stirling/source_connectors/socket_tracer/metrics.cc#L62) will be monitored for volume of tls traffic traced to verify there is no loss of instrumentation coverage between the new and old method. Relevant Issues: pixie-io#692 Type of change: /kind feature Test Plan: Verified the following - [x] New test passes with `enable_openssl_v3_testing` bool flag enabled ([P336](https://phab.corp.pixielabs.ai/P336)). This verifies that the new tracing technique and the feature toggle are functional since our previous tracing does not support OpenSSL v3 - [x] Existing `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test` passes ([P335](https://phab.corp.pixielabs.ai/P335)). This is explicitly mentioned since `openssl_trace_bpf_test` is still disabled pixie-io#699 - [x] Verified the `bool_flag` added from PR feedback conditionally enables the OpenSSL v3 testing ([P337](https://phab.corp.pixielabs.ai/P337)) - [x] Validate metrics from pixie-io#1161 to verify our assumptions are correct --------- Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai> Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…#1298) Summary: Remove update to BPF map not used in status quo tls tracing In order to vet the new style of tls tracing, we introduced a mechanism for detecting mismatched fds (pixie-io#1161). This instrumented all of our tls tracing when it was first developed. When the new method of tls tracing was introduced, we removed the mismatched fd detection from the status quo tls tracing (pixie-io#1123), however, this BPF map update was missed in that refactor (pixie-io#1123). Relevant Issues: pixie-io#692 Type of change: /kind bug Test Plan: Existing tests pass Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
…tls tracing method (pixie-io#1161) Summary: Instrument OpenSSL tracing to detect validity of assumptions for new tls tracing method After discussing pixie-io#1123 with @oazizi000 and @etep, we decided that invest in stronger validation that the assumptions for that PR and our future tls tracing are valid. Instead of relying on struct offsets of user space data structures, the new tracing method will access a connection's socket fd from the underlying socket syscalls during `SSL_write` / `SSL_read` calls. This technique should only be compatible with "BIO native" OpenSSL use cases, which are the OpenSSL use cases Pixie supports today. BIO native means that a compatible application uses a [BIO](https://wiki.openssl.org/index.php/BIO) provided by OpenSSL (via [SSL_set_fd](https://www.openssl.org/docs/manmaster/man3/SSL_set_fd.html)) and results in OpenSSL issuing read/write syscalls on your behalf for the underlying socket. There are two situations we wanted to understand prior to proceeding with pixie-io#1123: how custom BIO implementations behave (netty, nodejs, etc) and detecting unrelated (non socket) syscalls while `SSL_write` and `SSL_read` are on the stack. The former was verified with an experiment based on this change and is described in the Test Plan section below. We believe the latter should not occur, but the changes in this PR instrument this situation in order to detect if doest occur. Assuming this condition isn't encountered, we will proceed with pixie-io#1132 (with the minor changes learned from this experiment). Relevant Issues: pixie-io#692 Type of change: /kind feature Test Plan: Verified the following: - [x] `DCHECK` added to `socket_trace_connector` does not detect mismatched file descriptors for `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test` ([P345](https://phab.corp.pixielabs.ai/P345) -- openssl_trace_bpf_test is still disabled - pixie-io#699) - [x] Verified with a BPF histogram that the number of active syscalls within SSL_write/SSL_read are non zero for BIO native use cases and zero for non BIO native cases (nodejs, netty). See more details in pixie-io#1160 --------- Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
…ggle) (pixie-io#1123) Summary: Implement new tls tracing method behind stirling cli flag (feature toggle) The tls tracing method added in this PR determines a connection's identity (socket fd) through a different mechanism from our existing tracing. Instead of relying on struct offsets of user space data structures, it accesses the socket fd via the underlying socket syscalls while `SSL_write` / `SSL_read` calls occur. This is a prerequisite to support BoringSSL because its rolling release style makes the previous method of user space offsets untenable. This has the added benefit of reducing our maintenance cost for our existing OpenSSL tracing. Assuming future versions of OpenSSL maintain the same contract, we will not require any code changes to support them -- Pixie will gain OpenSSL v3 support once this new tracing is the default (as mentioned in the later testing). Our assumption is that the applications that explicitly set the socket fd on the SSL struct (the applications supported by our previous tracing technique) -- [nginx](https://github.com/nginx/nginx/blob/dfe70f74a3558f05142fb552cea239add123d414/src/event/ngx_event_openssl.c#L1696), [python](https://github.com/python/cpython/blob/e375bff03736f809fbc234010c087ef9d7e0d384/Modules/_ssl.c#L836) all use Openssl with its native BIO interface. In order to verify that assumption, this change will be feature flagged and monitored carefully as its enabled on internal clusters. The existing [conn_stats_bytes metric](https://github.com/pixie-io/pixie/blob/f45ced1803e6e44406f20f1171c15a24f4d5a17a/src/stirling/source_connectors/socket_tracer/metrics.cc#L62) will be monitored for volume of tls traffic traced to verify there is no loss of instrumentation coverage between the new and old method. Relevant Issues: pixie-io#692 Type of change: /kind feature Test Plan: Verified the following - [x] New test passes with `enable_openssl_v3_testing` bool flag enabled ([P336](https://phab.corp.pixielabs.ai/P336)). This verifies that the new tracing technique and the feature toggle are functional since our previous tracing does not support OpenSSL v3 - [x] Existing `openssl_trace_bpf_test` and `netty_tls_trace_bpf_test` passes ([P335](https://phab.corp.pixielabs.ai/P335)). This is explicitly mentioned since `openssl_trace_bpf_test` is still disabled pixie-io#699 - [x] Verified the `bool_flag` added from PR feedback conditionally enables the OpenSSL v3 testing ([P337](https://phab.corp.pixielabs.ai/P337)) - [x] Validate metrics from pixie-io#1161 to verify our assumptions are correct --------- Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai> Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…#1298) Summary: Remove update to BPF map not used in status quo tls tracing In order to vet the new style of tls tracing, we introduced a mechanism for detecting mismatched fds (pixie-io#1161). This instrumented all of our tls tracing when it was first developed. When the new method of tls tracing was introduced, we removed the mismatched fd detection from the status quo tls tracing (pixie-io#1123), however, this BPF map update was missed in that refactor (pixie-io#1123). Relevant Issues: pixie-io#692 Type of change: /kind bug Test Plan: Existing tests pass Signed-off-by: Dom Del Nano <ddelnano@pixielabs.ai>
Summary: Implement new tls tracing method behind stirling cli flag (feature toggle)
The tls tracing method added in this PR determines a connection's identity (socket fd) through a different mechanism from our existing tracing. Instead of relying on struct offsets of user space data structures, it accesses the socket fd via the underlying socket syscalls while
SSL_write/SSL_readcalls occur. This is a prerequisite to support BoringSSL because its rolling release style makes the previous method of user space offsets untenable. This has the added benefit of reducing our maintenance cost for our existing OpenSSL tracing. Assuming future versions of OpenSSL maintain the same contract, we will not require any code changes to support them -- Pixie will gain OpenSSL v3 support once this new tracing is the default (as mentioned in the later testing).Our assumption is that the applications that explicitly set the socket fd on the SSL struct (the applications supported by our previous tracing technique) -- nginx, python all use Openssl with its native BIO interface. In order to verify that assumption, this change will be feature flagged and monitored carefully as its enabled on internal clusters. The existing conn_stats_bytes metric will be monitored for volume of tls traffic traced to verify there is no loss of instrumentation coverage between the new and old method.
Relevant Issues: #692
Type of change: /kind feature
Test Plan: Verified the following
enable_openssl_v3_testingbool flag enabled (P336). This verifies that the new tracing technique and the feature toggle are functional since our previous tracing does not support OpenSSL v3openssl_trace_bpf_testandnetty_tls_trace_bpf_testpasses (P335). This is explicitly mentioned sinceopenssl_trace_bpf_testis still disabled Deflakeopenssl_trace_bpf_test#699bool_flagadded from PR feedback conditionally enables the OpenSSL v3 testing (P337)