Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate mux parser into stirling #367

Conversation

ddelnano
Copy link
Member

@ddelnano ddelnano commented Nov 29, 2021

This is my second attempt at integrating the mux parser into stirling. I'm opening this preemptively to show that the testing for #366 was successful, but we need to merge that PR first and rebase to pull out the docker image related changes.

Todo

  • Fix issue that is causing a subset of mux data from being collected
  • Debug why a bug in Tping/Rping parsing causes the mux stitcher to lose data
    • Incorrectly stripping the buffer caused the Tdispatch/Rdispatch frames to go missing. Fixing that bug causes all of the frames to exist except the Rping/Tping (since the Rping isn't parsed correctly). After looking at this again with more experience debugging this problem, I'm able to verify that the stitcher behaves as expected even with the Rping parser bug
  • Decrease false positive rate of mux protocol inference. I attempted to do this by parsing more of the mux message when the entire frame is available.
  • Avoid mutating mux stitcher deques by using iterators
  • Understand T/Rinit message structure better (6 unknown bytes) and update the mux parser accordingly

@pixie-io-buildbot
Copy link
Member

Can one of the admins verify this patch?

Copy link
Contributor

@chengruizhe chengruizhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Let's talk about how to test the Mux protocol inference rules on slack.

@@ -487,6 +487,66 @@ static __inline bool is_redis_message(const char* buf, size_t count) {
return true;
}

static __inline enum message_type_t infer_mux_message(const char* buf, size_t count) {
static const int8_t kTreq = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for a mux_type of 1 or -1 is probably not robust enough, because it's likely that another protocol can have a 1 at that location. We do have a network traffic dataset that we use internally to test how protocol inference rules affect the accuracy of classifying different protocols. I suggest we get some Mux data from a real application, add it to the dataset, and test its effect on the other protocols, before merging the protocol inference code. See experimental/stirling/protocol_inference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to enhance the mux protocol inference by reducing the number of types (R/Tinit, R/Tdispatch, Rerr(Old)) and by parsing 'special' strings that are present in those frame types. Ideally we would support all mux types, but there are some that aren't as high value.

If I remember correctly, the protocol inference runs on a connection until it's classified and is only provided the data from the syscall, correct? If that's the case I was hoping that having stronger verification on Rerr, Tinit and Rinit would help reduce false positives since it's earlier in the connection lifecycle. Let me know what you think of my latest update.

I am definitely interested in trying the protocol inference against the network traffic dataset. I shared the mux pcap file that I generated myself with you in slack. Unfortunately I don't know if I'll be able to provide real application traffic, but let me know how I can help with this testing.

namespace stirling {

// clang-format off
static constexpr DataElement kMuxElements[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to add a body field in the future? If so, let's add a TODO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends how we handle the nested thrift protocol data in thriftmux. I was picturing that the "body" field would be stored in a different table populated by a thrift protocol parser. That would allow non thriftmux users to get thrift protocol parsing.

However, it's not clear to me how we would join the mux and thrift procotol data in that situation. In an ideal world, the mux and thrift data would be easily joined. Has there been any discussion on how to handle protocols that are nested like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't had nested protocols like this before, so we'll be treading new ground. The problem is that we have only one ConnTracker per connection, and there's an assumption that the ConnTracker will output to one table.

I suppose we could detect the more specific inner protocol and choose the more specific one when possible. Needs more discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected this would require some discussion :) and good point regarding the ConnTracker 1:1 mapping. For now I've added a TODO comment explaining the future work.

@ddelnano ddelnano force-pushed the integrate-mux-parser-into-stirling-take-2 branch from e3bceb9 to 79b228a Compare December 15, 2021 02:09
namespace stirling {

// clang-format off
static constexpr DataElement kMuxElements[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't had nested protocols like this before, so we'll be treading new ground. The problem is that we have only one ConnTracker per connection, and there's an assumption that the ConnTracker will output to one table.

I suppose we could detect the more specific inner protocol and choose the more specific one when possible. Needs more discussion.

src/stirling/source_connectors/socket_tracer/mux_table.h Outdated Show resolved Hide resolved
…in a bug fix), remove unused client image and add class to use image

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…er image

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…es removes the buffer data correctly

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ng for mux's 24 bit tag field

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ire frame is provided

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…_test

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…d inside the mux protocol)

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…g postgres traffic

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
@ddelnano ddelnano force-pushed the integrate-mux-parser-into-stirling-take-2 branch from 5e2c268 to 84767f8 Compare December 19, 2021 15:04
@ddelnano
Copy link
Member Author

@chengruizhe I removed the part of the mux protocol inference that was misclassifying postgres traffic.

@oazizi000
Copy link
Contributor

oazizi000 commented Jan 11, 2022

@ddelnano 95199df pulled in most of these changes.

The protocol inference code was omitted as it is caused us to breach the eBPF instruction limit. This also means that for now, the Mux protocol support is not yet active.

@codecov-commenter
Copy link

Codecov Report

Merging #367 (95199df) into main (2fb2d7c) will decrease coverage by 0.01%.
The diff coverage is 55.43%.

❗ Current head 95199df differs from pull request most recent head 84767f8. Consider uploading reports for the commit 84767f8 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #367      +/-   ##
==========================================
- Coverage   63.44%   63.42%   -0.02%     
==========================================
  Files        1044     1048       +4     
  Lines       65051    65254     +203     
==========================================
+ Hits        41271    41390     +119     
- Misses      22553    22631      +78     
- Partials     1227     1233       +6     
Impacted Files Coverage Δ
src/cloud/api/controllers/vizier_cluster_grpc.go 64.60% <ø> (-0.20%) ⬇️
src/stirling/bpf_tools/bcc_symbolizer.cc 0.00% <0.00%> (ø)
...ling/source_connectors/perf_profiler/symbolizer.cc 0.00% <ø> (ø)
...rce_connectors/socket_tracer/bcc_bpf_intf/common.h 100.00% <ø> (ø)
...nnectors/socket_tracer/bcc_bpf_intf/socket_trace.h 0.00% <ø> (ø)
...ing/source_connectors/socket_tracer/data_stream.cc 98.21% <ø> (ø)
...et_tracer/protocols/kafka/decoder/packet_decoder.h 93.87% <ø> (ø)
...socket_tracer/protocols/kafka/opcodes/sync_group.h 0.00% <0.00%> (ø)
...nnectors/socket_tracer/protocols/kafka/stitcher.cc 53.00% <0.00%> (-7.23%) ⬇️
..._connectors/socket_tracer/socket_trace_connector.h 70.00% <ø> (ø)
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fb2d7c...84767f8. Read the comment docs.

@oazizi000
Copy link
Contributor

Marking this as closed since the majority has been pulled in. The last piece (eBPF mux protocol inference) will be submitted as a separate PR.

@oazizi000 oazizi000 closed this Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants