New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new tool: capable #690
new tool: capable #690
Conversation
I'm pretty happy with how capable turned out. Will be nice to add a -s for kernel stacks in the future, for more context on these capability checks... |
love capable. I think it's pretty cool and unique. |
What!? When was this added to tcp_info? ... ok, thanks, I see now, it's been growing in the last year. tcpi_bytes_received, tcpi_data_segs_in, tcpi_data_segs_out. I was already thinking of adding these bean counters somewhere, so glad it's done. I'll ditch tcptop for now and redo the PR. I might have to bring tcptop back later on :-) but using the following scheme:
|
29c51f2
to
cc6bc0a
Compare
cc6bc0a
to
0f00a2f
Compare
I fixed this PR. |
awesome. merged. |
tcp_info is missing a lot of other things which we can use BPF for, but it looks enough for throughput of long-running sessions. tcptop. I've dug into the tcp_info and inet_diag stuff. Given that short-lived sessions go into TIME_WAIT, you'd think that we have plenty of time to catch them later on the next interval refresh, so that we can write tcptop without any BPF. But it doesn't work. There's a special handler for TIME_WAIT sessions, inet_twsk_diag_fill(), which doesn't copy tcp_info. I'm gussing it can't, because at this point we have a stripped-down struct inet_timewait_sock, and not a full sock. (I also tried hacking it back in, and, well, 2 panics later...) Oh well. So back to BPF... :) You really need both tcp_v4_destroy_sock() and tcp_set_state() to catch the end of TCP sessions? I'm thinking that tcp_set_state() might be sufficient (it misses a few things, like in initialization...). |
hopefully @iamkafai can shed the light |
Right, doing diag on tw sock won't have much info. I recalled I had investigated why ss does not show tcp_info for some sockets and the reason is the same. In our current bpf setup, the tcp_info is caught (and output through perf_event) at the tcp_set_state(). The kprobe in the tcp_v4_destroy_sock() is to ensure a proper bpf map cleanup just in case the sk did not go through the TCP_CLOSE state. It is mostly my paranoid and the map cleanup could be done in tcp_set_state(TCP_CLOSE) as well. We used to have periodic event for long lived tcp connection but it was done in a tcp_estats.ko way. It turns out no one is taking good use of it, so I have dropped it during the recent transition from a tcp_estats.ko to the bpf tcp tracer. |
@4ast Based on our current tcp bpf tracer, we can at least create a tool like 'ip monitor' and monitor the TCP_CLOSE + tcp_info event. We also collect a few types of tcp rxmit events but we emit perf_event whenever the rxmit happens and relying on another backend to do data aggregation. A more complete tcp bpf tool (e.g. tcptop) could be a useful exercise and eventually be a driver to the tracepoints discussion. |
@iamkafai all very good points. Completely agree. Could help discover issues in the probes, since more eyes will look at it. And hopefully others will start extending it for other stats too. |
@iamkafai Ok, thanks, I can get tcp_info from tcp_set_state(). Another tool I want, and maybe the same as what you're thinking with ip monitor, is one that prints per-event details for each TCP session, including lifespan (duration), hosts, ports, and (from tcp_info) throughput stats. I'd call it "tcplife" (see filelife in bcc/tools). I think @jvns had asked for this once, such that you could say: show me all connections to this port, and their duration. Hopefully there's already a birth timestamp in sock_t somewhere (although if I'm already tracing tcp_set_state(), then I can catch and timestamp ESTABLISHED). I'm also excited at the idea of a tcptop that has near-zero overhead, by polling "ss -ti" and tracing tcp_set_state(). All of the TCP programs we create (tcptop, tcplife, tcpconnect, tcpaccept, tcpconnlat, tcpretrans) will be use-cases for the design of TCP tracepoints. |
@iamkafai here's what I'm thinking with tcplife (maybe you want to code this, as you already have some of it done by the sounds of it? :) )
DUR(ms) is the time from socket creation (or the first tcp_set_state() event seen) to when it's closed.
I don't like how the default output would be >80 chars to accommodate IPv6. But I'm not sure how else to do it. Having a single line of k=v pairs (no whitespace) might fit it in, but I'd rather treat that as a raw or "parseable" mode of output (-r). Or we could just make the columns fit IPv4 and let IPv6 overflow like a train wreck, and have a "-w" option for "wide columns" for people where that's too much of a nusiance. :) We could also put ADDRs last, and sometimes it would fit in 80 chars (and let IPv6 overflow, but at this point it won't look too bad). eg:
|
@brendangregg do you imagine tcplife showing every state transition (e.g. SYN_SENT -> ESTABLISHED -> etc)? It sounds like tcplife is intended to show one line per completed session. I'm asking because today I wanted to debug a situation where a process was stuck talking to something on the network but I wasn't sure what. I had in mind a tool that would list which remote hosts a machine/process was talking to, the state of those connections (are they established, when was data last sent/received, etc), which sounds somewhat like what you describe here. |
One line per completed session. I've done TCP state tracing tools before (where did I put them...), including per-state transition dumps (one line for each state transition), and time histograms for each state (system wide). But they weren't terribly useful. You had to then spend time interpreting them. Maybe I'd do the per-state transition dump again, but not the histograms, since system-wide histograms weren't very meaningful, and doing per-host histograms was a ton of output... I'd be more interested in targeted use cases: like, if you go in this state for more than this many milliseconds, then it means X; so lets trace and print those events. |
@brendangregg Is there a reliable way to learn the CMD (e.g. wget/curl/...) from a PID? We currently don't capture PID and CMD. I think we may have most of the kprobes...I can give it a try later this month. |
We'll probably need to cache PID & COMM by sock ptr. They are valid with the existing tcpconnect/tcpaccept tools, which trace tcp_v4_connect()/tcp_v6_connect() and inet_csk_accept(). They all have sock *'s. Although, given we're probably tracing tcp_set_state(), I'd investigate if PID/COMM are valid on the some tcp_set_state() seen for a sock *. Check this out: Active connection:
Ok, so it's valid on TCP_SYN_SENT (2), the first seen. Passive connection:
Ok, so when the SYN arrives we're just kernel context. But TCP_CLOSE_WAIT (8) shows the right PID/COMM, sshd. It gets a bit tricker; here's the same, but with a CPU loaded server:
Now the SYN arrives when an unrelated process is on-CPU. :) Note I'm testing these over the network, so that delays context switch us. Localhost connections seem to stay on CPU for more states. Anyway, would it be insane to use the following in tcp_set_state()?:
It's a lot of mucking around, but it'd be one kprobe, which would be great! |
TCP_CLOSE_WAIT wasn't reliable. Here's another passive, from a remote IP:
7 is wrong, no 8. The task context is valid on 9, TCP_LAST_ACK. Now passive from loopback:
8 is wrong, 7 & 9 are right. So 9 it is, for passive. Active still works on 2:
2, TCP_SYN_SENT, is still ok. I'll follow up on #693. Just wanted to note the previous strategy I wrote didn't work out. |
Server-closed:
|
adding the nc test from #788 with those above. Server-closed from server:
Server-closed from client:
Server-closed, over loopback, so sees server (PID 21771) and client (PID 21772):
|
client-closed from client:
server-closed from client:
|
client-closed from server:
|
loopback client-closed:
It's obvious which is client vs server: the client does the first SYN_SENT. |
No description provided.