Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new tool: capable #690

Merged
merged 2 commits into from Sep 14, 2016
Merged

new tool: capable #690

merged 2 commits into from Sep 14, 2016

Conversation

brendangregg
Copy link
Member

No description provided.

@brendangregg
Copy link
Member Author

I'm pretty happy with how capable turned out. Will be nice to add a -s for kernel stacks in the future, for more context on these capability checks...

@4ast
Copy link
Member

4ast commented Sep 14, 2016

love capable. I think it's pretty cool and unique.
the overhead of tcptop is definitely concerning. kernel tcp_info already collects these stats per socket and 'ss' can retrieve them. Can we beautify 'ss -4i' output may be? or something is missing?

@brendangregg
Copy link
Member Author

What!? When was this added to tcp_info? ... ok, thanks, I see now, it's been growing in the last year. tcpi_bytes_received, tcpi_data_segs_in, tcpi_data_segs_out. I was already thinking of adding these bean counters somewhere, so glad it's done.

I'll ditch tcptop for now and redo the PR.

I might have to bring tcptop back later on :-) but using the following scheme:

  • polling /proc & INET_DIAG_INFO (or just running ss) each interval for long-lived sessions.
  • tracing tcp_close() and digging out the tcp_info for sessions & partials the ss polling missed.

@brendangregg brendangregg changed the title two new tools: capable and tcptop new tool: capable Sep 14, 2016
@brendangregg
Copy link
Member Author

I fixed this PR.

@4ast 4ast merged commit 1298998 into iovisor:master Sep 14, 2016
@4ast
Copy link
Member

4ast commented Sep 14, 2016

awesome. merged.
re: tcptop
the tcp_info is missing few things that we need as well. Thankfully they're not per packet, so we just kprobe+bpf them.
Note tcp_close() is not enough. Instead we track the socket close from tcp_v4_destroy_sock and from tcp_set_state.
@iamkafai should we make a bcc tool out of it?

@brendangregg
Copy link
Member Author

brendangregg commented Sep 14, 2016

tcp_info is missing a lot of other things which we can use BPF for, but it looks enough for throughput of long-running sessions. tcptop.

I've dug into the tcp_info and inet_diag stuff. Given that short-lived sessions go into TIME_WAIT, you'd think that we have plenty of time to catch them later on the next interval refresh, so that we can write tcptop without any BPF. But it doesn't work. There's a special handler for TIME_WAIT sessions, inet_twsk_diag_fill(), which doesn't copy tcp_info. I'm gussing it can't, because at this point we have a stripped-down struct inet_timewait_sock, and not a full sock. (I also tried hacking it back in, and, well, 2 panics later...) Oh well.

So back to BPF... :) You really need both tcp_v4_destroy_sock() and tcp_set_state() to catch the end of TCP sessions? I'm thinking that tcp_set_state() might be sufficient (it misses a few things, like in initialization...).

@4ast
Copy link
Member

4ast commented Sep 14, 2016

hopefully @iamkafai can shed the light

@iamkafai
Copy link
Contributor

Right, doing diag on tw sock won't have much info. I recalled I had investigated why ss does not show tcp_info for some sockets and the reason is the same.

In our current bpf setup, the tcp_info is caught (and output through perf_event) at the tcp_set_state(). The kprobe in the tcp_v4_destroy_sock() is to ensure a proper bpf map cleanup just in case the sk did not go through the TCP_CLOSE state. It is mostly my paranoid and the map cleanup could be done in tcp_set_state(TCP_CLOSE) as well.

We used to have periodic event for long lived tcp connection but it was done in a tcp_estats.ko way. It turns out no one is taking good use of it, so I have dropped it during the recent transition from a tcp_estats.ko to the bpf tcp tracer.

@iamkafai
Copy link
Contributor

@4ast Based on our current tcp bpf tracer, we can at least create a tool like 'ip monitor' and monitor the TCP_CLOSE + tcp_info event.

We also collect a few types of tcp rxmit events but we emit perf_event whenever the rxmit happens and relying on another backend to do data aggregation.

A more complete tcp bpf tool (e.g. tcptop) could be a useful exercise and eventually be a driver to the tracepoints discussion.

@4ast
Copy link
Member

4ast commented Sep 14, 2016

@iamkafai all very good points. Completely agree. Could help discover issues in the probes, since more eyes will look at it. And hopefully others will start extending it for other stats too.
It would be the favorite tool for any web10g fan.

@brendangregg
Copy link
Member Author

@iamkafai Ok, thanks, I can get tcp_info from tcp_set_state(). Another tool I want, and maybe the same as what you're thinking with ip monitor, is one that prints per-event details for each TCP session, including lifespan (duration), hosts, ports, and (from tcp_info) throughput stats. I'd call it "tcplife" (see filelife in bcc/tools). I think @jvns had asked for this once, such that you could say: show me all connections to this port, and their duration. Hopefully there's already a birth timestamp in sock_t somewhere (although if I'm already tracing tcp_set_state(), then I can catch and timestamp ESTABLISHED).

I'm also excited at the idea of a tcptop that has near-zero overhead, by polling "ss -ti" and tracing tcp_set_state(). All of the TCP programs we create (tcptop, tcplife, tcpconnect, tcpaccept, tcpconnlat, tcpretrans) will be use-cases for the design of TCP tracepoints.

@brendangregg
Copy link
Member Author

brendangregg commented Sep 14, 2016

@iamkafai here's what I'm thinking with tcplife (maybe you want to code this, as you already have some of it done by the sounds of it? :) )

# ./tcplife
PID    COMM         SADDR,PORT                      DADDR,PORT                      RX_KB  TX_KB  DUR(ms)
12345  wget         100.153.223.157,12345           123.23.100.231,80                1182      2    10.65
12345  wget         fe80::8a31:9dff:fed5:6b19,12345 fe80::8a31:9dff:fed5:6b19,80000  1182      2    10.65

DUR(ms) is the time from socket creation (or the first tcp_set_state() event seen) to when it's closed.

  • -t to add a 1st "TIME" column of HH:MM:SS
  • -T to add a 1st "TIME(s)" column of seconds.microseconds
  • -P PORT: to filter on a destination port
  • -p PID: PID filter
  • -v to add trailing verbose columns: in/out segs from tcp_info

I don't like how the default output would be >80 chars to accommodate IPv6. But I'm not sure how else to do it. Having a single line of k=v pairs (no whitespace) might fit it in, but I'd rather treat that as a raw or "parseable" mode of output (-r). Or we could just make the columns fit IPv4 and let IPv6 overflow like a train wreck, and have a "-w" option for "wide columns" for people where that's too much of a nusiance. :)

We could also put ADDRs last, and sometimes it would fit in 80 chars (and let IPv6 overflow, but at this point it won't look too bad). eg:

# ./tcplife
PID    COMM       DUR(ms)  RX_KB  TX_KB  SADDR,PORT            DADDR,PORT
12345  wget        813.31 103871      3  100.153.223.157,12345 123.23.100.231,80
12345  wget         11.01   4713    183  fe80::8a31:9dff:fed5:6b19,12345 fe80::8a31:9dff:fed5:6b19,80

@markdrayton
Copy link
Contributor

@brendangregg do you imagine tcplife showing every state transition (e.g. SYN_SENT -> ESTABLISHED -> etc)? It sounds like tcplife is intended to show one line per completed session.

I'm asking because today I wanted to debug a situation where a process was stuck talking to something on the network but I wasn't sure what. I had in mind a tool that would list which remote hosts a machine/process was talking to, the state of those connections (are they established, when was data last sent/received, etc), which sounds somewhat like what you describe here.

@brendangregg
Copy link
Member Author

brendangregg commented Sep 15, 2016

One line per completed session.

I've done TCP state tracing tools before (where did I put them...), including per-state transition dumps (one line for each state transition), and time histograms for each state (system wide). But they weren't terribly useful. You had to then spend time interpreting them. Maybe I'd do the per-state transition dump again, but not the histograms, since system-wide histograms weren't very meaningful, and doing per-host histograms was a ton of output... I'd be more interested in targeted use cases: like, if you go in this state for more than this many milliseconds, then it means X; so lets trace and print those events.

@iamkafai
Copy link
Contributor

@brendangregg Is there a reliable way to learn the CMD (e.g. wget/curl/...) from a PID? We currently don't capture PID and CMD.

I think we may have most of the kprobes...I can give it a try later this month.

@brendangregg
Copy link
Member Author

We'll probably need to cache PID & COMM by sock ptr. They are valid with the existing tcpconnect/tcpaccept tools, which trace tcp_v4_connect()/tcp_v6_connect() and inet_csk_accept(). They all have sock *'s.

Although, given we're probably tracing tcp_set_state(), I'd investigate if PID/COMM are valid on the some tcp_set_state() seen for a sock *. Check this out:

Active connection:

# /mnt/src/bcc/tools/trace.py 'tcp_set_state "%x %d", arg1, arg2'
TIME     PID    COMM         FUNC             -
17:58:56 7590   curl         tcp_set_state    a2d7f080 2
17:58:56 0      swapper/7    tcp_set_state    a2d7f080 1
17:58:56 7590   curl         tcp_set_state    a2d7f080 4
17:58:56 0      swapper/7    tcp_set_state    a2d7f080 5
17:58:56 0      swapper/7    tcp_set_state    a2d7f080 7

Ok, so it's valid on TCP_SYN_SENT (2), the first seen.

Passive connection:

# /mnt/src/bcc/tools/trace.py 'tcp_set_state "%x %d", arg1, arg2'
TIME     PID    COMM         FUNC             -
18:02:02 0      swapper/7    tcp_set_state    a2d78000 1
18:02:04 12013  sshd         tcp_set_state    a2d78000 8
18:02:04 12013  sshd         tcp_set_state    a2d78000 9
18:02:04 0      swapper/7    tcp_set_state    a2d78000 7

Ok, so when the SYN arrives we're just kernel context. But TCP_CLOSE_WAIT (8) shows the right PID/COMM, sshd.

It gets a bit tricker; here's the same, but with a CPU loaded server:

# /mnt/src/bcc/tools/trace.py 'tcp_set_state "%x %d", arg1, arg2'
TIME     PID    COMM         FUNC             -
18:04:10 14645  bash         tcp_set_state    a2d7cb00 1
18:04:11 14952  sshd         tcp_set_state    a2d7cb00 8
18:04:11 14952  sshd         tcp_set_state    a2d7cb00 9
18:04:12 14673  bash         tcp_set_state    a2d7cb00 7

Now the SYN arrives when an unrelated process is on-CPU. :)

Note I'm testing these over the network, so that delays context switch us. Localhost connections seem to stay on CPU for more states.

Anyway, would it be insane to use the following in tcp_set_state()?:

fetch timestamp from map
if no timestamp, and state < TCP_FIN_WAIT1, set timestamp by sock *

if state == TCP_SYN_SENT 
record PID & COMM by sock *. return.

if state == TCP_FIN_WAIT1
fetch previous cached PID & COMM, and saved timestamp, and emit output. return.

if state == TCP_CLOSE_WAIT
use current PID & COMM, saved timestamp, and emit output. return.

It's a lot of mucking around, but it'd be one kprobe, which would be great!

This was referenced Sep 16, 2016
@brendangregg
Copy link
Member Author

I've created #692 and #693 so we can discuss tcptop & tcplife in separate threads.

@brendangregg
Copy link
Member Author

TCP_CLOSE_WAIT wasn't reliable. Here's another passive, from a remote IP:

22:46:12 0      swapper/7    tcp_set_state    a2e5e4c0 1
22:46:14 30548  sshd         tcp_set_state    a2e5e4c0 9
22:46:14 0      swapper/7    tcp_set_state    a2e5e4c0 7

7 is wrong, no 8. The task context is valid on 9, TCP_LAST_ACK.

Now passive from loopback:

22:53:32 8583   ssh          tcp_set_state    a2a8c5c0 1
22:53:33 8583   ssh          tcp_set_state    a2a8c5c0 8
22:53:33 8584   sshd         tcp_set_state    a2a8c5c0 9
22:53:33 8584   sshd         tcp_set_state    a2a8c5c0 7

8 is wrong, 7 & 9 are right. So 9 it is, for passive.

Active still works on 2:

22:50:10 3821   curl         tcp_set_state    a2e5b640 2
22:50:10 0      swapper/7    tcp_set_state    a2e5b640 1
22:50:10 3821   curl         tcp_set_state    a2e5b640 4
22:50:10 0      swapper/7    tcp_set_state    a2e5b640 5
22:50:10 0      swapper/7    tcp_set_state    a2e5b640 7

2, TCP_SYN_SENT, is still ok.

I'll follow up on #693. Just wanted to note the previous strategy I wrote didn't work out.

@brendangregg
Copy link
Member Author

Server-closed:

23:40:05 0      swapper/7    tcp_set_state    a2e5d540 1
23:40:16 30894  sshd         tcp_set_state    a2e5d540 4
23:40:16 0      swapper/7    tcp_set_state    a2e5d540 5
23:40:16 0      swapper/7    tcp_set_state    a2e5d540 7

@brendangregg
Copy link
Member Author

adding the nc test from #788 with those above.

Server-closed from server:

22:27:04 8841   bash         tcp_set_state    a2e58f80 1
22:27:05 19165  nc           tcp_set_state    a2e58f80 7

Server-closed from client:

22:49:37 17636  nc           tcp_set_state    a2a887c0 2
22:49:37 8872   bash         tcp_set_state    a2a887c0 1
22:49:38 17636  nc           tcp_set_state    a2a887c0 7

Server-closed, over loopback, so sees server (PID 21771) and client (PID 21772):

# ./trace.py 'tcp_set_state "%x %d", arg1, arg2'
TIME     PID    COMM         FUNC             -
22:53:22 21772  nc           tcp_set_state    a9641740 2
22:53:22 21772  nc           tcp_set_state    a9641740 1
22:53:22 21772  nc           tcp_set_state    a9640f80 1
22:53:22 21771  nc           tcp_set_state    a2a89f00 7
22:53:25 21771  nc           tcp_set_state    a9640f80 7
22:53:25 21771  nc           tcp_set_state    a9641740 7

@brendangregg
Copy link
Member Author

client-closed from client:

00:01:41 16505  16505  nc           tcp_set_state    a3ff7440 2
00:01:41 0      0      swapper/7    tcp_set_state    a3ff7440 1
00:01:42 16505  16505  nc           tcp_set_state    a3ff7440 4
00:01:42 0      0      swapper/7    tcp_set_state    a3ff7440 5
00:01:42 0      0      swapper/7    tcp_set_state    a3ff7440 7

server-closed from client:

00:03:03 18627  18627  nc           tcp_set_state    a4272e80 2
00:03:03 0      0      swapper/7    tcp_set_state    a4272e80 1
00:03:04 18627  18627  nc           tcp_set_state    a4272e80 7

@brendangregg
Copy link
Member Author

client-closed from server:

00:12:56 0      0      swapper/7    tcp_set_state    a3f4dd00 1
00:12:57 0      0      swapper/7    tcp_set_state    a3f4dd00 8
00:12:57 695    695    nc           tcp_set_state    a3f4dd00 9
00:12:57 0      0      swapper/7    tcp_set_state    a3f4dd00 7

@brendangregg
Copy link
Member Author

loopback client-closed:

00:24:12 878    878    nc           tcp_set_state    a2c507c0 2
00:24:12 878    878    nc           tcp_set_state    a2c507c0 1
00:24:12 878    878    nc           tcp_set_state    a2c50000 1
00:24:12 877    877    nc           tcp_set_state    a876e4c0 7
00:24:12 878    878    nc           tcp_set_state    a2c507c0 4
00:24:12 878    878    nc           tcp_set_state    a2c50000 8
00:24:12 877    877    nc           tcp_set_state    a2c50000 9
00:24:12 877    877    nc           tcp_set_state    a2c507c0 5
00:24:12 877    877    nc           tcp_set_state    a2c507c0 7
00:24:12 877    877    nc           tcp_set_state    a2c50000 7

It's obvious which is client vs server: the client does the first SYN_SENT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants