Skip to content

Commit

Permalink
tcpretrans
Browse files Browse the repository at this point in the history
  • Loading branch information
brendangregg committed Feb 15, 2016
1 parent fb7ae10 commit 553f2aa
Show file tree
Hide file tree
Showing 4 changed files with 368 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -100,6 +100,7 @@ Tools:
- tools/[syncsnoop](tools/syncsnoop.py): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
- tools/[tcpaccept](tools/tcpaccept.py): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
- tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
- tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
- tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
- tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).
- tools/[wakeuptime](tools/wakeuptime.py): Summarize sleep to wakeup time by waker kernel stack. [Examples](tools/wakeuptime_example.txt).
Expand Down
85 changes: 85 additions & 0 deletions man/man8/tcpretrans.8
@@ -0,0 +1,85 @@
.TH tcpaccept 8 "2016-02-14" "USER COMMANDS"
.SH NAME
tcpaccept \- Trace TCP retransmits and TLPs. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B tcpaccept [\-h] [\-l]
.SH DESCRIPTION
This traces TCP retransmits, showing address, port, and TCP state information,
and sometimes the PID (although usually not, since retransmits are usually
sent by the kernel on timeouts). To keep overhead very low, only
the TCP retransmit functions are traced. This does not trace every packet
(like tcpdump(8) or a packet sniffer).

This uses dynamic tracing of the kernel tcp_retransmit_skb() and
tcp_send_loss_probe() functions, and will need to be updated to
match kernel changes to these functions.

Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-l
Include tail loss probe attempts (in some cases the kernel may not
complete the TLP send).
.SH EXAMPLES
.TP
Trace TCP retransmits:
#
.B tcpretrans
.TP
Trace TCP retransmits and TLP attempts:
#
.B tcpconnect \-l
.SH FIELDS
.TP
TIME
Time of the retransmit.
.TP
PID
Process ID that was on-CPU. This is less useful than it might sound, as it
may usually be 0, for the kernel, for timer-based retransmits.
.TP
IP
IP address family (4 or 6).
.TP
LADDR
Local IP address. IPv4 as a dotted quad, IPv6 shows "..." then the last 4
bytes (check for newer versions of this tool for the full address).
.TP
LPORT
Local port.
.TP
T>
Type of event: R> == retransmit, L> == tail loss probe.
.TP
RADDR
Remote IP address. IPv4 as a dotted quad, IPv6 shows "..." then the last 4
bytes (check for newer versions of this tool for the full address).
.TP
RPORT
Remote port.
.TP
STATE
TCP session state.
.SH OVERHEAD
Should be negligible: TCP retransmit events should be low (<1000/s), and the
low overhead this tool adds to each event should make the cost negligible.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
tcpconnect(8), tcpaccept(8)
221 changes: 221 additions & 0 deletions tools/tcpretrans.py
@@ -0,0 +1,221 @@
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# tcpretrans Trace TCP retransmits and TLPs.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: tcpretrans [-h] [-l]
#
# This uses dynamic tracing of kernel functions, and will need to be updated
# to match kernel changes.
#
# IPv4 addresses are printed as dotted quads. For IPv6 addresses, the last four
# bytes are printed after "..."; check for future versions with better IPv6
# support.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 14-Feb-2016 Brendan Gregg Created this.

from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
import ctypes as ct

# arguments
examples = """examples:
./tcpretrans # trace TCP retransmits
./tcpretrans -l # include TLP attempts
"""
parser = argparse.ArgumentParser(
description="Trace TCP retransmits",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-l", "--lossprobe", action="store_true",
help="include tail loss probe attempts")
args = parser.parse_args()

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#define RETRANSMIT 1
#define TLP 2
// separate data structs for ipv4 and ipv6
struct ipv4_data_t {
// XXX: switch some to u32's when supported
u64 pid;
u64 ip;
u64 saddr;
u64 daddr;
u64 lport;
u64 dport;
u64 state;
u64 type;
};
BPF_PERF_OUTPUT(ipv4_events);
struct ipv6_data_t {
// XXX: update to transfer full ipv6 addrs
u64 pid;
u64 ip;
u64 saddr;
u64 daddr;
u64 lport;
u64 dport;
u64 state;
u64 type;
};
BPF_PERF_OUTPUT(ipv6_events);
static int trace_event(struct pt_regs *ctx, struct sock *sk, int type)
{
if (sk == NULL)
return 0;
u32 pid = bpf_get_current_pid_tgid();
struct sock *skp = NULL;
bpf_probe_read(&skp, sizeof(skp), &sk);
// pull in details
u16 family = 0, lport = 0, dport = 0;
char state = 0;
bpf_probe_read(&family, sizeof(family), &skp->__sk_common.skc_family);
bpf_probe_read(&lport, sizeof(lport), &skp->__sk_common.skc_num);
bpf_probe_read(&dport, sizeof(dport), &skp->__sk_common.skc_dport);
bpf_probe_read(&state, sizeof(state), (void *)&skp->__sk_common.skc_state);
if (family == AF_INET) {
struct ipv4_data_t data4 = {.pid = pid, .ip = 4, .type = type};
bpf_probe_read(&data4.saddr, sizeof(u32),
&skp->__sk_common.skc_rcv_saddr);
bpf_probe_read(&data4.daddr, sizeof(u32),
&skp->__sk_common.skc_daddr);
data4.lport = lport;
data4.dport = dport;
data4.state = state;
ipv4_events.perf_submit(ctx, &data4, sizeof(data4));
} else if (family == AF_INET6) {
struct ipv6_data_t data6 = {.pid = pid, .ip = 6, .type = type};
// just grab the last 4 bytes for now
u32 saddr = 0, daddr = 0;
bpf_probe_read(&saddr, sizeof(saddr),
&skp->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32[3]);
bpf_probe_read(&daddr, sizeof(daddr),
&skp->__sk_common.skc_v6_daddr.in6_u.u6_addr32[3]);
data6.saddr = bpf_ntohl(saddr);
data6.daddr = bpf_ntohl(daddr);
data6.lport = lport;
data6.dport = dport;
data6.state = state;
ipv6_events.perf_submit(ctx, &data6, sizeof(data6));
}
// else drop
return 0;
}
int trace_retransmit(struct pt_regs *ctx, struct sock *sk)
{
trace_event(ctx, sk, RETRANSMIT);
return 0;
}
int trace_tlp(struct pt_regs *ctx, struct sock *sk)
{
trace_event(ctx, sk, TLP);
return 0;
}
"""

# event data
TASK_COMM_LEN = 16 # linux/sched.h
class Data_ipv4(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("ip", ct.c_ulonglong),
("saddr", ct.c_ulonglong),
("daddr", ct.c_ulonglong),
("lport", ct.c_ulonglong),
("dport", ct.c_ulonglong),
("state", ct.c_ulonglong),
("type", ct.c_ulonglong)
]
class Data_ipv6(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("ip", ct.c_ulonglong),
("saddr", ct.c_ulonglong),
("daddr", ct.c_ulonglong),
("lport", ct.c_ulonglong),
("dport", ct.c_ulonglong),
("state", ct.c_ulonglong),
("type", ct.c_ulonglong)
]

# from bpf_text:
type = {}
type[1] = 'R'
type[2] = 'L'

def inet_ntoa(addr):
dq = ''
for i in range(0, 4):
dq = dq + str(addr & 0xff)
if (i != 3):
dq = dq + '.'
addr = addr >> 8
return dq

# from include/net/tcp_states.h:
tcpstate = {}
tcpstate[1] = 'ESTABLISHED'
tcpstate[2] = 'SYN_SENT'
tcpstate[3] = 'SYN_RECV'
tcpstate[4] = 'FIN_WAIT1'
tcpstate[5] = 'FIN_WAIT2'
tcpstate[6] = 'TIME_WAIT'
tcpstate[7] = 'CLOSE'
tcpstate[8] = 'CLOSE_WAIT'
tcpstate[9] = 'LAST_ACK'
tcpstate[10] = 'LISTEN'
tcpstate[11] = 'CLOSING'
tcpstate[12] = 'NEW_SYN_RECV'

# process event
def print_ipv4_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data_ipv4)).contents
print("%-8s %-6d %-2d %-20s %1s> %-20s %s" % (
strftime("%H:%M:%S"), event.pid, event.ip,
"%s:%s" % (inet_ntoa(event.saddr), event.lport),
type[event.type],
"%s:%s" % (inet_ntoa(event.daddr), event.dport),
tcpstate[event.state]))
def print_ipv6_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data_ipv6)).contents
print("%%-8s -6d %-2d %-20s %1s> %-20s %s" % (
strftime("%H:%M:%S"), event.pid, event.ip,
"...%x:%d" % (event.saddr, event.lport),
type[event.type],
"...%x:%d" % (event.daddr, event.dport),
tcpstate[event.state]))

# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="tcp_retransmit_skb", fn_name="trace_retransmit")
b.attach_kprobe(event="tcp_send_loss_probe", fn_name="trace_tlp")

# header
print("%-8s %-6s %-2s %-20s %1s> %-20s %-4s" % ("TIME", "PID", "IP",
"LADDR:LPORT", "T", "RADDR:RPORT", "STATE"))

# read events
b["ipv4_events"].open_perf_buffer(print_ipv4_event)
b["ipv6_events"].open_perf_buffer(print_ipv6_event)
while 1:
b.kprobe_poll()
61 changes: 61 additions & 0 deletions tools/tcpretrans_example.txt
@@ -0,0 +1,61 @@
Demonstrations of tcpretrans, the Linux eBPF/bcc version.


This tool traces the kernel TCP retransmit function to show details of these
retransmits. For example:

# ./tcpretrans
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
01:55:17 0 4 10.153.223.157:22 R> 69.53.245.40:22957 ESTABLISHED
[...]

This output shows three TCP retransmits, the first two were for an IPv4
connection from 10.153.223.157 port 22 to 69.53.245.40 port 34619. The TCP
state was "ESTABLISHED" at the time of the retransmit. The on-CPU PID at the
time of the retransmit is printed, in this case 0 (the kernel, which will
be the case most of the time).

Retransmits are usually a sign of poor network health, and this tool is
useful for their investigation. Unlike using tcpdump, this tool has very
low overhead, as it only traces the retransmit function. It also prints
additional kernel details: the state of the TCP session at the time of the
retransmit.


A -l option will include TCP tail loss probe attempts:

# ./tcpretrans -l
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
01:55:45 0 4 10.153.223.157:22 R> 69.53.245.40:51601 ESTABLISHED
01:55:46 0 4 10.153.223.157:22 R> 69.53.245.40:51601 ESTABLISHED
01:55:46 0 4 10.153.223.157:22 R> 69.53.245.40:51601 ESTABLISHED
01:55:53 0 4 10.153.223.157:22 L> 69.53.245.40:46444 ESTABLISHED
01:56:06 0 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
01:56:06 0 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
01:56:08 0 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
01:56:08 0 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
01:56:08 1938 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
01:56:08 0 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
01:56:08 0 4 10.153.223.157:22 R> 69.53.245.40:46444 ESTABLISHED
[...]

See the "L>" in the "T>" column. These are attempts: the kernel probably
sent a TLP, but in some cases it might not have been ultimately sent.


USAGE message:

# ./tcpretrans -h
usage: tcpretrans [-h] [-l]

Trace TCP retransmits

optional arguments:
-h, --help show this help message and exit
-l, --lossprobe include tail loss probe attempts

examples:
./tcpretrans # trace TCP retransmits
./tcpretrans -l # include TLP attempts

0 comments on commit 553f2aa

Please sign in to comment.