Skip to content

Commit

Permalink
blkalgn: add block command alignment observability tool
Browse files Browse the repository at this point in the history
The tool observes block commands and checks for LBA and block size
alignment.

The tool is used as part of the Large block size (LBS) effort [1] in the
kernel to validate min order mapping [2].

[1] https://kernelnewbies.org/KernelProjects/large-block-size

[2] min order:
use of min order: linux-kdevops/linux@563cea7
add min order support: linux-kdevops/linux@27f85d8
upstream RFC: https://lore.kernel.org/all/20230915183848.1018717-2-kernel@pankajraghav.com/

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
  • Loading branch information
dkruces committed Dec 15, 2023
1 parent 8bc151f commit e0f45a5
Show file tree
Hide file tree
Showing 5 changed files with 589 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ pair of .c and .py files, and some are directories of files.

- tools/[argdist](tools/argdist.py): Display function parameter values as a histogram or frequency count. [Examples](tools/argdist_example.txt).
- tools/[bashreadline](tools/bashreadline.py): Print entered bash commands system wide. [Examples](tools/bashreadline_example.txt).
- tools/[blkalgn](tools/blkalgn.py): Observe block commands alignment. [Examples](tools/blkalgn_example.txt).
- tools/[bpflist](tools/bpflist.py): Display processes with active BPF programs and maps. [Examples](tools/bpflist_example.txt).
- tools/[capable](tools/capable.py): Trace security capability checks. [Examples](tools/capable_example.txt).
- tools/[compactsnoop](tools/compactsnoop.py): Trace compact zone events with PID and latency. [Examples](tools/compactsnoop_example.txt).
Expand Down
106 changes: 106 additions & 0 deletions man/man8/blkalgn.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
.TH blkalgn 8 "2023-11-06 "USER COMMANDS"
.SH NAME
blkalgn \- Observes alignment of block commands.
.SH SYNOPSIS
.B blkalgn.py [\-h] [\-d DISK] [\-o OPS] [\--debug] [\--trace]
.B [\--interval INTERVAL]
.SH DESCRIPTION
blkalgn observes and traces block device commands. The program attaches kprobe
on `blk_mq_start_request` by default to capture NVMe commands issued to any
device. If disk and/or operation filters are used, the program will then skip
capturing for that particular event. If tracing option is passed, then all
captured events will be printed in a table with the following columns, sorted
from left to right:

- DISK: Prints the NVMe node (e.g. 'nvme0n9').

- OPS: Prints the NVMe operation (read/write).

- LEN: Prints the length in bytes.

- LBA: Prints the Logical Block Address (LBA).

- PID: Prints the process ID.

- COMM: Prints the process name (command).

- ALGN: Prints the maximum alignment possible in power-of-2 bytes. Example: An
alignment value of 16384 (16k) indicates the command is aligned in size and LBA
to 4k, 8k and 16k.

Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc
.SH OPTIONS
.TP
\-h, --help
show this help message and exit
.TP
\-d DISK, --disk DISK
If set, the BPF will add a disk name filter to skip block commands that don't
match the given block device node.
Example: nvme0n9
.TP
\-o OPS, --ops OPS
If set, the BPF will add a operation filter to skip NVMe commands that don't
match the given operation. A full list of the operation values can be found at
the 'enum req_op' in the kernel header 'include/linux/blk_types.h'.
.TP
\--debug
Prints BPF code before capturing.
.TP
\--trace
Prints NVMe captured commands in a table form.

Header: DISK OPS LEN LBA PID COMM ALGN.
.TP
\--interval INTERVAL
Specifies the maximum event polling event interval.
.SH EXAMPLES
.TP
Observe all block commands and print a power-of-2 histogram with the block and \
alignment sizes at the end.
#
.B blkalgn
.TP
Observe all block commands issued to the 9th NVMe node and print a power-of-2 \
histogram with the block and alignment sizes at the end.
#
.B blkalgn --disk nvme9n1
.TP
Observe and trace all write commands issued to the 9th NVMe node. And print a \
power-of-2 histogram with the block and alignment sizes at the end.
#
.B blkalgn --disk nvme9n1 --ops Write --trace
.TP
Print eBPF program before observe starts. Observe and trace all write \
commands issued to the 9th NVMe node. And print a power-of-2 histogram with \
the block and alignment sizes at the end.
#
.B blkalgn --disk nvme9n1 --ops Write --debug
.TP
Observe and trace all write commands issued to the 9th NVMe node. Poll NVMe \
events from the data ring buffer every 100 ms. And print a power-of-2 \
histogram with the block and alignment sizes at the end.
#
.B blkalgn --disk nvme9n1 --ops Write --interval 0.1
.SH OVERHEAD
This traces all block commands issued to any device. The overhead of this can
be high if the volume of the commands is high. To reduce overhead, add filters
such as disk ('--disk') and/or operation ('--ops'). You can also increase the
polling interval ('--interval') when tracing ('--trace') or if possible, just
disable tracing completely. You should only run this on a process where the
slowdown is acceptable.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Daniel Gomez
4 changes: 4 additions & 0 deletions tests/python/test_tools_smoke.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,10 @@ def test_nfsdist(self):
else:
pass

@skipUnless(kernel_version_ge(4,19), "requires kernel >= 4.19")
def test_blkalgn(self):
self.run_with_duration("blkalgn.py")

@skipUnless(kernel_version_ge(4,6), "requires kernel >= 4.6")
@mayFail("This fails on github actions environment, and needs to be fixed")
def test_offcputime(self):
Expand Down
262 changes: 262 additions & 0 deletions tools/blkalgn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
#!/usr/bin/env python
# SPDX-License-Identifier: Apache-2.0
#
# Block alignment observability tool.
#
# Copyright (c) 2023 Samsung Electronics Co., Ltd. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 06-Nov-2023 Daniel Gomez Created this.
from __future__ import (
absolute_import, division, unicode_literals, print_function
)
from bcc import BPF
import argparse
import time

examples = """examples:
blkalgn # Observe all blk commands
blkalgn --disk nvme9n1 # Observe all commands on 9th NVMe node
blkalgn --ops Read # Observe read commands on all NVMe
blkalgn --ops Write # Observe write commands on all NVMe
blkalgn --ops Write --disk nvme9n1 # Observe write commands on 9th NVMe node
blkalgn --debug # Print eBPF program before observe
blkalgn --trace # Print NVMe captured events
blkalgn --interval 0.1 # Poll data ring buffer every 100 ms
"""

parser = argparse.ArgumentParser(
description="Block commands observer tool",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples,
)
parser.add_argument(
"-d",
"--disk",
type=str,
help="capture commands for this block device node only"
)
parser.add_argument(
"-o",
"--ops",
type=str,
help="capture this command operation only"
)
parser.add_argument("--debug", action="store_true", help="debug")
parser.add_argument(
"--trace",
action="store_true",
help="trace block captured commands"
)
parser.add_argument(
"--interval",
type=float,
help="polling interval"
)

args = parser.parse_args()

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/blk-mq.h>
struct data_t {
u32 pid;
char comm[TASK_COMM_LEN];
char disk[DISK_NAME_LEN];
u32 op;
u32 len;
u32 lba;
u32 algn;
};
BPF_HISTOGRAM(block_len, u32, 64);
BPF_HISTOGRAM(algn, u32, 64);
BPF_ARRAY(counts, u64, 1);
BPF_RINGBUF_OUTPUT(events, 8);
/* local strcmp function, max length 16 to protect instruction loops */
#define CMPMAX 16
static int local_strcmp(const char *cs, const char *ct)
{
int len = 0;
unsigned char c1, c2;
while (len++ < CMPMAX) {
c1 = *cs++;
c2 = *ct++;
if (c1 != c2)
return c1 < c2 ? -1 : 1;
if (!c1)
break;
}
return 0;
}
"""

bpf_text_disk_filter = ""
if args.disk:
bpf_text_disk_filter = """
if (local_strcmp(req->q->disk->disk_name, "{disk}"))
return;
""".format(
disk=args.disk
)

bpf_text_ops_filter = ""
# Operation dictionary. Full list of operations at Linux kernel
# 'include/linux/blk_types.h' header file.
blk_ops = {
0: "Read",
1: "Write",
2: "Flush",
3: "Discard",
5: "SecureErase",
9: "WriteZeroes",
10: "ZoneOpen",
11: "ZoneClose",
12: "ZoneFinish",
13: "ZoneAppend",
15: "ZoneReset",
17: "ZoneResetAll",
34: "DrvIn",
35: "DrvOut",
36: "Last",
"Read": 0,
"Write": 1,
"Flush": 2,
"Discard": 3,
"SecureErase": 5,
"WriteZeroes": 9,
"ZoneOpen": 10,
"ZoneClose": 11,
"ZoneFinish": 12,
"ZoneAppend": 13,
"ZoneReset": 15,
"ZoneResetAll": 17,
"DrvIn": 34,
"DrvOut": 35,
"Last": 36,
}
if args.ops:
try:
operation = blk_ops[args.ops]
except KeyError:
print("Operation does not exist. Please, introduce any valid operation")
for k in blk_ops.keys():
if type(k) is str:
print(f"{k}")
exit()

bpf_text_ops_filter = """
if ((req->cmd_flags & 0xff) != {ops})
return;
""".format(
ops=operation
)

bpf_text += """
void start_request(struct pt_regs *ctx, struct request *req)
{{
struct data_t data = {{}};
u32 max_algn_size = 4096, algn_size = 4096;
u32 lba_len = algn_size / 4096;
bool is_algn = false;
u8 i;
u32 lba_shift;
{disk_filter}
{ops_filter}
data.pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&data.comm, sizeof(data.comm));
bpf_probe_read_kernel(&data.disk, sizeof(data.disk),
req->q->disk->disk_name);
data.op = req->cmd_flags & 0xff;
data.len = req->__data_len;
lba_shift = bpf_log2(req->q->limits.logical_block_size);
data.lba = req->__sector >> (lba_shift - SECTOR_SHIFT);
for (i=0; i<8; i++) {{
is_algn = !(data.len % algn_size) && !(data.lba % lba_len);
if (is_algn) {{
max_algn_size = algn_size;
}}
algn_size = algn_size << 1;
lba_len = algn_size / 4096;
}}
data.algn = max_algn_size;
events.ringbuf_output(&data, sizeof(data), 0);
block_len.increment(bpf_log2l(req->__data_len));
algn.increment(bpf_log2l(max_algn_size));
}}
""".format(
disk_filter=bpf_text_disk_filter, ops_filter=bpf_text_ops_filter
)


if args.debug:
print(args)
print(bpf_text)

bpf = BPF(text=bpf_text)
if args.trace:
print("Tracing block commands... Hit Ctrl-C to end.")
print(
"%-10s %-8s %-8s %-10s %-10s %-16s %-8s"
% ("DISK", "OPS", "LEN", "LBA", "PID", "COMM", "ALGN")
)

if BPF.get_kprobe_functions(b"blk_mq_start_request"):
bpf.attach_kprobe(event="blk_mq_start_request", fn_name="start_request")


def capture_event(ctx, data, size):
event = bpf["events"].event(data)
if args.trace:
print_event(event)


def print_event(event):
try:
op = blk_ops[event.op]
except KeyError:
op = event.op
print(
"%-10s %-8s %-8s %-10s %-10s %-16s %-8s"
% (
event.disk.decode("utf-8", "replace"),
op,
event.len,
event.lba,
event.pid,
event.comm.decode("utf-8", "replace"),
event.algn,
),
)


bpf["events"].open_ring_buffer(capture_event)
block_len = bpf["block_len"]
algn = bpf["algn"]
while 1:
try:
bpf.ring_buffer_poll(30)
if args.interval:
time.sleep(abs(args.interval))
except KeyboardInterrupt:
bpf.ring_buffer_consume()
print()
block_len.print_log2_hist(
"Block size", "operation", section_print_fn=bytes.decode
)
block_len.clear()
print()
algn.print_log2_hist("Algn size", "operation",
section_print_fn=bytes.decode)
algn.clear()
break
exit()

0 comments on commit e0f45a5

Please sign in to comment.