tools/virtiostat: add virtiostat tool #3270

pizhenwei · 2021-02-15T14:25:47Z

Add a new tool virtiostat to trace VIRTIO devices IO statistics.

Although we have already had iostat(to show block device statistics),
iftop(to show network device statistics), other devices of VIRTIO
family(Ex, console, balloon, GPU and so on) also need tools for each
type. virtiostat works in virtio lower layer, and it could trace all
the devices.

Signed-off-by: zhenwei pi pizhenwei@bytedance.com

yonghong-song · 2021-02-17T19:40:14Z

[buildbot, test this please]

pizhenwei · 2021-02-18T03:51:28Z

[buildbot, test this please]

@yonghong-song
Hi, testbot reports an error on ubuntu1710 (and only on ubuntu1710)

8: FAIL: test_jumps (__main__.TestBPFSocket)
8: ----------------------------------------------------------------------
8: Traceback (most recent call last):
8:   File "/home/iovisor/jenkins/workspace/bcc-pr/label/ubuntu1710/tests/python/test_call1.py", line 43, in test_jumps
8:     self.assertGreater(self.stats[c_int(S_IP)].value, 0)
8: AssertionError: 0L not greater than 0

I tried to build ubuntu-1710 environment to reproduce this issue, but apt source has been broken. Could you give me a hint why this test fails?

yonghong-song · 2021-02-18T06:29:39Z

I could be a timing issue. networking packet did not go through yet and then socket is closed. Let me try again. It should not be related to your patch.

yonghong-song · 2021-02-18T06:29:49Z

[buildbot, test this please]

yonghong-song · 2021-02-18T18:49:30Z

tools/virtiostat.py

+#include <bcc/proto.h>
+
+#define VIRTIO_MAX_SGS  128
+#define SG_MAX          18


Some comments where these constants come from?

yonghong-song · 2021-02-18T18:50:28Z

tools/virtiostat.py

+static void record(struct virtqueue *vq,
+                   struct scatterlist **sgs,
+                   unsigned int out_sgs,
+                   unsigned int in_sgs)


No need in four different lines for parameters, you have two lines with each line two parameters.

yonghong-song · 2021-02-18T18:50:51Z

tools/virtiostat.py

+                            unsigned int in_sgs,
+                            void *data,
+                            gfp_t gfp)
+


extra line is not needed.

yonghong-song · 2021-02-18T18:51:24Z

tools/virtiostat.py

+                            unsigned int out_sgs,
+                            unsigned int in_sgs,
+                            void *data,
+                            gfp_t gfp)


no need one parameter per line.

yonghong-song · 2021-02-18T18:52:16Z

tools/virtiostat.py

+
+{
+    unsigned int _out_sgs = PT_REGS_PARM3(ctx);
+    unsigned int _in_sgs = PT_REGS_PARM4(ctx);


why you need the above two lines to get _out_sgs and _in_sgs? They are in argument list and you should be able to use out_sgs and in_sgs directly?

yonghong-song · 2021-02-18T18:54:29Z

tools/virtiostat.py

+                                  gfp_t gfp)
+{
+    record(vq, &sg, 0, 1);
+


no need for extra line

yonghong-song · 2021-02-18T18:55:18Z

tools/virtiostat.py

+    else:
+        print("--------", end="\n")
+
+    print("%16s %12s %12s %12s %12s %16s %16s" % ("Driver", "Device", "VQ Name", "In SGs", "Out SGs", "In BW", "Out BW"), end="\n")


long line, break into two lines?

yonghong-song · 2021-02-18T18:55:29Z

tools/virtiostat.py

+    print("%16s %12s %12s %12s %12s %16s %16s" % ("Driver", "Device", "VQ Name", "In SGs", "Out SGs", "In BW", "Out BW"), end="\n")
+    stats = b.get_table("stats")
+    for k, v in sorted(stats.items(), key=lambda vs: vs[1].dev):
+        print("%16s %12s %12s %12d %12d %16d %16d" % (v.driver, v.dev, v.vqname, v.in_sgs, v.out_sgs, v.in_bw, v.out_bw), end="\n")


break into two lines?

yonghong-song · 2021-02-18T18:56:45Z

tools/virtiostat_example.txt

+--------
+          Driver       Device      VQ Name       In SGs      Out SGs            In BW           Out BW
+      virtio_net      virtio0     output.2            0            9                0              832
+      virtio_net      virtio0     output.3            0           92                0            13053


Let us ensure default output has at most 80 char's per line. Here we exceeded it.

yonghong-song · 2021-02-18T18:58:41Z

tools/virtiostat_example.txt

+
+
+This program traces virtio devices to analyze the IO operations and
+throughput.


Maybe you can give more explanation why these statistics are useful to find/debug a real issue. Real use case justification is needed to be included in bcc tools.

pizhenwei · 2021-02-19T07:18:20Z

@yonghong-song
Refine codes/docs as you suggested except

    unsigned int _out_sgs = PT_REGS_PARM3(ctx);
    unsigned int _in_sgs = PT_REGS_PARM4(ctx);

Add some comments in code, and I'm not sure if I hit a bug....
Two cases I have tested:
1, get out_sgs from args and PT_REGS_PARMx, and generate perf event. there is a probability that getting different value.
2, building a kmod and also get different value.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#include <linux/uaccess.h>
#include <linux/scatterlist.h>

static int kp_virtqueue_add_sgs_handler(struct kprobe *p, struct pt_regs *regs)
{
	struct scatterlist *sg = (struct scatterlist *)regs->si;
	int out = (int)regs->dx;
	int in = (int)regs->cx;

	printk("sg %p, out %d, in %d\n", sg, out, in);

	return 0;
}

static struct kprobe kp = {
	.symbol_name	= "virtqueue_add_sgs",
	.pre_handler = kp_virtqueue_add_sgs_handler,
};

static int __init kprobe_init(void)
{
	int ret;

	ret = register_kprobe(&kp);
	if (ret < 0) {
		pr_err("register_kprobe failed, returned %d\n", ret);
		return ret;
	}

	pr_info("[probe-unlink]Planted kprobe at %p\n", kp.addr);
	return 0;
}

static void __exit kprobe_exit(void)
{
	unregister_kprobe(&kp);
	pr_info("[probe-unregistered]kprobe at %p unregistered\n", kp.addr);
}

module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("pizhenwei pizhenwei@bytedance.com");

yonghong-song · 2021-02-19T17:29:31Z

@pizhenwei thanks for reporting. Indeed, the incorrect param value is caused by a bug in bcc rewriter. I have just put a fix #3275.

When I try to run, I hit the following verifier bug:

-bash-4.4$ sudo ./virtiostat.py 
bpf: Argument list too long. Program  too large (402 insns), at most 4096 insns

Traceback (most recent call last):
  File "./virtiostat.py", line 184, in <module>
    b.attach_kprobe(event="virtqueue_add_sgs", fn_name="trace_virtqueue_add_sgs")
  File "/usr/lib/python2.7/site-packages/bcc/__init__.py", line 679, in attach_kprobe
    fn = self.load_func(fn_name, BPF.KPROBE)
  File "/usr/lib/python2.7/site-packages/bcc/__init__.py", line 412, in load_func
    (func_name, errstr))
Exception: Failed to load BPF program trace_virtqueue_add_sgs: Argument list too long
-bash-4.4$

I am using llvm12 and llvm13(trunk) and both have issues. Let me dig more on this. I suspect this is related to the following code

    for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
        for (n = 0, sgp = sgs[i]; sgp && (n < SG_MAX); sgp = __sg_next(sgp)) {
            length += sgp->length;
            n++;
        }
    }

where verifier might not handle effectively.

yonghong-song · 2021-02-20T00:02:03Z

I did some analysis on verifier failure. The reason is due to the following code:

; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { // Line  60
  90:   15 08 15 00 00 00 00 00 if r8 == 0 goto +21
  91:   bf 81 00 00 00 00 00 00 r1 = r8
  92:   07 01 00 00 ff ff ff ff r1 += -1
  93:   67 01 00 00 20 00 00 00 r1 <<= 32
  94:   77 01 00 00 20 00 00 00 r1 >>= 32
  95:   b7 02 00 00 05 00 00 00 r2 = 5
  96:   2d 12 01 00 00 00 00 00 if r2 > r1 goto +1
  97:   b7 08 00 00 06 00 00 00 r8 = 6
  98:   b7 02 00 00 00 00 00 00 r2 = 0
  99:   b7 09 00 00 00 00 00 00 r9 = 0
 100:   7b 8a 68 ff 00 00 00 00 *(u64 *)(r10 - 152) = r8
 101:   05 00 35 00 00 00 00 00 goto +53

Insn 100 saves r8, but r8 has conservative value at insn 91 (suppose insn 96 evaluates true) compared to r1. Such a conservative value will make verifier think the loop can execute up to u64_max which will cause verification failure.

We do have a pass in llvm12/13 intended to prevent such optimization. I will study more with llvm on how to prevent this transformation. Looks like you are using an early version of llvm which didn't do the above transformation..

brendangregg · 2021-02-20T00:14:30Z

Cool tool, thanks!

The man page is missing a description of the columns.

A later addition could be a -x mode for extended stats, to include average latency (similar to iostat).

pizhenwei · 2021-02-20T09:00:59Z

I did some analysis on verifier failure. The reason is due to the following code:
; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { // Line  60
  90:   15 08 15 00 00 00 00 00 if r8 == 0 goto +21
  91:   bf 81 00 00 00 00 00 00 r1 = r8
  92:   07 01 00 00 ff ff ff ff r1 += -1
  93:   67 01 00 00 20 00 00 00 r1 <<= 32
  94:   77 01 00 00 20 00 00 00 r1 >>= 32
  95:   b7 02 00 00 05 00 00 00 r2 = 5
  96:   2d 12 01 00 00 00 00 00 if r2 > r1 goto +1
  97:   b7 08 00 00 06 00 00 00 r8 = 6
  98:   b7 02 00 00 00 00 00 00 r2 = 0
  99:   b7 09 00 00 00 00 00 00 r9 = 0
 100:   7b 8a 68 ff 00 00 00 00 *(u64 *)(r10 - 152) = r8
 101:   05 00 35 00 00 00 00 00 goto +53
Insn 100 saves r8, but r8 has conservative value at insn 91 (suppose insn 96 evaluates true) compared to r1. Such a conservative value will make verifier think the loop can execute up to u64_max which will cause verification failure.

We do have a pass in llvm12/13 intended to prevent such optimization. I will study more with llvm on how to prevent this transformation. Looks like you are using an early version of llvm which didn't do the above transformation..

I tested this on llvm-3.8, it worked fine. And tested on llvm-12, also reproduced this verification failure.

yonghong-song · 2021-02-22T00:46:34Z

tools/virtiostat.py

+    }
+}
+
+int trace_virtqueue_add_sgs(struct pt_regs *ctx, struct virtqueue *_vq,


please change _vq to vq, so argument, so it can be consistent with rest of arguments.

yonghong-song · 2021-02-22T00:46:49Z

tools/virtiostat.py

+{
+    /* NOTICE: to make sure your bcc is built with commit
+       b2317868af8c6b81a5b5065237589743f7a1168d */
+    record(_vq, sgs, out_sgs, in_sgs);


yonghong-song · 2021-02-22T00:47:25Z

tools/virtiostat.py

+
+{
+    /* NOTICE: to make sure your bcc is built with commit
+       b2317868af8c6b81a5b5065237589743f7a1168d */


Please remove the above comment. This commit will be merged on top of the above commit, so no need to mention that.

yonghong-song · 2021-02-22T00:52:20Z

@pizhenwei please also incorporate the following changes in your next revision. llvm and kernel work to address these issues are ongoing. Thanks!

diff --git a/tools/virtiostat.py b/tools/virtiostat.py
index a6a38bd1..1b5fbccf 100755
--- a/tools/virtiostat.py
+++ b/tools/virtiostat.py
@@ -95,6 +95,28 @@ static u64 count_len(struct scatterlist **sgs, unsigned int num)
             length += sgp->length;
             n++;
         }
+
+        /* IndVarSimplifyPass with clang 12 may cause verifier failure:
+         *   ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { // Line  60
+         *   90:   15 08 15 00 00 00 00 00 if r8 == 0 goto +21
+         *   91:   bf 81 00 00 00 00 00 00 r1 = r8
+         *   92:   07 01 00 00 ff ff ff ff r1 += -1
+         *   93:   67 01 00 00 20 00 00 00 r1 <<= 32
+         *   94:   77 01 00 00 20 00 00 00 r1 >>= 32
+         *   95:   b7 02 00 00 05 00 00 00 r2 = 5
+         *   96:   2d 12 01 00 00 00 00 00 if r2 > r1 goto +1
+         *   97:   b7 08 00 00 06 00 00 00 r8 = 6
+         *   98:   b7 02 00 00 00 00 00 00 r2 = 0
+         *   99:   b7 09 00 00 00 00 00 00 r9 = 0
+         *  100:   7b 8a 68 ff 00 00 00 00 *(u64 *)(r10 - 152) = r8
+         *  101:   05 00 35 00 00 00 00 00 goto +53
+         * Note that r1 is refined by r8 is saved to stack for later use.
+         * This will give verifier u64_max loop bound and eventually cause
+         * verification failure. Workaround with the below asm code.
+         */
+#if __clang_major__ >= 7
+        asm volatile("" : "=r"(i) : "0"(i));
+#endif
     }
 
     return length;
@@ -107,6 +129,13 @@ static void record(struct virtqueue *vq, struct scatterlist **sgs,
     virtio_stat_t *vs;
     u64 key = (u64)vq;
+    /* Workaround: separate two count_len() calls, one here and the
+     * other below. Otherwise, compiler may generate some spills which
+     * harms verifier pruning. This happens in llvm12, but not llvm4.
+     * Below code works on both cases.
+     */
+    u64 in_bw = count_len(sgs, in_sgs);
+
     vs = stats.lookup(&key);
     if (!vs) {
         bpf_probe_read_kernel(newvs.driver, sizeof(newvs.driver), vq->vdev->dev.driver->name);
@@ -114,13 +143,13 @@ static void record(struct virtqueue *vq, struct scatterlist **sgs,
         bpf_probe_read_kernel(newvs.vqname, sizeof(newvs.vqname), vq->name);
         newvs.in_sgs = in_sgs;
         newvs.out_sgs = out_sgs;
-        newvs.in_bw = count_len(sgs, in_sgs);
+        newvs.in_bw = in_bw;
         newvs.out_bw = count_len(sgs, out_sgs);
         stats.update(&key, &newvs);
     } else {
         vs->in_sgs += in_sgs;
         vs->out_sgs += out_sgs;
-        vs->in_bw += count_len(sgs, in_sgs);
+        vs->in_bw += in_bw;
         vs->out_bw += count_len(sgs, out_sgs);
     }
 }

yonghong-song · 2021-02-22T00:52:38Z

[buildbot, ok to test]

yonghong-song · 2021-02-22T00:54:50Z

Please consider to address the following comments from @brendangregg

The man page is missing a description of the columns.
A later addition could be a -x mode for extended stats, to include average latency (similar to iostat).

brendangregg · 2021-02-22T02:07:23Z

I tried using it, and as an iostat-like tool I found the options weren't what I expected. Instead of:

virtiostat
virtiostat -i 2
virtiostat -i 2 -d 10

I'd consider:

virtiostat
virtiostat 2
virtiostat 2 5

ie

virtiostat [interval [count]]

for consistency with the other stat tools.

Generally I'd reserve the tool arguments for the most common use case, and then options (like -T) for everything else. That saves typing. I think the most common use case here would be setting the interval and count, so we can drop the -i and -d. (note that this switches a duration to a count.)

pizhenwei · 2021-02-22T02:21:09Z

I tried using it, and as an iostat-like tool I found the options weren't what I expected. Instead of:
virtiostat
virtiostat -i 2
virtiostat -i 2 -d 10
I'd consider:
virtiostat
virtiostat 2
virtiostat 2 5
ie
virtiostat [interval [count]]
for consistency with the other stat tools.

Generally I'd reserve the tool arguments for the most common use case, and then options (like -T) for everything else. That saves typing. I think the most common use case here would be setting the interval and count, so we can drop the -i and -d. (note that this switches a duration to a count.)

I ran grep "parser.add_argument" *.py | grep grep -E "interval|duration" in tools directory, and noticed that a lot of tools use -i/-d.
Making virtiostat look like iostat may break the hobby of the bcc tools.

pizhenwei · 2021-02-22T02:58:28Z

Please consider to address the following comments from @brendangregg

The man page is missing a description of the columns.

Add some comment in description about SGs and BW.

A later addition could be a -x mode for extended stats, to include average latency (similar to iostat).

Let's make sure the basic version work fine firstly. To implement -x mode needs record the request starting time and done time (to track another function). So in my plan, I'd like to implement this in another PR.

yonghong-song · 2021-02-22T05:04:44Z

tools/virtiostat.py

+    if (out_sgs)
+        out_bw = count_len(sgs, out_sgs);
+    if (in_sgs)
+        in_bw = count_len(sgs + out_sgs, in_sgs);


The above code failed with llvm12 in my environment. Could you try my original suggestion in your environment?
if it works, let us use that. We can revisit once we got some fix in llvm or verifier.

Also, it looks count_len for in_bw is different from previous version. I checked the kernel source code. This version seems correct.

The above code failed with llvm12 in my environment. Could you try my original suggestion in your environment?
if it works, let us use that. We can revisit once we got some fix in llvm or verifier.

I tested this with LLVM version 12.0.0, it worked fine .... But I still modify it as your suggestion.

LLVM (http://llvm.org/): LLVM version 12.0.0 Optimized build. Default target: x86_64-pc-linux-gnu Host CPU: skylake

Also, it looks count_len for in_bw is different from previous version. I checked the kernel source code. This version seems correct.

Yes, this version is correct.

brendangregg · 2021-02-22T06:09:53Z

I ran grep "parser.add_argument" *.py | grep grep -E "interval|duration" in tools directory, and noticed that a lot of tools use -i/-d.
Making virtiostat look like iostat may break the hobby of the bcc tools.

From https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md:

Consider command line options. Should it have -p for filtering on a PID? -T for timestamps? -i for interval? See other tools for examples, and copy the style: the usage message should list example usage at the end. Remember to keep the tool doing one thing and doing it well. Also, if there's one option that seems to be the common case, perhaps it should just be the first argument and not need a switch (no -X). A special case of this is *stat tools, like iostat/vmstat/etc, where the convention is [interval [count]].

Other tools use -i -d when there is a different common case. E.g., funccount(8), where the common case is the function you want to trace, leaving interval/duration for the -i -d switches.

brendangregg · 2021-02-22T06:13:10Z

BTW, command line tools, on both Unix and Windows, frequently use this convention. For example, this is not their usage:

ls -l -file myfile.txt
cd -d c:\Windows

They have dropped the switch for the common case.

Add a new tool virtiostat to trace VIRTIO devices IO statistics. Although we have already had iostat(to show block device statistics), iftop(to show network device statistics), other devices of VIRTIO family(Ex, console, balloon, GPU and so on) also need tools for each type. virtiostat works in virtio lower layer, and it could trace all the devices. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>

pizhenwei · 2021-02-22T06:43:43Z

I ran grep "parser.add_argument" *.py | grep grep -E "interval|duration" in tools directory, and noticed that a lot of tools use -i/-d.
Making virtiostat look like iostat may break the hobby of the bcc tools.

From https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md:

Consider command line options. Should it have -p for filtering on a PID? -T for timestamps? -i for interval? See other tools for examples, and copy the style: the usage message should list example usage at the end. Remember to keep the tool doing one thing and doing it well. Also, if there's one option that seems to be the common case, perhaps it should just be the first argument and not need a switch (no -X). A special case of this is *stat tools, like iostat/vmstat/etc, where the convention is [interval [count]].

Other tools use -i -d when there is a different common case. E.g., funccount(8), where the common case is the function you want to trace, leaving interval/duration for the -i -d switches.

Force pushed as you suggested. Thanks a lot.

usage: virtiostat.py [-h] [-T] [-D] [interval] [count]

yonghong-song · 2021-02-22T15:41:37Z

LGTM. Thanks!

brendangregg · 2021-02-23T01:36:21Z

Thanks!

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479

The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 torvalds#10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... torvalds#10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>

The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>

The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>

The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 torvalds#10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... torvalds#10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>

The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>

The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues: Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in LLVM patch: https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above LLVM patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got: $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got: $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above LLVM fix and also provide a test case for further analyzing the verifier pruning issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Zhenwei Pi <pizhenwei@bytedance.com> Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn llvm#15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn llvm#24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479

The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues: Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in LLVM patch: https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above LLVM patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got: $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got: $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above LLVM fix and also provide a test case for further analyzing the verifier pruning issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Zhenwei Pi <pizhenwei@bytedance.com> Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn flang-compiler#15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn flang-compiler#24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 0bc0084)

The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479

pizhenwei requested review from 4ast, brendangregg, drzaeus77, goldshtn and yonghong-song as code owners February 15, 2021 14:25

yonghong-song reviewed Feb 18, 2021

View reviewed changes

yonghong-song reviewed Feb 22, 2021

View reviewed changes

yonghong-song merged commit 8614895 into iovisor:master Feb 22, 2021

pizhenwei deleted the virtiostat branch September 29, 2022 03:11



		This program traces virtio devices to analyze the IO operations and
		throughput.

tools/virtiostat: add virtiostat tool #3270

tools/virtiostat: add virtiostat tool #3270

Conversation

pizhenwei commented Feb 15, 2021

yonghong-song commented Feb 17, 2021

pizhenwei commented Feb 18, 2021

yonghong-song commented Feb 18, 2021

yonghong-song commented Feb 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pizhenwei commented Feb 19, 2021

yonghong-song commented Feb 19, 2021

yonghong-song commented Feb 20, 2021

brendangregg commented Feb 20, 2021

pizhenwei commented Feb 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonghong-song commented Feb 22, 2021

yonghong-song commented Feb 22, 2021

yonghong-song commented Feb 22, 2021

brendangregg commented Feb 22, 2021

pizhenwei commented Feb 22, 2021

pizhenwei commented Feb 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendangregg commented Feb 22, 2021

brendangregg commented Feb 22, 2021

pizhenwei commented Feb 22, 2021

yonghong-song commented Feb 22, 2021

brendangregg commented Feb 23, 2021