Contributing bpftrace/eBPF tools
If you want to contribute tools to bpftrace, please read this checklist first.
(Written by Brendan Gregg. Adapted from the bcc version).
bpftrace tool development checklist:
- Research the topic landscape. Learn the existing tools and metrics (incl. from /proc). Determine what real world problems exist and need solving. We have too many tools and metrics as it is, we don't need more "I guess that's useful" tools, we need more "ah-hah! I couldn't do this before!" tools. Consider asking other developers about your idea. Many of us can be found in IRC, in the #iovisor channel on irc.oftc.net. There's also the iovisor mailing list (see the README.md), and github for issues.
- Create a known workload for testing. This might involving writing a 10 line C program, using a micro-benchmark, or just improvising at the shell. If you don't know how to create a workload, learn! Figuring this out will provide invaluable context and details that you may have otherwise overlooked. Sometimes it's easy, and I'm able to just use dd(1) from /dev/urandom or a disk device to /dev/null. It lets me set the I/O size, count, and provides throughput statistics for cross-checking my tool output. But other times I need a micro-benchmark, or some C.
- Write the tool to solve the problem and no more. Unix philosophy: do one thing and do it well. netstat doesn't have an option to dump packets, tcpdump-style. They are two different tools.
- Consider bcc for custom output and options. Need to really customize your output? Want to support a variety of command line options? It sounds like your tool may be better as a bcc tool, which currently supports these using Python (and other) interfaces bcc.
- Check your tool correctly measures your known workload. If possible, run a prime number of events (eg, 23) and check that the numbers match. Try other workload variations.
- Use other observability tools to perform a cross-check or sanity check. Eg, imagine you write a PCI bus tool that shows current throughput is 28 Gbytes/sec. How could you sanity test that? Well, what PCI devices are there? Disks and network cards? Measure their throughput (iostat, nicstat, sar), and check if is in the ballpark of 28 Gbytes/sec (which would include PCI frame overheads). Ideally, your numbers match.
- Measure the overhead of the tool. If you are running a micro-benchmark, how much slower is it with the tool running. Is more CPU consumed? Try to determine the worst case: run the micro-benchmark so that CPU headroom is exhausted, and then run the bpftrace tool. Can overhead be lowered?
- Test again, and stress test. You want to discover and fix all the bad things before others hit them.
- Consider your own repository. Your tool does not need to be here! bpftrace makes it very easy to create new tools, perhaps too easy. As the previous items described, it's possible to create tools that print metrics that are incorrect, or cause too high overhead. Tools here will likely be run on production servers as root, at many companies, and we want to err on the side of caution. You can always create your own repository of bpftrace tools, and once they have had some exposure, testing, and bug fixes, consider contributing them here.
- Concise, intuitive, self-explanatory output. The default output should meet the common need concisely. Consider including a startup message that's self-explanatory, eg "Tracing block I/O. Output every 1 seconds. Ctrl-C to end.". Also, try hard to keep the output less than 80 characters wide, especially the default output of the tool. That way, the output not only fits on the smallest reasonable terminal, it also fits well in slide decks, blog posts, articles, and printed material, all of which help education and adoption. Publishers of technical books often have templates they require books to conform to: it may not be an option to shrink or narrow the font to fit your output.
- Check style: Do you have a consistent convention for indentation, variable names, and comment style? You can follow the lead from the other tools.
- Write an _example.txt file. Copy the style in tools/biolatency_example.txt: start with an intro sentence, then have examples, and finish with the USAGE message. Explain everything: the first example should explain what we are seeing, even if this seems obvious. For some people it won't be obvious. Also explain why we are running the tool: what problems it's solving. It can take a long time (hours) to come up with good examples, but it's worth it. These will get copied around (eg, presentations, articles).
- Read your example.txt file. Does this sound too niche or convoluted? Are you spending too much time explaining caveats? These can be hints that perhaps you should fix your tool, or abandon it! I've abandoned many tools at this stage.
- Write a man page. Either ROFF (.8), markdown (.md), or plain text (.txt): so long as it documents the important sections, particularly columns (fields) and caveats. These go under man/man8. See the other examples. Include a section on overhead, and pull no punches. It's better for end users to know about high overhead beforehand, than to discover it the hard way. Also explain caveats. Don't assume those will be obvious to tool users.
- Read your man page. For ROFF: nroff -man filename. Like before, this exercise is like saying something out loud. Does it sound too niche or convoluted? Again, hints that you might need to go back and fix things, or abandon it.
- Spell check your documentation. Use a spell checker like aspell to check your document quality before committing.
- Add an entry to README.md.
- If you made it this far, pull request!