MMTests: Benchmarking framework primarily aimed at Linux kernel testing
Shell Perl Other
Clone or download
gormanm Compress dmesg and kconfig
Signed-off-by: Mel Gorman <mgorman@suse.com>
Latest commit d5de30f Jun 29, 2018
Permalink
Failed to load latest commit information.
bin mmtestsvmstat: Cope with THP split field rename Jun 21, 2018
configs nas: Consistent number of iterations for NAS OMP Jun 25, 2018
drivers microtput: Add a shell benchmark that can measure any command that ou… May 31, 2018
monitors watch-file-frag: Use SHELLPACK_DATA Aug 30, 2017
shellpack_src nas: Use power-of-two of square numbers depending on the compute kernel Jun 12, 2018
shellpacks Remove redundant libhugetlbfsbuild helper Jun 12, 2018
stap-patches stap: Update patches for 4.14-rc4 Oct 23, 2017
stap-scripts pagealloc: Do not use probe context for executing test Nov 24, 2016
subreport sockperf: Correct server support Oct 13, 2016
tunings Tunings: simple tuning script for sysctl values Jan 5, 2015
vmr vmr-stream: remove Mar 4, 2016
CHANGELOG mmtests 0.16 Jun 26, 2014
COPYING Update FSF address Jul 1, 2013
README Merge pull request #2 from sjp38/master Mar 5, 2018
compare-kernels.sh bonnie: Switch to using detailed IO reporting Jun 7, 2018
config configs: Remove references to fraganalysis Jun 15, 2017
format-for-display.sh compare-mmtests: compress JSON exports Jul 31, 2017
generate-monitor-only.sh configs: Remove references to fraganalysis Jun 15, 2017
prebuild-all.sh Convert a number of shellpacks to a template Sep 17, 2012
profile-disabled-hooks-operf.sh profile-operf: Correct typo Jul 3, 2015
profile-disabled-hooks-oprofile.sh profiling: Explicitly name oprofile, operf and perf helpers Feb 18, 2016
profile-disabled-hooks-perf.sh profile-disabled-hooks-perf: Always exit with success Nov 28, 2016
run-kvm.sh run-kvm: Add option to skip installing host kernel in guest Jul 28, 2017
run-mmtests.sh Compress dmesg and kconfig Jun 29, 2018

README

Overview
========

MMTests is a configurable test suite that runs a number of common
workloads of interest to MM developers. Ideally this would have been
to integrated with LTP, xfstests or Phoronix Test or implemented
with autotest.  Unfortunately, large portions of these tests are
cobbled together over a number of years with varying degrees of
quality before decent test frameworks were common.  The refactoring
effort to integrate with another framework is significant.

Organisation
============

The top-level directory has a single driver script called
run-mmtests.sh which reads a config file that describes how the
benchmarks should be run, configures the system and runs the requested
tests. config also has some per-test configuration items that can be
set depending on the test. The driver script takes the name of the
test as a parameter. Generally, this would be a symbolic name naming
the kernel being tested.

Each test is driven by a run-single-test.sh script which reads the relevant
driver-TESTNAME.sh script. It should not be used directly. High level
items such as profiling are configured from the top-level script while the
driver scripts typically convert the config parameters into switches for a
"shellpack". A shellpack is a pair of benchmark and install scripts that
are all stored in shellpacks/ .

Monitors can be optionally configured. A full list is in monitors/
. Care should be taken with monitors as there is a possibility that
they introduce overhead of their own.  Hence, for some performance
sensitive tests it is preferable to have no monitoring.

Many of the tests download external benchmarks. An attempt will be
made to download from a mirror . To get an idea where the mirror
should be located, grep for MIRROR_LOCATION= in shellpacks/.

A basic invocation of the suite is

<pre>
$ cp configs/config-global-dhp__pagealloc-performance config
$ ./run-mmtests.sh --no-monitor 3.0-nomonitor
$ ./run-mmtests.sh --run-monitor 3.0-runmonitor
</pre>

Configuration
=============

The config file used is always called "config". A number of other
sample configuration files are provided that have a given theme. Some
important points of variability are;

MMTESTS is a list of what tests will be run

LINUX_GIT is the location of a git repo of the kernel. At the moment it's only
	used during report generation

SKIP_*PROFILE
	These parameters determine what profiling runs are done. Even with
	profiling enabled, a non-profile run can be used to ensure that
	the profile and non-profile runs are comparable.

SWAP_CONFIGURATION
SWAP_PARTITIONS
SWAP_SWAPFILE_SIZEMB
	It's possible to use a different swap configuration than what is
	provided by default.
	
TESTDISK_RAID_DEVICES
TESTDISK_RAID_MD_DEVICE
TESTDISK_RAID_OFFSET
TESTDISK_RAID_SIZE
TESTDISK_RAID_TYPE
	If the target machine has partitions suitable for configuring RAID,
	they can be specified here. This RAID partition is then used for
	all the tests

TESTDISK_PARTITION
	Use this partition for all tests

TESTDISK_FILESYSTEM
TESTDISK_MKFS_PARAM
TESTDISK_MOUNT_ARGS
	The filesystem, mkfs parameters and mount arguments for the test
	partitions

TESTDISK_DIR
	A directory passed to the test. If not set, defaults to
	SHELLPACK_TEMP. The directory is supposed to contain a precreated
	environment (eg. a specifically created filesystem mounted with desired
	mount options).

STORAGE_CACHE_TYPE
STORAGE_CACHING_DEVICE
STORAGE_BACKING_DEVICE
	It's also possible to use storage caching.
	STORAGE_CACHE_TYPE is either "dm-cache" or "bcache". The devices
	specified with STORAGE_CACHING_DEVICE and STORAGE_BACKING_DEVICE are
	used to create the cache device which then is used for all the
	tests.

As MMTests downloads a number of benchmarks it is possible to create
a local mirror. The location of the mirror is configured in WEBROOT in
shellpacks/common-config.sh.  For example, kernbench tries to download
$WEBROOT/kernbench/linux-3.0.tar.gz . If this is not available, it is
downloaded from the internet. This can add delays in testing and consumes
bandwidth so is worth configuring.

Available tests
===============

Note the ones that are marked untested. These have been ported from other
test suites but no guarantee they actually work correctly here. If you want
to run these tests and run into a problem, report a bug.

kernbench
	Builds a kernel 5 times recording the time taken to completion.
	An average time is stored. This is sensitive to the overall
	performance of the system as it hits a number of subsystems.

multibuild
	Similar to kernbench except it runs a number of kernel compiles
	in parallel. Can be useful for stressing the system and seeing
	how well it deals with simple fork-based parallelism.

aim9
	Runs a short version of aim9 by default. Each test runs for 60
	seconds. This is a micro-benchmark of a number of VM operations. It's
	sensitive to changes in the allocator paths for example.

dbench3
	Runs the dbench3 benchmark. This is an old benchmark but it is also
	a poor benchmark. The average results are sensitive to how long it
	was running as different portions of the command file have different
	throughput. This means that the exact time the benchmark stops makes
	a difference to average throughput.

	dbench3 when run asynchronously is sensitive to when background
	writeback or journalling takes place. High throughput figures may
	depend on the operations completing in memory before processes
	like kjournald (where applicable) or flushers kick in. If dbench
	does most of its work in memory then the throughput is artificially
	high. Running the benchmark in synchronous mode can offset this.

	Alterations to hold times of i_mutex can adversely affect dbench
	figures because in some cases long hold times can give an artificial
	boost to metadata modifications in memory without the results ever
	hitting the disk.

	Interpreting dbench3 figures can be a pain due to these limitations.
	However, it can still be useful as an early warning system. If
	dbench3 throws up something anomalous it is worth finding out the
	root cause and correlating it with other results.

dbench4
	Runs the dbench4 benchmark. This is in better shape than dbench3
	in that it is more of an IO benchmark. It also reports on latencies
	of various different operations which is useful.

ffsb
	The ffsb benchmark takes a workfile to simulate various different
	workloads. Support is there to generate profiles similar to what
	is used for evaluating btrfs.

vmr-stream
	Runs the STREAM benchmark a number of times for varying sizes. An
	average is recorded. This can be used to measure approximate memory
	throughput or the average cost of a number of basic operations. It is
	sensitive to cache layout used for page faults.

vmr-cacheeffects (untested)
	Performs linear and random walks on nodes of different sizes stored in
	a large amount of memory. Sensitive to cache footprint and layout.

vmr-createdelete (untested)
	A micro-benchmark that measures the time taken to create and delete
	file or anonymous mappings of increasing sizes. Sensitive to changes
	in the page fault path performance.

iozone
	A basic filesystem benchmark.

fsmark
	This tests write workloads varying the number of files and directory
	depth.

hackbench-*
	Hackbench is generally a scheduler benchmark but is also sensitive to
	overhead in the allocators and to a lesser extent the fault paths.
	Can be run for either sockets or pipes.

largecopy
	This is a simple single-threaded benchmark that downloads a large
	tar file, expands it a number of times, creates a new tar and
	expands it again. Each operation is timed and is aimed at shaking
	out stall-related bugs when copying large amounts of data

largedd
	Similar to largecopy except it uses dd instead of cp.

libreofficebuild
	This downloads and builds libreoffice. It is a more aggressive
	compile-orientated test. This is a very download-intensive
	benchmark and was only created as a reproduction case for
	a bug.

nas-*
	The NAS Parallel Benchmarks for the serial and openmp versions of
	the test.

netperf-*
	Runs the netperf benchmark for *_STREAM on the local machine.
	Sensitive to cache usage and allocator costs. To test for cache line
	bouncing, the test can be configured to bind to certain processors.

pipetest
	This is a scheduler ping-pong test where two processes send a byte
	to each other over a pipe to measure context switch latency. More
	details on the motivation behind the benchmark are available at
	http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389

postmark

	This workload is a single-threaded benchmark intended to measure
	filesystem performance for many short-lived and small files. It
	is meant to simulate workloads like mail servers or web servers
	serving static files using a mix of data and metadata intensive
	operations. The main weakness is that it does not application
	processing and all the transactions are filesystem-based. The
	ratio of each operation type is not necessarily representative
	of anything.

	Optionally a program can be run in the background that consumes
	anonymous memory. The background program is vary rarely needed
	except when trying to identify desktop stalls during heavy IO.

simple-writeback
	This is a simple writeback test based on dd. It's meant to be
	easy to understand and quick to run. Useful for measuring page
	writeback changes.

speccpu (untested)
	SPECcpu, what else can be said. A restriction is that you must have
	a mirrored copy of the tarball as it is not publicly available.

specjvm (untested)
	SPECjvm. Same story as speccpu

specomp (untested)
	SPEComp. Same story as speccpu

starve
	This is an old scheduler benchmark that was used to illustrate
	a particular bug in the O(1) scheduler. It's no longer really
	relevant but as it is a bug that occurred before, it may be
	worth monitoring. More details at
	http://www.hpl.hp.com/research/linux/kernel/o1-starve.php

sysbench
	Runs the complex workload for sysbench backed by postgres. Running
	this test requires a significant build environment on the test
	machine. It can run either read-only or read/write tests.

tiobench
	Threaded IO benchmark.

ltp (untested)
	The LTP benchmark. What it is testing depends on exactly which of the
	suite is configured to run.

ltp-pounder (untested)
	ltp pounder is a non-default test that exists in LTP. It's used by
	IBM for hardware certification to hammer a machine for a configured
	number of hours. Typically, they expect it to run for 72 hours
	without major errors.  Useful for testing general VM stability in
	high-pressure low-memory situations.

stress-highalloc
	This test requires that the system not have too much memory and
	that systemtap is available. Typically, it's tested with 3GB of
	RAM. It builds a number of kernels in parallel such that total
	memory usage is 1.5 times physical memory. When this is running
	for 5 minutes, it tries to allocate a large percentage of memory
	(e.g. 95%) as huge pages recording the latency of each operation as it
	goes. It does this twice. It then cancels the kernel compiles, cleans
	the system and tries to allocate huge pages at rest again. It's a
	basic test for fragmentation avoidance and the performance of huge
	page allocation.

xfstests (untested)
	This is still at prototype level and aimed at running testcase 180
	initially to reproduce some figures provided by the filesystems people.

futexbench-*
	Collection of programs to measure a number of core futex operations
	related to the overall architecture. Specifically: (i) hash, (ii) wake,
	and (iii) requeue operations.  Built on-top of the perf-bench
	infrastructure.

Reporting
=========

For reporting, there is a basic compare-kernels.sh script. It must be updated
with a list of kernels you want to compare and in what order. It generates a
table for each test, operation and kernel showing the relative performance
of each. The test reporting scripts are in subreports/. compare-kernels.sh
should be run from the path storing the test logs. By default this is
work/log. If you are automating tests from an external source, work/log is
what you should be capturing after a set of tests complete.

If monitors are configured such as ftrace, there are additional
processing scripts. They can be activated by setting FTRACE_ANALYSERS in
compare-kernels.sh. A basic post-process script is mmtests-duration which
simply reports how long an individual test took and what its CPU usage was.

There are a limited number of graphing scripts included in report/

TODO
====

o Add option to test on filesystem loopback device stored on tmpfs
o Add volanomark