DEDISbench and DEDISgen
(c) 2010 2017 INESC TEC and U. Minho. Written by J. Paulo, M. Freitas
DEDISbench and DEDISgen
DEDISbench is implemented in C and allows performing read and write block disk I/O tests on top of a file-system or block storage solution. DEDISbench main contribution is the ability to process, as input, a file that specifies a distribution of duplicate content, while using this information for generating a synthetic workload that follows such distribution. This input file can be populated by the users or can be generated automatically with DEDISgen, an analysis tool used for processing a specific real dataset and extracting from it the duplicates distribution.
I/O operations can be executed concurrently by several processes working on top of different files or disk regions. Moreover, the benchmark can be configured to stop the evaluation when a certain amount of data has been written or when a pre-defined period of time has elapsed. Another novel feature is the possibility of performing I/O operations with different load intensities. In addition to a stress load, where the benchmark issues I/O operations as fast as possible to stress the system, DEDISbench supports performing operations at a nominal load specified by the user.
For more information regarding DEDISbench algorithm you may read these two published papers:
- "DEDISbench: A Benchmark for Deduplicated Storage Systems"
- "Towards an Accurate Evaluation of Deduplicated Storage Systems"
The benchmark is written in C and can simulate several processes writing/reading fixed size blocks, with realistic content, concurrently into multiple files or into a block device. The location for the read/write operations can follow a sequential, random uniform and random hotspot distribution. The latter is provided by resorting to the NURand function from TPC-C benchmark that generates hotspots for I/O operations. A realistic distribution extracted from real storage systems is used to generate the blocks' content. Moreover, it is possible to use other realistic distributions and load them from a custom file using DEDISgen. DEDISbench provides two workload modes, one reproduces a fully or peak loaded system, as Bonnie++, that performs as maximum write I/O operations per second as possible. The second reproduces a system under a reasonable nominal load, and can be useful for understanding the behavior of storage operations in a stable system with a limited I/O throughput.
DEDISbench real content distributions:
DEDISbench comes with three distinct distributions extracted from real storage systems with different requirements and assumptions:
- An Archival storage where most files have a write-once policy, with non-significant updates, but with sporadic data deletion. This distribution is called dist_archival.
- A Personal Files storage where some files are updated and deleted frequently and the I/O requests latency is expected to be lower than the one found in archival storages. This distributions is called dist_personalfiles.
- A High Performance storage where most files are updated and deleted frequently and I/O latency is expected to be as minimal as possible. This distribution is called dist_highperf.
All these distributions are available and can be simulated by DEDISbench. NOTE: By default DEDISbench uses the Personal Files Storage distribution.
Duplicate distribution generator: DEDISgen
The binary DEDISgen generates an output file that describes the distribution of duplicates found at a specific folder and subfolders or disk device. This program can generate (if specified in the arguments) an output distribution file (with option -o) that can be consumed by DEDISbench and used for simulating that duplicate distribution.
With DEDISgen we also pack DEDISgenutils that may be needed for performing more complex analysis of duplicate data. For instance, if several datasets must be analysed separately and then merged toguether, for generating the output distribution file for DEDISbench, it is possible to use both tools as explained below.
Finally, if one wishes to view the benchmark's results in a little more nicely manner, both DEDISbench and DEDISgen output the files needed to plot some of their results with gnuplot.
The only libs required to run the benchmark should be libc6-dev, libdb-dev and libssl-dev.
To build DEDISBench run the following commands from the top-level directory:
$ ./autogen.sh $ ./configure $ make
Running the Benchmark
-p or -n
value Peak or Nominal Load with throughput rate of N operations per
second. If mixed nominal benchmark of read and writes is defined then use -nr
value and -nw
for nominal rate of reads and writes respectively.
-w or -r or -m Write or Read Benchmark or a Mix of write and read operations.
value or -s
value Benchmark duration (-t) in Minutes or amount of data to write (-s) in MB
Configuration file options
Default configuration file is
conf/defconf.ini. To use a custom configuration file use -f
value Input File with duplicate distribution default:internal file conf/dist_personalfiles
DEDISbench can simulate three real distributions extracted respectively from an Archival, Personal Files and High Performance Storage;
For choosing these distributions the
value must be dist_archival, dist_personalfiles or dist_highperf respectively.
The input file details the amount of blocks with a certain number of duplicates
and the format is: <number_duplicates> <number_blocks>
See below for more info for customizing distribution files and above for info on the default distributions.
value Number of concurrent processes (default:
value=4). Each process has an
independent file associated (or a common device if the rawdevice option is used)
value Size of the file of each process in MB. If rawdevice option is used, this parameter defines
the size of the raw device. (default:
value Processes write/read from a raw device instead of having an independent file. (
If more than one process is defined, each process is assingned with an independent chunk of the raw device,
dependent on the raw device size. By default, if this flag is not set each process writes to an individual file.
value Enable data integrity checks (default:
value=0): 0 - No integrity check is done
1 - static integrity check is done when the benchmark ends
2 - online integrity check is done for benchmark read requests
3 - both online and static verifications are done
Results are written to ./results/intgr_*
Files must be pre-populated with realistic content for read and mixed workloads to ensure that integrity checks are always correct.
value Size of blocks for I/O operations in Bytes (default:
value I/O Operations synchronization (default:
value=0): 0 - without fsync and O_DIRECT,
1 - O_DIRECT,
2 - fsync,
3 - both.
value Enable or disable the population of process files/device before running DEDISbench: 0-disabled, 1-enabled (with realistic content), 2- enabled (with DD). (Only enabled by default (with value 1) for read and mixed tests).
value Seed for random generator (default:current time). Usefull for repeating
value Choose the directory where DEDISbench writes/reads data. (default:
value I/O latency results are written to a log file to extract additional
statistics. Each process writes these values in a file called result and
each line, corresponds to a single I/O operation and presents:
(latency of I/O operation in microseconds) (current time in seconds).
value Generate an output log with the distribution actually generated by the benchmark.
This also generates the files needed to plot the distribution with gnuplot.
value Generate an output log with the access pattern generated by the benchmark
This also generates the files needed to plot the accesses to each block, throught time, with gnuplot.
file:RU:CD Write to file path the output of DEDISbench. This feature also writes two additional files with the same name
as given in
file and a snaplat and snapthr suffix that shows the throughput and latency average values
for 30 seconds intervals.
RU is the ramp up time in half minutes.
CD is the cool down time also half minutes.
It also writes the necessary files to plot a graph of both throughput and latency, with gnuplot.
value Disable the destruction of process temporary files generated by the benchmark (
Default behaviour is to destroy temporary files (
value=0 - or dont specify this option )
value Option to keep/delete databases from the previous execution: 0 - deletes databases,
1 - keeps databases.
run write benchmark in peak mode for 5 minutes
./DEDISbench -p -w -t5
run read benchmark in nominal mode (300 reads/second) for 10 minutes
./DEDISbench -n300 -r -t10
run write benchmark in peak mode for 5 minutes. Enable the population of process files before actually running the benchmark load and enable log output for results. Use files with 4GB for each process and 8 processes.
./DEDISbench -p -w -t5 -fconfig.ini
[execution] filesize=4096 nprocs=8 populate=1 [results] logging=1
run read benchmark in nominal mode (300 reads/second) for 8 minutes. Disable the pre-population of files. Carefull because files test(processid) must be present and have content or no content will be available for reading resulting in I/O errors.
./DEDISbench -n300 -r -t8 -fconfig.ini
run 8 minutes mixed benchmark with a nominal load (100reads/s and 50 writes/s) in a raw device with 4GB.
./DEDISbench -m -nr100 -nw50 -t8 -fconfig.ini
Running the generator DEDISgen:
-f or -d Find duplicates in folder -f or in a Disk Device -d
value Path for the folder or disk device
value Path for the output distribution file (Please specifiy the full path including the distribution file name to be generated). This is only necessary if distribution file of duplicates is going to be generated
value Path for the folder where duplicates databases are created default: ./gendbs/ . duplicate_db is the database with duplicates information hash->number_dups
value Size of blocks to analyse in bytes eg: -b1024,4096,8192 default: -b4096
value Path for the output distribution file. This is only necessary if distribution file of duplicates is going to be generated
-r Option to delete databases from the previous execution: 0 - deletes databases, 1 - keeps databases.
DEDISgen also outputs the files needed to plot the generated distribution with gnuplot.
generate duplicate distribution file /dir/dist_file for files in folder /dir/duplicates/ and its subfolders
./DEDISgen -f -p/dir/duplicates/ -o/dir/dist_file
generate duplicate distribution file /dir/dist_file for device /path/device. Analyse for a size of 4096 and 8192 Bytes (8KB). A file dist_file4096 and dist_file8192 will be created.
./DEDISgen -d -p/path/device -o/dir/dist_file -b4096,8192
Deduplication distribution FILE:
This file describes the amount of blocks with a specific number of duplicates and has the following format:
0 1000 1 500 5 10
There are 1000 blocks without any duplicate, 500 blocks with one duplicate and 10 blocks with 5 duplicates
UTILS for DEDISgen:
-m and -o can be used together or individually if only one operation is intended
-m Merges hashes and duplicates of db1 into db2.
-o Path for generating the output distribution file of duplicates db.
WARNING: these are supposed to be used in duplicates databases generated by dedisgen that map (hash->number-duplicates). Not intended for distribution dbs that generate the distribution outputs. These databases are on the database folder, which can be specified by the user as parameter with the option -z and by default are on ./gendbs/duplicatedb
Merging the duplicates found separately, with DEDISgen, for dataset 1 (in db1) and 2 (in db2) db1 with db2
Merge the two duplicates databases. Merge db1 into db2 - WARNING db2 info will be updated!!!! ./DEDISgenutils -mdb1 db2
generate output with distribution for db1+db2 ./DEDISgenutils -odb2 output_dist
For more information please contact: Joao Paulo jtpaulo at di.uminho.pt