DEDISbench: A disk I/O block-based benchmark for deduplication systems. Unlike other existing benchmarks, written content is generated in a realistic fashion that mimics the content found on real storage distributions.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.mkd

DEDISbench and DEDISgen

(c) 2010 2017 INESC TEC and U. Minho. Written by J. Paulo, M. Freitas

DEDISbench and DEDISgen

DEDISbench is implemented in C and allows performing read and write block disk I/O tests on top of a file-system or block storage solution. DEDISbench main contribution is the ability to process, as input, a file that specifies a distribution of duplicate content, while using this information for generating a synthetic workload that follows such distribution. This input file can be populated by the users or can be generated automatically with DEDISgen, an analysis tool used for processing a specific real dataset and extracting from it the duplicates distribution.

I/O operations can be executed concurrently by several processes working on top of different files or disk regions. Moreover, the benchmark can be configured to stop the evaluation when a certain amount of data has been written or when a pre-defined period of time has elapsed. Another novel feature is the possibility of performing I/O operations with different load intensities. In addition to a stress load, where the benchmark issues I/O operations as fast as possible to stress the system, DEDISbench supports performing operations at a nominal load specified by the user.

For more information regarding DEDISbench algorithm you may read these two published papers:

I/O benchmark:

The benchmark is written in C and can simulate several processes writing/reading fixed size blocks, with realistic content, concurrently into multiple files or into a block device. The location for the read/write operations can follow a sequential, random uniform and random hotspot distribution. The latter is provided by resorting to the NURand function from TPC-C benchmark that generates hotspots for I/O operations. A realistic distribution extracted from real storage systems is used to generate the blocks' content. Moreover, it is possible to use other realistic distributions and load them from a custom file using DEDISgen. DEDISbench provides two workload modes, one reproduces a fully or peak loaded system, as Bonnie++, that performs as maximum write I/O operations per second as possible. The second reproduces a system under a reasonable nominal load, and can be useful for understanding the behavior of storage operations in a stable system with a limited I/O throughput.

DEDISbench real content distributions:

DEDISbench comes with three distinct distributions extracted from real storage systems with different requirements and assumptions:

  • An Archival storage where most files have a write-once policy, with non-significant updates, but with sporadic data deletion. This distribution is called dist_archival.
  • A Personal Files storage where some files are updated and deleted frequently and the I/O requests latency is expected to be lower than the one found in archival storages. This distributions is called dist_personalfiles.
  • A High Performance storage where most files are updated and deleted frequently and I/O latency is expected to be as minimal as possible. This distribution is called dist_highperf.

All these distributions are available and can be simulated by DEDISbench. NOTE: By default DEDISbench uses the Personal Files Storage distribution.

Duplicate distribution generator: DEDISgen

The binary DEDISgen generates an output file that describes the distribution of duplicates found at a specific folder and subfolders or disk device. This program can generate (if specified in the arguments) an output distribution file (with option -o) that can be consumed by DEDISbench and used for simulating that duplicate distribution.

With DEDISgen we also pack DEDISgenutils that may be needed for performing more complex analysis of duplicate data. For instance, if several datasets must be analysed separately and then merged toguether, for generating the output distribution file for DEDISbench, it is possible to use both tools as explained below.

Finally, if one wishes to view the benchmark's results in a little more nicely manner, both DEDISbench and DEDISgen output the files needed to plot some of their results with gnuplot.

Instalation.

The only libs required to run the benchmark should be libc6-dev, libdb-dev and libssl-dev.

To build DEDISBench run the following commands from the top-level directory:

$ ./autogen.sh 
$ ./configure
$ make

Running the Benchmark

Required flags

-p or -nvalue Peak or Nominal Load with throughput rate of N operations per second. If mixed nominal benchmark of read and writes is defined then use -nrvalue and -nwvalue for nominal rate of reads and writes respectively.

-w or -r or -m Write or Read Benchmark or a Mix of write and read operations.

-tvalue or -svalue Benchmark duration (-t) in Minutes or amount of data to write (-s) in MB

Configuration file options

Default configuration file is conf/defconf.ini. To use a custom configuration file use -ffilename

[execution] section

distfile=value Input File with duplicate distribution default:internal file conf/dist_personalfiles DEDISbench can simulate three real distributions extracted respectively from an Archival, Personal Files and High Performance Storage; For choosing these distributions the value must be dist_archival, dist_personalfiles or dist_highperf respectively. The input file details the amount of blocks with a certain number of duplicates and the format is: <number_duplicates> <number_blocks> See below for more info for customizing distribution files and above for info on the default distributions.

nprocs=value Number of concurrent processes (default:value=4). Each process has an independent file associated (or a common device if the rawdevice option is used)

filesize=value Size of the file of each process in MB. If rawdevice option is used, this parameter defines the size of the raw device. (default:value=2048 MB)

rawdevice=value Processes write/read from a raw device instead of having an independent file. (value=path/to/dev) If more than one process is defined, each process is assingned with an independent chunk of the raw device, dependent on the raw device size. By default, if this flag is not set each process writes to an individual file.

integritycheck=value Enable data integrity checks (default:value=0): 0 - No integrity check is done 1 - static integrity check is done when the benchmark ends 2 - online integrity check is done for benchmark read requests 3 - both online and static verifications are done Results are written to ./results/intgr_* Files must be pre-populated with realistic content for read and mixed workloads to ensure that integrity checks are always correct.

blocksize=value Size of blocks for I/O operations in Bytes (default:value=4096)

sync=value I/O Operations synchronization (default:value=0): 0 - without fsync and O_DIRECT, 1 - O_DIRECT, 2 - fsync, 3 - both.

populate=value Enable or disable the population of process files/device before running DEDISbench: 0-disabled, 1-enabled (with realistic content), 2- enabled (with DD). (Only enabled by default (with value 1) for read and mixed tests).

seed=value Seed for random generator (default:current time). Usefull for repeating

[results] section

tempfilespath=value Choose the directory where DEDISbench writes/reads data. (default:value=.)

logging=value I/O latency results are written to a log file to extract additional statistics. Each process writes these values in a file called result and each line, corresponds to a single I/O operation and presents: (latency of I/O operation in microseconds) (current time in seconds).

dist_results=value Generate an output log with the distribution actually generated by the benchmark. This also generates the files needed to plot the distribution with gnuplot. (value=filename)

access_results=value Generate an output log with the access pattern generated by the benchmark This also generates the files needed to plot the accesses to each block, throught time, with gnuplot. (value=filename)

general_results=file:RU:CD Write to file path the output of DEDISbench. This feature also writes two additional files with the same name as given in file and a snaplat and snapthr suffix that shows the throughput and latency average values for 30 seconds intervals. RU is the ramp up time in half minutes. CD is the cool down time also half minutes. It also writes the necessary files to plot a graph of both throughput and latency, with gnuplot.

[structure] section

cleantemp=value Disable the destruction of process temporary files generated by the benchmark (value=1). Default behaviour is to destroy temporary files (value=0 - or dont specify this option )

keep_dbs=value Option to keep/delete databases from the previous execution: 0 - deletes databases, 1 - keeps databases.

Examples:

run write benchmark in peak mode for 5 minutes

./DEDISbench -p -w -t5

run read benchmark in nominal mode (300 reads/second) for 10 minutes

./DEDISbench -n300 -r -t10

run write benchmark in peak mode for 5 minutes. Enable the population of process files before actually running the benchmark load and enable log output for results. Use files with 4GB for each process and 8 processes.

./DEDISbench -p -w -t5 -fconfig.ini

config.ini

[execution]
filesize=4096
nprocs=8
populate=1
[results]
logging=1

run read benchmark in nominal mode (300 reads/second) for 8 minutes. Disable the pre-population of files. Carefull because files test(processid) must be present and have content or no content will be available for reading resulting in I/O errors.

./DEDISbench -n300 -r -t8 -fconfig.ini

config.ini

[results]
tempfilespath=./tempfiles

run 8 minutes mixed benchmark with a nominal load (100reads/s and 50 writes/s) in a raw device with 4GB.

./DEDISbench -m -nr100 -nw50 -t8 -fconfig.ini

config.ini

[execution]
rawdevice=path/to/rawdevice

Running the generator DEDISgen:

-f or -d Find duplicates in folder -f or in a Disk Device -d

-pvalue Path for the folder or disk device

-ovalue Path for the output distribution file (Please specifiy the full path including the distribution file name to be generated). This is only necessary if distribution file of duplicates is going to be generated

Optional Parameters

-zvalue Path for the folder where duplicates databases are created default: ./gendbs/ . duplicate_db is the database with duplicates information hash->number_dups

-bvalue Size of blocks to analyse in bytes eg: -b1024,4096,8192 default: -b4096

-ovalue Path for the output distribution file. This is only necessary if distribution file of duplicates is going to be generated

-r Option to delete databases from the previous execution: 0 - deletes databases, 1 - keeps databases.

DEDISgen also outputs the files needed to plot the generated distribution with gnuplot.

Examples

generate duplicate distribution file /dir/dist_file for files in folder /dir/duplicates/ and its subfolders

./DEDISgen -f -p/dir/duplicates/ -o/dir/dist_file

generate duplicate distribution file /dir/dist_file for device /path/device. Analyse for a size of 4096 and 8192 Bytes (8KB). A file dist_file4096 and dist_file8192 will be created.

./DEDISgen -d -p/path/device -o/dir/dist_file -b4096,8192

Deduplication distribution FILE:

This file describes the amount of blocks with a specific number of duplicates and has the following format:

Example file:

0 1000
1 500
5 10

There are 1000 blocks without any duplicate, 500 blocks with one duplicate and 10 blocks with 5 duplicates

NEW! DEDISgenutils

UTILS for DEDISgen:

-m and -o can be used together or individually if only one operation is intended

-m Merges hashes and duplicates of db1 into db2.

-o Path for generating the output distribution file of duplicates db.

WARNING: these are supposed to be used in duplicates databases generated by dedisgen that map (hash->number-duplicates). Not intended for distribution dbs that generate the distribution outputs. These databases are on the database folder, which can be specified by the user as parameter with the option -z and by default are on ./gendbs/duplicatedb

Examples:

Merging the duplicates found separately, with DEDISgen, for dataset 1 (in db1) and 2 (in db2) db1 with db2

Merge the two duplicates databases. Merge db1 into db2 - WARNING db2 info will be updated!!!! ./DEDISgenutils -mdb1 db2

generate output with distribution for db1+db2 ./DEDISgenutils -odb2 output_dist

For more information please contact: Joao Paulo jtpaulo at di.uminho.pt