This repository contains a script, zarr-digest-timings.py
, for repeatedly
running various implementations of a Zarr checksum calculation routine with
different types of caching and displaying the average runtime. The script is
run via nox, which manages installation of the proper varying dependencies.
Python 3.7 or higher is required.
nox -e <env> -- [<options>] <dirpath> <implementation>
Run a given checksumming function on the given directory a number of times and
print out the average runtime. If caching is in effect and
--no-clear-cache
is not given, an initial function call (populating the
cache) will be timed & reported separately.
<env>
- The nox environment in which to run the script; can be
nothreads
, which uses the non-threaded fscacher 0.1.6;threads
, which uses the threaded implementation on the gh-66 branch; orxor_bytes
, which uses the more efficient directory fingerprinting introduced in v0.2.0. (Note that no version of fscacher will have any effect by default unless the-c
or-C
option is passed to the script.) <dirpath>
- The path to a directory tree to calculate the Zarr checksum of
<implementation>
The checksumming function to use:
sync
- Walks the directory tree synchronously and breadth-first, digesting files, and constructs an in-memory tree for calculating the Zarr digest
fastio
- Like
sync
, but walks the directory tree using a multithreaded walk oothreads
- Like
fastio
, but rewritten to be more object-oriented trio
- Like
sync
, but walks the directory asynchronously using trio. The number of workers is controlled by the--threads
option. This implementation is not affected by--cache-files
. trio3
- A variant of
trio
that runs the MD5 digestion function for each file in a thread. This implementation is affected by--cache-files
. recursive
- Walks & digests the directory tree depth-first using recursion
-c, --cache | Use fscacher to cache the Zarr directory checksumming routine |
-C, --cache-files | |
Use fscacher to cache digests for individual files | |
--clear-cache, --no-clear-cache | |
Whether to clear the cache on program startup
[default: --clear-cache ] | |
-n INT, --number INT | |
Set the number of times to call the function (not counting the initial cache-populating call, if any). As a special case, passing 0 will cause the script to simply call the function once and print out the checksum without any timing. [default: 100] | |
-R FILE, --report FILE | |
Append a report of the run, containing the average time and the various input parameters, as a line of JSON to the given file | |
-T INT, --threads INT | |
Set the number of threads to use when walking a
directory tree. This affects both the
fastio implementation and the threaded
fscacher implementation. The default value is
the number of CPU cores plus 4, to a maximum of
32. | |
-v, --verbose | Log the result of each function call with a timestamp as it finishes. Specify this option up to two additional times for more debug logging. |
python3 mktree.py <dirpath> <specfile>
The mktree.py
script can be used to generate a sample directory tree for
running zarr-digest-timings.py
on. The directory is generated according to
a layout specification, which is a JSON file whose contents take one of the
following forms:
- A list
lst
ofn+1
integers, possibly with a file object (see below) appended — The tree will consist oflst[0]
directories, each of which containslst[1]
sub-directories, each of which containslst[2]
sub-subdirectories, and so on, with the directories at leveln-1
consisting oflst[n]
files. If a file object is supplied, the files will be generated according to its specification; otherwise, they will be empty. - An object mapping path names to layout sub-specifications, file objects, or
null
— For each key that maps to a layout sub-specification, a subdirectory will be created in the directory with that name and layout. For each key that maps to a file object ornull
, a file will be created in the directory with that name and according to that specification (an empty file fornull
s).
A file object is an object specifying the size of a file to create; it can take the following forms:
- If the object contains a
"size": INT
field, the file will be that size. - Otherwise, the object must contain a
"maxsize": INT
field and an optional"minsize": INT
field (default value: 0). The file will be created with a random size within the given range, inclusive.
All files are created with random bytes as data.
Some sample layout specifications can be found in the layouts/
directory.
bash time-all.sh [<options>] <dirpath>
The bash script time-all.sh
runs zarr-digest-timings.py
with all
non-redundant configurations against a given directory tree for a given number
of threads, and it generates a JSON Lines report.
-n INT | Set the number of times to run the checksumming function for each configuration [default: 100] |
-R FILE | Save the report to the given file [default:
time-all.json ] |
-T INT | Set the number of threads to use when walking a directory tree. See above for the default. |
-v | Increase the verbosity of
zarr-digest-timings.py ; can be specified
multiple times |
nox -e report2table -- [<options>] <reportfile>
The report2table.py
script takes a JSON Lines report generated via the
--report
option of zarr-digest-timings.py
and renders it as a
reStructuredText or GitHub-Flavored Markdown document containing a series of
tables. It should be run via nox in order to manage its dependencies.
All of the entries in the report should have been generated on the same machine. Entries generated on different paths or using different implementations will be grouped into distinct tables. If two or more entries were produced by the same configuration, their times will be combined.
For configurations that make use of caching, the corresponding cell in the resulting tables will consist of two times separated by a slash; the first time is the runtime of the initial cache-populating call, while the second time is the average of the other calls.
-f <rst|md>, --format <rst|md> | |
Specify whether to produce a reStructuredText
(rst ) or Markdown (md ) document
[default: rst ] | |
-o FILE, --outfile FILE | |
Output to the specified file | |
-t TEXT, --title TEXT | |
Set a title for the document |