Skip to content

Commit

Permalink
Merge pull request #160 from dsikich/docs-update
Browse files Browse the repository at this point in the history
update readthedocs
  • Loading branch information
dsikich committed Aug 24, 2018
2 parents f81bf02 + 07adf7b commit 34f0efc
Show file tree
Hide file tree
Showing 7 changed files with 232 additions and 0 deletions.
39 changes: 39 additions & 0 deletions doc/rst/build.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
========================
Build
========================

mpiFileUtils depends on several libraries. mpiFileUtils is available in Spack,
which simplifies the install to just:

.. code-block:: Bash
$ spack install mpifileutils
Or to enable all features:

.. code-block:: Bash
$ spack install mpifileutils +lustre +experimental
To build from a release tarball, there are two scripts: buildme_dependencies and
buildme. The buildme_dependencies script downloads and installs all the
necessary libraries. The buildme script then builds mpiFileUtils assuming the
libraries have been installed. Both scripts require that mpicc is in your path,
and that it is for an MPI library that supports at least v2.2 of the MPI
standard. Please review each buildme script, and edit if necessary. Then run
them in sequence:

.. code-block:: Bash
$ ./buildme_dependencies
$ ./buildme
To build from a clone, it may also be necessary to first run the
buildme_autotools script to obtain the required set of autotools, then use
buildme_dependencies_dev and buildme_dev:

.. code-block:: Bash
$ ./buildme_autotools
$ ./buildme_dependencies_dev
$ ./buildme_dev
12 changes: 12 additions & 0 deletions doc/rst/experimental/experimental-utilities.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
======================
Experimental Utilities
======================

Experimental utilities are under active development. They are not considered to
be production worthy, but they are available in the distribution for those
interested in developing them further or to provide additional examples. To
enable experimental utilities, run configure with the enable experimental
option.

.. code-block:: Bash
$./configure --enable-experimental
1 change: 1 addition & 0 deletions doc/rst/experimental/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Experimental Tools
.. toctree::
:maxdepth: 1

experimental-utilities.rst
dfind.1
dgrep.1
dparallel.1
Expand Down
4 changes: 4 additions & 0 deletions doc/rst/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ Documentation for mpiFileUtils
.. toctree::
:maxdepth: 1

overview.rst
libmfu.rst
build.rst
project-design.rst
dbcast.1
dchmod.1
dcmp.1
Expand Down
92 changes: 92 additions & 0 deletions doc/rst/libmfu.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
========================
libmfu
========================

Functionality that is common to multiple tools is moved to the common library,
libmfu. This goal of this library is to make it easy to develop new tools and
to provide consistent behavior across tools in the suite. The library can also
be useful to end applications, e.g., to efficiently create or remove a large
directory tree in a portable way across different parallel file systems.

----------------------------------------
libmfu: the mpiFileUtils common library
----------------------------------------

The mpiFileUtils common library defines data structures and methods on those
data structures that makes it easier to develop new tools or for use within HPC
applications to provide portable, performant implementations across file systems
common in HPC centers.

To use this library, include mfu.h.

.. code-block:: C
#include "mfu.h"
This file includes all other necessary headers.

----------------------------------------
mfu_flist
----------------------------------------

The key data structure in libmfu is a distributed file list called mfu_flist.
This structure represents a list of files, each with stat-like metadata, that
is distributed among a set of MPI ranks.

The library contains functions for creating and operating on these lists. For
example, one may create a list by recursively walking an existing directory or
by inserting new entries one at a time. Given a list as input, functions exist
to create corresponding entries (inodes) on the file system or to delete the
list of files. One may filter, sort, and remap entries. One can copy a list of
entries from one location to another or compare corresponding entries across
two different lists. A file list can be serialized and written to or read from
a file.

Each MPI rank "owns" a portion of the list, and there are routines to step
through the entries owned by that process. This portion is referred to as the
"local" list. Functions exist to get and set properties of the items in the
local list, for example to get the path name, type, and size of a file.
Functions dealing with the local list can be called by the MPI process
independently of other MPI processes.

Other functions operate on the global list in a collective fashion, such as
deleting all items in a file list. All processes in the MPI job must invoke
these functions simultaenously.

For full details, see `mfu_flist.h <https://github.com/hpc/mpifileutils/blob/master/src/common/mfu_flist.h>`_
and refer to its usage in existing tools.

----------------------------------------
mfu_path
----------------------------------------

mpiFileUtils represents file paths with the
`mfu_path <https://github.com/hpc/mpifileutils/blob/master/src/common/mfu_path.h>`_
structure. Functions are available to manipulate paths to prepend and append
entries, to slice paths into pieces, and to compute relative paths.

----------------------------------------
mfu_param_path
----------------------------------------

Path names provided by the user on the command line (parameters) are handled
through the
`mfu_param_path <https://github.com/hpc/mpifileutils/blob/master/src/common/mfu_param_path.h>`_
structure. Such paths may have to be checked for existence and to determine
their type (file or directory). Additionally, the user may specify many such
paths through invocations involving shell wildcards, so functions are available
to check long lists of paths in parallel.

----------------------------------------
mfu_io and mfu_util
----------------------------------------

The `mfu_io.h <https://github.com/hpc/mpifileutils/blob/master/src/common/mfu_io.h>`_
functions provide wrappers for many POSIX-IO functions. This is helpful for
checking error codes in a consistent manner and automating retries on failed
I/O calls. One should use the wrappers in mfu_io if available, and if not, one
should consider adding the missing wrapper.

The `mfu_util.h <https://github.com/hpc/mpifileutils/blob/master/src/common/mfu_util.h>`_
functions provide wrappers for error reporting and memory allocation.

34 changes: 34 additions & 0 deletions doc/rst/overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
================
Overview
================

mpiFileUtils provides both a library called libmfu and a suite of MPI-based
tools to manage large datasets, which may vary from large directory trees to
large files. High-performance computing users often generate large datasets with
parallel applications that run with many processes (millions in some cases).
However those users are then stuck with single-process tools like cp and rm to
manage their datasets. This suite provides MPI-based tools to handle typical
jobs like copy, remove, and compare for such datasets, providing speedups of up
to 50x. It also provides a library that simplifies the creation of new tools
or can be used in applications

---------------------------
Utilities
---------------------------

The tools in mpiFileUtils are actually MPI applications. They must be launched
as MPI applications, e.g., within a compute allocation on a cluster using
mpirun. The tools do not currently checkpoint, so one must be careful that an
invocation of the tool has sufficient time to complete before it is killed.
Example usage of each tool is provided below.

- dbcast - Broadcast files to compute nodes.
- dchmod - Change owner, group, and permissions on files.
- dcmp - Compare files.
- dcp - Copy files.
- ddup - Find duplicate files.
- dfilemaker - Generate random files.
- drm - Remove files.
- dstripe - Restripe files.
- dsync - Synchronize files.
- dwalk - List files.
50 changes: 50 additions & 0 deletions doc/rst/project-design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
=========================
Project Design Principles
=========================

The following principles drive design decisions in the project.

---------------------------
Scale
---------------------------

The library and tools should be designed such that running with more processes
increases performance, provided there are sufficient data and parallelism
available in the underlying file systems. The design of the tool should not
impose performance scalability bottlenecks.

---------------------------
Performance
---------------------------

While it is tempting to mimic the interface, behavior, and file formats of
familiar tools like cp, rm, and tar, when forced with a choice between
compatibility and performance, mpiFileUtils chooses performance. For example,
if an archive file format requires serialization that inhibits parallel
performance, mpiFileUtils will opt to define a new file format that enables
parallelism rather than being constrained to existing formats. Similarly,
options in the tool command line interface may have different semantics from
familiar tools in cases where performance is improved. Thus, one should be
careful to learn the options of each tool.

---------------------------
Portability
---------------------------

The tools are intended to support common file systems used in HPC centers, like
Lustre, GPFS, and NFS. Additionally, methods in the library should be portable
and efficient across multiple file systems. Tool and library users can rely on
mpiFileUtils to provide portable and performant implementations.

---------------------------
Composability
---------------------------

While the tools do not support chaining with Unix pipes, they do support
interoperability through input and output files. One tool may process a dataset
and generate an output file that another tool can read as input, e.g., to walk
a directory tree with one tool, filter the list of file names with another, and
perhaps delete a subset of matching files with a third. Additionally, when
logic is deemed to be useful across multiple tools or is anticipated to be
useful in future tools or applications, it should be provided in the common
library.

0 comments on commit 34f0efc

Please sign in to comment.