Skip to content

Commit

Permalink
Merge pull request #23 from rabernat/doc-update
Browse files Browse the repository at this point in the history
Doc update
  • Loading branch information
rabernat committed Jul 16, 2020
2 parents 269c75c + a6c0595 commit a3a0a24
Show file tree
Hide file tree
Showing 9 changed files with 1,709 additions and 92 deletions.
Binary file added docs/_static/algorithm.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 0 additions & 55 deletions docs/_templates/srclinks.html

This file was deleted.

43 changes: 43 additions & 0 deletions docs/algorithm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Algorithm
=========

The algorithm used by rechunker tries to satisfy several constraints simultaneously:

- *Respect memory limits.* Rechunker's algorithm guarantees that worker processes
will not exceed a user-specified memory threshold.
- *Minimize the number of required tasks.* Specifically, for N source chunks
and M target chunks, the number of tasks is always less than N + M.
- *Be embarrassingly parallel.* The task graph should be as simple as possible,
to make it easy to execute using different task scheduling frameworks. This also
means avoiding write locks, which are complex to manage, and inter-worker
communication.

The algorithm we chose emerged via a lively disucssion on the
`Pangeo Discourse Forum <https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588>`_.
We call it *Push / Pull Consolidated*.

.. figure:: _static/algorithm.png
:align: center
:alt: Algorithm Schematic
:figclass: align-center

Visualization of the Push / Pull Consolidated algorithm for a hypothetical
2D array. Each rectangle represents a single chunk. The dashed boxes
indicate consolidate reads / writes.

A rough sketch of the algorithm is as follows

1. User inputs a source array with a specific shape, chunk structure and
data type. Also specifies ```target_chunks``, the desired chunk structure
of the output array and ``max_mem``, the maximum amount of memory
each worker is allowed to use.
2. Determine the largest batch of data we can *write* by one worker given
``max_mem``. These are the ``write_chunks``.
3. Determine the largest batch of data we can *read* by one worker given
``max_mem``, plus the additional constraint of trying to fit within
write chunks if possible. These are the ``read_chunks``.
4. If ``write_chunks == read chunks``, we can avoid creating an intermediate
dataset and copy the data directly from source to target.
5. Otherwise, intermediate chunks are defined as the minimum of
``write_chunks`` and ``read_chunks`` along each axis. The source is copied
first to the intermediate array and then from intermediate to target.
26 changes: 26 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
API
===

Rechunk Function
----------------

The main function exposed by rechunker is :func:`rechunker.rechunk`.

.. currentmodule:: rechunker

.. autofunction:: rechunk


The Rechunked Object
--------------------

``rechunk`` returns a :class:`Rechunked` object.

.. autoclass:: Rechunked

.. note::
You must call ``execute()`` on the ``Rechunked`` object in order to actually
perform the rechunking operation.

.. warning::
You must manually delete the intermediate store when ``execute`` is finished.
19 changes: 15 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,23 +41,29 @@
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"numpydoc",
"nbsphinx",
"IPython.sphinxext.ipython_directive",
"IPython.sphinxext.ipython_console_highlighting",
"sphinxcontrib.srclinks",
]
srclink_project = "https://github.com/pangeo-data/rechunker"
numpydoc_show_class_members = False
class_members_toctree = False

# https://nbsphinx.readthedocs.io/en/0.2.14/never-execute.html
nbsphinx_execute = "never"

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", ".ipynb_checkpoints"]

intersphinx_mapping = {
"xarray": ("http://xarray.pydata.org/en/stable/", None),
"zarr": ("https://zarr.readthedocs.io/en/stable", None),
}

# -- Options for HTML output -------------------------------------------------

Expand All @@ -75,3 +81,8 @@
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

html_show_sourcelink = True
srclink_project = "https://github.com/pangeo-data/rechunker"
srclink_branch = "master"
srclink_src_path = "docs/"
57 changes: 29 additions & 28 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,46 +11,47 @@ of the chunk structure of chunked array formats such as Zarr_ and TileDB_.
Rechunker takes an input array (or group of arrays) stored in a persistent
storage device (such as a filesystem or a cloud storage bucket) and writes
out an array (or group of arrays) with the same data, but different chunking
scheme, to a new location.
scheme, to a new location. Rechunker is designed to be used within a parallel
execution framework such as Dask_.

Rechunker is designed to be used within a parallel execution framework such as
Dask_.

Usage
-----
Quickstart
----------

The main function exposed by rechunker is :func:`rechunker.rechunk`.
To install::

.. currentmodule:: rechunker
>>> pip install rechunker

.. autofunction:: rechunk

``rechunk`` returns a :class:`Rechunked` object.
To use::

.. autoclass:: Rechunked
>>> import zarr
>>> from rechunker import rechunk
>>> source = zarr.ones((4, 4), chunks=(2, 2), store="source.zarr")
>>> intermediate = "intermediate.zarr"
>>> target = "target.zarr"
>>> rechunked = rechunk(source, target_chunks=(4, 1), target_store=target,
... max_mem=256000,
... temp_store=intermediate)
>>> rechunked
<Rechunked>
* Source : <zarr.core.Array (4, 4) float64>
* Intermediate: dask.array<from-zarr, ... >
* Target : <zarr.core.Array (4, 4) float64>
>>> rechunked.execute()
<zarr.core.Array (4, 4) float64>


Examples
Contents
--------

.. toctree::
:maxdepth: 2

.. warning::
You must manually delete the intermediate store when rechunker is finished
executing.


The Rechunker Algorithm
-----------------------

The algorithm used by rechunker tries to satisfy several constraints simultaneously:

- *Respect memory limits.* Rechunker's algorithm guarantees that worker processes
will not exceed a user-specified memory threshold.
- *Minimize the number of required tasks.* Specificallly, for N source chunks
and M target chunks, the number of tasks is always less than N + M.
- *Be embarassingly parallel.* The task graph should be as simple as possible,
to make it easy to execute using different task scheduling frameworks. This also
means avoiding write locks, which are complex to manage.
tutorial
api
algorithm
release_notes


.. _Zarr: https://zarr.readthedocs.io/en/stable/
Expand Down
11 changes: 11 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Release Notes
=============

v0.1 - Unreleased
-----------------


v0.0.1 - 2020-07-15
-------------------

First Release

0 comments on commit a3a0a24

Please sign in to comment.