Merge pull request #23 from rabernat/doc-update

Doc update
pangeo-data · Jul 16, 2020 · a3a0a24 · a3a0a24
2 parents 269c75c + a6c0595
commit a3a0a24
Show file tree

Hide file tree

Showing 9 changed files with 1,709 additions and 92 deletions.
diff --git a/docs/_static/algorithm.png b/docs/_static/algorithm.png
diff --git a/docs/_templates/srclinks.html b/docs/_templates/srclinks.html
diff --git a/docs/algorithm.rst b/docs/algorithm.rst
@@ -0,0 +1,43 @@
+Algorithm
+=========
+
+The algorithm used by rechunker tries to satisfy several constraints simultaneously:
+
+- *Respect memory limits.* Rechunker's algorithm guarantees that worker processes
+  will not exceed a user-specified memory threshold.
+- *Minimize the number of required tasks.* Specifically, for N source chunks
+  and M target chunks, the number of tasks is always less than N + M.
+- *Be embarrassingly parallel.* The task graph should be as simple as possible,
+  to make it easy to execute using different task scheduling frameworks. This also
+  means avoiding write locks, which are complex to manage, and inter-worker
+  communication.
+
+The algorithm we chose emerged via a lively disucssion on the
+`Pangeo Discourse Forum <https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588>`_.
+We call it *Push / Pull Consolidated*.
+
+.. figure:: _static/algorithm.png
+    :align: center
+    :alt: Algorithm Schematic
+    :figclass: align-center
+
+    Visualization of the Push / Pull Consolidated algorithm for a hypothetical
+    2D array. Each rectangle represents a single chunk. The dashed boxes
+    indicate consolidate reads / writes.
+
+A rough sketch of the algorithm is as follows
+
+1. User inputs a source array with a specific shape, chunk structure and
+   data type. Also specifies ```target_chunks``, the desired chunk structure
+   of the output array and ``max_mem``, the maximum amount of memory
+   each worker is allowed to use.
+2. Determine the largest batch of data we can *write* by one worker given
+   ``max_mem``. These are the ``write_chunks``.
+3. Determine the largest batch of data we can *read* by one worker given
+   ``max_mem``, plus the additional constraint of trying to fit within
+   write chunks if possible. These are the ``read_chunks``.
+4. If ``write_chunks == read chunks``, we can avoid creating an intermediate
+   dataset and copy the data directly from source to target.
+5. Otherwise, intermediate chunks are defined as the minimum of
+   ``write_chunks`` and ``read_chunks`` along each axis. The source is copied
+   first to the intermediate array and then from intermediate to target.
diff --git a/docs/api.rst b/docs/api.rst
@@ -0,0 +1,26 @@
+API
+===
+
+Rechunk Function
+----------------
+
+The main function exposed by rechunker is :func:`rechunker.rechunk`.
+
+.. currentmodule:: rechunker
+
+.. autofunction:: rechunk
+
+
+The Rechunked Object
+--------------------
+
+``rechunk`` returns a :class:`Rechunked` object.
+
+.. autoclass:: Rechunked
+
+.. note::
+   You must call ``execute()`` on the ``Rechunked`` object in order to actually
+   perform the rechunking operation.
+
+.. warning::
+   You must manually delete the intermediate store when ``execute`` is finished.
diff --git a/docs/conf.py b/docs/conf.py
@@ -41,23 +41,29 @@
 extensions = [
     "sphinx.ext.autodoc",
     "sphinx.ext.autosummary",
+    "sphinx.ext.intersphinx",
     "numpydoc",
+    "nbsphinx",
     "IPython.sphinxext.ipython_directive",
     "IPython.sphinxext.ipython_console_highlighting",
     "sphinxcontrib.srclinks",
 ]
-srclink_project = "https://github.com/pangeo-data/rechunker"
-numpydoc_show_class_members = False
-class_members_toctree = False
+
+# https://nbsphinx.readthedocs.io/en/0.2.14/never-execute.html
+nbsphinx_execute = "never"
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", ".ipynb_checkpoints"]
 
+intersphinx_mapping = {
+    "xarray": ("http://xarray.pydata.org/en/stable/", None),
+    "zarr": ("https://zarr.readthedocs.io/en/stable", None),
+}
 
 # -- Options for HTML output -------------------------------------------------
 
@@ -75,3 +81,8 @@
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ["_static"]
+
+html_show_sourcelink = True
+srclink_project = "https://github.com/pangeo-data/rechunker"
+srclink_branch = "master"
+srclink_src_path = "docs/"
diff --git a/docs/index.rst b/docs/index.rst
@@ -11,46 +11,47 @@ of the chunk structure of chunked array formats such as Zarr_ and TileDB_.
 Rechunker takes an input array (or group of arrays) stored in a persistent
 storage device (such as a filesystem or a cloud storage bucket) and writes
 out an array (or group of arrays) with the same data, but different chunking
-scheme, to a new location.
+scheme, to a new location. Rechunker is designed to be used within a parallel
+execution framework such as Dask_.
 
-Rechunker is designed to be used within a parallel execution framework such as
-Dask_.
 
-Usage
------
+Quickstart
+----------
 
-The main function exposed by rechunker is :func:`rechunker.rechunk`.
+To install::
 
-.. currentmodule:: rechunker
+  >>> pip install rechunker
 
-.. autofunction:: rechunk
 
-``rechunk`` returns a :class:`Rechunked` object.
+To use::
 
-.. autoclass:: Rechunked
+  >>> import zarr
+  >>> from rechunker import rechunk
+  >>> source = zarr.ones((4, 4), chunks=(2, 2), store="source.zarr")
+  >>> intermediate = "intermediate.zarr"
+  >>> target = "target.zarr"
+  >>> rechunked = rechunk(source, target_chunks=(4, 1), target_store=target,
+  ...                     max_mem=256000,
+  ...                     temp_store=intermediate)
+  >>> rechunked
+  <Rechunked>
+  * Source      : <zarr.core.Array (4, 4) float64>
+  * Intermediate: dask.array<from-zarr, ... >
+  * Target      : <zarr.core.Array (4, 4) float64>
+  >>> rechunked.execute()
+  <zarr.core.Array (4, 4) float64>
 
 
-Examples
+Contents
 --------
 
+.. toctree::
+   :maxdepth: 2
 
-.. warning::
-   You must manually delete the intermediate store when rechunker is finished
-   executing.
-
-
-The Rechunker Algorithm
------------------------
-
-The algorithm used by rechunker tries to satisfy several constraints simultaneously:
-
-- *Respect memory limits.* Rechunker's algorithm guarantees that worker processes
-  will not exceed a user-specified memory threshold.
-- *Minimize the number of required tasks.* Specificallly, for N source chunks
-  and M target chunks, the number of tasks is always less than N + M.
-- *Be embarassingly parallel.* The task graph should be as simple as possible,
-  to make it easy to execute using different task scheduling frameworks. This also
-  means avoiding write locks, which are complex to manage.
+   tutorial
+   api
+   algorithm
+   release_notes
 
 
 .. _Zarr: https://zarr.readthedocs.io/en/stable/

diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -0,0 +1,11 @@
+Release Notes
+=============
+
+v0.1 - Unreleased
+-----------------
+
+
+v0.0.1 - 2020-07-15
+-------------------
+
+First Release