Merge pull request #9 from jcmgray/lazy-compile

[WIP] implement LazyArray + autocompile
jcmgray · Jun 4, 2021 · 24bf989 · 24bf989
2 parents b62baa9 + aa73c36
commit 24bf989
Show file tree

Hide file tree

Showing 17 changed files with 2,778 additions and 256 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,387 @@
+<p align="left"><img src="https://github.com/jcmgray/autoray/blob/master/docs/images/autoray-header.png?raw=true" alt="autoray" width="500px"></p>
+
+A lightweight python AUTOmatic-arRAY library. Write numeric code that works for:
+
+* [numpy](https://github.com/numpy/numpy)
+* [cupy](https://github.com/cupy/cupy)
+* [dask](https://github.com/dask/dask)
+* [autograd](https://github.com/HIPS/autograd)
+* [jax](https://github.com/google/jax)
+* [mars](https://github.com/mars-project/mars)
+* [tensorflow](https://github.com/tensorflow/tensorflow)
+* [pytorch](https://pytorch.org/)
+* ... and indeed **any** library that provides a numpy-*ish* api.
+
+[![Azure Pipelines](https://dev.azure.com/autoray-org/autoray/_apis/build/status/jcmgray.autoray?branchName=master)](https://dev.azure.com/autoray-org/autoray/_build/latest?definitionId=1&branchName=master) [![codecov](https://codecov.io/gh/jcmgray/autoray/branch/master/graph/badge.svg?token=Q5evNiuT9S)](https://codecov.io/gh/jcmgray/autoray) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/jcmgray/autoray.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/jcmgray/autoray/context:python) [![Anaconda-Server Badge](https://anaconda.org/conda-forge/autoray/badges/installer/conda.svg)](https://conda.anaconda.org/conda-forge)
+
+As an example consider this function that orthogonalizes a matrix using the modified [Gram-Schmidt](https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process) algorithm:
+
+```python
+from autoray import do
+
+def modified_gram_schmidt(X):
+    # n.b. performance-wise this particular function is *not*
+    # a good candidate for a pure python implementation
+
+    Q = []
+    for j in range(0, X.shape[0]):
+
+        q = X[j, :]
+        for i in range(0, j):
+            rij = do('tensordot', do('conj', Q[i]), q, 1)
+            q = q - rij * Q[i]
+
+        rjj = do('linalg.norm', q, 2)
+        Q.append(q / rjj)
+
+    return do('stack', Q, axis=0)
+```
+
+Which is now compatible with **all** of the above mentioned libraries! Abstracting out the array interface also allows the following functionality:
+
+* *swap custom versions of functions for specific backends*
+* *trace through computations lazily without actually running them*
+* *automatically share intermediates and fold constants in computations*
+* *compile functions with a unified interface for different backends*
+
+... all implemented in a lightweight manner with an emphasis on minimizing overhead. Of course complete compatibility is not going to be possible for all functions, operations and libraries, but ``autoray`` hopefully makes the job much easier. Of the above, ``tensorflow`` has *quite* a different interface and ``pytorch`` probably the *most* different. Whilst for example not every function will work out-of-the-box for these two, ``autoray`` is also designed with the easy addition of new functions in mind (for example adding new translations is often a one-liner).
+
+**Contents**
+
+* [Basic Usage](#Basic-usage)
+    * [How does it work?](#how-does-it-work?)
+    * [Customizing functions](#Customizing-functions)
+    * [Lazy Computation](#Lazy-Computation)
+    * [Compilation](#Compilation)
+* [Details](#Details)
+    * [Special Functions](#Special-Functions)
+    * [Deviations from `numpy`](#Deviations-from-numpy)
+* [Installation](#Installation)
+* [Contributing](#Contributing)
+
+
+# Basic Usage
+
+
+## How does it work?
+
+``autoray`` works using essentially a single dispatch mechanism on the first  argument for ``do``, or the ``like`` keyword argument if specified, fetching functions from the whichever module defined that supplied array. Additionally, it caches a few custom translations and lookups so as to handle libraries like ``tensorflow`` that don't exactly replicate the ``numpy`` api (for example ``sum`` gets translated to ``tensorflow.reduce_sum``). Due to the caching, each ``do`` call only adds 1 or 2 dict look-ups as overhead - much less than using ``functools.singledispatch`` for example.
+
+Essentially you call your numpy-style array functions in one of four ways:
+
+***1. Automatic backend:***
+
+```python
+do('sqrt', x)
+```
+
+Here the backend is inferred from ``x``. Usually dispatch happens on the first argument, but several functions (such as ``stack`` and ``einsum``) know to override this and look elsewhere.
+
+***2. Backend 'like' another array:***
+
+```python
+do('random.normal', size=(2, 3, 4), like=x)
+```
+
+Here the backend is inferred from another array and can thus be implicitly propagated, even when functions take no array arguments.
+
+***3. Explicit backend:***
+
+```python
+do('einsum', eq, x, y, like='customlib')
+```
+
+Here one simply supplies the desired function backend explicitly.
+
+***4. Context manager***
+
+```python
+with backend_like('autoray.lazy'):
+    xy = do('tensordot', x, y, 1)
+    z = do('trace', xy)
+```
+
+Here you set a default backend for a whole block of code. This default overrides method 1. above but 2. and 3. still take precedence.
+
+
+
+If you don't like the explicit ``do`` syntax, then you can import the fake ``numpy`` object as a **drop-in replacement** instead:
+
+```python
+from autoray import numpy as np
+
+x = np.random.uniform(size=(2, 3, 4), like='tensorflow')
+np.tensordot(x, x, [(2, 1), (2, 1)])
+# <tf.Tensor 'Tensordot:0' shape=(2, 2) dtype=float32>
+
+np.eye(3, like=x)  # many functions obviously can't dispatch without the `like` keyword
+# <tf.Tensor 'eye/MatrixDiag:0' shape=(3, 3) dtype=float32>
+```
+
+
+## Customizing functions
+
+If you want to directly provide a missing or alternative implementation of some function for a particular backend you can swap one in with ``autoray.register_function``:
+
+```python
+def my_custom_torch_svd(x):
+    import torch
+
+    print('Hello SVD!')
+    u, s, v = torch.svd(x)
+
+    return u, s, v.T
+
+ar.register_function('torch', 'linalg.svd', my_custom_torch_svd)
+
+x = ar.do('random.uniform', size=(3, 4), like='torch')
+
+ar.do('linalg.svd', x)
+# Hello SVD!
+# (tensor([[-0.5832,  0.6188, -0.5262],
+#          [-0.5787, -0.7711, -0.2655],
+#          [-0.5701,  0.1497,  0.8078]]),
+#  tensor([2.0336, 0.8518, 0.4572]),
+#  tensor([[-0.4568, -0.3166, -0.6835, -0.4732],
+#          [-0.5477,  0.2825, -0.2756,  0.7377],
+#          [ 0.2468, -0.8423, -0.0993,  0.4687]]))
+```
+
+If you want to make use of the existing function you can supply ``wrap=True`` in which case the custom function supplied should act like a decorator:
+
+```python
+def my_custom_sum_wrapper(old_fn):
+
+    def new_fn(*args, **kwargs):
+        print('Hello sum!')
+        return old_fn(*args **kwargs)
+
+    return new_fn
+
+ar.register_function('torch', 'sum', my_custom_sum_wrapper, wrap=True)
+
+ar.do('sum', x)
+# Hello sum!
+# tensor(5.4099)
+```
+
+Though be careful, if you call ``register_function`` again it will now wrap the *new* function!
+
+
+## Lazy Computation
+
+Abstracting out the array interface also affords an opportunity to run any computations utilizing ``autoray.do`` completely lazily. ``autoray`` provides the ``lazy`` submodule and ``LazyArray`` class for this purpose:
+
+```python
+from autoray import lazy
+
+# input array - can be anything autoray.do supports
+x = do('random.normal', size=(5, 5), like='torch')
+
+# convert it to a lazy 'computational node'
+lx = lazy.array(x)
+
+# supply this to our function
+ly = modified_gram_schmidt(lx)
+ly
+# <LazyArray(fn=stack, shape=(5, 5), dtype=float32, backend='torch')>
+```
+
+None of the functions have been called yet - simply the shapes and dtypes have been propagated through. ``ly`` represents the final ``stack`` call, and tracks which other ``LazyArray`` instances it needs to materialize before it can compute itself. At this point one can perform various bits of introspection:
+
+```python
+# --> the largest array encountered
+ly.history_max_size()
+# 25
+
+# number of unique computational nodes
+len(tuple(ly))
+# 57
+
+# --> traverse the computational graph and collect statistics
+from collections import Counter
+Counter(node.fn_name for node in ly)
+# Counter({'stack': 1,
+#          'truediv': 5,
+#          'norm': 5,
+#          'sub': 10,
+#          'mul': 10,
+#          'getitem': 5,
+#          'None': 1,
+#          'tensordot': 10,
+#          'conjugate': 10})
+
+# --> plot the full computation graph
+ly.plot()
+```
+<p align="left"><img src="https://i.imgur.com/1WOngc9.png" width="600px"></p>
+
+Finally, if we want to compute the actual value we call:
+```python
+ly.compute()
+# tensor([[-0.4225,  0.1371, -0.2307,  0.5892,  0.6343],
+#         [ 0.4079, -0.5103,  0.5924,  0.4261,  0.2016],
+#         [ 0.2569, -0.5173, -0.4875, -0.4238,  0.4992],
+#         [-0.2778, -0.5870, -0.3928,  0.3645, -0.5396],
+#         [ 0.7155,  0.3297, -0.4515,  0.3986, -0.1291]])
+```
+
+Note that once a node is computed, it only stores the actual result and clears all references to other ``LazyArray`` instances.
+
+**Sharing intermediates**
+
+If the computation might involve repeated computations then you can call it in a ``shared_intermediates`` context:
+
+```python
+with lazy.shared_intermediates():
+    ly = modified_gram_schmidt(lx)
+
+# --> a few nodes can be reused here (c.f. 57 previously)
+len(tuple(ly))
+# 51
+```
+this caches the computational nodes as they are created based on a hash of their input arguments (note this uses ``id`` for array like things, i.e. assumes they are immutable). Unlike eagerly caching function calls in real time, which might consume large amounts of memory, now when the computation runs (i.e. ``ly.compute()`` is called) data is only kept as long as its needed.
+
+**Why not use e.g. ``dask``?**
+
+ There are many reasons to use [dask](https://dask.org/), but it incurs a pretty large overhead for big computational graphs with comparatively small operations. Calling and computing the ``modified_gram_schmidt`` function for a 100x100 matrix (20,102 computational nodes) with ``dask.array`` takes ~25sec whereas with ``lazy.array`` it takes ~0.25sec:
+
+ ```python
+import dask.array as da
+
+%%time
+dx = da.array(x)
+dy = modified_gram_schmidt(dx)
+y = dy.compute()
+# CPU times: user 25.6 s, sys: 137 ms, total: 25.8 s
+# Wall time: 25.5 s
+
+%%time
+lx = lazy.array(x)
+ly = modified_gram_schmidt(lx)
+y = ly.compute()
+# CPU times: user 256 ms, sys: 0 ns, total: 256 ms
+# Wall time: 255 ms
+ ```
+
+This is enabled by `autoray`'s very minimal implementation.
+
+## Compilation
+
+Various libraries provide tools for tracing numeric functions and turning the resulting computation into a more efficient, compiled function. Notably:
+
+* [``jax.jit``](https://github.com/google/jax)
+* [``tensorflow.function``](https://www.tensorflow.org/api_docs/python/tf/function)
+* [``torch.jit.trace``](https://pytorch.org/docs/stable/jit.html)
+
+ ``autoray`` is obviously very well suited to these since it just dispatches functions to whichever library is doing the tracing - functions written using autoray should be immediately compatible with all of them.
+
+**The `autocompile` wrapper**
+
+Moreover, ``autoray`` also provides a *unified interface* for compiling functions so that the compilation backend can be easily switched or automatically identified:
+
+```python
+from autoray import autocompile
+
+mgs = autocompile(modified_gram_schmidt)
+```
+
+Currently ``autocompile`` supports functions with the signature ``fn(*args, **kwargs) -> array`` where both ``args`` and ``kwargs`` can be any nested combination of ``tuple``, ``list`` and ``dict`` objects containings arrays.
+We can compare different compiled versions of this simply by changing the ``backend`` option:
+
+```python
+x = do("random.normal", size=(50, 50), like='numpy')
+
+# first the uncompiled version
+%%timeit
+modified_gram_schmidt(x)
+# 23.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+# 'python' mode unravels computation into source then uses compile+exec
+%%timeit
+mgs(x)  # backend='python'
+# 17.8 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
+
+%%timeit
+mgs(x, backend='torch')
+# 11.9 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+%%timeit
+mgs(x, backend='tensorflow')
+# 1.87 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+# need to config jax to run on same footing
+from jax.config import config
+config.update("jax_enable_x64", True)
+config.update('jax_platform_name', 'cpu')
+
+%%timeit
+mgs(x, backend='jax')
+# 226 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+%%timeit
+do('linalg.qr', x, like='numpy')[0]  # appriximately the 'C' version
+# 156 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
+```
+
+Here you see *(with this very for-loop heavy function)*, that there are significant gains to be made for all the compilations options. Whilst ``jax`` for example achieves fantastic performance, it should be noted the compilation step takes a lot of time and scales badly (super-linearly) with the number of computational nodes.
+
+# Details
+
+## Special Functions
+
+The main function is ``do``, but the following special (i.e. not in ``numpy``) functions are also implemented that may be useful:
+
+* ``autoray.infer_backend`` - check what library is being inferred for a given array
+* ``autoray.to_backend_dtype`` - convert a string specified dtype like ``'float32'`` to ``torch.float32`` for example
+* ``autoray.get_dtype_name`` - convert a backend dtype back into the equivalent string specifier like ``'complex64'``
+* ``autoray.astype`` - backend agnostic dtype conversion of arrays
+* ``autoray.to_numpy`` - convert any array to a ``numpy.ndarray``
+
+Here are all of those in action:
+
+
+```python
+import autoray as ar
+
+backend = 'torch'
+dtype = ar.to_backend_dtype('float64', like=backend)
+dtype
+# torch.float64
+
+x = ar.do('random.normal', size=(4,), dtype=dtype, like=backend)
+x
+# tensor([ 0.0461,  0.3028,  0.1790, -0.1494], dtype=torch.float64)
+
+ar.infer_backend(x)
+# 'torch'
+
+ar.get_dtype_name(x)
+# 'float64'
+
+x32 = ar.astype(x, 'float32')
+ar.to_numpy(x32)
+# array([ 0.04605161,  0.30280888,  0.17903718, -0.14936243], dtype=float32)
+```
+
+## Deviations from `numpy`
+
+`autoray` doesn't have an API as such, since it is essentially just a fancy single dispatch mechanism. On the other hand, where translations *are* in place, they generally use the numpy API. So ``autoray.do('stack', arrays=pytorch_tensors, axis=0)`` gets automatically translated into ``torch.stack(tensors=pytorch_tensors, dims=0)`` and so forth.
+
+Currently the one place this isn't true is ``autoray.do('linalg.svd', x)`` where instead ``full_matrices=False`` is used as the default since this generally makes more sense and many libraries don't even implement the other case. Autoray also dispatches ``'linalg.expm'`` for ``numpy`` arrays to ``scipy``, and may well do with other scipy-only functions at some point.
+
+
+# Installation
+
+You can install ``autoray`` via [conda-forge](https://conda-forge.org/) as well as with ``pip``. Alternatively, simply copy the monolithic ``autoray.py`` into your project internally (if dependencies aren't your thing) to provide ``do``.
+
+**Alternatives**
+
+* The ``__array_function__`` protocol has been [suggested](https://www.numpy.org/neps/nep-0018-array-function-protocol.html) and now implemented in ``numpy``. Hopefully this will eventually negate the need for ``autoray``. On the other hand, third party libraries themselves need to implement the interface, which has not been done, for example, in ``tensorflow`` yet.
+* The [uarray](https://github.com/Quansight-Labs/uarray) project aims to develop a generic array interface but comes with the warning *"This is experimental and very early research code. Don't use this."*.
+
+
+# Contributing
+
+Pull requests such as extra translations are very welcome!
+