scan with gradient checkpointing #2139

shoyer · 2020-02-02T07:18:26Z

It would be great to have a version of lax.scan used a recursive gradient checkpointing (e.g., "binomial checkpointing") that allows for differentiating through long time series with logarithmic time/space costs.

In principle this could be built on top of the experimental remat decorator: #1749

The text was updated successfully, but these errors were encountered:

shoyer · 2020-08-06T16:53:03Z

Support for a single level of gradient checkpointing might also be useful, which requires only one extra forward pass. Apparently the optimal way to do it reduces memory usage to the square root of the number of steps, e.g., per this implementation in TensorFlow:
https://github.com/cybertronai/gradient-checkpointing

hawkinsp · 2020-08-06T17:00:00Z

There is also a class of algorithms that optimize to a fixed memory budget: https://dl.acm.org/doi/10.1145/347837.347846

(I'm not sure they are worth it over the simpler strategies though.)

shoyer · 2020-08-06T17:04:14Z

Yep, I realize now that that's what "binomial checkpointing" in particular means. I was originally thinking of something simpler, just using recursion.

patrick-kidger · 2022-02-14T16:30:55Z

So Diffrax actually implements a bounded_while_loop that does exactly this -- early exit by nesting scan-conds, and managing memory using recursive checkpointing. In Diffrax's case it's used to handle the stepping of a differential equation solver.

The implementation is here: https://github.com/patrick-kidger/diffrax/blob/2b4e4d863c15abc7143919bac7825090bbfe50be/diffrax/misc/bounded_while_loop.py

It's worth noting that there are a lot of caveats that need to be worked around in order to make something like this feasible.

If vmap is handled naively, then the lax.conds used to handle whether to evaluate body_fun or simply perform an identity operation will get turned into lax.selects, and the entire point of the efficiency gains are lost. This has to be worked around with some custom vmap behaviour. (In particular an unvmap operation.)
In-place updates are a nightmare. XLA:CPU fails to optimise away the copies, as per JAX issues
vmapping O(n) algorithm produces O(n^2) runtime #8192
and
Feature request: disabling/detecting out-of-place .at[].set() updates. #9132.
Working around this requires changing the signature of body_fun with a custom way of handling in-place updates:
https://github.com/patrick-kidger/diffrax/blob/2b4e4d863c15abc7143919bac7825090bbfe50be/diffrax/misc/bounded_while_loop.py#L25
Moreover XLA clearly has some bugs because you can improve performance of nested in-place updates (in nested bounded_while_loops) by adding dead code that doesn't actually evaluate to anything (!!!)
https://github.com/patrick-kidger/diffrax/blob/2b4e4d863c15abc7143919bac7825090bbfe50be/diffrax/integrate.py#L242
https://github.com/patrick-kidger/diffrax/blob/2b4e4d863c15abc7143919bac7825090bbfe50be/diffrax/integrate.py#L256
https://github.com/patrick-kidger/diffrax/blob/2b4e4d863c15abc7143919bac7825090bbfe50be/diffrax/integrate.py#L274
Which is something I file under "voodoo magic".
To the best of my knowledge it's currently impossible to implement something that exhibits both reasonable compile times and reasonable backpropagation times. In particular binomial checkpointing (specifically treeverse, which is optimal -- not the sqrt-based stuff linked for TensorFlow above) requires writing out the whole computation graph, which is huge. See
O(1) forward computation requires potentially unbounded time to compute gradient #8239
for where this was first identified, and see the devdocs for bounded_while_loop, here for a discussion on the trade-offs we make in practice. (We end up with O(1) compile times but the backpropagation time scales logarithmically with the bound on the maximum number of steps.)
The nested conds produces trace times that are exponential in the depth due to JAX issues
O(n)-sized jaxpr takes O(exp(n)) work to obtain #8184
and
O(n) program takes O(exp(n)) work to linearize #8193.
In practice bounded_while_loop works around these by monkey-patching the JAX tracing mechanisms here. Hopefully the JAX tracing mechanisms can be updated at some point to make this unnecessary.

In practice most of these details are hidden from an end-user. (You just end up with a funny-looking extra argument to body_fun, and in many cases have to suffer subpar performance.) But I thought I'd record them here for anyone who ends up treading down the same path I did. Implementing a bounded_while_loop that exhibits reasonable performance was easily the single hardest part of implementing Diffrax, by a very large margin.

shoyer · 2022-07-19T17:49:56Z

A few other reference points for anyone who find this issue:

Flax has flax.linen.remat_scan for scanning over Flax modules.
I wrote a simpler version of scanning with nested gradient checkpointing, based on some the same design principles as Diffrax's bounded_while_loop:

# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import math
from typing import Any, Callable, Optional, Sequence, Tuple, TypeVar, Union

import jax
import jax.numpy as jnp


Carry = TypeVar('Carry')
Input = TypeVar('Input')
Output = TypeVar('Output')
Func = TypeVar('Func', bound=Callable)


def nested_checkpoint_scan(
    f: Callable[[Carry, Input], Tuple[Carry, Output]],
    init: Carry,
    xs: Input,
    length: Optional[int] = None,
    *,
    nested_lengths: Sequence[int],
    scan_fn: typing.ScanFn = jax.lax.scan,
    checkpoint_fn: Callable[[Func], Func] = jax.checkpoint,
) -> Tuple[Carry, Output]:
  """A version of lax.scan that supports recursive gradient checkpointing.

  The interface of `nested_checkpoint_scan` exactly matches lax.scan, except for
  the required `nested_lengths` argument.

  The key feature of `nested_checkpoint_scan` is that gradient calculations
  require O(max(nested_lengths)) memory, vs O(prod(nested_lengths)) for unnested
  scans, which it achieves by re-evaluating the forward pass
  `len(nested_lengths) - 1` times.

  `nested_checkpoint_scan` reduces to `lax.scan` when `nested_lengths` has a
  single element.

  Args:
    f: function to scan over.
    init: initial value.
    xs: scanned over values.
    length: leading length of all dimensions
    nested_lengths: required list of lengths to scan over for each level of
      checkpointing. The product of nested_lengths must match length (if
      provided) and the size of the leading axis for all arrays in ``xs``.
    scan_fn: function matching the API of lax.scan
    checkpoint_fn: function matching the API of jax.checkpoint.

  Returns:
    Carry and output values.
  """
  if length is not None and length != math.prod(nested_lengths):
    raise ValueError(f'inconsistent {length=} and {nested_lengths=}')

  def nested_reshape(x):
    x = jnp.asarray(x)
    new_shape = tuple(nested_lengths) + x.shape[1:]
    return x.reshape(new_shape)

  sub_xs = jax.tree_map(nested_reshape, xs)
  return _inner_nested_scan(f, init, sub_xs, nested_lengths, scan_fn,
                            checkpoint_fn)


def _inner_nested_scan(f, init, xs, lengths, scan_fn, checkpoint_fn):
  """Recursively applied scan function."""
  if len(lengths) == 1:
    return scan_fn(f, init, xs, lengths[0])

  @checkpoint_fn
  def sub_scans(carry, xs):
    return _inner_nested_scan(f, carry, xs, lengths[1:], scan_fn, checkpoint_fn)

  carry, out = scan_fn(sub_scans, init, xs, lengths[0])
  stacked_out = jax.tree_map(jnp.concatenate, out)
  return carry, stacked_out

patrick-kidger · 2023-02-22T02:05:11Z

Reporting back to this old thread: Equinox now supports a while-loop-with-gradient-checkpointing. This is available at equinox.internal.while_loop(..., kind="checkpointed").

Source code here: https://github.com/patrick-kidger/equinox/blob/main/equinox/internal/while_loop/checkpointed.py

This means that (a) we now have a proper checkpointing scheme to help manage memory, and (b) we also get reverse-mode autodifferentiable while loops in JAX!

To my knowledge this is the first implementation of this in JAX. We've had similar things floating around before (e.g stuff like lax.scan(jax.checkpoint(f), ...), or multi-level versions of that), but they've always suffered from either asymptotically slow runtimes (since the checkpointing scheme wasn't really the right thing) or from slow compile times (e.g. due to unrolling loops).

For the pedantically curious.

This is technically slightly different to scan-with-gradient-checkpointing. The difference is that in a scan, the number of steps is known in advance. In a while loop, the number of steps is not known in advance. This implies using slightly different checkpointing algorithms: "online treeverse" vs "classical treeverse", and the former may be slightly less efficient due to having less information to work with.

Given a fixed num_checkpoints, and then running to see how many num_steps you get: if num_steps <= (num_checkpoints + 1) * (num_checkpoints + 2) / 2 then it turns out that both approaches match each other exactly. If num_steps is larger then this bound then online treeverse will make some extra computations (as compared to classical treeverse with an oracle on the number of steps) -- but it will still at least have the same asymptotic complexity!

shoyer mentioned this issue Apr 7, 2020

More flexible ODE integration #2628

Open

3 tasks

jekbradbury mentioned this issue Sep 15, 2020

Extending jax.checkpoint #4285

Closed

patrick-kidger mentioned this issue Jan 27, 2022

Announcement patrick-kidger/diffrax#41

Closed

shoyer mentioned this issue Feb 21, 2022

Deprecate jax.experimental.odes in favor of Diffrax? #9654

Open

dlwh mentioned this issue Jul 20, 2022

Experiment with various gradient checkpointing techniques stanford-crfm/levanter#2

Closed

patrick-kidger mentioned this issue Nov 1, 2022

Added scan(early_exit=...) and while_loop(max_steps=...) #13062

Open

astanziola mentioned this issue Nov 21, 2022

Advanced checkpointing ucl-bug/jwave#123

Merged

2 tasks

michaeldeistler mentioned this issue May 8, 2023

Checkpointing linearly increases compile time jaxleyverse/jaxley#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan with gradient checkpointing #2139

scan with gradient checkpointing #2139

shoyer commented Feb 2, 2020 •

edited

Loading

shoyer commented Aug 6, 2020

hawkinsp commented Aug 6, 2020

shoyer commented Aug 6, 2020

patrick-kidger commented Feb 14, 2022 •

edited

Loading

shoyer commented Jul 19, 2022

patrick-kidger commented Feb 22, 2023 •

edited

Loading

scan with gradient checkpointing #2139

scan with gradient checkpointing #2139

Comments

shoyer commented Feb 2, 2020 • edited Loading

shoyer commented Aug 6, 2020

hawkinsp commented Aug 6, 2020

shoyer commented Aug 6, 2020

patrick-kidger commented Feb 14, 2022 • edited Loading

shoyer commented Jul 19, 2022

patrick-kidger commented Feb 22, 2023 • edited Loading

shoyer commented Feb 2, 2020 •

edited

Loading

patrick-kidger commented Feb 14, 2022 •

edited

Loading

patrick-kidger commented Feb 22, 2023 •

edited

Loading