How to get the derivative wrt. the hidden activations of a model in Flax/JAX? #1152

marcvanzee · 2021-03-17T15:13:35Z

marcvanzee
Mar 17, 2021
Maintainer

Original question by @untom.

Dec 15, 2022

We now have the Module.perturb API for this. Check out Extracting gradients of intermediate values for a complete walk through on how to do this.

In the meanwhile here is a short example from the documenation:

import jax
import jax.numpy as jnp
import flax.linen as nn

class Foo(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = nn.Dense(3)(x)
        x = self.perturb('dense3', x)
        return nn.Dense(2)(x)

def loss(params, perturbations, inputs, targets):
  variables = {'params': params, 'perturbations': perturbations}
  preds = model.apply(variables, inputs)
  return jnp.square(preds - targets).mean()

x = jnp.ones((2, 9))
y = jnp.ones((2, 2))
model = Foo()
variables =

View full answer

marcvanzee · 2021-03-17T15:15:00Z

marcvanzee
Mar 17, 2021
Maintainer Author

Answer by @jheek: It is annoying if you don't have code that nicely factors into functions, so if you want the grad wrt to all hidden activations it's definitely annoying but there is a trick:

class GradWrapper(nn.Module):
  mdl: Module

  @nn.compact
  def __call__(self, *args, **kwargs):
    y = self.mdl(*args, **kwargs)
    eps = self.variable('inter_grads', 'activation', lambda: jnp.zeros_like(y))
    return y + eps

variables = model.init(...)

grads = jax.grad(model.apply)(variables, batch)
param_grads = grads['params']
inter_grads = grads['inter_grads']

So the trick is to add a "delta epsilon" everywhere you want to add intermediate gradients (it's like calculus 101 all over again 😛). The wrapper itself might not be the right place to put it. You just need that zero variable which adds to whatever you want to track.

The only problem with this pattern is that you should make sure you don't "materialize" the zeros for optimal performance. If XLA knows it's just zeros it can optimize the redundant + zeros away

XLA should realize that it's just zeros in this case if you simply do:

inter_grad = jax.tree_map(jnp.zeros_like, variables['inter_grad'])

3 replies

bastings Dec 7, 2021
Maintainer

For Flax Linen this has become:

eps = self.variable('inter_grads', 'activation', lambda: jnp.zeros_like(y))
return y + eps.value

marcvanzee Dec 7, 2021
Maintainer Author

Yes, thanks for noting! I've updated the code in my original message as well to avoid users copy/pasting that directly.

DevPranjal Dec 12, 2022

Hello! Thanks for the reply. I had the same issue while I was working on a project. The suggested approach works well if we add the "delta epsilon" to the intermediate activations during the module definition. For example:

class Model(nn.Module):
  @nn.compact
  def __call__(self, x):
    x = nn.relu(nn.Dense(8)(x))
    x += self.variable('inter_grads', 'activation', lambda: jnp.zeros_like(y))
    x = nn.Dense(2)(x)
    return x

But, I was writing an API as below, which accessed the output upto a particular layer, and the gradient of the loss until that layer:

def foo(x, model, params, target_layer):
  y, state = model.apply(params, x, ..., capture_intermediates=True)
  target_layer_output = state['intermediates'][...][...] # using target_layer
  grad_wrt_target_layer_output = ... # ??
  ...

In this case, how do I access the gradient until this point, since the model passed by user does not have the extra "delta epsilon". Is there any way to edit the model definition externally, inside the function? (In Pytorch, this could be achieved by registering backward hooks)

cgarciae · 2022-12-15T15:25:47Z

cgarciae
Dec 15, 2022
Maintainer

We now have the Module.perturb API for this. Check out Extracting gradients of intermediate values for a complete walk through on how to do this.

In the meanwhile here is a short example from the documenation:

import jax
import jax.numpy as jnp
import flax.linen as nn

class Foo(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = nn.Dense(3)(x)
        x = self.perturb('dense3', x)
        return nn.Dense(2)(x)

def loss(params, perturbations, inputs, targets):
  variables = {'params': params, 'perturbations': perturbations}
  preds = model.apply(variables, inputs)
  return jnp.square(preds - targets).mean()

x = jnp.ones((2, 9))
y = jnp.ones((2, 2))
model = Foo()
variables = model.init(jax.random.PRNGKey(0), x)
intm_grads = jax.grad(loss, argnums=1)(variables['params'], variables['perturbations'], x, y)
print(intm_grads['dense3']) # ==> [[-1.456924   -0.44332537  0.02422847]
                            #      [-1.456924   -0.44332537  0.02422847]]

2 replies

PhilipVinc Dec 15, 2022

@cgarciae could one use this to implement K-Fac?

8bitmp3 Aug 3, 2023
Maintainer

@cgarciae Updated the API link for flax.linen.Module.perturb to
https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html#flax.linen.Module.perturb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the derivative wrt. the hidden activations of a model in Flax/JAX? #1152

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to get the derivative wrt. the hidden activations of a model in Flax/JAX? #1152

marcvanzee Mar 17, 2021 Maintainer

Replies: 2 comments · 5 replies

marcvanzee Mar 17, 2021 Maintainer Author

bastings Dec 7, 2021 Maintainer

marcvanzee Dec 7, 2021 Maintainer Author

DevPranjal Dec 12, 2022

cgarciae Dec 15, 2022 Maintainer

PhilipVinc Dec 15, 2022

8bitmp3 Aug 3, 2023 Maintainer

marcvanzee
Mar 17, 2021
Maintainer

Replies: 2 comments 5 replies

marcvanzee
Mar 17, 2021
Maintainer Author

bastings Dec 7, 2021
Maintainer

marcvanzee Dec 7, 2021
Maintainer Author

cgarciae
Dec 15, 2022
Maintainer

8bitmp3 Aug 3, 2023
Maintainer