Use exact GELU #4428

hendrycks · 2020-10-01T01:16:26Z

jax.nn.gelu uses the approximate form of the GELU, but tensorflow and pytorch use the exact version.

I believe the exact form is more numerically stable and similarly fast. Figure 16 of the Performer paper (@xingyousong) showed the GELU running into NaN issues, and I suspect this is because jax uses the approximate version.

hawkinsp · 2020-10-01T02:50:30Z

Interesting! #1556 switched to the approximate version. @trevorcai @jekbradbury

jekbradbury · 2020-10-01T02:52:53Z

Exact GeLU is significantly slower on TPUs (easily noticeable even in end-to-end step time). We’d be happy to take a PR adding the exact implementation as an option, but keeping the approximate one as the default?

hendrycks · 2020-10-01T03:31:12Z

Exact GeLU is significantly slower on TPUs

Interesting. For PyTorch this was not the case: pytorch/pytorch#39853 (comment) The exact version was slightly faster.

We’d be happy to take a PR

Sadly I don't know what optimizations would make them similarly fast like in PyTorch.

jekbradbury · 2020-10-01T04:05:21Z

In JAX under a JIT, both versions are quite well optimized (fused, etc.). But TPUs are not very fast at certain kinds of vector math, and the exact GeLU happens to hit some of those cases (I think).

hendrycks · 2020-10-01T04:13:03Z

#1556 (comment) says

Confirming that my benchmarks are showing jax.grad(jax.jarrett(gelu)) as slower than jax.grad(gelu) on GPU as well.

So do you think the issue is with both TPUs and GPUs yet PyTorch on GPUs doesn't have a problem? Or do you think it's more of a TPU-specific issue?

hawkinsp · 2020-10-01T14:13:20Z

I just tried some new JAX timings on TPUv2 (two generations old) and V100 (one generation old), mostly because I had easy access to them via Colab.

I found that on V100 when compiled with jax.jit, the approximate formulation is 1.12x faster on the forward pass (which seems relatively insignificant), but on TPUv2 the approximate formulation is 1.75x faster. The difference on the backward pass was much smaller. I would guess this is related to erf being much more expensive to compute than tanh on TPU. It's possible we could optimize our erf implementation, which would presumably improve the relative performance of the exact formulation on both platforms.

Since the performance differences are so large, at the moment it does seem like we might do best to let users choose which they want.

hawkinsp · 2020-10-02T13:53:49Z

PR #4438 adds an jax.nn.gelu(..., approximate=True) keyword argument to select between the exact and approximate versions. It does not switch the default to the exact version yet.

Separately, I have a PR coming that optimizes the implementation of erf in XLA such that the exact formulation matches the performance of the approximation on V100, and comes very close on TPU.

However, I need to try some end-to-end benchmarking on TPU before switching the default; we are guessing that in the context of a larger model like BERT the approximation would still be faster.

froystig added enhancement New feature or request performance make things lean and fast labels Oct 1, 2020

hawkinsp mentioned this issue Oct 2, 2020

Adds an approximate=... keyword argument to jax.nn.gelu to select bet… #4438

Merged

copybara-service bot closed this as completed in #4438 Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use exact GELU #4428

Use exact GELU #4428

hendrycks commented Oct 1, 2020 •

edited

Loading

hawkinsp commented Oct 1, 2020

jekbradbury commented Oct 1, 2020

hendrycks commented Oct 1, 2020 •

edited

Loading

jekbradbury commented Oct 1, 2020

hendrycks commented Oct 1, 2020 •

edited

Loading

hawkinsp commented Oct 1, 2020

hawkinsp commented Oct 2, 2020

Use exact GELU #4428

Use exact GELU #4428

Comments

hendrycks commented Oct 1, 2020 • edited Loading

hawkinsp commented Oct 1, 2020

jekbradbury commented Oct 1, 2020

hendrycks commented Oct 1, 2020 • edited Loading

jekbradbury commented Oct 1, 2020

hendrycks commented Oct 1, 2020 • edited Loading

hawkinsp commented Oct 1, 2020

hawkinsp commented Oct 2, 2020

hendrycks commented Oct 1, 2020 •

edited

Loading

hendrycks commented Oct 1, 2020 •

edited

Loading

hendrycks commented Oct 1, 2020 •

edited

Loading