FP8 GEMM for the fprop should use fast accumulation #6168

kaixih · 2023-10-09T17:36:45Z

In the current implementation of fp8 gemm, the CUBLASLT_MATMUL_DESC_FAST_ACCUM (link) setting is not configured. This means it defaults to disabled fast accumulation. However, following the transformer engine recipe, it is recommended to enable fast accumulation during the fprop pass, which can further enhance the speed of the fp8 gemm operation.

Determining whether the gemm/dot op node in the hlo graph belongs to fprop or bprop poses a non-trivial challenge. In practice, e4m3 is typically used for the fprop gemm, while a combination of e4m3 and e5m2 is used for the bprop gemm. To address this, we propose a solution: if both input data types are e4m3, we will set the aforementioned flag to ensure fast accumulation is utilized specifically during the fprop pass. If this proposal is accepted, we can proceed with preparing a pull request (PR) to implement this change.

cc. @reedwm @philipphack @nluehr @instinct79

The text was updated successfully, but these errors were encountered:

reedwm · 2023-10-10T19:55:06Z

To address this, we propose a solution: if both input data types are e4m3, we will set the aforementioned flag to ensure fast accumulation is utilized specifically during the fprop pass.

I'm not a fan of making the accumulation precision dependent on which of the FP8 types are used for the inputs. If we want the forward pass to use faster, less precise accumulation, this should be directly encoded in the HLO instruction. Also I don't think convolutions necessarily use different FP8 types on the forward vs backward pass, so this would only work for dots.

How about we use the PrecisionConfig field (see the precision_config field of the StableHLO spec for dot_general)? There is one PrecisionConfig per input and currently it only affects the input precisions, and only when the inputs are FP32. In XLA:GPU, TF32 is used for the inputs if the PrecisionConfig is DEFAULT or HIGH, and FP32 is used if it's HIGHEST.

For FP8 inputs, I propose interpreting the PrecisionConfig slightly differently: if all of the input's PrecisionConfigs are HIGHEST, accumulate with full precision, otherwise use cuBLAS's fast accumulation mode (whose exact precision is unfortunately undocumented). This is different than how PrecisionConfig is currently used, as now I'm proposing having it affect accumulation precision for FP8 gemms while currently it only affects input precisions. But I think it's fine for the PrecisionConfig concept to not only refer to input precision. Currently the StableHLO spec does not define exactly what part of the dot the PrecisionConfig affects (see openxla/stablehlo#755).

@kaixih @cheshire @burmako, WDYT about using PrecisionConfig to specify the accumulation precision for FP8 dots?

philipphack · 2023-10-10T20:30:26Z

What's the mechanism for setting the PrecisionConfig for a given FP8 dot?

reedwm · 2023-10-10T21:01:21Z

What's the mechanism for setting the PrecisionConfig for a given FP8 dot?

In JAX, it can be passed to various functions like jnp.dot as the precision argument. In TF, I think this is impossible right now but can be added.

wenscarl · 2023-10-11T19:09:45Z

In addition to plumbing through jnp.dot, we may also need to have a wrapper around fp8_dot inspired by Transformer Engine's design. Ref here. PrecisionConfig should be a OK since it's not utilized in current design.

burmako · 2023-10-16T15:13:07Z

@reedwm No objections from the StableHLO side!

philipphack · 2023-10-16T17:57:54Z

If we use JAX' precision enum, we may have to augment its documentation.

reedwm · 2023-10-16T19:39:50Z

I talked to @cheshire and he is also OK with using the PrecisionConfig to specify the accumulation type for FP8 matmuls. So it sounds like this is the way to go. @kaixih do you want to implement this or should I?

Once implemented, we can update the JAX documentation.

Getting JAX to specify a separate PrecisionConfig on the backwards pass is a bit tricky, but can be done by calling jax.jvp or jax.vjp from within a JAX custom JVP/VJP. For example, the dot_precise_grad function below runs with higher precision the backwards pass:

@jax.custom_jvp
def dot_precise_grad(x, y):
  return jnp.dot(x, y, precision=jax.lax.Precision.DEFAULT)

@dot_precise_grad.defjvp
def dot_precise_grad_jvp(primals, tangents):
  def dot_precise(x, y):
    return jnp.dot(x, y, precision=jax.lax.Precision.HIGHEST)

  _, jvp = jax.jvp(dot_precise, primals, tangents)
  out = jnp.dot(*primals, precision=jax.lax.Precision.DEFAULT)
  return out, jvp

TF currently doesn't support setting the PrecisionConfig on a per-op basis. We should probably add a way to do this once we start training with FP8 models in TF, at which point a similar approach can be used in TF.

kaixih · 2023-10-16T19:42:48Z

Yes, looks like this is similar to what @wenscarl has just drafted in here.

kaixih · 2023-10-16T19:55:43Z

@reedwm It appears that our definitions of DEFAULT/HIGHEST differ from yours, as indicated in this link. In our context, DEFAULT signifies the fast accumulation, whereas HIGHEST denotes non-fast accumulation. According to our definition, fprop should use DEFAULT, and bprop should use HIGHEST. I believe this classification aligns with the current understanding of DEFAULT/HIGHEST.

DEFAULT: Fastest calculation, but least accurate approximation to the original number.
HIGHEST: Slowest calculation, but most accurate approximation to the original number.

reedwm · 2023-10-16T23:35:46Z

In my example, DEFAULT signifies fast accumulation as well. DEFAULT is used in the forward pass, while HIGHEST is used in the gradients.

wenscarl · 2023-10-17T16:55:03Z

In my example, DEFAULT signifies fast accumulation as well. DEFAULT is used in the forward pass, while HIGHEST is used in the gradients.

Tried this commit, the cublasLt logs shows it works.

reedwm · 2023-10-17T16:59:55Z

That commit looks good. I see you created and closed a PR #6388. Do you plan on reopening it?

wenscarl · 2023-10-17T18:09:55Z

This commit has drastically touched code base. Figuring out how to rebase.

wenscarl · 2023-10-27T20:03:09Z

Opened PR.

Imported from GitHub PR #6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue##6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <shuw@nvidia.com>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <shuw@nvidia.com>: Improve based on review #1 -- e906d76 by shuw <shuw@nvidia.com>: Improve based on review #2 Merging this change closes #6599 FUTURE_COPYBARA_INTEGRATE_REVIEW=#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578904075

Imported from GitHub PR #6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue##6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <shuw@nvidia.com>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <shuw@nvidia.com>: Improve based on review #1 -- e906d76 by shuw <shuw@nvidia.com>: Improve based on review #2 Merging this change closes #6599 COPYBARA_INTEGRATE_REVIEW=#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578948593

Imported from GitHub PR openxla/xla#6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla/xla#6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da8ca08cd2d4796a7b8f032827867a361bc by shuw <shuw@nvidia.com>: Add FP8 fast accumulation support for cublasLt. -- 96845683cc4b1e7b947bc919fbf97d8865abeac9 by shuw <shuw@nvidia.com>: Improve based on review #1 -- e906d7620780d2cf1fe8433c933648dcb98dc61d by shuw <shuw@nvidia.com>: Improve based on review #2 Merging this change closes #6599 PiperOrigin-RevId: 578948593

reedwm · 2024-02-29T18:03:53Z

Closing as @wenscarl fixed this in #6599.

cheshire assigned reedwm Oct 10, 2023

wenscarl mentioned this issue Oct 16, 2023

Allow for fast accumulation selection for FP8 GEMM google/flax#3416

Merged

wenscarl mentioned this issue Oct 17, 2023

[draft]Fp8 fast accumulation #6388

Closed

reedwm mentioned this issue Oct 20, 2023

[ROCm]: Add call context in GEMM Interface #4768

Closed

wenscarl mentioned this issue Oct 27, 2023

Fp8 Fast Accumulation support for cublasLt #6599

Closed

copybara-service bot mentioned this issue Nov 2, 2023

PR #6599: Fp8 Fast Accumulation support for cublasLt #6730

Closed

kaixih mentioned this issue Nov 2, 2023

[NVIDIA] Use the fast accumulation for FP8 matmul google/praxis#35

Merged

penpornk added NVIDIA-GPU XLA on Nvidia GPU GPU XLA on GPU labels Feb 29, 2024

reedwm closed this as completed Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 GEMM for the fprop should use fast accumulation #6168

FP8 GEMM for the fprop should use fast accumulation #6168

kaixih commented Oct 9, 2023 •

edited

Loading

reedwm commented Oct 10, 2023

philipphack commented Oct 10, 2023

reedwm commented Oct 10, 2023

wenscarl commented Oct 11, 2023

burmako commented Oct 16, 2023

philipphack commented Oct 16, 2023

reedwm commented Oct 16, 2023

kaixih commented Oct 16, 2023

kaixih commented Oct 16, 2023

reedwm commented Oct 16, 2023

wenscarl commented Oct 17, 2023

reedwm commented Oct 17, 2023

wenscarl commented Oct 17, 2023

wenscarl commented Oct 27, 2023

reedwm commented Feb 29, 2024

FP8 GEMM for the fprop should use fast accumulation #6168

FP8 GEMM for the fprop should use fast accumulation #6168

Comments

kaixih commented Oct 9, 2023 • edited Loading

reedwm commented Oct 10, 2023

philipphack commented Oct 10, 2023

reedwm commented Oct 10, 2023

wenscarl commented Oct 11, 2023

burmako commented Oct 16, 2023

philipphack commented Oct 16, 2023

reedwm commented Oct 16, 2023

kaixih commented Oct 16, 2023

kaixih commented Oct 16, 2023

reedwm commented Oct 16, 2023

wenscarl commented Oct 17, 2023

reedwm commented Oct 17, 2023

wenscarl commented Oct 17, 2023

wenscarl commented Oct 27, 2023

reedwm commented Feb 29, 2024

kaixih commented Oct 9, 2023 •

edited

Loading