Torch compile settings #2

hal-314 · 2024-03-21T08:35:32Z

Hi there,

First of all, thanks for the benchmark! It's very useful to see this nice comparison between keras back-ends and native Pytorch.

I have two doubts:

Why did you choose to use 'reduce-overhead' and not 'max-autotune' to compile native torch code? Pytorch docs recommends max-autotune for the best performance.
Torch keras backend is generally slower than native torch. Sometimes, it's more than 2 times slower. Is keras using torch compile? If so, which mode is using?

Finally, it could be useful to add how much time takes to compile / first model run for every model with different backends. So, you can know which backend you can use for quick prototyping and which one to use for long training jobs.

Thanks

lezcano · 2024-04-02T08:23:37Z

Also, I didn't follow up in the first issue, but if you are using cuda graphs, you need to run the model twice during warmp-up for the tracing to happen. This is not currently done in these benchmarks.

haifeng-jin · 2024-04-03T00:11:09Z

Hi Everyone,

Thanks for all the comments!
I am trying to make the comparison as fair as possible.
So, if you find any unfair settings, please do let us know.

Here are my replies:

For the initial questions of this issue:

Why did you choose to use 'reduce-overhead' and not 'max-autotune' to compile native torch code? Pytorch docs recommends max-autotune for the best performance.

I mainly followed the suggestions in #1. The reduce-overhead seems a more transparent since max-autotune seems unclear what was done under the hood. You are welcome try with other modes and report the results.

Torch keras backend is generally slower than native torch. Sometimes, it's more than 2 times slower. Is keras using torch compile? If so, which mode is using?

Keras is not using torch.compile() by default. We have added the eager mode numbers for PyTorch native.
We still have a gap, but much smaller. I think it is mainly because of Keras is using the ops on a more fine grained level instead of calling the fused ops.

Some other concerns from people:

"You have ignored an explicit warning about torch.set_float32_matmul_precision('high') for performance."

We use float32 for SAM on all frameworks.
Set it to "high" instead of "highest" is lowering the precision for one framework and leads to unfair comparison.

"you are not using cudagraph [for SAM]"

It is because of the model implementation from Meta Research.
We don't change the source code from Meta Research.
Changing the code contradicts with our purpose of measuring the "out-of-the-box" performance.

The implementation is not optimal, but it is representative for a common way to use PyTorch.

So, we benchmark it, and be explicit about it in the post.
It is referred to as "less manually-optimized model".

Chillee · 2024-04-03T00:16:49Z

Set it to "high" instead of "highest" is lowering the precision for one framework and leads to unfair comparison.

In this case, the default in Jax/tensorflow is to turn on tensorfloat32 by default. So, in order to have a fair comparison, you need to enable tensorfloat32 in PyTorch.

haifeng-jin · 2024-04-03T00:31:23Z

Also, I didn't follow up in the first issue, but if you are using cuda graphs, you need to run the model twice during warmp-up for the tracing to happen. This is not currently done in these benchmarks.

Ah. Sorry for the oversight. The code is updated. Blog post update on the way.
We only used one batch for the warmup.

haifeng-jin · 2024-04-03T00:46:47Z

Set it to "high" instead of "highest" is lowering the precision for one framework and leads to unfair comparison.

In this case, the default in Jax/tensorflow is to turn on tensorfloat32 by default. So, in order to have a fair comparison, you need to enable tensorfloat32 in PyTorch.

@Chillee
Thanks for pointing this out!
I didn't know this.

Is it just torch.backends.cuda.matmul.allow_tf32 = True?
If it is that simple, I can measure it again and update the results.

Just find another caveat. I think Keras is explicitly specifying the dtypes to the backend it is using.
So it passes float32 to TF/JAX. So I wonder if in that case it still uses tensorfloat32.
Is there a way to check if TF/JAX is actually using tensorfloat32?

lezcano · 2024-05-19T08:46:14Z

Was the tf32 point fixed in the end?

haifeng-jin · 2024-05-20T19:01:17Z

Hi @lezcano,

Thanks for following up!

Keras use "float32" explicitly when creating tensors. (code link) (code link) (code link)

So, to my understanding, Keras enforces TF and JAX to use "float32" instead of leaving it to the default value of the corresponding backend.

Therefore, the benchmarking code uses the same "float32" dtype for different backends. There is no fix required.

Everyone is welcome to check the actual dtype during runtime (if there is a way to do so) and post the method and results here!

lezcano · 2024-05-20T20:57:35Z

I don't think that is correct. TF32 is not a type per-se in keras or PyTorch. Both of them use it as a mode to perform fast float32 x float32 multiplication. As Horace mentioned, I'm pretty sure that JAX uses this by default. OTOH, in PyTorch is off by default, so you'd need to execute the PyTorch code with torch.backends.cuda.matmul.allow_tf32 = True to have a fair comparison.

haifeng-jin · 2024-05-20T21:31:08Z

I see. And I trust your expertise in PyTorch.

However, I am unable to verify if TF32 is actually enabled for TF/JAX.
I would prefer to just turn it off completely for all frameworks with export NVIDIA_TF32_OVERRIDE=0 as suggested here.

What do you think?

lezcano · 2024-05-20T21:38:28Z

either that or enable it in both. I think a more reasonable comparison, given that you are benchmarking neural networks, would be to enable it in both, as that's what would be closer to reality (even better, networks would be executed on bfloat16 or amp, but yeah).

You can probably check that it exercises the tf32 cores by profiling the program with ncu and then looking at the cublas/cutlass kernels that it calls.

lezcano · 2024-05-20T21:39:57Z

To make sure they both use the same precision, you can manually enable the same flag in PyTorch (see above) and JAX

jax.config.update("jax_default_matmul_precision", "highest")

haifeng-jin · 2024-05-20T23:43:30Z

@lezcano Thank you so much for teaching me all these! I have just turned it off.

To everyone:

Unfortunately, I do not have bandwidth to keep working on these benchmarks.
It would be more responsible to just remove it rather than leaving them as they are.

Therefore,
I removed all the code for native PyTorch/HuggingFace part
(and only kept the Keras with different backends),
and had a new snapshot and release of the repo.

I saw a lot concerns from people about the code in this repo.
I thought these benchmarks would help people gain insights about different frameworks,
but it only caused more trouble instead.
It was mainly because of my limited understanding of the frameworks involved.
My sincere apologies if these benchmarks have caused any trouble for you.

mitkotak mentioned this issue Apr 3, 2024

Profiling code for low level CUDA analysis ? maxxxzdn/jax-geometric#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch compile settings #2

Torch compile settings #2

hal-314 commented Mar 21, 2024

lezcano commented Apr 2, 2024

haifeng-jin commented Apr 3, 2024 •

edited

Chillee commented Apr 3, 2024

haifeng-jin commented Apr 3, 2024

haifeng-jin commented Apr 3, 2024 •

edited

lezcano commented May 19, 2024

haifeng-jin commented May 20, 2024

lezcano commented May 20, 2024

haifeng-jin commented May 20, 2024 •

edited

lezcano commented May 20, 2024

lezcano commented May 20, 2024

haifeng-jin commented May 20, 2024

Torch compile settings #2

Torch compile settings #2

Comments

hal-314 commented Mar 21, 2024

lezcano commented Apr 2, 2024

haifeng-jin commented Apr 3, 2024 • edited

Chillee commented Apr 3, 2024

haifeng-jin commented Apr 3, 2024

haifeng-jin commented Apr 3, 2024 • edited

lezcano commented May 19, 2024

haifeng-jin commented May 20, 2024

lezcano commented May 20, 2024

haifeng-jin commented May 20, 2024 • edited

lezcano commented May 20, 2024

lezcano commented May 20, 2024

haifeng-jin commented May 20, 2024

haifeng-jin commented Apr 3, 2024 •

edited

haifeng-jin commented Apr 3, 2024 •

edited

haifeng-jin commented May 20, 2024 •

edited