-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Transfer on A770 16 GB Fails Unit Tests and has Incomplete Level Zero/OpenCL API #618
Comments
Thank you for your feedback. Performance numbers are collected with high level PyTorch S/W stack.
|
clpeak has weird results, where the enqueueWriteBuffer is 2x faster than enqueueReadBuffer? Why is there a difference in reads and writes? Every other GPU has the approximately the same read and write speed
Also Kernel Latency of 34.76 us in A770 16 GB is 10x larger than RTX 2080 SUPER of 3.46 us |
I compiled and ran memory_benchmark_l0 and memory_benchmark_ocl I attached the results as .txt files and the memory benchmarks have nonsensical outputs such as inf, negative numbers, ERROR and many others.
api_overhead_benchmark_l0.txt multitile_memory_benchmark_ocl.txt |
Thanks for follow up and additional experiments. We are also working to confirm Level-Zero failures, which are not visible in Intel CI environment. |
For FP32 clPeak maxes out at ~13 TFLOP/s. But 512 (EUs) * 2400 (Mhz) * 8 (SIMD length) * 2 (FMA) = ~20 TFLOP/s. For comparison the same clPeak FP32 kernels on an iGPU get the expected peak. Similarly, FP16 is exactly a factor of two off of the theoretical peak. |
memory_benchmark_ocl has a few errors and crashed my system. I think the Level Zero/OpenCL API returns invalid values too, which causes many applications like pytorch/tensorflow to behave weirdly.
|
clPeak SP32 FMA (a.k.a MAD multiply+add) synthetic test observations are correct. In case of Arc family, MAD instruction is split into two ticks, while in mentioned integrated GFX there are fused two ops per single instruction in one tick in this test. |
So as a point of reference on a 96 EU iGPU my SGEMM kernels (derived from intel/intel-graphics-compiler#254) can get 1.5 TFLOP/s or ~75% of peak. On my A770M (~16 TFLOP/s peak) and an appropriately scaled up problem the same kernels strike out at ~6.3 TFLOP/s, or ~37.5% of peak. Precisely a factor of two off what an iGPU can benchmark, and inline with the synthetics. |
Hello, Please update driver to the latest release: 22.49.25018.24 https://github.com/intel/compute-runtime/releases |
I upgrade to 22.49.25018.24 and kernel 6.2.1 but the errors still persists. I have an i5-13600K raptor lake CPU and 64 GB of RAM. Resizable bar is enabled and Arc A770 is the only OpenCL device available on the system.
clinfo
memory_benchmark_ocl
|
I'm also observing some failures on 6.2.1 kernel and A770, logs attached. The performance drop for non-D2D copies is even worse than reported by others. Here's excerpt from
|
@BartusW |
With recent release it works fine on our side, @BA8F0D39 could you confirm it is working for you? |
@BA8F0D39 Please re-open the issue or create a new one if problem still exists |
I have a A770 16 GB and I installed intel-compute-runtime 22.43.24595.30 and intel extensions for pytorch v1.13.10+xpu on Linux Kernel 6.2rc8
I made a script to test memory transfers from GPU to GPU on the A770.
Running the script gives a maximum GPU to GPU transfer rate of around 100 GB/s
Why is the GPU to GPU transfer rate limited to 100 GB/s, when the A770 16 GB's bandwidth is 512 GB/s?
The text was updated successfully, but these errors were encountered: