Performance issues with Intel i9-13900K CPU #3802
Replies: 3 comments 8 replies
-
How interesting, I've never seen that before. This is just speculation, but could it be related to float exception handling? If you have some zeros in your FFC image, you could be triggering a divide by zero exception which libvips then has to catch, and I can imagine AMD and Intel having very different float exception hardware. I always compute my FFC images with something like: smooth = white.gaussblur(10)
ffc = smooth.max() / smooth (that would be for vignette correction -- per-pixel sensitivity correction should be done separately and much earlier in your pipeline) Then in your assembly code you can do: tile = (tile * ffc).cast("uchar") Which should be a little quicker, and avoids any /0 issues. It might be worth trying. |
Beta Was this translation helpful? Give feedback.
-
Could this be a CPU throttling issue? You could try to limit the concurrency to the max physical cores. This can be controlled in NetVips with the |
Beta Was this translation helpful? Give feedback.
-
Oooh this looks interesting. Your i9 has 8 P cores but should use the 16 E cores quite well too.. I'd expect concurrency 12 to be faster than that. I tested on my 16 core amd7950X with your win10 .bat and saw some limitation as only a few cores reached 40% active (time was 60s). Something is not threading/"partial" |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm working on an application that needs to stitch a large image from a rectangular grid of tiles.
libvips works great for me and does the job but now I'm facing performance issue when running on Intel i9-13900K CPU.
My app is in C#/.Net for Windows so I'm using NetVips wrapper. Here's a simplified test that shows my image processing pipeline:
One important step is performing Flat Field Correction of the images: dividing them by FFC profile which is 3-band float TIFF.
In short: for each of 50 rows we:
(In the next step, row files are loaded from disk and merged together, I'm omitting this from my test as there's no performance difference).
I ran this test on two PCs: AMD (Ryzen 5 5600X @ 3.7 GHz, Windows 10, 32GB RAM) and Intel (13th i9-13900K, 3.00 GHz, Windows 11, 64GB RAM). Both have similar fast Samsung SSDs.
Elapsed time on AMD: 150 seconds.
Elapsed time Intel: 450 seconds.
Despite having more cores and more power, Intel is x3 times slower!
When I remove 'divide' operation from the pipeline, results are much more consistent:
Elapsed time on AMD without 'Divide': 73 seconds.
Elapsed time on Intel without 'Divide': 70 seconds.
Test results are averaged from multiple runs.
Question: what is wrong? Why adding 'Divide' operation slows down whole pipeline so much on Intel?
Here's a link to compiled binaries and sample images if anyone wants to benchmark their computers (note that you would need 1 GB free space on the SSD): VIPS Benchmark
Thanks a lot,
Vyacheslav.
Beta Was this translation helpful? Give feedback.
All reactions