Performance issues with Intel i9-13900K CPU #3802

vyacheslav-ogirenko · 2023-12-28T07:29:50Z

vyacheslav-ogirenko
Dec 28, 2023

Hello,

I'm working on an application that needs to stitch a large image from a rectangular grid of tiles.
libvips works great for me and does the job but now I'm facing performance issue when running on Intel i9-13900K CPU.

My app is in C#/.Net for Windows so I'm using NetVips wrapper. Here's a simplified test that shows my image processing pipeline:

using NetVips;
using System;
using System.Diagnostics;

namespace VIPSBenchmark
{
	class Program
	{
		private const int NumRows = 50;

		private const int NumColumns = 50;

		private const int TileWidth = 2880;

		private const double Overlap = 0.1;

		private const double BlendAreaWidth = 0.1;

		private const string FFCProfilePath = @"ffc_profile.tiff";

		private const string TileImageFile = @"tile.jpg";

		private const string RowImageFile = @"row.v";

		private static int HorizontalOverlapPixels => (int)Math.Round(TileWidth * Overlap);

		private static int StitchedTileWidth => TileWidth - HorizontalOverlapPixels;

		private static Image FFC
		{
			get
			{
				if(_FFC == null)
				{
					_FFC = Image.Tiffload(FFCProfilePath, 0);

					_FFC = _FFC.Bandjoin(new Image[] { Image.Tiffload(FFCProfilePath, 1), Image.Tiffload(FFCProfilePath, 2) });
				}

				return _FFC;
			}
		}

		private static Image _FFC = null;

		private static void MergeRow()
		{
			Image rowImage = Image.Black(1, 1);

			Image firstImage = Image.NewFromFile(TileImageFile, true, NetVips.Enums.Access.Sequential);

			firstImage = firstImage.Divide(FFC).Cast(NetVips.Enums.BandFormat.Uchar);

			rowImage = rowImage.Insert(firstImage, 0, 0, true);

			firstImage.Close();

			for (int c = 1; c < NumColumns; c++)
			{
				Image imageColumn = Image.NewFromFile(TileImageFile, true, NetVips.Enums.Access.Sequential);

				imageColumn = imageColumn.Divide(FFC).Cast(NetVips.Enums.BandFormat.Uchar);

				rowImage = rowImage.Merge(imageColumn, NetVips.Enums.Direction.Horizontal, -StitchedTileWidth * c, 0, (int)(HorizontalOverlapPixels * BlendAreaWidth));

				imageColumn.Close();
			}

			rowImage.WriteToFile(RowImageFile);

			rowImage.Close();
		}

		static void Main()
		{
			Console.WriteLine("Working...");

			Stopwatch stopWatch = Stopwatch.StartNew();

			for (int r = 0; r < NumRows; r++)
			{
				MergeRow();
			}

			Console.WriteLine("Elapsed time: {0:0.0} sec.", stopWatch.ElapsedMilliseconds / 1000.0);
		}
	}
}

One important step is performing Flat Field Correction of the images: dividing them by FFC profile which is 3-band float TIFF.

In short: for each of 50 rows we:

Load tile images one by one (only one image file is used in this test for simplicity reasons)
'divide' tile image by FFC image, cast result back to 8-bit
'merge' tile into larger row image
Save resulting row image to a '.v' file to disk.

(In the next step, row files are loaded from disk and merged together, I'm omitting this from my test as there's no performance difference).

I ran this test on two PCs: AMD (Ryzen 5 5600X @ 3.7 GHz, Windows 10, 32GB RAM) and Intel (13th i9-13900K, 3.00 GHz, Windows 11, 64GB RAM). Both have similar fast Samsung SSDs.

Elapsed time on AMD: 150 seconds.
Elapsed time Intel: 450 seconds.

Despite having more cores and more power, Intel is x3 times slower!

When I remove 'divide' operation from the pipeline, results are much more consistent:

Elapsed time on AMD without 'Divide': 73 seconds.
Elapsed time on Intel without 'Divide': 70 seconds.

Test results are averaged from multiple runs.

Question: what is wrong? Why adding 'Divide' operation slows down whole pipeline so much on Intel?

Here's a link to compiled binaries and sample images if anyone wants to benchmark their computers (note that you would need 1 GB free space on the SSD): VIPS Benchmark

Thanks a lot,
Vyacheslav.

jcupitt · 2023-12-28T08:53:09Z

jcupitt
Dec 28, 2023
Maintainer

Hi @vyacheslav-ogirenko,

How interesting, I've never seen that before.

This is just speculation, but could it be related to float exception handling? If you have some zeros in your FFC image, you could be triggering a divide by zero exception which libvips then has to catch, and I can imagine AMD and Intel having very different float exception hardware.

I always compute my FFC images with something like:

smooth = white.gaussblur(10)
ffc = smooth.max() / smooth

(that would be for vignette correction -- per-pixel sensitivity correction should be done separately and much earlier in your pipeline)

Then in your assembly code you can do:

tile = (tile * ffc).cast("uchar")

Which should be a little quicker, and avoids any /0 issues.

It might be worth trying.

2 replies

jcupitt Dec 28, 2023
Maintainer

You could write your line assembly as eg.:

row = None
for i in range(n_columns):
    tile = pyvips.Image.new_from_file(filename[i], access="sequential")
    tile = (tile * ffc).cast("uchar")
    row = tile if row == None else row = row.merge(tile, "horizontal",  overlap_in_pixels - row.width, 0)

it'd save a little copy-paste code.

vyacheslav-ogirenko Dec 28, 2023
Author

John, thank you for your quick reply!

Normally there should be no zeroes in the FFC image, it is computed a bit differently and addresses uneven lighting correction.
Anyway, I replaced Divide with Multiply and there's no change in performance ;(

Your sample code is much cleaner and easy to understand, thanks. I'm just not a Python guy ;)
I will re-write my test in Python and check how it performs.

Vyacheslav.

kleisauke · 2023-12-28T11:21:05Z

kleisauke
Dec 28, 2023
Collaborator

Could this be a CPU throttling issue? You could try to limit the concurrency to the max physical cores. This can be controlled in NetVips with the VIPS_CONCURRENCY environment variable or the NetVips.Concurrency setter.

4 replies

vyacheslav-ogirenko Dec 28, 2023
Author

Bingo! ;) It's definitely related to concurrency but in a somewhat strange way.

On AMD with concurrency set to 1 I'm getting 175 seconds, a little slower than with default concurrency setting.
On Intel with concurrency set to 1 I'm getting 138 seconds, which indeed shows how faster my Intel's CPU single-core performance should be related to my AMD.

Raising concurrency gradually from 2 to 6 on AMD (number of physical cores) gives me steady performance increase up to 98 seconds.
Raising it further lowers performance back to 150 seconds that I have with default concurrency setting.

Strange things happen with Intel CPU (which has 24 cores) though. Raising concurrency gradually from 2 to 4 gives performance increase up to 78 seconds but then it starts to degrade. With concurrency = 24 it's only 295 seconds, 2 times slower than single-core.
Could that be new Intel E-Cores getting in the way?

Interestingly, I never see 100% CPU utilization in any of the cases above, both AMD and Intel.

Anyway, I guess my solution would be limiting libvips to single-core operation and parallelize my code manually.

Thanks a lot for suggestions!

kleisauke Jan 4, 2024
Collaborator

Looking at the above example program, I noticed this:

for (int r = 0; r < NumRows; r++)
{
    MergeRow();
}

So, it looks like your calling MergeRow() 50 times without doing anything with the r variable. Perhaps that loop can be removed?

FWIW, here's a simplified program that completes in ~4 seconds for me.

using NetVips;
using System;
using System.Diagnostics;

namespace VIPSBenchmark;

class Program
{
    private const int NumColumns = 50;
    private const double Overlap = 0.1;
    private const double BlendAreaWidth = 0.1;

    private const string FFCProfilePath = "ffc_profile.tiff";
    private const string TileImageFile = "tile.jpg";
    private const string RowImageFile = "row.v";

    static void Main()
    {
        var sw = Stopwatch.StartNew();

        using var ffc0 = Image.Tiffload(FFCProfilePath, 0);
        using var ffc1 = Image.Tiffload(FFCProfilePath, 1);
        using var ffc2 = Image.Tiffload(FFCProfilePath, 2);
        using var ffc = ffc0.Bandjoin(ffc1, ffc2);

        var image = Image.NewFromFile(TileImageFile, access: Enums.Access.Sequential)
            .Divide(ffc)
            .Cast(Enums.BandFormat.Uchar);

        for (var c = 1; c < NumColumns; c++)
        {
            using var tile = Image.NewFromFile(TileImageFile, access: Enums.Access.Sequential)
                .Divide(ffc)
                .Cast(Enums.BandFormat.Uchar);

            var horizontalOverlapPixels = (int)Math.Round(tile.Width * Overlap);
            image = image.Merge(tile, Enums.Direction.Horizontal,
                -(tile.Width - horizontalOverlapPixels) * c, 0,
                (int)(horizontalOverlapPixels * BlendAreaWidth));
        }

        image.WriteToFile(RowImageFile);
        image.Dispose();

        sw.Stop();
        Console.WriteLine("Elapsed time: {0:0.0}s", sw.ElapsedMilliseconds / 1000.0);
    }
}

Could that be new Intel E-Cores getting in the way?

I'm not sure about this, but I suspect it could be related to this. libvips uses g_get_num_processors() to set the default concurrency, perhaps commit https://gitlab.gnome.org/GNOME/glib/-/commit/09de26185e8d8ed307ba82e47ec7126da85fa564 might help with this.

vyacheslav-ogirenko Jan 4, 2024
Author

So, it looks like your calling MergeRow() 50 times without doing anything with the r variable. Perhaps that loop can be removed?

That's just to make sure test runs long enough to measure time differences more accurately. Repeating same MergeRow() 50 times.

Thanks for your streamlined test app. I ran it with different Concurrency settings:
Default Concurrency: 5.4 sec
Concurrency = 1 (single-core): 2.5 sec
Concurrency = 4: 1.1 sec
Concurrency = 8: 2.0 sec
Concurrency = 12: 3.1 sec
Concurrency = 24: 6.1 sec

This is consistent with long version of the test.

Unfortunately I don't have means to build libvips from source on Windows, will wait till this patch makes it to release and then test again.

kleisauke Jan 8, 2024
Collaborator

Ah, that commit was only relevant when running on Linux. On Windows, g_get_num_processors() should already take CPU affinity in account.

Windows 11 Task Scheduler ought to ship with Intel's Thread Director, but I'm unsure if this functionally is enabled by default. It can be queried with the CPUID instruction like this:

using System.Runtime.Intrinsics.X86;

// Thermal and Power Management Leaf, CPUID level 0x00000006 (EAX), word 14
var eax = X86Base.CpuId(0x00000006, 0x00000000).Eax;

Console.WriteLine(eax & (1 << 23)); // Intel Thread Director (ITD)

(untested)

kmartinez · 2024-01-05T16:25:38Z

kmartinez
Jan 5, 2024
Maintainer

Oooh this looks interesting. Your i9 has 8 P cores but should use the 16 E cores quite well too.. I'd expect concurrency 12 to be faster than that.

I tested on my 16 core amd7950X with your win10 .bat and saw some limitation as only a few cores reached 40% active (time was 60s). Something is not threading/"partial"

2 replies

vyacheslav-ogirenko Jan 5, 2024
Author

Setting concurrency to number of physical cores gives [expected] best performance with AMD but not with Intel.

I tried to limit process affinity to P-cores only with Process Lasso tool but that doesn't seem to do any good, at least with default concurrency setting.

kleisauke Jan 8, 2024
Collaborator

I tried to limit process affinity to P-cores only with Process Lasso tool but that doesn't seem to do any good, at least with default concurrency setting.

This can also be done in Windows Task Manager. IIUC, 0-7 are the P-cores and 8-24 are the E-cores on your 13900K CPU. You may want to restart the application to ensure that NetVips.Concurrency / g_get_num_processors() takes this into account as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with Intel i9-13900K CPU #3802

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance issues with Intel i9-13900K CPU #3802

vyacheslav-ogirenko Dec 28, 2023

Replies: 3 comments · 8 replies

jcupitt Dec 28, 2023 Maintainer

jcupitt Dec 28, 2023 Maintainer

vyacheslav-ogirenko Dec 28, 2023 Author

kleisauke Dec 28, 2023 Collaborator

vyacheslav-ogirenko Dec 28, 2023 Author

kleisauke Jan 4, 2024 Collaborator

vyacheslav-ogirenko Jan 4, 2024 Author

kleisauke Jan 8, 2024 Collaborator

kmartinez Jan 5, 2024 Maintainer

vyacheslav-ogirenko Jan 5, 2024 Author

kleisauke Jan 8, 2024 Collaborator

vyacheslav-ogirenko
Dec 28, 2023

Replies: 3 comments 8 replies

jcupitt
Dec 28, 2023
Maintainer

jcupitt Dec 28, 2023
Maintainer

vyacheslav-ogirenko Dec 28, 2023
Author

kleisauke
Dec 28, 2023
Collaborator

vyacheslav-ogirenko Dec 28, 2023
Author

kleisauke Jan 4, 2024
Collaborator

vyacheslav-ogirenko Jan 4, 2024
Author

kleisauke Jan 8, 2024
Collaborator

kmartinez
Jan 5, 2024
Maintainer

vyacheslav-ogirenko Jan 5, 2024
Author

kleisauke Jan 8, 2024
Collaborator