Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

Open
qinxianyuzi opened this issue Feb 8, 2022 · 16 comments
Open

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

qinxianyuzi opened this issue Feb 8, 2022 · 16 comments

Comments

@qinxianyuzi
Copy link

qinxianyuzi commented Feb 8, 2022

Hello, thanks for helping:
I try to use wuffs to open png files within a c++ project. I use vs2017 to compile this code, but PNG decoding is slower than OpenCV.
OpenCV: 65ms
wuffs: 93ms

#include "iostream"
#include "chrono"
#define WUFFS_IMPLEMENTATION
#define WUFFS_CONFIG__MODULE__PNG
#include "wuffs-v0.3.c"

uint32_t g_width = 0;
uint32_t g_height = 0;
wuffs_aux::MemOwner g_pixbuf_mem_owner(nullptr, &free);
wuffs_base__pixel_buffer g_pixbuf = { 0 };

bool load_image(const char* filename)
{
	FILE* file = stdin;
	const char* adj_filename = "<stdin>";
	if (filename) {
		FILE* f = fopen(filename, "rb");
		if (f == NULL) {
			printf("%s: could not open file\n", filename);
			return false;
		}
		file = f;
		adj_filename = filename;
	}
	g_width = 0;
	g_height = 0;
	g_pixbuf_mem_owner.reset();
	g_pixbuf = wuffs_base__null_pixel_buffer();
	wuffs_aux::DecodeImageCallbacks callbacks;
	wuffs_aux::sync_io::FileInput input(file);
	wuffs_aux::DecodeImageResult res = wuffs_aux::DecodeImage(callbacks, input);

	return true;
}

inline auto get_time()
{
	return std::chrono::high_resolution_clock::now();
}

int main(int argc, char** argv)
{
	auto start = get_time();
	bool loaded = load_image("C:/Users/huangry/Desktop/8/IMG_1071.PNG");
	if (loaded) 
		std::cout << loaded << "\n";
	auto end = get_time();
	std::chrono::duration<double> elapsed = (end - start);
	printf("Wuffs : %fs\n", elapsed.count());

	return 0;
}
@nigeltao
Copy link
Collaborator

nigeltao commented Feb 8, 2022

Are you configuring Visual Studio with /arch:AVX? GCC and clang can use __attribute__((target("avx2"))) on its functions but I don't think Microsoft's VS supports that, so you have to manually opt in to SIMD acceleration. If you don't opt in, you'll get the slower (non-SIMD) fallback code.

@nigeltao
Copy link
Collaborator

nigeltao commented Feb 8, 2022

If that doesn't help, can you attach the C:/Users/huangry/Desktop/8/IMG_1071.PNG file so I can try to reproduce the slowness?

@nigeltao
Copy link
Collaborator

nigeltao commented Feb 8, 2022

Are you configuring Visual Studio with /arch:AVX?

Oh, also, for MSVC, make sure that you're compiling an optimized build, not a debug build. I think this is the /O2 option (that's: slash, letter-O, number-2), or its GUI equivalent, but I might be wrong (I don't use Microsoft's toolchain, day-to-day).

@qinxianyuzi
Copy link
Author

Thanks, I'm trying to configure Visual Studio with avx2. And maybe clang is indispensable.

@qinxianyuzi
Copy link
Author

IMG_1071

This is PNG file.

@qinxianyuzi
Copy link
Author

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

@nigeltao
Copy link
Collaborator

nigeltao commented Feb 9, 2022

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

Does "it doesn't work" mean that it didn't get faster, or does it mean that you got a compiler error message, or does it mean something else? If it's an error message, can you copy/paste it here?

@qinxianyuzi
Copy link
Author

It didn't get faster.

@nigeltao
Copy link
Collaborator

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

OK. Does /arch:AVX without the 2 do anything? Do you also pass /O2? It might be easier if you say what compiler flags you are passing.

Is clang faster or is it also as slow?

@qinxianyuzi
Copy link
Author

It is 1.2x faster than opencv with clang

@pavel-perina
Copy link

pavel-perina commented Jun 9, 2022

Hi. I tried it on large data. Program has some internal overhead, but anyways ...

First dataset 1984x1984x1540/16bit grayscale (all times including overhead, series of 1540 images)):
OpenCV/libpng: 75s
WIC (windows imaging components)/file: 66s
WIC/memory: 58s (because file reader had some overhead reading 26MB PNG files from HDD, it turned out to be faster to read whole file and use memory decoder)
WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)

Second dataset 2048x2048x2048/8bit grayscale synthetic data, each PNG roughly 14kB - basically repeating b&w patterns.
WUFFs: 18s everything w/overhead, 9.4s in decoder
WIC/memory: 10s everything, 2.7s in decoder (3.5x faster!!!)
OpenCV/libpng 21s everything, 5.4s in decoder (worse overhead due to another app layer)

About /arch:AVX ... it may do something, but MSVC is very good at finding reasons why it won't optimize loops and reasons can be printed using /Qvec-report:2 option in C++/All options/Additional Options

Bottleneck is obviously wuffs_base__io_writer__limited_copy_u32_from_history_fast for very compressible data which gives us

1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10427) : info C5002: loop not vectorized due to reason '1301'
1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10432) : info C5002: loop not vectorized due to reason '1301'
1>

And from https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=msvc-170#BKMK_ReasonCode130x , 1301 = Loop stride isn't +1.

Example of code which it can optimize (if outputtype is shorter or the same, otherwise it fails with code 1203, but code logic chooses outputtype that won't overflow)

template<typename OutputType, typename InputType>
void updateBufferFromBlock(void *output, void *input, size_t n)
{
    const InputType*  pIn  = static_cast<const InputType*>(input);
          OutputType* pOut = static_cast<OutputType*>(output);

    for (size_t i = 0; i < n; i++) {
        pOut[i] += static_cast<OutputType>(pIn[i]);
    }
}

Top-down function times for realistic dataset: https://i.imgur.com/UD5a7MF.jpg compiled with /02 /arch:AVX and comparison with other decoders: https://imgur.com/a/ZEtojo9

TL;DR: either write/generate code using AVX intristic instructions or don't pre-optimize it for MSVC. Windows Imaging Components seems fastest, but it works only on WIndows (since Vista, Seven ... idk)

@nigeltao
Copy link
Collaborator

FWIW, this patch:

diff --git a/release/c/wuffs-unsupported-snapshot.c b/release/c/wuffs-unsupported-snapshot.c
index 717414f8..ef2105cb 100644
--- a/release/c/wuffs-unsupported-snapshot.c
+++ b/release/c/wuffs-unsupported-snapshot.c
@@ -11743,13 +11743,8 @@ wuffs_base__io_writer__limited_copy_u32_from_history_fast(uint8_t** ptr_iop_w,
                                                           uint32_t distance) {
   uint8_t* p = *ptr_iop_w;
   uint8_t* q = p - distance;
-  uint32_t n = length;
-  for (; n >= 3; n -= 3) {
-    *p++ = *q++;
-    *p++ = *q++;
-    *p++ = *q++;
-  }
-  for (; n; n--) {
+  size_t n = length;
+  for (size_t i = 0; i < n; i++) {
     *p++ = *q++;
   }
   *ptr_iop_w = p;

looks like your updateBufferFromBlock suggestion, but the benchmark results are mixed. clang11 gets worse, gcc10 gets better.

name                                              old speed     new speed     delta

wuffs_deflate_decode_1k_full_init/clang11         181MB/s ± 1%  179MB/s ± 1%  -1.36%  (p=0.008 n=5+5)
wuffs_deflate_decode_1k_part_init/clang11         215MB/s ± 0%  206MB/s ± 0%  -4.53%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/clang11        388MB/s ± 0%  362MB/s ± 1%  -6.64%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/clang11        398MB/s ± 0%  370MB/s ± 0%  -7.14%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/clang11   496MB/s ± 0%  489MB/s ± 0%  -1.47%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/clang11  313MB/s ± 0%  302MB/s ± 0%  -3.40%  (p=0.008 n=5+5)

wuffs_deflate_decode_1k_full_init/gcc10           177MB/s ± 0%  179MB/s ± 1%    ~     (p=0.056 n=5+5)
wuffs_deflate_decode_1k_part_init/gcc10           206MB/s ± 0%  209MB/s ± 0%  +1.51%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/gcc10          384MB/s ± 0%  386MB/s ± 0%  +0.73%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/gcc10          393MB/s ± 0%  397MB/s ± 0%  +1.08%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/gcc10     496MB/s ± 0%  523MB/s ± 0%  +5.30%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/gcc10    314MB/s ± 0%  336MB/s ± 1%  +6.96%  (p=0.008 n=5+5)

mimic_deflate_decode_1k_full_init/gcc10           229MB/s ± 1%  228MB/s ± 0%    ~     (p=0.310 n=5+5)
mimic_deflate_decode_10k_full_init/gcc10          275MB/s ± 0%  275MB/s ± 0%    ~     (p=0.310 n=5+5)
mimic_deflate_decode_100k_just_one_read/gcc10     336MB/s ± 0%  335MB/s ± 0%  -0.37%  (p=0.008 n=5+5)
mimic_deflate_decode_100k_many_big_reads/gcc10    263MB/s ± 0%  264MB/s ± 0%    ~     (p=0.310 n=5+5)

In any case, I'm not sure if AVX-ness (or not) would really help here. The destination and source byte slices can overlap, often by only a few bytes, in which case you can't just do a simple memcpy 32 bytes at a atime.

@nigeltao
Copy link
Collaborator

WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)

Wuffs should be able to decode to WUFFS_BASE__PIXEL_FORMAT__Y_16LE or WUFFS_BASE__PIXEL_FORMAT__Y_16BE, but you have to opt into that (instead of defaulting to WUFFS_BASE__PIXEL_FORMAT__BGRA_PREMUL). If you're using Wuffs' C++ API, then that involves overriding the SelectPixfmt method (like example/sdl-imageviewer/sdl-imageviewer.cc).

nigeltao added a commit that referenced this issue Jun 12, 2022
I don't have a Windows machine readily available, but according to
https://godbolt.org/z/q4MfjzTPh and the https://imgur.com/UD5a7MF
profile mentioned in #72, this could improve inner loop performance.

Updates #72
@nigeltao
Copy link
Collaborator

I don't have MSVC myself, but for those who do, I'm curious if commit c226ed6 noticably improves PNG decode speed.

@pavel-perina
Copy link

I don't have MSVC myself, but for those who do, I'm curious if commit c226ed6 noticably improves PNG decode speed.

I'm sorry, little busy this week, hopefully will get to this issue next week.

@nigeltao
Copy link
Collaborator

nigeltao commented Jul 6, 2022

@pavel-perina any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants