Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PNG with zero sized IDAT is loading successfully as black image #1075

Closed
misos1 opened this issue Nov 14, 2019 · 24 comments
Closed

PNG with zero sized IDAT is loading successfully as black image #1075

misos1 opened this issue Nov 14, 2019 · 24 comments

Comments

@misos1
Copy link

misos1 commented Nov 14, 2019

Expected

Program should panic at image::load_from_memory(bad_png).unwrap(). Loading invalid image should fail. If there is no IDAT or size of decompressed data is too small to fill whole image then loading fails but not with zero sized IDAT.

Actual behaviour

Program runs successfully. image::load_from_memory(bad_png) returns Ok. And resulting image is black. Raw bytes are all zero and maybe actually undefined.

Reproduction steps

extern crate image;

fn main()
{
	let bad_png = b"\
		\x89PNG\x0D\x0A\x1A\x0A\
		\x00\x00\x00\x0D\
			IHDR\
			\x00\x00\x04\x00\
			\x00\x00\x05\xA9\
			\x08\x02\x00\x00\x00\
			\x68\x1B\xF7\x46\
		\x00\x00\x00\x00\
			IDAT\
			\x35\xAF\x06\x1E\
		\x00\x00\x00\x00\
			IEND\
			\xAE\x42\x60\x82\
	";
	let img = image::load_from_memory(bad_png).unwrap();
	let data = img.as_rgb8().unwrap().iter().copied();
	let zeros = std::iter::repeat(0u8).take(1024 * 1449 * 3);
	assert!(data.eq(zeros), "Are they zero or undefined?");
}
@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 7, 2020

The bytes are definitely not undefined and zeroed, as the image buffer is created as a zeroed Vec of bytes that are later filled. It should still not accept such an image and is probably an error in image-png where a check was forgotten.

@misos1
Copy link
Author

misos1 commented Feb 7, 2020

Would not be then better to start with vector of uninitialized values (after are things like this fixed)?

@HeroicKatora
Copy link
Member

Not really, it's very risky and the benefits are minimal. Allocating a large zeroed vectors is marginally slower, sometimes even not slower at all, than allocating an uninitialized vector.

@fintelia
Copy link
Contributor

fintelia commented Feb 7, 2020

To add details, current computers can zero memory at something like ~50GiB per second so the total time to zero a vector is not going to be very large. Plus at the end of the process, the contents of the vector will now be in cache so subsequent accesses while decoding will be faster, which can further mitigate the overhead

@HeroicKatora HeroicKatora added this to Planned in Version 0.23 Feb 9, 2020
@HeroicKatora HeroicKatora moved this from Planned to Bugs in Version 0.23 Feb 9, 2020
@misos1
Copy link
Author

misos1 commented Feb 10, 2020

@HeroicKatora So lets see:

// cargo-deps: rand = "0.7.3"
extern crate rand;
use std::alloc::{alloc, alloc_zeroed, dealloc, Layout};
use std::time::Instant;
use rand::Rng;

fn main()
{
	for alloc_func in &[alloc, alloc_zeroed]
	{
		let mut rng = rand::thread_rng();
		let mut useit: usize = 0;
		let start = Instant::now();
		for _ in 0..1000_000
		{
			let layout = Layout::from_size_align(rng.gen_range(1_000_000, 10_000_000), 1).unwrap();
			let ptr = unsafe {alloc_func(layout)};
			useit = useit.wrapping_add(ptr as usize);
			unsafe {dealloc(ptr, layout)};
		}
		println!("{} seconds   (sum {:X})", start.elapsed().as_secs_f32(), useit);
	}
}

Example output on my machine:

0.08206253 seconds   (sum 2079FCB4B2853320)
176.02649 seconds   (sum 20792E067CEC8C00)

Allocations are not so slow. Allocating large zeroed memory is much slower than just allocation itself. In this test few thousand times slower. I used allocations from 1 to 10 MB which is about range for typical color images with resolution from about 640x480 to 1920x1080.

I understand that using excess unsafe blocks or giving to user vector of MaybeUninit may not be the best thing. But this can be accomplished without using unsafe blocks for example construct vector with_capacity and push pixels into it (or maybe use size hint with collect).

@fintelia Zeroing memory is nothing special. It is as fast as filling it with any other value. In worst case writing decompressed values can be as fast as writing zeros so in worst case this could mean 50% performance penalty. If it fit into L1 or L2 cache then this lost may be lesser but not because we first wrote zeros into cache. Writing just decompressed data without first writing zeros will be still faster. But if processor has 512 kB of L2 cache per core then it cannot fit even 640x480 color image. And L3 cache is not much faster than memory. The best would be to do some benchmark to see how big is this overhead actually.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

@misos1 Marginally slower in the total of reading the image, of course. We are talking about trading som 0.1ms for on average an image size of 5 million bytes, which can take itself with some formats 20ms or more. And even that might be compensated due to the zeroing having filled the TLB and other caches, reducing cache misses on subsequent writes of the actual data. The tradeoff of risking a repitition of this needs much more benefits. If you have numbers for the end-to-end benchmark then we might begin to discuss which change to some decoders bring enough benefit but definitely not start at making ImageDecoder basically an unsafe trait.

The much more effective change would be to build on the current system to avoid the allocation by allowing reading into a pre-allocated buffer.

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

@HeroicKatora Yes this makes sense. But 50 GB/s is probably more like multi-core performance. I would expect more like half of that for single-core performance on high end systems. I do not think this could be compensated by filing zeros acting as prefetching for which there are prefetch instructions. But you are right that it is better to fill it with zeros rather than to use uninitialized values. But why not to use Vec::with_capacity approach? This would pre-allocate buffer and avoid subsequent allocations.

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

@HeroicKatora It would be like safe "uninitialized" values. If is currently not possible to push individual pixels into vector then its length can be raised row by row and each row filled first with zeros then with decoded image data. Row of image is much smaller and can fit into L1 cache so such zeroing would be really negligible.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

@misos1 Because it would commit to Vec<u8> as the backing buffer which is both infexible and incorrect for higher-aligned types. Maybe we'll have a specialized interface to avoid the filling at some point but I doubt that we can improve with our limited time what not even std has stabilized (see std::io::Read::initializer and related experiments). If you want to contribute to this effort it is much more effective for all involved if we focussed on eliminating (or reducing) the number of deallocation-allocation cycles involved.

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

@HeroicKatora Did you mean Vec<T>? I do not see why it must be u8.

@HeroicKatora
Copy link
Member

I'm interested in how you think ImageDecoder works with a generic parameter T when the actual type of samples involved is dynamically chosen by the decoder?

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

@HeroicKatora Seems I just misunderstood something. You probably meant container type not element type.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

No, I did mean the element type. The decoder will dynamically find the channel format which might turn out to be several channels of type u8, u16, f32, or even some heterogeneous ones (see tiff, ..). To be able to provided a &mut Vec<u16> requires that the image buffer chooses a type based on the dynamic ColorType information that was decoded. In all instantiations we could find, this would then require either

  • Calling a generic function which somehows needs to specialize on that type parameter, but it can't because it's Rust and not C++ and there is no constexpr if std::is_same.
  • Choosing some stabilized and static set of supported buffer and channel types from which we basically just migrated away because the old enum with a Vec<u8> and Vec<u16> was far too inflexible.

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

Ok I see now. Currently are all images loaded into right Vec<T> but through typecast to [u8] slice. So you would need to forward Vec<T> to read_image of decoders and do things based on type of T. So it will be probably better to do this after is rust specialization stabilized or const if (maybe currently could be used macros).

@fintelia
Copy link
Contributor

The benchmark posted above exhibits really weird effects because it doesn't actually touch the memory that was allocated. On my machine for instance, increasing the allocation size 100x or 1000x actually makes the zeroed case several orders of magnitude faster, and the two functions end up performing the same. Not to mention that the allocation pattern is so simple that the allocator only ever returns a half dozen or so distinct pointers

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

Try to change second parameter of Layout::from_size_align from 1 to 256 (or page size maybe). I am now not sure what is happening but also output from my machine would indicate about 31 GB/s bandwidth, but it should be more like 16.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

And I want to remark that using jemalloc instead of the system allocator (on gnu-linux target with kernel 5.14) also brings both into the same range of performance, and makes the alloc_zeroed perform much better than a manual zeroing. I've added a loop to touch a page of memory with a volatile write of a single 1u8 as well. So your measurements might just be tainted by the default allocator not being very smart but having a small binary profile instead (If I remember correctly, that was the reason to remove jemalloc somewhere around 1.32). Full code here

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

I am on kernel 4.15:

0.31884187 seconds   (sum C30DA766F1A26000)
0.63516575 seconds   (sum C30DA72CF996D800)
73.62745 seconds   (sum C30DA2A8E7616000)

So it has somehow prepared already zeroed pages?

@HeroicKatora
Copy link
Member

So it has somehow prepared already zeroed pages?

Or it will lazily zero those pages. Maybe that means that it is actually slower if one tries to fill them very rapidly. It also means that there is no concrete better and profiling is much more useful then speculating on the exact effects of eliding some initialization.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

This is the magic of MAP_ANONYMOUS and other mmap arguments, one can spend an incredible amount of time obsessing over microprofiling here I think. And hence why I said that time is better spent on other, easier and clearer optimizations first.

@fintelia
Copy link
Contributor

fintelia commented Feb 10, 2020

Yeah, Linux does lazy allocation for calls to mmap. It just records that the allocation happens but doesn't do any work or actually map the page until the application tries to access it. When/if the application does an access to the memory it will get a page fault, at which point the kernel will zero a page of memory and map it at the given address (this is why performance is always the same: Linux requires the page will be zero so there is no extra work for alloc_zeroed vs plain alloc)

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

Ok but now I have to add that vec![0; ...] is (probably) not using alloc_zeroed but does that manual zeroing.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

Specifically vec![0; ] does use alloc_zeroed as an optimization. (Follow the code flow. It might be a good excercise in reading Rust std code and getting familiar with its internals. It's much more readable than C++ STL for example.)

@misos1
Copy link
Author

misos1 commented Feb 10, 2020

Yes right now I found it. I did not check before I wrote that comment. I was wondering why is code with vec![0;] using calloc and thought it was just some llvm optimization. But seems to take advantage over these "fast" zeroed allocations one still has to use jemalloc which is not default now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Version 0.23
  
Done
Development

No branches or pull requests

3 participants