Fix inefficient label mapping in direct color mode (10-20x speedup) #5723

ksofiyuk · 2023-04-13T20:50:56Z

Description

If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode:

napari/napari/layers/labels/labels.py

Lines 927 to 935 in a79537c

    
           u, inv = np.unique(raw_modified, return_inverse=True) 
        
           image = np.array( 
        
               [ 
        
                   self._label_color_index[x] 
        
                   if x in self._label_color_index 
        
                   else self._label_color_index[None] 
        
                   for x in u 
        
               ] 
        
           )[inv].reshape(raw_modified.shape)

It calls np.unique of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single _raw_to_displayed call on 4096x4096 images.

Luckily, in most use cases the use of np.unique is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when max_label_id - min_label_id >= 1024, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456).

I also added a new benchmark Labels2DColorDirectSuite that tests the Label layer in the direct color mode.

Benchmark results:

The results for the Label layer in the AUTO mode (for comparison):

[100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed                                                                                                                          ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16     3.84±0.02μs
                 32     5.89±0.03μs
                 64     14.4±0.04μs
                128     47.6±0.09μs
                256       183±1μs
                512     1.02±0.01ms
                1024    4.05±0.02ms
                2048     15.8±0.2ms
                4096      63.7±2ms
              ======== =============

The results for the Label layer in the DIRECT mode (old algorithm):

[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 16     36.5±0.2μs
                 32      47.2±1μs
                 64      150±2μs
                128      534±2μs
                256     2.31±0.2ms
                512     11.8±0.1ms
                1024    61.8±0.7ms
                2048     312±2ms
                4096    1.50±0.02s
              ======== ============

The results for the Label layer in the DIRECT mode (new algorithm):

[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16       17.1±1μs
                 32     20.0±0.04μs
                 64      35.1±0.3μs
                128      93.8±0.2μs
                256       330±2μs
                512     1.63±0.01ms
                1024    6.40±0.02ms
                2048     25.1±0.1ms
                4096     101±0.5ms
              ======== =============

As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the AUTO mode.

Further optimization

It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way:

Recompute min_label_id and max_label_id from scratch only when someone tries to set the data in a layer.
When someone tries to change the color id of brush, update the min_label_id and max_label_id with a new color id accordingly.

This should be enough to always maintain the valid min/max of label ids without the need of recomputing them on every update.

If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000).

Type of change

Bug-fix (non-breaking change which fixes an issue)

How has this been tested?

example: the test suite for my feature covers cases x, y, and z
example: all tests pass with my change
example: I check if my changes works with both PySide and PyQt backends
as there are small differences between the two Qt bindings.

Final checklist:

My PR is the minimum possible work for the desired functionality
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
If I included new strings, I have used trans. to make them localizable.
For more information see our translations guide.

Czaki · 2023-04-13T21:20:01Z

Our current target for solving this problem is to avoid global remapping of labels on CPU in #3308 but I'm not sure when linked PR will be finished.

We will try to investigate of the status of #3308 and if the prognosis of the finish will be not optimistic I will be happy to at least partially solve this performance problem by merging this PR.

If you have time You could also take a look at the linked PR and maybe help to finish it.

Czaki · 2023-04-13T21:36:19Z

one idea (the changes will be bigger)

We do not need to know the real set of labels. We only need to have a guarantee that all labels from layers are present in this set (so even if we erase some label, then its present is not a problem).

So the call of unique could be delayed and done for example 1s after finishing the drawing (with updating the set of every variable used in the drawing).

It will be more complex and take a long time with deeper dive in napari code, but it also will be useful for solve problems mentioned in #3308 and speedup code even more (as avoid calculate min and max on every refresh).

codecov · 2023-04-13T21:53:34Z

Codecov Report

Merging #5723 (3992ce4) into main (a79537c) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5723      +/-   ##
==========================================
- Coverage   89.85%   89.85%   -0.01%     
==========================================
  Files         608      608              
  Lines       51756    51870     +114     
==========================================
+ Hits        46504    46606     +102     
- Misses       5252     5264      +12

Impacted Files	Coverage Δ
napari/layers/labels/_tests/test_labels.py	`99.49% <100.00%> (+<0.01%)`	⬆️
napari/layers/labels/labels.py	`95.43% <100.00%> (+0.06%)`	⬆️

... and 20 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jni · 2023-04-14T06:37:14Z

@ksofiyuk How does your benchmark suite fare on #3308? I think that PR makes this one unnecessary, and it's very important (as you'll see in the discussion there, it fixes a heap of issues with direct color mode), so I think we should focus on that one...

ksofiyuk · 2023-04-14T07:28:54Z

@jni I'll test that PR later. Anyway, I propose to merge this PR if you don't have any objections regarding the code. It is a non-breaking change that doesn't change any behaviour - a pure optimization.

I did it because without it napari is unusable on large images with custom colors, it is not a matter about optimization for the sake of optimization. Just try it yourself, open a large image (I use 4096x3000) and try to use the brush. You should consider it as a bugfix or a necessary optimization.

I can't see any conflicts with the mentioned PR because if it resolves the issue, then when it's ready this code will be unnecessary and will be replaced.

brisvag · 2023-04-14T08:24:16Z

To be on the safe side, I kept the old approach for the cases when max_label_id - min_label_id >= 1024, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456).

To be clear, this is to handle better cases where labels are sparsely distributed, like [0, 1, 12341234, 893475892374]?

I can't see any conflicts with the mentioned PR because if it resolves the issue, then when it's ready this code will be unnecessary and will be replaced.

I agree with @ksofiyuk, that PR has been sitting for a long time, who knows how much longer, and this one brings very little disruption. Tests are passing already, which is a good sign. I'm not super familiar with the logic here though, so I'm not confident that it won't break some corner cases. Hopefully @jni can help more here.

alisterburt

I'm also fine with this PR going in whilst waiting for #3308 - I know that's getting closer but this will be easy to remove once #3308 is ready and the performance gains here are high value

I defined the upper bound on the number of unique labels explicitly which I think makes the code a little easier to parse at first glance

napari/layers/labels/labels.py

alisterburt · 2023-04-14T09:57:47Z

thanks again @ksofiyuk - you're on fire! 🔥

napari/layers/labels/labels.py

ksofiyuk · 2023-04-14T11:14:32Z

@alisterburt thanks for the improving of readability.

I updated the threshold for upper_bound_n_unique_labels to 65536, as it can handle even that range of labels without almost any performance drop. Actually, we can safely increase it even more, but I think 65536 should cover 99.999% of possible cases.
I also updated the benchmark to incorporate a larger range of generated labels (from -10000 to 10000).

In this benchmark with sparse labels from -10000 to 10000, the proposed method shows even much more significant speedup (up to 40x).

The timing of the old algorithm on a new benchmark (provide only the two biggest sizes for clarity):

                2048      521±10ms  
                4096     2.63±0.02s

The new algorithm:

                2048     17.3±0.2ms 
                4096     62.3±0.8ms

It is 42x speedup!

I tested it on a bit different hardware, so the numbers don't exactly match the results in the first message.

ksofiyuk · 2023-04-14T11:22:51Z

To be clear, this is to handle better cases where labels are sparsely distributed, like [0, 1, 12341234, 893475892374]?

@brisvag, no, on the contrary, it fixes the cases when labels are not super sparsely distributed (I can say it covers 99.999% of real world cases). With the last update, you will gain significant speedup from this PR, if your labels can be localized in any continuous interval which length is less than 65536 elements. Here are some examples, if your labels are:

[0, 1, 2, 3]   # you will get significant boost here
[-1000, 0, 1, 1000]  # you will also get significant boost here as it falls inside the interval of length 65536
[100000, 100100, 100200, 100300]  # again it falls inside the interval, the new algorithm will be used
[-1000000, 1000000]  # they cannot be localized in a continuous of length 65536, the old method with `np.unique` will be used.

jni · 2023-04-14T12:10:56Z

Super cool! Yeah, sorry, I didn't mean to imply we shouldn't merge this! I actually think the other one is closer than you're implying @brisvag but I totally get the point that this is already passing and a huge speedup so I'm all for merging! The other factor is that I think using direct mode, sooner rather than later @ksofiyuk will run into some of the color mismatch issues fixed by #3308. But we can cross that bridge when we get there!

brisvag · 2023-04-14T12:12:30Z

Sorry, with "this" I meant "keeping the old approach for some cases" :P So I'm on board 👍

brisvag · 2023-04-14T12:13:28Z

I actually think the other one is closer than you're implying @brisvag

Nice, looking forward to it :)

jni · 2023-04-14T13:34:42Z

Just pushed a commit trying to make sure all the new lines are covered. This should be good to merge if that passes, deepest apologies if it doesn't 🤣 (it passed locally)

jni · 2023-04-14T13:54:29Z

@ksofiyuk that's the point — the old tests didn't have a branch, but they were "shuttled" to the new code — if you look at the previous codecov comment, it was the old code that was no longer covered by tests. Hence the new test.

Anyway, new npe2 release broke our typing action. CC @tlambert03

ksofiyuk · 2023-04-14T14:47:37Z

that's the point — the old tests didn't have a branch, but they were "shuttled" to the new code — if you look at the previous codecov comment, it was the old code that was no longer covered by tests. Hence the new test.

@jni I realized it after I posted it, so I deleted my previous comment, but you managed to read it and reply before it 😄 But thanks for the explanation.

Czaki · 2023-04-14T20:27:59Z

I believe that the label rendering can be significantly improved by some clever algorithmic optimizations which would allow to have a near constant time complexity of rendering on a brush update without the dependency on the resolution of an image.

One of the core problems here is that we transfer the whole layer array to GPU after paint.
So bigger array means more linear operations.

I do not have time to investigate how it will be hard to transfer only the updated part of the image.

But this plus maybe caching some properties should lead to better performance

ksofiyuk · 2023-04-14T20:45:34Z

One of the core problems here is that we transfer the whole layer array to GPU after paint.
So bigger array means more linear operations.

In theory, it shouldn't be very difficult to send to GPU for updating only patches of an image. But I'm also not sure how difficult it would be to implement in napari (I haven't looked into the code of PR #3308).

It does not look very difficult to implement local updates (only around the recent brush movement + limit the updating only to the part of the image that can be seen) for the CPU version. And it should allow to handle images of any resolution without performance degradation for 99,9% of typical use cases. And it would beat the GPU version on really high resolution images if the same thing wasn't implemented for it.

alisterburt · 2023-04-14T21:36:47Z

It does not look very difficult to implement local updates (only around the recent brush movement + limit the updating only to the part of the image that can be seen) for the CPU version. And it should allow to handle images of any resolution without performance degradation for 99,9% of typical use cases. And it would beat the GPU version on really high resolution images if the same thing wasn't implemented for it.

This would be cool, here is a previous implementation of partial updates in the volume visual which may provide some clues ☺️
#1979

Czaki · 2023-04-14T21:38:33Z

Currently we also send data to GPU for render. I try to create graph:

For #3308 it will looks like (I grayed out current steeps):

Casting to float32 is much faster than the current steep but still linear in data size. And in both scenarios, there is a transfer to GPU steep.

(On both graphs I highlight only most important steeps, it is not a full documentation)

ksofiyuk · 2023-04-15T07:13:18Z

@Czaki Thank you for this graph. Based on this graph, I'd say the only correct way to make it very efficient is to implement sending local updates to GPU instead of a whole image and label map.

If the above is not done, there is no reason in further CPU optimization as the transferring data from CPU to GPU will always be the bottleneck.

alisterburt · 2023-04-15T09:08:43Z

This is marked as ready for merge and tests are passing but something is up with the typing test - I'm not familiar with this and the error appears to be in files unaffected by this PR. I will merge in a few hours unless anyone objects

alisterburt · 2023-04-15T09:18:26Z

just seen that #5732 includes this

Czaki · 2023-04-15T09:57:15Z

@alisterburt Typing fix is here #5727

Czaki · 2023-04-15T17:52:47Z

@jni Labels and milestone

jni · 2023-04-16T08:34:50Z

@Czaki should we make a 0.4.18 milestone for cherry-picking?

alisterburt · 2023-04-16T09:17:12Z

+1 on the 0.4.18 milestone

Czaki · 2023-04-16T09:27:27Z

@jni yes

…5723) # Description If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode: https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935 It calls `np.unique` of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single `_raw_to_displayed` call on 4096x4096 images. Luckily, in most use cases the use of `np.unique` is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when `max_label_id - min_label_id >= 1024`, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456). I also added a new benchmark `Labels2DColorDirectSuite` that tests the Label layer in the direct color mode. ## Benchmark results: The results for the Label layer in the `AUTO` mode (for comparison): ``` [100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 3.84±0.02μs 32 5.89±0.03μs 64 14.4±0.04μs 128 47.6±0.09μs 256 183±1μs 512 1.02±0.01ms 1024 4.05±0.02ms 2048 15.8±0.2ms 4096 63.7±2ms ======== ============= ``` The results for the Label layer in the `DIRECT` mode (old algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============ param1 -------- ------------ 16 36.5±0.2μs 32 47.2±1μs 64 150±2μs 128 534±2μs 256 2.31±0.2ms 512 11.8±0.1ms 1024 61.8±0.7ms 2048 312±2ms 4096 1.50±0.02s ======== ============ ``` The results for the Label layer in the `DIRECT` mode (new algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 17.1±1μs 32 20.0±0.04μs 64 35.1±0.3μs 128 93.8±0.2μs 256 330±2μs 512 1.63±0.01ms 1024 6.40±0.02ms 2048 25.1±0.1ms 4096 101±0.5ms ======== ============= ``` As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the `AUTO` mode. ## Further optimization It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way: 1. Recompute `min_label_id` and `max_label_id` from scratch only when someone tries to set the data in a layer. 2. When someone tries to change the color id of brush, update the `min_label_id` and `max_label_id` with a new color id accordingly. This should be enough to always maintain the valid `min/max` of label ids without the need of recomputing them on every update. If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000). Co-authored-by: alisterburt <alisterburt@gmail.com> Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>

imagesc-bot · 2023-07-05T14:36:47Z

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/announcement-napari-0-4-18-released/83322/1

ksofiyuk added 2 commits April 13, 2023 21:50

Add a new benchmark for the labels layer in the direct color mode

eb5e4e7

Implement a more efficient color mapping

2acc8b8

github-actions bot assigned ksofiyuk Apr 13, 2023

alisterburt approved these changes Apr 14, 2023

View reviewed changes

napari/layers/labels/labels.py Show resolved Hide resolved

napari/layers/labels/labels.py Outdated Show resolved Hide resolved

alisterburt added 2 commits April 14, 2023 10:56

Update napari/layers/labels/labels.py

a59e352

Update napari/layers/labels/labels.py

c139b35

alisterburt reviewed Apr 14, 2023

View reviewed changes

napari/layers/labels/labels.py Outdated Show resolved Hide resolved

napari/layers/labels/labels.py Outdated Show resolved Hide resolved

Apply suggestions from code review

70440a2

ksofiyuk marked this pull request as draft April 14, 2023 10:54

Increase the upper_bound_n_unique_labels to 65536 + update the test

f4cad3d

ksofiyuk marked this pull request as ready for review April 14, 2023 11:00

jni approved these changes Apr 14, 2023

View reviewed changes

Add test for large label ranges

3992ce4

github-actions bot added the tests Something related to our tests label Apr 14, 2023

jni added the ready to merge Last chance for comments! Will be merged in ~24h label Apr 14, 2023

ksofiyuk mentioned this pull request Apr 15, 2023

Efficient labels mapping for drawing in Labels (60 FPS even with 8000x8000 images) #5732

Merged

7 tasks

jni added performance Relates to performance highlight PR that should be mentioned in next release notes labels Apr 16, 2023

jni added this to the 0.4 milestone Apr 16, 2023

jni changed the title ~~Fixing the very inefficient label mapping in the direct color mode (10-20x speedup)~~ Fix inefficient label mapping in direct color mode (10-20x speedup) Apr 16, 2023

jni merged commit 11c7d48 into napari:main Apr 16, 2023

jni modified the milestones: 0.4, 0.4.18 Apr 16, 2023

Czaki removed the ready to merge Last chance for comments! Will be merged in ~24h label Apr 18, 2023

Czaki mentioned this pull request Jun 7, 2023

v0.4.18 - release process #5911

Closed

jni mentioned this pull request Jun 19, 2023

Improved tools for 3D annotation #5955

Open

7 tasks

GenevieveBuckley mentioned this pull request Jun 23, 2023

Labels painting slow for big images #569

Closed

ksofiyuk deleted the efficient_label_mapping branch January 8, 2024 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inefficient label mapping in direct color mode (10-20x speedup) #5723

Fix inefficient label mapping in direct color mode (10-20x speedup) #5723

ksofiyuk commented Apr 13, 2023 •

edited

Loading

Czaki commented Apr 13, 2023

Czaki commented Apr 13, 2023

codecov bot commented Apr 13, 2023 •

edited

Loading

jni commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 •

edited

Loading

brisvag commented Apr 14, 2023

alisterburt left a comment

alisterburt commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 •

edited

Loading

ksofiyuk commented Apr 14, 2023 •

edited

Loading

jni commented Apr 14, 2023

brisvag commented Apr 14, 2023

brisvag commented Apr 14, 2023

jni commented Apr 14, 2023

jni commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 •

edited

Loading

Czaki commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 •

edited

Loading

alisterburt commented Apr 14, 2023

Czaki commented Apr 14, 2023

ksofiyuk commented Apr 15, 2023

alisterburt commented Apr 15, 2023

alisterburt commented Apr 15, 2023

Czaki commented Apr 15, 2023

Czaki commented Apr 15, 2023

jni commented Apr 16, 2023

alisterburt commented Apr 16, 2023

Czaki commented Apr 16, 2023

imagesc-bot commented Jul 5, 2023

	u, inv = np.unique(raw_modified, return_inverse=True)
	image = np.array(
	[
	self._label_color_index[x]
	if x in self._label_color_index
	else self._label_color_index[None]
	for x in u
	]
	)[inv].reshape(raw_modified.shape)

Fix inefficient label mapping in direct color mode (10-20x speedup) #5723

Fix inefficient label mapping in direct color mode (10-20x speedup) #5723

Conversation

ksofiyuk commented Apr 13, 2023 • edited Loading

Description

Benchmark results:

Further optimization

Type of change

How has this been tested?

Final checklist:

Czaki commented Apr 13, 2023

Czaki commented Apr 13, 2023

codecov bot commented Apr 13, 2023 • edited Loading

Codecov Report

jni commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 • edited Loading

brisvag commented Apr 14, 2023

alisterburt left a comment

Choose a reason for hiding this comment

alisterburt commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 • edited Loading

ksofiyuk commented Apr 14, 2023 • edited Loading

jni commented Apr 14, 2023

brisvag commented Apr 14, 2023

brisvag commented Apr 14, 2023

jni commented Apr 14, 2023

jni commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 • edited Loading

Czaki commented Apr 14, 2023

ksofiyuk commented Apr 14, 2023 • edited Loading

alisterburt commented Apr 14, 2023

Czaki commented Apr 14, 2023

ksofiyuk commented Apr 15, 2023

alisterburt commented Apr 15, 2023

alisterburt commented Apr 15, 2023

Czaki commented Apr 15, 2023

Czaki commented Apr 15, 2023

jni commented Apr 16, 2023

alisterburt commented Apr 16, 2023

Czaki commented Apr 16, 2023

imagesc-bot commented Jul 5, 2023

ksofiyuk commented Apr 13, 2023 •

edited

Loading

codecov bot commented Apr 13, 2023 •

edited

Loading

ksofiyuk commented Apr 14, 2023 •

edited

Loading

ksofiyuk commented Apr 14, 2023 •

edited

Loading

ksofiyuk commented Apr 14, 2023 •

edited

Loading

ksofiyuk commented Apr 14, 2023 •

edited

Loading

ksofiyuk commented Apr 14, 2023 •

edited

Loading