Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inefficient label mapping in direct color mode (10-20x speedup) #5723

Merged
merged 7 commits into from
Apr 16, 2023

Conversation

ksofiyuk
Copy link
Contributor

@ksofiyuk ksofiyuk commented Apr 13, 2023

Description

If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode:

u, inv = np.unique(raw_modified, return_inverse=True)
image = np.array(
[
self._label_color_index[x]
if x in self._label_color_index
else self._label_color_index[None]
for x in u
]
)[inv].reshape(raw_modified.shape)

It calls np.unique of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single _raw_to_displayed call on 4096x4096 images.

Luckily, in most use cases the use of np.unique is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when max_label_id - min_label_id >= 1024, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456).

I also added a new benchmark Labels2DColorDirectSuite that tests the Label layer in the direct color mode.

Benchmark results:

The results for the Label layer in the AUTO mode (for comparison):

[100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed                                                                                                                          ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16     3.84±0.02μs
                 32     5.89±0.03μs
                 64     14.4±0.04μs
                128     47.6±0.09μs
                256       183±1μs
                512     1.02±0.01ms
                1024    4.05±0.02ms
                2048     15.8±0.2ms
                4096      63.7±2ms
              ======== =============

The results for the Label layer in the DIRECT mode (old algorithm):

[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 16     36.5±0.2μs
                 32      47.2±1μs
                 64      150±2μs
                128      534±2μs
                256     2.31±0.2ms
                512     11.8±0.1ms
                1024    61.8±0.7ms
                2048     312±2ms
                4096    1.50±0.02s
              ======== ============

The results for the Label layer in the DIRECT mode (new algorithm):

[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16       17.1±1μs
                 32     20.0±0.04μs
                 64      35.1±0.3μs
                128      93.8±0.2μs
                256       330±2μs
                512     1.63±0.01ms
                1024    6.40±0.02ms
                2048     25.1±0.1ms
                4096     101±0.5ms
              ======== =============

As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the AUTO mode.

Further optimization

It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way:

  1. Recompute min_label_id and max_label_id from scratch only when someone tries to set the data in a layer.
  2. When someone tries to change the color id of brush, update the min_label_id and max_label_id with a new color id accordingly.

This should be enough to always maintain the valid min/max of label ids without the need of recomputing them on every update.

If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000).

Type of change

  • Bug-fix (non-breaking change which fixes an issue)

How has this been tested?

  • example: the test suite for my feature covers cases x, y, and z
  • example: all tests pass with my change
  • example: I check if my changes works with both PySide and PyQt backends
    as there are small differences between the two Qt bindings.

Final checklist:

  • My PR is the minimum possible work for the desired functionality
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • If I included new strings, I have used trans. to make them localizable.
    For more information see our translations guide.

@Czaki
Copy link
Collaborator

Czaki commented Apr 13, 2023

Our current target for solving this problem is to avoid global remapping of labels on CPU in #3308 but I'm not sure when linked PR will be finished.

We will try to investigate of the status of #3308 and if the prognosis of the finish will be not optimistic I will be happy to at least partially solve this performance problem by merging this PR.

If you have time You could also take a look at the linked PR and maybe help to finish it.

@Czaki
Copy link
Collaborator

Czaki commented Apr 13, 2023

one idea (the changes will be bigger)

We do not need to know the real set of labels. We only need to have a guarantee that all labels from layers are present in this set (so even if we erase some label, then its present is not a problem).

So the call of unique could be delayed and done for example 1s after finishing the drawing (with updating the set of every variable used in the drawing).

It will be more complex and take a long time with deeper dive in napari code, but it also will be useful for solve problems mentioned in #3308 and speedup code even more (as avoid calculate min and max on every refresh).

@codecov
Copy link

codecov bot commented Apr 13, 2023

Codecov Report

Merging #5723 (3992ce4) into main (a79537c) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5723      +/-   ##
==========================================
- Coverage   89.85%   89.85%   -0.01%     
==========================================
  Files         608      608              
  Lines       51756    51870     +114     
==========================================
+ Hits        46504    46606     +102     
- Misses       5252     5264      +12     
Impacted Files Coverage Δ
napari/layers/labels/_tests/test_labels.py 99.49% <100.00%> (+<0.01%) ⬆️
napari/layers/labels/labels.py 95.43% <100.00%> (+0.06%) ⬆️

... and 20 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jni
Copy link
Member

jni commented Apr 14, 2023

@ksofiyuk How does your benchmark suite fare on #3308? I think that PR makes this one unnecessary, and it's very important (as you'll see in the discussion there, it fixes a heap of issues with direct color mode), so I think we should focus on that one...

@ksofiyuk
Copy link
Contributor Author

ksofiyuk commented Apr 14, 2023

@jni I'll test that PR later. Anyway, I propose to merge this PR if you don't have any objections regarding the code. It is a non-breaking change that doesn't change any behaviour - a pure optimization.

I did it because without it napari is unusable on large images with custom colors, it is not a matter about optimization for the sake of optimization. Just try it yourself, open a large image (I use 4096x3000) and try to use the brush. You should consider it as a bugfix or a necessary optimization.

I can't see any conflicts with the mentioned PR because if it resolves the issue, then when it's ready this code will be unnecessary and will be replaced.

@brisvag
Copy link
Contributor

brisvag commented Apr 14, 2023

To be on the safe side, I kept the old approach for the cases when max_label_id - min_label_id >= 1024, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456).

To be clear, this is to handle better cases where labels are sparsely distributed, like [0, 1, 12341234, 893475892374]?

I can't see any conflicts with the mentioned PR because if it resolves the issue, then when it's ready this code will be unnecessary and will be replaced.

I agree with @ksofiyuk, that PR has been sitting for a long time, who knows how much longer, and this one brings very little disruption. Tests are passing already, which is a good sign. I'm not super familiar with the logic here though, so I'm not confident that it won't break some corner cases. Hopefully @jni can help more here.

Copy link
Contributor

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine with this PR going in whilst waiting for #3308 - I know that's getting closer but this will be easy to remove once #3308 is ready and the performance gains here are high value

I defined the upper bound on the number of unique labels explicitly which I think makes the code a little easier to parse at first glance

napari/layers/labels/labels.py Show resolved Hide resolved
napari/layers/labels/labels.py Outdated Show resolved Hide resolved
@alisterburt
Copy link
Contributor

thanks again @ksofiyuk - you're on fire! 🔥

napari/layers/labels/labels.py Outdated Show resolved Hide resolved
napari/layers/labels/labels.py Outdated Show resolved Hide resolved
@ksofiyuk ksofiyuk marked this pull request as draft April 14, 2023 10:54
@ksofiyuk ksofiyuk marked this pull request as ready for review April 14, 2023 11:00
@ksofiyuk
Copy link
Contributor Author

ksofiyuk commented Apr 14, 2023

@alisterburt thanks for the improving of readability.

  1. I updated the threshold for upper_bound_n_unique_labels to 65536, as it can handle even that range of labels without almost any performance drop. Actually, we can safely increase it even more, but I think 65536 should cover 99.999% of possible cases.
  2. I also updated the benchmark to incorporate a larger range of generated labels (from -10000 to 10000).

In this benchmark with sparse labels from -10000 to 10000, the proposed method shows even much more significant speedup (up to 40x).

The timing of the old algorithm on a new benchmark (provide only the two biggest sizes for clarity):

                2048      521±10ms  
                4096     2.63±0.02s 

The new algorithm:

                2048     17.3±0.2ms 
                4096     62.3±0.8ms 

It is 42x speedup!

I tested it on a bit different hardware, so the numbers don't exactly match the results in the first message.

@ksofiyuk
Copy link
Contributor Author

ksofiyuk commented Apr 14, 2023

To be clear, this is to handle better cases where labels are sparsely distributed, like [0, 1, 12341234, 893475892374]?

@brisvag, no, on the contrary, it fixes the cases when labels are not super sparsely distributed (I can say it covers 99.999% of real world cases). With the last update, you will gain significant speedup from this PR, if your labels can be localized in any continuous interval which length is less than 65536 elements. Here are some examples, if your labels are:

[0, 1, 2, 3]   # you will get significant boost here
[-1000, 0, 1, 1000]  # you will also get significant boost here as it falls inside the interval of length 65536
[100000, 100100, 100200, 100300]  # again it falls inside the interval, the new algorithm will be used
[-1000000, 1000000]  # they cannot be localized in a continuous of length 65536, the old method with `np.unique` will be used.

@jni
Copy link
Member

jni commented Apr 14, 2023

Super cool! Yeah, sorry, I didn't mean to imply we shouldn't merge this! I actually think the other one is closer than you're implying @brisvag but I totally get the point that this is already passing and a huge speedup so I'm all for merging! The other factor is that I think using direct mode, sooner rather than later @ksofiyuk will run into some of the color mismatch issues fixed by #3308. But we can cross that bridge when we get there!

@brisvag
Copy link
Contributor

brisvag commented Apr 14, 2023

Sorry, with "this" I meant "keeping the old approach for some cases" :P So I'm on board 👍

@brisvag
Copy link
Contributor

brisvag commented Apr 14, 2023

I actually think the other one is closer than you're implying @brisvag

Nice, looking forward to it :)

@github-actions github-actions bot added the tests Something related to our tests label Apr 14, 2023
@jni
Copy link
Member

jni commented Apr 14, 2023

Just pushed a commit trying to make sure all the new lines are covered. This should be good to merge if that passes, deepest apologies if it doesn't 🤣 (it passed locally)

@jni
Copy link
Member

jni commented Apr 14, 2023

@ksofiyuk that's the point — the old tests didn't have a branch, but they were "shuttled" to the new code — if you look at the previous codecov comment, it was the old code that was no longer covered by tests. Hence the new test.

Anyway, new npe2 release broke our typing action. CC @tlambert03

@jni jni added the ready to merge Last chance for comments! Will be merged in ~24h label Apr 14, 2023
@ksofiyuk
Copy link
Contributor Author

ksofiyuk commented Apr 14, 2023

that's the point — the old tests didn't have a branch, but they were "shuttled" to the new code — if you look at the previous codecov comment, it was the old code that was no longer covered by tests. Hence the new test.

@jni I realized it after I posted it, so I deleted my previous comment, but you managed to read it and reply before it 😄 But thanks for the explanation.

@Czaki
Copy link
Collaborator

Czaki commented Apr 14, 2023

I believe that the label rendering can be significantly improved by some clever algorithmic optimizations which would allow to have a near constant time complexity of rendering on a brush update without the dependency on the resolution of an image.

One of the core problems here is that we transfer the whole layer array to GPU after paint.
So bigger array means more linear operations.

I do not have time to investigate how it will be hard to transfer only the updated part of the image.

But this plus maybe caching some properties should lead to better performance

@ksofiyuk
Copy link
Contributor Author

ksofiyuk commented Apr 14, 2023

One of the core problems here is that we transfer the whole layer array to GPU after paint.
So bigger array means more linear operations.

In theory, it shouldn't be very difficult to send to GPU for updating only patches of an image. But I'm also not sure how difficult it would be to implement in napari (I haven't looked into the code of PR #3308).

It does not look very difficult to implement local updates (only around the recent brush movement + limit the updating only to the part of the image that can be seen) for the CPU version. And it should allow to handle images of any resolution without performance degradation for 99,9% of typical use cases. And it would beat the GPU version on really high resolution images if the same thing wasn't implemented for it.

@alisterburt
Copy link
Contributor

It does not look very difficult to implement local updates (only around the recent brush movement + limit the updating only to the part of the image that can be seen) for the CPU version. And it should allow to handle images of any resolution without performance degradation for 99,9% of typical use cases. And it would beat the GPU version on really high resolution images if the same thing wasn't implemented for it.

This would be cool, here is a previous implementation of partial updates in the volume visual which may provide some clues ☺️
#1979

@Czaki
Copy link
Collaborator

Czaki commented Apr 14, 2023

Currently we also send data to GPU for render. I try to create graph:

obraz

For #3308 it will looks like (I grayed out current steeps):

obraz

Casting to float32 is much faster than the current steep but still linear in data size. And in both scenarios, there is a transfer to GPU steep.

(On both graphs I highlight only most important steeps, it is not a full documentation)

@ksofiyuk
Copy link
Contributor Author

@Czaki Thank you for this graph. Based on this graph, I'd say the only correct way to make it very efficient is to implement sending local updates to GPU instead of a whole image and label map.

If the above is not done, there is no reason in further CPU optimization as the transferring data from CPU to GPU will always be the bottleneck.

@alisterburt
Copy link
Contributor

This is marked as ready for merge and tests are passing but something is up with the typing test - I'm not familiar with this and the error appears to be in files unaffected by this PR. I will merge in a few hours unless anyone objects

@alisterburt
Copy link
Contributor

just seen that #5732 includes this

@Czaki
Copy link
Collaborator

Czaki commented Apr 15, 2023

@alisterburt Typing fix is here #5727

@Czaki
Copy link
Collaborator

Czaki commented Apr 15, 2023

@jni Labels and milestone

@jni jni added performance Relates to performance highlight PR that should be mentioned in next release notes labels Apr 16, 2023
@jni jni added this to the 0.4 milestone Apr 16, 2023
@jni
Copy link
Member

jni commented Apr 16, 2023

@Czaki should we make a 0.4.18 milestone for cherry-picking?

@jni jni changed the title Fixing the very inefficient label mapping in the direct color mode (10-20x speedup) Fix inefficient label mapping in direct color mode (10-20x speedup) Apr 16, 2023
@jni jni merged commit 11c7d48 into napari:main Apr 16, 2023
@jni jni modified the milestones: 0.4, 0.4.18 Apr 16, 2023
@alisterburt
Copy link
Contributor

+1 on the 0.4.18 milestone

@Czaki
Copy link
Collaborator

Czaki commented Apr 16, 2023

@jni yes

@Czaki Czaki removed the ready to merge Last chance for comments! Will be merged in ~24h label Apr 18, 2023
@Czaki Czaki mentioned this pull request Jun 7, 2023
@jni jni mentioned this pull request Jun 19, 2023
7 tasks
Czaki pushed a commit that referenced this pull request Jun 19, 2023
…5723)

# Description

If you specify a custom color set in the Label layer, the brush becomes
extremely laggy in high-resolution images. After some examination, I
discovered that issue is caused by the inefficient label mapping in the
direct color mode:

https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935

It calls `np.unique` of the whole label map every brush update! This is
a very slow operation for large arrays. As you can see in the benchmark
results below, it takes 1.5s for a single `_raw_to_displayed` call on
4096x4096 images.

Luckily, in most use cases the use of `np.unique` is a total overkill
here and it is possible to achieve the same functionality just by
finding the minimum and maximum values of an array. I implemented a new
algorithm, which is 10-20x faster on large images. To be on the safe
side, I kept the old approach for the cases when `max_label_id -
min_label_id >= 1024`, it will handle the situations when someone
decides to load labels with very large ids (e.g. label_id=123456).

I also added a new benchmark `Labels2DColorDirectSuite` that tests the
Label layer in the direct color mode.

## Benchmark results:
The results for the Label layer in the `AUTO` mode (for comparison):
```
[100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed                                                                                                                          ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16     3.84±0.02μs
                 32     5.89±0.03μs
                 64     14.4±0.04μs
                128     47.6±0.09μs
                256       183±1μs
                512     1.02±0.01ms
                1024    4.05±0.02ms
                2048     15.8±0.2ms
                4096      63.7±2ms
              ======== =============
```

The results for the Label layer in the `DIRECT` mode (old algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 16     36.5±0.2μs
                 32      47.2±1μs
                 64      150±2μs
                128      534±2μs
                256     2.31±0.2ms
                512     11.8±0.1ms
                1024    61.8±0.7ms
                2048     312±2ms
                4096    1.50±0.02s
              ======== ============
```

The results for the Label layer in the `DIRECT` mode (new algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16       17.1±1μs
                 32     20.0±0.04μs
                 64      35.1±0.3μs
                128      93.8±0.2μs
                256       330±2μs
                512     1.63±0.01ms
                1024    6.40±0.02ms
                2048     25.1±0.1ms
                4096     101±0.5ms
              ======== =============
```

As you can see, it became 15x faster on 4096x4096 images, achieving
almost the same performance as in the `AUTO` mode.

## Further optimization

It is just a quick fix to mitigate the issue. However, I think it is
very inefficient to compute the label mapping from scratch on every
brush update. At least the proposed approach can be further optimized in
the following way:

1. Recompute `min_label_id` and `max_label_id` from scratch only when
someone tries to set the data in a layer.
2. When someone tries to change the color id of brush, update the
`min_label_id` and `max_label_id` with a new color id accordingly.

This should be enough to always maintain the valid `min/max` of label
ids without the need of recomputing them on every update.

If we want to be ultimately efficient, the color map should be computed
only for the part of an image that is visible in the current camera
view. Or only recompute the color map near the local region around a
recent brush trajectory as during drawing in large images 99% of image
area usually is not affected and should not be updated. It should allow
to handle extremely large images (e.g. 10000x10000).

Co-authored-by: alisterburt <alisterburt@gmail.com>
Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
Czaki pushed a commit that referenced this pull request Jun 21, 2023
…5723)

# Description

If you specify a custom color set in the Label layer, the brush becomes
extremely laggy in high-resolution images. After some examination, I
discovered that issue is caused by the inefficient label mapping in the
direct color mode:

https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935

It calls `np.unique` of the whole label map every brush update! This is
a very slow operation for large arrays. As you can see in the benchmark
results below, it takes 1.5s for a single `_raw_to_displayed` call on
4096x4096 images.

Luckily, in most use cases the use of `np.unique` is a total overkill
here and it is possible to achieve the same functionality just by
finding the minimum and maximum values of an array. I implemented a new
algorithm, which is 10-20x faster on large images. To be on the safe
side, I kept the old approach for the cases when `max_label_id -
min_label_id >= 1024`, it will handle the situations when someone
decides to load labels with very large ids (e.g. label_id=123456).

I also added a new benchmark `Labels2DColorDirectSuite` that tests the
Label layer in the direct color mode.

## Benchmark results:
The results for the Label layer in the `AUTO` mode (for comparison):
```
[100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed                                                                                                                          ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16     3.84±0.02μs
                 32     5.89±0.03μs
                 64     14.4±0.04μs
                128     47.6±0.09μs
                256       183±1μs
                512     1.02±0.01ms
                1024    4.05±0.02ms
                2048     15.8±0.2ms
                4096      63.7±2ms
              ======== =============
```

The results for the Label layer in the `DIRECT` mode (old algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 16     36.5±0.2μs
                 32      47.2±1μs
                 64      150±2μs
                128      534±2μs
                256     2.31±0.2ms
                512     11.8±0.1ms
                1024    61.8±0.7ms
                2048     312±2ms
                4096    1.50±0.02s
              ======== ============
```

The results for the Label layer in the `DIRECT` mode (new algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16       17.1±1μs
                 32     20.0±0.04μs
                 64      35.1±0.3μs
                128      93.8±0.2μs
                256       330±2μs
                512     1.63±0.01ms
                1024    6.40±0.02ms
                2048     25.1±0.1ms
                4096     101±0.5ms
              ======== =============
```

As you can see, it became 15x faster on 4096x4096 images, achieving
almost the same performance as in the `AUTO` mode.

## Further optimization

It is just a quick fix to mitigate the issue. However, I think it is
very inefficient to compute the label mapping from scratch on every
brush update. At least the proposed approach can be further optimized in
the following way:

1. Recompute `min_label_id` and `max_label_id` from scratch only when
someone tries to set the data in a layer.
2. When someone tries to change the color id of brush, update the
`min_label_id` and `max_label_id` with a new color id accordingly.

This should be enough to always maintain the valid `min/max` of label
ids without the need of recomputing them on every update.

If we want to be ultimately efficient, the color map should be computed
only for the part of an image that is visible in the current camera
view. Or only recompute the color map near the local region around a
recent brush trajectory as during drawing in large images 99% of image
area usually is not affected and should not be updated. It should allow
to handle extremely large images (e.g. 10000x10000).

Co-authored-by: alisterburt <alisterburt@gmail.com>
Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
Czaki pushed a commit that referenced this pull request Jun 21, 2023
…5723)

# Description

If you specify a custom color set in the Label layer, the brush becomes
extremely laggy in high-resolution images. After some examination, I
discovered that issue is caused by the inefficient label mapping in the
direct color mode:

https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935

It calls `np.unique` of the whole label map every brush update! This is
a very slow operation for large arrays. As you can see in the benchmark
results below, it takes 1.5s for a single `_raw_to_displayed` call on
4096x4096 images.

Luckily, in most use cases the use of `np.unique` is a total overkill
here and it is possible to achieve the same functionality just by
finding the minimum and maximum values of an array. I implemented a new
algorithm, which is 10-20x faster on large images. To be on the safe
side, I kept the old approach for the cases when `max_label_id -
min_label_id >= 1024`, it will handle the situations when someone
decides to load labels with very large ids (e.g. label_id=123456).

I also added a new benchmark `Labels2DColorDirectSuite` that tests the
Label layer in the direct color mode.

## Benchmark results:
The results for the Label layer in the `AUTO` mode (for comparison):
```
[100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed                                                                                                                          ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16     3.84±0.02μs
                 32     5.89±0.03μs
                 64     14.4±0.04μs
                128     47.6±0.09μs
                256       183±1μs
                512     1.02±0.01ms
                1024    4.05±0.02ms
                2048     15.8±0.2ms
                4096      63.7±2ms
              ======== =============
```

The results for the Label layer in the `DIRECT` mode (old algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 16     36.5±0.2μs
                 32      47.2±1μs
                 64      150±2μs
                128      534±2μs
                256     2.31±0.2ms
                512     11.8±0.1ms
                1024    61.8±0.7ms
                2048     312±2ms
                4096    1.50±0.02s
              ======== ============
```

The results for the Label layer in the `DIRECT` mode (new algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16       17.1±1μs
                 32     20.0±0.04μs
                 64      35.1±0.3μs
                128      93.8±0.2μs
                256       330±2μs
                512     1.63±0.01ms
                1024    6.40±0.02ms
                2048     25.1±0.1ms
                4096     101±0.5ms
              ======== =============
```

As you can see, it became 15x faster on 4096x4096 images, achieving
almost the same performance as in the `AUTO` mode.

## Further optimization

It is just a quick fix to mitigate the issue. However, I think it is
very inefficient to compute the label mapping from scratch on every
brush update. At least the proposed approach can be further optimized in
the following way:

1. Recompute `min_label_id` and `max_label_id` from scratch only when
someone tries to set the data in a layer.
2. When someone tries to change the color id of brush, update the
`min_label_id` and `max_label_id` with a new color id accordingly.

This should be enough to always maintain the valid `min/max` of label
ids without the need of recomputing them on every update.

If we want to be ultimately efficient, the color map should be computed
only for the part of an image that is visible in the current camera
view. Or only recompute the color map near the local region around a
recent brush trajectory as during drawing in large images 99% of image
area usually is not affected and should not be updated. It should allow
to handle extremely large images (e.g. 10000x10000).

Co-authored-by: alisterburt <alisterburt@gmail.com>
Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
Czaki pushed a commit that referenced this pull request Jun 21, 2023
…5723)

# Description

If you specify a custom color set in the Label layer, the brush becomes
extremely laggy in high-resolution images. After some examination, I
discovered that issue is caused by the inefficient label mapping in the
direct color mode:

https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935

It calls `np.unique` of the whole label map every brush update! This is
a very slow operation for large arrays. As you can see in the benchmark
results below, it takes 1.5s for a single `_raw_to_displayed` call on
4096x4096 images.

Luckily, in most use cases the use of `np.unique` is a total overkill
here and it is possible to achieve the same functionality just by
finding the minimum and maximum values of an array. I implemented a new
algorithm, which is 10-20x faster on large images. To be on the safe
side, I kept the old approach for the cases when `max_label_id -
min_label_id >= 1024`, it will handle the situations when someone
decides to load labels with very large ids (e.g. label_id=123456).

I also added a new benchmark `Labels2DColorDirectSuite` that tests the
Label layer in the direct color mode.

## Benchmark results:
The results for the Label layer in the `AUTO` mode (for comparison):
```
[100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed                                                                                                                          ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16     3.84±0.02μs
                 32     5.89±0.03μs
                 64     14.4±0.04μs
                128     47.6±0.09μs
                256       183±1μs
                512     1.02±0.01ms
                1024    4.05±0.02ms
                2048     15.8±0.2ms
                4096      63.7±2ms
              ======== =============
```

The results for the Label layer in the `DIRECT` mode (old algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 16     36.5±0.2μs
                 32      47.2±1μs
                 64      150±2μs
                128      534±2μs
                256     2.31±0.2ms
                512     11.8±0.1ms
                1024    61.8±0.7ms
                2048     312±2ms
                4096    1.50±0.02s
              ======== ============
```

The results for the Label layer in the `DIRECT` mode (new algorithm):
```
[100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed                                                                                                               ok
[100.00%] ··· ======== =============
               param1
              -------- -------------
                 16       17.1±1μs
                 32     20.0±0.04μs
                 64      35.1±0.3μs
                128      93.8±0.2μs
                256       330±2μs
                512     1.63±0.01ms
                1024    6.40±0.02ms
                2048     25.1±0.1ms
                4096     101±0.5ms
              ======== =============
```

As you can see, it became 15x faster on 4096x4096 images, achieving
almost the same performance as in the `AUTO` mode.

## Further optimization

It is just a quick fix to mitigate the issue. However, I think it is
very inefficient to compute the label mapping from scratch on every
brush update. At least the proposed approach can be further optimized in
the following way:

1. Recompute `min_label_id` and `max_label_id` from scratch only when
someone tries to set the data in a layer.
2. When someone tries to change the color id of brush, update the
`min_label_id` and `max_label_id` with a new color id accordingly.

This should be enough to always maintain the valid `min/max` of label
ids without the need of recomputing them on every update.

If we want to be ultimately efficient, the color map should be computed
only for the part of an image that is visible in the current camera
view. Or only recompute the color map near the local region around a
recent brush trajectory as during drawing in large images 99% of image
area usually is not affected and should not be updated. It should allow
to handle extremely large images (e.g. 10000x10000).

Co-authored-by: alisterburt <alisterburt@gmail.com>
Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/announcement-napari-0-4-18-released/83322/1

@ksofiyuk ksofiyuk deleted the efficient_label_mapping branch January 8, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
highlight PR that should be mentioned in next release notes performance Relates to performance tests Something related to our tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants