-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inefficient label mapping in direct color mode (10-20x speedup) #5723
Conversation
Our current target for solving this problem is to avoid global remapping of labels on CPU in #3308 but I'm not sure when linked PR will be finished. We will try to investigate of the status of #3308 and if the prognosis of the finish will be not optimistic I will be happy to at least partially solve this performance problem by merging this PR. If you have time You could also take a look at the linked PR and maybe help to finish it. |
one idea (the changes will be bigger) We do not need to know the real set of labels. We only need to have a guarantee that all labels from layers are present in this set (so even if we erase some label, then its present is not a problem). So the call of unique could be delayed and done for example 1s after finishing the drawing (with updating the set of every variable used in the drawing). It will be more complex and take a long time with deeper dive in napari code, but it also will be useful for solve problems mentioned in #3308 and speedup code even more (as avoid calculate min and max on every refresh). |
Codecov Report
@@ Coverage Diff @@
## main #5723 +/- ##
==========================================
- Coverage 89.85% 89.85% -0.01%
==========================================
Files 608 608
Lines 51756 51870 +114
==========================================
+ Hits 46504 46606 +102
- Misses 5252 5264 +12
... and 20 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@jni I'll test that PR later. Anyway, I propose to merge this PR if you don't have any objections regarding the code. It is a non-breaking change that doesn't change any behaviour - a pure optimization. I did it because without it napari is unusable on large images with custom colors, it is not a matter about optimization for the sake of optimization. Just try it yourself, open a large image (I use 4096x3000) and try to use the brush. You should consider it as a bugfix or a necessary optimization. I can't see any conflicts with the mentioned PR because if it resolves the issue, then when it's ready this code will be unnecessary and will be replaced. |
To be clear, this is to handle better cases where labels are sparsely distributed, like
I agree with @ksofiyuk, that PR has been sitting for a long time, who knows how much longer, and this one brings very little disruption. Tests are passing already, which is a good sign. I'm not super familiar with the logic here though, so I'm not confident that it won't break some corner cases. Hopefully @jni can help more here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also fine with this PR going in whilst waiting for #3308 - I know that's getting closer but this will be easy to remove once #3308 is ready and the performance gains here are high value
I defined the upper bound on the number of unique labels explicitly which I think makes the code a little easier to parse at first glance
thanks again @ksofiyuk - you're on fire! 🔥 |
@alisterburt thanks for the improving of readability.
In this benchmark with sparse labels from -10000 to 10000, the proposed method shows even much more significant speedup (up to 40x). The timing of the old algorithm on a new benchmark (provide only the two biggest sizes for clarity):
The new algorithm:
It is 42x speedup! I tested it on a bit different hardware, so the numbers don't exactly match the results in the first message. |
@brisvag, no, on the contrary, it fixes the cases when labels are not super sparsely distributed (I can say it covers 99.999% of real world cases). With the last update, you will gain significant speedup from this PR, if your labels can be localized in any continuous interval which length is less than 65536 elements. Here are some examples, if your labels are:
|
Super cool! Yeah, sorry, I didn't mean to imply we shouldn't merge this! I actually think the other one is closer than you're implying @brisvag but I totally get the point that this is already passing and a huge speedup so I'm all for merging! The other factor is that I think using direct mode, sooner rather than later @ksofiyuk will run into some of the color mismatch issues fixed by #3308. But we can cross that bridge when we get there! |
Sorry, with "this" I meant "keeping the old approach for some cases" :P So I'm on board 👍 |
Nice, looking forward to it :) |
Just pushed a commit trying to make sure all the new lines are covered. This should be good to merge if that passes, deepest apologies if it doesn't 🤣 (it passed locally) |
@ksofiyuk that's the point — the old tests didn't have a branch, but they were "shuttled" to the new code — if you look at the previous codecov comment, it was the old code that was no longer covered by tests. Hence the new test. Anyway, new npe2 release broke our typing action. CC @tlambert03 |
@jni I realized it after I posted it, so I deleted my previous comment, but you managed to read it and reply before it 😄 But thanks for the explanation. |
One of the core problems here is that we transfer the whole layer array to GPU after paint. I do not have time to investigate how it will be hard to transfer only the updated part of the image. But this plus maybe caching some properties should lead to better performance |
In theory, it shouldn't be very difficult to send to GPU for updating only patches of an image. But I'm also not sure how difficult it would be to implement in napari (I haven't looked into the code of PR #3308). It does not look very difficult to implement local updates (only around the recent brush movement + limit the updating only to the part of the image that can be seen) for the CPU version. And it should allow to handle images of any resolution without performance degradation for 99,9% of typical use cases. And it would beat the GPU version on really high resolution images if the same thing wasn't implemented for it. |
This would be cool, here is a previous implementation of partial updates in the volume visual which may provide some clues |
Currently we also send data to GPU for render. I try to create graph: For #3308 it will looks like (I grayed out current steeps): Casting to float32 is much faster than the current steep but still linear in data size. And in both scenarios, there is a transfer to GPU steep. (On both graphs I highlight only most important steeps, it is not a full documentation) |
@Czaki Thank you for this graph. Based on this graph, I'd say the only correct way to make it very efficient is to implement sending local updates to GPU instead of a whole image and label map. If the above is not done, there is no reason in further CPU optimization as the transferring data from CPU to GPU will always be the bottleneck. |
This is marked as ready for merge and tests are passing but something is up with the typing test - I'm not familiar with this and the error appears to be in files unaffected by this PR. I will merge in a few hours unless anyone objects |
just seen that #5732 includes this |
@alisterburt Typing fix is here #5727 |
@jni Labels and milestone |
@Czaki should we make a 0.4.18 milestone for cherry-picking? |
+1 on the 0.4.18 milestone |
@jni yes |
…5723) # Description If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode: https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935 It calls `np.unique` of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single `_raw_to_displayed` call on 4096x4096 images. Luckily, in most use cases the use of `np.unique` is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when `max_label_id - min_label_id >= 1024`, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456). I also added a new benchmark `Labels2DColorDirectSuite` that tests the Label layer in the direct color mode. ## Benchmark results: The results for the Label layer in the `AUTO` mode (for comparison): ``` [100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 3.84±0.02μs 32 5.89±0.03μs 64 14.4±0.04μs 128 47.6±0.09μs 256 183±1μs 512 1.02±0.01ms 1024 4.05±0.02ms 2048 15.8±0.2ms 4096 63.7±2ms ======== ============= ``` The results for the Label layer in the `DIRECT` mode (old algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============ param1 -------- ------------ 16 36.5±0.2μs 32 47.2±1μs 64 150±2μs 128 534±2μs 256 2.31±0.2ms 512 11.8±0.1ms 1024 61.8±0.7ms 2048 312±2ms 4096 1.50±0.02s ======== ============ ``` The results for the Label layer in the `DIRECT` mode (new algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 17.1±1μs 32 20.0±0.04μs 64 35.1±0.3μs 128 93.8±0.2μs 256 330±2μs 512 1.63±0.01ms 1024 6.40±0.02ms 2048 25.1±0.1ms 4096 101±0.5ms ======== ============= ``` As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the `AUTO` mode. ## Further optimization It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way: 1. Recompute `min_label_id` and `max_label_id` from scratch only when someone tries to set the data in a layer. 2. When someone tries to change the color id of brush, update the `min_label_id` and `max_label_id` with a new color id accordingly. This should be enough to always maintain the valid `min/max` of label ids without the need of recomputing them on every update. If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000). Co-authored-by: alisterburt <alisterburt@gmail.com> Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
…5723) # Description If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode: https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935 It calls `np.unique` of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single `_raw_to_displayed` call on 4096x4096 images. Luckily, in most use cases the use of `np.unique` is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when `max_label_id - min_label_id >= 1024`, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456). I also added a new benchmark `Labels2DColorDirectSuite` that tests the Label layer in the direct color mode. ## Benchmark results: The results for the Label layer in the `AUTO` mode (for comparison): ``` [100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 3.84±0.02μs 32 5.89±0.03μs 64 14.4±0.04μs 128 47.6±0.09μs 256 183±1μs 512 1.02±0.01ms 1024 4.05±0.02ms 2048 15.8±0.2ms 4096 63.7±2ms ======== ============= ``` The results for the Label layer in the `DIRECT` mode (old algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============ param1 -------- ------------ 16 36.5±0.2μs 32 47.2±1μs 64 150±2μs 128 534±2μs 256 2.31±0.2ms 512 11.8±0.1ms 1024 61.8±0.7ms 2048 312±2ms 4096 1.50±0.02s ======== ============ ``` The results for the Label layer in the `DIRECT` mode (new algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 17.1±1μs 32 20.0±0.04μs 64 35.1±0.3μs 128 93.8±0.2μs 256 330±2μs 512 1.63±0.01ms 1024 6.40±0.02ms 2048 25.1±0.1ms 4096 101±0.5ms ======== ============= ``` As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the `AUTO` mode. ## Further optimization It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way: 1. Recompute `min_label_id` and `max_label_id` from scratch only when someone tries to set the data in a layer. 2. When someone tries to change the color id of brush, update the `min_label_id` and `max_label_id` with a new color id accordingly. This should be enough to always maintain the valid `min/max` of label ids without the need of recomputing them on every update. If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000). Co-authored-by: alisterburt <alisterburt@gmail.com> Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
…5723) # Description If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode: https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935 It calls `np.unique` of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single `_raw_to_displayed` call on 4096x4096 images. Luckily, in most use cases the use of `np.unique` is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when `max_label_id - min_label_id >= 1024`, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456). I also added a new benchmark `Labels2DColorDirectSuite` that tests the Label layer in the direct color mode. ## Benchmark results: The results for the Label layer in the `AUTO` mode (for comparison): ``` [100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 3.84±0.02μs 32 5.89±0.03μs 64 14.4±0.04μs 128 47.6±0.09μs 256 183±1μs 512 1.02±0.01ms 1024 4.05±0.02ms 2048 15.8±0.2ms 4096 63.7±2ms ======== ============= ``` The results for the Label layer in the `DIRECT` mode (old algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============ param1 -------- ------------ 16 36.5±0.2μs 32 47.2±1μs 64 150±2μs 128 534±2μs 256 2.31±0.2ms 512 11.8±0.1ms 1024 61.8±0.7ms 2048 312±2ms 4096 1.50±0.02s ======== ============ ``` The results for the Label layer in the `DIRECT` mode (new algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 17.1±1μs 32 20.0±0.04μs 64 35.1±0.3μs 128 93.8±0.2μs 256 330±2μs 512 1.63±0.01ms 1024 6.40±0.02ms 2048 25.1±0.1ms 4096 101±0.5ms ======== ============= ``` As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the `AUTO` mode. ## Further optimization It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way: 1. Recompute `min_label_id` and `max_label_id` from scratch only when someone tries to set the data in a layer. 2. When someone tries to change the color id of brush, update the `min_label_id` and `max_label_id` with a new color id accordingly. This should be enough to always maintain the valid `min/max` of label ids without the need of recomputing them on every update. If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000). Co-authored-by: alisterburt <alisterburt@gmail.com> Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
…5723) # Description If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode: https://github.com/napari/napari/blob/a79537cac7c2a595c4fca41b85f2c7b4a44d0a61/napari/layers/labels/labels.py#L927-L935 It calls `np.unique` of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single `_raw_to_displayed` call on 4096x4096 images. Luckily, in most use cases the use of `np.unique` is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases when `max_label_id - min_label_id >= 1024`, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456). I also added a new benchmark `Labels2DColorDirectSuite` that tests the Label layer in the direct color mode. ## Benchmark results: The results for the Label layer in the `AUTO` mode (for comparison): ``` [100.00%] ··· benchmark_labels_layer.Labels2DSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 3.84±0.02μs 32 5.89±0.03μs 64 14.4±0.04μs 128 47.6±0.09μs 256 183±1μs 512 1.02±0.01ms 1024 4.05±0.02ms 2048 15.8±0.2ms 4096 63.7±2ms ======== ============= ``` The results for the Label layer in the `DIRECT` mode (old algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============ param1 -------- ------------ 16 36.5±0.2μs 32 47.2±1μs 64 150±2μs 128 534±2μs 256 2.31±0.2ms 512 11.8±0.1ms 1024 61.8±0.7ms 2048 312±2ms 4096 1.50±0.02s ======== ============ ``` The results for the Label layer in the `DIRECT` mode (new algorithm): ``` [100.00%] ··· benchmark_labels_layer.Labels2DColorDirectSuite.time_raw_to_displayed ok [100.00%] ··· ======== ============= param1 -------- ------------- 16 17.1±1μs 32 20.0±0.04μs 64 35.1±0.3μs 128 93.8±0.2μs 256 330±2μs 512 1.63±0.01ms 1024 6.40±0.02ms 2048 25.1±0.1ms 4096 101±0.5ms ======== ============= ``` As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the `AUTO` mode. ## Further optimization It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way: 1. Recompute `min_label_id` and `max_label_id` from scratch only when someone tries to set the data in a layer. 2. When someone tries to change the color id of brush, update the `min_label_id` and `max_label_id` with a new color id accordingly. This should be enough to always maintain the valid `min/max` of label ids without the need of recomputing them on every update. If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000). Co-authored-by: alisterburt <alisterburt@gmail.com> Co-authored-by: Juan Nunez-Iglesias <jni@fastmail.com>
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/announcement-napari-0-4-18-released/83322/1 |
Description
If you specify a custom color set in the Label layer, the brush becomes extremely laggy in high-resolution images. After some examination, I discovered that issue is caused by the inefficient label mapping in the direct color mode:
napari/napari/layers/labels/labels.py
Lines 927 to 935 in a79537c
It calls
np.unique
of the whole label map every brush update! This is a very slow operation for large arrays. As you can see in the benchmark results below, it takes 1.5s for a single_raw_to_displayed
call on 4096x4096 images.Luckily, in most use cases the use of
np.unique
is a total overkill here and it is possible to achieve the same functionality just by finding the minimum and maximum values of an array. I implemented a new algorithm, which is 10-20x faster on large images. To be on the safe side, I kept the old approach for the cases whenmax_label_id - min_label_id >= 1024
, it will handle the situations when someone decides to load labels with very large ids (e.g. label_id=123456).I also added a new benchmark
Labels2DColorDirectSuite
that tests the Label layer in the direct color mode.Benchmark results:
The results for the Label layer in the
AUTO
mode (for comparison):The results for the Label layer in the
DIRECT
mode (old algorithm):The results for the Label layer in the
DIRECT
mode (new algorithm):As you can see, it became 15x faster on 4096x4096 images, achieving almost the same performance as in the
AUTO
mode.Further optimization
It is just a quick fix to mitigate the issue. However, I think it is very inefficient to compute the label mapping from scratch on every brush update. At least the proposed approach can be further optimized in the following way:
min_label_id
andmax_label_id
from scratch only when someone tries to set the data in a layer.min_label_id
andmax_label_id
with a new color id accordingly.This should be enough to always maintain the valid
min/max
of label ids without the need of recomputing them on every update.If we want to be ultimately efficient, the color map should be computed only for the part of an image that is visible in the current camera view. Or only recompute the color map near the local region around a recent brush trajectory as during drawing in large images 99% of image area usually is not affected and should not be updated. It should allow to handle extremely large images (e.g. 10000x10000).
Type of change
How has this been tested?
as there are small differences between the two Qt bindings.
Final checklist:
trans.
to make them localizable.For more information see our translations guide.