New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve lazy operations and Chunking #2617
Improve lazy operations and Chunking #2617
Conversation
Codecov Report
@@ Coverage Diff @@
## RELEASE_next_patch #2617 +/- ##
======================================================
+ Coverage 76.94% 76.97% +0.03%
======================================================
Files 201 201
Lines 29711 29782 +71
Branches 6515 6536 +21
======================================================
+ Hits 22860 22926 +66
- Misses 5105 5107 +2
- Partials 1746 1749 +3
Continue to review full report at Codecov.
|
Some of these tests are still failing because there is a little bit less flexibility in how this function works now. This is mostly because the output signal size has to be explicit (or ragged) and the output data type has to be explicit as well. There is a possibility to guess at each of these things, but as the map function is more of a back end type function it might be better to allow the user to set it themselves? I also allowed for ragged signal to be returned as lazy signals. |
Can you please rebase on On a slightly different topic, some files seems to be deleted by mistakes! |
5b3da65
to
7e72c05
Compare
7e72c05
to
67880a3
Compare
…still some errors which are related to weird edge cases
Okay I'm still failing on a couple of tests, I can slog through them but I think that we need to be a little bit stricter with how we use the map function. There are a couple of things that came up again and again when writing this and made what should have been a pretty easy fix much more complicated than it needed to be. I think some of this is older legacy code, but some of it might be newer as well. I was going to change the documentation to make things more explicit and if someone wants to comment to make sure that I understand how the map function (should) work in hyperspy that would be nice. 1 - In particular the _map_iterate function should be protected and only called by the map function. This helps catch cases where people are passing improper arguments to the _map_iterate function. The biggest error being made was passing the iterating_kwarg to the the map_function not as a BaseSignal but as an array which breaks the new code and is kind of a poor workaround for not really understanding the map function operation (3) 2 - Something else that I found frustrating was passing a BaseSignal in that doesn't have the same size as the navigation axes. I handled that case by just converting that to a numpy array if the navigation size is 0 and I threw an error if that doesn't work. Maybe this should be handled by the map function but it works well now so I wouldn't really touch it. 3 - This is related to the first two issues but in general the map function has two different operating principles. If a BaseSignal is passed to the signal and has the same shape navigation signal, then it is iterated alongside the signal. If a NumPy array is passed then the signal is assumed to be a constant and applied to every signal. |
Indeed, this may need a bit of work, but it we clearly understand what is the issue, then it shouldn't be that much work. It will pay off very easily in the long term instead of using workaround here and there. Thanks for taking a stab at it! A few comments without looking too much into details.
Just to make sure that I don't misunderstand the issue: this applies to a subset of use of the
Indeed, this makes sense to allow the user to set the data dtype (optional), otherwise it should be inferred.
If I understand correctly, the only things which matters, is that the length of the arguments to iterate on is the same as for the navigation axis. It means that it could a BaseSignal, a numpy array or a list, etc, as long as the length is correct.
This sounds like the flow control needs to be improved!
To avoid confusion, what do you mean with BaseSignal is passed to the signal? Your started to add documentation to the dev guide, which is good! In case, there are some comments or explanation, it would be very good to add them in the user guide too. For example, how to use these efficiently and highlight the fact that some operation will be more efficient than other - as in the situation where the |
Honestly I'm not really sure what the
So for the For the The solution is just to tighten the restrictions we place on the
Good point, I'm not great at including examples sometimes. def multiply(data1, value):
# this function multiplys some array by some value.
return np.multiply(data1,value
import hyperspy.api as hs
import numpy as np
s = hs.signals.Signal2D(np.ones((2,3,2,3))) # 4 D dataset
s2 = np.reshape(np.arange(6),(2,3))
s2_signal = hs.signals.BaseSignal(s2)# 2D dataset all in signalaxes
s2_navigation = s2_signal.T
s.map(multiply, value=s2_navigation) # This iterates s2 alongside s1 a
s.map(multiply, value=s2_signal) # This would try to multiply s by the array s2 at every nav position
s.map(multiply, value=s2) # This would try to multiply s by the array s2 at every nav position
# now if we wanted to break things a little bit and use ``_map_iterate`` (what I wouldn't recommend) we could do the following
#This currently works but won't in the new code...
s._map_iterate(multiply, iterating_kwargs = (('value'),(s2))) # This iterates s2 alongside s1
s._map_iterate(multiply, value=s2) # This would try to multiply s by the array s2 at every nav position It might not be clear from this example why the first one is better, but from a consistency standpoint it is just better to always be dealing with Signals rather than arrays and lists. There is less ambiguity about what is being applied where. I can allow the second set of |
…uments for the map function
if self.axes_manager.navigation_shape == () and self._lazy: | ||
print("Converting signal to a non-lazy signal because there are no nav dimensions") | ||
self.compute() | ||
# Sepate ndkwargs depending on if they are BaseSignals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be noted that this actually works with all signals which inherits BaseSignal
. Thus, the current implementation works fine with Signal2D
and Signal1D
as well.
I have fixed this but it involves making a copy of the signal when doing the rechunking so that the signal chunk spans the entire signal dimension. Hopefully that works well enough and the copying doesn't require too much memory or time. |
@CSSFrancis, any change to address the comments above. You can see the missing coverage at https://github.com/hyperspy/hyperspy/pull/2617/checks?check_run_id=2249986920 or in the PR diff https://github.com/hyperspy/hyperspy/pull/2617/files |
The failure on azure pipeline is not related to this PR and should be sorted soon - see #2694 (comment). |
@ericpre I think that this should be good now. Let me know if there are any more changes that need to be made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CSSFrancis, this looks good to me.
The failure on azure pipeline is due to a the tifffile package being broken on anaconda defaults channels - see AnacondaRecipes/tifffile-feedstock#2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description of the change
I have been looking into the map function on lazy datasets and have started to realize that it isn't very well optimized. Currently it creates a bunch of Dask delayed objects from the Dask Array. It then either calculates each chunk many times or operates on only one signal at a time which is pretty slow.
This is the line which slows things down considerably...
hyperspy/hyperspy/_signals/lazy.py
Line 539 in 512d129
Progress of the PR
CHANGES.rst
(if appropriate),I still need to work on the ragged examples and changing signals. A lot of this code is modified from @magnunor so any input he has might be useful. I just figured that it might be worth putting in the effort to fix this now as opposed to later.
Minimal example of the bug fix or the new feature