New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map function larger iter arrays #2878
Map function larger iter arrays #2878
Conversation
Codecov Report
@@ Coverage Diff @@
## RELEASE_next_minor #2878 +/- ##
===================================================
Coverage 79.10% 79.10%
===================================================
Files 206 206
Lines 31907 31910 +3
Branches 7181 7187 +6
===================================================
+ Hits 25239 25242 +3
Misses 4909 4909
Partials 1759 1759
Continue to review full report at Codecov.
|
@magnunor If you have some time can you look over this too? You wrote the original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started going through this, and saw some minor things.
Will have more thorough feedback later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more minor comments.
Sorry about the slowness on this, been using my dev-time to fix up some of the failing units tests. Some points:
|
@magnunor No problem! I've been busy the last couple of days anyways so I haven't been able to get to this.
Ahh, thanks! I am pretty bad at forgetting to use
Will do! |
I did a fairly thorough check of the functionality, inputting a variety of cases:
I also checked the runtime of the I did find one thing which didn't work, which might be an existing issue. Ergo, it might be a "bug" in the current version of This code does not work (on my computer), giving the error: Changing to chunk_shape = (2, 2, 2, 2, 2)
import numpy as np
import dask.array as da
import hyperspy.api as hs
def add_sum(image, add1, add2):
print(add1.shape, add2.shape)
temp_add = add1.sum(-1) + add2
out = image + np.sum(temp_add)
return out
x = np.ones((4, 5, 10, 11))
s = hs.signals.Signal2D(x)
s_add1 = hs.signals.BaseSignal(2 * np.ones((4, 5, 2, 3, 2))).transpose(3)
s_add2 = hs.signals.BaseSignal(3 * np.ones((4, 5, 2, 3))).transpose(2)
s = hs.signals.Signal2D(da.from_array(s.data, chunks=(2, 2, 2, 2))).as_lazy()
s_add1 = hs.signals.Signal2D(da.from_array(s_add1.data, chunks=chunk_shape)).as_lazy().transpose(navigation_axes=(1, 2))
s_out = s.map(add_sum, inplace=False, add1=s_add1, add2=s_add2, lazy_output=False) Also, one thing with regards to the commit history: there has been an increasing focus to have a clear and readable commit history. This makes it easier in the future to understand the nature of the code changes. This is a bit of an "advanced" feature, so not sure how much we expect that everyone does this. @ericpre might have some thoughts on this. |
@magnunor Thanks for catching that bug! I think that was a holdover from before, it seems that I wasn't checking to see if the signal was spanned in the case where the navigation chunks were equal to the chunks of the mapped signal but the signal chunks were different. I added in a test for that case so it should now work as expected!
I must admit to being rather poor at this, let me know if you want me to go through and edit my commit messages. |
Other than the commit history, I'm happy with this pull request! |
I think I'll merge this, even with the commit history being a bit messy. To speed things along for the 1.7 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant part of the error from the code in #2878 (comment):
File "C:\Users\M0041User\mambaforge\lib\site-packages\dask\array\core.py", line 2481, in map_blocks
return map_blocks(func, self, *args, **kwargs)
File "C:\Users\M0041User\mambaforge\lib\site-packages\dask\array\core.py", line 813, in map_blocks
raise ValueError(
ValueError: Provided chunks have 2 dims; expected 3 dims
Even if the meaning of this error message is not obvious, this seems to be related to changes in the chunks dimension. If I understand correctly, we get this error because the drop_axis
argument passed to map_blocks
is incorrect, in this is case, it is (3, 2)
while the correct value is (4, 3, 2)
. This is indeed what the fix of the PR is doing by considering the signal arguments passed to map
to figure out the correct changes in axes. This is good, however, I don't see why we need to change to blockwise
and why this can't be done by keeping map_blocks
? blockwise
is lower-level API and is used by map_blocks
, if we don't have a reason to change to blockwise
, I would suggest to keep using map_blocks
, because, if any improvement (optimisation, bug fix, etc.) is made to map_blocks
in future release of dask (which I believe is fairly likely to happen), then we would not benefit from it and we will need to maintain it. The idea is to defer to dask as much as we can by using their high level API.
@magnunor @ericpre Sorry I kind of disappeared on this. I've been at a conference and haven't had as much time as I thought. As for why to use blockwise vs map_blocks. Map blocks expects the same number of dimensions for each signal. If i remember correctly there is also a part of the code in 'map_blocks' that doesn't properly account for how to pass the arguments to 'blockwise'. I can try to look back at what the offending code was once I get back to my office. Additionally, in the past we were just adding signals of dim 1 to get this to work which I think was confusing and not very clean, this is where lots of previous errors were occurring. It adds in more tasks for dask which are trivial but complicates your task graph. We could probably get it to work out using map blocks with a hacky solution, and I worked on doing that first but kept coming to the conclusion that using the lower level 'blockwise' actually fit our purposes better. Sorry this isn't very specific, I can get you a better response tomorrow. |
The example above works fine with
If you can check this, that would be useful too! Then the question may be, whether this is a dask bug or this is not an expected behaviour from
Are you saying that you were doing subsequent |
…terating signal chunks != signal chunks. Changes to get_iterating_kwargs to check for chunk spanning and rechunk if necessary.
…to _get_block_pattern
3897774
to
5e25ceb
Compare
# Conflicts: # hyperspy/misc/utils.py
4a815a2
to
3a0c756
Compare
I fixed this up a little bit. I tried to change some of the commit messages but have a little bit of a problem that I rebased this branch with the release next minor a couple of times. I'm not exactly a git expert so I'm not sure how to change some of the older commit messages without messing up the git history there. |
This should be good to go now. I changed as many of the commit messages as I can without creating a new PR with a new commit history or rewriting commit history that isn't mine. Maybe just use this as a learning experience and move from there. In response to @ericpre
I'm saying that in the old code if you had a signal with shape [4,3|2,2] and an iterating signal [4,3|1] then it would change the second signal to a size[4,3|1,1] using This is part of the reason why things were broken in the old version. In the reverse case where the signal shape was [4,3|1] and the iterating signal shape was [4,3|2,2] it wasn't adding the dimensions to the signal and that broke the code. That change was minimal but still broke in some cases. In the argument for using In reference to checking drop and new axes we already handle all of that as well as well as the rechunking etc. I think using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CSSFrancis for elaborating a bit more on the motivation to use blockwise
, I agree that this justify the change!
Regarding the commit history, I would not worry too much, as long as it is sensible and there is not a number of "spurious commit". One issue you may have with squashing commits or moving commits around is that the merge commit may be preventing you to do it. Usually, I do a rebase instead of a merge commit to merge upstream changes to avoid this issue.
@ericpre Yea that is probably a better way to do it from now on. I can always just be better about my commit messages as well though :) I guess we can start making sure this works with changes in pyxem now! |
Description of the change
This change simplifies some of the
map_iterate
function by replacing themap_blocks
function with the lower levelblockwise
function._get_iterating_kwargs
function so that it no longer adds on dimensions which are zero dimensions so that all of the args have the same number of dimensions.get_block_pattern
function. This works by getting a block pattern for the args and identifies which blocks change/don't change.Progress of the PR
upcoming_changes
folder (seeupcoming_changes/README.rst
),Minimal example of the bug fix or the new feature
This fixes the problem when a larger signal dimension for an iterating argument causes the map function to fail.
It also hopefully simplifies the map function.
This replaces part of #2877