Map function larger iter arrays #2878

CSSFrancis · 2022-02-08T04:03:18Z

Description of the change

This change simplifies some of the map_iterate function by replacing the map_blocks function with the lower level blockwise function.

This changes the _get_iterating_kwargs function so that it no longer adds on dimensions which are zero dimensions so that all of the args have the same number of dimensions.
The drop and added axes are replaced with the get_block_pattern function. This works by getting a block pattern for the args and identifies which blocks change/don't change.
Larger size arguments with arg.ndim>signal.ndim are possible and don't break the function.

Progress of the PR

Change implemented (can be split into several points),
add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
add tests,
ready for review.

Minimal example of the bug fix or the new feature

def add_sum(image, add):
    out = np.sum(image) + np.sum(add)
    return out
import hyperspy.api as hs
import numpy as np
x = np.ones((10,20,2,3))
s = hs.signals.Signal2D(x)
s_add = hs.signals.BaseSignal(2 * np.ones((10, 20, 2, 2, 2))).transpose(3)


s_out = s.map(add_sum, inplace=False,  add=s_add)# This didn't work before
s_out = s_add.map(add_sum, inplace=False,  add=s)# This worked before

This fixes the problem when a larger signal dimension for an iterating argument causes the map function to fail.

It also hopefully simplifies the map function.

This replaces part of #2877

…flexibility.

codecov · 2022-02-08T04:10:31Z

Codecov Report

Merging #2878 (3a0c756) into RELEASE_next_minor (f3aaab5) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@                 Coverage Diff                 @@
##           RELEASE_next_minor    #2878   +/-   ##
===================================================
  Coverage               79.10%   79.10%           
===================================================
  Files                     206      206           
  Lines                   31907    31910    +3     
  Branches                 7181     7187    +6     
===================================================
+ Hits                    25239    25242    +3     
  Misses                   4909     4909           
  Partials                 1759     1759

Impacted Files	Coverage Δ
hyperspy/misc/utils.py	`92.19% <100.00%> (+0.24%)`	⬆️
hyperspy/signal.py	`76.02% <100.00%> (-0.18%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3aaab5...3a0c756. Read the comment docs.

CSSFrancis · 2022-02-08T14:46:38Z

@magnunor If you have some time can you look over this too? You wrote the original map_blocks function so it might be good to have another pair of eyes on this. I can test it vs pyxem as well. Hopefully this doesn't break anything in pyxem/pyxem#789 as well. It shouldn't as most of the changes are rather superficial but there might be some edge cases I have not considered.

magnunor

I started going through this, and saw some minor things.

Will have more thorough feedback later.

hyperspy/signal.py

magnunor

Some more minor comments.

hyperspy/tests/signals/test_map_method.py

magnunor · 2022-02-18T13:13:32Z

Sorry about the slowness on this, been using my dev-time to fix up some of the failing units tests.

Some points:

Make sure the new code follows the black style "guide". This might be a bit of a pain, since you can't run black straight on the code files, as that will change all of the code. What I typically do is run black with diff and check, on the specific file: python3 -m black --check --diff hyperspy/tests/signals/test_map_method.py. Alternatively, there are a number of editor plugins: https://black.readthedocs.io/en/stable/integrations/editors.html. However, make sure it doesn't change the style for all the code, since that makes it really difficult to review the changes.
Sync RELEASE_next_minor with the branch in this pull request, after the bug fix in Fix test_blockfile unit test, due to change in dask keyword naming #2889 has been first merged into RNp, then "upstreamed" into RELEASE_next_minor.

CSSFrancis · 2022-02-21T15:51:38Z

Sorry about the slowness on this, been using my dev-time to fix up some of the failing units tests.

@magnunor No problem! I've been busy the last couple of days anyways so I haven't been able to get to this.

Make sure the new code follows the black style "guide". This might be a bit of a pain, since you can't run black straight on the code files, as that will change all of the code. What I typically do is run black with diff and check, on the specific file: python3 -m black --check --diff hyperspy/tests/signals/test_map_method.py. Alternatively, there are a number of editor plugins: https://black.readthedocs.io/en/stable/integrations/editors.html. However, make sure it doesn't change the style for all the code, since that makes it really difficult to review the changes.

Ahh, thanks! I am pretty bad at forgetting to use black. Thanks for the tip on using the diff and check flags, that should help.

Sync RELEASE_next_minor with the branch in this pull request, after the bug fix in Fix test_blockfile unit test, due to change in dask keyword naming #2889 has been first merged into RNp, then "upstreamed" into RELEASE_next_minor.

Will do!

hyperspy/signal.py

magnunor · 2022-02-26T15:21:55Z

I did a fairly thorough check of the functionality, inputting a variety of cases:

Several iterating kwargs
Several args combined with iterating kwargs
lazy signal
Output with more dimensions than the input (and iterating kwargs)
Signal1D

I also checked the runtime of the map with a real dataset: it looked good! At least for the check I did, the computation time was the same.

I did find one thing which didn't work, which might be an existing issue. Ergo, it might be a "bug" in the current version of RELEASE_next_minor. Or possible some strange combination of library versions on my computer. So check on your own computer if you can replicate this.

This code does not work (on my computer), giving the error: ValueError: operands could not be broadcast together with shapes (2,2) (2,3).

Changing to chunk_shape = (2, 2, 2, 3, 2), fixes the issue. Not sure exactly what is causing it. Possible that not all the signal dimensions are rechunked into one single chunk.

chunk_shape = (2, 2, 2, 2, 2)

import numpy as np
import dask.array as da
import hyperspy.api as hs

def add_sum(image, add1, add2):
    print(add1.shape, add2.shape)
    temp_add = add1.sum(-1) + add2
    out = image + np.sum(temp_add)
    return out

x = np.ones((4, 5, 10, 11))
s = hs.signals.Signal2D(x)
s_add1 = hs.signals.BaseSignal(2 * np.ones((4, 5, 2, 3, 2))).transpose(3)
s_add2 = hs.signals.BaseSignal(3 * np.ones((4, 5, 2, 3))).transpose(2)

s = hs.signals.Signal2D(da.from_array(s.data, chunks=(2, 2, 2, 2))).as_lazy()
s_add1 = hs.signals.Signal2D(da.from_array(s_add1.data, chunks=chunk_shape)).as_lazy().transpose(navigation_axes=(1, 2))

s_out = s.map(add_sum, inplace=False, add1=s_add1, add2=s_add2, lazy_output=False)

Also, one thing with regards to the commit history: there has been an increasing focus to have a clear and readable commit history. This makes it easier in the future to understand the nature of the code changes. This is a bit of an "advanced" feature, so not sure how much we expect that everyone does this. @ericpre might have some thoughts on this.

CSSFrancis · 2022-02-26T16:30:20Z

@magnunor Thanks for catching that bug! I think that was a holdover from before, it seems that I wasn't checking to see if the signal was spanned in the case where the navigation chunks were equal to the chunks of the mapped signal but the signal chunks were different.

I added in a test for that case so it should now work as expected!

Also, one thing with regards to the commit history: there has been an increasing focus to have a clear and readable commit history. This makes it easier in the future to understand the nature of the code changes. This is a bit of an "advanced" feature, so not sure how much we expect that everyone does this. @ericpre might have some thoughts on this.

I must admit to being rather poor at this, let me know if you want me to go through and edit my commit messages.

magnunor · 2022-02-26T18:50:15Z

Other than the commit history, I'm happy with this pull request!

magnunor · 2022-03-04T09:03:50Z

I think I'll merge this, even with the commit history being a bit messy. To speed things along for the 1.7 release.

ericpre

The relevant part of the error from the code in #2878 (comment):

  File "C:\Users\M0041User\mambaforge\lib\site-packages\dask\array\core.py", line 2481, in map_blocks
    return map_blocks(func, self, *args, **kwargs)

  File "C:\Users\M0041User\mambaforge\lib\site-packages\dask\array\core.py", line 813, in map_blocks
    raise ValueError(

ValueError: Provided chunks have 2 dims; expected 3 dims

Even if the meaning of this error message is not obvious, this seems to be related to changes in the chunks dimension. If I understand correctly, we get this error because the drop_axis argument passed to map_blocks is incorrect, in this is case, it is (3, 2) while the correct value is (4, 3, 2) . This is indeed what the fix of the PR is doing by considering the signal arguments passed to map to figure out the correct changes in axes. This is good, however, I don't see why we need to change to blockwise and why this can't be done by keeping map_blocks? blockwise is lower-level API and is used by map_blocks, if we don't have a reason to change to blockwise, I would suggest to keep using map_blocks, because, if any improvement (optimisation, bug fix, etc.) is made to map_blocks in future release of dask (which I believe is fairly likely to happen), then we would not benefit from it and we will need to maintain it. The idea is to defer to dask as much as we can by using their high level API.

hyperspy/misc/utils.py

hyperspy/signal.py

CSSFrancis · 2022-03-06T13:57:58Z

@magnunor @ericpre Sorry I kind of disappeared on this. I've been at a conference and haven't had as much time as I thought.

As for why to use blockwise vs map_blocks. Map blocks expects the same number of dimensions for each signal. If i remember correctly there is also a part of the code in 'map_blocks' that doesn't properly account for how to pass the arguments to 'blockwise'. I can try to look back at what the offending code was once I get back to my office.

Additionally, in the past we were just adding signals of dim 1 to get this to work which I think was confusing and not very clean, this is where lots of previous errors were occurring. It adds in more tasks for dask which are trivial but complicates your task graph. We could probably get it to work out using map blocks with a hacky solution, and I worked on doing that first but kept coming to the conclusion that using the lower level 'blockwise' actually fit our purposes better.

Sorry this isn't very specific, I can get you a better response tomorrow.

ericpre · 2022-03-06T14:51:32Z

As for why to use blockwise vs map_blocks. Map blocks expects the same number of dimensions for each signal.

The example above works fine with map_blocks providing that drop_axis is correct. However, this may not be generally correct, particularly when the blocks doesn't align and if you can confirm this with an example, then this is a good justification! :)
The documentation of map_blocks seems to implies this, when it says "Map_blocks aligns blocks by block positions without regard to shape. In the following example we have two arrays with the same number of blocks but with different shape and chunk sizes."

If i remember correctly there is also a part of the code in 'map_blocks' that doesn't properly account for how to pass the arguments to 'blockwise'. I can try to look back at what the offending code was once I get back to my office.

If you can check this, that would be useful too! Then the question may be, whether this is a dask bug or this is not an expected behaviour from map_blocks?

Additionally, in the past we were just adding signals of dim 1 to get this to work which I think was confusing and not very clean, this is where lots of previous errors were occurring. It adds in more tasks for dask which are trivial but complicates your task graph. We could probably get it to work out using map blocks with a hacky solution, and I worked on doing that first but kept coming to the conclusion that using the lower level 'blockwise' actually fit our purposes better.

Are you saying that you were doing subsequent map calls as a workaround? If so, indeed, this is not a great workaround!

…terating signal chunks != signal chunks. Changes to get_iterating_kwargs to check for chunk spanning and rechunk if necessary.

…to _get_block_pattern

# Conflicts: # hyperspy/misc/utils.py

CSSFrancis · 2022-03-08T23:36:06Z

I fixed this up a little bit. I tried to change some of the commit messages but have a little bit of a problem that I rebased this branch with the release next minor a couple of times. I'm not exactly a git expert so I'm not sure how to change some of the older commit messages without messing up the git history there.

CSSFrancis · 2022-03-11T15:24:52Z

This should be good to go now. I changed as many of the commit messages as I can without creating a new PR with a new commit history or rewriting commit history that isn't mine. Maybe just use this as a learning experience and move from there.

In response to @ericpre

Are you saying that you were doing subsequent map calls as a workaround? If so, indeed, this is not a great workaround!

I'm saying that in the old code if you had a signal with shape [4,3|2,2] and an iterating signal [4,3|1] then it would change the second signal to a size[4,3|1,1] using reshape so that the number of dimensions were equal. That is a requirement for using map_blocks

This is part of the reason why things were broken in the old version. In the reverse case where the signal shape was [4,3|1] and the iterating signal shape was [4,3|2,2] it wasn't adding the dimensions to the signal and that broke the code. That change was minimal but still broke in some cases.

In the argument for using map_blocks vs blockwise. I would say that blockwise is the better for our situation. It's possible that there is a bug in dask map_blocks in regard to what I described above but already we are overwriting much of the reasons to use map_blocks. For example the auto dtype checking and output size is handled in the _map_iterate function. These are necessarily custom in our case because the auto type checking from dask fails in many cases as they try to run the function on a 0-D array.

In reference to checking drop and new axes we already handle all of that as well as well as the rechunking etc.

I think using blockwise cleans things up quite nicely. The code is shorter more readable and less potential for behind the scenes changes to occur. The reshape task is removed leading to shorter task graphs. The only downside is maybe the fact that the blockwise function is maybe less intuitive, but I would argue that there are a lot of pitfalls using the map_blocks function which are very difficult to trouble shoot despite it looking intuitive originally.

ericpre

Thanks @CSSFrancis for elaborating a bit more on the motivation to use blockwise, I agree that this justify the change!

Regarding the commit history, I would not worry too much, as long as it is sensible and there is not a number of "spurious commit". One issue you may have with squashing commits or moving commits around is that the merge commit may be preventing you to do it. Usually, I do a rebase instead of a merge commit to merge upstream changes to avoid this issue.

CSSFrancis · 2022-03-13T19:12:41Z

Thanks @CSSFrancis for elaborating a bit more on the motivation to use blockwise, I agree that this justify the change!

Regarding the commit history, I would not worry too much, as long as it is sensible and there is not a number of "spurious commit". One issue you may have with squashing commits or moving commits around is that the merge commit may be preventing you to do it. Usually, I do a rebase instead of a merge commit to merge upstream changes to avoid this issue.

@ericpre Yea that is probably a better way to do it from now on. I can always just be better about my commit messages as well though :)

I guess we can start making sure this works with changes in pyxem now!

CSSFrancis added 2 commits February 7, 2022 17:45

Replace map_blocks with a custom blockwise function. Allows for more …

ba3e6d0

…flexibility.

Testing fixed

664e9af

magnunor reviewed Feb 9, 2022

View reviewed changes

hyperspy/signal.py Outdated Show resolved Hide resolved

hyperspy/signal.py Outdated Show resolved Hide resolved

magnunor reviewed Feb 15, 2022

View reviewed changes

hyperspy/tests/signals/test_map_method.py Outdated Show resolved Hide resolved

hyperspy/tests/signals/test_map_method.py Outdated Show resolved Hide resolved

CSSFrancis mentioned this pull request Feb 21, 2022

Update pyxem to work with new version of HyperSpy's s.map function pyxem/pyxem#789

Closed

jlaehne added the status: needs review label Feb 21, 2022

jlaehne added this to the v1.7 milestone Feb 21, 2022

jlaehne added release: next minor type: enhancement labels Feb 22, 2022

magnunor mentioned this pull request Feb 22, 2022

Release 1.7 #2845

Closed

magnunor added status: waiting for author and removed status: needs review labels Feb 22, 2022

magnunor reviewed Feb 22, 2022

View reviewed changes

hyperspy/signal.py Outdated Show resolved Hide resolved

CSSFrancis added 3 commits February 22, 2022 14:37

removed print and added spaces

1543b4a

Merge branch 'RELEASE_next_minor' into MapFunctionLargerIterArrays

82f6438

Changed toolz to tlz and added dependancy

7fd14ee

magnunor mentioned this pull request Mar 2, 2022

git commit messages and history #2895

Open

ericpre reviewed Mar 6, 2022

View reviewed changes

hyperspy/misc/utils.py Show resolved Hide resolved

hyperspy/signal.py Outdated Show resolved Hide resolved

hyperspy/signal.py Show resolved Hide resolved

CSSFrancis added 2 commits March 8, 2022 16:45

Reformatting. Applied black, removed print statements and extra spaces.

da78103

Added in test_rechunk_arguments function which tests the case where i…

e7b4437

…terating signal chunks != signal chunks. Changes to get_iterating_kwargs to check for chunk spanning and rechunk if necessary.

Moved _get_block_pattern -> hyperspy.misc.utils and Added docstrings …

5e25ceb

…to _get_block_pattern

CSSFrancis force-pushed the MapFunctionLargerIterArrays branch from 3897774 to 5e25ceb Compare March 8, 2022 22:45

CSSFrancis added 2 commits March 8, 2022 16:49

Merge branch 'RELEASE_next_minor' into MapFunctionLargerIterArrays

b24fbfd

# Conflicts: # hyperspy/misc/utils.py

Added changelog file for PR hyperspy#2878.

3a0c756

CSSFrancis force-pushed the MapFunctionLargerIterArrays branch from 4a815a2 to 3a0c756 Compare March 8, 2022 23:15

ericpre approved these changes Mar 13, 2022

View reviewed changes

magnunor merged commit b18e974 into hyperspy:RELEASE_next_minor Mar 13, 2022

magnunor removed the status: waiting for author label Mar 13, 2022

magnunor mentioned this pull request Mar 13, 2022

Version 0.14.0 pyxem/pyxem#788

Closed

ericpre mentioned this pull request Mar 16, 2022

Fixed ragged map with zero dim and Larger Iterating signal #2877

Closed

1 task

ericpre removed the release: next minor label Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map function larger iter arrays #2878

Map function larger iter arrays #2878

CSSFrancis commented Feb 8, 2022 •

edited

codecov bot commented Feb 8, 2022 •

edited

CSSFrancis commented Feb 8, 2022

magnunor left a comment

magnunor left a comment

magnunor commented Feb 18, 2022

CSSFrancis commented Feb 21, 2022

magnunor commented Feb 26, 2022

CSSFrancis commented Feb 26, 2022

magnunor commented Feb 26, 2022

magnunor commented Mar 4, 2022

ericpre left a comment

CSSFrancis commented Mar 6, 2022

ericpre commented Mar 6, 2022

CSSFrancis commented Mar 8, 2022

CSSFrancis commented Mar 11, 2022

ericpre left a comment

CSSFrancis commented Mar 13, 2022

Map function larger iter arrays #2878

Map function larger iter arrays #2878

Conversation

CSSFrancis commented Feb 8, 2022 • edited

Description of the change

Progress of the PR

Minimal example of the bug fix or the new feature

codecov bot commented Feb 8, 2022 • edited

Codecov Report

CSSFrancis commented Feb 8, 2022

magnunor left a comment

Choose a reason for hiding this comment

magnunor left a comment

Choose a reason for hiding this comment

magnunor commented Feb 18, 2022

CSSFrancis commented Feb 21, 2022

magnunor commented Feb 26, 2022

CSSFrancis commented Feb 26, 2022

magnunor commented Feb 26, 2022

magnunor commented Mar 4, 2022

ericpre left a comment

Choose a reason for hiding this comment

CSSFrancis commented Mar 6, 2022

ericpre commented Mar 6, 2022

CSSFrancis commented Mar 8, 2022

CSSFrancis commented Mar 11, 2022

ericpre left a comment

Choose a reason for hiding this comment

CSSFrancis commented Mar 13, 2022

CSSFrancis commented Feb 8, 2022 •

edited

codecov bot commented Feb 8, 2022 •

edited