Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map function larger iter arrays #2878

Merged

Conversation

CSSFrancis
Copy link
Member

@CSSFrancis CSSFrancis commented Feb 8, 2022

Description of the change

This change simplifies some of the map_iterate function by replacing the map_blocks function with the lower level blockwise function.

  • This changes the _get_iterating_kwargs function so that it no longer adds on dimensions which are zero dimensions so that all of the args have the same number of dimensions.
  • The drop and added axes are replaced with the get_block_pattern function. This works by getting a block pattern for the args and identifies which blocks change/don't change.
  • Larger size arguments with arg.ndim>signal.ndim are possible and don't break the function.

Progress of the PR

  • Change implemented (can be split into several points),
  • add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • add tests,
  • ready for review.

Minimal example of the bug fix or the new feature

def add_sum(image, add):
    out = np.sum(image) + np.sum(add)
    return out
import hyperspy.api as hs
import numpy as np
x = np.ones((10,20,2,3))
s = hs.signals.Signal2D(x)
s_add = hs.signals.BaseSignal(2 * np.ones((10, 20, 2, 2, 2))).transpose(3)


s_out = s.map(add_sum, inplace=False,  add=s_add)# This didn't work before
s_out = s_add.map(add_sum, inplace=False,  add=s)# This worked before

This fixes the problem when a larger signal dimension for an iterating argument causes the map function to fail.

It also hopefully simplifies the map function.

This replaces part of #2877

@codecov
Copy link

codecov bot commented Feb 8, 2022

Codecov Report

Merging #2878 (3a0c756) into RELEASE_next_minor (f3aaab5) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                 Coverage Diff                 @@
##           RELEASE_next_minor    #2878   +/-   ##
===================================================
  Coverage               79.10%   79.10%           
===================================================
  Files                     206      206           
  Lines                   31907    31910    +3     
  Branches                 7181     7187    +6     
===================================================
+ Hits                    25239    25242    +3     
  Misses                   4909     4909           
  Partials                 1759     1759           
Impacted Files Coverage Δ
hyperspy/misc/utils.py 92.19% <100.00%> (+0.24%) ⬆️
hyperspy/signal.py 76.02% <100.00%> (-0.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3aaab5...3a0c756. Read the comment docs.

@CSSFrancis
Copy link
Member Author

@magnunor If you have some time can you look over this too? You wrote the original map_blocks function so it might be good to have another pair of eyes on this. I can test it vs pyxem as well. Hopefully this doesn't break anything in pyxem/pyxem#789 as well. It shouldn't as most of the changes are rather superficial but there might be some edge cases I have not considered.

Copy link
Contributor

@magnunor magnunor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started going through this, and saw some minor things.

Will have more thorough feedback later.

hyperspy/signal.py Outdated Show resolved Hide resolved
hyperspy/signal.py Outdated Show resolved Hide resolved
Copy link
Contributor

@magnunor magnunor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor comments.

hyperspy/tests/signals/test_map_method.py Outdated Show resolved Hide resolved
hyperspy/tests/signals/test_map_method.py Outdated Show resolved Hide resolved
@magnunor
Copy link
Contributor

Sorry about the slowness on this, been using my dev-time to fix up some of the failing units tests.

Some points:

  • Make sure the new code follows the black style "guide". This might be a bit of a pain, since you can't run black straight on the code files, as that will change all of the code. What I typically do is run black with diff and check, on the specific file: python3 -m black --check --diff hyperspy/tests/signals/test_map_method.py. Alternatively, there are a number of editor plugins: https://black.readthedocs.io/en/stable/integrations/editors.html. However, make sure it doesn't change the style for all the code, since that makes it really difficult to review the changes.
  • Sync RELEASE_next_minor with the branch in this pull request, after the bug fix in Fix test_blockfile unit test, due to change in dask keyword naming #2889 has been first merged into RNp, then "upstreamed" into RELEASE_next_minor.

@CSSFrancis
Copy link
Member Author

Sorry about the slowness on this, been using my dev-time to fix up some of the failing units tests.

@magnunor No problem! I've been busy the last couple of days anyways so I haven't been able to get to this.

  • Make sure the new code follows the black style "guide". This might be a bit of a pain, since you can't run black straight on the code files, as that will change all of the code. What I typically do is run black with diff and check, on the specific file: python3 -m black --check --diff hyperspy/tests/signals/test_map_method.py. Alternatively, there are a number of editor plugins: https://black.readthedocs.io/en/stable/integrations/editors.html. However, make sure it doesn't change the style for all the code, since that makes it really difficult to review the changes.

Ahh, thanks! I am pretty bad at forgetting to use black. Thanks for the tip on using the diff and check flags, that should help.

Will do!

hyperspy/signal.py Outdated Show resolved Hide resolved
@magnunor
Copy link
Contributor

I did a fairly thorough check of the functionality, inputting a variety of cases:

  • Several iterating kwargs
  • Several args combined with iterating kwargs
  • lazy signal
  • Output with more dimensions than the input (and iterating kwargs)
  • Signal1D

I also checked the runtime of the map with a real dataset: it looked good! At least for the check I did, the computation time was the same.


I did find one thing which didn't work, which might be an existing issue. Ergo, it might be a "bug" in the current version of RELEASE_next_minor. Or possible some strange combination of library versions on my computer. So check on your own computer if you can replicate this.


This code does not work (on my computer), giving the error: ValueError: operands could not be broadcast together with shapes (2,2) (2,3).

Changing to chunk_shape = (2, 2, 2, 3, 2), fixes the issue. Not sure exactly what is causing it. Possible that not all the signal dimensions are rechunked into one single chunk.

chunk_shape = (2, 2, 2, 2, 2)

import numpy as np
import dask.array as da
import hyperspy.api as hs

def add_sum(image, add1, add2):
    print(add1.shape, add2.shape)
    temp_add = add1.sum(-1) + add2
    out = image + np.sum(temp_add)
    return out

x = np.ones((4, 5, 10, 11))
s = hs.signals.Signal2D(x)
s_add1 = hs.signals.BaseSignal(2 * np.ones((4, 5, 2, 3, 2))).transpose(3)
s_add2 = hs.signals.BaseSignal(3 * np.ones((4, 5, 2, 3))).transpose(2)

s = hs.signals.Signal2D(da.from_array(s.data, chunks=(2, 2, 2, 2))).as_lazy()
s_add1 = hs.signals.Signal2D(da.from_array(s_add1.data, chunks=chunk_shape)).as_lazy().transpose(navigation_axes=(1, 2))

s_out = s.map(add_sum, inplace=False, add1=s_add1, add2=s_add2, lazy_output=False)

Also, one thing with regards to the commit history: there has been an increasing focus to have a clear and readable commit history. This makes it easier in the future to understand the nature of the code changes. This is a bit of an "advanced" feature, so not sure how much we expect that everyone does this. @ericpre might have some thoughts on this.

@CSSFrancis
Copy link
Member Author

@magnunor Thanks for catching that bug! I think that was a holdover from before, it seems that I wasn't checking to see if the signal was spanned in the case where the navigation chunks were equal to the chunks of the mapped signal but the signal chunks were different.

I added in a test for that case so it should now work as expected!

Also, one thing with regards to the commit history: there has been an increasing focus to have a clear and readable commit history. This makes it easier in the future to understand the nature of the code changes. This is a bit of an "advanced" feature, so not sure how much we expect that everyone does this. @ericpre might have some thoughts on this.

I must admit to being rather poor at this, let me know if you want me to go through and edit my commit messages.

@magnunor
Copy link
Contributor

Other than the commit history, I'm happy with this pull request!

@magnunor
Copy link
Contributor

magnunor commented Mar 4, 2022

I think I'll merge this, even with the commit history being a bit messy. To speed things along for the 1.7 release.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant part of the error from the code in #2878 (comment):

  File "C:\Users\M0041User\mambaforge\lib\site-packages\dask\array\core.py", line 2481, in map_blocks
    return map_blocks(func, self, *args, **kwargs)

  File "C:\Users\M0041User\mambaforge\lib\site-packages\dask\array\core.py", line 813, in map_blocks
    raise ValueError(

ValueError: Provided chunks have 2 dims; expected 3 dims

Even if the meaning of this error message is not obvious, this seems to be related to changes in the chunks dimension. If I understand correctly, we get this error because the drop_axis argument passed to map_blocks is incorrect, in this is case, it is (3, 2) while the correct value is (4, 3, 2) . This is indeed what the fix of the PR is doing by considering the signal arguments passed to map to figure out the correct changes in axes. This is good, however, I don't see why we need to change to blockwise and why this can't be done by keeping map_blocks? blockwise is lower-level API and is used by map_blocks, if we don't have a reason to change to blockwise, I would suggest to keep using map_blocks, because, if any improvement (optimisation, bug fix, etc.) is made to map_blocks in future release of dask (which I believe is fairly likely to happen), then we would not benefit from it and we will need to maintain it. The idea is to defer to dask as much as we can by using their high level API.

hyperspy/misc/utils.py Show resolved Hide resolved
hyperspy/signal.py Outdated Show resolved Hide resolved
hyperspy/signal.py Show resolved Hide resolved
@CSSFrancis
Copy link
Member Author

@magnunor @ericpre Sorry I kind of disappeared on this. I've been at a conference and haven't had as much time as I thought.

As for why to use blockwise vs map_blocks. Map blocks expects the same number of dimensions for each signal. If i remember correctly there is also a part of the code in 'map_blocks' that doesn't properly account for how to pass the arguments to 'blockwise'. I can try to look back at what the offending code was once I get back to my office.

Additionally, in the past we were just adding signals of dim 1 to get this to work which I think was confusing and not very clean, this is where lots of previous errors were occurring. It adds in more tasks for dask which are trivial but complicates your task graph. We could probably get it to work out using map blocks with a hacky solution, and I worked on doing that first but kept coming to the conclusion that using the lower level 'blockwise' actually fit our purposes better.

Sorry this isn't very specific, I can get you a better response tomorrow.

@ericpre
Copy link
Member

ericpre commented Mar 6, 2022

As for why to use blockwise vs map_blocks. Map blocks expects the same number of dimensions for each signal.

The example above works fine with map_blocks providing that drop_axis is correct. However, this may not be generally correct, particularly when the blocks doesn't align and if you can confirm this with an example, then this is a good justification! :)
The documentation of map_blocks seems to implies this, when it says "Map_blocks aligns blocks by block positions without regard to shape. In the following example we have two arrays with the same number of blocks but with different shape and chunk sizes."

If i remember correctly there is also a part of the code in 'map_blocks' that doesn't properly account for how to pass the arguments to 'blockwise'. I can try to look back at what the offending code was once I get back to my office.

If you can check this, that would be useful too! Then the question may be, whether this is a dask bug or this is not an expected behaviour from map_blocks?

Additionally, in the past we were just adding signals of dim 1 to get this to work which I think was confusing and not very clean, this is where lots of previous errors were occurring. It adds in more tasks for dask which are trivial but complicates your task graph. We could probably get it to work out using map blocks with a hacky solution, and I worked on doing that first but kept coming to the conclusion that using the lower level 'blockwise' actually fit our purposes better.

Are you saying that you were doing subsequent map calls as a workaround? If so, indeed, this is not a great workaround!

…terating signal chunks != signal chunks. Changes to get_iterating_kwargs to check for chunk spanning and rechunk if necessary.
@CSSFrancis
Copy link
Member Author

I fixed this up a little bit. I tried to change some of the commit messages but have a little bit of a problem that I rebased this branch with the release next minor a couple of times. I'm not exactly a git expert so I'm not sure how to change some of the older commit messages without messing up the git history there.

@CSSFrancis
Copy link
Member Author

This should be good to go now. I changed as many of the commit messages as I can without creating a new PR with a new commit history or rewriting commit history that isn't mine. Maybe just use this as a learning experience and move from there.

In response to @ericpre

Are you saying that you were doing subsequent map calls as a workaround? If so, indeed, this is not a great workaround!

I'm saying that in the old code if you had a signal with shape [4,3|2,2] and an iterating signal [4,3|1] then it would change the second signal to a size[4,3|1,1] using reshape so that the number of dimensions were equal. That is a requirement for using map_blocks

This is part of the reason why things were broken in the old version. In the reverse case where the signal shape was [4,3|1] and the iterating signal shape was [4,3|2,2] it wasn't adding the dimensions to the signal and that broke the code. That change was minimal but still broke in some cases.

In the argument for using map_blocks vs blockwise. I would say that blockwise is the better for our situation. It's possible that there is a bug in dask map_blocks in regard to what I described above but already we are overwriting much of the reasons to use map_blocks. For example the auto dtype checking and output size is handled in the _map_iterate function. These are necessarily custom in our case because the auto type checking from dask fails in many cases as they try to run the function on a 0-D array.

In reference to checking drop and new axes we already handle all of that as well as well as the rechunking etc.

I think using blockwise cleans things up quite nicely. The code is shorter more readable and less potential for behind the scenes changes to occur. The reshape task is removed leading to shorter task graphs. The only downside is maybe the fact that the blockwise function is maybe less intuitive, but I would argue that there are a lot of pitfalls using the map_blocks function which are very difficult to trouble shoot despite it looking intuitive originally.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CSSFrancis for elaborating a bit more on the motivation to use blockwise, I agree that this justify the change!

Regarding the commit history, I would not worry too much, as long as it is sensible and there is not a number of "spurious commit". One issue you may have with squashing commits or moving commits around is that the merge commit may be preventing you to do it. Usually, I do a rebase instead of a merge commit to merge upstream changes to avoid this issue.

@CSSFrancis
Copy link
Member Author

Thanks @CSSFrancis for elaborating a bit more on the motivation to use blockwise, I agree that this justify the change!

Regarding the commit history, I would not worry too much, as long as it is sensible and there is not a number of "spurious commit". One issue you may have with squashing commits or moving commits around is that the merge commit may be preventing you to do it. Usually, I do a rebase instead of a merge commit to merge upstream changes to avoid this issue.

@ericpre Yea that is probably a better way to do it from now on. I can always just be better about my commit messages as well though :)

I guess we can start making sure this works with changes in pyxem now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants