Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/510 split #677

Merged
merged 34 commits into from Nov 17, 2020
Merged

Features/510 split #677

merged 34 commits into from Nov 17, 2020

Conversation

lenablind
Copy link
Collaborator

@lenablind lenablind commented Sep 25, 2020

Description

Implementation of functions split(), vsplit(), hsplit(), dsplit() based on np analogues.
For process-local operations, I use torch.split

Docs numpy:

Docs pytorch: https://pytorch.org/docs/stable/generated/torch.split.html

differences numpy and torch

This affects the input parameter indices_or_sections (np) / split_size_or_sections (torch).
As the variations in the given names might already reveal, the parameters contain different types of information depending on the used library.

Numpy:

  • integer:
    In this case, indices_or_sections indicates in how many parts the input ary shall be split.
    Therefore, indices_or_sections = 2 will result into a list containing 2 DNDarrays.
  • array_like
    On the other hand, indices_or_sections can be handled as a list containing the boundaries of a series of slices
    For instance, indices_or_sections = [1, 3, 4] will indicate the following data chunks:
    - [: 1]
    - [1: 3]
    - [3: 4]
    - [4: ]

Torch:
In contrary to numpy, the torch analogue always contains the size(s) of the resulting tensor(s).
If integer, all (function-)split tensors will have the same size, otherwise the possibly varying split sizes might be defined in an array_like.

According to these differences, the parameter semantics of numpy are mapped to those of torch as explained in the following.

Strategy

As vsplit, dsplit, hsplit are no more than calls of split with a specific axis, I'll reduce the explanation to the algorithm of the latter.

Depending on whether ary is distributed in the same dimension as it is split within the function (therefore ary.split == axis), the strategy varies, as the data chunks on each node have to be reorganized correctly in the resulting list of (function-) split DNDarrays. (Challenge: Data of (function-) split DNDarrays is split via MPI)

axis != ary.split

Mapping np -> torch:
If indices_or_sections is integer, the resulting size for all DNDarrays will be calculated using the size of the affected axis.
If indices_or_sections is array_like, the resulting split sizes are calculated usinght.diff (1st discrete difference).

axis == ary.split

  • indices_or_sections is integer:
    • CASE 1 : number of processes == indices_or_selections
      In this case, the split matches the distribution and the data on the node can be used directly and inserted into the resulting list.
      The remaining DNDarrays are filled with empty DNDarrays of the needed shape.
    • CASE 2: number of processes != indices_or_selections
      The goal is, to use torch.split with the process-adapted/correctly calculated indices_or_sections parameter (new_indices)
      1. Calculate the index of the first element in the resulting list which needs data of the current process
      2. Calculate the amount of data which is needed to fill this chunk up
      3. Use the remaining data of the process for the following chunks
  • indices_or_sections is array_like / DNDarray after sanitation
    The goal is, to use torch.split with the process-adapted/correctly calculated indices_or_sections parameter.
    Therefore, reduce the input information of indices_or_sections to the process relevant/ the given indices (of the global DNDarray) which correspond to the (local) data on the node. The needed information is provided via the slice out of ht.comm.chunk
    Afterwards, map the numpy to the torch semantics, as described above in axis != ary.split

In all cases, every DNDarray within the list is balanced before being returned.

Overview split functions and related axis

Function Axis
vsplit 0
hsplit 1
dsplit 2

Hint: In contrary to the corresponding general split function call with axis 1, the input within hsplit is allowed to be 1-dimensional.
This results into some dimension changing operations in the HeAT version, more specifically a reshape adding one dimension and a flattening afterwards, to meet full consistency with numpy.

Issue/s resolved: #510

Changes proposed:

  • Implemented new functions: manipulations.split(), manipulations.hsplit(), manipulations.vsplit(), manipulations.dsplit()
  • Implemented corresponding tests

Type of change

  • New feature (non-breaking change which adds functionality)

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

@mtar
Copy link
Collaborator

mtar commented Sep 25, 2020

GPU cluster tests are currently disabled on this Pull Request.

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #677 (454cdee) into master (170592a) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #677      +/-   ##
==========================================
+ Coverage   97.54%   97.59%   +0.04%     
==========================================
  Files          87       87              
  Lines       17698    18030     +332     
==========================================
+ Hits        17264    17596     +332     
  Misses        434      434              
Impacted Files Coverage Δ
heat/core/manipulations.py 99.22% <100.00%> (+0.06%) ⬆️
heat/core/tests/test_manipulations.py 99.94% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 170592a...454cdee. Read the comment docs.

@mtar
Copy link
Collaborator

mtar commented Sep 29, 2020

run tests

Copy link
Member

@coquelin77 coquelin77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks good to me. However, there is a larger comment to be discussed.

Do we want to have this return actual views of the DNDarray? If so, it would mean that we cannot balance the DNDarrays, and it might change some of the logic involved in the function itself. @ClaudiaComito @Markus-Goetz , thoughts?

heat/core/manipulations.py Outdated Show resolved Hide resolved
Comment on lines +985 to +990
if len(ary.lshape) < 2:
ary = reshape(ary, (1, ary.lshape[0]))
result = split(ary, indices_or_sections, 1)
result = [flatten(sub_array) for sub_array in result]
else:
result = split(ary, indices_or_sections, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed in dsplit as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coquelin77 No, it isn't. hsplit is some kind of a special function among the three split variations. In contrary to the corresponding general split function call with axis 1, the ary is allowed to be 1-dimensional (in numpy) even if (functionally) split among the second axis. As it isn't in torch, ary has to be reshaped (adding one dimension). The subsequent flattening is needed to meet full consistency with numpy.
As this exception doesn't occur within the other split functions, no shape manipulating is needed there.

heat/core/manipulations.py Outdated Show resolved Hide resolved
Comment on lines +2140 to +2141
for sub_DNDarray in sub_arrays_ht:
sub_DNDarray.balance_()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that this should be communicated in the docs. Since all of the sub-arrays are balanced, they are not linked to the original data points...this spawns a larger question.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coquelin77 I agree, but I guess this strongly depends on the general discussion above.

@coquelin77
Copy link
Member

rerun tests

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Lena,

before I look into the details, something important jumps to my eye: np.split & co. return lists of subarrays as views into the original array. The current heat implementation also says so, but really it returns copies. To exemplify this:

>>> import numpy as np
>>> import heat as ht
>>> a = np.arange(16).reshape(4,4)
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
>>> b, c = np.split(a, 2)
>>> b[0,0] = 10
>>> b
array([[10,  1,  2,  3],
       [ 4,  5,  6,  7]])
>>> a
array([[10,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

>>> ht_a = ht.arange(16).reshape((4,4))
>>> ht_a
DNDarray([[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]], dtype=ht.int32, device=cpu:0, split=None)
>>> ht_b, ht_c = ht.split(ht_a, 2)
>>> ht_b[0,0] = 10
>>> ht_b
DNDarray([[10,  1,  2,  3],
          [ 4,  5,  6,  7]], dtype=ht.int32, device=cpu:0, split=None)
>>> ht_a
DNDarray([[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]], dtype=ht.int32, device=cpu:0, split=None)

DNDarray([])
]

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we should raise an exception if ary.ndim < 3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, an exception is raised within split, so I think raising an additional one in dsplit would be superfluous

@ClaudiaComito
Copy link
Contributor

this all looks good to me. However, there is a larger comment to be discussed.

Do we want to have this return actual views of the DNDarray? If so, it would mean that we cannot balance the DNDarrays, and it might change some of the logic involved in the function itself. @ClaudiaComito @Markus-Goetz , thoughts?

I'm with you @coquelin77, no balancing. We need ht.view

@coquelin77
Copy link
Member

this all looks good to me. However, there is a larger comment to be discussed.
Do we want to have this return actual views of the DNDarray? If so, it would mean that we cannot balance the DNDarrays, and it might change some of the logic involved in the function itself. @ClaudiaComito @Markus-Goetz , thoughts?

I'm with you @coquelin77, no balancing. We need ht.view

this isnt a major change from the code, it could be done with relative ease if the balance is dropped. however, if the hard splits result in a partial view on two processes, this would be hard to communicate to users

@Markus-Goetz
Copy link
Member

@coquelin77 @ClaudiaComito For now it should return a copy, generally however I would obviously like to have something like a ht.view of the data, especially silent return of it. I am not 100% sure whether we can actually achieve this easily, due to the additional properties that we carry around and would have to watch out for. Yet, if somebody would successfully implement this, that would be a solid success for squeezing out further performance

Copy link
Member

@coquelin77 coquelin77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the comment about views vs copies didnt make it into the docs of dsplit, hsplit. or vsplit

Returns
-------
sub_arrays : list of DNDarrays
A list of sub-DNDarrays as views into ary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copies not views, also ary an be a code block (ary)

also can you add the copy statement to the description of the function as well. i think that people will often only read that part

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear, sorry I missed that. I'll fix it and wrap ary into a code block. Are all markdown features available in the dosctrings/correctly rendered in sphinx? And you're right, the general description is probably the most relevant part to most, I'll add the hint

@lenablind
Copy link
Collaborator Author

rerun tests

1 similar comment
@coquelin77
Copy link
Member

rerun tests

Copy link
Member

@coquelin77 coquelin77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the comment about views vs copies didnt make it into the docs of dsplit, hsplit. or vsplit

@coquelin77
Copy link
Member

looks like the comment about views vs copies didnt make it into the docs of dsplit, hsplit. or vsplit

sorry this was a remnant of before, ignore this

@coquelin77 coquelin77 merged commit debeaf7 into master Nov 17, 2020
@coquelin77 coquelin77 deleted the features/510-split branch November 17, 2020 14:43
@ClaudiaComito ClaudiaComito mentioned this pull request Nov 27, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement split functions
5 participants