Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support nogc concatenation in the outermost dimension #3

Closed
timotheecour opened this issue Jan 10, 2016 · 17 comments
Closed

support nogc concatenation in the outermost dimension #3

timotheecour opened this issue Jan 10, 2016 · 17 comments

Comments

@timotheecour
Copy link

timotheecour commented Jan 10, 2016

algo:

auto  cat!0(S0 a0, S1 a1) if ( isSlice!S0 && isSlice!S1) {
  auto shape0=a0.shape;
  auto shape1=a1.shape;
  shape0[0]+=shape1[0];
  assert(shape0[1..$]==shape1[1..$]);
  return chain(a0.byElement, a1.byElement).sliced(shape);
}

generalizes trivially to variadic arguments

@9il
Copy link
Member

9il commented Jan 10, 2016

This is slow. ByElement is random access range, but it is slow for random access.

@timotheecour
Copy link
Author

why would it have to be used in a random-access way?
i would assume it to be faster than having to allocate the result of the concatenation. eg:

auto a=range0.sliced(m, n0)
auto b=range1.sliced(m, n1)
cat!0(a,b)[2..$-1, 2..$-1].transposed.imwrite("result.jpg");
// => imwrite requires sequential access to elements to obtain a non-strided D array, followed by calling libjpeg. No random access here.

alternative:

auto c=catAllocating(a,b)[2..$-1, 2..$-1].transposed.imwrite("result.jpg");

this is not good as not only it allocates, but it has to call byElement twice: once at the concatenation, and once at the end (inside materializing to an array in imwrite)

@9il
Copy link
Member

9il commented Jan 10, 2016

why would it have to be used in a random-access way?

See section Binary Representation at struct Slice description.

cat!0(a,b)[2..$-1, 2..$-1].transposed.imwrite("result.jpg");

It is possible to implement fast version with input range primitives. However transposed and other functions would not work.

@9il
Copy link
Member

9il commented Jan 10, 2016

But you will be able to use fast version to save images.

@timotheecour
Copy link
Author

i think I see what you're trying to say:
you're saying: if the user uses
cat(a,b) in a random-access way, he should probably allocate first to get a pointer-backed slice so that random-accesses are fast.

I agree with that but i think it's good to give user the option to allocate or not in this case.
I would assume the example I gave above would indeed be faster with the nogc cat(a,b).

Furthermore, instead of catAllocating why not define a general function:

auto reallocate(S)(S a) if (isSlice!S){
  return a.byElement.array.sliced(a.shape);
}
// with another overload that takes a preallocated buffer

//apologies if it already exists in std.experimental.ndslice, but couldnt' find it!

Then: we don't need catAllocating, we can just call:

auto c=cat(a,b).reallocate;
// fast to use c in random access

@9il
Copy link
Member

9il commented Jan 10, 2016

What is the use case for concatenation without allocation?
If you want so save a file, it is possible to do with allocation algorithm (but without memory allocation!).
An OutputRange can be used to save concatenation to memory or to a file. For GC allocation OutputRange is std,array.Appender and for file-streams it is std.stdio.lockingTextWriter (if I am not wrong).

@timotheecour
Copy link
Author

What is use case for concatenation without allocation?

  • delayed allocation (after applying further transformations as in my above example) [1],
  • no allocation nor random access (eg: passing through some reduce operaton)

Concrete example inspired from http://jackstouffer.com/blog/nd_slice.html

"This function does not calculate border cases in which a window overlaps the image partially. However, the function can still be used to carry out such calculations. That can be done by creating an amplified image, with the edges reflected from the original image"

This is a perfect use case of cat (at least in the case of upper and lower borders):

auto image = ...
auto border_up=image(_,R(0,border).reverse);
auto border_down=image(_,R($-1,$-1-border,-1));
auto padded_image=cat(border_up, image, border_down);

// now call some algorithm that uses padded_image and is mostly non-random-access.
// movingWindowByChannel could maybe qualify since most of the access [windows] are sequential access, not random access.

[1] same rationale as "only to be assembled at the very end to avoid unnecessary allocations" in http://jackstouffer.com/blog/nd_slice.html

@9il
Copy link
Member

9il commented Jan 10, 2016

Concrete example inspired from http://jackstouffer.com/blog/nd_slice.html

This is my example. And I am 99.99% sure that median filter will be faster with allocation.

@timotheecour
Copy link
Author

This is my example.

I know :-), and I like it :-)

And I am 99.99% sure that median filter will be faster with allocation.

haven't tried w your example but see this simple test:

https://github.com/timotheecour/dtools/blob/master/dtools/scratch/bench_ndslice/test1.d
=> 2X speedup with lazy cat

I'm obviously not saying it'll always be faster without reallocate after cat, but that it can definitely be faster, so that makes it worth it to provide as a library for user. User can choose to reallocate or not for his use case.

@9il
Copy link
Member

9il commented Jan 10, 2016

=> 2X speedup with lazy cat

Lazy cat of Slice over Iota (It not allocated at memory at all) VS Lazy cat + allocation into memory.
It is not fair. It is not from real world.

Fair:
Lazy cat of Slice over array VS Reallocation + (without cat) Slice over array.

BTW: indexSlice could help to create fast lazy cat.

@timotheecour
Copy link
Author

the speedup was also 2X when I replaced:

auto image = iota(n0 * n1).sliced(n0, n1)[1 .. $, 1 .. $];

with

auto image = iota(n0 * n1).array.sliced(n0, n1)[1 .. $, 1 .. $];

Also, I made a more real-world example, see updated code in
https://github.com/timotheecour/dtools/blob/master/dtools/scratch/bench_ndslice/test1.d
which computes an integral image over a mirror-padded image.

The speedup with lazy cat is now 1.4X in this example.

I will take a look at indexSlice.

@9il
Copy link
Member

9il commented Jan 11, 2016

I will take a look at indexSlice.

The idea is to propagate Slice primitives over a simple generic Cat (with multidimensonal auto opIndex(auto ref size_t[N] lengths) ) using IndexSlice.

@9il
Copy link
Member

9il commented Mar 6, 2017

Partially solved by stack. We can add opIndex primitive to it, so it will be more flexible.

@9il
Copy link
Member

9il commented Mar 6, 2017

libmir/mir-algorithm#29 fixes this issue.

auto s = stack!dim(...);
auto v = s.indexed(s.shape.ndiota);
static assert(isSlice!(typeof(v)));

This is more accurate and faster then flattened workaround.

@9il 9il closed this as completed Mar 6, 2017
@9il
Copy link
Member

9il commented Mar 9, 2017

constant padding was added libmir/mir-algorithm#31

@9il
Copy link
Member

9il commented Mar 9, 2017

edge, symmetric and wrap paddings libmir/mir-algorithm#32

@9il
Copy link
Member

9il commented Mar 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants