New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization pass #58

Merged
merged 48 commits into from Nov 12, 2016

Conversation

Projects
None yet
2 participants
@ljubobratovicrelja
Member

ljubobratovicrelja commented Sep 29, 2016

This should be first optimization pass advised in #34 and #27, mostly by refactoring loops into mir.ndslice.algorithm iterations.

@9il I'll try making a commit per function (or related group of functions) when optimized. I'm making an early wip PR, so I can apply corrections from potential reviews to further optimizations. Hope you're OK with this plan.

Progress:

  • dcv.features.corner.harris
  • dcv.features.utils.extractCorners
  • dcv.imgproc.color
  • dcv.imgproc.convolution
  • dcv.imgproc.filter.bilateralFilter
  • dcv.imgproc.filter.calcGradients
  • dcv.imgproc.filter.calcPartialDerivatives
  • dcv.imgproc.filter.canny
  • dcv.imgproc.filter.filterNonMaximum
  • dcv.imgproc.filter.nonMaximaSupression
  • dcv.imgproc.imgmanip.remap/warp
  • dcv.imgproc.threshold.threshold
@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Sep 29, 2016

Member

After d870aa6, running benchmark on my laptop yields:

$ ./benchmark.sh master -i 1000 -t harris

-----------------------------------------------------------------------------------
Profiling...
-----------------------------------------------------------------------------------

master:
dcv.features.corner.harris.harrisCorners.3:775 ms and 185 μs
dcv.features.corner.harris.harrisCorners.5:1 sec, 527 ms, and 822 μs
dcv.features.corner.harris.shiTomasiCorners.3:724 ms and 623 μs
dcv.features.corner.harris.shiTomasiCorners.5:1 sec, 529 ms, and 490 μs
master profiling done, results written in /home/relja/git/dcv/tests/performance-tests/.cache/master/tests/performance-tests/profile.csv

this:
dcv.features.corner.harris.harrisCorners.3:499 ms and 515 μs
dcv.features.corner.harris.harrisCorners.5:586 ms and 269 μs
dcv.features.corner.harris.shiTomasiCorners.3:431 ms and 233 μs
dcv.features.corner.harris.shiTomasiCorners.5:578 ms and 913 μs
 profiling done, results written in /home/relja/git/dcv/tests/performance-tests/profile.csv

-----------------------------------------------------------------------------------
Comparing and writing comparison results...
-----------------------------------------------------------------------------------

Comparison done, results written in /home/relja/git/dcv/tests/performance-tests/benchmark.csv

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.harris.harrisCorners.3,775185,499515,55
dcv.features.corner.harris.harrisCorners.5,1527822,586269,160
dcv.features.corner.harris.shiTomasiCorners.3,724623,431233,68
dcv.features.corner.harris.shiTomasiCorners.5,1529490,578913,164
Member

ljubobratovicrelja commented Sep 29, 2016

After d870aa6, running benchmark on my laptop yields:

$ ./benchmark.sh master -i 1000 -t harris

-----------------------------------------------------------------------------------
Profiling...
-----------------------------------------------------------------------------------

master:
dcv.features.corner.harris.harrisCorners.3:775 ms and 185 μs
dcv.features.corner.harris.harrisCorners.5:1 sec, 527 ms, and 822 μs
dcv.features.corner.harris.shiTomasiCorners.3:724 ms and 623 μs
dcv.features.corner.harris.shiTomasiCorners.5:1 sec, 529 ms, and 490 μs
master profiling done, results written in /home/relja/git/dcv/tests/performance-tests/.cache/master/tests/performance-tests/profile.csv

this:
dcv.features.corner.harris.harrisCorners.3:499 ms and 515 μs
dcv.features.corner.harris.harrisCorners.5:586 ms and 269 μs
dcv.features.corner.harris.shiTomasiCorners.3:431 ms and 233 μs
dcv.features.corner.harris.shiTomasiCorners.5:578 ms and 913 μs
 profiling done, results written in /home/relja/git/dcv/tests/performance-tests/profile.csv

-----------------------------------------------------------------------------------
Comparing and writing comparison results...
-----------------------------------------------------------------------------------

Comparison done, results written in /home/relja/git/dcv/tests/performance-tests/benchmark.csv

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.harris.harrisCorners.3,775185,499515,55
dcv.features.corner.harris.harrisCorners.5,1527822,586269,160
dcv.features.corner.harris.shiTomasiCorners.3,724623,431233,68
dcv.features.corner.harris.shiTomasiCorners.5,1529490,578913,164
@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Sep 29, 2016

Member

Are tests is single thread?

Member

9il commented Sep 29, 2016

Are tests is single thread?

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Sep 29, 2016

Member

Do you use mcpu=native flag for tests?
You may want to add fastmath

Member

9il commented Sep 29, 2016

Do you use mcpu=native flag for tests?
You may want to add fastmath

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Sep 29, 2016

Member

Are tests is single thread?

No - do you think they should be? I wanted to test raw performance, seemed like there's no need to run them in single thread. Anyhow this is easily done because of optional parallelization, so it might as well be defined by optional flag when running tests.

Do you use mcpu=native flag for tests?

Oops, I was sure I added that... Not at the moment, will add it.

Member

ljubobratovicrelja commented Sep 29, 2016

Are tests is single thread?

No - do you think they should be? I wanted to test raw performance, seemed like there's no need to run them in single thread. Anyhow this is easily done because of optional parallelization, so it might as well be defined by optional flag when running tests.

Do you use mcpu=native flag for tests?

Oops, I was sure I added that... Not at the moment, will add it.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il
Member

9il commented Sep 29, 2016

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Sep 29, 2016

Member

Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf

Will take a look, thanks.

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

AFAIK it is commonly best used with bigger kernels. It's also vectorizable like the one described in the paper you've posted, which is great. We'll definitely talk about this in future..

Member

ljubobratovicrelja commented Sep 29, 2016

Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf

Will take a look, thanks.

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

AFAIK it is commonly best used with bigger kernels. It's also vectorizable like the one described in the paper you've posted, which is great. We'll definitely talk about this in future..

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Sep 30, 2016

Member

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

Looks very good too, probably better for median filter

Member

9il commented Sep 30, 2016

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

Looks very good too, probably better for median filter

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Oct 1, 2016

Member

Finished dcv.imgproc.color module. I was mainly extracting conversion algorithm code from the main function, to be easily replaced afterwards with std.experimental.color conversion implementation. And by that replacing loops and lockstep with ndEach, ndMap and assumeSameStructure. With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure. Overall improvement in speed is nice:

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.imgproc.color.gray2rgb,284986,259867,9
dcv.imgproc.color.hsv2rgb,8043678,1001873,702
dcv.imgproc.color.rgb2gray,3845346,489591,685
dcv.imgproc.color.rgb2hsv,6374634,1177531,441
dcv.imgproc.color.rgb2yuv,4144262,1067521,288
dcv.imgproc.color.yuv2rgb,3945694,950825,314
Member

ljubobratovicrelja commented Oct 1, 2016

Finished dcv.imgproc.color module. I was mainly extracting conversion algorithm code from the main function, to be easily replaced afterwards with std.experimental.color conversion implementation. And by that replacing loops and lockstep with ndEach, ndMap and assumeSameStructure. With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure. Overall improvement in speed is nice:

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.imgproc.color.gray2rgb,284986,259867,9
dcv.imgproc.color.hsv2rgb,8043678,1001873,702
dcv.imgproc.color.rgb2gray,3845346,489591,685
dcv.imgproc.color.rgb2hsv,6374634,1177531,441
dcv.imgproc.color.rgb2yuv,4144262,1067521,288
dcv.imgproc.color.yuv2rgb,3945694,950825,314

@ljubobratovicrelja ljubobratovicrelja added this to the Patch-up v0.1 milestone Oct 1, 2016

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Oct 1, 2016

Member

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure.

The packed slices are not required here:
For example, for rgb2gray kernel algorithm will looks something like:

...
{
    auto prealloc = uninitilizedSlice!(V[3])(slice.shape);
    assumeSameStructure!("a", "b")(prealloc, slice).ndEach!rgb2grayImpl();
    ...
    return cast(typeof(return)) prealloc;
}
void rgb2grayImpl(P)(P pointerTuple) // no need ref
{
    auto v = pointerTuple.b;
    pointerTuple.a[0] = v;
    pointerTuple.a[1] = v;
    pointerTuple.a[2] = v;
}
Member

9il commented Oct 1, 2016

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure.

The packed slices are not required here:
For example, for rgb2gray kernel algorithm will looks something like:

...
{
    auto prealloc = uninitilizedSlice!(V[3])(slice.shape);
    assumeSameStructure!("a", "b")(prealloc, slice).ndEach!rgb2grayImpl();
    ...
    return cast(typeof(return)) prealloc;
}
void rgb2grayImpl(P)(P pointerTuple) // no need ref
{
    auto v = pointerTuple.b;
    pointerTuple.a[0] = v;
    pointerTuple.a[1] = v;
    pointerTuple.a[2] = v;
}
@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Oct 1, 2016

Member

Also, probably std.experimental.color can be used for some kernels

Member

9il commented Oct 1, 2016

Also, probably std.experimental.color can be used for some kernels

Show outdated Hide outdated source/dcv/imgproc/color.d
prealloc = uninitializedArray!(V[])(range.shape[0] * range.shape[1]).sliced(range.shape[0], range.shape[1]);
auto r = cast(float)(pack[0].rgb) / 255.0f;
auto g = cast(float)(pack[1].rgb) / 255.0f;
auto b = cast(float)(pack[2].rgb) / 255.0f;

This comment has been minimized.

@9il

9il Oct 1, 2016

Member

Fastmath may replace it with: cast(float)(pack[0].rgb) * (1.0f / 255.0f);
In the same time, this should be OK if (1.0f / 255.0f) * 255.0f == 1.0f.
The same true for 65535.0f.

@9il

9il Oct 1, 2016

Member

Fastmath may replace it with: cast(float)(pack[0].rgb) * (1.0f / 255.0f);
In the same time, this should be OK if (1.0f / 255.0f) * 255.0f == 1.0f.
The same true for 65535.0f.

This comment has been minimized.

@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

Yes I think so. I changed it just to be sure. There are also num / 100.0f in few places that should be replaced with num * 0.01f.

@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

Yes I think so. I changed it just to be sure. There are also num / 100.0f in few places that should be replaced with num * 0.01f.

Show outdated Hide outdated source/dcv/imgproc/color.d
m[] /= m.sum;
m[0].swap(m[2]);
auto m = rgb2GrayMltp[conv];
auto gray = range.pack!1.ndMap!(rgb => cast(V)(rgb[2] * m[0] + rgb[1] * m[1] + rgb[0] * m[2]));

This comment has been minimized.

@9il

9il Oct 1, 2016

Member

It is better to define all inner lambdas as separate functions.
This will allow to apply @fastmath for them.
In addition, if the stide!2 == 1 (normal situation), then is much better/faster to use Slice!(2, XXX*), where XXX is a color or static array.

@9il

9il Oct 1, 2016

Member

It is better to define all inner lambdas as separate functions.
This will allow to apply @fastmath for them.
In addition, if the stide!2 == 1 (normal situation), then is much better/faster to use Slice!(2, XXX*), where XXX is a color or static array.

This comment has been minimized.

@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

It is better to define all inner lambdas as separate functions.

OK, thanks. There's just one thing - this way function has to be public. If private I get this (example from gray2rbg):

../../.dub/packages/mir-0.18.2/mir/source/mir/ndslice/algorithm.d-mixin-414-mixin-504(504): Error: function dcv.imgproc.color.gray2rgbImpl!(Index).gray2rgbImpl is not accessible from module algorithm

This is not a huge problem, but those functions may clutter the public API. Is there a workaround for this? Seems like a language limitation, so I suppose there's nothing else to be done from Mir side?

@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

It is better to define all inner lambdas as separate functions.

OK, thanks. There's just one thing - this way function has to be public. If private I get this (example from gray2rbg):

../../.dub/packages/mir-0.18.2/mir/source/mir/ndslice/algorithm.d-mixin-414-mixin-504(504): Error: function dcv.imgproc.color.gray2rgbImpl!(Index).gray2rgbImpl is not accessible from module algorithm

This is not a huge problem, but those functions may clutter the public API. Is there a workaround for this? Seems like a language limitation, so I suppose there's nothing else to be done from Mir side?

Show outdated Hide outdated source/dcv/features/corner/harris.d
auto winSqr = winSize ^^ 2;
auto winHalf = winSize / 2;
window.ndEach!((pack) { auto gx = pack.fx; auto gy = pack.fy; r1 += gx * gx; r2 += gx * gy; r3 += gy * gy; });

This comment has been minimized.

@9il

9il Oct 1, 2016

Member

outer fastmath kernel

@9il

9il Oct 1, 2016

Member

outer fastmath kernel

Show outdated Hide outdated source/dcv/imgproc/color.d
if (hsv[0] < 0)
hsv[0] += 360;
assumeSameStructure!("rgb", "hsv")(range, prealloc).pack!1.ndEach!((p) { rgb2hsvImpl!(V, R)(p); });

This comment has been minimized.

@9il

9il Oct 1, 2016

Member

Just, ndEach!(rgb2hsvImpl!(...), otherwise @fastmath may not work properly

@9il

9il Oct 1, 2016

Member

Just, ndEach!(rgb2hsvImpl!(...), otherwise @fastmath may not work properly

This comment has been minimized.

@9il

9il Oct 1, 2016

Member

Again, Slice!(2, XXX*) should be used here instead of Slice!(3, Type*).
The reason is very simple. The color length can be known at compile time and the last stride equals to 1. This is very good optimization. So, std.experimental.color can help significantly.
Note, that you can convert, for example Slice!(2, RGB*) <--> Slice!(3, ubyte*).

@9il

9il Oct 1, 2016

Member

Again, Slice!(2, XXX*) should be used here instead of Slice!(3, Type*).
The reason is very simple. The color length can be known at compile time and the last stride equals to 1. This is very good optimization. So, std.experimental.color can help significantly.
Note, that you can convert, for example Slice!(2, RGB*) <--> Slice!(3, ubyte*).

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

Yes - definitely. I think its very important for people to see how nd iterations are faster with ndslice.algorithm then with loops. This is important for Mir, but also very important for D - this example could mean a lot to newcomers searching for fast numerical code. I'll write it when we're finished with this PR, OK?

For example, for rgb2gray kernel algorithm will looks something like:

Awesome, thanks!

Also, probably std.experimental.color can be used for some kernels

Yes, for sure - I'll change this in following PRs.

Member

ljubobratovicrelja commented Oct 2, 2016

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

Yes - definitely. I think its very important for people to see how nd iterations are faster with ndslice.algorithm then with loops. This is important for Mir, but also very important for D - this example could mean a lot to newcomers searching for fast numerical code. I'll write it when we're finished with this PR, OK?

For example, for rgb2gray kernel algorithm will looks something like:

Awesome, thanks!

Also, probably std.experimental.color can be used for some kernels

Yes, for sure - I'll change this in following PRs.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Oct 2, 2016

Member

OK?

Looking forward!

Member

9il commented Oct 2, 2016

OK?

Looking forward!

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Oct 2, 2016

Member

see also #59

Member

9il commented Oct 2, 2016

see also #59

Show outdated Hide outdated source/dcv/features/rht.d
import std.array : appender;
auto points = appender!(Point[])();
assumeSameStructure!("index", "value")(indexSlice(image.length!0, image.length!1), image).ndEach!((p) {

This comment has been minimized.

@9il

9il Oct 2, 2016

Member

This would not work if image is part of the larger image. Alternative:

foreach(row; image)
{
   y++;
   x = 0;
   foreach(ref p; row)
   {
       x++;
       ...
   }
}
@9il

9il Oct 2, 2016

Member

This would not work if image is part of the larger image. Alternative:

foreach(row; image)
{
   y++;
   x = 0;
   foreach(ref p; row)
   {
       x++;
       ...
   }
}

This comment has been minimized.

@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

Is ref in ref p needed for speed? Otherwise, here it's not.

@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

Is ref in ref p needed for speed? Otherwise, here it's not.

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Oct 2, 2016

Member

assumeSameStructure requires identical strides, which is not true for sub-images.

Member

9il commented Oct 2, 2016

assumeSameStructure requires identical strides, which is not true for sub-images.

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Oct 2, 2016

Member

assumeSameStructure requires identical strides, which is not true for sub-images.

Ah, tnx. Sorry didn't see that, ignore the comment above.

Member

ljubobratovicrelja commented Oct 2, 2016

assumeSameStructure requires identical strides, which is not true for sub-images.

Ah, tnx. Sorry didn't see that, ignore the comment above.

ljubobratovicrelja added some commits Oct 9, 2016

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Nov 11, 2016

Member

@henrygouk could you take a look at the merge with master diff (after bilateral sigma separation)? Selected line and those few below are the only edits, else is more-less taken from master. I believe I got the equation right, but I'd appreciate a double check. :)

Member

ljubobratovicrelja commented Nov 11, 2016

@henrygouk could you take a look at the merge with master diff (after bilateral sigma separation)? Selected line and those few below are the only edits, else is more-less taken from master. I believe I got the equation right, but I'd appreciate a double check. :)

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Nov 11, 2016

Member

@9il I'd like to merge this one. PR has grown far beyond my expectations, and I believe that detailed revision at this point would be absurd. Except from few template parameter changes to allow lazy slices as inputs, API hasn't change, so I believe there would be no braking changes on merge. Please let me know if you think something else should be done before merge. And I'll tend to brake changes like this into smaller PRs in future, really sorry about this one...

Here is the benchmark result on the current state of the branch (take a special look at threshold results - thanks for the tip on lockstep):

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.fast.FASTDetector,34834,35779,-2
dcv.features.corner.harris.harrisCorners.3,1624687,278579,483
dcv.features.corner.harris.harrisCorners.5,4406410,328159,1242
dcv.features.corner.harris.shiTomasiCorners.3,1583896,223794,607
dcv.features.corner.harris.shiTomasiCorners.5,4422529,297106,1388
dcv.features.rht.RhtCircles,1064020,987154,7
dcv.features.rht.RhtLines,1096606,1013977,8
dcv.features.utils.extractCorners,3164128,355564,789
dcv.imgproc.color.gray2rgb,441354,8918,4849
dcv.imgproc.color.hsv2rgb,433122,51392,742
dcv.imgproc.color.rgb2gray,262186,31813,724
dcv.imgproc.color.rgb2hsv,365969,65572,458
dcv.imgproc.color.rgb2yuv,272942,221719,23
dcv.imgproc.color.yuv2rgb,277880,219906,26
dcv.imgproc.convolution.conv.1D.3,124888,67486,85
dcv.imgproc.convolution.conv.1D.5,159795,68881,131
dcv.imgproc.convolution.conv.1D.7,206059,75361,173
dcv.imgproc.convolution.conv.2D.3x3,767058,120216,538
dcv.imgproc.convolution.conv.2D.5x5,1941055,360809,437
dcv.imgproc.convolution.conv.2D.7x7,3719552,865524,329
dcv.imgproc.convolution.conv.3D.3x3,2091025,374006,459
dcv.imgproc.convolution.conv.3D.5x5,5547364,1074208,416
dcv.imgproc.filter.bilateralFilter.3,6118754,1778482,244
dcv.imgproc.filter.bilateralFilter.5,16718651,4597027,263
dcv.imgproc.filter.calcGradients,2244798,506101,343
dcv.imgproc.filter.calcPartialDerivatives,428318,141520,202
dcv.imgproc.filter.canny,4108987,758240,441
dcv.imgproc.filter.close,705101,657362,7
dcv.imgproc.filter.dilate,638126,583649,9
dcv.imgproc.filter.erode,78763,79369,0
dcv.imgproc.filter.filterNonMaximum,477543,38968,1125
dcv.imgproc.filter.histEqual,103582,105822,-2
dcv.imgproc.filter.medianFilter.3,1029246,1019109,0
dcv.imgproc.filter.medianFilter.5,2954655,2881709,2
dcv.imgproc.filter.nonMaximaSupression,588455,84436,596
dcv.imgproc.filter.open,710023,651529,8
dcv.imgproc.imgmanip.remap,225780,62089,263
dcv.imgproc.imgmanip.resize.downsize,60381,65549,-7
dcv.imgproc.imgmanip.resize.upsize,466977,471281,0
dcv.imgproc.imgmanip.scale.downsize,62576,61688,1
dcv.imgproc.imgmanip.scale.upsize,470195,470125,0
dcv.imgproc.imgmanip.transformAffine,958879,939532,2
dcv.imgproc.imgmanip.transformPerspective,1094929,1090702,0
dcv.imgproc.imgmanip.warp,235169,63821,268
dcv.imgproc.threshold.threshold,124755,656,18917
dcv.multiview.stereo.matching.semiGlobalMatchingPipeline,56359182,56938945,-1
dcv.tracking.opticalflow.hornschunck.HornSchunckFlow,20398746,17306457,17
dcv.tracking.opticalflow.lucaskanade.LucasKanadeFlow,267640,288798,-7
dcv.tracking.opticalflow.pyramidflow.DensePyramidFlow.HornSchunckFlow,24545229,20135950,21
dcv.tracking.opticalflow.pyramidflow.SparsePyramidFlow.LucasKanadeFlow,368629,383785,-3
Member

ljubobratovicrelja commented Nov 11, 2016

@9il I'd like to merge this one. PR has grown far beyond my expectations, and I believe that detailed revision at this point would be absurd. Except from few template parameter changes to allow lazy slices as inputs, API hasn't change, so I believe there would be no braking changes on merge. Please let me know if you think something else should be done before merge. And I'll tend to brake changes like this into smaller PRs in future, really sorry about this one...

Here is the benchmark result on the current state of the branch (take a special look at threshold results - thanks for the tip on lockstep):

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.fast.FASTDetector,34834,35779,-2
dcv.features.corner.harris.harrisCorners.3,1624687,278579,483
dcv.features.corner.harris.harrisCorners.5,4406410,328159,1242
dcv.features.corner.harris.shiTomasiCorners.3,1583896,223794,607
dcv.features.corner.harris.shiTomasiCorners.5,4422529,297106,1388
dcv.features.rht.RhtCircles,1064020,987154,7
dcv.features.rht.RhtLines,1096606,1013977,8
dcv.features.utils.extractCorners,3164128,355564,789
dcv.imgproc.color.gray2rgb,441354,8918,4849
dcv.imgproc.color.hsv2rgb,433122,51392,742
dcv.imgproc.color.rgb2gray,262186,31813,724
dcv.imgproc.color.rgb2hsv,365969,65572,458
dcv.imgproc.color.rgb2yuv,272942,221719,23
dcv.imgproc.color.yuv2rgb,277880,219906,26
dcv.imgproc.convolution.conv.1D.3,124888,67486,85
dcv.imgproc.convolution.conv.1D.5,159795,68881,131
dcv.imgproc.convolution.conv.1D.7,206059,75361,173
dcv.imgproc.convolution.conv.2D.3x3,767058,120216,538
dcv.imgproc.convolution.conv.2D.5x5,1941055,360809,437
dcv.imgproc.convolution.conv.2D.7x7,3719552,865524,329
dcv.imgproc.convolution.conv.3D.3x3,2091025,374006,459
dcv.imgproc.convolution.conv.3D.5x5,5547364,1074208,416
dcv.imgproc.filter.bilateralFilter.3,6118754,1778482,244
dcv.imgproc.filter.bilateralFilter.5,16718651,4597027,263
dcv.imgproc.filter.calcGradients,2244798,506101,343
dcv.imgproc.filter.calcPartialDerivatives,428318,141520,202
dcv.imgproc.filter.canny,4108987,758240,441
dcv.imgproc.filter.close,705101,657362,7
dcv.imgproc.filter.dilate,638126,583649,9
dcv.imgproc.filter.erode,78763,79369,0
dcv.imgproc.filter.filterNonMaximum,477543,38968,1125
dcv.imgproc.filter.histEqual,103582,105822,-2
dcv.imgproc.filter.medianFilter.3,1029246,1019109,0
dcv.imgproc.filter.medianFilter.5,2954655,2881709,2
dcv.imgproc.filter.nonMaximaSupression,588455,84436,596
dcv.imgproc.filter.open,710023,651529,8
dcv.imgproc.imgmanip.remap,225780,62089,263
dcv.imgproc.imgmanip.resize.downsize,60381,65549,-7
dcv.imgproc.imgmanip.resize.upsize,466977,471281,0
dcv.imgproc.imgmanip.scale.downsize,62576,61688,1
dcv.imgproc.imgmanip.scale.upsize,470195,470125,0
dcv.imgproc.imgmanip.transformAffine,958879,939532,2
dcv.imgproc.imgmanip.transformPerspective,1094929,1090702,0
dcv.imgproc.imgmanip.warp,235169,63821,268
dcv.imgproc.threshold.threshold,124755,656,18917
dcv.multiview.stereo.matching.semiGlobalMatchingPipeline,56359182,56938945,-1
dcv.tracking.opticalflow.hornschunck.HornSchunckFlow,20398746,17306457,17
dcv.tracking.opticalflow.lucaskanade.LucasKanadeFlow,267640,288798,-7
dcv.tracking.opticalflow.pyramidflow.DensePyramidFlow.HornSchunckFlow,24545229,20135950,21
dcv.tracking.opticalflow.pyramidflow.SparsePyramidFlow.LucasKanadeFlow,368629,383785,-3
@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Nov 12, 2016

Member

Looks awesome! Great work!
Does new threshold works properly?

assumeSameStructure requires the same strides and shapes. The easist way is add a note in the docs, that all input images must be densi in the memory. And add assert check. Also this will allow to reduced code size and co.pilation time two times for all be vectorized functions.

Could you please write a.blog post in the Mir blog a out this awesome work?

Member

9il commented Nov 12, 2016

Looks awesome! Great work!
Does new threshold works properly?

assumeSameStructure requires the same strides and shapes. The easist way is add a note in the docs, that all input images must be densi in the memory. And add assert check. Also this will allow to reduced code size and co.pilation time two times for all be vectorized functions.

Could you please write a.blog post in the Mir blog a out this awesome work?

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Nov 12, 2016

Member

Sorry for the English . I am from phone

Member

9il commented Nov 12, 2016

Sorry for the English . I am from phone

@ljubobratovicrelja

This comment has been minimized.

Show comment
Hide comment
@ljubobratovicrelja

ljubobratovicrelja Nov 12, 2016

Member

Does new threshold works properly?

Of course, I've checked functionality after each change.

Ah tnx for reminding me about assumeSameStructure - yes I'll add note and assert check about this. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Could you please write a.blog post in the Mir blog a out this awesome work?

Yes of course - we have already agreed on this! :-) I'll email you so we discuss details about the format of the post, as soon as I'm ready to start working on it! (which will be after this merge)

Member

ljubobratovicrelja commented Nov 12, 2016

Does new threshold works properly?

Of course, I've checked functionality after each change.

Ah tnx for reminding me about assumeSameStructure - yes I'll add note and assert check about this. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Could you please write a.blog post in the Mir blog a out this awesome work?

Yes of course - we have already agreed on this! :-) I'll email you so we discuss details about the format of the post, as soon as I'm ready to start working on it! (which will be after this merge)

@9il

This comment has been minimized.

Show comment
Hide comment
@9il

9il Nov 12, 2016

Member

. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Yes

Member

9il commented Nov 12, 2016

. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Yes

@ljubobratovicrelja ljubobratovicrelja merged commit 753a1d3 into master Nov 12, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ljubobratovicrelja ljubobratovicrelja deleted the optimization-patch-2 branch Nov 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment