Optimization on the Repeat operation #43

owulveryck · 2019-05-30T20:41:53Z

The denseRepeat method of the standard engine is calling copyDenseSlice method in two embedded for loops.
This method is slicing the source and destination at each pass, even if the start and end indices of the source did not change.

This PR is optimizing this by avoiding the call to copyDenseSlice if possible and avoiding to create a slice array if it's not needed.
The Memcpy method is called directly from the denseRepeat method to avoid any time loss by passing values or references to the function.
On top of being less greedy with the memory, this change also reduces the object creation and by consequence the pressure on the GC.

This commit includes a couple of benchmarks made from the tests.

The comparison gives the following results:

benchmark                     old ns/op     new ns/op     delta
BenchmarkDense_Repeat0-4      1771          1718          -2.99%
BenchmarkDense_Repeat1-4      1744          1667          -4.42%
BenchmarkDense_Repeat2-4      3045          2381          -21.81%
BenchmarkDense_Repeat3-4      4722          2241          -52.54%
BenchmarkDense_Repeat4-4      1993          1712          -14.10%
BenchmarkDense_Repeat5-4      2685          2159          -19.59%
BenchmarkDense_Repeat6-4      3333          2367          -28.98%
BenchmarkDense_Repeat7-4      2538          1926          -24.11%
BenchmarkDense_Repeat8-4      2906          1969          -32.24%
BenchmarkDense_Repeat9-4      2538          1940          -23.56%
BenchmarkDense_Repeat10-4     2296          1942          -15.42%
BenchmarkDense_Repeat11-4     3224          2153          -33.22%
BenchmarkDense_Repeat12-4     2651          2186          -17.54%
BenchmarkDense_Repeat13-4     1904          1723          -9.51%
BenchmarkDense_Repeat14-4     1984          1707          -13.96%
BenchmarkDense_Repeat15-4     3890          2679          -31.13%
BenchmarkDense_Repeat16-4     8598          4897          -43.04%
BenchmarkDense_Repeat17-4     3373          3486          +3.35%
BenchmarkDense_Repeat18-4     3493          3075          -11.97%

benchmark                     old allocs     new allocs     delta
BenchmarkDense_Repeat0-4      17             15             -11.76%
BenchmarkDense_Repeat1-4      17             15             -11.76%
BenchmarkDense_Repeat2-4      24             20             -16.67%
BenchmarkDense_Repeat3-4      24             20             -16.67%
BenchmarkDense_Repeat4-4      18             16             -11.11%
BenchmarkDense_Repeat5-4      24             20             -16.67%
BenchmarkDense_Repeat6-4      24             20             -16.67%
BenchmarkDense_Repeat7-4      23             19             -17.39%
BenchmarkDense_Repeat8-4      23             19             -17.39%
BenchmarkDense_Repeat9-4      23             19             -17.39%
BenchmarkDense_Repeat10-4     20             19             -5.00%
BenchmarkDense_Repeat11-4     23             21             -8.70%
BenchmarkDense_Repeat12-4     23             21             -8.70%
BenchmarkDense_Repeat13-4     17             16             -5.88%
BenchmarkDense_Repeat14-4     17             16             -5.88%
BenchmarkDense_Repeat15-4     27             25             -7.41%
BenchmarkDense_Repeat16-4     59             47             -20.34%
BenchmarkDense_Repeat17-4     12             12             +0.00%
BenchmarkDense_Repeat18-4     10             10             +0.00%

benchmark                     old bytes     new bytes     delta
BenchmarkDense_Repeat0-4      872           744           -14.68%
BenchmarkDense_Repeat1-4      896           768           -14.29%
BenchmarkDense_Repeat2-4      1320          1064          -19.39%
BenchmarkDense_Repeat3-4      1360          1104          -18.82%
BenchmarkDense_Repeat4-4      968           840           -13.22%
BenchmarkDense_Repeat5-4      1336          1080          -19.16%
BenchmarkDense_Repeat6-4      1360          1104          -18.82%
BenchmarkDense_Repeat7-4      1264          1008          -20.25%
BenchmarkDense_Repeat8-4      1288          1032          -19.88%
BenchmarkDense_Repeat9-4      1304          1048          -19.63%
BenchmarkDense_Repeat10-4     1160          1096          -5.52%
BenchmarkDense_Repeat11-4     1328          1200          -9.64%
BenchmarkDense_Repeat12-4     1344          1216          -9.52%
BenchmarkDense_Repeat13-4     960           896           -6.67%
BenchmarkDense_Repeat14-4     960           896           -6.67%
BenchmarkDense_Repeat15-4     1728          1600          -7.41%
BenchmarkDense_Repeat16-4     3832          3064          -20.04%
BenchmarkDense_Repeat17-4     848           848           +0.00%
BenchmarkDense_Repeat18-4     736           736           +0.00%

If possible, avoid the use of the copyDenseSlice function to be able to reuse the source slice.

coveralls · 2019-05-30T20:48:50Z

Coverage increased (+0.07%) to 72.838% when pulling e106ac3 on test-optim-repeat into 06d5c67 on master.

chewxy · 2019-05-31T00:20:11Z

This looks amazing. That's a crazy diff

owulveryck · 2019-05-31T07:30:19Z

Tests in the packages are ok
Tests in Gorgonia are ok
Tests with Gorgonnx ok

chewxy · 2019-06-01T21:02:48Z

I'll fix up the fastCopy bit and we're good to merge @owulveryck ... this is phenomenal;

…est-optim-repeat

Added tests for slow path (by creating masked copies)

chewxy · 2019-06-09T04:43:32Z

@owulveryck I have pushed new code to separate out fastCopy into a separate method. Wondering if you could run your benchmarks? My benchmark results are as follows (bear in mind my computer is slow)

benchmark                                                                     old ns/op     new ns/op     delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8                                2873          2705          -5.85%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8                                2928          2933          +0.17%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8                                4420          3830          -13.35%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8                                4058          3739          -7.86%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8                                3440          3165          -7.99%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8                                4482          3895          -13.10%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8                                4310          3933          -8.75%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8                              3650          3572          -2.14%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8                              3687          3371          -8.57%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8                              3573          3313          -7.28%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8     3543          3615          +2.03%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8          4190          4000          -4.53%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8          4502          4217          -6.33%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8          3325          3395          +2.11%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8          3586          3446          -3.90%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8      5245          4451          -15.14%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8                          10253         8329          -18.77%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8                 4161          3783          -9.08%
BenchmarkDenseRepeat/Nonexistent_axis-8                                       3648          3455          -5.29%

benchmark                                                                     old allocs     new allocs     delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8                                17             15             -11.76%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8                                17             15             -11.76%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8                                24             20             -16.67%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8                                18             16             -11.11%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8                                24             20             -16.67%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8                              23             19             -17.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8                              23             19             -17.39%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8                              23             19             -17.39%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8     20             19             -5.00%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8          17             16             -5.88%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8          17             16             -5.88%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8      27             25             -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8                          59             47             -20.34%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8                 12             12             +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-8                                       10             10             +0.00%

benchmark                                                                     old bytes     new bytes     delta
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-8                                872           744           -14.68%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-8                                896           768           -14.29%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-8                                1320          1064          -19.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-8                                1360          1104          -18.82%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-8                                968           840           -13.22%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-8                                1336          1080          -19.16%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-8                                1360          1104          -18.82%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-8                              1264          1008          -20.25%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-8                              1288          1032          -19.88%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-8                              1304          1048          -19.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-8     1160          1096          -5.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-8          1328          1200          -9.64%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-8          1344          1216          -9.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-8          960           896           -6.67%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-8          960           896           -6.67%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-8      1728          1600          -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-8                          3833          3064          -20.06%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-8                 848           848           +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-8                                       736           736           +0.00%

chewxy · 2019-06-09T04:44:52Z

@owulveryck if you like the results you see from the benchmarks please squash and merge. Thank you so much

chewxy · 2019-05-31T00:21:53Z

defaultengine_matop_misc.go


 			for k := 0; k < tmp; k++ {
 				if srcStart >= t.len() || destStart+stride > d.len() {
 					break
 				}
-				copyDenseSliced(d, destStart, d.len(), t, srcStart, t.len())
+				if fastCopy {


Thinking on this. Wondering if we pull fastCopy into a separate method would it be cleaner and not have ifs in loops which makes it hard for reasoning?

thinknig

owulveryck · 2019-06-09T13:05:33Z

I like this new interface !

Here is the result of my bench (with benchcmp):

benchmark                                                                     old ns/op     new ns/op     delta
BenchmarkDense_Mul_Unsafe-4                                                   14441         13450         -6.86%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4                                  13373         13373         +0.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4                               445395        512722        +15.12%
BenchmarkDense_Transpose-4                                                    222218        206483        -7.08%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4                                1936          1681          -13.17%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4                                1877          1436          -23.49%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4                                2726          2136          -21.64%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4                                3059          2304          -24.68%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4                                2499          1680          -32.77%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4                                3188          2124          -33.38%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4                                3201          2073          -35.24%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4                              2652          2866          +8.07%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4                              2460          3715          +51.02%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4                              2998          2265          -24.45%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4     2405          2027          -15.72%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4          3647          2559          -29.83%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4          4537          2240          -50.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4          6135          1789          -70.84%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4          3798          1882          -50.45%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4      6804          3065          -54.95%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4                          15772         4831          -69.37%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4                 4608          4003          -13.13%
BenchmarkDenseRepeat/Nonexistent_axis-4                                       4310          3632          -15.73%

benchmark                                                                     old allocs     new allocs     delta
BenchmarkDense_Mul_Unsafe-4                                                   1              1              +0.00%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4                                  1              1              +0.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4                               5              5              +0.00%
BenchmarkDense_Transpose-4                                                    11             11             +0.00%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4                                17             15             -11.76%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4                                17             15             -11.76%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4                                24             20             -16.67%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4                                18             16             -11.11%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4                                24             20             -16.67%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4                                24             20             -16.67%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4                              23             19             -17.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4                              23             19             -17.39%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4                              23             19             -17.39%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4     20             19             -5.00%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4          23             21             -8.70%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4          17             16             -5.88%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4          17             16             -5.88%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4      27             25             -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4                          59             47             -20.34%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4                 12             12             +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-4                                       10             10             +0.00%

benchmark                                                                     old bytes     new bytes     delta
BenchmarkDense_Mul_Unsafe-4                                                   11            9             -18.18%
BenchmarkDense_ContiguousSliced_Mul_Unsafe-4                                  10            12            +20.00%
BenchmarkDense_NonContiguousSliced_Mul_Unsafe-4                               396           396           +0.00%
BenchmarkDense_Transpose-4                                                    20982         20979         -0.01%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_0-4                                872           744           -14.68%
BenchmarkDenseRepeat/Scalar_Repeat_on_axis_1-4                                896           768           -14.29%
BenchmarkDenseRepeat/Vector_Repeat_on_axis_0-4                                1320          1064          -19.39%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_0-4                                1360          1104          -18.82%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_0-4                                968           840           -13.22%
BenchmarkDenseRepeat/ColVec_Repeat_on_axis_1-4                                1336          1080          -19.16%
BenchmarkDenseRepeat/RowVec_Repeat_on_axis_1-4                                1360          1104          -18.82%
BenchmarkDenseRepeat/Vector_Repeat_on_all_axes-4                              1264          1008          -20.25%
BenchmarkDenseRepeat/ColVec_Repeat_on_all_axes-4                              1288          1032          -19.88%
BenchmarkDenseRepeat/RowVec_Repeat_on_all_axes-4                              1304          1048          -19.63%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_all_axes_with_repeats_=_(1,2,1,1)-4     1160          1096          -5.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(2,_1)-4          1328          1200          -9.64%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_1_with_repeats_=_(1,_2)-4          1344          1216          -9.52%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(1,_2)-4          960           896           -6.67%
BenchmarkDenseRepeat/M[2,2]_Repeat_on_axis_0_with_repeats_=_(2,_1)-4          960           896           -6.67%
BenchmarkDenseRepeat/3T[2,3,2]_Repeat_on_axis_1_with_repeats_=_(1,2,1)-4      1728          1600          -7.41%
BenchmarkDenseRepeat/3T[2,3,2]_Generic_Repeat_by_2-4                          3832          3064          -20.04%
BenchmarkDenseRepeat/3T[2,3,2]_repeat_with_broadcast_errors-4                 848           848           +0.00%
BenchmarkDenseRepeat/Nonexistent_axis-4                                       736           736           +0.00%

owulveryck added 2 commits May 30, 2019 22:05

feat: enhance the performances of the Repeat operator

cadd8c2

If possible, avoid the use of the copyDenseSlice function to be able to reuse the source slice.

feat: add some bench for the Repeat operation

37fd07e

owulveryck requested a review from chewxy May 30, 2019 20:42

owulveryck mentioned this pull request May 30, 2019

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia askorama/onnx-go#68

Closed

Merge branch 'master' into test-optim-repeat

9185194

owulveryck and others added 4 commits June 2, 2019 08:35

fix: use named subbenchmark for clarity on the output

d2032ab

Merge branch 'test-optim-repeat' of github.com:gorgonia/tensor into t…

8e3c012

…est-optim-repeat

Moved "fastCopy" of repeat to a new method.

570accc

Added tests for slow path (by creating masked copies)

Moved benchmarks out to a separate file

e106ac3

chewxy approved these changes Jun 9, 2019

View reviewed changes

owulveryck merged commit 663333f into master Jun 9, 2019

owulveryck deleted the test-optim-repeat branch June 9, 2019 13:06

owulveryck mentioned this pull request Jun 9, 2019

Analyse (and enhance) performances on multiple predictions askorama/onnx-go#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization on the Repeat operation #43

Optimization on the Repeat operation #43

owulveryck commented May 30, 2019

coveralls commented May 30, 2019 •

edited

Loading

chewxy commented May 31, 2019

owulveryck commented May 31, 2019

chewxy commented Jun 1, 2019

chewxy commented Jun 9, 2019

chewxy commented Jun 9, 2019

chewxy May 31, 2019

owulveryck commented Jun 9, 2019

Optimization on the Repeat operation #43

Optimization on the Repeat operation #43

Conversation

owulveryck commented May 30, 2019

coveralls commented May 30, 2019 • edited Loading

chewxy commented May 31, 2019

owulveryck commented May 31, 2019

chewxy commented Jun 1, 2019

chewxy commented Jun 9, 2019

chewxy commented Jun 9, 2019

chewxy May 31, 2019

Choose a reason for hiding this comment

owulveryck commented Jun 9, 2019

coveralls commented May 30, 2019 •

edited

Loading