Optimize ONNXLayoutTransform #2852

AlexandreEichenberger · 2024-06-19T16:31:03Z

Right now, under the --enable-zhigh-decompose-stick-unstick flag, we decompose the stick/unstick into a data conversion and a layout transformation.

The layout transformation op lowering to KRNL was implemented using a simple load/store saving one value at a time.
That implementation significantly lagged in performance compared to zDNN stick/unstick and the compiler generated pattern for stick/unstick.

This implementation look at one stick at a time (guaranteed to be contiguous in memory) and generate a mem copy for all (typically 64) values at once.

It works in a non NNPA context. Given a map (d0, d1, d2) it simply checks that the mapped version is either d2 (identity) or d2 mod lit (last dim tiled by lit constant value).

When storing into a tiled array, the code does not check for the last tile; it may overwrite data that is unused to begin with.
When storing into a non-tiled array (aka d2 mapping), then the last tile is handled precisely.

In Roberta (6x384) the sequential time of the layout transformation went from 405ms to 81ms (5x speedup).

Lit test:

    %0 = "onnx.LayoutTransform"(%arg0) {target_layout = "3DS"} : (tensor<?x?x?xf16>) -> tensor<?x?x?xf16, #zhigh.layout<{dataLayout = "3DS"}>>
    %1 = "onnx.LayoutTransform"(%0) : (tensor<?x?x?xf16, #zhigh.layout<{dataLayout = "3DS"}>>) -> tensor<?x?x?xf16>

becomes (no bound check to tiled data)

    %alloc = memref.alloc(%dim, %dim_0, %dim_1) {alignment = 4096 : i64} : memref<?x?x?xf16, #map>
    %0:3 = krnl.define_loops 3
    krnl.iterate(%0#0, %0#1, %0#2) with (%0#0 -> %arg1 = 0 to #map1(%dim_1, %dim), %0#1 -> %arg2 = 0 to #map2(%dim_1, %dim, %dim_0), %0#2 -> %arg3 = 0 to #map3(%dim_1, %dim, %dim_0)){
      %2:3 = krnl.get_induction_var_value(%0#0, %0#1, %0#2) : (!krnl.loop, !krnl.loop, !krnl.loop) -> (index, index, index)
      %3 = affine.apply #map4()[%2#2]
      %4 = krnl.get_linear_offset_index %alloc at [%2#0, %2#1, %3] : memref<?x?x?xf16, #map>
      %5 = krnl.get_linear_offset_index %arg0 at [%2#0, %2#1, %3] : memref<?x?x?xf16>
      "krnl.memcpy"(%alloc, %arg0, %c64_i64, %4, %5) : (memref<?x?x?xf16, #map>, memref<?x?x?xf16>, i64, index, index) -> ()
    }

and (bound check to untiled data)

    %alloc_2 = memref.alloc(%dim, %dim_0, %dim_1) {alignment = 16 : i64} : memref<?x?x?xf16>
    %1:3 = krnl.define_loops 3
    krnl.iterate(%1#0, %1#1, %1#2) with (%1#0 -> %arg1 = 0 to #map1(%dim_1, %dim), %1#1 -> %arg2 = 0 to #map2(%dim_1, %dim, %dim_0), %1#2 -> %arg3 = 0 to #map3(%dim_1, %dim, %dim_0)){
      %2:3 = krnl.get_induction_var_value(%1#0, %1#1, %1#2) : (!krnl.loop, !krnl.loop, !krnl.loop) -> (index, index, index)
      %3 = affine.apply #map4()[%2#2]
      %4 = krnl.get_linear_offset_index %alloc_2 at [%2#0, %2#1, %3] : memref<?x?x?xf16>
      %5 = krnl.get_linear_offset_index %alloc at [%2#0, %2#1, %3] : memref<?x?x?xf16, #map>
      %6 = affine.apply #map5()[%2#2, %dim_1]
      %7 = arith.cmpi sge, %6, %c0 : index
      scf.if %7 {
        "krnl.memcpy"(%alloc_2, %alloc, %c64_i64, %4, %5) : (memref<?x?x?xf16>, memref<?x?x?xf16, #map>, i64, index, index) -> ()
      } else {
        %8 = affine.apply #map6()[%dim_1, %2#2]
        %9 = arith.index_cast %8 : index to i64
        "krnl.memcpy"(%alloc_2, %alloc, %9, %4, %5) : (memref<?x?x?xf16>, memref<?x?x?xf16, #map>, i64, index, index) -> ()
      }
    }

I will also investigate if transforming some krnl.memcpy into unrolled simd loop may yield better results (which it appears that it should as the compiler generated stick/unstick are still faster), at least for some small sizes.
This will be in a subsequent PR.

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

tungld

LGTM. Thanks so much for quickly optimizing it!

jenkins-droid · 2024-06-20T21:33:11Z

Jenkins Linux amd64 Build #15025 [push] Optimize ONNXLayoutTrans... started at 16:33

jenkins-droid · 2024-06-20T21:33:13Z

Jenkins Linux ppc64le Build #14055 [push] Optimize ONNXLayoutTrans... started at 17:42

jenkins-droid · 2024-06-20T21:33:15Z

Jenkins Linux s390x Build #15030 [push] Optimize ONNXLayoutTrans... started at 17:33

jenkins-droid · 2024-06-20T23:06:42Z

Jenkins Linux s390x Build #15030 [push] Optimize ONNXLayoutTrans... passed after 1 hr 33 min

jenkins-droid · 2024-06-20T23:07:29Z

Jenkins Linux ppc64le Build #14055 [push] Optimize ONNXLayoutTrans... failed after 1 hr 34 min

jenkins-droid · 2024-06-20T23:08:20Z

Jenkins Linux amd64 Build #15025 [push] Optimize ONNXLayoutTrans... passed after 1 hr 35 min

AlexandreEichenberger added 7 commits June 17, 2024 17:30

initial recognition of pattern

8a2592e

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

use memcpy for layout transform to mapped data

9643885

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

both directions

45b83bf

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

restore recombining

2bd4cf1

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

added prefetch

f4a4ec9

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

added debug

a1ea8a4

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

added lit test

1d10944

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

AlexandreEichenberger requested a review from tungld June 19, 2024 16:31

added mod to mod optimized pattern

7029fe9

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

tungld approved these changes Jun 20, 2024

View reviewed changes

Merge branch 'main' into layout-opt-v1

5108142

AlexandreEichenberger merged commit c9be563 into onnx:main Jun 20, 2024
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ONNXLayoutTransform #2852

Optimize ONNXLayoutTransform #2852

AlexandreEichenberger commented Jun 19, 2024 •

edited

Loading

tungld left a comment

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

Optimize ONNXLayoutTransform #2852

Optimize ONNXLayoutTransform #2852

Conversation

AlexandreEichenberger commented Jun 19, 2024 • edited Loading

tungld left a comment

Choose a reason for hiding this comment

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

jenkins-droid commented Jun 20, 2024

AlexandreEichenberger commented Jun 19, 2024 •

edited

Loading