Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ONNXLayoutTransform #2852

Merged
merged 9 commits into from
Jun 20, 2024

Conversation

AlexandreEichenberger
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger commented Jun 19, 2024

Right now, under the --enable-zhigh-decompose-stick-unstick flag, we decompose the stick/unstick into a data conversion and a layout transformation.

The layout transformation op lowering to KRNL was implemented using a simple load/store saving one value at a time.
That implementation significantly lagged in performance compared to zDNN stick/unstick and the compiler generated pattern for stick/unstick.

This implementation look at one stick at a time (guaranteed to be contiguous in memory) and generate a mem copy for all (typically 64) values at once.

It works in a non NNPA context. Given a map (d0, d1, d2) it simply checks that the mapped version is either d2 (identity) or d2 mod lit (last dim tiled by lit constant value).

When storing into a tiled array, the code does not check for the last tile; it may overwrite data that is unused to begin with.
When storing into a non-tiled array (aka d2 mapping), then the last tile is handled precisely.

In Roberta (6x384) the sequential time of the layout transformation went from 405ms to 81ms (5x speedup).

Lit test:

    %0 = "onnx.LayoutTransform"(%arg0) {target_layout = "3DS"} : (tensor<?x?x?xf16>) -> tensor<?x?x?xf16, #zhigh.layout<{dataLayout = "3DS"}>>
    %1 = "onnx.LayoutTransform"(%0) : (tensor<?x?x?xf16, #zhigh.layout<{dataLayout = "3DS"}>>) -> tensor<?x?x?xf16>

becomes (no bound check to tiled data)

    %alloc = memref.alloc(%dim, %dim_0, %dim_1) {alignment = 4096 : i64} : memref<?x?x?xf16, #map>
    %0:3 = krnl.define_loops 3
    krnl.iterate(%0#0, %0#1, %0#2) with (%0#0 -> %arg1 = 0 to #map1(%dim_1, %dim), %0#1 -> %arg2 = 0 to #map2(%dim_1, %dim, %dim_0), %0#2 -> %arg3 = 0 to #map3(%dim_1, %dim, %dim_0)){
      %2:3 = krnl.get_induction_var_value(%0#0, %0#1, %0#2) : (!krnl.loop, !krnl.loop, !krnl.loop) -> (index, index, index)
      %3 = affine.apply #map4()[%2#2]
      %4 = krnl.get_linear_offset_index %alloc at [%2#0, %2#1, %3] : memref<?x?x?xf16, #map>
      %5 = krnl.get_linear_offset_index %arg0 at [%2#0, %2#1, %3] : memref<?x?x?xf16>
      "krnl.memcpy"(%alloc, %arg0, %c64_i64, %4, %5) : (memref<?x?x?xf16, #map>, memref<?x?x?xf16>, i64, index, index) -> ()
    }

and (bound check to untiled data)

    %alloc_2 = memref.alloc(%dim, %dim_0, %dim_1) {alignment = 16 : i64} : memref<?x?x?xf16>
    %1:3 = krnl.define_loops 3
    krnl.iterate(%1#0, %1#1, %1#2) with (%1#0 -> %arg1 = 0 to #map1(%dim_1, %dim), %1#1 -> %arg2 = 0 to #map2(%dim_1, %dim, %dim_0), %1#2 -> %arg3 = 0 to #map3(%dim_1, %dim, %dim_0)){
      %2:3 = krnl.get_induction_var_value(%1#0, %1#1, %1#2) : (!krnl.loop, !krnl.loop, !krnl.loop) -> (index, index, index)
      %3 = affine.apply #map4()[%2#2]
      %4 = krnl.get_linear_offset_index %alloc_2 at [%2#0, %2#1, %3] : memref<?x?x?xf16>
      %5 = krnl.get_linear_offset_index %alloc at [%2#0, %2#1, %3] : memref<?x?x?xf16, #map>
      %6 = affine.apply #map5()[%2#2, %dim_1]
      %7 = arith.cmpi sge, %6, %c0 : index
      scf.if %7 {
        "krnl.memcpy"(%alloc_2, %alloc, %c64_i64, %4, %5) : (memref<?x?x?xf16>, memref<?x?x?xf16, #map>, i64, index, index) -> ()
      } else {
        %8 = affine.apply #map6()[%dim_1, %2#2]
        %9 = arith.index_cast %8 : index to i64
        "krnl.memcpy"(%alloc_2, %alloc, %9, %4, %5) : (memref<?x?x?xf16>, memref<?x?x?xf16, #map>, i64, index, index) -> ()
      }
    }

I will also investigate if transforming some krnl.memcpy into unrolled simd loop may yield better results (which it appears that it should as the compiler generated stick/unstick are still faster), at least for some small sizes.
This will be in a subsequent PR.

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks so much for quickly optimizing it!

@AlexandreEichenberger AlexandreEichenberger merged commit c9be563 into onnx:main Jun 20, 2024
6 of 7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15025 [push] Optimize ONNXLayoutTrans... started at 16:33

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14055 [push] Optimize ONNXLayoutTrans... started at 17:42

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15030 [push] Optimize ONNXLayoutTrans... started at 17:33

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15030 [push] Optimize ONNXLayoutTrans... passed after 1 hr 33 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14055 [push] Optimize ONNXLayoutTrans... failed after 1 hr 34 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15025 [push] Optimize ONNXLayoutTrans... passed after 1 hr 35 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants