-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ONNXLayoutTransform #2852
Optimize ONNXLayoutTransform #2852
Conversation
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks so much for quickly optimizing it!
Jenkins Linux amd64 Build #15025 [push] Optimize ONNXLayoutTrans... started at 16:33 |
Jenkins Linux ppc64le Build #14055 [push] Optimize ONNXLayoutTrans... started at 17:42 |
Jenkins Linux s390x Build #15030 [push] Optimize ONNXLayoutTrans... started at 17:33 |
Jenkins Linux s390x Build #15030 [push] Optimize ONNXLayoutTrans... passed after 1 hr 33 min |
Jenkins Linux ppc64le Build #14055 [push] Optimize ONNXLayoutTrans... failed after 1 hr 34 min |
Jenkins Linux amd64 Build #15025 [push] Optimize ONNXLayoutTrans... passed after 1 hr 35 min |
Right now, under the
--enable-zhigh-decompose-stick-unstick
flag, we decompose the stick/unstick into a data conversion and a layout transformation.The layout transformation op lowering to KRNL was implemented using a simple load/store saving one value at a time.
That implementation significantly lagged in performance compared to zDNN stick/unstick and the compiler generated pattern for stick/unstick.
This implementation look at one stick at a time (guaranteed to be contiguous in memory) and generate a mem copy for all (typically 64) values at once.
It works in a non NNPA context. Given a map
(d0, d1, d2)
it simply checks that the mapped version is eitherd2
(identity) ord2 mod lit
(last dim tiled bylit
constant value).When storing into a tiled array, the code does not check for the last tile; it may overwrite data that is unused to begin with.
When storing into a non-tiled array (aka
d2
mapping), then the last tile is handled precisely.In Roberta (6x384) the sequential time of the layout transformation went from 405ms to 81ms (5x speedup).
Lit test:
becomes (no bound check to tiled data)
and (bound check to untiled data)
I will also investigate if transforming some
krnl.memcpy
into unrolled simd loop may yield better results (which it appears that it should as the compiler generated stick/unstick are still faster), at least for some small sizes.This will be in a subsequent PR.