New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching matmul codegen to CPUDoubleTilingExpert by default. #8175
Conversation
The PR enables fusion in the pipeline, which addresses perf issues for matmul + generic cases. However, it causes performance regression for some few cases. There are some unnecessary one-trip loops created for some reasons. Some of them can be fixed by iree-org#7458. They are known values during compilation time because the dim sizes are less than tiling sizes. In this context, the one-trip loops won't be generated. This enables vectorization for the cases that dim sizes are not divisible by tiling sizes. See examples (260x300x280) below. Overall, the new pipeline is better than TileFuseAndVectorize pipeline, so we're going to switch matmul codegen to the new pipeline. There are still some "gaps" between sandbox and IREE. It's not really an apple-to-apple comparison because - There are some overheads in IREE. It would affect when execution time is small. - IREE is a multi-threaded based approach while sandbox is single-threaded based approach. The next step is to search parameters and see if IREE can get closer to sandbox or MKL lib. Then have heuristic configurations for those cases. Performance | Shape | Before w/o exp | After w/o exp | After w/ exp | | ---------------- | --------------- | --------------- | ------------ | | 1x384x384 | 0.025 ms | 0.095 ms | 0.024 ms | | 128x384x384 | 0.311 ms | 0.337 ms | 0.355 ms | | 128x1536x384 | 1.51 ms | 1.53 ms | 1.59 ms | | 128x384x1536 | 1.51 ms | 1.43 ms | 1.52 ms | | 384x384x512 | 1.45 ms | 1.30 ms | 1.38 ms | | 384x128x128 | 0.108 ms | 0.107 ms | 0.122 ms | | 384x128x512 | 0.453 ms | 0.458 ms | 0.511 ms | | 384x512x128 | 0.414 ms | 0.408 ms | 0.415 ms | | 384x512x2 | 0.366 ms | 0.860 ms | 0.582 ms | | 384x384x32 | 0.102 ms | 0.137 ms | 0.121 ms | | 64x384x64 | 0.028 ms | 0.030 ms | 0.028 ms | | 64x1536x64 | 0.105 ms | 0.099 ms | 0.100 ms | | 192x256x128 | 0.106 ms | 0.101 ms | 0.110 ms | | 260x300x280 | 14.2 ms | 0.836 ms | 0.823 ms | | 1000x1000x1000 | 684 ms | 36.0 ms | 35.9 ms | | 1024x1024x1024 | 41.6 ms | 39.9 ms | 40.7 ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks fine. Maybe a few clarifications. But this is looking promising!
@@ -329,16 +283,17 @@ static LogicalResult setX86SandboxRootConfig( | |||
/*workgroupSize=*/ArrayRef<int64_t>{}); | |||
|
|||
// Hardcoded tile sizes. The configuration is derived from iree-llvm-sandbox. | |||
// L1 tile sizes are {1, 1, ..., 288, 128, 512}. | |||
// Vector tile sizes are {1, ..., 9, 32, 16} | |||
// L1 tile sizes are {1, 1, ..., 288, 128, 512}. Note that L1 tiling sizes is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just not set them. Also not clear why the L1 tile sizes are unused. If set here they will be used. These should be set to the same sizes as the second level tile sizes of sandbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see wy they arent used. I'd rather just not set them, but upto you. Using the L1 tile sizes and not using the vector tile sizes is more intuitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want it at some point. The first level of tiling is applied at flow level. It only considers parallel dims. In sandbox, the DoulbeTilingExpert tiles both parallel dims and reduction dims for first level of tiling. The TileFuseAndVectorize pass also does similar things. I think we'd like to keep L1 tiling size and use it soon. I sent the PR out because the performance looks fine to me. I can give it a shot and update the status in the PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some compilation issues when trying it. I'll go with using L1 tile sizes (not vector tile sizes) at this moment and drop a TODO here.
Looks fine overall. Does this regress the miniLM perf that is being tracked. If it does, then it will just cause unnecessary pain. Would be good to check before we land this. |
It's noisy, so I ran it several times. We have slight improvements in miniLM perf.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We need to port the existing verifier over to be valid for this pipeline.
Yes, this is the next thing on my plate. :) |
The PR enables fusion in the pipeline, which addresses perf issues for
matmul + generic cases. However, it causes performance regression for
some few cases. There are some unnecessary one-trip loops created for
some reasons. Some of them can be fixed by
#7458. They are known values during
compilation time because the dim sizes are less than tiling sizes. In
this context, the one-trip loops won't be generated.
This enables vectorization for the cases that dim sizes are not
divisible by tiling sizes. See examples (260x300x280) below.
Overall, the new pipeline is better than TileFuseAndVectorize pipeline,
so we're going to switch matmul codegen to the new pipeline.
There are still some "gaps" between sandbox and IREE. It's not really an
apple-to-apple comparison because
is small.
single-threaded based approach.
The next step is to search parameters and see if IREE can get closer to
sandbox and MKL lib. Then have heuristic configurations for those cases.
Performance
MobileBert Inference Benchmarks
MiniLM BERT Benchmarks
Matmul Microbenchmarks (Single-threaded)