Switching matmul codegen to CPUDoubleTilingExpert by default. #8175

hanhanW · 2022-01-25T02:13:28Z

The PR enables fusion in the pipeline, which addresses perf issues for
matmul + generic cases. However, it causes performance regression for
some few cases. There are some unnecessary one-trip loops created for
some reasons. Some of them can be fixed by
#7458. They are known values during
compilation time because the dim sizes are less than tiling sizes. In
this context, the one-trip loops won't be generated.

This enables vectorization for the cases that dim sizes are not
divisible by tiling sizes. See examples (260x300x280) below.

Overall, the new pipeline is better than TileFuseAndVectorize pipeline,
so we're going to switch matmul codegen to the new pipeline.

There are still some "gaps" between sandbox and IREE. It's not really an
apple-to-apple comparison because

There are some overheads in IREE. It would affect when execution time
is small.
IREE is a multi-threaded based approach while sandbox is
single-threaded based approach.

The next step is to search parameters and see if IREE can get closer to
sandbox and MKL lib. Then have heuristic configurations for those cases.

Performance

MobileBert Inference Benchmarks

Before (single-threaded)	After (single-threaded)	Before (multi-threaded)	After (multi-threaded)
228 ms	200 ms	61.1 ms	59.3 ms

MiniLM BERT Benchmarks

Before (single-threaded)	After (single-threaded)	Before (multi-threaded)	After (multi-threaded)
82.7 ms	79.1 ms	21.6 ms	21.1 ms

Matmul Microbenchmarks (Single-threaded)

Shape	Before w/o exp	After w/o exp	After w/ exp
1x384x384	0.025 ms	0.095 ms	0.024 ms
128x384x384	0.311 ms	0.337 ms	0.355 ms
128x1536x384	1.51 ms	1.53 ms	1.59 ms
128x384x1536	1.51 ms	1.43 ms	1.52 ms
384x384x512	1.45 ms	1.30 ms	1.38 ms
384x128x128	0.108 ms	0.107 ms	0.122 ms
384x128x512	0.453 ms	0.458 ms	0.511 ms
384x512x128	0.414 ms	0.408 ms	0.415 ms
384x512x2	0.366 ms	0.860 ms	0.582 ms
384x384x32	0.102 ms	0.137 ms	0.121 ms
64x384x64	0.028 ms	0.030 ms	0.028 ms
64x1536x64	0.105 ms	0.099 ms	0.100 ms
192x256x128	0.106 ms	0.101 ms	0.110 ms
260x300x280	14.2 ms	0.836 ms	0.823 ms
1000x1000x1000	684 ms	36.0 ms	35.9 ms
1024x1024x1024	41.6 ms	39.9 ms	40.7 ms

The PR enables fusion in the pipeline, which addresses perf issues for matmul + generic cases. However, it causes performance regression for some few cases. There are some unnecessary one-trip loops created for some reasons. Some of them can be fixed by iree-org#7458. They are known values during compilation time because the dim sizes are less than tiling sizes. In this context, the one-trip loops won't be generated. This enables vectorization for the cases that dim sizes are not divisible by tiling sizes. See examples (260x300x280) below. Overall, the new pipeline is better than TileFuseAndVectorize pipeline, so we're going to switch matmul codegen to the new pipeline. There are still some "gaps" between sandbox and IREE. It's not really an apple-to-apple comparison because - There are some overheads in IREE. It would affect when execution time is small. - IREE is a multi-threaded based approach while sandbox is single-threaded based approach. The next step is to search parameters and see if IREE can get closer to sandbox or MKL lib. Then have heuristic configurations for those cases. Performance | Shape | Before w/o exp | After w/o exp | After w/ exp | | ---------------- | --------------- | --------------- | ------------ | | 1x384x384 | 0.025 ms | 0.095 ms | 0.024 ms | | 128x384x384 | 0.311 ms | 0.337 ms | 0.355 ms | | 128x1536x384 | 1.51 ms | 1.53 ms | 1.59 ms | | 128x384x1536 | 1.51 ms | 1.43 ms | 1.52 ms | | 384x384x512 | 1.45 ms | 1.30 ms | 1.38 ms | | 384x128x128 | 0.108 ms | 0.107 ms | 0.122 ms | | 384x128x512 | 0.453 ms | 0.458 ms | 0.511 ms | | 384x512x128 | 0.414 ms | 0.408 ms | 0.415 ms | | 384x512x2 | 0.366 ms | 0.860 ms | 0.582 ms | | 384x384x32 | 0.102 ms | 0.137 ms | 0.121 ms | | 64x384x64 | 0.028 ms | 0.030 ms | 0.028 ms | | 64x1536x64 | 0.105 ms | 0.099 ms | 0.100 ms | | 192x256x128 | 0.106 ms | 0.101 ms | 0.110 ms | | 260x300x280 | 14.2 ms | 0.836 ms | 0.823 ms | | 1000x1000x1000 | 684 ms | 36.0 ms | 35.9 ms | | 1024x1024x1024 | 41.6 ms | 39.9 ms | 40.7 ms |

MaheshRavishankar

Mostly looks fine. Maybe a few clarifications. But this is looking promising!

MaheshRavishankar · 2022-01-26T03:36:55Z

iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

@@ -329,16 +283,17 @@ static LogicalResult setX86SandboxRootConfig(
                     /*workgroupSize=*/ArrayRef<int64_t>{});

  // Hardcoded tile sizes. The configuration is derived from iree-llvm-sandbox.
-  // L1 tile sizes are {1, 1, ..., 288, 128, 512}.
-  // Vector tile sizes are {1, ..., 9, 32, 16}
+  // L1 tile sizes are {1, 1, ..., 288, 128, 512}. Note that L1 tiling sizes is


Could we just not set them. Also not clear why the L1 tile sizes are unused. If set here they will be used. These should be set to the same sizes as the second level tile sizes of sandbox.

Ok, I see wy they arent used. I'd rather just not set them, but upto you. Using the L1 tile sizes and not using the vector tile sizes is more intuitive to me.

I think we might want it at some point. The first level of tiling is applied at flow level. It only considers parallel dims. In sandbox, the DoulbeTilingExpert tiles both parallel dims and reduction dims for first level of tiling. The TileFuseAndVectorize pass also does similar things. I think we'd like to keep L1 tiling size and use it soon. I sent the PR out because the performance looks fine to me. I can give it a shot and update the status in the PR later.

There are some compilation issues when trying it. I'll go with using L1 tile sizes (not vector tile sizes) at this moment and drop a TODO here.

MaheshRavishankar · 2022-01-26T19:57:48Z

Looks fine overall. Does this regress the miniLM perf that is being tracked. If it does, then it will just cause unnecessary pain. Would be good to check before we land this.

hanhanW · 2022-01-26T21:43:59Z

Looks fine overall. Does this regress the miniLM perf that is being tracked. If it does, then it will just cause unnecessary pain. Would be good to check before we land this.

It's noisy, so I ran it several times. We have slight improvements in miniLM perf.

Before (single-threaded)	After (single-threaded)	Before (multi-threaded)	After (multi-threaded)
82.7 ms	79.1 ms	21.6 ms	21.1 ms

MaheshRavishankar

LGTM! We need to port the existing verifier over to be valid for this pipeline.

hanhanW · 2022-01-27T18:44:48Z

LGTM! We need to port the existing verifier over to be valid for this pipeline.

Yes, this is the next thing on my plate. :)

hanhanW added 2 commits January 20, 2022 13:38

Triaging performance gap between IREE and sandbox

2afbaaa

hanhanW requested a review from MaheshRavishankar January 25, 2022 02:19

hanhanW added 3 commits January 24, 2022 18:45

Also make batch_matmul go with DoubleTilingExpert

3f942f1

delete configuration test file

7cc894b

delete unused func

20f0437

MaheshRavishankar requested changes Jan 26, 2022

View reviewed changes

Address comments

640e4e7

hanhanW requested a review from MaheshRavishankar January 26, 2022 19:14

fix tests

7c25451

hanhanW enabled auto-merge (squash) January 26, 2022 22:02

MaheshRavishankar approved these changes Jan 27, 2022

View reviewed changes

hanhanW merged commit e3ded85 into iree-org:main Jan 27, 2022

hanhanW deleted the gemm-gap branch January 27, 2022 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching matmul codegen to CPUDoubleTilingExpert by default. #8175

Switching matmul codegen to CPUDoubleTilingExpert by default. #8175

hanhanW commented Jan 25, 2022 •

edited

MaheshRavishankar left a comment

MaheshRavishankar Jan 26, 2022

MaheshRavishankar Jan 26, 2022

hanhanW Jan 26, 2022

hanhanW Jan 26, 2022

MaheshRavishankar commented Jan 26, 2022

hanhanW commented Jan 26, 2022

MaheshRavishankar left a comment

hanhanW commented Jan 27, 2022

Switching matmul codegen to CPUDoubleTilingExpert by default. #8175

Switching matmul codegen to CPUDoubleTilingExpert by default. #8175

Conversation

hanhanW commented Jan 25, 2022 • edited

Performance

MobileBert Inference Benchmarks

MiniLM BERT Benchmarks

Matmul Microbenchmarks (Single-threaded)

MaheshRavishankar left a comment

Choose a reason for hiding this comment

MaheshRavishankar Jan 26, 2022

Choose a reason for hiding this comment

MaheshRavishankar Jan 26, 2022

Choose a reason for hiding this comment

hanhanW Jan 26, 2022

Choose a reason for hiding this comment

hanhanW Jan 26, 2022

Choose a reason for hiding this comment

MaheshRavishankar commented Jan 26, 2022

hanhanW commented Jan 26, 2022

MaheshRavishankar left a comment

Choose a reason for hiding this comment

hanhanW commented Jan 27, 2022

hanhanW commented Jan 25, 2022 •

edited