Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching matmul codegen to CPUDoubleTilingExpert by default. #8175

Merged
merged 7 commits into from Jan 27, 2022

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Jan 25, 2022

The PR enables fusion in the pipeline, which addresses perf issues for
matmul + generic cases. However, it causes performance regression for
some few cases. There are some unnecessary one-trip loops created for
some reasons. Some of them can be fixed by
#7458. They are known values during
compilation time because the dim sizes are less than tiling sizes. In
this context, the one-trip loops won't be generated.

This enables vectorization for the cases that dim sizes are not
divisible by tiling sizes. See examples (260x300x280) below.

Overall, the new pipeline is better than TileFuseAndVectorize pipeline,
so we're going to switch matmul codegen to the new pipeline.

There are still some "gaps" between sandbox and IREE. It's not really an
apple-to-apple comparison because

  • There are some overheads in IREE. It would affect when execution time
    is small.
  • IREE is a multi-threaded based approach while sandbox is
    single-threaded based approach.

The next step is to search parameters and see if IREE can get closer to
sandbox and MKL lib. Then have heuristic configurations for those cases.

Performance

MobileBert Inference Benchmarks

Before (single-threaded) After (single-threaded) Before (multi-threaded) After (multi-threaded)
228 ms 200 ms 61.1 ms 59.3 ms

MiniLM BERT Benchmarks

Before (single-threaded) After (single-threaded) Before (multi-threaded) After (multi-threaded)
82.7 ms 79.1 ms 21.6 ms 21.1 ms

Matmul Microbenchmarks (Single-threaded)

Shape Before w/o exp After w/o exp After w/ exp
1x384x384 0.025 ms 0.095 ms 0.024 ms
128x384x384 0.311 ms 0.337 ms 0.355 ms
128x1536x384 1.51 ms 1.53 ms 1.59 ms
128x384x1536 1.51 ms 1.43 ms 1.52 ms
384x384x512 1.45 ms 1.30 ms 1.38 ms
384x128x128 0.108 ms 0.107 ms 0.122 ms
384x128x512 0.453 ms 0.458 ms 0.511 ms
384x512x128 0.414 ms 0.408 ms 0.415 ms
384x512x2 0.366 ms 0.860 ms 0.582 ms
384x384x32 0.102 ms 0.137 ms 0.121 ms
64x384x64 0.028 ms 0.030 ms 0.028 ms
64x1536x64 0.105 ms 0.099 ms 0.100 ms
192x256x128 0.106 ms 0.101 ms 0.110 ms
260x300x280 14.2 ms 0.836 ms 0.823 ms
1000x1000x1000 684 ms 36.0 ms 35.9 ms
1024x1024x1024 41.6 ms 39.9 ms 40.7 ms

The PR enables fusion in the pipeline, which addresses perf issues for
matmul + generic cases. However, it causes performance regression for
some few cases. There are some unnecessary one-trip loops created for
some reasons. Some of them can be fixed by
iree-org#7458. They are known values during
compilation time because the dim sizes are less than tiling sizes. In
this context, the one-trip loops won't be generated.

This enables vectorization for the cases that dim sizes are not
divisible by tiling sizes. See examples (260x300x280) below.

Overall, the new pipeline is better than TileFuseAndVectorize pipeline,
so we're going to switch matmul codegen to the new pipeline.

There are still some "gaps" between sandbox and IREE. It's not really an
apple-to-apple comparison because

- There are some overheads in IREE. It would affect when execution time
  is small.
- IREE is a multi-threaded based approach while sandbox is
  single-threaded based approach.

The next step is to search parameters and see if IREE can get closer to
sandbox or MKL lib. Then have heuristic configurations for those cases.

Performance

| Shape            | Before w/o exp  | After w/o exp   | After w/ exp |
| ---------------- | --------------- | --------------- | ------------ |
| 1x384x384        |        0.025 ms |        0.095 ms |     0.024 ms |
| 128x384x384      |        0.311 ms |        0.337 ms |     0.355 ms |
| 128x1536x384     |         1.51 ms |         1.53 ms |      1.59 ms |
| 128x384x1536     |         1.51 ms |         1.43 ms |      1.52 ms |
| 384x384x512      |         1.45 ms |         1.30 ms |      1.38 ms |
| 384x128x128      |        0.108 ms |        0.107 ms |     0.122 ms |
| 384x128x512      |        0.453 ms |        0.458 ms |     0.511 ms |
| 384x512x128      |        0.414 ms |        0.408 ms |     0.415 ms |
| 384x512x2        |        0.366 ms |        0.860 ms |     0.582 ms |
| 384x384x32       |        0.102 ms |        0.137 ms |     0.121 ms |
| 64x384x64        |        0.028 ms |        0.030 ms |     0.028 ms |
| 64x1536x64       |        0.105 ms |        0.099 ms |     0.100 ms |
| 192x256x128      |        0.106 ms |        0.101 ms |     0.110 ms |
| 260x300x280      |         14.2 ms |        0.836 ms |     0.823 ms |
| 1000x1000x1000   |          684 ms |         36.0 ms |      35.9 ms |
| 1024x1024x1024   |         41.6 ms |         39.9 ms |      40.7 ms |
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine. Maybe a few clarifications. But this is looking promising!

@@ -329,16 +283,17 @@ static LogicalResult setX86SandboxRootConfig(
/*workgroupSize=*/ArrayRef<int64_t>{});

// Hardcoded tile sizes. The configuration is derived from iree-llvm-sandbox.
// L1 tile sizes are {1, 1, ..., 288, 128, 512}.
// Vector tile sizes are {1, ..., 9, 32, 16}
// L1 tile sizes are {1, 1, ..., 288, 128, 512}. Note that L1 tiling sizes is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just not set them. Also not clear why the L1 tile sizes are unused. If set here they will be used. These should be set to the same sizes as the second level tile sizes of sandbox.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see wy they arent used. I'd rather just not set them, but upto you. Using the L1 tile sizes and not using the vector tile sizes is more intuitive to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want it at some point. The first level of tiling is applied at flow level. It only considers parallel dims. In sandbox, the DoulbeTilingExpert tiles both parallel dims and reduction dims for first level of tiling. The TileFuseAndVectorize pass also does similar things. I think we'd like to keep L1 tiling size and use it soon. I sent the PR out because the performance looks fine to me. I can give it a shot and update the status in the PR later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some compilation issues when trying it. I'll go with using L1 tile sizes (not vector tile sizes) at this moment and drop a TODO here.

@MaheshRavishankar
Copy link
Contributor

Looks fine overall. Does this regress the miniLM perf that is being tracked. If it does, then it will just cause unnecessary pain. Would be good to check before we land this.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 26, 2022

Looks fine overall. Does this regress the miniLM perf that is being tracked. If it does, then it will just cause unnecessary pain. Would be good to check before we land this.

It's noisy, so I ran it several times. We have slight improvements in miniLM perf.

Before (single-threaded) After (single-threaded) Before (multi-threaded) After (multi-threaded)
82.7 ms 79.1 ms 21.6 ms 21.1 ms

@hanhanW hanhanW enabled auto-merge (squash) January 26, 2022 22:02
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We need to port the existing verifier over to be valid for this pipeline.

@hanhanW hanhanW merged commit e3ded85 into iree-org:main Jan 27, 2022
@hanhanW hanhanW deleted the gemm-gap branch January 27, 2022 18:44
@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 27, 2022

LGTM! We need to port the existing verifier over to be valid for this pipeline.

Yes, this is the next thing on my plate. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants