Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MicroBenchmark,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations #26

Merged
merged 11 commits into from
Nov 17, 2023

Conversation

nilanjana87
Copy link
Contributor

@nilanjana87 nilanjana87 commented Sep 27, 2023

This microbenchmark attempts to find the right loop trip count threshold for deciding whether to interleave a loop or not for different types of loops, such as loops with or without reduction inside it, loops with or without vectorization inside it. For context, the loop vectorizer uses a threshold of TinyTripCountInterleaveThreshold that is currently set to 128.

…nterleaving Count with varying loop iterations.

This microbenchmark attempts to find the right loop trip count threshold for deciding whether to interleave a loop or not for different types of loops, such as loops with or without reduction inside it, loops with or without vectorization inside it.
Note: Interleaving count of 1 means interleaving is disabled.

Differential Revision: https://reviews.llvm.org/D159475
@fhahn
Copy link
Contributor

fhahn commented Oct 10, 2023

IIUC this adds benchmarks for the best case for interleaving. Would it be possible to also add variants where interleaving is less beneficial?

nilanjana87 and others added 2 commits October 22, 2023 20:22
…ed a redundant function, added test cases where the compiler selects the vectorization configuration
@nilanjana87
Copy link
Contributor Author

IIUC this adds benchmarks for the best case for interleaving. Would it be possible to also add variants where interleaving is less beneficial?

This file adds benchmarks for testing cases that may or may not be beneficial for interleaving. For example, cases with low trip counts are better off without interleaving, whereas if the interleaved loop runs at least twice it starts showing performance benefit. If we need to add more cases where it is less beneficial, can you suggest some?

@fhahn
Copy link
Contributor

fhahn commented Oct 23, 2023

IIUC this adds benchmarks for the best case for interleaving. Would it be possible to also add variants where interleaving is less beneficial?

This file adds benchmarks for testing cases that may or may not be beneficial for interleaving. For example, cases with low trip counts are better off without interleaving, whereas if the interleaved loop runs at least twice it starts showing performance benefit. If we need to add more cases where it is less beneficial, can you suggest some?

Some additional cases could be loops with a reduction, but also a number of additional independent memory and compute operation chains (i.e. larger loop body with multiple independent operation chains) and one with a larger loop body without reduction (in that case, LLVM likely will decide not to interleave).

They might provide additional insight into current issues of the cost modeling for interleaving.

nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Oct 24, 2023
    A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases:
    1) when TC > 2 * VW * IC, such that the interleaved vectorized portion of the loop runs at least twice
    2) when TC is an exact multiple of VW * IC, such that there is no epilogue loop to run
    where, TC = trip count, VW = vectorization width, IC = interleaving count

    We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416).
@nilanjana87
Copy link
Contributor Author

IIUC this adds benchmarks for the best case for interleaving. Would it be possible to also add variants where interleaving is less beneficial?

This file adds benchmarks for testing cases that may or may not be beneficial for interleaving. For example, cases with low trip counts are better off without interleaving, whereas if the interleaved loop runs at least twice it starts showing performance benefit. If we need to add more cases where it is less beneficial, can you suggest some?

Some additional cases could be loops with a reduction, but also a number of additional independent memory and compute operation chains (i.e. larger loop body with multiple independent operation chains) and one with a larger loop body without reduction (in that case, LLVM likely will decide not to interleave).

They might provide additional insight into current issues of the cost modeling for interleaving.

Replicated the existing tests for loops with bigger bodies with additional independent memory operations. These can be found in the latest patch, if you do case-insensitive search of function names for "BigLoop".

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the extra loops, a few more comments inline

MicroBenchmarks/LoopVectorization/LoopInterleaving.cpp Outdated Show resolved Hide resolved
MicroBenchmarks/LoopVectorization/LoopInterleaving.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@nilanjana87 nilanjana87 merged commit eda2d6c into llvm:main Nov 17, 2023
@antmox
Copy link
Contributor

antmox commented Nov 17, 2023

antmox added a commit that referenced this pull request Nov 17, 2023
… Loop Interleaving Count with varying loop iterations (#26)"

This reverts commit eda2d6c.
antmox added a commit that referenced this pull request Nov 17, 2023
… Loop Interleaving Count with varying loop iterations (#26)" (#54)

This reverts commit eda2d6c.
@antmox
Copy link
Contributor

antmox commented Nov 17, 2023

Reverted it. Hope that it's OK and that it was indeed the culprit commit. (edit: yes it was)

@nilanjana87
Copy link
Contributor Author

Reverted it. Hope that it's OK and that it was indeed the culprit commit. (edit: yes it was)

Thanks for figuring this out & reverting it. Seems like adding too many tests ended up hitting the timeout. I'll try to reduce the test points to keep it within the time limit & re-land it.

The tests passed for me locally though, even with the ninja check command that timed out in the build-bot. So, what is a good test to find this issue locally, other than reducing the test time below 1200 seconds as suggested by the failure log?

On a separate matter, I was checking the bot links you posted above, but I don't see myself or this patch in the Responsible Users or Changes tab, for example in https://lab.llvm.org/buildbot/#/builders/185/builds/5395. Is this because my patch has been removed from this list after it was reverted?

@antmox
Copy link
Contributor

antmox commented Nov 20, 2023

Yes the test does completes successfully here as well. I don't know the rules for adding new tests, but this one seems really long.
2500sec here on a huge aarch64 machine, while the testsuite can run on smaller machines, sometimes heavily loaded.
I don't think there is a 1200sec limit, but jobs are killed after 1200sec without any output.

For the other thing, the responsible users and changes tabs should be read with care.
Changes from the llvm-test-suite and lnt repos are not listed here. Some bots may not even list all changes from the llvm-project here.
I think this is defined in the llvm-zorg builders classes, depends-on-projects attribute.

@antmox
Copy link
Contributor

antmox commented Nov 20, 2023

Also, I wonder why only the arm/aarch64 bots failed with this test. Maybe there's something to analyze here.
On what type of machine did you run your test, and what was the execution time?

@nilanjana87
Copy link
Contributor Author

2500sec here on a huge aarch64 machine, while the testsuite can run on smaller machines, sometimes heavily loaded.
I don't think there is a 1200sec limit, but jobs are killed after 1200sec without any output.

It runs for little more than 40min in my machine, but locally it doesn't hit the timeout. I added a new PR #56 which runs a reduced test set by default and runs the whole set when compiled with a flag.

Also, I wonder why only the arm/aarch64 bots failed with this test.
There's nothing Arm/AArch64 specific in the code, so I don't have a clue why this is so.

tarunprabhu pushed a commit to llvm-project-tlp/llvm-test-suite that referenced this pull request Dec 8, 2023
…terleaving Count with varying loop iterations (llvm#26)

* [MicroBenchmarks,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations.

This microbenchmark attempts to find the impact of loop interleaving count for different types of loops (big or small, with or without reductions inside them) over different vectorization factors for varying loop trip counts.
Note: Interleaving count of 1 means interleaving is disabled.

These microbenchmarks are to help guide changes in loop interleaving count computation and removal of trip count threshold for interleaving loops in llvm/llvm-project#67725 & related patches.
tarunprabhu pushed a commit to llvm-project-tlp/llvm-test-suite that referenced this pull request Dec 8, 2023
… Loop Interleaving Count with varying loop iterations (llvm#26)" (llvm#54)

This reverts commit eda2d6c.
nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Jan 4, 2024
…the loop

The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold.

Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Jan 16, 2024
…the loop

The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold.

Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Jan 27, 2024
…the loop

The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold.

Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Jan 29, 2024
…the loop

The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold.

Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Feb 6, 2024
…the loop

The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold.

Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).
nilanjana87 added a commit to llvm/llvm-project that referenced this pull request Feb 6, 2024
…ave a loop (#67725)

A set of microbenchmarks (llvm/llvm-test-suite#26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
nilanjana87 added a commit to apple/llvm-project that referenced this pull request Mar 5, 2024
…ave a loop (llvm#67725)

A set of microbenchmarks (llvm/llvm-test-suite#26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants