Skip to content

Conversation

@bababuck
Copy link
Contributor

Currently, we don't optimize function-specialize cases like the following:

int foo(int a) {
  return a + 1;
}

int bar(int a) {
  return foo(a);
}

int main() {
  return bar(3);
}

because when specializing for the 3 passed to bar(), the value isn't propagated into foo() to gauge benefit from specializing foo() as well. With this patch, the optimized code would be:

int foo.specialized.1() {
  return 4;
}

int bar.specialized.2() {
  return foo.specialized.1();
}

int main() {
  return bar.specialized.2();
}

This patch required a fair amount of refactoring before making my changes. That refactoring was done as a series of commits before the main changes for this patch. I'm assuming that (if accepted) this should get merged in as a series of MRs.

The series of commits (x/6) all belong together since they are the same functional change, but I left them as separate commits for easier reading.

At a high level, each Spec element can have SubSpecs which are functions that it forwards its constant argument(s) to and should be specialized along with it. In the above exampled, the Spec for bar(3) would have a SubSpec for foo(3).

…han the minimum amount

If the knob for minimum code size is turned down low enough, for small functions:
`MinCodeSizeSavings * FuncSize / 100` will evaluate to `0`, and then with strict
less than we will accept Specialization that doesn't lead to any benefit.
…ucture

The data structure will eventually contain extra data for chained and indirect
specialization.
…ogic to its own function

Will want to call recursively for chains.
…ization into macro

Will need to call recursively.

No functional change.
Spec contains a Function, and will need to pass extra information
with Chaining.
Used to be a single object within findSpecializations() since
each Function only entered findSpecializations() once. But will
now be going in arbitrary order with Chains.
…zation

Cannot rely on AllSpecs to be inorder after Chaining.
If a function is called with constants that passes those constants to another function,
try to specialize both of those functions.
…en only ever part of a chain

Will get specialized as part of the chain if the chain scores well enough.
… chains in NSpecs

Will get specialized as part of chain, so aren't viable as a standalone.
When calculating possible Chains, use the metrics saved as part
of the sub-specializations.
…ed functions

Otherwise confusing with Chaining.
In the future we won't know the Function at the time of insertion, so
need to store and index so we can look up the Argument later.
…of arguments, skip chaining

See test/Transforms/FunctionSpecialization/compiler-crash-60191.ll
…s part of a chain

This way we can still more accurately see the effect of the specialization.
@github-actions
Copy link

github-actions bot commented Oct 16, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@labrinea
Copy link
Collaborator

when specializing for the 3 passed to bar(), the value isn't propagated into foo() to gauge benefit from specializing foo() as well

The function InstCostVisitor::visitCallBase tries to compute benefit from constant folding if possible. In this example it cannot constant fold foo(3) while propagating the constant inside bar's body. However given a small enough codesize threshold this example gets specialized: https://godbolt.org/z/dvfMov9WM

@bababuck
Copy link
Contributor Author

when specializing for the 3 passed to bar(), the value isn't propagated into foo() to gauge benefit from specializing foo() as well

The function InstCostVisitor::visitCallBase tries to compute benefit from constant folding if possible. In this example it cannot constant fold foo(3) while propagating the constant inside bar's body. However given a small enough codesize threshold this example gets specialized: https://godbolt.org/z/dvfMov9WM

Agreed, thank you for the correction! My understanding is that visitCallBase, which calls ConstantFolding.cpp::ConstandFoldCall which to my understanding is targeted at intrinsics and library calls (please correct me if I'm wrong).

The example on Godbolt only specializes on the current upstream due to some funny behavior of the function specializer, see #164867 (this change is included in this MR as well, I have begun splitting off small chunks that can stand on their own).

@labrinea
Copy link
Collaborator

I briefly looked at the patch series and I am not convinced it's the right approach. Perhaps it's best to handle non constant foldable calls in the instruction cost visitor separately, similarly to branches and switches which are not folded to a constant. I mean to compute the profitability of specializing that call instead of folding it. But then all this adds compile time complexity which I am not sure it is worth it.

@bababuck
Copy link
Contributor Author

Sorry for the slow response, and thanks for taking the time to engage.

Perhaps it's best to handle non constant foldable calls in the instruction cost visitor separately, similarly to branches and switches which are not folded to a constant.

I think that is a competitive approach, here were my pro's/con's between the two approaches:
Pro's of approach in patch:

  • Able to handle indirect function calls that are constants. Since currently we visit each instruction one argument as a time, the second approach would require additional caching logic to handle this.
  • Can handle chained specialization in a single run() loop, doesn't run the risk of specializing the first layer and not the second (in the case where max iterations is hit, or maximum code-size is reached.) At least if my understanding is correct, the cost metric would allow the outer function to specialize due to the savings of the inner function. Then in the next iteration, the inner function would specialize.

Pro's of the approach you suggested:

  • Much cleaner implementation, only need to extend a single area of the code

But then all this adds compile time complexity which I am not sure it is worth it.

We were looking into this for a particular case in x264 which we wanted to optimize, but I can collect data on how this code behaves on other workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants