Skip to content

Conversation

@lslusarczyk
Copy link
Contributor

@lslusarczyk lslusarczyk commented Nov 18, 2025

Reduce overhead of SYCL over L0 in almost all scenarios by up to 3% (comparing to L0 baseline), or by (0-13% comparing to current overhead).

Multiple optimization on common path. Performance gain mainly by inlining most frequently executed simple methods, like traces or getters.

Performance improvements may be seen here:
https://intel.github.io/llvm/benchmarks/?runs=lslusarczyk_BMG_L0v2%2CBaseline_BMG_L0v2

They are seen in most of benchmarks. The most significant in CPU Count. But also in time plots

Sample improvements (dot-s on the right are new results from this PR):
SubmitKernel in order using events, SYCL, from 15.04us to 14.62us (L0 12.19us), that is overhead over L0 reduced from 23.3% to 19.9%
SubmitKernel in order using events

SubmitKernel out of order with completion using events long kernel, CPU count, SYCL, from 159k instr, to 155k instr (L0 132.5k) that is overhead over L0 reduced from 20.0% to 17.6%
SubmitKernel out of order with completion using events long kernel, CPU count

SubmitKernel in order, CPU count, SYCL, from 133.4k instr, to 130.7k instr (L0 118.0k), that is overhead over L0 reduced from 13.0% yo 10.7%
SubmitKernel in order, CPU count

... and much more.

Flamegraph of optimized method before:
enqueueImpKernel_outsideUR

Flamegraph after:
enqueueImpKernel_outsideUR

See especially how getOrCreateKernel became smaller. But not only this function.

@lslusarczyk lslusarczyk marked this pull request as ready for review November 18, 2025 13:51
@lslusarczyk lslusarczyk requested a review from a team as a code owner November 18, 2025 13:51
@lslusarczyk lslusarczyk requested a review from vinser52 November 18, 2025 13:51
@lslusarczyk
Copy link
Contributor Author

There is one UT failing, which I will fix tomorrow, but I think this PR may be reviewed right now.

return MUsesAssert;
}

const std::optional<int> &getImplicitLocalArgPos() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method market a const?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you just moved the implementation from the cpp to the hpp file. But maybe initially it was a mistake not marking this method as const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed, I also removed not needed const reference to small return value

#endif
}

GlobalHandler *&GlobalHandler::getInstancePtr() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we keep this method in cpp file and move all others to the hpp? What will be the impact on performance?
My personal opinion is that code with a static variable declared inside the method looks clearer and isolated.

Copy link
Contributor Author

@lslusarczyk lslusarczyk Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we keep this method in cpp file and move all others to the hpp? What will be the impact on performance?

I see getInstancePtr is used only in .cpp. Should be also fast. But this function is not needed any more. I removed it in favor of just using variable.

My personal opinion is that code with a static variable declared inside the method looks clearer and isolated.

You are right. However on critical path we should avoid this pattern, as it is slower than global variable outside function. See this SO answer. I worked in another project where we indeed observed how much static variable inside a method pattern can deteriorate performance.


/// \return true if the instance has not been deallocated yet.
static bool isInstanceAlive();
static GlobalHandler *&getInstancePtr() { return RTGlobalObjHandler; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it public now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake, anyway I'm removing this function

Copy link
Contributor

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two parts in this PR:

  1. There are NFC changes when you move method implementations from cpp to hpp file.
  2. Optimization of tracing path, which is a functional change.

I would like to see two PR: the first is NFC changes, and the second is refactoring tracing/config reading paths. It would be easier to review, but what is more important to me is how much each PR contributes to the performance.

@sergey-semenov recently tried to enable LTO and there was almost no perf impact. That is why I am curious how much moving methods impl from cpp to hpp files improves perf.

@lslusarczyk
Copy link
Contributor Author

lslusarczyk commented Nov 19, 2025

I think there are two parts in this PR:

1. There are NFC changes when you move method implementations from cpp to hpp file.

2. Optimization of tracing path, which is a functional change.

I would like to see two PR: the first is NFC changes, and the second is refactoring tracing/config reading paths. It would be easier to review, but what is more important to me is how much each PR contributes to the performance.

Twor PRs created. Please review:

Inpact in performance can be best split by looking at instruction plots as time plots have high variance to reliable split impact.

Second dot of short line shows impact when we remove traces optimization (just inlines optimizations).

Out of order example.

old SYCL: 144927, new SYCL with inlines optimization: 142057, new SYCL with both optimizations: 141817

In terms of overhead reduction over UR (117309), this is:

  • 23.54% -> 20.89% with both optimizations and
  • 23.54% -> 21.10% with just inlines.
SubmitKernel out of order long kernel, CPU count(1)

So inlinig has a major effect.

Tracing optimization positive effect is however clearly visible in flamegraphs, where call to trace was ProgramManager::getOrCreateKernel where it originally took 2.5% of this function and now it something between 0-1.32% (closer to 0 I think)

In order example

old SYCL: 133375, new SYCL with inlines optimization: 130915, new SYCL with both optimizations: 130675

In terms of overhead reduction over UR (118028), this is:

  • 13.00% -> 10.72% with both optimizations and
  • 23.54% -> 10.92% with just inlines.
SubmitKernel in order, CPU count(1)

@sergey-semenov recently tried to enable LTO and there was almost no perf impact. That is why I am curious how much moving methods impl from cpp to hpp files improves perf.

I think compiler can do more in compilation time it if sees inlined function definitions in one source file which it analyses, than later during LTO when it has intermediate representations.

@vinser52
Copy link
Contributor

@lslusarczyk, can we close this PR because #20681 and #20682 were created?

steffenlarsen pushed a commit that referenced this pull request Nov 20, 2025
_[functional change]:_ **GlobalHandler instance move** from a static
variable declared inside a method to global scope. This **changes when
the first object is created**. Previously it would only be created if
getInstancePtr was called, now it happens at the time of loading the
library. Users pay for the instance even if they never use the global
handle.
Motivation is performance - accessing global variable declared inside a
function can be slower than to variable in global scope. Explained e.g.
[here](https://stackoverflow.com/questions/52198322/is-access-to-a-static-function-variable-slower-than-access-to-a-global-variable/52202005#52202005)

_[non-functional-change]_: **various functions** exising on hot
submit-kernel path **get inlined**

This PR is part of #20668 which was
split into two. See plots in that PR desciption for performance
comparisons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants