New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: minimize morestack calls text footprint #29067

Open
CAFxX opened this Issue Dec 3, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@CAFxX
Contributor

CAFxX commented Dec 3, 2018

This is more of a small papercut idea than a real issue, but every time I go through the disassembled output of objdump it comes back to mind, so I'm going to drop it here.

Currently the morestack check is laid out as follows:

function entry point:
  check if enough stack
  if not enough stack, jump to morestack block # JMP1
  ...function body...
morestack block:
  call morestack
  jump to function entry point # JMP2

if it was instead laid out like this:

morestack block:
  call morestack
function entry point:
  check if enough stack
  if not enough stack, jump to morestack block
  ...function body...

we have one less jump (JMP2) and, more importantly, JMP1 can always use the more compact rel8 variant. The morestack block, in this case, would be placed in the padding between this and the previous function (because the morestack block of the previous function would have been moved as well, this should not significantly impact the overall amount of padding used)

For small functions (i.e. functions where rel8 is already used in both jumps) this would save 2 bytes per function, for functions where rel32 is used it would save 10 bytes per function. From a quick check there are ~1800 small functions and ~5000 big functions using morestack in the 1.11.2 windows compiler. This means that text size could decrease by ~86KB. This would help for #6853 and, hopefully, also improve icache hit rates.

Looking forward to when per-function metadata will make it possible to mostly eliminate morestack calls (#25999 (comment)), this could be further extended to have multiple entry points (so that we don't need to store multiple variants of functions for which we need both morestack and no morestack variants):

morestack block:
  call morestack
function entry point:
  check if enough stack
  if not enough stack, jump to morestack block
function entry point no-morestack:
  ...function body...

I'm not knowledgeable about the linker, so I don't really know how workable these two ideas are.

@CAFxX CAFxX changed the title from Minimize morestack calls footprint to Minimize morestack calls text footprint Dec 3, 2018

@agnivade agnivade changed the title from Minimize morestack calls text footprint to cmd/compile: minimize morestack calls text footprint Dec 3, 2018

@agnivade agnivade added this to the Unplanned milestone Dec 3, 2018

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Dec 3, 2018

Note that on Intel processors in the absence of an entry in the branch target buffer backward branches are predicted as taken and forward branches are predicted as not taken. So on Intel processors there is a clear performance advantage to making the branch to be taken when more stack space is required be a forward branch rather than a backward branch.

@CAFxX

This comment has been minimized.

Contributor

CAFxX commented Dec 3, 2018

@ianlancetaylor I'm no expert in microarchitecture but this is what Agner writes about static prediction in "recent" archs:

Static prediction in PM and Core2

These processors do not use static prediction. The predictor simply makes a random prediction the first time a branch is seen, depending on what happens to be in the BTB entry that is assigned to the new branch. There is simply a 50% chance of making the right prediction of jump or no jump, but the predicted target is correct. Branch hint prefixes have no useful effect on PM and Core2 processors.

Static prediction in AMD

A branch is predicted not taken the first time it is seen. A branch is predicted always taken after the first time it has been taken. Dynamic prediction is used only after a branch has been taken and then not taken. Branch hint prefixes have no effect.

I obviously can't guarantee the above is correct, still relevant to newer archs, or that there are no other factors that may cause regressions. It would have to be measured to see if it helps.

@josharian

This comment has been minimized.

Contributor

josharian commented Dec 3, 2018

@CAFxX I altered the function prologue a couple of years ago to be more static-branch-prediction friendly. At the time, I benchmarked it extensively, and found that on all architectures I had access to, including recently AMD64 processors, laying out the function prologue in the order we do affords a significant performance benefit. I don't have the commits handy, but there are hard numbers in the commit message.

@CAFxX

This comment has been minimized.

Contributor

CAFxX commented Dec 3, 2018

@josharian I suppose you are referring to 88f423e

I have a question though: why would static prediction influence the runtime of super-hot functions that are being executed millions of times back-to-back during the benchmark? Why would there be no record for the branches in the BTB in that case? Wouldn't it be possible that the improvements were mostly due to better icache utilization (since, from a cursory check of the commit message, it seems cold code was moved out of the hot path)? Were you able to measure the two effects independently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment