New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: -bench should correct for GC #17434

Open
mdempsky opened this Issue Oct 13, 2016 · 4 comments

Comments

Projects
None yet
7 participants
@mdempsky
Member

mdempsky commented Oct 13, 2016

Currently -bench output is very sensitive to GC effects. For example:

  1. Changing allocations in phase A might cause a GC cycle to shift from phase B to phase C, which can look like an improvement to B and a regression for phase C.
  2. Reducing long-lived memory pressure from earlier phases gets credited to later phases, as the later phases benefit from reduced GC costs.

This makes it hard to isolate performance improvements from frontend vs backend changes.

I'm considering a few possible improvements to -bench:

  1. Record GC pause times, and subtract them from phase times.
  2. Record allocation stats for each phase.
  3. Explicit GC cycle between FE and BE so we can measure how much live memory the FE has left for the BE to work with.

Any other suggestions and/or implementation advice?

/cc @griesemer @rsc @aclements

@aclements

This comment has been minimized.

Member

aclements commented Oct 13, 2016

Record GC pause times, and subtract them from phase times.

I don't see why this would help. GC pause times are close to 0 and getting closer. It's not the pauses that are the problem, it's the CPU taken away from the compiler during the concurrent phase.

Record allocation stats for each phase.

This seems like a good idea in general.

Explicit GC cycle between FE and BE so we can measure how much live memory the FE has left for the BE to work with.

Adding explicit GCs between phases seems necessary if you're going to isolate the performance of the different phases. I'm not sure what you mean by "live memory the FE has left" since live memory isn't something that's left, but I don't think this is about measurement anyway. Doing an explicit GC between phases resets the pacing so the scheduling of GCs during each phase is much closer to independent from the other phases (not exactly, since a change in the live memory remaining after an earlier phase can still affect the GC scheduling in a later phase, but you'll be much closer to independence).

@mdempsky

This comment has been minimized.

Member

mdempsky commented Oct 13, 2016

I don't see why this would help. GC pause times are close to 0 and getting closer. It's not the pauses that are the problem, it's the CPU taken away from the compiler during the concurrent phase.

I see. I said "GC pause times" just because that's the only time duration that I could see in package runtime.MemStats or runtime/debug.GCStats, and I naively assumed it somehow represented total GC overhead. I guess it actually means only STW time?

Is there a good way to measure CPU cost from the concurrent phase? Also, currently I think we only measure per-phase wallclock time. I wonder if we need to measure per-phase CPU-seconds instead, since the GC is concurrent (and possibly the compiler itself will be too, in the future).

I'm not sure what you mean by "live memory the FE has left" since live memory isn't something that's left, but I don't think this is about measurement anyway.

I meant (for example) to make an explicit runtime.GC() call at the end of the frontend phases and record the runtime.MemStats.Heap{Alloc,Objects} values. The hypothesis being that 1) they represent how much data the FE has allocated that will continue to remain live throughout the BE phases, and 2) improving those numbers should reduce the amount of GC work necessary during the BE phases. Is that sound, or is my model of GC effects too naive?

@aclements

This comment has been minimized.

Member

aclements commented Oct 13, 2016

I see. I said "GC pause times" just because that's the only time duration that I could see in package runtime.MemStats or runtime/debug.GCStats, and I naively assumed it somehow represented total GC overhead. I guess it actually means only STW time?

Right. The only thing in MemStats that accounts for concurrent GC time is GCCPUFraction, but I don't think that would help here.

Also, currently I think we only measure per-phase wallclock time. I wonder if we need to measure per-phase CPU-seconds instead, since the GC is concurrent (and possibly the compiler itself will be too, in the future).

I'm not so sure. What people generally care about when they're running the compiler is how long it took, not how many CPU-seconds it took.

I meant (for example) to make an explicit runtime.GC() call at the end of the frontend phases and record the runtime.MemStats.Heap{Alloc,Objects} values. The hypothesis being that 1) they represent how much data the FE has allocated that will continue to remain live throughout the BE phases, and 2) improving those numbers should reduce the amount of GC work necessary during the BE phases. Is that sound, or is my model of GC effects too naive?

I think that's a good thing to measure, however, the effect is somewhat secondary to just how many allocations the FE does. To a first order, if the FE retained set doubles, each GC during the BE will cost twice as much but they will happen half as often, so the total cost doesn't change. (It does matter to a second order, since longer GCs are less efficient GCs because of write barrier overheads and floating garbage.)

However, my point about running a GC between phases just to reset the GC pacing still stands. Imagine the GC runs exactly 1 second, 2 seconds, etc. after the process starts; if you change the time some phase takes, all of the later phases will line up with the GC ticks differently, causing fluctuations in measured performance. runtime.GC() lets you reset the clock, so if you do it at the beginning of each compiler phase, only that phase's "timing" will matter for its own measurement. The GC actually runs in logical "heap time", but the analogy is quite close.

@quentinmit

This comment has been minimized.

Contributor

quentinmit commented Oct 17, 2016

It seems like -bench should turn on an explicit GC at the end of each phase, counted against that phase's timing.

@quentinmit quentinmit added the NeedsFix label Oct 17, 2016

@quentinmit quentinmit added this to the Go1.8Maybe milestone Oct 17, 2016

@rsc rsc modified the milestones: Go1.9, Go1.8Maybe Oct 20, 2016

@josharian josharian modified the milestones: Go1.10, Go1.9 May 11, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 29, 2017

@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment