Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
x/build/cmd/coordinator: tracking bug for running benchmarks in the coordinator. #19871
I anticipate that pre- and post-submit benchmark runs will be somewhat different (e.g. expected latency, # of iterations, etc.), and post-submit benchmark runs require trend analysis, so I am focusing on pre-submit trybot benchmark runs at the moment.
We've already had several design discussions about this; I'm going to attempt to replay them into this issue for posterity.
If machines die, they can resume later just from snapshots, bypassing make.bash. The untarring of the snapshot to the tmpfs is fast.
I think benchmarks should work similarly. They should be another work unit(s), like unit tests, sharded over machines.
Perf Trybots Sketch
(reviewed Mar 3 by @bradfitz)
We want to collect benchmark data as part of the normal Gerrit CL process. To effect this, I propose the following modifications to the build coordinator:
Some builders are probably too slow to run benchmarks on. I suggest we initially run benchmarks on:
Q: there's no guarantee [the base commit's snapshot] exists. is that a problem?
Q: how many iterations are you going to run to get statistically interesting numbers?
Q: [freebsd-amd64-gce101] is just another amd64 so I'm not sure it helps so much. Seems overkill for trybots. I even think Windows is a bit redundant but justifiable due to its drastic difference. Linux and FreeBSD are relatively similar. Mostly seeing three different architectures is the interesting bit. I think running builders are three different Intel CPU families (in different GCE zones would be even more interesting)
Email thread with @bradfitz about how to fit benchmarks into the existing work units for make and test.
I know you have your goal of posting trybot results in <5m. Obviously, that's simply not enough time to post trybot results, and we can't (I think?) parallelize the benchmarks because we'd like them to come from the same machine. Currently with 5 iterations of the benchmarks it's taking 10-20m to complete.
So I think that leads me to "we should post make + test results first, and then follow up with bench results in a separate Gerrit comment". Do you think that's a reasonable UI? Alternatively, we could include in the existing Gerrit post "benchmarks are being run, ", and have the benchmark results just stream in at that page without making another post when they're done.
The question becomes then how to fit that into trySet. There's an ominous comment in buildStatus.build:
Ideally, we would like to log the
Sorry for the long-winded e-mail. Does this all sound okay?
There's been some progress towards that in the past year (unique build IDs and try IDs used mostly for BigQuery), but if we finished that, then we'd have a great place to reference perf data too.
So we'd record the Trybot-Result + comment on Gerrit and then keep running benchmarks. People interested in benchmarks can get to it from the original already-spammed comment on Gerrit. Perhaps it should only do a follow-up comment if the benchmark found notable change.
I'd prefer it to be a logically separate work unit, but do the buildlet sharing as an optimization via the buildlet pool layer. There's an open bug about that somewhere--- if a buildlet is no longer being used (especially ones that are slow to create) and it's never seen a failure and has no new processes running from when it started, then add it back to the free pool, at least for everything but building a release where we really want things to be hermetic.
I could certainly build a mechanism to pass off the buildlet client to a logically separate work unit, but then I also run into the question of how to handle logging. Right now it seems like there's an assumption that there is only one buildStatus for each commit and buildlet type. I think I would have to break that assumption if we want to have the "bench" buildStatus separate from the "make and test" buildStatus.
(Code-wise, the benchmark stuff is already logically separate, it has a single entry point of bs.runBench().)
I'd rather have a simple design (benchmarks always start with a fresh buildlet, as a new work unit) and pay the cost of untarring the work that was just done a moment ago in the uncontended case.
But in the contended case (like this morning), we'd free up the buildlets for important work (trybots, building the release-branch.go1.8 most recent commit, etc), and catch up on benchmarking later when idle.
Okay, with that, I believe that brings this issue up to the current state of discussion.
Okay, so, if I understand your proposal correctly, you want us to treat benchmarks as another type of helper buildlet, so we untar the snapshot built by make.bash and then run the benchmarks in a fresh buildlet.
Understood, and that makes sense (bearing in mind the overhead of starting another VM, which can be mitigated with your reuse proposal). In the meantime until we do that, I'd like to just put that logic here into
Now, given that, I'm still unsure of exactly how to tie that into the existing
I want to treat each benchmark as a unit of work that needs to be done, just like a "go tool dist test --list" work unit. I don't care whether it runs on the initial buildlet or on one of the helper buildlets. Ideally they'd run on all N of them, including the original one, in parallel. (I assume we're running more than 1 benchmark, and running it more than 1 time.)
We already have the buildlets available. We just created ~N=5 to run unit tests on, and now they're idle. (Also, the Kubernetes buildlets create in seconds. There's not much overhead there.)
You'd need to modify
Let's not cut corners. I have to maintain this. If there's a dependency to do, do that first.
The logic for what?
Reusing the idle buildlets is fine.
Other question: are you only doing benchmarks for the main repo ("go") or also for subrepos?
What work does the benchmarking need to do, and what environment does it need?
It needs the output from make.bash. How many benchmarks does it run, and which, and how many iterations? Can the iterations run on different VMs if they're the same processors/sizes?
Does the parent commit need to be submitted? Can we use benchmark output computed previously for the parent, if it's from the same VM processors/sizes?
Okay, yes. I am starting with the assumption that we need to run all benchmarks on a single buildlet, so that the results are comparable. With the CL as currently implemented, we are running 4 separate benchmark binaries, and we are running each of them 5 times. At a minimum, I think we need to make sure we run the same set of benchmarks on each buildlet (so we could shard it 5 ways, but not 20 ways). I don't know yet how much variance that introduces.
I think this conflicts with the fact that we may need to wait for the parent commit to make + test, and we don't want to tie up buildlets while waiting.
The logic for deciding whether a buildlet can be reused for benchmarks.
At the moment, only the go repo is in scope. Running benchmarks in subrepos is also interesting, but I'm not yet tackling that because I haven't designed a way to decide which benchmarks to run.
We need to have built benchmark binaries (which could have been cross-compiled, I suppose), and then we just need to run them on a buildlet. We do require a
I have the assumption that the iterations can't run on different VMs, but see above.
The parent commit does not need to be submitted. It does need to have been built and tested. We can't use previously computed benchmark output because there is far too much variation on GCE. (We could of course use a pre-compiled benchmark binary for the parent, however.)
How long do each of the ~5 benchmark binaries take to run?
For a given binary, do you want to run the old-vs-new binaries as:
Or interleaved old/new like:
? I can see some advantage of interleaving them if the run times are long and something unrelated on the VM (different customer) changes performance somehow.
Implementation wise: if you want this part of trybots, we really need to work to minimize overlap what runs when.
The benchmarks can be sharded, and since we already have sharded testing buildlets, that won't add much.
I'm more concerned about the extra
Refactoring the helper buildlets code from being just about test sharding to being earlier and more general (and letting you ask for just 1 immediately vs the timer creating all N helpers at once) will be the trickiest part.
As for sharding the benchmarks out over buildlets, I would treat them as normal test units, modifying
Then you'd modify
I drew a picture of how the sharding works for tests earlier today and @adams-sarah took a picture of it but I don't think it was a great drawing on its own. I can probably make a Gantt chart of each buildlet and what it's doing if that'd help.
I've been logging that with spans and we have a fair amount of data now. Do you have any reporting tools or should I just dig in the datastore myself?
From a quick glance, it looks like the go1 binary is the worst, at ~60s per run. The other benchmark binaries are around 10-20s each.
(go1 contains many benchmarks, of course, so we can split it in half to get 30s, or in three to get 20s.)
I agree, we definitely want to interleave them.
I don't know what you mean by "minimize overlap". Do you mean that we should try to schedule Kube pods on different hosts?
I don't think it makes sense to rebuild the parent commit if a snapshot is not available. We don't even necessarily know where to fetch it from (since it could be in a separate Gerrit CL, or on master). I think we should just wait until the parent commit has been built normally. This means we might take longer than 5m if a whole train of CLs is pushed at once, but I think that's a reasonable tradeoff (and we might wait a while in that case anyway before we can get a buildlet to run tests, too). When buildlet scheduling is implemented, we can teach the scheduling to run CLs in order from oldest to newest.
Yeah, that makes sense. Initially, I'd like to punt on this change and instead just use a completely separate buildlet for benchmarks in parallel with the sharded test buildlets. Then reusing the same buildlets just becomes an optimization.
x/build/cmd/buildstats -sync will sync from datastore to bigquery, and then you can write SQL.
Yeah, we'll want to do that.
Er, sorry, that sentence made little sense. I meant we want to maximize overlap, to not have idle buildlets while we do many steps in serial. We want to take advantage of potential parallelism, as it currently does.
You just fetch it from the gitmirror, which will return it regardless of where it is.
Idling while waiting for a buildlet isn't acceptable. We're not going to do that.
That's actually not the priority that was proposed in #19178. Recent things are more important than things 5 commits back.
Sure, how about your use one buildlet (maybe total for now, until #19178 can prioritize buildlet users) and do your work totally async with the trybots. The trybots can write "This will eventually be available at $URL", and users can just poll that URL, or your process can leave a comment saying the results are available, if it detected anything interesting.