x/build: migrate off Kubernets buildlets back to VMs? #25108
We've had a lot of flakiness with our Kubernetes-based buildlets. This seems to happen most during periods of high load, suggesting that we're still hitting isolation problems with Kubernetes, despite our various pod configuration knobs (requested CPU/mem + limit CPU/mem) being pretty paranoid and reviewed by various Kubernetes people.
Further, Kubernetes seems to get into a state where things are bad and then a cluster upgrade or nuke+recreate makes things good again.. for a bit.
I think it might be time to stop using Kubernetes for our buildlets (we'll still use them for all our misc services) and switch back to using VMs.
In the past, our Linux VMs were tedious because we had to prepare VM images for each config. We did this using a "docker2boot" tool I wrote that converted a container image (built from a Dockerfile) into a bootable GCE VM. But the whole process & testing was still slow & painful to iterate on.
When we moved to Kubernetes, we moved to more vanilla Dockerfiles with pushes & pulls to gcr.io. This was much less painful.
I don't propose we move back to custom VM images. I don't want to use docker2boot again (as cool of a hack as it was).
Instead, I think we should use GCE's container OS image (https://cloud.google.com/container-optimized-os/docs/) and use our existing buildlet containers.
The pros of moving to VMs:
The cons of moving to VMs:
I think this is worth trying. We can make it a flag and be able to revert easily. We don't delete the Kubernetes-based building code in case we want to switch back to it in the future.
The text was updated successfully, but these errors were encountered:
Just started on this. So far so good. I created a VM running the cos-stable image (
(and the normal
... and then the buildlet came right up, and pretty quickly. (didn't measure)
The "privileged: true" part was a test. I think we'll be able to run a few more tests than we did before with it. And I think the 'tty: true' part is unnecessary. I thought we needed it for something, but I think I'm misremembering.
Once containers run on COS instead of Kubernetes, one name (Kube*) is wrong and the other (GCE) is ambiguous. So rename them now to be more specific. No behavior changes. Just renaming in this step, to reduce size of next CL. Updates golang/go#25108 Change-Id: Ib09eb682ef74acbbf6ed50b46074f834ef5e0c0b Reviewed-on: https://go-review.googlesource.com/111639 Reviewed-by: Brad Fitzpatrick <firstname.lastname@example.org>
…exec Google's Container-Optimized Linux's konlet container start-up program creates any requested tmpfs mounts as noexec. That doesn't work for doing builds in, so remount it as executable. This is required to run builds on COS instead of GKE. Updates golang/go#25108 Change-Id: I9b719caf9180a03bafefa5b3b4b47ee43b9e5c1c Reviewed-on: https://go-review.googlesource.com/112715 Reviewed-by: Andrew Bonventre <email@example.com>
The nacl image hadn't been updated in 2+ years and it needed to be updated as part of rolling out the new COS-based builders. But no released version works for us yet; we were getting the same errors as in golang/go#23836 ("Signal 11 from untrusted code") We were getting lucky that it was working with an ancient (pepper_34?) version, but I was unable to get those working again either. Rolling forward is better anyway, as we haven't had a Dockerfile reflecting reality for this builder for 2+ years. This is the same version used in playground in CL 101735, which said: > playground: update NaCl to trunk.544461 > > This pulls in https://crrev.com/c/962675, which fixes the > underlying issue of NaCl mishandling signals during a SIGSEGV. Updates golang/go#23836 Updates golang/go#25108 Change-Id: I187042af71a1249e84ce2070aa8039a88d2c02c2 Reviewed-on: https://go-review.googlesource.com/112735 Reviewed-by: Brad Fitzpatrick <firstname.lastname@example.org> Reviewed-by: Andrew Bonventre <email@example.com>