Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
x/build: migrate off Kubernets buildlets back to VMs? #25108
We've had a lot of flakiness with our Kubernetes-based buildlets. This seems to happen most during periods of high load, suggesting that we're still hitting isolation problems with Kubernetes, despite our various pod configuration knobs (requested CPU/mem + limit CPU/mem) being pretty paranoid and reviewed by various Kubernetes people.
Further, Kubernetes seems to get into a state where things are bad and then a cluster upgrade or nuke+recreate makes things good again.. for a bit.
I think it might be time to stop using Kubernetes for our buildlets (we'll still use them for all our misc services) and switch back to using VMs.
In the past, our Linux VMs were tedious because we had to prepare VM images for each config. We did this using a "docker2boot" tool I wrote that converted a container image (built from a Dockerfile) into a bootable GCE VM. But the whole process & testing was still slow & painful to iterate on.
When we moved to Kubernetes, we moved to more vanilla Dockerfiles with pushes & pulls to gcr.io. This was much less painful.
I don't propose we move back to custom VM images. I don't want to use docker2boot again (as cool of a hack as it was).
Instead, I think we should use GCE's container OS image (https://cloud.google.com/container-optimized-os/docs/) and use our existing buildlet containers.
The pros of moving to VMs:
The cons of moving to VMs:
I think this is worth trying. We can make it a flag and be able to revert easily. We don't delete the Kubernetes-based building code in case we want to switch back to it in the future.
Just started on this. So far so good. I created a VM running the cos-stable image (
(and the normal
... and then the buildlet came right up, and pretty quickly. (didn't measure)
The "privileged: true" part was a test. I think we'll be able to run a few more tests than we did before with it. And I think the 'tty: true' part is unnecessary. I thought we needed it for something, but I think I'm misremembering.