Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildkite: avoid unbounded disk space growth on runners #7139

Closed
5 tasks
GMNGeoffrey opened this issue Sep 22, 2021 · 12 comments
Closed
5 tasks

Buildkite: avoid unbounded disk space growth on runners #7139

GMNGeoffrey opened this issue Sep 22, 2021 · 12 comments
Labels
infrastructure Relating to build systems, CI, or testing

Comments

@GMNGeoffrey
Copy link
Contributor

We periodically run out of disk space in the buildkite build machines, which is annoying to deal with.

Right now each pipeline has its own checkout of the repository, which uses unnecessary disk space and adds checkout latency. https://buildkite.com/docs/agent/v3#experimental-features-git-mirrors would maybe fix this, but something custom might be required. IDK. This will matter more if we scale up the number of pipelines (which we may or may not do)

  • clean up old docker images
  • delete/avoid copying over other agent's directories when cloning instances/snapshots
  • Avoid having a separate repository checkout for each pipeline that runs on the agent. This matters more if we scale up number of pipelines (as opposed to having a couple pipelines that do all the things). https://buildkite.com/docs/agent/v3#experimental-features-git-mirrors may help here
  • periodic cleanup in general
  • Drop persistent runners entirely and use a Kubernetes cluster. Would need some optimization (like shared caches) to be performant, I think.
@GMNGeoffrey GMNGeoffrey added the infrastructure Relating to build systems, CI, or testing label Sep 22, 2021
@GMNGeoffrey
Copy link
Contributor Author

See this thread on an unimplemented docker FR for ideas on how to implement docker gc. moby/moby#4237

@GMNGeoffrey
Copy link
Contributor Author

I think the best option there is https://github.com/stepchowfun/docuum

@ScottTodd
Copy link
Member

Saw another failure from low disk space: https://buildkite.com/iree/iree-samples/builds/96#de9a887f-7021-4276-b969-8eab5a2d2ab8

project/llvm/lib/Transforms/IPO/CMakeFiles/LLVMipo.dir/WholeProgramDevirt.cpp.o && /usr/bin/ranlib -D third_party/llvm-project/llvm/lib/libLLVMipo.a && :
/usr/bin/ranlib: third_party/llvm-project/llvm/lib/libLLVMipo.a: No space left on device
[2687/3495] Linking CXX static library third_party/llvm-project/llvm/lib/libLLVMCodeGen.a
FAILED: third_party/llvm-project/llvm/lib/libLLVMCodeGen.a

@GMNGeoffrey
Copy link
Contributor Author

I've resized the disks to all be 1TB, so that should buy us some time

@GMNGeoffrey
Copy link
Contributor Author

bazelbuild/bazel#5139 (comment) would help with bazel disk caches if we want to use those in addition to a remote cache (note the "1" there could be substituted for some number of GB).

@antiagainst
Copy link
Contributor

Okay. This is also happening for the tiny VM we use for uploading pipeline configurations. The disk (10GB) is full there, which forbids it to uploading pipeline configurations, so we are seeing all tasks in the waiting queue.. I've created a new VM with 256GB storage for it for now.

@GMNGeoffrey
Copy link
Contributor Author

Huh, I'm curious what's eating all its space. I see you created a whole new VM, but note that you should be able to resize disks on the fly. Anyway, that VM had a stupid name, so fine that it went away. I think we should just delete it entirely?

@GMNGeoffrey
Copy link
Contributor Author

Whelp, the instance won't start up (likely due to being out of disk space), so I can't ssh into it to figure out what's going on. I've detached the disk and attached it to another instance so I can try to poke around and see what happened

@GMNGeoffrey
Copy link
Contributor Author

GMNGeoffrey commented Oct 26, 2021

So some findings. When mounted elsewhere, the disk actually had 1.2G of free space. I don't really understand how that happens. I guess shutdown or detaching deleted some stuff (/tmp)? But then I don't really understand why it wasn't able to start up

gcmn@gcmn-test-instance:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           395M  5.3M  390M   2% /run
/dev/sda1       9.7G  1.4G  7.9G  15% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/sda15      124M  5.7M  119M   5% /boot/efi
tmpfs           395M     0  395M   0% /run/user/1001
/dev/sdb1       9.6G  8.4G  1.2G  89% /mnt/mako-uploader

There are a few largeish directories:

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/ | sort -rh | head -10
8.4G    /mnt/mako-uploader/
3.1G    /mnt/mako-uploader/usr
2.9G    /mnt/mako-uploader/var
1.3G    /mnt/mako-uploader/lib
1.1G    /mnt/mako-uploader/home
222M    /mnt/mako-uploader/boot
16M     /mnt/mako-uploader/sbin
15M     /mnt/mako-uploader/bin
6.9M    /mnt/mako-uploader/etc
48K     /mnt/mako-uploader/snap

2G of different gcp headers, for some reason:

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/usr/src | sort -rh | head -10
2.2G    /mnt/mako-uploader/usr/src
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1055
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1053
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1052
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1051
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1049
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1046
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1044
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1043
108M    /mnt/mako-uploader/usr/src/linux-gcp-5.4-headers-5.4.0-1042

1.3G of "modules", whatever that is, that mirror the headers

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/lib/modules | sort -rh | head -10
1.3G    /mnt/mako-uploader/lib/modules
251M    /mnt/mako-uploader/lib/modules/5.4.0-1052-gcp
251M    /mnt/mako-uploader/lib/modules/5.4.0-1051-gcp
250M    /mnt/mako-uploader/lib/modules/5.4.0-1053-gcp
249M    /mnt/mako-uploader/lib/modules/5.4.0-1055-gcp
238M    /mnt/mako-uploader/lib/modules/5.4.0-1029-gcp
1.1M    /mnt/mako-uploader/lib/modules/5.4.0-1049-gcp
1.1M    /mnt/mako-uploader/lib/modules/5.4.0-1046-gcp
1.1M    /mnt/mako-uploader/lib/modules/5.4.0-1044-gcp
1.1M    /mnt/mako-uploader/lib/modules/5.4.0-1043-gcp

A gigabyte of logs:

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/var/log/journal/597226472db5e64da917fe51433dfb7c/ | sort -rh | head -10
977M    /mnt/mako-uploader/var/log/journal/597226472db5e64da917fe51433dfb7c/

Half a gig of builds:

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/var/lib/buildkite-agent/builds/pipeline-uploader/iree/ | sort -rh | head -10
574M    /mnt/mako-uploader/var/lib/buildkite-agent/builds/pipeline-uploader/iree/
288M    /mnt/mako-uploader/var/lib/buildkite-agent/builds/pipeline-uploader/iree/iree-android-arm64-v8a
138M    /mnt/mako-uploader/var/lib/buildkite-agent/builds/pipeline-uploader/iree/iree-build-configurations
78M     /mnt/mako-uploader/var/lib/buildkite-agent/builds/pipeline-uploader/iree/iree-samples
71M     /mnt/mako-uploader/var/lib/buildkite-agent/builds/pipeline-uploader/iree/iree-benchmark

A gig of snaps

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/var/lib/snapd | sort -rh | head -10
892M    /mnt/mako-uploader/var/lib/snapd

A gig because one time Hanhan did a bazel build on this machine:

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /mnt/mako-uploader/home/hanchung/.cache | sort -rh | head -10
1.1G    /mnt/mako-uploader/home/hanchung/.cache/bazel

Let's contrast this with the boot disk from the new VM I just created:

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 / | sort -rh | head -10
1.3G    /
1.1G    /usr
208M    /var
33M     /boot
3.5M    /etc
84K     /home
44K     /tmp
16K     /lost+found
12K     /root
4.0K    /srv

That's no gcp update headers or modules

gcmn@gcmn-test-instance:~$ sudo du -xh --max-depth=1 /usr | sort -rh | head -10
1.1G    /usr
799M    /usr/lib
169M    /usr/share
93M     /usr/bin
17M     /usr/sbin
52K     /usr/local
20K     /usr/include
4.0K    /usr/src
4.0K    /usr/libx32
4.0K    /usr/lib64

So basically I think our Buildkite processes (IREE builds and logs) are responsible for 1.5G, there's some stuff that isn't growing, and the rest is GCP auto-updates, which hopefully at some point have garbage collection. Seems like a bigger disk is probably the answer in this case

@antiagainst
Copy link
Contributor

Thanks for looking into the details, @GMNGeoffrey! Didn't know that GCP auto update can be such resource consuming..

When mounted elsewhere, the disk actually had 1.2G of free space. I don't really understand how that happens. I guess shutdown or detaching deleted some stuff (/tmp)?

Oh I actually tried to fix the VM by SSH in and removed /var/lib/buildkite-agent/<something>/mako-uploader/.*. That gives some space back. But it did help much as I still see errors afterwards.

@ScottTodd
Copy link
Member

Is this issue obsolete with the migration to GitHub Actions? Do we have any similar problems there?

@GMNGeoffrey
Copy link
Contributor Author

Not as long as we have ephemeral runners

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants