-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up evaluation by caching task environments as docker images #317
Conversation
Pull updates from the upstream
Very cool stuff! I'll take a closer look at that on Friday! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #317 +/- ##
=======================================
Coverage ? 75.72%
=======================================
Files ? 18
Lines ? 2892
Branches ? 0
=======================================
Hits ? 2190
Misses ? 702
Partials ? 0 ☔ View full report in Codecov by Sentry. |
# Prepare image tag prefix for cached task environments | ||
if self.args.cache_task_images: | ||
logger.info("Task environment caching enabled") | ||
tag = f"{self.args.data_path.replace('/', '_')}__{self.args.split}__{self.args.base_commit or 'head'}__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data path can now be all kinds of things, including the full text of the problem statement. I could push some code to work around these things.
Though I also wonder if data_path
is really what we should make this depend on. Perhaps we could rather intercept the actual setup stages and hash the setup config or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some dataset fingerprints would be even better, I agree. Also, there is a limit of 128 characters for the docker image tag, so this would help stay within the limit with any dataset. I will try to implement the idea with the config hash.
sweagent/environment/swe_env.py
Outdated
@@ -420,25 +455,29 @@ def reset_container(self) -> None: | |||
self.container_obj = None | |||
self._reset_container() | |||
|
|||
def _init_container(self) -> None: | |||
def _init_container(self, cached_image=None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put type hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of |
Let me push this on top of your branch :) |
Sure. How can I do that? |
Realistically, I'll probably only get to it this Wednesday though, so no reason to wait for me until then haha |
Aha, I see. I've enabled that option, thanks. |
Hmm, somehow pushing on this PR doesn't work, not sure why. Let me merge your PR and then apply my changes on top :) |
Thanks again for the very nice addition! ❤️ |
I've highlighted your contribution in our changelog :) |
What does this implement?
This PR introduces the ability to cache the environment created for each SWE-bench task as the docker image. It saves the filesystem and environment variables (using a file) with
docker commit
, which produces a new docker image with the tag unique to the given task. The tag contains the dataset name, split, and task number. The feature could be enabled by flag--cache_task_images
.This change addresses the issue of spending a big chunk of evaluation time on setting up the task environments. Timing test on a dev split of
princeton-nlp/SWE-bench_Lite
(23 tasks), on a 2-core VM:As the repo states the avg. task run time of 1.5 minutes, this PR improves the speed of the consecutive evaluations by up to 40% (for some HW setups).
Any other comments?
I expected the change to use a small amount of disk space since all task environments share the same base image and Docker uses OverlayFS to avoid storing duplicate image parts. However, each image ends up using ~1.5GB of disk space per task. The dev split of SWE-bench_lite requires ~40GB of disk space, while the test split would consume ~500 GB. Although this issue should be addressed later, it still could be a reasonable trade-off when running a few consecutive evaluations to test some changes.