Speed up evaluation by caching task environments as docker images #317

ollmer · 2024-05-06T20:40:19Z

What does this implement?

This PR introduces the ability to cache the environment created for each SWE-bench task as the docker image. It saves the filesystem and environment variables (using a file) with docker commit, which produces a new docker image with the tag unique to the given task. The tag contains the dataset name, split, and task number. The feature could be enabled by flag --cache_task_images.

This change addresses the issue of spending a big chunk of evaluation time on setting up the task environments. Timing test on a dev split of princeton-nlp/SWE-bench_Lite (23 tasks), on a 2-core VM:

Avg. time to prepare 1 task environment: 78.3 sec
Avg. time to load cached environment from the image: 10.1 sec

As the repo states the avg. task run time of 1.5 minutes, this PR improves the speed of the consecutive evaluations by up to 40% (for some HW setups).

Any other comments?

I expected the change to use a small amount of disk space since all task environments share the same base image and Docker uses OverlayFS to avoid storing duplicate image parts. However, each image ends up using ~1.5GB of disk space per task. The dev split of SWE-bench_lite requires ~40GB of disk space, while the test split would consume ~500 GB. Although this issue should be addressed later, it still could be a reasonable trade-off when running a few consecutive evaluations to test some changes.

Pull updates from the upstream

klieret · 2024-05-08T01:19:51Z

Very cool stuff! I'll take a closer look at that on Friday!

codecov · 2024-05-08T01:29:31Z

Codecov Report

Attention: Patch coverage is 37.50000% with 35 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (main@088aabd). Click here to learn what that means.
Report is 1 commits behind head on main.

❗ Current head fef3d32 differs from pull request most recent head 3d28971. Consider uploading reports for the commit 3d28971 to get more accurate results

Files	Patch %	Lines
sweagent/environment/swe_env.py	29.54%	31 Missing ⚠️
sweagent/environment/utils.py	66.66%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #317   +/-   ##
=======================================
  Coverage        ?   75.72%           
=======================================
  Files           ?       18           
  Lines           ?     2892           
  Branches        ?        0           
=======================================
  Hits            ?     2190           
  Misses          ?      702           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

klieret · 2024-05-13T20:08:03Z

sweagent/environment/swe_env.py

+        # Prepare image tag prefix for cached task environments
+        if self.args.cache_task_images:
+            logger.info("Task environment caching enabled")
+            tag = f"{self.args.data_path.replace('/', '_')}__{self.args.split}__{self.args.base_commit or 'head'}__"


Data path can now be all kinds of things, including the full text of the problem statement. I could push some code to work around these things.

Though I also wonder if data_path is really what we should make this depend on. Perhaps we could rather intercept the actual setup stages and hash the setup config or something like that?

Some dataset fingerprints would be even better, I agree. Also, there is a limit of 128 characters for the docker image tag, so this would help stay within the limit with any dataset. I will try to implement the idea with the config hash.

klieret · 2024-05-13T20:10:22Z

sweagent/environment/swe_env.py

@@ -420,25 +455,29 @@ def reset_container(self) -> None:
        self.container_obj = None
        self._reset_container()

-    def _init_container(self) -> None:
+    def _init_container(self, cached_image=None) -> None:


let's put type hints

klieret · 2024-05-13T20:22:08Z

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsitable

klieret · 2024-05-13T20:36:17Z

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsuitable

Let me push this on top of your branch :)

…ments

ollmer · 2024-05-13T21:37:24Z

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsuitable

Let me push this on top of your branch :)

Sure. How can I do that?

klieret · 2024-05-13T21:40:43Z

Sure. How can I do that?
Probably I already can :) (else it should be this setting).

Realistically, I'll probably only get to it this Wednesday though, so no reason to wait for me until then haha

ollmer · 2024-05-13T21:47:08Z

Probably I already can :) (else it should be this setting).

Aha, I see. I've enabled that option, thanks.

klieret · 2024-05-27T21:39:34Z

Hmm, somehow pushing on this PR doesn't work, not sure why. Let me merge your PR and then apply my changes on top :)

klieret · 2024-05-27T21:51:29Z

Thanks again for the very nice addition! ❤️

klieret · 2024-05-28T18:07:24Z

I've highlighted your contribution in our changelog :)

ollmer and others added 11 commits April 28, 2024 22:17

cache task environment as docker images with separate tags

f644f60

save env vars inside the task image before docker commit, debug timing

815e12c

increase docker api timeout to afford long commits

e75d824

fix

438a2aa

fix

388643d

Merge pull request #1 from princeton-nlp/main

ab59b47

Pull updates from the upstream

Merge remote-tracking branch 'origin/main' into cached_task_environments

e53dea8

remove timing collection code

9556c4a

some cleanup

c115007

remove timings storage

bd42b26

use close func to stop container

fef3d32

klieret added the ✨ enhancement New feature or request label May 7, 2024

klieret requested changes May 13, 2024

View reviewed changes

ollmer added 2 commits May 13, 2024 23:32

Merge remote-tracking branch 'upstream/main' into cached_task_environ…

ffba574

…ments

address review comment, type hint

3d28971

klieret merged commit 2a0c164 into princeton-nlp:main May 27, 2024
1 check passed

klieret mentioned this pull request May 27, 2024

Cached task instances (2) #425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up evaluation by caching task environments as docker images #317

Speed up evaluation by caching task environments as docker images #317

ollmer commented May 6, 2024

klieret commented May 8, 2024

codecov bot commented May 8, 2024 •

edited

Loading

klieret May 13, 2024

ollmer May 13, 2024

klieret May 13, 2024

ollmer May 13, 2024

klieret commented May 13, 2024

klieret commented May 13, 2024

ollmer commented May 13, 2024

klieret commented May 13, 2024

ollmer commented May 13, 2024 •

edited

Loading

klieret commented May 27, 2024

klieret commented May 27, 2024

klieret commented May 28, 2024

Speed up evaluation by caching task environments as docker images #317

Speed up evaluation by caching task environments as docker images #317

Conversation

ollmer commented May 6, 2024

What does this implement?

Any other comments?

klieret commented May 8, 2024

codecov bot commented May 8, 2024 • edited Loading

Codecov Report

klieret May 13, 2024

Choose a reason for hiding this comment

ollmer May 13, 2024

Choose a reason for hiding this comment

klieret May 13, 2024

Choose a reason for hiding this comment

ollmer May 13, 2024

Choose a reason for hiding this comment

klieret commented May 13, 2024

klieret commented May 13, 2024

ollmer commented May 13, 2024

klieret commented May 13, 2024

ollmer commented May 13, 2024 • edited Loading

klieret commented May 27, 2024

klieret commented May 27, 2024

klieret commented May 28, 2024

codecov bot commented May 8, 2024 •

edited

Loading

ollmer commented May 13, 2024 •

edited

Loading