Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up evaluation by caching task environments as docker images #317

Merged
merged 13 commits into from
May 27, 2024

Conversation

ollmer
Copy link
Contributor

@ollmer ollmer commented May 6, 2024

What does this implement?

This PR introduces the ability to cache the environment created for each SWE-bench task as the docker image. It saves the filesystem and environment variables (using a file) with docker commit, which produces a new docker image with the tag unique to the given task. The tag contains the dataset name, split, and task number. The feature could be enabled by flag --cache_task_images.

This change addresses the issue of spending a big chunk of evaluation time on setting up the task environments. Timing test on a dev split of princeton-nlp/SWE-bench_Lite (23 tasks), on a 2-core VM:

  • Avg. time to prepare 1 task environment: 78.3 sec
  • Avg. time to load cached environment from the image: 10.1 sec

As the repo states the avg. task run time of 1.5 minutes, this PR improves the speed of the consecutive evaluations by up to 40% (for some HW setups).

Any other comments?

I expected the change to use a small amount of disk space since all task environments share the same base image and Docker uses OverlayFS to avoid storing duplicate image parts. However, each image ends up using ~1.5GB of disk space per task. The dev split of SWE-bench_lite requires ~40GB of disk space, while the test split would consume ~500 GB. Although this issue should be addressed later, it still could be a reasonable trade-off when running a few consecutive evaluations to test some changes.

@klieret klieret added the ✨ enhancement New feature or request label May 7, 2024
@klieret
Copy link
Member

klieret commented May 8, 2024

Very cool stuff! I'll take a closer look at that on Friday!

Copy link

codecov bot commented May 8, 2024

Codecov Report

Attention: Patch coverage is 37.50000% with 35 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (main@088aabd). Click here to learn what that means.
Report is 1 commits behind head on main.

❗ Current head fef3d32 differs from pull request most recent head 3d28971. Consider uploading reports for the commit 3d28971 to get more accurate results

Files Patch % Lines
sweagent/environment/swe_env.py 29.54% 31 Missing ⚠️
sweagent/environment/utils.py 66.66% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #317   +/-   ##
=======================================
  Coverage        ?   75.72%           
=======================================
  Files           ?       18           
  Lines           ?     2892           
  Branches        ?        0           
=======================================
  Hits            ?     2190           
  Misses          ?      702           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Prepare image tag prefix for cached task environments
if self.args.cache_task_images:
logger.info("Task environment caching enabled")
tag = f"{self.args.data_path.replace('/', '_')}__{self.args.split}__{self.args.base_commit or 'head'}__"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data path can now be all kinds of things, including the full text of the problem statement. I could push some code to work around these things.

Though I also wonder if data_path is really what we should make this depend on. Perhaps we could rather intercept the actual setup stages and hash the setup config or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some dataset fingerprints would be even better, I agree. Also, there is a limit of 128 characters for the docker image tag, so this would help stay within the limit with any dataset. I will try to implement the idea with the config hash.

@@ -420,25 +455,29 @@ def reset_container(self) -> None:
self.container_obj = None
self._reset_container()

def _init_container(self) -> None:
def _init_container(self, cached_image=None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put type hints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@klieret
Copy link
Member

klieret commented May 13, 2024

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsitable

@klieret
Copy link
Member

klieret commented May 13, 2024

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsuitable

Let me push this on top of your branch :)

@ollmer
Copy link
Contributor Author

ollmer commented May 13, 2024

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsuitable

Let me push this on top of your branch :)

Sure. How can I do that?

@klieret
Copy link
Member

klieret commented May 13, 2024

Sure. How can I do that?
Probably I already can :) (else it should be this setting).

Realistically, I'll probably only get to it this Wednesday though, so no reason to wait for me until then haha

@ollmer
Copy link
Contributor Author

ollmer commented May 13, 2024

Probably I already can :) (else it should be this setting).

Aha, I see. I've enabled that option, thanks.

@klieret
Copy link
Member

klieret commented May 27, 2024

Hmm, somehow pushing on this PR doesn't work, not sure why. Let me merge your PR and then apply my changes on top :)

@klieret klieret merged commit 2a0c164 into princeton-nlp:main May 27, 2024
1 check passed
@klieret
Copy link
Member

klieret commented May 27, 2024

Thanks again for the very nice addition! ❤️

@klieret
Copy link
Member

klieret commented May 28, 2024

I've highlighted your contribution in our changelog :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants