Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local caching for ExecuteProcess #6898

Open
stuhood opened this issue Dec 11, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@stuhood
Copy link
Member

commented Dec 11, 2018

Currently the only cache for ExecuteProcess exists in the remoting API, and we have so far biased toward assuming that pantsd will be up in order to avoid re-invoking local processes in situations where remoting is not available.

@illicitonion

This comment has been minimized.

Copy link
Contributor

commented Dec 11, 2018

I'd recommend we do this by:

  1. Adding a new "Database" (not a ShardedLmdb, just an lmdb::Database) to the fs::local::InnerStore:
    struct InnerStore {
    pool: Arc<ResettablePool>,
    // Store directories separately from files because:
    // 1. They may have different lifetimes.
    // 2. It's nice to know whether we should be able to parse something as a proto.
    file_dbs: Result<Arc<ShardedLmdb>, String>,
    directory_dbs: Result<Arc<ShardedLmdb>, String>,
    }
    - this should write to root.join("process_execution") or similar.
  2. Expose an interface on fs::Store like record_directory and load_directory which is record_process_result and load_process_result. What type this should exactly take and return is an open question. An easy thing we could do is populate a bazel_protos::remote_execution::Action as we do in process_execution::remote::make_execute_request (extracting the common code somewhere useful), store things indexed by the sha256 of that proto serialized as bytes (as we already do in make_execute_request to populate the bazel_protos::remote_execution::ExecuteRequest), and serialize the FallibleExecuteProcessResult as a bazel_protos::remote_execution::ActionResult whose bytes we can store (the mapping is trivial). We could also use some kind of custom serialization if it could be justified.
  3. Modify engine::ExecuteProcess::run to check the cache (accessible via context.core.store()) before running the process here:
    fn run(self, context: Context) -> NodeFuture<ProcessResult> {
    let request = self.0;
    context
    .core
    .command_runner()
    .run(request)
    .map(ProcessResult)
    .map_err(|e| throw(&format!("Failed to execute process: {}", e)))
    .to_boxed()
    }
    }
@stuhood

This comment has been minimized.

Copy link
Member Author

commented Dec 11, 2018

Two thoughts:

  1. The jdk_home facility will be a bit of a monkey wrench. When computing a local cache key, we'd want to actually include the destination of the symlink in the key. One way to do that would be to have an explicit method outside of process_execution::remote::make_execute_request that explicitly "amends" the resulting gRPC struct with local-only information in order to affect the cache key... and to then have that method add an env var with the JDK info.
  2. Another possible factoring of the cache would be to make it an impl CommandRunner similar to BoundedCommandRunner, that wraps a cache around some other CommandRunner. Unclear whether this level of abstraction is actually useful, or just "obvious".
@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2019

I'm thinking of replacing any uses of PythonToolPrepBase with a synchronous call to the engine as a case study for this ticket. The idea here is to use a very specific file or directory on the local filesystem to serve as the "local caching", and that this would also be the link to v1 tasks. v1 tasks also have that context which is required for context._scheduler.capture_snapshots (and hence to be able to use the output of some in get_pants_cachedir() in the v2 engine), and I don't know how to get an instance of that directly in a v2 task (but I can make that happen).

@illicitonion

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2019

That sounds kind of the opposite of what I described above?

What I was imagining was that we'd internally cache process executions inside the lmdb store (so that people who call processes don't know whether they hit a cache or not, it's opaque), and then v1 rules could call context._scheduler.materialize_directories - that way process execution caching is a black box, and the v1 rule is in control of the destination of files... It makes it much harder for things to corrupt the cache by editing files (the files exist in an opaque cache, rather than on disk where they may be edited).

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2019

That makes sense, and I appreciate taking the time to describe it. I wasn't thinking super clearly about why we would need things on disk in the first place (was still thinking in v1 mode). I will look into implementing the above. Thanks again.

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2019

Warming up to the impl CommandRunner idea since the work of separating out the common code for serializing an Action seems difficult without putting a lot of the logic in process_execution anyway (EDIT: this doesn't contradict the above in any way)

@cosmicexplorer cosmicexplorer referenced a pull request that will close this issue Jan 28, 2019

Open

[WIP] Cached local process execution #7171

@Eric-Arellano Eric-Arellano self-assigned this May 31, 2019

Eric-Arellano added a commit that referenced this issue Jun 3, 2019

Bump CI unit test timeout for less flaky runs (#7831)
When using V2, unit tests take too long to run so we have to use `travis_wait`. We set the wait time too short, and have a few times had timeouts.

Note that a better solution would be to figure out why V2 is slower in the first place via #7795, followed by implementing local caching via #6898.
@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

and we have so far biased toward assuming that pantsd will be up in order to avoid re-invoking local processes in situations where remoting is not available.

I'm a bit unclear on the high level of when we would use pantsd vs. when we would use this new local caching? For example, what is the use case for running without --enable-pantsd in the first place? Would we ever train pantsd to leverage this caching mechanism described here?

@Eric-Arellano Eric-Arellano removed their assignment Jun 4, 2019

@stuhood

This comment has been minimized.

Copy link
Member Author

commented Jun 4, 2019

For example, what is the use case for running without --enable-pantsd in the first place?

In (most) CI environments, we don't expect that pantsd is able to stay up across multiple runs, whereas on a laptop it will be able to.

Would we ever train pantsd to leverage this caching mechanism described here?

All runs would take advantage of this cache, with or without pantsd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.