Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul BinaryUtil to play nicely with v2 #7790

Closed
illicitonion opened this issue May 22, 2019 · 9 comments
Closed

Overhaul BinaryUtil to play nicely with v2 #7790

illicitonion opened this issue May 22, 2019 · 9 comments

Comments

@illicitonion
Copy link
Contributor

Right now, BinaryUtil works by eagerly http fetching files to a known location in ~/.cache.

We should overhaul it to work in a v2-first way. Some of the things I think this entails (not in any order):

  1. Attach an expected digest to files, as UrlToFetch requires.
  2. Download files to the Store, rather than to a known location on disk (probably using a UrlToFetch request, so that these can happen in parallel in v2). Materialize them to disk only if needed (ideally only as part of v2 process executions, but for v1-compat we may need to materialize them to ~/.cache too if being called from v1).

Exciting but not-strictly-necessary extension:
3. Make their downloads lazy by default; a remote execution which uses file X shouldn't need the local system to have file X, if the remote execution already knows about the file. Hopefully we can do this without needing to teach the engine to rewind the graph, or needing to introduce a tri-state execution status ("treat-as-if-executed-but-may-need-to-execute"), but... We may need to do something slightly tricky there...

@cosmicexplorer
Copy link
Contributor

The exciting but not-strictly-necessary extension would allow us to avoid having to e.g. download a version of a compiler for Linux onto our OSX box if the remote execution platform is Linux (and not just compilers, but any native tool) -- BinaryUtil doesn't currently expose this capability anyway, so this extension could possibly let us avoid poking a hole in that abstraction.

@illicitonion
Copy link
Contributor Author

Additional optimisation: It would be nice to be able to specify something along the lines of:
BinaryUtil(archive=UrlToFetch(some.tar.gz, digest=foo), file=("path/in/tar", digest=bar)) so that we can potentially fetch multiple things out of single-downloaded archives :)

@cosmicexplorer
Copy link
Contributor

fetch multiple things out of single-downloaded archives :)

This is a great idea that I believe native_toolchain.py would love to take advantage of!

BinaryUtil(archive=UrlToFetch(some.tar.gz, digest=foo), file=("path/in/tar", digest=bar))

I'm into this idea, but a nitpick -- is there a reason we'd need to specify digest=bar for extracting a specific file? It seems like checking the digest of the entire archive might be enough? I'm not clear on whether we'd definitely have the bar digest available in user code when we construct a BinaryUtil request? Or is the idea that we would only want to expose this mechanism for cases where we do know the digest in advance (and would we then have those digest strings hardcoded, or provided in an option, or am I missing something)?

@stuhood
Copy link
Sponsor Member

stuhood commented May 23, 2019

It doesn't feel like "extracting a tarball" strictly needs to be a native operation, but I guess there are some advantages there? Having said that, it doesn't feel like something that should be "part" of downloading them... rather, a composable component.

After typing this though, it occurred to me that because URL fetching is native, it runs in the client, rather than on the cluster. And that... may not always be desirable. Ditto extracting tarballs.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented May 23, 2019

After typing this though, it occurred to me that because URL fetching is native, it runs in the client, rather than on the cluster. And that... may not always be desirable. Ditto extracting tarballs.

Sure, but it might not be obscenely difficult to extract a curl invocation into a python ruleset, which would be cacheable (although it might actually be a useful intentional convention to have cluster process invocations avoid touching the outside internet, since that injects a hidden cache-invisible dependency?).

I think one distinction between things that should run on the client vs the cluster could be whether the process (download/etc) is cacheable in a way other processes can use (url downloading actually seems like a good example of this, as I think you were alluding to). I can see tarball extraction potentially going the same way, although we'd want to benchmark either of these before making any optimization decisions.

@illicitonion
Copy link
Contributor Author

fetch multiple things out of single-downloaded archives :)

This is a great idea that I believe native_toolchain.py would love to take advantage of!

BinaryUtil(archive=UrlToFetch(some.tar.gz, digest=foo), file=("path/in/tar", digest=bar))

I'm into this idea, but a nitpick -- is there a reason we'd need to specify digest=bar for extracting a specific file? It seems like checking the digest of the entire archive might be enough? I'm not clear on whether we'd definitely have the bar digest available in user code when we construct a BinaryUtil request? Or is the idea that we would only want to expose this mechanism for cases where we do know the digest in advance (and would we then have those digest strings hardcoded, or provided in an option, or am I missing something)?

The goal of having the digest is mostly so that we can skip the download entirely if we (or if we're executing remotely, the remote side) already know about that file. So if I have already downloaded digest bar, maybe not from a tar, or maybe from an old version of the tar, or maybe it's already on the remote execution cluster because of another user, I don't have to do the download or untar at all.

It doesn't feel like "extracting a tarball" strictly needs to be a native operation, but I guess there are some advantages there?

The advantage of doing this natively is that the native side has knowledge of the contents of the store (and if it likes, the remote store), so can skip things if needed. We could also expose that information another way if we wanted.

After typing this though, it occurred to me that because URL fetching is native, it runs in the client, rather than on the cluster. And that... may not always be desirable. Ditto extracting tarballs.

The local side could turn either of those things into a remote action if it wanted to, but I'm pretty sure the local side needs to be the thing that decides whether the action needs doing. Alternatively, we could teach the remote side about different well-known sources of digests, but that would be a lot more design work (and probably remote execution system implementation-specific).

I think one distinction between things that should run on the client vs the cluster could be whether the process (download/etc) is cacheable in a way other processes can use (url downloading actually seems like a good example of this, as I think you were alluding to). I can see tarball extraction potentially going the same way, although we'd want to benchmark either of these before making any optimization decisions.

My current line of thinking is that all process executions should be cacheable in a way that other processes can use (if the files are materialised). This would interact with platform though, in that downloading a file specific to a platform in the wrong place is bad. cc #7735

illicitonion added a commit to twitter/pants that referenced this issue May 23, 2019
stuhood pushed a commit to twitter/pants that referenced this issue May 25, 2019
stuhood added a commit that referenced this issue Jun 1, 2019
### Problem

In runs with high degrees of concurrency, we end up seeing many simultaneous instances of tools being fetched and snapshotted.

### Solution

Enforce a single fetch+snapshot attempt per tool. As mentioned in the comment, this is a temporary solution until #7790 migrates this into the engine itself.
stuhood added a commit that referenced this issue Jun 1, 2019
### Problem

In runs with high degrees of concurrency, we end up seeing many simultaneous instances of tools being fetched and snapshotted.

### Solution

Enforce a single fetch+snapshot attempt per tool. As mentioned in the comment, this is a temporary solution until #7790 migrates this into the engine itself.
stuhood added a commit that referenced this issue Jun 1, 2019
### Problem

In runs with high degrees of concurrency, we end up seeing many simultaneous instances of tools being fetched and snapshotted.

### Solution

Enforce a single fetch+snapshot attempt per tool. As mentioned in the comment, this is a temporary solution until #7790 migrates this into the engine itself.
@illicitonion
Copy link
Contributor Author

Relevant upstream API PR: bazelbuild/remote-apis#88

@stuhood
Copy link
Sponsor Member

stuhood commented Jul 26, 2019

And a pointer to the code that implements UrlToFetch -> Snapshot:

#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
pub struct DownloadedFile(Key);
impl DownloadedFile {
fn load_or_download(
&self,
core: Arc<Core>,
url: Url,
digest: hashing::Digest,
) -> BoxFuture<store::Snapshot, String> {
let file_name = try_future!(url
.path_segments()
.and_then(Iterator::last)
.map(str::to_owned)
.ok_or_else(|| format!("Error getting the file name from the parsed URL: {}", url)));
core
.store()
.load_file_bytes_with(digest, |_| ())
.and_then(move |maybe_bytes| {
maybe_bytes
.map(|((), _metadata)| future::ok(()).to_boxed())
.unwrap_or_else(|| DownloadedFile::download(core.clone(), url, file_name.clone(), digest))
.and_then(move |()| {
DownloadedFile::snapshot_of_one_file(core.store(), PathBuf::from(file_name), digest)
})
})
.to_boxed()
}
fn snapshot_of_one_file(
store: store::Store,
name: PathBuf,
digest: hashing::Digest,
) -> BoxFuture<store::Snapshot, String> {
#[derive(Clone)]
struct Digester {
digest: hashing::Digest,
}
impl StoreFileByDigest<String> for Digester {
fn store_by_digest(&self, _: File) -> BoxFuture<hashing::Digest, String> {
future::ok(self.digest).to_boxed()
}
}
store::Snapshot::from_path_stats(
store,
&Digester { digest },
vec![fs::PathStat::File {
path: name.clone(),
stat: fs::File {
path: name,
is_executable: true,
},
}],
)
}
fn download(
core: Arc<Core>,
url: Url,
file_name: String,
expected_digest: hashing::Digest,
) -> BoxFuture<(), String> {
// TODO: Retry failures
core
.http_client
.get(url.clone())
.send()
.map_err(|err| format!("Error downloading file: {}", err))
.and_then(move |response| {
// Handle common HTTP errors.
if response.status().is_server_error() {
Err(format!(
"Server error ({}) downloading file {} from {}",
response.status().as_str(),
file_name,
url,
))
} else if response.status().is_client_error() {
Err(format!(
"Client error ({}) downloading file {} from {}",
response.status().as_str(),
file_name,
url,
))
} else {
Ok(response)
}
})
.and_then(move |response| {
struct SizeLimiter<W: std::io::Write> {
writer: W,
written: usize,
size_limit: usize,
}
impl<W: std::io::Write> Write for SizeLimiter<W> {
fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> {
let new_size = self.written + buf.len();
if new_size > self.size_limit {
Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"Downloaded file was larger than expected digest",
))
} else {
self.written = new_size;
self.writer.write_all(buf)?;
Ok(buf.len())
}
}
fn flush(&mut self) -> Result<(), std::io::Error> {
self.writer.flush()
}
}
let hasher = hashing::WriterHasher::new(SizeLimiter {
writer: bytes::BytesMut::with_capacity(expected_digest.1).writer(),
written: 0,
size_limit: expected_digest.1,
});
response
.into_body()
.map_err(|err| format!("Error reading URL fetch response: {}", err))
.fold(hasher, |mut hasher, chunk| {
hasher
.write_all(&chunk)
.map(|_| hasher)
.map_err(|err| format!("Error hashing/writing URL fetch response: {}", err))
})
.map(|hasher| {
let (digest, bytewriter) = hasher.finish();
(digest, bytewriter.writer.into_inner().freeze())
})
})
.and_then(move |(actual_digest, buf)| {
if expected_digest != actual_digest {
return future::err(format!(
"Wrong digest for downloaded file: want {:?} got {:?}",
expected_digest, actual_digest
))
.to_boxed();
}
core
.store()
.store_file_bytes(buf, true)
.map(|_| ())
.to_boxed()
})
.to_boxed()
}
}

cosmicexplorer added a commit that referenced this issue Feb 11, 2020
### Problem

*One step towards a conclusion for #7790.*

v2 rules currently have no way to make use of the extremely useful infrastructure we have for downloading tools hosted by ourselves or others, exposed via the `NativeTool` and `Script` subclasses of `BinaryToolBase`. This requires us to hard-code the digest and url to fetch from for `cloc` and `pex`, e.g.: https://github.com/pantsbuild/pants/blob/fd469cb62f330b73e7814f86e94558d16cfc9da3/src/python/pants/backend/python/rules/download_pex_bin.py#L67-L73

### Solution

- Add a v2 ruleset in `binary_tool.py`, which exposes a `BinaryToolFetchRequest` object which wraps a `BinaryToolBase` instance and converts into a `UrlToFetch`.
- Add a `default_digest` field to `BinaryToolBase`, along with the `--fingerprint` and `--size-bytes` options, so that tools can specify the expected digest for their `default_version`, while giving the user the ability to set `--fingerprint` and `--size-bytes` if they override the `--version`.
- Make cloc and pex use `BinaryToolFetchRequest` instead of hardcoding the url and digest.

### Result

It should be much, much easier to integrate outside tools into v2 `@rule`s! The above pex `@rule` can now look like:
```python
@rule
async def download_pex_bin(pex_binary_tool: DownloadedPexBin.Factory) -> DownloadedPexBin:
  snapshot = await Get[Snapshot](BinaryToolFetchRequest(pex_binary_tool))
  return DownloadedPexBin(SingleFileExecutable(snapshot))
```
@benjyw
Copy link
Sponsor Contributor

benjyw commented Apr 28, 2020

Addressed in #9625 (Other than the lazy-by-default downloads, but that feels like a much bigger issue)

@benjyw benjyw closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants