Overhaul BinaryUtil to play nicely with v2 #7790

illicitonion · 2019-05-22T15:09:57Z

Right now, BinaryUtil works by eagerly http fetching files to a known location in ~/.cache.

We should overhaul it to work in a v2-first way. Some of the things I think this entails (not in any order):

Attach an expected digest to files, as UrlToFetch requires.
Download files to the Store, rather than to a known location on disk (probably using a UrlToFetch request, so that these can happen in parallel in v2). Materialize them to disk only if needed (ideally only as part of v2 process executions, but for v1-compat we may need to materialize them to ~/.cache too if being called from v1).

Exciting but not-strictly-necessary extension:
3. Make their downloads lazy by default; a remote execution which uses file X shouldn't need the local system to have file X, if the remote execution already knows about the file. Hopefully we can do this without needing to teach the engine to rewind the graph, or needing to introduce a tri-state execution status ("treat-as-if-executed-but-may-need-to-execute"), but... We may need to do something slightly tricky there...

The text was updated successfully, but these errors were encountered:

cosmicexplorer · 2019-05-22T15:17:17Z

The exciting but not-strictly-necessary extension would allow us to avoid having to e.g. download a version of a compiler for Linux onto our OSX box if the remote execution platform is Linux (and not just compilers, but any native tool) -- BinaryUtil doesn't currently expose this capability anyway, so this extension could possibly let us avoid poking a hole in that abstraction.

illicitonion · 2019-05-22T16:30:23Z

Additional optimisation: It would be nice to be able to specify something along the lines of:
BinaryUtil(archive=UrlToFetch(some.tar.gz, digest=foo), file=("path/in/tar", digest=bar)) so that we can potentially fetch multiple things out of single-downloaded archives :)

cosmicexplorer · 2019-05-22T19:49:06Z

fetch multiple things out of single-downloaded archives :)

This is a great idea that I believe native_toolchain.py would love to take advantage of!

BinaryUtil(archive=UrlToFetch(some.tar.gz, digest=foo), file=("path/in/tar", digest=bar))

I'm into this idea, but a nitpick -- is there a reason we'd need to specify digest=bar for extracting a specific file? It seems like checking the digest of the entire archive might be enough? I'm not clear on whether we'd definitely have the bar digest available in user code when we construct a BinaryUtil request? Or is the idea that we would only want to expose this mechanism for cases where we do know the digest in advance (and would we then have those digest strings hardcoded, or provided in an option, or am I missing something)?

stuhood · 2019-05-23T03:28:08Z

It doesn't feel like "extracting a tarball" strictly needs to be a native operation, but I guess there are some advantages there? Having said that, it doesn't feel like something that should be "part" of downloading them... rather, a composable component.

After typing this though, it occurred to me that because URL fetching is native, it runs in the client, rather than on the cluster. And that... may not always be desirable. Ditto extracting tarballs.

cosmicexplorer · 2019-05-23T06:34:36Z

After typing this though, it occurred to me that because URL fetching is native, it runs in the client, rather than on the cluster. And that... may not always be desirable. Ditto extracting tarballs.

Sure, but it might not be obscenely difficult to extract a curl invocation into a python ruleset, which would be cacheable (although it might actually be a useful intentional convention to have cluster process invocations avoid touching the outside internet, since that injects a hidden cache-invisible dependency?).

I think one distinction between things that should run on the client vs the cluster could be whether the process (download/etc) is cacheable in a way other processes can use (url downloading actually seems like a good example of this, as I think you were alluding to). I can see tarball extraction potentially going the same way, although we'd want to benchmark either of these before making any optimization decisions.

illicitonion · 2019-05-23T08:28:48Z

fetch multiple things out of single-downloaded archives :)

This is a great idea that I believe native_toolchain.py would love to take advantage of!

BinaryUtil(archive=UrlToFetch(some.tar.gz, digest=foo), file=("path/in/tar", digest=bar))

I'm into this idea, but a nitpick -- is there a reason we'd need to specify digest=bar for extracting a specific file? It seems like checking the digest of the entire archive might be enough? I'm not clear on whether we'd definitely have the bar digest available in user code when we construct a BinaryUtil request? Or is the idea that we would only want to expose this mechanism for cases where we do know the digest in advance (and would we then have those digest strings hardcoded, or provided in an option, or am I missing something)?

The goal of having the digest is mostly so that we can skip the download entirely if we (or if we're executing remotely, the remote side) already know about that file. So if I have already downloaded digest bar, maybe not from a tar, or maybe from an old version of the tar, or maybe it's already on the remote execution cluster because of another user, I don't have to do the download or untar at all.

It doesn't feel like "extracting a tarball" strictly needs to be a native operation, but I guess there are some advantages there?

The advantage of doing this natively is that the native side has knowledge of the contents of the store (and if it likes, the remote store), so can skip things if needed. We could also expose that information another way if we wanted.

After typing this though, it occurred to me that because URL fetching is native, it runs in the client, rather than on the cluster. And that... may not always be desirable. Ditto extracting tarballs.

The local side could turn either of those things into a remote action if it wanted to, but I'm pretty sure the local side needs to be the thing that decides whether the action needs doing. Alternatively, we could teach the remote side about different well-known sources of digests, but that would be a lot more design work (and probably remote execution system implementation-specific).

I think one distinction between things that should run on the client vs the cluster could be whether the process (download/etc) is cacheable in a way other processes can use (url downloading actually seems like a good example of this, as I think you were alluding to). I can see tarball extraction potentially going the same way, although we'd want to benchmark either of these before making any optimization decisions.

My current line of thinking is that all process executions should be cacheable in a way that other processes can use (if the files are materialised). This would interact with platform though, in that downloading a file specific to a platform in the wrong place is bad. cc #7735

This will need to change as part of pantsbuild#7790 and pantsbuild#7735

### Problem In runs with high degrees of concurrency, we end up seeing many simultaneous instances of tools being fetched and snapshotted. ### Solution Enforce a single fetch+snapshot attempt per tool. As mentioned in the comment, this is a temporary solution until #7790 migrates this into the engine itself.

illicitonion · 2019-07-01T14:11:24Z

Relevant upstream API PR: bazelbuild/remote-apis#88

stuhood · 2019-07-26T23:49:46Z

And a pointer to the code that implements UrlToFetch -> Snapshot:

pants/src/rust/engine/src/nodes.rs

Lines 647 to 802 in 44c2741

    
           #[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)] 
        
           pub struct DownloadedFile(Key); 
        
           impl DownloadedFile { 
        
             fn load_or_download( 
        
               &self, 
        
               core: Arc<Core>, 
        
               url: Url, 
        
               digest: hashing::Digest, 
        
             ) -> BoxFuture<store::Snapshot, String> { 
        
               let file_name = try_future!(url 
        
                 .path_segments() 
        
                 .and_then(Iterator::last) 
        
                 .map(str::to_owned) 
        
                 .ok_or_else(|| format!("Error getting the file name from the parsed URL: {}", url))); 
        
               core 
        
                 .store() 
        
                 .load_file_bytes_with(digest, |_| ()) 
        
                 .and_then(move |maybe_bytes| { 
        
                   maybe_bytes 
        
                     .map(|((), _metadata)| future::ok(()).to_boxed()) 
        
                     .unwrap_or_else(|| DownloadedFile::download(core.clone(), url, file_name.clone(), digest)) 
        
                     .and_then(move |()| { 
        
                       DownloadedFile::snapshot_of_one_file(core.store(), PathBuf::from(file_name), digest) 
        
                     }) 
        
                 }) 
        
                 .to_boxed() 
        
             } 
        
             fn snapshot_of_one_file( 
        
               store: store::Store, 
        
               name: PathBuf, 
        
               digest: hashing::Digest, 
        
             ) -> BoxFuture<store::Snapshot, String> { 
        
               #[derive(Clone)] 
        
               struct Digester { 
        
                 digest: hashing::Digest, 
        
               } 
        
               impl StoreFileByDigest<String> for Digester { 
        
                 fn store_by_digest(&self, _: File) -> BoxFuture<hashing::Digest, String> { 
        
                   future::ok(self.digest).to_boxed() 
        
                 } 
        
               } 
        
               store::Snapshot::from_path_stats( 
        
                 store, 
        
                 &Digester { digest }, 
        
                 vec![fs::PathStat::File { 
        
                   path: name.clone(), 
        
                   stat: fs::File { 
        
                     path: name, 
        
                     is_executable: true, 
        
                   }, 
        
                 }], 
        
               ) 
        
             } 
        
             fn download( 
        
               core: Arc<Core>, 
        
               url: Url, 
        
               file_name: String, 
        
               expected_digest: hashing::Digest, 
        
             ) -> BoxFuture<(), String> { 
        
               // TODO: Retry failures 
        
               core 
        
                 .http_client 
        
                 .get(url.clone()) 
        
                 .send() 
        
                 .map_err(|err| format!("Error downloading file: {}", err)) 
        
                 .and_then(move |response| { 
        
                   // Handle common HTTP errors. 
        
                   if response.status().is_server_error() { 
        
                     Err(format!( 
        
                       "Server error ({}) downloading file {} from {}", 
        
                       response.status().as_str(), 
        
                       file_name, 
        
                       url, 
        
                     )) 
        
                   } else if response.status().is_client_error() { 
        
                     Err(format!( 
        
                       "Client error ({}) downloading file {} from {}", 
        
                       response.status().as_str(), 
        
                       file_name, 
        
                       url, 
        
                     )) 
        
                   } else { 
        
                     Ok(response) 
        
                   } 
        
                 }) 
        
                 .and_then(move |response| { 
        
                   struct SizeLimiter<W: std::io::Write> { 
        
                     writer: W, 
        
                     written: usize, 
        
                     size_limit: usize, 
        
                   } 
        
                   impl<W: std::io::Write> Write for SizeLimiter<W> { 
        
                     fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> { 
        
                       let new_size = self.written + buf.len(); 
        
                       if new_size > self.size_limit { 
        
                         Err(std::io::Error::new( 
        
                           std::io::ErrorKind::InvalidData, 
        
                           "Downloaded file was larger than expected digest", 
        
                         )) 
        
                       } else { 
        
                         self.written = new_size; 
        
                         self.writer.write_all(buf)?; 
        
                         Ok(buf.len()) 
        
                       } 
        
                     } 
        
                     fn flush(&mut self) -> Result<(), std::io::Error> { 
        
                       self.writer.flush() 
        
                     } 
        
                   } 
        
                   let hasher = hashing::WriterHasher::new(SizeLimiter { 
        
                     writer: bytes::BytesMut::with_capacity(expected_digest.1).writer(), 
        
                     written: 0, 
        
                     size_limit: expected_digest.1, 
        
                   }); 
        
                   response 
        
                     .into_body() 
        
                     .map_err(|err| format!("Error reading URL fetch response: {}", err)) 
        
                     .fold(hasher, |mut hasher, chunk| { 
        
                       hasher 
        
                         .write_all(&chunk) 
        
                         .map(|_| hasher) 
        
                         .map_err(|err| format!("Error hashing/writing URL fetch response: {}", err)) 
        
                     }) 
        
                     .map(|hasher| { 
        
                       let (digest, bytewriter) = hasher.finish(); 
        
                       (digest, bytewriter.writer.into_inner().freeze()) 
        
                     }) 
        
                 }) 
        
                 .and_then(move |(actual_digest, buf)| { 
        
                   if expected_digest != actual_digest { 
        
                     return future::err(format!( 
        
                       "Wrong digest for downloaded file: want {:?} got {:?}", 
        
                       expected_digest, actual_digest 
        
                     )) 
        
                     .to_boxed(); 
        
                   } 
        
                   core 
        
                     .store() 
        
                     .store_file_bytes(buf, true) 
        
                     .map(|_| ()) 
        
                     .to_boxed() 
        
                 }) 
        
                 .to_boxed() 
        
             } 
        
           }

### Problem *One step towards a conclusion for #7790.* v2 rules currently have no way to make use of the extremely useful infrastructure we have for downloading tools hosted by ourselves or others, exposed via the `NativeTool` and `Script` subclasses of `BinaryToolBase`. This requires us to hard-code the digest and url to fetch from for `cloc` and `pex`, e.g.: https://github.com/pantsbuild/pants/blob/fd469cb62f330b73e7814f86e94558d16cfc9da3/src/python/pants/backend/python/rules/download_pex_bin.py#L67-L73 ### Solution - Add a v2 ruleset in `binary_tool.py`, which exposes a `BinaryToolFetchRequest` object which wraps a `BinaryToolBase` instance and converts into a `UrlToFetch`. - Add a `default_digest` field to `BinaryToolBase`, along with the `--fingerprint` and `--size-bytes` options, so that tools can specify the expected digest for their `default_version`, while giving the user the ability to set `--fingerprint` and `--size-bytes` if they override the `--version`. - Make cloc and pex use `BinaryToolFetchRequest` instead of hardcoding the url and digest. ### Result It should be much, much easier to integrate outside tools into v2 `@rule`s! The above pex `@rule` can now look like: ```python @rule async def download_pex_bin(pex_binary_tool: DownloadedPexBin.Factory) -> DownloadedPexBin: snapshot = await Get[Snapshot](BinaryToolFetchRequest(pex_binary_tool)) return DownloadedPexBin(SingleFileExecutable(snapshot)) ```

benjyw · 2020-04-28T16:25:40Z

Addressed in #9625 (Other than the lazy-by-default downloads, but that feels like a much bigger issue)

illicitonion added a commit to twitter/pants that referenced this issue May 23, 2019

NativeTool can be snapshotted, not just Script

5b1a2a9

This will need to change as part of pantsbuild#7790 and pantsbuild#7735

illicitonion mentioned this issue May 23, 2019

Hermetic zinc runs with a native-image #7796

Merged

stuhood pushed a commit to twitter/pants that referenced this issue May 25, 2019

NativeTool can be snapshotted, not just Script

2856388

This will need to change as part of pantsbuild#7790 and pantsbuild#7735

stuhood mentioned this issue May 31, 2019

Avoid redundant work while hackily_snapshot()ing. #7829

Merged

gshuflin mentioned this issue Nov 13, 2019

Port cloc goal to v2 engine #8251

Merged

cosmicexplorer mentioned this issue Dec 22, 2019

expose a v2 ruleset for BinaryToolBase #8859

Merged

benjyw closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul BinaryUtil to play nicely with v2 #7790

Overhaul BinaryUtil to play nicely with v2 #7790

illicitonion commented May 22, 2019

cosmicexplorer commented May 22, 2019

illicitonion commented May 22, 2019

cosmicexplorer commented May 22, 2019

stuhood commented May 23, 2019 •

edited

cosmicexplorer commented May 23, 2019 •

edited

illicitonion commented May 23, 2019

illicitonion commented Jul 1, 2019

stuhood commented Jul 26, 2019

benjyw commented Apr 28, 2020 •

edited

Overhaul BinaryUtil to play nicely with v2 #7790

Overhaul BinaryUtil to play nicely with v2 #7790

Comments

illicitonion commented May 22, 2019

cosmicexplorer commented May 22, 2019

illicitonion commented May 22, 2019

cosmicexplorer commented May 22, 2019

stuhood commented May 23, 2019 • edited

cosmicexplorer commented May 23, 2019 • edited

illicitonion commented May 23, 2019

illicitonion commented Jul 1, 2019

stuhood commented Jul 26, 2019

benjyw commented Apr 28, 2020 • edited

stuhood commented May 23, 2019 •

edited

cosmicexplorer commented May 23, 2019 •

edited

benjyw commented Apr 28, 2020 •

edited