Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support file-level content-addressable storage for compile inputs in the distributed build server #558

Open
luser opened this issue Oct 23, 2019 · 1 comment

Comments

@luser
Copy link
Contributor

luser commented Oct 23, 2019

I was looking at the source for the Rust InputsPackager, which is responsible for packaging up all the inputs necessary for a single compile to ship to the build server:

sccache/src/compiler/rust.rs

Lines 1316 to 1462 in fc256ff

impl pkg::InputsPackager for RustInputsPackager {
fn write_inputs(self: Box<Self>, wtr: &mut dyn io::Write) -> Result<dist::PathTransformer> {
debug!("Packaging compile inputs for compile");
let RustInputsPackager { crate_link_paths, crate_types, inputs, mut path_transformer, rlib_dep_reader, env_vars } = *{self};
// If this is a cargo build, we can assume all immediate `extern crate` dependencies
// have been passed on the command line, allowing us to scan them all and find the
// complete list of crates we might need.
// If it's not a cargo build, we can't to extract the `extern crate` statements and
// so have no way to build a list of necessary crates - send all rlibs.
let is_cargo = env_vars.iter().any(|(k, _)| k == "CARGO_PKG_NAME");
let mut rlib_dep_reader_and_names = if is_cargo {
rlib_dep_reader.map(|r| (r, HashSet::new()))
} else {
None
};
let mut tar_inputs = vec![];
for input_path in inputs.into_iter() {
let input_path = pkg::simplify_path(&input_path)?;
if let Some(ext) = input_path.extension() {
if !CAN_DIST_DYLIBS && ext == DLL_EXTENSION {
bail!("Cannot distribute dylib input {} on this platform", input_path.display())
} else if ext == RLIB_EXTENSION {
if let Some((ref rlib_dep_reader, ref mut dep_crate_names)) = rlib_dep_reader_and_names {
dep_crate_names.extend(rlib_dep_reader.discover_rlib_deps(&env_vars, &input_path)
.chain_err(|| format!("Failed to read deps of {}", input_path.display()))?)
}
}
}
let dist_input_path = path_transformer.to_dist(&input_path)
.chain_err(|| format!("unable to transform input path {}", input_path.display()))?;
tar_inputs.push((input_path, dist_input_path))
}
if log_enabled!(Trace) {
if let Some((_, ref dep_crate_names)) = rlib_dep_reader_and_names {
trace!("Identified dependency crate names: {:?}", dep_crate_names)
}
}
// Given the link paths, find the things we need to send over the wire to the remote machine. If
// we've been able to use a dependency searcher then we can filter down just candidates for that
// crate, otherwise we need to send everything.
let mut tar_crate_libs = vec![];
for crate_link_path in crate_link_paths.into_iter() {
let crate_link_path = pkg::simplify_path(&crate_link_path)?;
let dir_entries = match fs::read_dir(crate_link_path) {
Ok(iter) => iter,
Err(ref e) if e.kind() == io::ErrorKind::NotFound => continue,
Err(e) => return Err(Error::from(e).chain_err(|| "Failed to read dir entries in crate link path")),
};
for entry in dir_entries {
let entry = match entry {
Ok(entry) => entry,
Err(e) => return Err(Error::from(e).chain_err(|| "Error during iteration over crate link path")),
};
let path = entry.path();
{
// Take a look at the path and see if it's something we care about
let libname: &str = match path.file_name().and_then(|s| s.to_str()) {
Some(name) => {
let mut rev_name_split = name.rsplitn(2, '-');
let _extra_filename_and_ext = rev_name_split.next();
let libname = if let Some(libname) = rev_name_split.next() {
libname
} else {
continue
};
assert!(rev_name_split.next().is_none());
libname
},
None => continue,
};
let (crate_name, ext): (&str, _) = match path.extension() {
Some(ext) if libname.starts_with(DLL_PREFIX) && ext == DLL_EXTENSION =>
(&libname[DLL_PREFIX.len()..], ext),
Some(ext) if libname.starts_with(RLIB_PREFIX) && ext == RLIB_EXTENSION =>
(&libname[RLIB_PREFIX.len()..], ext),
_ => continue,
};
if let Some((_, ref dep_crate_names)) = rlib_dep_reader_and_names {
// We have a list of crate names we care about, see if this lib is a candidate
if !dep_crate_names.contains(crate_name) {
continue
}
}
if !path.is_file() {
continue
} else if !CAN_DIST_DYLIBS && ext == DLL_EXTENSION {
bail!("Cannot distribute dylib input {} on this platform", path.display())
}
}
// This is a lib that may be of interest during compilation
let dist_path = path_transformer.to_dist(&path)
.chain_err(|| format!("unable to transform lib path {}", path.display()))?;
tar_crate_libs.push((path, dist_path))
}
}
let mut all_tar_inputs: Vec<_> = tar_inputs.into_iter().chain(tar_crate_libs).collect();
all_tar_inputs.sort();
// There are almost certainly duplicates from explicit externs also within the lib search paths
all_tar_inputs.dedup();
// If we're just creating an rlib then the only thing inspected inside dependency rlibs is the
// metadata, in which case we can create a trimmed rlib (which is actually a .a) with the metadata
let can_trim_rlibs = if let CrateTypes { rlib: true, staticlib: false } = crate_types { true } else { false };
let mut builder = tar::Builder::new(wtr);
for (input_path, dist_input_path) in all_tar_inputs.iter() {
let mut file_header = pkg::make_tar_header(input_path, dist_input_path)?;
let file = fs::File::open(input_path)?;
if can_trim_rlibs && input_path.extension().map(|e| e == RLIB_EXTENSION).unwrap_or(false) {
let mut archive = ar::Archive::new(file);
while let Some(entry_result) = archive.next_entry() {
let mut entry = entry_result?;
if entry.header().identifier() != b"rust.metadata.bin" {
continue
}
let mut metadata = vec![];
io::copy(&mut entry, &mut metadata)?;
let mut metadata_ar = vec![];
{
let mut ar_builder = ar::Builder::new(&mut metadata_ar);
ar_builder.append(entry.header(), metadata.as_slice())?
}
file_header.set_size(metadata_ar.len() as u64);
file_header.set_cksum();
builder.append(&file_header, metadata_ar.as_slice())?;
break
}
} else {
file_header.set_cksum();
builder.append(&file_header, file)?
}
}
// Finish archive
let _ = builder.into_inner()?;
Ok(path_transformer)
}
}

And I realized that distributed Rust compiles are probably doing a lot of redundant work. Every time a Rust compilation is distributed sccache has to package up any files referenced on the commandline which includes all the rlib files for crates the current crate uses, shared libraries for proc macros, etc. For a single cargo invocation this likely means that the same files will get packaged up over and over again.

A nice optimization here would be to give the build server an API for a content-addressable store, where clients could query the server for a file hash and find out if the server already has it and upload files that the server doesn't have. The server could simply store them on disk similar to how the existing DiskCache works. Then instead of packaging up all compile inputs the client would hash them all (this happens as part of cache lookup anyway), ask the server whether it already has them, and upload any that are not already present. Presumably to optimize the process the server should provide an API that allows querying a list of hashes and returns two lists of hashes: those that are already present and those that are missing. When the build server goes to execute a compilation then instead of providing a tarball of all the inputs it would provide a list of mappings from path -> hash, and the server would take care of retrieving all of those files from its local cache and placing them at the desired paths in the build filesystem (this could likely be done with hardlinks to avoid copying). A further optimization would be to preemptively store any outputs of Rust compilation in the cache since they are likely to be used as inputs to another Rust compile.

@aidanhs
Copy link
Contributor

aidanhs commented Oct 26, 2019

Yes, I have a side interest in content-addressible-stores (actually in rolling checksums, but they're mainly interesting in that context) and sometimes think about it since there's a multitude of use cases, including:

  • this issue (build inputs)
  • allowing build machines to collaborate on receiving toolchains from slow clients
  • permitting clients to retrieve just changed parts of outputs

Might be worth also looking at the "Remote Execution API" as it uses CAS, with the caveat that it may be optimised for different use-cases to sccache - #358 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants