Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use file characteristics instead of timestamps #1459

Open
greened opened this issue Aug 14, 2018 · 31 comments
Open

Option to use file characteristics instead of timestamps #1459

greened opened this issue Aug 14, 2018 · 31 comments
Labels
Milestone

Comments

@greened
Copy link

greened commented Aug 14, 2018

As noted in other issues, using timestamps to determine whether to rebuild something can be problematic. While timestamps are convenient and relatively fast, it is often desirable to key build decisions on some characteristic intrinsic to the file itself, such as a hash of its contents.

I use git a lot and it is annoying that simply changing branches triggers rebuilds. Ideally, I would be able to switch from my current work branch to some other branch, not touch any files and then switch back to the original branch and not have to rebuild anything at all. As far as I know that is not possible if the build system uses timestamps. Using file hashes would solve this particular issue.

I understand the using file hashes or other such intrisic file characteristics could slow down Ninja so using them should be an option.

@greened greened changed the title Feature requestion: Option to use file characteristics instead of timestamps Feature request: Option to use file characteristics instead of timestamps Aug 14, 2018
@jhasse jhasse added the feature label Oct 29, 2018
@metti
Copy link

metti commented Nov 13, 2018

#929 has an implementation. Even though it is successfully used (in a fork) for thousands of builds daily, it was not considered for merging.

@jhasse
Copy link
Collaborator

jhasse commented Nov 13, 2018

#929 is single-threaded and can therefore be slower like ccache or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.

@jhasse jhasse changed the title Feature request: Option to use file characteristics instead of timestamps Option to use file characteristics instead of timestamps Nov 13, 2018
@metti
Copy link

metti commented Nov 13, 2018

It can't (or at least should not) be a command line flag, as then hashing would apply to all rules. Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates. For that distinction, it has to be part of the build description. In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

Responding to the single-threaded argument: Yes, it adds instructions to the single-thread loop. In reality that only matters if the single thread gets more work than it can work down (I.e. more rules finish than the one thread can process (depslog+hashlog+...)). Only then hashing hurts. Else the single thread loop waits for jobs to finish anyway. We never saw a busy ninja with hashing even with -j1000 experiments. (And for fast-finishing rules, hashing is anyway not interesting to safe time.)

Also consider: the hashing with murmur hash is considerably fast and even large source files take only some miliseconds to be hashed. In addition the hashing happens right after the source file (and the dependencies) have been seen by the compiler. Therefore they are usually read from filesystem cache.
As hashing happens during the build (in parallel to executed rules), the overall build time is usually not measureably affected.

Lastly, the implementation in #929 is opt-in and comes at no cost for people not using the feature (besides the if-statement).

@jhasse
Copy link
Collaborator

jhasse commented Nov 13, 2018

Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates.

I would say that hashing of inputs to the linker is especially desired as it can often result in the linking being skipped completely (e. g. for formatting changes or when comments are changed). As the compilation of object files finishes piece by piece, the hash calculation can happen while the build is running (as you pointed out).

If it's really too slow (e. g. with big static libraries), we could think about implementing hashes only for pure inputs and not intermediate files. That would solve the "switching Git branches causes full rebuilds" case at least.

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

@Boddlnagg
Copy link

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

But then there would need to be a way to switch from non-hashing to hashing, which means that the current state of files would need to be hashed, so the next rebuild can use the hashes (which probably didn't exist or are out of date if you didn't pass the flag).

@jhasse
Copy link
Collaborator

jhasse commented Nov 14, 2018

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

@moroten
Copy link
Contributor

moroten commented Nov 14, 2018

I've also used hashing at work, with great success. It is based on #929, but with a bunch of patches, as can be seen in https://github.com/moroten/ninja/commits/hashed. hash_input = 1 on selected rules is very convenient. My branch still contains a bug where files are stat too often, O(n^2) instead of O(n). The bug is related to phony edges.

One problem is how to handle phony edges. I use phony edges to group e.g. header files. Therefore, my implementation iterates recursively through phony edges. It also relates to the bug in #1021.

Another thought is to make the hashing first class member of ninja, i.e. move the hashes into the build log. Using SHA256 would be one step towards adding support for Bazel's remote execution API. A C++ implementation can be found at https://gitlab.com/bloomberg/recc/. Wouldn't that be pretty nice?

Unfortunately, sorting out the semantics for phony edges will probably break backward compatibility.

@csagan5
Copy link

csagan5 commented Nov 17, 2018

Meanwhile I have concocted a solution for my use case of switching branches in Chromium (which is particularly painful); the small Go program and script I use is here: https://github.com/bromite/mtool

Feel free to adapt it to your use-cases, if it works for Chromium I bet it will work for smaller build projects as well (and it takes negligible time for my runs). Only downside is that if you use the script I published there as-is it will litter .mtool files all around in each git repository parent directory, but nothing a global gitignore cannot cure.

Interesting to note that I use the git ls-files --stage output for the hashing needs; it is possible (but less efficient) to ask git to hash also non-indexed files if your build depends on those.

Feature-wise one would expect that to implement the feature being discussed here ninja could do the same internally (without relying on git) and with similar performance results.

@Trass3r
Copy link

Trass3r commented Nov 18, 2018

After a quick look those ninja patches don't seem to hash the compiler commandline in addition to the input files. Am I missing anything?

@metti
Copy link

metti commented Nov 18, 2018

The command line is already hashed by ninja and stored in the build log.

@rulatir
Copy link

rulatir commented Dec 4, 2020

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

That would be very wrong. Comparing hashes can be elided if timestamps don't match; the build system can assume that the dependency was modified, and if it wasn't really modified (only touched) then the build will be suboptimal but correct. However if the timestamps match, it is still possible that the dependency was modified and its timestamp was forcibly reset (EDIT: or that the target was touched and thus made newer than its dependencies). Without double checking by comparing hashes this would result in incorrect build.

@burdiyan
Copy link

burdiyan commented Dec 4, 2020

I guess Ninja already assumes and skips all the work when timestamps match. So in your example this would be an incorrect build anyways.

I'm not aware (probably due to my ignorance) of any tool that would produce a different output and would keep the previous timestamp. Why would one do that?

IMHO, skipping the hash check when the timestamps match is a very valid optimization.

Hash check would simply avoid some "false-dirty" rebuilds, while keeping the existing semantics already provided by Ninja.

@rulatir
Copy link

rulatir commented Dec 14, 2020

The problem is not false-dirty rebuilds, it's false-clean non-rebuilds. A git checkout touches everything it overwrites. It can make a target newer than a dependency (yes, people commit generated code for various valid reasons). A hash check would prevent a false-clean non-rebuild in this case.

@burdiyan
Copy link

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

@rulatir
Copy link

rulatir commented Dec 18, 2020

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

How does hash depend on where the file is?

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

I understand that the main benefit of using timestamps is in avoiding the need to maintain a separate database that keeps track of "previously known" version signatures. If you are willing to forego that benefit, why shouldn't those signatures be hashes?

@seanmiddleditch
Copy link

A strategy we employ in large distributed content pipelines (not using ninja, but it's all similar concepts) uses hash caches keyed off a triplet like (timestamp; file size; inode).

The chances in a typical and non-malicious environment where a file changes without one of those three changing is incredibly low... and ninja has no business trying to protect itself against malicious manipulation IMO, since users can just manipulate build files and ninja caches at that point; if they want to break things, they will.

I will note that the timestamp used in a hash cache should be considered out of date if it's at all different, not just newer-than. Otherwise rolling back a change to a character in a source file would go undetected.

For those possible cases where the triplet isn't reliable to detect file mutation (I've personally never encountered such a case, but of course I haven't encountered every build environment that exists!), a --rehash flag could force a rehash of every file considered by the build. I've always added those kinds of things to content pipeline tools, though I've never needed them outside of testing/verification of the hash cache itself.

We've in the past considered extending those triplets with additional information, e.g. other file attributes or permissions, as well as adding options to use underlying hash metadata on some esoteric file systems or some SCM backends, but the basic triplet has proven reliable for us so for on all systems we've cared about.

For content pipelines, the hash cashing is essential since the files are often very large and there's often very many of them (far more than there are source code files, generally) and content hashing scales with total file size, of course. I've never measured what the difference would be in source code builds since I've generally just relied on standard ninja/make/msbuild behavior for such things. I imagine the results would be less impactful than with content pipelines, but probably still worth it for large projects like Chromium or the typical AAA game's code base.

@burdiyan
Copy link

@seanmiddleditch the problem I find with anything timestamp is that a branch switch forces unnecessary rebuilds. Just touching a file marks the target dirty. Having a proper hashing at least as an opt-in strategy could save many people lots of unnecessary build time, IMHO.

@seanmiddleditch
Copy link

seanmiddleditch commented Dec 21, 2021

@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.

(edit) I mean the timestamps in what I proposed are just an acceleration; status quo ninja today has the problems you describe, of course.

@rulatir
Copy link

rulatir commented Feb 10, 2022

@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.

It's 2022. Version control systems that destroy all assumptions about timestamps are a reality now. THEY are the common case.

@seanmiddleditch
Copy link

Just to make it very clear: I am not arguing against hash-based builds. :) I'm just offering that adding a cache in front of the hashes helps maintain very fast local builds. It's a "best of both worlds" approach: hashes when the VCS stomps all over timestamps, and timestamps when files are completely untouched. Win-win.

@burdiyan
Copy link

burdiyan commented Apr 21, 2022

I agree with that! In my wacky scripts I do something similar, i.e. check the timestamp first, and if it's different check the hashes. So indeed win-win. Pretty much always it's faster to hash a file even if it's unnecessary than to run an unnecessary build, IMHO.

@jonesmz

This comment was marked as abuse.

@jcelerier
Copy link

I'm having a case where the build machine periodically resets its time to 00:00 which wreaks havok in ninja + cmake (it seems to make ninja call cmake's configure in a loop), it would be great to have support for hash-based rather than time-based things to overcome this.

@mathstuf
Copy link
Contributor

mathstuf commented Jun 9, 2022

While this might fix this particular fallout of that root problem, I suspect that you really need to get that fixed because this is not the only thing on a modern system that cares about the system clock being…useful.

@jcelerier
Copy link

It really can't - the system is synced to an external PTP clock which behaves like this.

@mathstuf
Copy link
Contributor

mathstuf commented Jun 9, 2022

I would doubt that such behavior is useful for a "Precision Time Protocol" device. I highly recommend you solve the root cause because this issue isn't moving anywhere near fast enough for your problem (cf. basically nowhere beyond ideation so far).

@hadrielk
Copy link

hadrielk commented Jun 9, 2022

this issue isn't moving anywhere near fast enough for your problem

Don't worry, according to this PTP clock, there'll be plenty of time remaining to fix it.


(sorry, couldn't help it)

@jcelerier
Copy link

jcelerier commented Aug 13, 2022

I highly recommend you solve the root cause

Not much possibility - those are proprietary audio devices which act as a master clock setting to synchronize themselves independently of real-world time ; since the linux driver is in-kernel this sets the whole machine to the same clock.

@StevenShw
Copy link

Sad, I want to know the final solution...

@maxwell-bland
Copy link

If someone wants to build a system based on (potentially) inotifywait that ties in as an easy to use plugin and for a given target simplifies the output of "-t commands" to just that set of commands necessary to recompile steps dependent on files written to since the last build in the directory, I will personally mail you 5 USD as a bounty. 10 USD and a can of room-temp/warm Chicago beer if you make it so that it "just works", where the next run of a target with the plugin enabled only runs the commands related to the changes.

No idea if this solves all the hashing nonesense or whatever Bazel is calling it nowadays (hermetic ... synonym "esoteric" ... at least they are up front about it (-; ). Anyways, I think inotifywait is a good approach, at least in some simple case. I'll try it once I get some time, but if it gets to that point I'll post results here and the bounty will be gone.

@ericriff
Copy link

I'm also interested on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests