Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved build cache invalidation with content hashes #3705

Merged
merged 1 commit into from
Aug 27, 2019

Conversation

hdgarrood
Copy link
Contributor

Fixes #3145.

The current build cache invalidation algorithm compares the input
timestamps to the output timestamps, and only triggers a rebuild if the
input timestamps are newer than the output timestamps. However, this
does not appear to be sufficient: as discussed in #3145, we think that
the reason that doing rm -r output often fixes weird compile errors is
that we should really be considering the input file to have changed if
its timestamp is different to what it was at the last successful
build, regardless of whether it is before or after the output timestamp.
Essentially, timestamps on input files can't be trusted to the extent
that we do for cache invalidation, because of things like switching
between different versions of dependencies or switching branches;
sometimes you can have an input file's contents and timestamp both
change, but have the timestamp still be older than the output timestamp.

This commit implements a slightly different cache invalidation
algorithm, where we make a note of the timestamps of all input files at
the start of each build, and we consider files to have changed in
subsequent builds if their input timestamps have changed at all
(regardless of whether the new input timestamps are before or after the
output timestamps).

The timestamps are stored in a json file cache-db.json in the output
directory; I also considered putting the timestamps in the externs
files, but I think having them stored separately is preferable because
then we don't have to update the module's externs file if its input file
timestamp changes but its hash doesn't, which means that we don't force
a rebuild for downstream modules.

As an additional enhancement, we also make note of file content hashes
and store them in the cache-db.json file. On subsequent builds, if
timestamps have changed, we compare the previous hash to the new hash,
and if they are identical, we can skip rebuilding the module. This means
that e.g. touching a file no longer forces a rebuild. Note that we only
compute hashes in the case where timestamps differ to avoid doing extra
unnecessary work. This scheme of checking timestamps and then hashes was
inspired by Shake, which provides this mechanism as one of its options
for Change; see #3145 (comment)

I've also added some tests so that we can make changes to this part of
the compiler a little more confidently.

I'm using the latest version of these (which is not in our Stack
snapshot) because it doesn't incur a lens dependency, whereas earlier
versions do.

Fixes #3145.

The current build cache invalidation algorithm compares the input
timestamps to the output timestamps, and only triggers a rebuild if the
input timestamps are newer than the output timestamps. However, this
does not appear to be sufficient: as discussed in #3145, we think that
the reason that doing `rm -r output` often fixes weird compile errors is
that we should really be considering the input file to have changed if
its timestamp is _different_ to what it was at the last successful
build, regardless of whether it is before or after the output timestamp.
Essentially, timestamps on input files can't be trusted to the extent
that we do for cache invalidation, because of things like switching
between different versions of dependencies or switching branches;
sometimes you can have an input file's contents and timestamp both
change, but have the timestamp still be older than the output timestamp.

This commit implements a slightly different cache invalidation
algorithm, where we make a note of the timestamps of all input files at
the start of each build, and we consider files to have changed in
subsequent builds if their input timestamps have changed at all
(regardless of whether the new input timestamps are before or after the
output timestamps).

The timestamps are stored in a json file `cache-db.json` in the output
directory; I also considered putting the timestamps in the externs
files, but I think having them stored separately is preferable because
then we don't have to update the module's externs file if its input file
timestamp changes but its hash doesn't, which means that we don't force
a rebuild for downstream modules.

As an additional enhancement, we also make note of file content hashes
and store them in the `cache-db.json` file. On subsequent builds, if
timestamps have changed, we compare the previous hash to the new hash,
and if they are identical, we can skip rebuilding the module. This means
that e.g. touching a file no longer forces a rebuild. Note that we only
compute hashes in the case where timestamps differ to avoid doing extra
unnecessary work. This scheme of checking timestamps and then hashes was
inspired by Shake, which provides this mechanism as one of its options
for Change; see #3145 (comment)

I've also added some tests so that we can make changes to this part of
the compiler a little more confidently.

I'm using the latest version of `these` (which is not in our Stack
snapshot) because it doesn't incur a `lens` dependency, whereas earlier
versions do.
@hdgarrood
Copy link
Contributor Author

The diff stats are a little intimidating but most of the added lines are from the new tests and the additions to our license file.

Copy link
Member

@LiamGoodacre LiamGoodacre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use content hashes to determine whether a file needs to be recompiled
2 participants