New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use content hashes to determine whether a file needs to be recompiled #3145
Comments
Then you'd have to store the latest hashes somewhere right? The nice thing about the timestamp is that it's automatically stored in the file metadata. |
Yes, we'd have to keep them with the compiled modules. |
Yeah, the |
Suggest to make this optional (eg. via flag), because this seems to be an issue only with extra tooling involved (eg. CI), and time-stamp-based (file-system metadata) ---instead of dedicated explicit computing-managing-comparing of hashes--- works pleasantly fast and sufficiently correctly for the "no-frills, no-integrations local bare dev-env" cases. =) |
We don't really do flags, except in exceptional circumstances. It tends to increase the number of ways things can go wrong. |
I guess I'd be fine with this change, but the hash would have to take into account all of the dependencies. It just doesn't seem worth it to me. Edit: to clarify, not worth it for a small speed improvement at the risk of breaking one of the most fragile bits of the compiler when we only recently got it working properly. |
I don't think speed was the motivation, more that the dates still have hiccups sometimes. Speed was actually my concern, but it'd be interesting to compare |
Why would a hash need to take dependencies into account? Timestamps can't take changed dependencies into account, surely, so it's not clear to me why hashes would need to. |
I'd quite like to go ahead and give this a go. As with @garyb, I think it's best to consider this change as a bug fix rather than a performance boost: having file modification times change, e.g. by switching branches in your local working directory, can make the compiler think that an output directory is up to date when it isn't, causing compile errors, and the fix is to |
I tend to agree, it becomes habitual so you forget that it's a thing until a newcomer starts asking questions when they encounter something that requires it. |
Re: the conversation in #3635, I'd like to make a proposal for using Why
|
I believe so.
I don't think so. The only thing would be dealing with codegen options (corefn, sourcemaps). I personally would really like to see content-hash support ASAP. At work we do a lot of unnecessary recompilation due to branch switching. |
Loving the As an aside: bucklescript (~reasonml) took a similar route by adopting ninja (https://github.com/BuckleScript/bucklescript/tree/master/vendor/ninja) and that seems to be working out well for them. I suspect this would require a pretty major overhaul of the code though (most of which would be removing code 👍) as we'd be ripping the essential compilation stuff out of https://github.com/purescript/purescript/blob/master/src/Language/PureScript/Make.hs and pluggin' it into shake instead. On the assumption that switching to digest checking as things are will be easier, and using shake proper needs some investigation, should we break this out into a separate issue/branch and tackle the current issue sans shake? |
Thanks very much @joneshf, this is really interesting. I have a few thoughts: Currently we implicitly depend on the current behaviour that if a module is rebuilt then all of its downstream modules must be rebuilt too in a few places, in the sense that information which is more than one hop away in the module dependency graph can leak through into the output artifacts. If we want to try to make things more granular we will need to address this first. The first example which comes to mind is re-exports, as re-exports are currently always generated as coming from the module they were defined in (as opposed to the module they were locally imported from). If module At the moment, when a module gets rebuilt, it has access to the externs files for every module which it depends on (directly or transitively). If/when we have addressed the above, and to ensure we've addressed this properly once and for all, I think we should try to have our build system only provide externs for the direct dependencies of any given module when that module is being built; that way, we make it easier not to end up accidentally leaking information through the module graph transitively, which then makes it easier to provide fast and safe incremental builds. None of the above should be necessary to worry about for a first pass though, as I guess the plan should be to start off by replicating the current level of granularity in Shake first, and only later trying to make it smarter. One thing which is good about the current system is that we keep the contents of the externs files in memory rather than having to read and parse them each time for each module we compile; I would very much like to retain this property if we switch to Shake, although presumably that shouldn't be too hard to achieve? I am happy to co-opt this issue to make it about Shake. If we change course later we can always open new issues. |
I would also like to see content-based rebuilds ASAP. I wouldn't consider switching to a content-based system a breaking change, as I don't think the details of how the compiler determines that something needs rebuilding should be considered part of the compiler's public API. |
Oh also: having been looking into #3503 recently, all of this is fairly fresh in my mind, so when I get a moment I will have a go at coming up with a simple explanation of which files depend on which other ones during a build. |
Sweet! I thought that was going to make things easier, but it looks like we end up doing pretty much the same work for "output/*/index.js" %> \out -> do
let corefn = Development.Shake.FilePath.replaceFileName out "corefn.json"
foreign = Development.Shake.FilePath.replaceFileName out "foreign.js"
name = extractModule out
case Data.Map.lookup name sources of
Just source -> do
exists <- Development.Shake.FilePath.doesFileExist (source -<.> "js")
let dependencies = if exists then [corefn, foreign] else [corefn, source]
Development.Shake.need dependencies
_produceTheIndexFile corefn
Nothing -> Control.Monad.Fail.fail (_missingSource name) That would mean that we have a direct dependency on
Those both seem fairly straight forward as well: add the rules for each, and change the rules ::
Language.PureScript.ModuleGraph ->
Data.Map.Map Language.PureScript.ModuleName FilePath ->
[Language.PureScript.CodegenTarget] ->
Development.Shake.Rules ()
rules graph sources targets = do
"output/*/corefn.json" %> \out -> do
let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
name = extractModule out
case Data.Map.lookup name sources of
Just source -> do
Development.Shake.need [externs, source]
_produceTheCoreFnFile source
Nothing -> Control.Monad.Fail.fail (_missingSource name)
...
"output/*/index.js" %> \out -> do
let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
foreign = Development.Shake.FilePath.replaceFileName out "foreign.js"
name = extractModule out
case Data.Map.lookup name sources of
Just source -> do
exists <- Development.Shake.FilePath.doesFileExist (source -<.> "js")
let dependencies = if exists then [externs, foreign, source] else [externs, source]
Development.Shake.need dependencies
if Language.PureScript.JSSourceMap `elem` targets
then _produceTheIndexFileForSourceMap source
else _produceTheIndexFile source
Nothing -> Control.Monad.Fail.fail (_missingSource name)
"output/*/index.js.map" %> \out -> do
let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
name = extractModule out
case Data.Map.lookup name sources of
Just source -> do
Development.Shake.need [externs, source]
_produceTheSourceMap source
Nothing -> Control.Monad.Fail.fail (_missingSource name)
run :: [FilePath] -> [Language.PureScript.CodegenTarget] -> IO ()
run inputs targets = do
sources <- _parseModuleMap inputs
graph <- _parseImports sources
Development.Shake.shake options $ do
want $ do
name <- Data.Map.keys sources
for targets $ \case
Language.PureScript.CoreFn -> toCoreFn name
Language.PureScript.JS -> toIndex name
Language.PureScript.JSSourceMap -> toSourceMap name
rules graph sources targets
where
...
toCoreFn :: Language.PureScript.ModuleName -> FilePath
toCoreFn name =
"output"
</> Data.Text.unpack (Language.PureScript.runModuleName name)
</> "corefn.json"
toSourceMap :: Language.PureScript.ModuleName -> FilePath
toSourceMap name =
"output"
</> Data.Text.unpack (Language.PureScript.runModuleName name)
</> "index.js.map"
I'm not entirely sure we would have to change anything with the way the rest of the compiler works. In my head, I'm seeing it as translating the dependency information inside the compiler into The scenario you described seems like it can be implemented with oracles. The basic idea behind oracles is that things other than plain text files can provide build information. Rather than require all information be stored in files (like
I'm of a few thoughts here. The first is that doing a single pass is almost surely more efficient for a clean build. The second is that keeping information in memory intuitively seems more efficient than multiple passes where we re-parse. The third is that it's very unclear to me how bad multiple passes would be after that first clean build. Keeping information in memory goes back to using oracles. We might change the calls above from As for multiple passes after the first clean build, there's cases where it's not clear to me (without implementing it) that having information in memory is a huge enough improvement in practice over doing the naive thing and having a simpler implementation. That first clean build seems like it'll be atrocious. I'm definitely not against being smarter, but I think a Some cases off the top of my head:
All of this to say, I'm down for having information in memory, but there's a chance it wouldn't be terrible to be naive.
Oh dang, I didn't look at how coupled the Fixing the issue (not assuming monotonically increasing timestamps are safe) does seem easier than separating out the build system parts of the FWIW, I don't think we actually need to go all the way to digests to fix this issue. Timestamps can still work fine so long as we don't assume older input stamps necessarily means newer output stamps are safe. We could create a database from output file names to the timestamp of their inputs when that file was built. If the timestamps are different, we need to recreate the output. It doesn't matter if the input timestamp is newer or older, what matters is if it's different from what we last built. The database could be a single JSON file like: {
"output/Data.Unit/externs.json": 1557758992,
"output/Data.Void/externs.json": 1557759007
} We'd write to it in the codegen, change this part of the We would save the complexity of having to hash things, still have timestamps as our primitive (so it's simple), but no longer have to deal with I do still think having digests is the way forward. If someone feels that we're in for a pound if we're changing this, then going to digests would also solve the problem. But there's also a simpler fix to one of the major pain points of PS. |
@joneshf you've proper nerd sniped me with this one. I'm playing around with plugging the compiler internals into shake over at https://github.com/jmackie/purescript-shake - initially I just want to be able to generate corefn, and if that works nicely then we can think about extending to other targets. You're input would be much appreciated as I'm new to shake! |
That's awesome! I'll take a closer look later, but some things pop out for me:
|
purescript/purescript#3145 (comment) Only now things don't work and I dunno why...
The grouping of That rule you mention does look very dodgy. Shake basically runs the rule and then sees what it's value is. If you use the source on the left of Also happy to answer questions as @joneshf says. |
Having read the full thread, you might be interested in what we did with DAML. We wrote the compiler as a Shake build system without persistence (we had a slightly odd use case) and turned it into both a compiler and an IDE with almost no changes. We store everything in memory forever (partly because we can't serialise all the steps...). You can see the actual IDE pieces in this dir and the Shake wrapper is here. I would suggest focusing on replicating the current state with Shake first before going in any other directions though. |
Instead of having two separate MVars for build job results and errors, just have one, which contains a sum type, to indicate if and how a build job has completed with a little more clarity and safety (in the sense that this makes some invalid states unrepresentable). Additionally, rather than having two separate functions for consuming the result of a build plan, namely `collectErrors` and `collectResults`, and requiring that the first is called before the second, unify them both into `collectResults`. This will help for #3145, as for that I need the BuildPlan to be able to expose which build jobs succeeded before their errors are rethrown, so that we can store their timestamps and hashes in preparation for the next build. I haven't done any performance tests on this just yet, but I don't anticipate any drastic changes.
Instead of having two separate MVars for build job results and errors, just have one, which contains a sum type, to indicate if and how a build job has completed with a little more clarity and safety (in the sense that this makes some invalid states unrepresentable). Additionally, rather than having two separate functions for consuming the result of a build plan, namely `collectErrors` and `collectResults`, and requiring that the first is called before the second, unify them both into `collectResults`. This will help for #3145, as for that I need the BuildPlan to be able to expose which build jobs succeeded before their errors are rethrown, so that we can store their timestamps and hashes in preparation for the next build. I haven't done any performance tests on this just yet, but I don't anticipate any drastic changes.
Instead of having two separate MVars for build job results and errors, just have one, which contains a sum type, to indicate if and how a build job has completed with a little more clarity and safety (in the sense that this makes some invalid states unrepresentable). Additionally, rather than having two separate functions for consuming the result of a build plan, namely `collectErrors` and `collectResults`, and requiring that the first is called before the second, unify them both into `collectResults`. This will help for #3145, as for that I need the BuildPlan to be able to expose which build jobs succeeded before their errors are rethrown, so that we can store their timestamps and hashes in preparation for the next build. I haven't done any performance tests on this just yet, but I don't anticipate any drastic changes.
Fixes #3145. The current build cache invalidation algorithm compares the input timestamps to the output timestamps, and only triggers a rebuild if the input timestamps are newer than the output timestamps. However, this does not appear to be sufficient: as discussed in #3145, we think that the reason that doing `rm -r output` often fixes weird compile errors is that we should really be considering the input file to have changed if its timestamp is _different_ to what it was at the last successful build, regardless of whether it is before or after the output timestamp. Essentially, timestamps on input files can't be trusted to the extent that we do for cache invalidation, because of things like switching between different versions of dependencies or switching branches; sometimes you can have an input file's contents and timestamp both change, but have the timestamp still be older than the output timestamp. This commit implements a slightly different cache invalidation algorithm, where we make a note of the timestamps of all input files at the start of each build, and we consider files to have changed in subsequent builds if their input timestamps have changed at all (regardless of whether the new input timestamps are before or after the output timestamps). The timestamps are stored in a json file `cache-db.json` in the output directory; I also considered putting the timestamps in the externs files, but I think having them stored separately is preferable because then we don't have to update the module's externs file if its input file timestamp changes but its hash doesn't, which means that we don't force a rebuild for downstream modules. As an additional enhancement, we also make note of file content hashes and store them in the `cache-db.json` file. On subsequent builds, if timestamps have changed, we compare the previous hash to the new hash, and if they are identical, we can skip rebuilding the module. This means that e.g. touching a file no longer forces a rebuild. Note that we only compute hashes in the case where timestamps differ to avoid doing extra unnecessary work. This scheme of checking timestamps and then hashes was inspired by Shake, which provides this mechanism as one of its options for Change; see #3145 (comment) I've also added some tests so that we can make changes to this part of the compiler a little more confidently. I'm using the latest version of `these` (which is not in our Stack snapshot) because it doesn't incur a `lens` dependency, whereas earlier versions do.
Fixes #3145. The current build cache invalidation algorithm compares the input timestamps to the output timestamps, and only triggers a rebuild if the input timestamps are newer than the output timestamps. However, this does not appear to be sufficient: as discussed in #3145, we think that the reason that doing `rm -r output` often fixes weird compile errors is that we should really be considering the input file to have changed if its timestamp is _different_ to what it was at the last successful build, regardless of whether it is before or after the output timestamp. Essentially, timestamps on input files can't be trusted to the extent that we do for cache invalidation, because of things like switching between different versions of dependencies or switching branches; sometimes you can have an input file's contents and timestamp both change, but have the timestamp still be older than the output timestamp. This commit implements a slightly different cache invalidation algorithm, where we make a note of the timestamps of all input files at the start of each build, and we consider files to have changed in subsequent builds if their input timestamps have changed at all (regardless of whether the new input timestamps are before or after the output timestamps). The timestamps are stored in a json file `cache-db.json` in the output directory; I also considered putting the timestamps in the externs files, but I think having them stored separately is preferable because then we don't have to update the module's externs file if its input file timestamp changes but its hash doesn't, which means that we don't force a rebuild for downstream modules. As an additional enhancement, we also make note of file content hashes and store them in the `cache-db.json` file. On subsequent builds, if timestamps have changed, we compare the previous hash to the new hash, and if they are identical, we can skip rebuilding the module. This means that e.g. touching a file no longer forces a rebuild. Note that we only compute hashes in the case where timestamps differ to avoid doing extra unnecessary work. This scheme of checking timestamps and then hashes was inspired by Shake, which provides this mechanism as one of its options for Change; see #3145 (comment) I've also added some tests so that we can make changes to this part of the compiler a little more confidently. I'm using the latest version of `these` (which is not in our Stack snapshot) because it doesn't incur a `lens` dependency, whereas earlier versions do.
Instead of having two separate MVars for build job results and errors, just have one, which contains a sum type, to indicate if and how a build job has completed with a little more clarity and safety (in the sense that this makes some invalid states unrepresentable). Additionally, rather than having two separate functions for consuming the result of a build plan, namely `collectErrors` and `collectResults`, and requiring that the first is called before the second, unify them both into `collectResults`. This will help for purescript#3145, as for that I need the BuildPlan to be able to expose which build jobs succeeded before their errors are rethrown, so that we can store their timestamps and hashes in preparation for the next build. I haven't done any performance tests on this just yet, but I don't anticipate any drastic changes.
Fixes #3145. The current build cache invalidation algorithm compares the input timestamps to the output timestamps, and only triggers a rebuild if the input timestamps are newer than the output timestamps. However, this does not appear to be sufficient: as discussed in #3145, we think that the reason that doing `rm -r output` often fixes weird compile errors is that we should really be considering the input file to have changed if its timestamp is _different_ to what it was at the last successful build, regardless of whether it is before or after the output timestamp. Essentially, timestamps on input files can't be trusted to the extent that we do for cache invalidation, because of things like switching between different versions of dependencies or switching branches; sometimes you can have an input file's contents and timestamp both change, but have the timestamp still be older than the output timestamp. This commit implements a slightly different cache invalidation algorithm, where we make a note of the timestamps of all input files at the start of each build, and we consider files to have changed in subsequent builds if their input timestamps have changed at all (regardless of whether the new input timestamps are before or after the output timestamps). The timestamps are stored in a json file `cache-db.json` in the output directory; I also considered putting the timestamps in the externs files, but I think having them stored separately is preferable because then we don't have to update the module's externs file if its input file timestamp changes but its hash doesn't, which means that we don't force a rebuild for downstream modules. As an additional enhancement, we also make note of file content hashes and store them in the `cache-db.json` file. On subsequent builds, if timestamps have changed, we compare the previous hash to the new hash, and if they are identical, we can skip rebuilding the module. This means that e.g. touching a file no longer forces a rebuild. Note that we only compute hashes in the case where timestamps differ to avoid doing extra unnecessary work. This scheme of checking timestamps and then hashes was inspired by Shake, which provides this mechanism as one of its options for Change; see #3145 (comment) I've also added some tests so that we can make changes to this part of the compiler a little more confidently. I'm using the latest version of `these` (which is not in our Stack snapshot) because it doesn't incur a `lens` dependency, whereas earlier versions do.
Hello!
I believe currently the compiler looks at file modification times in order to determine if a file needs to be recompiled. If we altered the compiler to instead look at a hash of the contents of the file the compiler would be a little better at determining if the work needs to be done.
This would benefit CI systems where previously compiled modules have been cached and thus have an older timestamp while being up-to-date with the source file.
This feature came up from a chat on Slack RE build speed on CI :)
Thanks,
Louis
The text was updated successfully, but these errors were encountered: