Use content hashes to determine whether a file needs to be recompiled #3145

lpil · 2017-11-04T17:04:00Z

Hello!

I believe currently the compiler looks at file modification times in order to determine if a file needs to be recompiled. If we altered the compiler to instead look at a hash of the contents of the file the compiler would be a little better at determining if the work needs to be done.

This would benefit CI systems where previously compiled modules have been cached and thus have an older timestamp while being up-to-date with the source file.

This feature came up from a chat on Slack RE build speed on CI :)

Thanks,
Louis

paf31 · 2017-11-04T22:24:59Z

Then you'd have to store the latest hashes somewhere right? The nice thing about the timestamp is that it's automatically stored in the file metadata.

lpil · 2017-11-04T23:46:26Z

Yes, we'd have to keep them with the compiled modules.

kritzcreek · 2017-11-05T11:10:00Z

Yeah, the ExternsFile would need a hash for the contents of the source file it was compiled from.

metaleap · 2017-11-10T11:58:48Z

Suggest to make this optional (eg. via flag), because this seems to be an issue only with extra tooling involved (eg. CI), and time-stamp-based (file-system metadata) ---instead of dedicated explicit computing-managing-comparing of hashes--- works pleasantly fast and sufficiently correctly for the "no-frills, no-integrations local bare dev-env" cases. =)

paf31 · 2017-11-14T18:53:12Z

Suggest to make this optional (eg. via flag)

We don't really do flags, except in exceptional circumstances. It tends to increase the number of ways things can go wrong.

paf31 · 2017-11-14T18:54:14Z

I guess I'd be fine with this change, but the hash would have to take into account all of the dependencies. It just doesn't seem worth it to me.

Edit: to clarify, not worth it for a small speed improvement at the risk of breaking one of the most fragile bits of the compiler when we only recently got it working properly.

garyb · 2017-11-14T19:37:49Z

I don't think speed was the motivation, more that the dates still have hiccups sometimes. Speed was actually my concern, but it'd be interesting to compare stating all the files with hashing, since #3087 is a thing, and kinda an annoying one at that.

hdgarrood · 2018-01-28T16:48:23Z

Why would a hash need to take dependencies into account? Timestamps can't take changed dependencies into account, surely, so it's not clear to me why hashes would need to.

hdgarrood · 2019-02-15T13:59:22Z

I'd quite like to go ahead and give this a go. As with @garyb, I think it's best to consider this change as a bug fix rather than a performance boost: having file modification times change, e.g. by switching branches in your local working directory, can make the compiler think that an output directory is up to date when it isn't, causing compile errors, and the fix is to rm -r output. If rm -r output ever fixes a compile error, that ought to be considered a bug, I think.

garyb · 2019-02-15T14:07:30Z

If rm -r output ever fixes a compile error, that ought to be considered a bug, I think.

I tend to agree, it becomes habitual so you forget that it's a thing until a newcomer starts asking questions when they encounter something that requires it.

joneshf · 2019-05-12T23:48:45Z

Re: the conversation in #3635, I'd like to make a proposal for using shake as a library to solve this.
cc: @jmackie

Why `shake`?

Most build systems have applicative dependencies. With applicative build systems, you have to statically know all of the dependencies before running the build. With monadic build systems, you can figure out the dependencies of a build artifact when the build is running. shakes reason for existing is to be a monadic build system. PureScript is a language that has monadic dependencies. Instead of us re-inventing the bookkeeping around working with monadic dependencies, we can use shake to manage that for us.

Since shake has been in use in non-trivial projects, it's grown some nice features like caching, cut-off, non file-based dependencies, profiling, reporting. Many of these features we'd get for free (or ridiculously cheap) if we moved over to shake. The feature that is relevant to this issue is how shake deals with Change. There are currently a few ways that involve some combination of modified times and digests. The default is ChangeModtime:

Compare equality of modification timestamps, a file has changed if its last modified time changes. A touch will force a rebuild. This mode is fast and usually sufficiently accurate, so is the default.

This is similar behavior to what we currently use in the compiler with one very important difference, it assumes a change if the modified time is different from the last known time. If we could have this behavior in the current system, it would solve this issue outright. We'd have to have some database of modified times though, and that seems non-trivial.

The ChangeModtime algorithm means that if you switch git branches or something similar, you'll end up needing to build files that otherwise haven't changed. One change algorithm that has worked well for me on other projects is ChangeModtimeAndDigest:

A file is rebuilt if both its modification time and digest have changed. For efficiency reasons, the modification time is checked first, and if that has changed, the digest is checked.

What ChangeModtimeAndDigest means in practice is that you can switch git branches and things won't always force a build if the contents of the file haven't changed. It also means that if you download a newer version of a dependency, but the contents of the sources are the same (maybe it's a version bump or something), it won't force a build.

How would this work?

I think we can start using shake without causing a breaking change if we use ChangeModtime. That should resolve this bug, but we can switch to digests after that and hopefully have fewer builds overall. We will need to work out how to give the dependencies to shake in a way that is transparent to people running purs compile (and friends) though.

It seems like every .purs file has zero or more inputs (other modules). It seems that every .purs file has two outputs (when not supplying the --codegen flag): externs.json and index.js. I assume that index.js is dependent on externs.json and not the other way around because you always get an externs.json when you compile, but you only get index.js if you're generating JS. It also seems that every FFI'd .js file has one output: foreign.js.

What might a first pass look like?

On the face of it, it seems like shake can solve this relatively straight-forwardly. We might start with something like:

import "shake" Development.Shake ((%>))
import "shake" Development.Shake.FilePath ((-<.>), (</>))

import qualified "base" Control.Monad.Fail
import qualified "containers" Data.Map
import qualified "text" Data.Text
import qualified "shake" Development.Shake
import qualified "shake" Development.Shake.FilePath
import qualified "this" Language.PureScript

extractModuleName :: FilePath -> Language.PureScript.ModuleName
extractModuleName =
  Language.PureScript.moduleNameFromString
    . Data.Text.pack
    . Development.Shake.FilePath.takeDirectory1
    . Development.Shake.FilePath.dropDirectory1

rules ::
  Language.PureScript.ModuleGraph ->
  Data.Map.Map Language.PureScript.ModuleName FilePath ->
  Development.Shake.Rules ()
rules graph sources = do
  "output/*/externs.json" %> \out -> do
    let name = extractModuleName out
    case Data.Map.lookup name sources of
      Just source ->
        case lookup name graph of
          Just dependencies' -> do
            let dependencies = fmap toExterns dependencies'
            Development.Shake.need (source : dependencies)
            dependenciesModules <- for dependencies _parseExterns
            sourceModule <- _parseModule source
            _typeCheck (sourceModule : dependenciesModules)
            _produceTheExternFile sourceModule
          Nothing -> Control.Monad.Fail.fail (_missingDependencies name)
      Nothing -> Control.Monad.Fail.fail (_missingSource name)

  "output/*/foreign.js" %> \out -> do
    let name = extractModule out
    case Data.Map.lookup name sources of
      Just source -> do
        let foreign = source -<.> "js"
        Development.Shake.need [foreign]
        _produceTheForeignFile foreign
      Nothing -> Control.Monad.Fail.fail (_missingSource name)

  "output/*/index.js" %> \out -> do
    let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
        foreign = Development.Shake.FilePath.replaceFileName out "foreign.js"
        name = extractModule out
    case Data.Map.lookup name sources of
      Just source -> do
        exists <- Development.Shake.FilePath.doesFileExist (source -<.> "js")
        let dependencies = if exists then [externs, foreign, source] else [externs, source]
        Development.Shake.need dependencies
        _produceTheIndexFile source
      Nothing -> Control.Monad.Fail.fail (_missingSource name)

run :: [FilePath] -> IO ()
run inputs = do
  sources <- _parseModuleMap inputs
  graph <- _parseImports sources
  Development.Shake.shake options $ do
    want (fmap toIndex $ Data.Map.keys sources)
    rules graph sources
  where
  options :: Development.Shake.ShakeOptions
  options =
    Development.Shake.shakeOptions
      -- Store the files that `shake` uses for metadata about the build.
      { Development.Shake.shakeFiles = "output"
      -- We would alter `shakeChange` to use digests or some other algorithm.
      , Development.Shake.shakeChange = Development.Shake.ChangeModTime
      }

toExterns :: Language.PureScript.ModuleName -> FilePath
toExterns name =
  "output"
    </> Data.Text.unpack (Language.PureScript.runModuleName name)
    </> "externs.json"

toIndex :: Language.PureScript.ModuleName -> FilePath
toIndex name =
  "output"
    </> Data.Text.unpack (Language.PureScript.runModuleName name)
    </> "index.js"

We do the simplest thing that would work:

Given some input list of source .purs files, figure out the module name and the transitive imports in each file.
For each .purs file in the input list, want the corresponding index.js file in the output. want is what tells shake what to build.
For each index.js file in the output, need the corresponding externs.json file. need is what tells shake what dependencies a build artifact has. We also need the .purs and foreign.js file if it should exist. Whenever any of these files change, the rule to build the index.js file is run. The _produceTheIndexFile hole will need to parse the .purs file and generates the JS from that.
For each foreign.js file in the output, need the source FFI file. The _produceTheForeignFile hole will need to parse and print it where it belongs.
For each externs.json file in the output, lookup the transitive imports and need the externs.json of those imports along with the new source .purs file we're compiling. Using the externs.json, the _typeCheck hole will need to type check the source .purs file. Then, the _produceTheExternsFile hole will need to produce the externs.json.

There are some holes in here that we'd have to fill in, but this should give us a correct build system. We might end up doing multiple reads of the same files initially, but it would seem like things would relax after the first build. We might also be able to only look at immediate dependencies instead of transitive if we can parse that information appropriately. We can address the those issues with things like caches, oracles, and/or batches.

Some insights/questions

The impetus behind using shake is primarily to fix this issue. If we do start using shake, it opens up a whole world of optimizations we can make in the build. One such thing is granularity in the build artifacts.

Everything operates with files being the atomic unit. However, we have access to the entirety of the language. We could parse a module, determine what its values are, and have the values be the thing index.js depends on. If the dependencies of a value change (different types, moved file location, new instance, renamed module, etc.), but the generated JS value wouldn't change, shake would cut-off and not re-generate the index.js file. We'd still have type checking, and the rest of the build would work normal. But if we can get cut-off at that level, it can have huge impacts in the overall build performance. E.g. if you updated a dependency deep in the hierarchy, and it changed some value that is not (transitively) used anywhere in your program, it should only re-compile the parts of the tree that actually (transitively) depend on it.

There are some questions I have since I lack familiarity with the internals of the compiler:

Can we make the index.js files directly from corefn JSON files? If so, that kind of makes thing easier and would allow for even more granularity.
Are there any blatant issues with this approach? If so, is there another approach that would let us offload the dependency managment to shake?
Am I missing anything important?

natefaubion · 2019-05-13T00:34:25Z

Can we make the index.js files directly from corefn JSON files? If so, that kind of makes thing easier and would allow for even more granularity.

I believe so. corefn.json files and index.js are just artifacts from the same compiler output.

Are there any blatant issues with this approach? If so, is there another approach that would let us offload the dependency managment to shake?

I don't think so. The only thing would be dealing with codegen options (corefn, sourcemaps).

I personally would really like to see content-hash support ASAP. At work we do a lot of unnecessary recompilation due to branch switching.

jmackie · 2019-05-13T07:47:22Z

Loving the shake idea @joneshf !

As an aside: bucklescript (~reasonml) took a similar route by adopting ninja (https://github.com/BuckleScript/bucklescript/tree/master/vendor/ninja) and that seems to be working out well for them.

I suspect this would require a pretty major overhaul of the code though (most of which would be removing code 👍) as we'd be ripping the essential compilation stuff out of https://github.com/purescript/purescript/blob/master/src/Language/PureScript/Make.hs and pluggin' it into shake instead.

On the assumption that switching to digest checking as things are will be easier, and using shake proper needs some investigation, should we break this out into a separate issue/branch and tackle the current issue sans shake?

hdgarrood · 2019-05-13T12:32:30Z

Thanks very much @joneshf, this is really interesting. I have a few thoughts:

Currently we implicitly depend on the current behaviour that if a module is rebuilt then all of its downstream modules must be rebuilt too in a few places, in the sense that information which is more than one hop away in the module dependency graph can leak through into the output artifacts. If we want to try to make things more granular we will need to address this first. The first example which comes to mind is re-exports, as re-exports are currently always generated as coming from the module they were defined in (as opposed to the module they were locally imported from). If module M1 exports foo, module M2 re-exports foo, and M3 imports foo from M2, then in the generated code for M3 we will have an import of M1. This means that if foo later moves to a different module (but continues to be re-exported from M2), and if M2 is determined not to need rebuilding, we could end up in a situation where the generated code for M3 still refers to M1.foo.

At the moment, when a module gets rebuilt, it has access to the externs files for every module which it depends on (directly or transitively). If/when we have addressed the above, and to ensure we've addressed this properly once and for all, I think we should try to have our build system only provide externs for the direct dependencies of any given module when that module is being built; that way, we make it easier not to end up accidentally leaking information through the module graph transitively, which then makes it easier to provide fast and safe incremental builds.

None of the above should be necessary to worry about for a first pass though, as I guess the plan should be to start off by replicating the current level of granularity in Shake first, and only later trying to make it smarter.

One thing which is good about the current system is that we keep the contents of the externs files in memory rather than having to read and parse them each time for each module we compile; I would very much like to retain this property if we switch to Shake, although presumably that shouldn't be too hard to achieve?

I am happy to co-opt this issue to make it about Shake. If we change course later we can always open new issues.

hdgarrood · 2019-05-13T12:34:20Z

I would also like to see content-based rebuilds ASAP. I wouldn't consider switching to a content-based system a breaking change, as I don't think the details of how the compiler determines that something needs rebuilding should be considered part of the compiler's public API.

hdgarrood · 2019-05-13T12:36:05Z

Oh also: having been looking into #3503 recently, all of this is fairly fresh in my mind, so when I get a moment I will have a go at coming up with a simple explanation of which files depend on which other ones during a build.

joneshf · 2019-05-13T16:25:29Z

I believe so. corefn.json files and index.js are just artifacts from the same compiler output.

Sweet! I thought that was going to make things easier, but it looks like we end up doing pretty much the same work for index.js:

"output/*/index.js" %> \out -> do
  let corefn = Development.Shake.FilePath.replaceFileName out "corefn.json"
      foreign = Development.Shake.FilePath.replaceFileName out "foreign.js"
      name = extractModule out
  case Data.Map.lookup name sources of
    Just source -> do
      exists <- Development.Shake.FilePath.doesFileExist (source -<.> "js")
      let dependencies = if exists then [corefn, foreign] else [corefn, source]
      Development.Shake.need dependencies
      _produceTheIndexFile corefn
    Nothing -> Control.Monad.Fail.fail (_missingSource name)

That would mean that we have a direct dependency on corefn.json files for index.js to be generated. That's a behavioral change we'd have to be on-board with.

The only thing would be dealing with codegen options (corefn, sourcemaps).

Those both seem fairly straight forward as well: add the rules for each, and change the wants to include what we're generating. Source maps add some complexity because they alter the index.js, but seems doable.

rules ::
  Language.PureScript.ModuleGraph ->
  Data.Map.Map Language.PureScript.ModuleName FilePath ->
  [Language.PureScript.CodegenTarget] ->
  Development.Shake.Rules ()
rules graph sources targets = do
  "output/*/corefn.json" %> \out -> do
    let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
        name = extractModule out
    case Data.Map.lookup name sources of
      Just source -> do
        Development.Shake.need [externs, source]
        _produceTheCoreFnFile source
      Nothing -> Control.Monad.Fail.fail (_missingSource name)

  ...

  "output/*/index.js" %> \out -> do
    let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
        foreign = Development.Shake.FilePath.replaceFileName out "foreign.js"
        name = extractModule out
    case Data.Map.lookup name sources of
      Just source -> do
        exists <- Development.Shake.FilePath.doesFileExist (source -<.> "js")
        let dependencies = if exists then [externs, foreign, source] else [externs, source]
        Development.Shake.need dependencies
        if Language.PureScript.JSSourceMap `elem` targets
          then _produceTheIndexFileForSourceMap source
          else _produceTheIndexFile source
      Nothing -> Control.Monad.Fail.fail (_missingSource name)

  "output/*/index.js.map" %> \out -> do
    let externs = Development.Shake.FilePath.replaceFileName out "externs.json"
        name = extractModule out
    case Data.Map.lookup name sources of
      Just source -> do
        Development.Shake.need [externs, source]
        _produceTheSourceMap source
      Nothing -> Control.Monad.Fail.fail (_missingSource name)

run :: [FilePath] -> [Language.PureScript.CodegenTarget] -> IO ()
run inputs targets = do
  sources <- _parseModuleMap inputs
  graph <- _parseImports sources
  Development.Shake.shake options $ do
    want $ do
      name <- Data.Map.keys sources
      for targets $ \case
        Language.PureScript.CoreFn -> toCoreFn name
        Language.PureScript.JS -> toIndex name
        Language.PureScript.JSSourceMap -> toSourceMap name
    rules graph sources targets
  where
  ...

  toCoreFn :: Language.PureScript.ModuleName -> FilePath
  toCoreFn name =
    "output"
      </> Data.Text.unpack (Language.PureScript.runModuleName name)
      </> "corefn.json"

  toSourceMap :: Language.PureScript.ModuleName -> FilePath
  toSourceMap name =
    "output"
      </> Data.Text.unpack (Language.PureScript.runModuleName name)
      </> "index.js.map"

Currently we implicitly depend on the current behaviour that if a module is rebuilt then all of its downstream modules must be rebuilt too in a few places, in the sense that information which is more than one hop away in the module dependency graph can leak through into the output artifacts. If we want to try to make things more granular we will need to address this first.

I'm not entirely sure we would have to change anything with the way the rest of the compiler works. In my head, I'm seeing it as translating the dependency information inside the compiler into shake's DSL. The current way we're doing dependencies seems to work correctly, so we shouldn't have to change it. We still can, but I don't think we necessarily have to in order to have granularity.

The scenario you described seems like it can be implemented with oracles. The basic idea behind oracles is that things other than plain text files can provide build information. Rather than require all information be stored in files (like make and friends), shake allows arbitrary computation to be a dependency (assuming it can be hashed appropriately). The example under addOracle might provide a bit more insight. Storing the information in files is simpler, but requires a ton of files to get real good granularity.

One thing which is good about the current system is that we keep the contents of the externs files in memory rather than having to read and parse them each time for each module we compile; I would very much like to retain this property if we switch to Shake, although presumably that shouldn't be too hard to achieve?

I'm of a few thoughts here. The first is that doing a single pass is almost surely more efficient for a clean build. The second is that keeping information in memory intuitively seems more efficient than multiple passes where we re-parse. The third is that it's very unclear to me how bad multiple passes would be after that first clean build.

Keeping information in memory goes back to using oracles. We might change the calls above from needing the actual externs.json to asking an oracle for the AST, or something. That oracle would have the dependency on the externs.json and do the parsing. It should only need to parse any given externs.json once per build. If we also cache the results of the oracle, it would only need to parse any given externs.json if the file changes–meaning it could hold onto the same parsed AST in a binary serialized form in shake's database over multiple builds.

As for multiple passes after the first clean build, there's cases where it's not clear to me (without implementing it) that having information in memory is a huge enough improvement in practice over doing the naive thing and having a simpler implementation. That first clean build seems like it'll be atrocious. I'm definitely not against being smarter, but I think a shake build system that only uses Rules _ is a bit easier to understand and maintain. oracles aren't a code smell, I don't think. But, they're not a free thing.

Some cases off the top of my head:

If you're working at the application/library level and you only change a module in your application/library, it should behave similarly to how things currently behave (0.12.5). It will make one pass over a large number of modules.
If we're using ChangeModtimeAndDigest, and you change a deep down dependency that isn't hugely breaking (like purescript-records 1.0.0 to 2.0.0 rather than something like purescript-halogen 4.0.0 to 5.0.0) it might make a handful of small passes but should cut-off at some point.
I would suspect we'd have to make three or four passes with like 10-20 modules rather than more than one pass with hundreds of modules. Those few passes ought to be really quick as well because the scope of what's being parsed is much smaller and purs is generally pretty fast.

All of this to say, I'm down for having information in memory, but there's a chance it wouldn't be terrible to be naive.

I personally would really like to see content-hash support ASAP. At work we do a lot of unnecessary recompilation due to branch switching.

On the assumption that switching to digest checking as things are will be easier, and using shake proper needs some investigation, should we break this out into a separate issue/branch and tackle the current issue sans shake?

I would also like to see content-based rebuilds ASAP. I wouldn't consider switching to a content-based system a breaking change, as I don't think the details of how the compiler determines that something needs rebuilding should be considered part of the compiler's public API.

Oh dang, I didn't look at how coupled the Language.PureScript.Make hierarchy was with the rest of the code. Thanks for pointing that out!

Fixing the issue (not assuming monotonically increasing timestamps are safe) does seem easier than separating out the build system parts of the Language.PureScript.Make hierarchy from the non-build system parts.

FWIW, I don't think we actually need to go all the way to digests to fix this issue. Timestamps can still work fine so long as we don't assume older input stamps necessarily means newer output stamps are safe. We could create a database from output file names to the timestamp of their inputs when that file was built. If the timestamps are different, we need to recreate the output. It doesn't matter if the input timestamp is newer or older, what matters is if it's different from what we last built.

The database could be a single JSON file like:

{
  "output/Data.Unit/externs.json": 1557758992,
  "output/Data.Void/externs.json": 1557759007
}

We'd write to it in the codegen, change this part of the getOutputTimestamp value, and make the check be for equality. That seem about right?

We would save the complexity of having to hash things, still have timestamps as our primitive (so it's simple), but no longer have to deal with rm -fr output. The rm -fr output issue is a way bigger issue in practice than switching git branches and having to re-compile. The former frustrates and/or stumps people for hours or causes them to leave the language, the latter slows down the feedback loop and makes creating value slower once you've reached critical mass with a program.

I do still think having digests is the way forward. If someone feels that we're in for a pound if we're changing this, then going to digests would also solve the problem. But there's also a simpler fix to one of the major pain points of PS.

jmackie · 2019-05-14T16:30:40Z

@joneshf you've proper nerd sniped me with this one.

I'm playing around with plugging the compiler internals into shake over at https://github.com/jmackie/purescript-shake - initially I just want to be able to generate corefn, and if that works nicely then we can think about extending to other targets. You're input would be much appreciated as I'm new to shake!

joneshf · 2019-05-14T17:19:07Z

That's awesome! I'll take a closer look later, but some things pop out for me:

You're calling want once for each module. I think if we pass shake all of the wants at once, it can build things in parallel better. One way to do that is to return the wants and call them at the afterward:

compile :: [(AST.Module, [ModuleName])] -> IO ()
compile graph =
  shakeArgs shakeOptions $ do
    wants <- traverse (uncurry compileModule) graph
    want (concat wants)

compileModule :: AST.Module -> [ModuleName] -> Rules [FilePath]
compileModule m deps = do
  ...
  pure [moduleOutputDir </> corefnPath, moduleOutputDir </> externsPath]

Similarly, you're calling need once for each extern. If we pass a whole bunch of files, it can parallelize builds as well.
I think this rule might not be quite right. The thing on the left if (%>) is the target that is supposed to be built. In this case it looks like we'd be saying the target is something like purs/Five.purs. If we're going to be outputting the corefn.json, we'd want that to instead say output/Five/corefn.json. Otherwise, I'm not sure how shake would deal with that. Does it ignore that rule when it's running since the source file is always up to date?
Also in that above rule, it's generating both the externs.json and the corefn.json. While I think you can do that, I'm not sure how shake will deal with it. It might not be able to build any of the externs.jsons if there's no explicit rule for them.
There's a shakeLint option that can flesh out problems with a build. It checks a bunch of things about what a rule does.
There's a wealth of information on Stack Overflow, and in shakes GitHub issues. @ndmitchell has been incredibly helpful and responsive with any questions I've had in the past. I've also got a non-trivial build system going for multiple languages in case anything in there is useful to you. I'm more than happy to answer questions, but hopefully some of those resources can help as well.

purescript/purescript#3145 (comment) Only now things don't work and I dunno why...

ndmitchell · 2019-05-26T21:56:40Z

The grouping of need and want is definitely desirable if it's easy. However, Shake now (since last week) merges adjacent need calls, so you might get away with it. However, better to explicit if it's easy.

That rule you mention does look very dodgy. Shake basically runs the rule and then sees what it's value is. If you use the source on the left of %> it will do the same, but if you need the externs.json or corefn.json then Shake won't know how to build them. I suggest using the more direct &%> - it will probably work out more easily. The use of createDirectoryIfMissing is a big code smell, since Shake creates the necessary directories for outputs.

Also happy to answer questions as @joneshf says.

ndmitchell · 2019-05-26T22:02:06Z

Having read the full thread, you might be interested in what we did with DAML. We wrote the compiler as a Shake build system without persistence (we had a slightly odd use case) and turned it into both a compiler and an IDE with almost no changes. We store everything in memory forever (partly because we can't serialise all the steps...). You can see the actual IDE pieces in this dir and the Shake wrapper is here.

I would suggest focusing on replicating the current state with Shake first before going in any other directions though.

Instead of having two separate MVars for build job results and errors, just have one, which contains a sum type, to indicate if and how a build job has completed with a little more clarity and safety (in the sense that this makes some invalid states unrepresentable). Additionally, rather than having two separate functions for consuming the result of a build plan, namely `collectErrors` and `collectResults`, and requiring that the first is called before the second, unify them both into `collectResults`. This will help for #3145, as for that I need the BuildPlan to be able to expose which build jobs succeeded before their errors are rethrown, so that we can store their timestamps and hashes in preparation for the next build. I haven't done any performance tests on this just yet, but I don't anticipate any drastic changes.

Fixes #3145. The current build cache invalidation algorithm compares the input timestamps to the output timestamps, and only triggers a rebuild if the input timestamps are newer than the output timestamps. However, this does not appear to be sufficient: as discussed in #3145, we think that the reason that doing `rm -r output` often fixes weird compile errors is that we should really be considering the input file to have changed if its timestamp is _different_ to what it was at the last successful build, regardless of whether it is before or after the output timestamp. Essentially, timestamps on input files can't be trusted to the extent that we do for cache invalidation, because of things like switching between different versions of dependencies or switching branches; sometimes you can have an input file's contents and timestamp both change, but have the timestamp still be older than the output timestamp. This commit implements a slightly different cache invalidation algorithm, where we make a note of the timestamps of all input files at the start of each build, and we consider files to have changed in subsequent builds if their input timestamps have changed at all (regardless of whether the new input timestamps are before or after the output timestamps). The timestamps are stored in a json file `cache-db.json` in the output directory; I also considered putting the timestamps in the externs files, but I think having them stored separately is preferable because then we don't have to update the module's externs file if its input file timestamp changes but its hash doesn't, which means that we don't force a rebuild for downstream modules. As an additional enhancement, we also make note of file content hashes and store them in the `cache-db.json` file. On subsequent builds, if timestamps have changed, we compare the previous hash to the new hash, and if they are identical, we can skip rebuilding the module. This means that e.g. touching a file no longer forces a rebuild. Note that we only compute hashes in the case where timestamps differ to avoid doing extra unnecessary work. This scheme of checking timestamps and then hashes was inspired by Shake, which provides this mechanism as one of its options for Change; see #3145 (comment) I've also added some tests so that we can make changes to this part of the compiler a little more confidently. I'm using the latest version of `these` (which is not in our Stack snapshot) because it doesn't incur a `lens` dependency, whereas earlier versions do.

Instead of having two separate MVars for build job results and errors, just have one, which contains a sum type, to indicate if and how a build job has completed with a little more clarity and safety (in the sense that this makes some invalid states unrepresentable). Additionally, rather than having two separate functions for consuming the result of a build plan, namely `collectErrors` and `collectResults`, and requiring that the first is called before the second, unify them both into `collectResults`. This will help for purescript#3145, as for that I need the BuildPlan to be able to expose which build jobs succeeded before their errors are rethrown, so that we can store their timestamps and hashes in preparation for the next build. I haven't done any performance tests on this just yet, but I don't anticipate any drastic changes.

Fixes #3145. The current build cache invalidation algorithm compares the input timestamps to the output timestamps, and only triggers a rebuild if the input timestamps are newer than the output timestamps. However, this does not appear to be sufficient: as discussed in #3145, we think that the reason that doing `rm -r output` often fixes weird compile errors is that we should really be considering the input file to have changed if its timestamp is _different_ to what it was at the last successful build, regardless of whether it is before or after the output timestamp. Essentially, timestamps on input files can't be trusted to the extent that we do for cache invalidation, because of things like switching between different versions of dependencies or switching branches; sometimes you can have an input file's contents and timestamp both change, but have the timestamp still be older than the output timestamp. This commit implements a slightly different cache invalidation algorithm, where we make a note of the timestamps of all input files at the start of each build, and we consider files to have changed in subsequent builds if their input timestamps have changed at all (regardless of whether the new input timestamps are before or after the output timestamps). The timestamps are stored in a json file `cache-db.json` in the output directory; I also considered putting the timestamps in the externs files, but I think having them stored separately is preferable because then we don't have to update the module's externs file if its input file timestamp changes but its hash doesn't, which means that we don't force a rebuild for downstream modules. As an additional enhancement, we also make note of file content hashes and store them in the `cache-db.json` file. On subsequent builds, if timestamps have changed, we compare the previous hash to the new hash, and if they are identical, we can skip rebuilding the module. This means that e.g. touching a file no longer forces a rebuild. Note that we only compute hashes in the case where timestamps differ to avoid doing extra unnecessary work. This scheme of checking timestamps and then hashes was inspired by Shake, which provides this mechanism as one of its options for Change; see #3145 (comment) I've also added some tests so that we can make changes to this part of the compiler a little more confidently. I'm using the latest version of `these` (which is not in our Stack snapshot) because it doesn't incur a `lens` dependency, whereas earlier versions do.

paf31 added this to the Ideas milestone Nov 14, 2017

paf31 added codegen type: enhancement labels Nov 14, 2017

hdgarrood mentioned this issue Mar 10, 2019

incremental builds are slow #3557

Closed

joneshf-cn mentioned this issue Mar 22, 2019

Validation citizennet/purescript-lynx#16

Merged

hdgarrood mentioned this issue May 12, 2019

Add graph command for graphing module dependencies #3635

Closed

jmackie pushed a commit to jmackie/purescript-shake that referenced this issue May 15, 2019

Add @joneshf feedback

3cf9a56

purescript/purescript#3145 (comment) Only now things don't work and I dunno why...

hdgarrood mentioned this issue Jul 16, 2019

Altering capitalisation of module breaks incremental compilation #3698

Closed

hdgarrood mentioned this issue Jul 20, 2019

Refactor and simplify BuildPlan a little #3699

Merged

hdgarrood mentioned this issue Jul 28, 2019

Improved build cache invalidation with content hashes #3705

Merged

f-f mentioned this issue Aug 10, 2019

Task runner capabilities purescript/spago#364

Closed

hdgarrood closed this as completed in #3705 Aug 27, 2019

hdgarrood mentioned this issue Aug 28, 2019

Don't require rebuilds for downstream modules if module interface is unchanged #3724

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use content hashes to determine whether a file needs to be recompiled #3145

Use content hashes to determine whether a file needs to be recompiled #3145

lpil commented Nov 4, 2017

paf31 commented Nov 4, 2017

lpil commented Nov 4, 2017

kritzcreek commented Nov 5, 2017

metaleap commented Nov 10, 2017

paf31 commented Nov 14, 2017

paf31 commented Nov 14, 2017 •

edited

garyb commented Nov 14, 2017

hdgarrood commented Jan 28, 2018

hdgarrood commented Feb 15, 2019

garyb commented Feb 15, 2019

joneshf commented May 12, 2019 •

edited

natefaubion commented May 13, 2019

jmackie commented May 13, 2019 •

edited

hdgarrood commented May 13, 2019 •

edited

hdgarrood commented May 13, 2019

hdgarrood commented May 13, 2019

joneshf commented May 13, 2019 •

edited

jmackie commented May 14, 2019

joneshf commented May 14, 2019

ndmitchell commented May 26, 2019

ndmitchell commented May 26, 2019

Use content hashes to determine whether a file needs to be recompiled #3145

Use content hashes to determine whether a file needs to be recompiled #3145

Comments

lpil commented Nov 4, 2017

paf31 commented Nov 4, 2017

lpil commented Nov 4, 2017

kritzcreek commented Nov 5, 2017

metaleap commented Nov 10, 2017

paf31 commented Nov 14, 2017

paf31 commented Nov 14, 2017 • edited

garyb commented Nov 14, 2017

hdgarrood commented Jan 28, 2018

hdgarrood commented Feb 15, 2019

garyb commented Feb 15, 2019

joneshf commented May 12, 2019 • edited

Why shake?

How would this work?

What might a first pass look like?

Some insights/questions

natefaubion commented May 13, 2019

jmackie commented May 13, 2019 • edited

hdgarrood commented May 13, 2019 • edited

hdgarrood commented May 13, 2019

hdgarrood commented May 13, 2019

joneshf commented May 13, 2019 • edited

jmackie commented May 14, 2019

joneshf commented May 14, 2019

ndmitchell commented May 26, 2019

ndmitchell commented May 26, 2019

paf31 commented Nov 14, 2017 •

edited

joneshf commented May 12, 2019 •

edited

Why `shake`?

jmackie commented May 13, 2019 •

edited

hdgarrood commented May 13, 2019 •

edited

joneshf commented May 13, 2019 •

edited