Dependency Analysis

benmachine edited this page Mar 20, 2013 · 1 revision

Imported from Trac wiki; be wary of outdated information or markup mishaps.

Please put your ideas & discussion about Cabal dependency analysis here.

Neil's (ndm) make idea

data Rule {
  produces :: [FilePath],
  requires :: [FilePath],
  dynRequires :: Maybe (IO [FilePath]),
  action :: IO ()

This is a slight extension of Neils existing make code. In particular the dynRequires.

The semantics are that to generate a set of particular target files we ensure that all the static dependencies are up to date, then if there are any dynamic dependencies we check those are up to date too and finally run the action.

The need for dynamic dependencies is for automatically generating deps by reading a file to look for imports.

So for example:

Two source files: Foo.hs, Bar.hs. Module Foo imports Bar.

So we have some static and dynamic deps, we generate the dynamic deps and put them in a .dep file (or we bundle them alltogether in an appropriate cache file. It doesn't matter for the abstract setup.)

Foo.hs.dep -- contains deps of Foo as a [FilePath] Bar.hs.dep -- similarly

Foo.hs.dep is generated from Foo.hs

So we'd actually have 4 Rules:

foo_hs_dep = Rule { 
    requires = ["Foo.hs"],
    produces = ["Foo.hs.dep"],
    action = generateDepsFor "Foo.hs",
    dynRequires = Nothing

foo_hs = Rule {
  requires = ["Foo.hs", "Foo.hs.dep"],
  produces = ["Foo.o", "Foo.hi"]
  action = compile "Foo.hs"
  dynRequires = Just (readCachedDepsFrom "Foo.hs.dep")

Interleaving dep chasing and building

It is impossible to generate a complete dependency graph in the presence of pre-processors without running some pre-processors.

For example, supposing our ultimate target is Foo.o & Foo.hi. We would generate those from Foo.hs, which might be generated from Foo.hs.cpp. However we cannot find the full set of dependencies for Foo.o & Foo.hi until Foo.hs exists since we need to read that file to see what all the import statements are. This is not possible until we have run cpp to generate Foo.hs from Foo.hs.cpp. At that point we may discover that Foo.o depends on Bar.hi if Foo imports Bar.

Further more, there is no guarantee that we actually know about Bar.hs (it may not be an exposed module) so we're not even in the situation where we only add additional edges between existing nodes in the build graph, we may have to add additional nodes too.

So the question is, what is a good interface between the dependency chaser and a make layer.

I agree that having the dependency generation and make layers separate would be a good thing, but I don't know if that's 100% feasible. Consider that the only time we will have a complete dependency graph is when the build is nearly done. So, I don't really think we should try to build the graph at all (at least, in the direct sense).

Here's the pseudocode for the algorithm I have in my head:

for every target file T:
  find the rule chain R that will build the file. (there should only be one)
  starting from the final entry in the rule chain (e.g. the .chs.pp->.chs rule) and working backwards:
    call the current source file A, and the target file B. (e.g. A=.chs.pp, B=.chs)
    if B is older than A:
        for each file D that is needed to build B from A:
            if D is already on the stack, bail out
            otherwise, recurse on D
        compile B

Of course, other things are needed, but I think this is a good starting point.

So, after trying to code this, I finding the rule chain has a runtime worst case O(numRules^2^) unless you cache the results as a dependency graph. I guess you were right then, Duncan :)

So, anyhow, I think I'm going to thread the depgraph in a monad. I figure

type DepGraph = Map FilePath (FilePath, Rule)

should suffice, and

type DepGraphT a = StateT DepGraphT IO a

will do the trick.

Generalize Dependencies

If we want to cache the result of things like finding the rule chain (as above) or deal correctly with searchpath shadowing, we need to track how the algorithm depended on the environment so that we can see if the result is still valid in the current one. This means that they are part of the dependency graph as well.

This introduces new kinds of nodes, e.g. both the mentioned algorithms need to test if a file exists, which is different from depending on a file's content, so FilePath is no more a proper representation for nodes.

data Target = File FilePath | FileExist FilePath | ...

We can also consider that we don't need .dep files to be stored on the filesystem, but they can be more efficiently stored in a dictionary serialized between runs.

... | DictEntry Name Content | ...

A discriminated union like this is not very extensible but it's probably the simplest approach, especially in h98.

Determine if a target is up to date

Considering only files we can solve this by checking if target T is older than all its dependencies, using timestamps.

If T depends on d = FileExist "foo.hs" instead, we have a problem since we can't assing a sensibile timestamp to d, what really matters is if the value computed for that dependency has changed or not wrt when we built T, so we can test this condition by storing the old value (that we'll call the representation, in this case a Bool) and compare it to the current one.

If all the cached representations are equal to the ones in the current environment for both the dependencies and the target, then we don't have to build T again, since the old one is still valid.

This assumes that we consider as valid the representations of the dependecies. So they must have already passed this test or have been recomputed, which leads to traversing the graph top-down, from sources to sinks.

If some representation is missing than we just consider T not valid and build it. Missing the current representation for dependencies is a bit anomalous since the algorithm should have produced them earlier, but concurrent user interaction could cause this.

This works nicely also for plain files: storing the whole content and comparing it would be exactly the correct condition to check, but it's too expensive, so we use timestamps as representations for files as before, to keep the comparison cheap.

We get something that is a little more accurate than the simple "T older than D" check, since just touching T won't work as the timestamp won't coincide.

It also means that we can't make use of targets for which we don't have a cached representation without rerunning the action to produce them, but this shouldn't be a problem.

It's important that the core make algorithm doesn't care about the internal structure of Targets or Representations, as long as it can retrieve the old and current representations for a target and compare them for equality.

Some Problems

Two conflicting needs:

  • If we implement the check as stated above, we end up performing the same comparisons multiple times for the same target, once per dependent Rule or Target, it would be preferable to share it for every dependency instead.

  • To give some consistency guarantees wrt concurrent extenral modification of the environment instead, i.e. that in a single build for every T we always use the same value for it, we must check that the dependencies are still in the same state as the last time we used them, everytime we run an action.