Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Incremental compiler notes

Grzegorz Kossakowski edited this page · 19 revisions
Clone this wiki locally

My (@gkossakowski) personal notes on Scala's incremental compiler implemented in sbt

Here are two links to sbt documentation that are discussing incremental compiler:

  • Understanding Incremental Compilation which talks a little bit about the design goals of the incremental compiler, existing algorithm and mentions general challenges with incremental compilation in Scala
  • api.specification file which discusses details of the api extraction aspect of incremental compilation

This document should be cleaned up and merged with sbt's documentation eventually.

Overview

We assume everywhere in this document that compilation can happen at source file level of granularity. In particular, even if only one method in source file might need to be type checked based on other changes we'll have to type check the whole source file that contains that method. That reflects the current design of the Scala compiler.

The goal of incremental compilation is detect changes to source files or to the classpath and determine a small set of files to be recompiled in such a way that it'll yield the final result identical to the result from a full, batch compilation. When reacting to changes the incremental compiler has to goals that are at odds with each other:

  • recompile as little source files as possible
  • cover all changes to type checking and produced byte code triggered by changed source files and/or classpath

The first goal is straightforward and it's a sole point of incremental compiler existence. The second goal which sets a lower limit on the size of a set of recompiled files. Determining that set is the core problem incremental compiler tries to solve. We'll dive a little bit into this problem in the overview to understand what makes implementing incremental compiler a challenging task.

Let's consider this very simple example:

// A.scala
package a
class A {
  def foo(): Int = 12
}

// B.scala
package b
class B {
  def bar(x: a.A): Int = x.foo()
}

Let's assume both of those files are already compiled and user changes A.scala so it looks like this:

// A.scala
package a
class A {
  def foo(): Int = 23 // changed constant
}

The first step of incremental compilation is to compile modified source files. That's minimal set of files incremental compiler has to compile. Modified version of A.scala will be compiled successfully as changing the constant doesn't introduce type checking errors. The next step of incremental compilation is determining whether changes applied to A.scala may affect other files. In the example above only the constant returned by method foo has changed and that does not affect compilation results of other files.

Let's consider an other change to A.scala:

// A.scala
package a
class A {
  def foo(): String = "abc" // changed constant and return type
}

As before, the first step of incremental compilation is to compile modified files. In this case we compile A.scala and compilation will finish successfully. The second step is again determining whether changes to A.scala affect other files. We see that the return type of the foo public method has changed so this might affect compilation results of other files. Indeed, B.scala contains call to the foo method so has to be compiled in the second step. Compilation of B.scala will fail because of type mismatch in B.bar method and that error will be reported back to the user. That's where incremental compilation terminates in this case.

Let's identify the two main pieces of information that were needed to make decisions in the examples presented above. The incremental compiler algorithm needs to:

  • index source files so it knows whether there were API changes that might affect other source files; e.g. it needs to detect changes to method signatures as in the example above
  • track dependencies between source files; once the change to an API is detected the algorithm needs to determine the set of files that might be potentially affected by this change

Both of those pieces of information are extracted from the Scala compiler.

Interaction with the Scala compiler

This chapter discusses the implementation as it is done in sbt 0.12.x.

Incremental compiler interacts with Scala compiler in many ways:

  • provides two phases additional phases that extract needed information:
    • api phase extracts public interface of compiled sources by walking trees and indexing types
    • analyzer phase which extracts dependencies between source files (compilation units) and the list of emitted class files
  • defines a custom reporter which allows sbt to gather errors and warnings
  • subclasses Global to:
    • add the api and analyzer phases
    • set the custom reporter
  • manages instances of the custom Global and uses them to compile files it determined that need to be compiled

API extraction phase

The API extraction phase extracts information from Trees, Types and Symbols and maps it to incremental compiler's internal data structures described in the api.specification file. Those data structures allow to express an API in a way that is independent from Scala compiler version. Also, such representation is persistent so it is serialized on disk and reused between compiler runs or even sbt runs.

The api extraction phase consist of two major components:

  1. mapping Types and Symbols to incremental compiler representation of an extracted API
  2. hashing that representation

Mapping Types and Symbols

The logic responsible for mapping Types and Symbols is implemented in API.scala. With introduction of Scala reflection we have multiple variants of Types and Symbols. The incremental compiler uses the variant defined in scala.reflect.internal package.

Also, there's one design choice that might not be obvious. When type corresponding to a class or a trait is mapped then all inherited members are copied instead of declarations in that class/trait. The reason for doing so is that it greatly simplifies analysis of API representation because all relevant information to a class is stored in one place so there's no need for looking up parent type representation (TODO: verify with @harrah that this is indeed the motivation). This simplicity comes with a high price: the same information is copied over and over again resulting in big performance hit. For example, every class will have members of java.lang.Object duplicated along with full information about their signatures.

Hashing an API representation

The incremental compiler (as it's implemented right now) doesn't need very fine grained information about the API. The incremental compiler just needs to know whether an API has changed since the last time it was indexed. For that purpose hash sum is enough and it saves a lot of memory. Therefore, API representation is hashed immediately after single compilation unit is processed and only hash sum is stored persistently.

In earlier versions the incremental compiler wouldn't hash. That resulted in a very high memory consumption and poor serialization/deserialization performance.

The hashing logic is implemented in the HashAPI.scala file.

The analyzer phase

The analyzer phase extracts two pieces of information:

  1. dependencies of a given compilation unit
  2. class files produced by given compilation unit

Dependency extraction

The incremental compiler extracts all Symbols given compilation unit depends on (refers to) and then tries to map them back to corresponding source/class files. Mapping a Symbol back to a source file is performed by using sourceFile attribute that Symbols derived from source files have set. Mapping a Symbol back to (binary) class file is more tricky because Scala compiler does not track origin of Symbols derived from binary files. Therefore simple heuristic is used which maps a qualified class name to corresponding classpath entry. This logic is implemented in analyzer phase which has an access to the full classpath.

The set of Symbols given compilation unit depends on is obtained through CompilationUnit.depends property. This set is populated by Scala compiler during type checking. It contains all Symbols type checker ever touches when type checking given compilation unit. This includes Symbols corresponding to companion objects (and other classes) that implicit search touches when resolving implicits.

The advantage of using CompilationUnit.depends property is that it is being maintained by type checker and that gives fairly high confidence that all Symbols given compilation unit depends on will be contained in that set.

However, the incremental compiler seems to be the only client to that property and it comes with no real specification or tests. Nothing in the Scala compiler itself is relying on the contents of the depends property so aforementioned high confidence is diminished to some extent. The other disadvantage of using CompilationUnit.depends is that it's a flat collection without any contextual information about given dependency. For example, CompilationUnit.depends property cannot be used to determine per-class dependencies in given compilation unit.

Produced class files

Extracting collection of produced class files is achieved by inspecting contents CompilationUnit.icode property which contains collection of all ICode classes that backend will emit as JVM class files.

Incremental compilation algorithm in sbt 0.13

Te be done: Describe in this chapter the algorithm which takes into account information about API changes and extracted dependencies and makes decision which files to recompile.

Name hashing algorithm

This chapter outlines proposal for name hashing algorithm I'm currently working on. It will be described with an assumption that you are familiar with the details of current implementation.

Motivation

Let's consider the following example:

// A.scala
class A {
  def inc(x: Int): Int = x+1
}

// B.scala
class B {
  def foo(a: A, x: Int): Int = a.inc(x)
}

Let's assume both of those files are compiled and user changes A.scala so it looks like this:

// A.scala
class A {
  def inc(x: Int): Int = x+1
  def dec(x: Int): Int = x-1
}

Once user hits save and asks incremental compiler to recompile it's project it will do the following:

  1. Recompile A.scala as the source code has changed (first iteration)
  2. While recompiling it will reindex API structure of A.scala and detect it has changed
  3. It will determine that B.scala depends on A.scala and since the API structure of A.scala has changed B.scala has to be recompiled as well (B.scala has been invalidated)
  4. Recompile B.scala because it was invalidated in 3. due to dependency change
  5. Reindex API structure of B.scala and find out that it hasn't changed so we are done

To summarize, we'll invoke Scala compiler twice: one time to recompile A.scala and then to recompile B.scala because A has a new method dec.

However, one can easily see that in this simple scenario recompilation of B.scala is not needed because addition of dec method to A class is irrelevant to the B class as its not using it and it is not affected by it in any way.

In case of two files the fact that we recompile too much doesn't sound too bad. However, in practice, the dependency graph is rather dense so one might end up recompiling the whole project upon a change that is irrelevant to almost all files in the whole project. That's exactly what happens in Play projects when routes are modified. The nature of routes and reversed routes is that every template and every controller depends on some methods defined in those two classes (Routes and ReversedRoutes) but changes to specific route definition usually affects only small subset of all templates and controllers.

The idea behind name hashing is to exploit that observation and make the invalidation algorithm smarter about changes that can possibly affect a small number of files.

Detection of irrelevant dependencies (direct approach)

I call call a change to API of given source file X.scala irrelevant if doesn't affect compilation result of other file Y.scala even if Y.scala does depend on X.scala.

From that definition one can easily see that change can be declared irrelevant only in respect to given dependency. Conversely, one can declare a dependency between two source files irrelevant in respect to given change of an API in one of the files if the change doesn't affect compilation result of the other file. From now on we'll focus on detection of irrelevant dependencies.

A very naive way of solving a problem of detecting irrelevant dependencies would be to say that we keep track of all used methods in Y.scala so if a method in X.scala is added/removed/modified we just check if it's being used in Y.scala and if it's not then we consider dependency of Y.scala on X.scala irrelevant in this particular case.

Just to give you a sneak preview of problems that quickly arise if you consider that strategy let's consider those two scenarios.

Inheritance

We'll see how method not used in other source file might affect it's compilation result. Let's consider this structure:

// A.scala
abstract class A

// B.scala
class B extends A

Let's add an abstract method to class A:

// A.scala
abstract class A {
  def foo(x: Int): Int
}

Now, once we recompile A.scala we could just say that since A.foo is not used in B class then we don't need to recompile B.scala. However, it's not true because B doesn't implement a newly introduced, abstract method and an error should be reported.

Therefore, a simple strategy of looking at used methods for determining whether a given dependency is relevant or not is not enough.

Enrichment pattern

Here we'll see another case of newly introduced method (that is not used anywhere yet) that affects compilation results of other files. This time, no inheritance will be involved but we'll use enrichment pattern (implicit conversions) instead.

Let's assume we have the following structure:

// A.scala
class A

// B.scala
class B {
  class AOps(a: A) {
    def foo(x: Int): Int = x+1
  }
  implicit def richA(a: A): AOps = new AOps(a)
  def bar(a: A): Int = a.foo(12) // this is expanded to richA(a).foo so we are calling AOPs.foo method
}

Now, let's add a foo method directly to A:

// A.scala
class A {
  def foo(x: Int): Int = x-1
}

Now, once we recompile A.scala and detect that there's a new method defined in A class we would need to consider whether this is relevant to dependency of B.scala on A.scala. Notice that in B.scala we do not use A.foo (it didn't exist at the time B.scala was compiled) but we use AOps.foo and it's not immediately clear that AOps.foo has anything to do with A.foo. One would need to detect the fact that a call to AOps.foo as a result of implicit conversion richA that was inserted because we failed to find foo on A before.

This kind of analysis gets us very quickly to implementation complexity of Scala's type checker and is not feasible to implement in a general case.

Too much information to track

All above assumed we actually have full information about structure of API and used methods preserved so we can make use of it. However, as described in Hashing an API representation we do not store the whole representation of the API but only its hash sum. Also, dependencies are tracked at source file level and not at class/method level.

One could imagine reworking current design to track more information but it would be a very big undertake. Also, incremental compiler used to preserve a whole API structure but it switched to hashing due to very big memory consumption.

Detection of irrelevant dependencies (name hashing)

As we saw in previous chapter, the direct approach of tracking more information about what's being used in files becomes tricky very quickly. One would wish to come up with some simpler and less precise approach that still yields big improvements over existing implementation.

The idea is to not track all used members and reason very precisely when given change to some members affects result of compilation of other files. We would track just used simple names instead and we would also track hash sums for all members with given simple name. The simple name means just an unqualified name of a term or a type.

Let's see first how this simplified strategy addresses the problem with enrichment pattern. We'll do that by simulating the name hashing algorithm. Let's start with the original code:

// A.scala
class A

// B.scala
class B {
  class AOps(a: A) {
    def foo(x: Int): Int = x+1
  }
  implicit def richA(a: A): AOps = new AOps(a)
  def bar(a: A): Int = a.foo(12) // this is expanded to richA(a).foo so we are calling AOPs.foo method
}

During compilation of those two files we'll extract the following information:

usedNames("A.scala"): A
usedNames("B.scala"): B, AOps, a, A, foo, x, Int, richA, AOps, bar

nameHashes("A.scala"): A -> ...
nameHashes("B.scala"): B -> ..., AOps -> ..., foo -> ..., richA -> ..., bar -> ...

The usedNames relation track all names mentioned in given source file. The nameHashes relation gives us a hash sum of groups of members that are put together in one bucket if they have the same simple name. In addition to information presented above we still track dependency of B.scala on A.scala.

Now, if we add foo method to A class:

// A.scala
class A {
  def foo(x: Int): Int = x-1
}

and recompile, we'll get the following (updated) information:

usedNames("A.scala"): A, foo
nameHashes("A.scala"): A -> ..., foo -> ...

The incremental compiler compares name hashes before and after the change and detects that the hash sum of foo name has changed (it's been added). Therefore, it looks at all source files that depend on A.scala, in our case it's just B.scala, and checks whether foo appears as used name. It does, therefore it recompiles B.scala as intendent.

You can see now, that if we added other method to A like xyz then B.scala wouldn't be recompiled because nowhere in B.scala the name xyz is mentioned. Therefore, if you have reasonably non-clashing names you should benefit from a lot of dependencies between source files marked as irrelevant.

It's very nice that this simple, name-based heuristic manages to withstand "enrichment pattern" test. However, name-hashing fails to pass other test of inheritance. In order to address that problem, we'll need to have a closer look at dependencies introduced by inheritance vs dependencies introduced by member references.

Dependencies introduced by member reference and inheritance

The core assumption behind name-hashing algorithm is that if a user adds/modifies/removes a member of a class (e.g. a method) then other results of compilation of classes won't be affected unless they are using that particular member. Inheritance with various override checks makes the whole situation much more complicated; if you combine it with mix-in composition that introduces a new fields to classes inheriting from traits then you quickly realize that inheritance requires special handling.

The idea is that for now we would switch back to the old scheme whenever inheritance is involved. Therefore, we track dependencies introduced by member reference separately from dependencies introduced by inheritance. All dependencies introduced by inheritance are not a subject to name-hashing analysis so they are never marked as irrelevant.

The intuition behind dependency introduced by inheritance is very simple: it's a dependency a class/trait introduces by inheriting from other class/trait. All other dependencies are called dependencies by member reference because they are introduced by referring (selecting) a member (method, type alis, inner class, val, etc.) from another class. Notice that in order to inherit from a class you need to refer to it so dependencies introduced by inheritance are a strict subset of member reference dependencies.

Here's an example which illustrates the distinction:

// A.scala
class A {
  def foo(x: Int): Int = x+1
}

// B.scala
class B(val a: A)

// C.scala
trait C

// D.scala
trait D[T]

// X.scala
class X extends A with C with D[B] {
  // dependencies by inheritance: A, C, D
  // dependencies by member reference: A, C, D, B
}

// Y.scala
class Y {
  def test(b: B): Int = b.a.foo(12)
  // dependencies by member reference: B, Int, A
}

There are two things to notice:

  1. X does not depend on B by inheritance because B is passed as type parameter to D; we consider only types that appear as parents to X
  2. Y does depend on A even if there's no explicit mention of A in the source file; we select a method foo defined in A and that's enough to introduce a dependency

To sum it up, the way we want to handle inheritance and problems it introduces is to track all dependencies introduced by inheritance separately and have much more strict way of invalidating dependencies. Essentially, whenever there's a dependency by inheritance it will react to any (even minor) change in parent types.

Computing name hashes

One thing we skimmed over so far is how name hashes are actually computed.

As mentioned before, all definitions are grouped together by simple name and then hashed as one bucket. If a definition (for example a class) contains other definition then those nested definitions do not contribute do a hash sum. The nested definitions will contribute to hashes of buckets selected by their name.

Non-core incremental compilation concerns

Here's a list (contributed by @harrah) of important (for correctness) concerns that complete incremental compiler implementation must need to take into account:

  1. Incremental compiler also notices changes to options, if classes are deleted, and the location of the output directory.
  2. Incremental compiler manages order of Java/Scala compilation 3. Discussion doesn't cover change detection (what sources and binaries were modified). It might seem unimportant, but surprisingly it hasn't been done correctly anywhere else: see the chart on how Ant+Maven detect what sources to recompile.
  3. Discussion doesn't cover extraction of information from Java sources. Java sources passed to scalac are ignored by api/analyzer. The information comes from class file inspection after javac runs.
  4. Classfile management (might fall under unwritten invalidation section): deleting classes files and optionally restoring them
  5. Source files are sorted before being passed to scalac to workaround scalac ordering issues (definitely existed in olderversions, possibly just for consistency now).
  6. Incremental compiler manages order of Java/Scala compilation

Those points should be considered as TODOs.

Examples of problematic scenarios

This chapter contains collection of random problematic scenarios which make designing more fine grained incremental compilation algorithm very challenging task.

Private var of a trait

If you have

// A.scala
trait A

// B.scala
class B extends A

and you add a private var to a your A trait:

// A.scala
trait A {
  private var foo = 12
}

then B.scala has to be recompiled even though the change to trait A is not affecting its public interface. The example contributed by @paulp.

Nested packages

Let's say we start with

// A.scala
package a
package b
package c

class A {
  def foo(x: X) = x
}

// X.scala
package a

class X

and we compile both A.scala and X.scala. Now, we add X2.scala:

// X2.scala
package a.b
class X

What we get now is situation where a completely new file (X2.scala) causes invalidation of other, already existing, file (A.scala) because X is being resolved to a different class.

Something went wrong with that request. Please try again.