Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Changes to ambivalence scope tracking #1609
This GPR fixes various bugs related to ambiguous types silently escaping their scope.
This GPR seems to bring a negligible improvement in compiler speed along with a small increase in
Consider the following piece of code:
let f (type a) (b : bool) (wit : (a, int) eq) (x : a) = match wit with | Refl -> if b then x else 0 | _ -> x
The typechecker will correctly reject it:
However if we instead write the following, clearly similar, functions:
let f1 (type a b) (b : bool) (wit : (a, b) eq) (x : a) (y : b) = match wit with | Refl -> if b then x else y | _ -> x let f2 (type a b) (b : bool) (wit : (a, b) eq) (x : a) (y : b) = match wit with | Refl -> if b then x else y | _ -> y
then the first one is accepted and given the type
In both cases the typechecker will have chosen an arbitrary element of the ambiguous type of the first branch, and then depending whether the type of the second branch matches or not the chosen representative, the function will or won't typecheck.
Note that exchanging the content of the if-branches, or changing the order in which the newtypes are introduced will result in the ambiguity being noticed (and changing both at the same time means that the function will be accepted but return a
let f (type a b) (x : (a, b) eq) = match x,  with | Refl, [(_ : a) | (_ : b)] ->  | Refl, [(_ : a)] ->  let g (type a b) (x : (a, b) eq) = match x,  with | Refl, [(_ : a) | (_ : b)] ->  | Refl, [(_ : b)] -> 
On trunk, these typecheck even though they look very similar to the examples discussed above. Jacques gave this result the following explanation:
This however doesn't seem very convincing, if indeed that was the reason then this would also typecheck:
A correct explanation is, I believe, that just as in the previous examples, an ambiguous type escapes from the first pattern, but since we are now unifying each branch with the scrutiny in the environment of the branch, and not in the outer environment as was done before, then the second branch posses a local equation that allows it to typecheck.
let f (type a b) (x : (a, b) eq) = match x,  with | Refl, [(_ : a) | (_ : b)] ->  | _, [(_ : a)] ->  let g (type a b) (x : (a, b) eq) = match x,  with | Refl, [(_ : a) | (_ : b)] ->  | _, [(_ : b)] -> 
Various other similar examples have been added in the following commits:
Fixing the situation
The main change proposed by this GPR is to track ambivalence in the types themselves (which is reminiscent of the 2013 paper about ambivalent types) instead of registering them in the environment.
To do so, we add a
This avoids the issue of looking at an ambivalent type with an environment which doesn't know about its ambivalence (and so allows ambiguity to leak), as was for example the case in GPR#1315 .
The commit where we start using this new scope field ( "use scope instead of gadt_instances", a17e01c ) also happens to fix things which I believe to be symmetry issues (i.e. when we have "a=b" we would remember the gadt instance level of a, but not of b).
There is another, smaller, commit of interest which was introduced before this big change (but might just as well be done after as it is in fact independent), which is: "ctype: split
The next big commit is the one changing
Finally, the commit titled "slight representation change" changes
Testing and timing
This PR contains various commits making additions to the testsuite to illustrate some issue with the previous implementation, which are then updated in later commits when the issue is resolved.
Apart from that, I have also compiled janestreet's codebase with this compiler and all of it compiled, except for one place where the new compiler was reporting that some ambiguous type was escaping the scope of its equation.
Having compiled janestreet codebase with
As for the size of the build artefacts, the total
My first reaction is of surprise: I would have expected adding a field to
I agree that handling ambivalence in the types is the right thing to do, since this is what the theory does. I was really afraid of the cost, but it seems that I was wrong, or rather underestimated the extant of the usage of GADTs, and the performance cost involved.
Indeed, this is now fixed.
My guess would have been that the gain comes from not looking up in environment the gadt instance level of a type.
I would think that the gain in simplicity / clarity of the implementation as well as the various bugfixes would be enough to justify the size increase.
Now that the backtracking problem is solved, I believe this code is fine: the reasoning behind seems sound, and there are enough tests to ensure that it works as intended.
While I read the changes, I cannot really vouch for them, due to the size, and the number of independent changes mixed together. But again, I believe the tests, and @trefis ' understanding.
I still have a question: why is there a
An by the way I do not agree with your statement:
Independently of the specific merits of this PR, I have a problem with our workflow.
@garrigue for the changes of @trefis touching pattern matching, what we have done on several occasions (for larger PRs) is to meet in person with Luc to discuss the PR and review the changes in detail, which in general caused a partial rewrite followed by a re-review using the interface (mostly checking that things were as discussed).
Is there a variant of this workflow that would work for you? (Some communication medium, to discuss the broad ideas of a PR etc., after a first version/draft is put on Github.) Email, chat, video call could work.
It depends very much on the contents of the patch.
Can you elaborate on this point? I assume the actual problem is not the size, but the time to load external .cmi when compiling a unit; is that right? Or perhaps the RAM usage of the compiler? Did you observe that this is actually a bottleneck?
This was one of the reasons to introduce module aliases. Jane Street was ending up with cmi's so big that they were no longer manageable. Of course the problem is not so much disk space as in-memory space, but adding a field to the omnipresent
Jane Street people frequently report the in-memory size of .cmi (once loaded by the compiler) as a bottleneck when doing parallel compilation (in Jenga for example), or when using naive caching strategies (for example, when Merlin caches loaded .cmi to process future buffers faster). I am not sure whether on-disk size and in-memory size are always strongly correlated, but it seems reasonable to assume that adding a field to the in-memory representation of types affects both.
This was added after the fact (in 68a1ffa), it is indeed not strictly necessary: an ambiguous type escape will be detected as such after typing the whole match, or you will have a type mismatch between the patterns earlier on.
I'd rather keep that function: I'm not so worried about the small overlap, and I do believe the gain in consistency of the error messages is a nice thing to have.
Regarding the size of
Regarding coordination/communication I do not have much to propose that wasn't already suggested by Gabriel. In the future I could get in touch a bit in advance if you'd prefer that.
Although the various discussions might not all be over, does anyone mind if I rebase (I plan to squash the bootstrap commits in their parent) and merge this? (N.B. the appveyor failure is just a timeout)
Do you know if this is still a strong concern with module aliases?
If so, I'm wondering if it would be useful to investigate alternative strategies to store interfaces. In particular, I'm thinking about creating per-library interfaces, in the form of a single file (say .cmis), containing all interfaces for public modules of the library, with some hash-consing scheme to recreate maximal (or just more) sharing. The compiler would only need to open/unmarshal one big file per library. There is an overhead due to the fact that some modules from the library might not be required, but I suspect the overall effect would often be positive. It would also reduce the incrementality of builds, but per-library granularity might be ok (at least in the OPAM world, perhaps not for big mono-repos).
This seems likely to be a big negative, essentially undoing the good work of module aliases. Most modules of a large library are typically not needed for each module that uses the library.
By far the biggest cause of cmi size for us is the broken semantics of
I should also point out that we aren't that concerned with small percentage increases in cmi size. Even 50% probably wouldn't be that noticeable. The issue we had before module aliases was that we were loading cmis that were more than 10x larger than were needed.