Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to ambivalence scope tracking #1609

Merged
merged 17 commits into from Feb 27, 2018

Conversation

Projects
None yet
5 participants
@trefis
Copy link
Contributor

trefis commented Feb 13, 2018

tl;dr

This GPR fixes various bugs related to ambiguous types silently escaping their scope.
To that end, the following changes have been made:

  • the scope of (ambivalent) types is now stored in type_expr, not in the environment: as shown in the past, it is too easy to let things escape by mistake.
  • some changes were done to type_cases to reduce sharing between the scrutiny and the type of each pattern.
    (this implements what was hinted at in MPR#7618)
  • removed Types.type_declaration.type_newtype_level: (int * int) option, instead we now have type_is_newtype: bool and type_expansion_scope: int option. The "definition level" is always the binding time.

This GPR seems to bring a negligible improvement in compiler speed along with a small increase in .cmi sizes (rough measurements are present at the end of this post).

Long version

Consider the following piece of code:

let f (type a) (b : bool) (wit : (a, int) eq) (x : a) =
  match wit with
  | Refl -> if b then x else 0
  | _ -> x

The typechecker will correctly reject it:

Line _, characters _:
    | Refl -> if b then x else 0
                               ^
Error: This expression has type int but an expression was expected of type
         a = int
       This instance of int is ambiguous:
       it would escape the scope of its equation

However if we instead write the following, clearly similar, functions:

let f1 (type a b) (b : bool) (wit : (a, b) eq) (x : a) (y : b) =
  match wit with
  | Refl -> if b then x else y
  | _ -> x

let f2 (type a b) (b : bool) (wit : (a, b) eq) (x : a) (y : b) =
  match wit with
  | Refl -> if b then x else y
  | _ -> y

then the first one is accepted and given the type bool -> ('a, 'b) eq -> 'a -> 'b -> 'a, while the second one gets rejected:

  | _ -> y
         ^
Error: This expression has type b but an expression was expected of type a

In both cases the typechecker will have chosen an arbitrary element of the ambiguous type of the first branch, and then depending whether the type of the second branch matches or not the chosen representative, the function will or won't typecheck.

Note that exchanging the content of the if-branches, or changing the order in which the newtypes are introduced will result in the ambiguity being noticed (and changing both at the same time means that the function will be accepted but return a 'b instead). This check is brittle, there seems to be a symmetry issue.

Ambiguity issues can also happen in patterns alone, as was already reported in MPRs #7617 and #7618.
Consider (a slight variation of) the examples given in #7617:

let f (type a b) (x : (a, b) eq) =
  match x, [] with
  | Refl, [(_ : a) | (_ : b)] -> []
  | Refl, [(_ : a)] -> []

let g (type a b) (x : (a, b) eq) =
  match x, [] with
  | Refl, [(_ : a) | (_ : b)] -> []
  | Refl, [(_ : b)] -> []

On trunk, these typecheck even though they look very similar to the examples discussed above. Jacques gave this result the following explanation:

Actually, the [...] example is fine, since [] has type forall 'a. 'a list, the type is not escaping.

This however doesn't seem very convincing, if indeed that was the reason then this would also typecheck:

let x = match [] with ["1"] -> 1 | [1.0] -> 2 | [1] -> 3 | _ -> 4

A correct explanation is, I believe, that just as in the previous examples, an ambiguous type escapes from the first pattern, but since we are now unifying each branch with the scrutiny in the environment of the branch, and not in the outer environment as was done before, then the second branch posses a local equation that allows it to typecheck.
This can be confirmed by noticing that in the following:

let f (type a b) (x : (a, b) eq) =
  match x, [] with
  | Refl, [(_ : a) | (_ : b)] -> []
  | _, [(_ : a)] -> []

let g (type a b) (x : (a, b) eq) =
  match x, [] with
  | Refl, [(_ : a) | (_ : b)] -> []
  | _, [(_ : b)] -> []

f will be accepted but g won't be.

Various other similar examples have been added in the following commits:

  • "testsuite: add an example of expression with an ambiguous type escaping its scope" ( 3338a64 )
  • "expand some existing tests to make sure we don't break anything" ( b9d7a84 )
  • "testsuite: add an example of pattern with an ambiguous type escaping its scope" ( fab9ae6 )

Fixing the situation

The main change proposed by this GPR is to track ambivalence in the types themselves (which is reminiscent of the 2013 paper about ambivalent types) instead of registering them in the environment.

To do so, we add a scope: int option field to type_expr, which is set whenever an ambivalent type is created. The value of that field will be the level at which the equation allowing its creation was introduced.
Checking that an ambivalent type ty doesn't escape its scope is then just a matter of enforcing that ty.scope <= ty.level.

This avoids the issue of looking at an ambivalent type with an environment which doesn't know about its ambivalence (and so allows ambiguity to leak), as was for example the case in GPR#1315 .

The commit where we start using this new scope field ( "use scope instead of gadt_instances", a17e01c ) also happens to fix things which I believe to be symmetry issues (i.e. when we have "a=b" we would remember the gadt instance level of a, but not of b).
I don't think this is magically resolved by scope itself, it's probably due to the fact that we were forgetting to add some gadt instance in some cases, whereas we're now careful to always update the scope (there is a simple rule for this: whenever you call update_level, you want to call update_scope right after).

There is another, smaller, commit of interest which was introduced before this big change (but might just as well be done after as it is in fact independent), which is: "ctype: split find_newtype_level in two".
Before to this commit, when trying to unify two newtypes, and if there are local constraints in the environment, we would try to expand one or the other based on which had the smallest "newtype level".
We now decide which to expand based on what will in the later commits be called "the expansion scope", i.e. the second element of the type_newtype_level pair, which corresponds to the level where an equation was added to that type.
Previously, the definition level was also taken into account (and had, in fact, priority).

The next big commit is the one changing type_cases ( 3006646 ). This commit does what Leo suggested in MPR#7618, that is: decreasing the amount of sharing between the type of the scrutiny and that of each pattern. The commit also adds a check that no ambiguous type escapes its pattern, by unifying all of them outside the match (i.e. in the outer environment, at the outer level).

Finally, the commit titled "slight representation change" changes Types.type_declaration to remove the field type_newtype_level: (int * int) option and adds two fields type_is_newtype: bool and type_expansion_scope: int option in its place.
This commit isn't needed to correct any (known) bug, but it makes things slightly cleaner and conceptually simpler IMHO.

Testing and timing

This PR contains various commits making additions to the testsuite to illustrate some issue with the previous implementation, which are then updated in later commits when the issue is resolved.

Apart from that, I have also compiled janestreet's codebase with this compiler and all of it compiled, except for one place where the new compiler was reporting that some ambiguous type was escaping the scope of its equation.
The new compiler was correct.

Having compiled janestreet codebase with -dtimings, it appears that we were spending 252.47 minutes in the typechecker without the proposed changes, and that we spend 227.34 minutes in the typechecker with the proposed changes.
This slight improvement seems consistent with the time required to compile the compiler itself: make world was taking 170.82s on trunk, while it takes 163.56s after the proposed changes (though these last numbers are small enough that I don't know how much value they have).

As for the size of the build artefacts, the total .cmi size for janestreet's codebase was 1023M before and grows to 1066M after the changes. The size of .cmt[i]s is stable at 14G.
Again, this seems consistent with the growth observed on the compiler itself, the total .cmi size was previous 4.0M (4084K) and is now 4.2M (4216K).
(numbers obtained with du and du -h)

@trefis

This comment has been minimized.

Copy link
Contributor Author

trefis commented Feb 13, 2018

N.B. the commits seem properly ordered on the "commits" page of the github interface, but the list on the "conversation" page seems all over the place.

@garrigue

This comment has been minimized.

Copy link
Contributor

garrigue commented Feb 20, 2018

My first reaction is of surprise: I would have expected adding a field to type_expr to have a huge impact on cmi size (maximum 25%), and it appears to be only about 5%.
It would still be too much, were it not for the gain in performance: if we gain 10% in performance for 5% in size, this is a clear win.
Have you a better analysis of the performance gain: does it come mainly from type_cases, or from removing the instance chain tracking?

I agree that handling ambivalence in the types is the right thing to do, since this is what the theory does. I was really afraid of the cost, but it seems that I was wrong, or rather underestimated the extant of the usage of GADTs, and the performance cost involved.
Another advantage is that there is no need to backtrack in the environment any more (which was not properly done anyway).
This said, you do need to backtrack scopes in type_expr, i.e. extend the logic at the end of Btype, and use it every type you update the scope.

@trefis trefis force-pushed the trefis:fix-gadts branch from f38553a to 8e103d6 Feb 20, 2018

@trefis

This comment has been minimized.

Copy link
Contributor Author

trefis commented Feb 20, 2018

you do need to backtrack scopes in type_expr, i.e. extend the logic at the end of Btype, and use it every type you update the scope.

Indeed, this is now fixed.

Have you a better analysis of the performance gain: does it come mainly from type_cases, or from removing the instance chain tracking?

My guess would have been that the gain comes from not looking up in environment the gadt instance level of a type.
This seems to be confirmed by perf when looking at

../ocamlc.opt -strict-sequence -nostdlib -safe-string -strict-formats -w Ae -c camlinternalFormat.ml

On trunk camlEnv__is_Tlink and camlStdlib__set__exists are close to the top, and I believe that is due to Env.gadt_instance_level`.
Obviously that doesn't happen in this branch.
(I also timed the invocation above a few times, and trunk seems to be consistently slower than this branch).

It would still be too much, were it not for the gain in performance

I would think that the gain in simplicity / clarity of the implementation as well as the various bugfixes would be enough to justify the size increase.

@gasche

This comment has been minimized.

Copy link
Member

gasche commented Feb 26, 2018

@garrigue: gentle ping, are you planning to do a review of the branch?

@garrigue

This comment has been minimized.

Copy link
Contributor

garrigue commented Feb 27, 2018

Now that the backtracking problem is solved, I believe this code is fine: the reasoning behind seems sound, and there are enough tests to ensure that it works as intended.

While I read the changes, I cannot really vouch for them, due to the size, and the number of independent changes mixed together. But again, I believe the tests, and @trefis ' understanding.

I still have a question: why is there a check_scope_escape separated from update_level?
update_level seems to subsume it. Wasn't there a way to avoid the (small) duplication?

An by the way I do not agree with your statement:
I would think that the gain in simplicity / clarity of the implementation as well as the various bugfixes would be enough to justify the size increase.
Size of cmi's is a major problem. A change that deliberately increases it, particularly for a (potentially) seldom used feature, needs to be carefully argued. Here the gain in performance is I believe the best argument. Of course simplicity and clarity are important, but their cost matters.

@garrigue

This comment has been minimized.

Copy link
Contributor

garrigue commented Feb 27, 2018

Independently of the specific merits of this PR, I have a problem with our workflow.
I'm not sure this is the right place to discuss it, but why not start here.
Namely, I end up reviewing large pieces of code, without previous discussion on why this is needed, and how this should be done. This might a problem with GitHub: a PR is supposed to contain some code, and incremental changes inside a PR are not so well supported.
My belief is that in general it would be better to have a discussion before starting to write code, and then reading the code in the light of this discussion would be much easier. I do not say that the choices in this PR are wrong, just that they there is very little room for feedback.

@gasche

This comment has been minimized.

Copy link
Member

gasche commented Feb 27, 2018

@garrigue for the changes of @trefis touching pattern matching, what we have done on several occasions (for larger PRs) is to meet in person with Luc to discuss the PR and review the changes in detail, which in general caused a partial rewrite followed by a re-review using the interface (mostly checking that things were as discussed).

Is there a variant of this workflow that would work for you? (Some communication medium, to discuss the broad ideas of a PR etc., after a first version/draft is put on Github.) Email, chat, video call could work.

@garrigue

This comment has been minimized.

Copy link
Contributor

garrigue commented Feb 27, 2018

It depends very much on the contents of the patch.
In this case my problem is not so much to convince myself that it is correct, but rather whether it is really needed. And I believe it would be better handled before writing the patch. But I'm not sure what tool can help here. A physical discussion would be good, but unfortunately this is not possible just now...

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Feb 27, 2018

Size of cmi's is a major problem.

Can you elaborate on this point? I assume the actual problem is not the size, but the time to load external .cmi when compiling a unit; is that right? Or perhaps the RAM usage of the compiler? Did you observe that this is actually a bottleneck?

@garrigue

This comment has been minimized.

Copy link
Contributor

garrigue commented Feb 27, 2018

This was one of the reasons to introduce module aliases. Jane Street was ending up with cmi's so big that they were no longer manageable. Of course the problem is not so much disk space as in-memory space, but adding a field to the omnipresent type_expr certain augments memory usage too. Fortunately, this ends up being much smaller than 25% overall.

@gasche

This comment has been minimized.

Copy link
Member

gasche commented Feb 27, 2018

Jane Street people frequently report the in-memory size of .cmi (once loaded by the compiler) as a bottleneck when doing parallel compilation (in Jenga for example), or when using naive caching strategies (for example, when Merlin caches loaded .cmi to process future buffers faster). I am not sure whether on-disk size and in-memory size are always strongly correlated, but it seems reasonable to assume that adding a field to the in-memory representation of types affects both.

@trefis

This comment has been minimized.

Copy link
Contributor Author

trefis commented Feb 27, 2018

I still have a question: why is there a check_scope_escape separated from update_level?
update_level seems to subsume it. Wasn't there a way to avoid the (small) duplication?

This was added after the fact (in 68a1ffa), it is indeed not strictly necessary: an ambiguous type escape will be detected as such after typing the whole match, or you will have a type mismatch between the patterns earlier on.
However it makes error messages more stable: ambiguous type escapes will be detected as soon as we've typechecked a pattern.

I'd rather keep that function: I'm not so worried about the small overlap, and I do believe the gain in consistency of the error messages is a nice thing to have.

@trefis

This comment has been minimized.

Copy link
Contributor Author

trefis commented Feb 27, 2018

Regarding the size of .cmi files, I do understand your worries (working both at Jane Street and on merlin) and indeed if the size increase had been as big as you feared I would probably have looked for another way to fix the issues.
Fortunately, the increase wasn't the big.

Regarding coordination/communication I do not have much to propose that wasn't already suggested by Gabriel. In the future I could get in touch a bit in advance if you'd prefer that.

Although the various discussions might not all be over, does anyone mind if I rebase (I plan to squash the bootstrap commits in their parent) and merge this? (N.B. the appveyor failure is just a timeout)

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Feb 27, 2018

Jane Street people frequently report the in-memory size of .cmi (once loaded by the compiler) as a bottleneck when doing parallel compilation

Do you know if this is still a strong concern with module aliases?

If so, I'm wondering if it would be useful to investigate alternative strategies to store interfaces. In particular, I'm thinking about creating per-library interfaces, in the form of a single file (say .cmis), containing all interfaces for public modules of the library, with some hash-consing scheme to recreate maximal (or just more) sharing. The compiler would only need to open/unmarshal one big file per library. There is an overhead due to the fact that some modules from the library might not be required, but I suspect the overall effect would often be positive. It would also reduce the incrementality of builds, but per-library granularity might be ok (at least in the OPAM world, perhaps not for big mono-repos).

@gasche

This comment has been minimized.

Copy link
Member

gasche commented Feb 27, 2018

We could also implement hash-consing when reading interface files into memory, even without changing their on-disk layout.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

alainfrisch commented Feb 27, 2018

We could also implement hash-consing when reading interface files into memory, even without changing their on-disk layout.

Worth trying, but I suspect this might be quite slow. (And hash-consing circular data structure is not straightforward!)

@lpw25

This comment has been minimized.

Copy link
Contributor

lpw25 commented Feb 27, 2018

The compiler would only need to open/unmarshal one big file per library. There is an overhead due to the fact that some modules from the library might not be required, but I suspect the overall effect would often be positive.

This seems likely to be a big negative, essentially undoing the good work of module aliases. Most modules of a large library are typically not needed for each module that uses the library.

By far the biggest cause of cmi size for us is the broken semantics of module type of and with module, which recursively remove module aliases. @trefis is currently testing a patch for changing this behaviour on opam, and we'll make a pull request of it shortly.

I should also point out that we aren't that concerned with small percentage increases in cmi size. Even 50% probably wouldn't be that noticeable. The issue we had before module aliases was that we were loading cmis that were more than 10x larger than were needed.

@garrigue

This comment has been minimized.

Copy link
Contributor

garrigue commented Feb 27, 2018

@trefis I already approved the merge. My questions were just to clarify some code that was not completely clear, but I agree that having better error messages fully justifies adding a bit of code.

@trefis

This comment has been minimized.

Copy link
Contributor Author

trefis commented Feb 27, 2018

OK! I'll proceed with the merge then.
Thank you for the review!

@trefis trefis force-pushed the trefis:fix-gadts branch from 8e103d6 to 134b1b1 Feb 27, 2018

@trefis trefis merged commit 980b9ef into ocaml:trunk Feb 27, 2018

0 of 2 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@trefis trefis deleted the trefis:fix-gadts branch Feb 27, 2018

@gasche gasche referenced this pull request Aug 20, 2018

Merged

Track newtype level again #1997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.