Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-short-paths slow :( #6600

Open
vicuna opened this issue Oct 7, 2014 · 36 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link

commented Oct 7, 2014

Original bug ID: 6600
Reporter: @diml
Assigned to: @lpw25
Status: assigned (set by @mshinwell on 2017-03-07T13:02:11Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.02.0
Target version: later
Category: typing
Tags: patch
Related to: #2871
Monitored by: @trefis @gasche jmeber @hcarty yminsky @yakobowski

Bug description

Adding -short-paths can slow down compilation a lot, up to several seconds. This is becoming a problem for us.

All the time is spent in building the path map. Taking into account only module aliases and opens should be enough to get good result with core libraries. So there might be a way to speed up the search.

Steps to reproduce

$ cat > foo.ml <<EOF
open Core.Std
open Async.Std

let x = 1 + 'e'
EOF

$ time ocamlopt.opt ocamlfind query -i-format -r async -c foo.ml
File "foo.ml", line 4, characters 12-15:
Error: This expression has type char but an expression was expected of type
int

real 0m0.177s
user 0m0.041s
sys 0m0.013s

$ time ocamlopt.opt ocamlfind query -i-format -r async -short-paths -c foo.ml
File "foo.ml", line 4, characters 12-15:
Error: This expression has type char but an expression was expected of type
int

real 0m2.040s
user 0m1.783s
sys 0m0.155s

File attachments

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 8, 2014

Comment author: @garrigue

Just to clarify: does this slow down only occur when some types are printed, or does it appear for normal compilation too ?
The map is supposed to be only computed when printing types.
This is an expensive operation, but maybe it could be improved by processing module aliases independently.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 8, 2014

Comment author: @diml

The slow down occurs only when some types are printed.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 8, 2014

Comment author: @mshinwell

I have heard that MLton has a solution to this problem that doesn't take so long; it might be worth having a look to see how it does it.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 8, 2014

Comment author: yminsky

The fact that it only shows up when types are printed helps a lot, but
it's still problematic, because it means that when you have a loop
where you're fixing type errors interactively, the round-trip through
the compiler gets painfully long.

It's also quite frustrating in utop, where it hits you immediately at
your first expression.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 9, 2014

Comment author: @garrigue

I'm not sure I believe the "not so long" part.
short-paths has a specification: it looks up the whole environment to find the shortest long identifier (i.e. syntactic path) which expands to so some internal path.
As long as you do not store extra information in the environment, you have to look up the whole environment every time you need this information (you don't know where some useful long identifier might be).
And with huge libraries such as Core and Async, maybe it is not that surprising that it is costly.
An important source of cost is that this search triggers the evaluation of all the lazy information in the environment. I'm afraid this is actually most of the cost, but this is a problem with the way ocaml handles modules, not short-paths itself.

Now that we have module aliases, I'm looking at how to optimize the way they are handled in Env.
Maybe it will help short-paths eventually (currently we recompute the contents of aliases independently of the module itself).

Another solution would be to introduce some pruning, such as only looking for long identifiers of length 1 or 2, but I'm not sure it would help much, and it would be arbitrary.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 9, 2014

Comment author: @garrigue

I ran your test, with just Core, not Async, and I get the following numbers (user time, for 4.01, but this shouldn't matter)

without -short-paths: 2s
with -short-paths: 8s
only forcing the environment: 5s

So this means that there is no miracle: if we want to be faster we need to improve both the module system performance, and the map construction performance.
Also you may be feeling it more in 4.02, because just loading the cmi's should be much faster.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 9, 2014

Comment author: @garrigue

Actually, the numbers with 4.02.1 are a bit different:

without -short-paths: 0.14s
with -short-paths: 7.4s
only forcing the environment: 6.8s

So it seems that with aliases the main cost is in forcing the environment.
As a result, my current work on avoiding recomputing the environment for aliases could help.
And it might be worth trying something more radical, such as avoiding applying substitutions in the environment itself.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 9, 2014

Comment author: @garrigue

Tried again on another machine, with the environment sharing patch:

4.02 branch without sharing

  • default 0.7s
  • short-paths 3.7s
  • forcing 3.3s
    4.02 with sharing patch
  • default 0.14s
  • short-paths 1.8s
  • forcing 1.1s

So the main improvement is in the forcing part, but short-paths is twice faster as a result...
I uploaded the patch, but it still seems slightly buggy, as short-paths gives some wrong results (strangely, all other tests but one go through).
Also, there is no way this is going into 4.02, as there is a small incompatibility concerning the (module type of A) when A is an alias. Basically, the alias becomes really transparent. I think this is reasonable, as this is easier to explain than the current behavior, but this needs to be documented.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 9, 2014

Comment author: @sliquister

In case it's helpful, judging from strace, for a compilation taking 1.9s, I see a 1.1s gap in the trace right before loading async_kernel (so I think something in the typer happens when core is fully loaded, applying some substitution maybe?), a 0.4s gap right before writing the error message (I suppose after async is fully loaded) and the remaining 0.4s seem to be loading/searching the various cmis.

I am not sure this would be applicable except in toy example, but it sounds like the whole environment doesn't have to be forced. In the example, the types are int and char so why look in any module? Module can only contain type paths with at least one module component, so they are too long.

Also, still on trying to force less the environment, when trying to shorten Deferred.t (from async), looking in Core is not going to help simply because Async depends on Core and Deferred.t is abstract, so Deferred.t can't have an alias in Core. I don't have a concrete improvement to suggest that uses this property though.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 9, 2014

Comment author: @garrigue

In case it's helpful, judging from strace, for a compilation taking 1.9s, I see a 1.1s gap in the trace right before loading async_kernel (so I think something in the typer happens when core is fully loaded, applying some substitution maybe?), a 0.4s gap right before writing the error message (I suppose after async is fully loaded) and the remaining 0.4s seem to be loading/searching the various cmis.

This is a bit strange, as these numbers do not seem to coincide with what I am observing using 4.02.
Basically, just loading and opening the cmis should cost almost nothing.
For Core and Async on 4.02.1 I get 0.07s.
(The first measure in answer 12337 was wrong by 10x. Also, the sharing patch was a bit slower because it was actually doing a bit of forcing, but this can easily be made lazy, as in the new version of the patch.)

The probable explanation is that actually this 0.07s is only for loading Core and Async, but not their dependencies, which may be loaded lazily. This means that the loading of async_kernel is triggered by the search, so what you see is really a mix of search, loading and substitution (intermingled because of laziness), and it is hard to conclude anything from it.

I am not sure this would be applicable except in toy example, but it sounds like the whole environment doesn't have to be forced. In the example, the types are int and char so why look in any module? Module can only contain type paths with at least one module component, so they are too long.

Also, still on trying to force less the environment, when trying to shorten Deferred.t (from async), looking in Core is not going to help simply because Async depends on Core and Deferred.t is abstract, so Deferred.t can't have an alias in Core. I don't have a concrete improvement to suggest that uses this property though.

Yes, I've thought about all these heuristics. But this is not the way it is implemented: we build the complete map before looking at the type. Such a heuristic would require doing a first path on the type (which actually we might not know fully when we build the map), and would only work when none of the types are qualified.
Also, the specification of -short-paths is not only to choose the shortest path, but a canonical one.
I.e. all paths that expand to the same one should be printed the same way.
So it still means that at least you must create a map for unqualified types, which may trigger a lot of forcing.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 10, 2014

Comment author: @garrigue

I have fixed the bug in the patch (version 4), but unfortunately it reduces the sharing, and we don't gain that much anymore (

Core with short-paths : 3.7s down to 2.3s (40% speedup)
Core+Async with short-paths: 4.2s down to 2.6s (40% speedup)

Maybe we should try to do something about #2871, outside of laziness...

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 10, 2014

Comment author: @damiendoligez

Note: this doesn't look like something that can be fixed in the (very short) time frame of 4.02.1, especially if the fix introduces some incompatibilities, so I'm postponing to the next version.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 13, 2014

Comment author: @garrigue

Added a new patch env-share-components5.diff, which fixes the bug without disabling local sharing.
So we get the same performance as in 12337 (actually a bit better, since we end up doing even less forcing, so that I get 1.5s for Core and 1.7s for Core+Async, i.e. about 60% speedup).

I plan to eventually commit this in trunk after discussion on the developers' list.
This does not preclude trying something else to get faster, but I have no idea that would work in all cases.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 15, 2014

Comment author: @garrigue

I've tried another approach, with the printtyp-incremental-map.diff patch.
The idea is to make the building of the map breadth-first, and only deepen it when we want to print a path which is longer than the current depth.
This is very fast in your example, but of course it depends much on the length of the paths involved.
For instance, try replacing 'e' by (Queue.create ()) or (Core.Bitstring.create 3) or (Bitstring.create 3), and you will get very different behaviors.

Could you give it a try in practice ?
It should be safe to combine it with env-share-components5.diff.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 15, 2014

Comment author: @garrigue

printtyp-incremental-map2.diff respects the original specification: it incrementally deepens the traversal of the environment until it finds an optimal path, or there is nothing to deepen.
And it is actually faster than printtyp-incremental-map.diff :-)

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 15, 2014

Comment author: @let-def

Just to let you know, I started working on specific tweaks for merlin.
The work is on two fronts. Right now, we have an alternative implementation with a weaker specification, only considering shortest paths using module aliases.
The path of a type is rewritten to be the shortest considering all module aliases reachable from current scope after removing prefix from opened modules. Type aliases themselves are not taken into account.
With some discipline, this gives reasonable results (though less optimal) in a much shorter time.

The second part is bringing this into the original -short-paths:

  1. not computing all expanded aliases but instead using module aliases to share parts of the path map.
  2. as a simplification, only considering opened modules as roots of the search.
  3. more aggressive caching

For 2., the reason is that most of the paths that are expected to be rewritten when used from Core are reachable from "open Core.Std" / "open Async.Std".
For 3., the reason is that merlin tends to be invoked a lot of times from mostly similar but slightly different environments.
When asking for the type of an identifier, merlin will choose the closest environment for printing. This makes short_paths recompute the whole path map at each invocation, even though 99.9% of the work already done could be reused.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 16, 2014

Comment author: @garrigue

I see.

Unfortunately, considering only module aliases is only half of the specification: IIRC, one of the goals of short-paths was to simplify paths that went through functors, and for that you need to fully simplify.
You are of course free to do anything you please in Merlin, but for the compiler we want a clear specification of what is done.

Also, with environment sharing, (2) might not give you much: once you've opened Async.Std and Core.Std you have already all of Core and Async there.
Note that we could also have chosen that as specification: just remove the prefixes for opened modules.
This can be done by looking at the summary, without building a map for the environment.
This is just a wholly different specification.

Cacheing is something I thought about before; ideas are welcome.
However, one has to be careful no to do anything that would break the environment, as it is used for type inference, not just to search short paths :-)

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 9, 2014

Comment author: @garrigue

Just a reminder that this PR is in feedback mode.
As I have shown in my patches, it is possible to improve to some extent (about 60%) the performance of -short-paths without changing its semantics, but this is of course limited, because the semantics requires to look up the whole environment.

  • how useful would this improvement be ?
  • how important is optimizing for incremental use (for command line use only one map is computed anyway) ?
  • would it be better to change the semantics ? When -short-paths was introduced, there were no module aliases; exploiting them indeed offers a simpler way.
@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 9, 2014

Comment author: yminsky

Here are my thoughts:

  • def experimented with other semantics, and I think it was pretty clear that they were materially worse. The current semantics of short-paths seem quite good (though there are a few odd corner cases that I come across from time to time that I still don't understand. I'll try and isolate a good example.)
  • I do think incremental use is pretty important for Merlin and utop.
  • I think the 60% improvement would be a lot better. Right now, ona nice modern laptop, it takes about 3 seconds to evaluate "let x = 3" the first time in utop. Dropping that down to 1 second would reduce the pain by quite a bit, though it's still on the high side, especially since it's going to be worse for older hardware.

def did some experiments that indicated that most of the cost was in the computation, not in loading the modules, which makes one wonder if some kind of eager pre-computation of the map along with the ordinary build would help.

Another simple heuristic that seems worth trying would be to not walk the entire environment: to use the ordinary short-paths heuristic, but restricted to the modules that OCaml requires to be loaded given what is being referred to. Thus, any types in modules that by the current rules would not be linked in would not be considered for short paths either.

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 11, 2014

Comment author: @garrigue

I'm not sure the problem is the same for Merlin and utop.
Namely, using utop the map is only computed once after each input, whereas in Merlin you may need to update it at every keystroke, isn't it?

If performance matters, it is possible to do more cacheing.
I.e., only recompute the diff w.r.t the previous environment.
However, since this only works for additions, this may be a bit tricky.

Concerning the cost, I think my numbers already show that there are two costs: forcing the environment and computing the map. For compilation, the cost is dominated by the environment, as both are done only once. But for interactive use, the environment cost becomes negligible after the first computation, so we really need to cache the map more efficiently.

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 11, 2014

Comment author: yminsky

The utop and merlin case sound like they should be different, but my experience is that they're the same. In particular, the first lookup by merlin is slow, but after that, it gets faster, just as utop does.

I showed this to def, who saw the same effect, but didn't understand how it could be so. Def, do you now understand what's going on?

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 11, 2014

Comment author: @garrigue

The similarity is not surprising: the first lookup requires forcing the whole environment, which is the most costly part. The result of this forcing is fully cached. Subsequent calls only require to force new parts of the environment (usually small), and to recompute the map (if the environment has changed).
My point was only that what is acceptable for utop is not necessarily acceptable for Merlin (in terms of responsiveness).

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 12, 2014

Comment author: yminsky

Fair enough. My experience though is that for both of them it's pretty close: a little too long to be pleasant, but an improvement of a factor of 2 or 3 would make a big difference.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 14, 2015

Comment author: @damiendoligez

garrigue, do you think this patch is mature enough for inclusion in 4.02.2?

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 15, 2015

Comment author: @mshinwell

I thought there were still concerns about speed (Damien, Ron knows about this)

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 15, 2015

Comment author: @garrigue

Since env-share-components.diff changes the semantics (in a very minor way), it should rather go in trunk.

This said, if there is demand it is possible to include only printtyp-incremental-map2.diff in 4.02.2. It should make the toplevel much faster, by avoiding to always force everything, and it is only about -short-paths, so it won't break anything.

Whether we can do better is another question.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 15, 2015

Comment author: yminsky

If there's a material improvement we can get for the toplevel, I'd be quite eager to get it for 4.02.2. Right now, our custom built utop (with lots of libraries linked in) takes 10 or so seconds to evaluate "let x = 3;;". This is a pretty serious usability problem.

Weirdly, it gets faster the second time you do it, but it still takes real time, say 4 seconds. Subsequent expressions are essentially instantaneous.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 16, 2015

Comment author: @garrigue

Merged printtyp-incremental-map2.diff in 4.02 and trunk (after some cleanup), at revisions 15774 and 15775.
Please check that it works as expected before we release 4.02.2 (the testsuite is not very complete for -short-paths).

@vicuna

This comment has been minimized.

Copy link
Author

commented May 1, 2015

Comment author: @lpw25

I can confirm that with the 4.02 branch the Jane St. custom utop now evaluates "let x = 3;;" instantaneously. So this is a definite improvement.

In terms of testing that the behaviour of -short-paths has not changed, there is not much that I can say. There is no large collection of programs containing errors with which to perform such tests.

However, whilst trying to test the performance I did get an unhandled "Not_found" exception from the compiler, which I am currently investigating.

@vicuna

This comment has been minimized.

Copy link
Author

commented May 1, 2015

Comment author: @lpw25

The "Not_found" comes from the call to find_module in scrape_alias_safe, which gets called during best_type_path. I suspect that it is forgetting to handle the case of a reference to an unavailable ".cmi" file.

@vicuna

This comment has been minimized.

Copy link
Author

commented May 2, 2015

Comment author: @garrigue

Do you have a repro case?
I uploaded a fix for scrape_alias_safe, which should protect against this.
Could you check that it does what is expected?
And merge it if it is ok? (I'm on holidays until may 6)

@vicuna

This comment has been minimized.

Copy link
Author

commented May 4, 2015

Comment author: @lpw25

Do you have a repro case?

Not one that is easily extracted.

Could you check that it does what is expected?

I confirm that it fixes the issue.

And merge it if it is ok?

I don't have commit access, so can someone else please merge this?

@vicuna

This comment has been minimized.

Copy link
Author

commented May 5, 2015

Comment author: @damiendoligez

Applied patch to branch 4.02 (rev 16078) and trunk (rev 16079).

Some cleanup of the Changes file sneaked into the commit for 4.02, sorry.

I'm now postponing this issue to 4.03, as we have done all we can for 4.02.2, any further changes look like they'll be slightly incompatible.

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 23, 2015

Comment author: @xavierleroy

Do we still have this performance issue in 4.02.2 or in trunk? Otherwise I move to mark this PR resolved.

@vicuna

This comment has been minimized.

Copy link
Author

commented Nov 23, 2015

Comment author: @gasche

My understanding is that short-paths is still a source of performance issues for some workflows, and that there will be further changes down the chain. Thomas Réfis and Frédéric Bour have both worked on short-paths recently, and I think that they plan to work on it again.

I kept the issue open, but moved it to "later" target as it's clear that it is not on the agenda for 4.03.

@vicuna

This comment has been minimized.

Copy link
Author

commented Mar 7, 2017

Comment author: @mshinwell

Leo is working on a completely new implementation of short-paths with the aim of solving this performance problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.