New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use discrimination trees for type search #250

Open
taktoa opened this Issue Apr 29, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@taktoa
Copy link
Contributor

taktoa commented Apr 29, 2018

There is a data structure called a discrimination tree that allows one to compactly store a large number of terms (rose trees) with metavariables and then, given another term with metavariables, efficiently filter the set of contained terms down to a subset that may unify with the given term.

Discrimination trees are, at their core, just tries whose nodes are of type Maybe Node where Node is a type with one nullary constructor for each AST (term) constructor. Adding a term to a discrimination tree involves doing a preorder traversal of the term to get [Either Var Node] and then mapping either (const Nothing) Just and inserting the result into the underlying trie. Querying the discrimination tree involves a backtracking search. The leaves of the trie should have the set of terms whose preorder traversal is equivalent to that path through the trie.

If you hash-cons the discrimination tree, it can take up less space, which would be desirable for an on-disk representation, but may not be worth the effort. If you want to store a discrimination tree on disk, it would probably be wise to use Word32 offsets instead of pointers (or Word16 if the tree is small enough, which it will often be if someone is running the Hoogle indexer on lots of small packages and then combining the indexes). Discrimination trees are really just about filtering a large set of terms to yield a subset that contains the set that unify with some term, so if you have an implementation of unification that handles Haskell constraints, you can easily handle searching with constraints by running the unification algorithm on each term the discrimination tree outputs and throwing out the ones that fail to unify due to unsatisfied constraints.

This problem in general is called term indexing. There are many data structures that can be used for term indexing, and a discrimination tree is one of the simplest. There are other options, like substitution trees, that may be more efficient. You may want to consult the resources below for more details:

If you implement discrimination trees for Hoogle, please consider putting the result of your efforts as a separate package on Hackage, as there is no package existing for them. I am working on an implementation (permalink) but it may take some time for it to be ready.

@ndmitchell

This comment has been minimized.

Copy link
Owner

ndmitchell commented Apr 29, 2018

That sounds like a good thing to use - I'd welcome help integrating it with Hoogle, and a plug-in library doing the heavy lifting would be awesome.

@taktoa

This comment has been minimized.

Copy link
Contributor

taktoa commented Apr 29, 2018

Another thing to note: to handle rank-n types, which may have forall binders in them, we must erase all the forall binders (when indexing and when querying) by lifting them to the top level, since most/all term indexing data structures only work for terms without binders. This will give us some false positives, since querying the discrimination tree for forall s. (Int -> ST s) -> Int will yield (forall s. Int -> ST s) -> Int (assuming the latter has been inserted). The false positives can then be eliminated via unification. Since a rank-a type will never unify with a search query that is rank-b where a ≠ b, it may make sense to have two discrimination trees in the index: one for rank-1 terms, and one for rank-2 and up. It may also make sense to have more discrimination trees for different levels of higher-rank terms (one for rank-1, one for rank-2, one for rank-3, etc.); the tradeoff we are making is that the result of a querying will be more accurate but the size of the index may be larger (since splitting one discrimination tree into two will prevent sharing of trie nodes between similar terms).

@ndmitchell

This comment has been minimized.

Copy link
Owner

ndmitchell commented Apr 29, 2018

You really don't need to worry about rank-2 types - Hoogle already eliminates them already. They just don't occur enough in Haskell to care about.

@taktoa

This comment has been minimized.

Copy link
Contributor

taktoa commented Apr 29, 2018

Some implementations of term-indexing data structures I found:

More potentially relevant papers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment