Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Provide "rightmost key at most" and "leftmost key at least" queries (feature request) #37

Open
jtweigle opened this issue Nov 17, 2021 · 6 comments

Comments

@jtweigle
Copy link

Given a Trie, I'd like to be able to search for a key k, and get either k or the closest thing on either side of k.

More precisely, the query would yield one of these:

  • k and its value, if k is in the trie,
  • the immediate lexicographical predecessor (or successor) of k if one exists, or
  • the fact that the trie has no values before (or after) k.
    (When I say "lexicographical" I'm talking about the ordering given by the structure of the trie.)

Currently I can see how to implement it by iteratively calling has_node, or iteratively calling iterkeys (or iteritems), building up a larger and larger prefix and maybe handling KeyError exceptions to decide when to go forward. Or it could be implemented naively by traversing over all keys linearly (but that wouldn’t take advantage of the convenient structure of a trie). Finally, I’m half-sure it could be implemented by using traverse, but it’s not obvious how. The traverse function seems to work like foldr on the path to a key—which should be enough, as long as you also implement a "rightmost child" or "leftmost child" query.

All of these options (except the naive one) involve non-trivial work that deals with the structure of the tree, so I think these queries should be handled internally by the class.

@mina86
Copy link
Contributor

mina86 commented Nov 17, 2021

There is longest_prefix which sounds like ‘rightmost key at most’. And ‘leftmost key at least’ sounds like next(iter(trie.keys(prefix=key)), None).

import pygtrie

trie = pygtrie.CharTrie({'ra': 1, 'rabar': 2, 'rabarbar': 3})

assert 'ra'    == trie.longest_prefix('ra').key
assert 'ra'    == trie.longest_prefix('rab').key
assert 'rabar' == trie.longest_prefix('rabar').key

assert 'ra'    == next(iter(trie.keys(prefix='ra')), None)
assert 'rabar' == next(iter(trie.keys(prefix='rab')), None)
assert 'rabar' == next(iter(trie.keys(prefix='rabar')), None)

Though to guarantee ordering for the latter you’d need to enable sorting in the trie.

@jtweigle
Copy link
Author

jtweigle commented Nov 18, 2021

Oh, you’re talking about finding the predecessor on the path to a key, or successor in the subtrie after the key (or the key itself, if it’s found):
pred-succ-on-path

I was trying to ask about the lexicographical predecessor and successor. Let me try to draw what I mean. Say we have a trie like this, where sorting is enabled:
Screenshot from 2021-11-18 10-44-31
Nodes circled in blue are keys, and the node circled in orange is the one we want to find (dashed lines mean it’s not actually in the trie, but that’s where it would be in the trie).

Since "ba" isn’t in the trie, the rightmost key at most "ba" is its lexicographical predecessor, "aac".
Screenshot from 2021-11-18 10-44-36
The leftmost key at least "ba" is "ca".
Screenshot from 2021-11-18 10-44-44

Note that if we add "c" as a key, the lexicographical successor of "ba" is "c".
Screenshot from 2021-11-18 10-44-50
I didn’t realize this before, but the lexicographical successor isn’t always "leftmost": it's leftmost-and-topmost-with-preference-for-topmost. This is my bad: "leftmost key at least" is kind of a misnomer in this case. It should be something like key, or failing that, its successor.

And of course, it may happen that the searched-for, not-in-the-trie key has no lexicographical successor…
Screenshot from 2021-11-18 10-45-05
…or predecessor…
Screenshot from 2021-11-18 10-45-15
…or if the key is in the trie, we just get that key back.
Screenshot from 2021-11-18 10-44-57

The two queries are:

  • If you listed all keys in the trie in order, what would be the last key in that list that is no greater than k?
  • If you listed all keys in the trie in order, what would be the first key in that list that is no less than k?

I think these are still well-defined even if sorting is not enabled, since listing keys still does an in-order traversal of the trie. The only difference is that "lexicographical" would now be a confusing term: it would refer to the implicit ordering of keys by the structure of the trie, rather than one given by the order of the alphabet.

Now that I’ve written this out, I kinda think it might be better just to implement lexicographical predecessor and lexicographical successor functions, since the queries I’m asking for would be easy to construct out of those.

What do you think?

@mina86
Copy link
Contributor

mina86 commented Nov 18, 2021

So that’s upper_bound and lower_bound operations which are available on binary search trees.

The only difference is that "lexicographical" would now be a confusing term: it would refer to the implicit ordering of keys by the structure of the trie, rather than one given by the order of the alphabet.

That wouldn’t quite work the way you’ve showed then. Without sorting there’s no reason that b in your second image should go between a and c. The way dictionaries are implemented in Python it would in fact go after c.

I’m not convinced this make sense for a trie though. Typical solution when moving to previous or next key is required is to use a binary search tree. Trie addresses different needs.

Is there a practical case you need those operations for?

@jtweigle
Copy link
Author

Ah, good to know that’s what these are called!

Right, the picture wouldn’t look the same without sorting: if we inserted "ba" now, it would go all the way to the right, because the prefix "b" hasn’t been added yet. In this case the problem is trivial.
Screenshot from 2021-11-18 19-35-52

But there are cases where it’s not trivial. If the key has a nonempty prefix in the trie, then even without ordering it would still belong in the subtree below that prefix.
Screenshot from 2021-11-18 19-49-21

Is there a practical case you need those operations for?

There is! I have a use case where I’m mapping bit strings, all of some length n, to (alternating) Boolean values, so that the value for any given n-length bit string is defined to be the value of the lower bound of that bit string. Here I prefer a trie to a binary tree, because I don’t have to worry about balancing and can guarantee that all root-to-leaf paths are of length n. Plus, of course, keys are locations and locations are keys.
Screenshot from 2021-11-18 20-52-09

By the way, for my specific problem, there’s a different approach that involves storing in each internal node of the trie the parity of the number of children.
Screenshot from 2021-11-18 20-49-24
Is there a way to store this parity information in internal nodes with this library?

@mina86
Copy link
Contributor

mina86 commented Nov 18, 2021

Sorry, I misspoke a bit. Lower bound is the smallest key that is greater or equal given key. Upper bound is smallest key that is greater than the given key. So operations you’re describing are ‘predecessor of lower bound’ and ‘lower bound’.

Can’t you just rely on next(iter(trie.keys(key)), None)?

@jtweigle
Copy link
Author

Oh, okay, thanks for the clarification about the names!

What do you mean about relying on next(iter(trie.keys(key)), None)? It sounds like, for example:

>>> from pygtrie import Trie
>>> t = Trie()
>>> t.enable_sorting()
>>> t["0100"]=None
>>> t["0101"]=None
>>> t["0111"]=None
>>> # I want to ask about "0110" and get back "0101"
>>> next(iter(t.keys("0110")), None)

Then of course I get a KeyError.

(Just an aside: the more I think about my own problem, the more I think I do want to use the parity approach. So perhaps I’ll end up making my own “parity-encoding trie” structure. However, implementing this bst-like search behavior could still be useful for someone else.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants