Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27

Open
manthanchauhan opened this issue May 25, 2019 · 30 comments

Comments

@manthanchauhan
Copy link

manthanchauhan commented May 25, 2019

char_trie['he'] = 3
char_trie['him'] = 4

---h

    |__e: 3

    |__i

        |__m: 4

can be compressed as,
---h

    |__e: 3

    |__i m:4

since removing each redundant node saves atleast 56 bytes, this can provide high memmory optimisation to a data structure that is used for memmory optimized storage.

@manthanchauhan
Copy link
Author

I'd like to contribute code for this, if required.

@mina86
Copy link
Contributor

mina86 commented May 25, 2019

I’m always happy to review pull requests though adding a ‘compressed’ representation to the library is rather involved change so this might benefit from discussion of the design before code is written. There are non-obvious cases to consider such as how would the data structure look with trie['foo'] = trie['foobar'] = trie['foobarbaz'] = 42.

Lastly, you should be looking at https://github.com/mina86/pygtrie repository since that’s where the latest code resides.

@manthanchauhan
Copy link
Author

manthanchauhan commented May 26, 2019

One solution for the example you provided is, node decomposition, as shown,
trie['foobarbaz'] = 42

---foobarbaz: 42

trie['foobar'] = 42
no we will perform sub-string matching of 'foobar' starting from the keys at root node. From the first character in 'foobar' we'll find a key in the node such that it has the same corresponding characters.

in this example, 'foobar' matched with 'foobarbaz' upto 5th position, therefore, the root node breaks at the 5th position creating,
---foobar: 42

   |-- baz: 42

after trie['foo'] = 42,
---foo: 42
   |-- bar: 42
       |-- baz: 42

It is to be noted that, at a node, no two keys start form same character and hence time complexity of addition is unaffected after compression.

@manthanchauhan
Copy link
Author

In the above trie, if I perform,
trie['foobie'] = 9

the resulting structure would be,
---foo: 42
   |-- b
       |--ie: 9
       |--ar: 42
           |-- baz: 42

ie, node with key 'bar' decomposed at 0th position ('b').

@mina86
Copy link
Contributor

mina86 commented May 26, 2019

What if I then did trie['bar'] = 42? If I understand you correctly, you suggest that the root node would keep ‘foo’ as the key in children dictionary but that won’t quite work and goes opposed to how trie is meant to behave.

@manthanchauhan
Copy link
Author

If you do trie['bar'] = 42,
the root node will now have two children, one with key 'foo' and another with key 'bar'. I don't understand your point, how does it violates any behaviour of a trie.

@mina86
Copy link
Contributor

mina86 commented May 27, 2019

When a lookup for ‘foo’ is done, trie’s behaviour is to look for a child ‘f’ of the root node. If root’s node children wolud hold ‘foo’ and ‘bar’, you’d never find ‘foo’ in there.

@manthanchauhan
Copy link
Author

Look-ups would be implemented similar to additions, at each node we find the key which has the same first character as that of the query,
If I were to look-up for 'foobar' in the following trie,
---foo: 42
   |-- b
       |--ie: 9
       |--ar: 42
           |-- baz: 42

at root node 'foo' starts with an 'f' just like 'foobar', this indicates that 'foobar' must belong to the subtrie at 'foo'.

in the 'foo' subtrie now we lookup for 'bar', this time 'b' matches with 'bar',

now in 'b' subtrie we look-up for 'ar' where it matches with 'ar', since the query has been found, we stop here.

Yes, at every node, we need to traverse to all the children until we find the matching first character, therefore time complexity increases. But, lots of storage is saved which is often helpful for storage concerned applications.

@mina86
Copy link
Contributor

mina86 commented May 27, 2019

The _Node object has two fields: children and value. In your diagram, what does:

---foo: 42
   |-- b

correspond to? What’s the value of children and value fields in the root node?

Because if the root node has children field equal to a {'foo': <some-node-object>} than there is no efficient way to find the foo key when looking for foobar since the way trie works is that you will look for a child under f.

When I say that design discussion may be in order, I mean discussion of how each kind of node is represented. What are children and value fields values and how are key lookup, addition and deletion performed.

@manthanchauhan
Copy link
Author

Your interpretation of children field is correct.

after compression, now trie keys are strings and hence now at each node, rather than looking of a character like 'f' we look for strings beginning with 'f' like 'foo',

at each node, log(N) complexity is required to find a key, where N is the total number of children of that node. Yes the operations are not time efficient, but this is the cost of the saved memory space.

If I'm right, combining each pair of nodes saves 56B.

@mina86
Copy link
Contributor

mina86 commented May 27, 2019

at each node, log(N) complexity is required to find a key, where N is the total number of children of that node.

I don’t see how that’s correct. The actual complexity is linear to the length of the prefix. Recall that Python uses hash maps for dictionaries, not binary search trees. As a result, when looking for child ‘foobar’ you have to first check ‘f’, than ‘fo’, then ‘foo’ etc.

At that point it’s simpler to just have a flat dictionary and do key lookups by doing longer and longer prefix matches.

@manthanchauhan
Copy link
Author

manthanchauhan commented May 27, 2019

Actually, at one node there will be only one key starting from 'f' or any other character. In the following trie,
---foo: 42
   |-- bar: 42

if I do trie['foob'] = 9
this happens,
---foo: 42
   |-- b: 9
        |--ar: 42

the node with key 'bar' decomposes at 'b' to avoid having both 'bar' and 'b' as keys.

Once we match the first character of query (in O(logN) ), we can simply traverse the matched key to find out upto which position it matches with the query ie, 'foo' with 'foob' upto 2nd position, then in the subtrie we search for remaining part of the query ie. the remaining 'b' of 'foob'.

@mina86
Copy link
Contributor

mina86 commented May 27, 2019

So then also if I add trie['bar'] = 42 the root node gets decomposed?

@manthanchauhan
Copy link
Author

no, in this case the root node will have one more key as 'bar', due to the fact that no existing key of root node matches with 'b'.

@mina86
Copy link
Contributor

mina86 commented May 27, 2019

So again, how do you look up a key in such a node?

@manthanchauhan
Copy link
Author

If sorting is enabled, as done by the function enable_sorting(), we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.

otherwise, if sorting is not enabled we have to perform a linear search.

Do note that, query string will change for each node, depending upto which character the match occurs in previous node.

@mina86
Copy link
Contributor

mina86 commented May 28, 2019

If sorting is enabled, as done by the function enable_sorting(), we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.

otherwise, if sorting is not enabled we have to perform a linear search.

I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key.

I don’t see getting away from children being keyed by a single character (in case of CharTrie). The path could be compressed but that would need to be for the remaining of the prefix, something like:

trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4

\+- f -> (oo) -> 1
 |               \-- b -> (ar) -> 2
 |                                \-- b -> (az) -> 3
 \- b -> (ar) -> 4

The question then is how to encode and handle the compressed path.

@manthanchauhan
Copy link
Author

Once sorting is enabled, look-ups should have a worst-case time complexity of O( k log n ), and that will be the case when the trie remains unaffected after compression. For an average case, the time complexity should be O( log n ) + O( k ).

You suggest that we use only the first character of a path as a key, right? which should save the search time of a key.

@mina86
Copy link
Contributor

mina86 commented May 28, 2019 via email

@manthanchauhan
Copy link
Author

manthanchauhan commented May 29, 2019

Thinking about your idea for character keys,

If sorting is enabled, as done by the function enable_sorting(), we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.
otherwise, if sorting is not enabled we have to perform a linear search.

I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key.

I don’t see getting away from children being keyed by a single character (in case of CharTrie). The path could be compressed but that would need to be for the remaining of the prefix, something like:

trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4

\+- f -> (oo) -> 1
 |               \-- b -> (ar) -> 2
 |                                \-- b -> (az) -> 3
 \- b -> (ar) -> 4

The question then is how to encode and handle the compressed path.

at each node, we can use the first character of the query string to locate a key-value pair, here, each value is a collection of a string and a node pointer.

once we locate a key-value pair, we traverse the associated string upto the position it matches with the query string, which would be equivalent to traversing that much nodes of an uncompressed CharTrie. The remaining part of the query string will be passed to the pointed _Node.

@mina86
Copy link
Contributor

mina86 commented May 30, 2019 via email

@manthanchauhan
Copy link
Author

sorry but can you please provide some references for this walk_towards interface?

@mina86
Copy link
Contributor

mina86 commented Jun 1, 2019

@manthanchauhan
Copy link
Author

manthanchauhan commented Jun 4, 2019

This sounds workable. I only wonder how would it interact with interfaces such as walk_towards.

For this to work, we'll need to modify the _path_from_key() functions, because in this proposed model of compact tries, the key of a node is not essentially it's path from the root.

trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4

+- f -> (oo) -> 1
| -- b -> (ar) -> 2
| -- b -> (az) -> 3
- b -> (ar) -> 4

in this trie structure, if we consider the key foobarbaz, the path for this node should be fbb and not foobarbaz.

with the above described path definition walk_towards function will work unaffected.

@mina86
Copy link
Contributor

mina86 commented Jun 4, 2019

I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards should go through the whole foobarbaz regardless of how the internal node structure looks like.

@manthanchauhan
Copy link
Author

I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards should go through the whole foobarbaz regardless of how the internal node structure looks like.

Then we can have to do a little modification in the walk towards function.

If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards function we have to increment pos in each iteration by,

pos += 1 + len( node.children.get( path [pos ] ) [0] )

@mina86
Copy link
Contributor

mina86 commented Jun 12, 2019

Sorry about delay.

If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards function we have to increment pos in each iteration by,

pos += 1 + len( node.children.get( path [pos ] ) [0] )

That doesn't sound viable for two reasons.

First of all, the result of walk_towards should not depend on the internal structure of the trie. Whether compressed paths are used or not, iterating over the steps should yield all keys on the path to given key. In other words, the following test needs to pass:

import pygtrie

t = pygtrie.CharTrie(foobar=42)
steps = t.walk_towards('foobar')
res = ' '.join('<%s>' % step.key for step in steps)
assert res == '<> <f> <fo> <foo> <foob> <fooba> <foobar>'

Second of all, if the motivation is to reduce amount of memory used, I think only children with compressed paths should be a (sub-path, node) tuple. This doesn't fundamentally change anything other than code needing to handle both cases.

There's also another aspect that ideally, the following test should pass as well:

from itertools import islice
import pygtrie

t = pygtrie.CharTrie(foobar=42)
res = ''
for step in t.walk_towards('foobar'):
    res += '<%s>' % step.key
    if 'fred' in t:
        del t['fred']
        del t['foobaz']
    else:
        t['fred'] = t['foobaz'] = 24
assert res == '<><f><fo><foo><foob><fooba><foobar>'

In other words, converting nodes between compressed and not-compressed should not break walk_towards method.

@manthanchauhan
Copy link
Author

Okay, but for compressed tries to pass the walk_through() test, some of the returned path will not point to any Node.

That is, if the trie contains,
foobar=5 foob=6

then while walk_through() the path foo will be returned without any Node, if that is accepted.

@mina86
Copy link
Contributor

mina86 commented Jun 13, 2019

Correct. This is the desired behaviour because it's the current behaviour (currently the method goes through all nodes regardless of whether they have values assigned to them or how many children they have) and methods should not expose details of the implementation.

@manthanchauhan
Copy link
Author

I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards should go through the whole foobarbaz regardless of how the internal node structure looks like.

Then we can have to do a little modification in the walk towards function.

If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards function we have to increment pos in each iteration by,

pos += 1 + len( node.children.get( path [pos ] ) [0] )

For that, we can use this trick and inbetween every two valid Node we can pass,
len(node.children.get( path [pos] ) [0] ) None nodes to make up for the invalid keys inbetween.

I mean without actually traversing the path character by character.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants