method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27

manthanchauhan · 2019-05-25T04:14:10Z

char_trie['he'] = 3
char_trie['him'] = 4

---h

    |__e: 3

    |__i

        |__m: 4

can be compressed as,
---h

|__e: 3

|__i m:4

since removing each redundant node saves atleast 56 bytes, this can provide high memmory optimisation to a data structure that is used for memmory optimized storage.

manthanchauhan · 2019-05-25T14:21:33Z

I'd like to contribute code for this, if required.

mina86 · 2019-05-25T22:23:59Z

I’m always happy to review pull requests though adding a ‘compressed’ representation to the library is rather involved change so this might benefit from discussion of the design before code is written. There are non-obvious cases to consider such as how would the data structure look with trie['foo'] = trie['foobar'] = trie['foobarbaz'] = 42.

Lastly, you should be looking at https://github.com/mina86/pygtrie repository since that’s where the latest code resides.

manthanchauhan · 2019-05-26T03:32:45Z

One solution for the example you provided is, node decomposition, as shown,
trie['foobarbaz'] = 42

---foobarbaz: 42

trie['foobar'] = 42
no we will perform sub-string matching of 'foobar' starting from the keys at root node. From the first character in 'foobar' we'll find a key in the node such that it has the same corresponding characters.

in this example, 'foobar' matched with 'foobarbaz' upto 5th position, therefore, the root node breaks at the 5th position creating,
---foobar: 42

|-- baz: 42

after trie['foo'] = 42,
---foo: 42
|-- bar: 42
|-- baz: 42

It is to be noted that, at a node, no two keys start form same character and hence time complexity of addition is unaffected after compression.

manthanchauhan · 2019-05-26T03:59:59Z

In the above trie, if I perform,
trie['foobie'] = 9

the resulting structure would be,
---foo: 42
   |-- b
       |--ie: 9
       |--ar: 42
           |-- baz: 42

ie, node with key 'bar' decomposed at 0th position ('b').

mina86 · 2019-05-26T10:34:12Z

What if I then did trie['bar'] = 42? If I understand you correctly, you suggest that the root node would keep ‘foo’ as the key in children dictionary but that won’t quite work and goes opposed to how trie is meant to behave.

manthanchauhan · 2019-05-26T18:33:01Z

If you do trie['bar'] = 42,
the root node will now have two children, one with key 'foo' and another with key 'bar'. I don't understand your point, how does it violates any behaviour of a trie.

mina86 · 2019-05-27T00:43:53Z

When a lookup for ‘foo’ is done, trie’s behaviour is to look for a child ‘f’ of the root node. If root’s node children wolud hold ‘foo’ and ‘bar’, you’d never find ‘foo’ in there.

manthanchauhan · 2019-05-27T12:29:57Z

Look-ups would be implemented similar to additions, at each node we find the key which has the same first character as that of the query,
If I were to look-up for 'foobar' in the following trie,
---foo: 42
   |-- b
       |--ie: 9
       |--ar: 42
           |-- baz: 42

at root node 'foo' starts with an 'f' just like 'foobar', this indicates that 'foobar' must belong to the subtrie at 'foo'.

in the 'foo' subtrie now we lookup for 'bar', this time 'b' matches with 'bar',

now in 'b' subtrie we look-up for 'ar' where it matches with 'ar', since the query has been found, we stop here.

Yes, at every node, we need to traverse to all the children until we find the matching first character, therefore time complexity increases. But, lots of storage is saved which is often helpful for storage concerned applications.

mina86 · 2019-05-27T12:59:24Z

The _Node object has two fields: children and value. In your diagram, what does:

---foo: 42
   |-- b

correspond to? What’s the value of children and value fields in the root node?

Because if the root node has children field equal to a {'foo': <some-node-object>} than there is no efficient way to find the foo key when looking for foobar since the way trie works is that you will look for a child under f.

When I say that design discussion may be in order, I mean discussion of how each kind of node is represented. What are children and value fields values and how are key lookup, addition and deletion performed.

manthanchauhan · 2019-05-27T13:17:08Z

Your interpretation of children field is correct.

after compression, now trie keys are strings and hence now at each node, rather than looking of a character like 'f' we look for strings beginning with 'f' like 'foo',

at each node, log(N) complexity is required to find a key, where N is the total number of children of that node. Yes the operations are not time efficient, but this is the cost of the saved memory space.

If I'm right, combining each pair of nodes saves 56B.

mina86 · 2019-05-27T16:22:26Z

at each node, log(N) complexity is required to find a key, where N is the total number of children of that node.

I don’t see how that’s correct. The actual complexity is linear to the length of the prefix. Recall that Python uses hash maps for dictionaries, not binary search trees. As a result, when looking for child ‘foobar’ you have to first check ‘f’, than ‘fo’, then ‘foo’ etc.

At that point it’s simpler to just have a flat dictionary and do key lookups by doing longer and longer prefix matches.

manthanchauhan · 2019-05-27T16:32:01Z

Actually, at one node there will be only one key starting from 'f' or any other character. In the following trie,
---foo: 42
|-- bar: 42

if I do trie['foob'] = 9
this happens,
---foo: 42
|-- b: 9
|--ar: 42

the node with key 'bar' decomposes at 'b' to avoid having both 'bar' and 'b' as keys.

Once we match the first character of query (in O(logN) ), we can simply traverse the matched key to find out upto which position it matches with the query ie, 'foo' with 'foob' upto 2nd position, then in the subtrie we search for remaining part of the query ie. the remaining 'b' of 'foob'.

mina86 · 2019-05-27T17:15:48Z

So then also if I add trie['bar'] = 42 the root node gets decomposed?

manthanchauhan · 2019-05-27T19:09:39Z

no, in this case the root node will have one more key as 'bar', due to the fact that no existing key of root node matches with 'b'.

mina86 · 2019-05-27T19:41:19Z

So again, how do you look up a key in such a node?

manthanchauhan · 2019-05-28T06:06:16Z

If sorting is enabled, as done by the function enable_sorting(), we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.

otherwise, if sorting is not enabled we have to perform a linear search.

Do note that, query string will change for each node, depending upto which character the match occurs in previous node.

mina86 · 2019-05-28T06:22:47Z

If sorting is enabled, as done by the function enable_sorting(), we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.

otherwise, if sorting is not enabled we have to perform a linear search.

I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key.

I don’t see getting away from children being keyed by a single character (in case of CharTrie). The path could be compressed but that would need to be for the remaining of the prefix, something like:

trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4

\+- f -> (oo) -> 1
 |               \-- b -> (ar) -> 2
 |                                \-- b -> (az) -> 3
 \- b -> (ar) -> 4

The question then is how to encode and handle the compressed path.

manthanchauhan · 2019-05-28T07:03:05Z

Once sorting is enabled, look-ups should have a worst-case time complexity of O( k log n ), and that will be the case when the trie remains unaffected after compression. For an average case, the time complexity should be O( log n ) + O( k ).

You suggest that we use only the first character of a path as a key, right? which should save the search time of a key.

mina86 · 2019-05-28T07:52:43Z

On Tue, May 28, 2019, 09:03 Manthan Chauhan ***@***.***> wrote: Once sorting is enabled, look-ups should have a worst-case time complexity of O( k log n ), and that will be the case when the trie remains unaffected after compression. For an average case, the time complexity should be O( log n ) + O( k ).

When sorting is enabled children are not stored sorted. A hash table is still used. With sorting enabled children are sorted each time items are iterated over. As such, there is no way to take advantage of sorting being enabled during key lookup. The chdren would need to be sorted each time which means lookup takes O(n log n). You suggest that we use only the first character of a path as a key, right?

which should save the search time of a key.

Yes. That's how tries work.

…

manthanchauhan · 2019-05-29T03:36:26Z

Thinking about your idea for character keys,

If sorting is enabled, as done by the function enable_sorting(), we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.
otherwise, if sorting is not enabled we have to perform a linear search.

I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key.

I don’t see getting away from children being keyed by a single character (in case of CharTrie). The path could be compressed but that would need to be for the remaining of the prefix, something like:
trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4

\+- f -> (oo) -> 1
 |               \-- b -> (ar) -> 2
 |                                \-- b -> (az) -> 3
 \- b -> (ar) -> 4
The question then is how to encode and handle the compressed path.

at each node, we can use the first character of the query string to locate a key-value pair, here, each value is a collection of a string and a node pointer.

once we locate a key-value pair, we traverse the associated string upto the position it matches with the query string, which would be equivalent to traversing that much nodes of an uncompressed CharTrie. The remaining part of the query string will be passed to the pointed _Node.

mina86 · 2019-05-30T09:09:24Z

This sounds workable. I only wonder how would it interact with interfaces such as `walk_towards`.

manthanchauhan · 2019-05-31T09:09:09Z

sorry but can you please provide some references for this walk_towards interface?

mina86 · 2019-06-01T17:49:15Z

https://github.com/mina86/pygtrie/blob/master/pygtrie.py#L1073

manthanchauhan · 2019-06-04T16:11:33Z

This sounds workable. I only wonder how would it interact with interfaces such as walk_towards.

For this to work, we'll need to modify the _path_from_key() functions, because in this proposed model of compact tries, the key of a node is not essentially it's path from the root.

trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4

+- f -> (oo) -> 1
| -- b -> (ar) -> 2
| -- b -> (az) -> 3
- b -> (ar) -> 4

in this trie structure, if we consider the key foobarbaz, the path for this node should be fbb and not foobarbaz.

with the above described path definition walk_towards function will work unaffected.

mina86 · 2019-06-04T18:02:47Z

I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards should go through the whole foobarbaz regardless of how the internal node structure looks like.

manthanchauhan · 2019-06-06T16:17:18Z

I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards should go through the whole foobarbaz regardless of how the internal node structure looks like.

Then we can have to do a little modification in the walk towards function.

If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards function we have to increment pos in each iteration by,

pos += 1 + len( node.children.get( path [pos ] ) [0] )

mina86 · 2019-06-12T10:54:27Z

Sorry about delay.

If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards function we have to increment pos in each iteration by,

pos += 1 + len( node.children.get( path [pos ] ) [0] )

That doesn't sound viable for two reasons.

First of all, the result of walk_towards should not depend on the internal structure of the trie. Whether compressed paths are used or not, iterating over the steps should yield all keys on the path to given key. In other words, the following test needs to pass:

import pygtrie

t = pygtrie.CharTrie(foobar=42)
steps = t.walk_towards('foobar')
res = ' '.join('<%s>' % step.key for step in steps)
assert res == '<> <f> <fo> <foo> <foob> <fooba> <foobar>'

Second of all, if the motivation is to reduce amount of memory used, I think only children with compressed paths should be a (sub-path, node) tuple. This doesn't fundamentally change anything other than code needing to handle both cases.

There's also another aspect that ideally, the following test should pass as well:

from itertools import islice
import pygtrie

t = pygtrie.CharTrie(foobar=42)
res = ''
for step in t.walk_towards('foobar'):
    res += '<%s>' % step.key
    if 'fred' in t:
        del t['fred']
        del t['foobaz']
    else:
        t['fred'] = t['foobaz'] = 24
assert res == '<><f><fo><foo><foob><fooba><foobar>'

In other words, converting nodes between compressed and not-compressed should not break walk_towards method.

manthanchauhan · 2019-06-13T16:04:22Z

Okay, but for compressed tries to pass the walk_through() test, some of the returned path will not point to any Node.

That is, if the trie contains,
foobar=5 foob=6

then while walk_through() the path foo will be returned without any Node, if that is accepted.

mina86 · 2019-06-13T16:16:10Z

Correct. This is the desired behaviour because it's the current behaviour (currently the method goes through all nodes regardless of whether they have values assigned to them or how many children they have) and methods should not expose details of the implementation.

manthanchauhan · 2019-06-17T16:01:46Z

I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards should go through the whole foobarbaz regardless of how the internal node structure looks like.

Then we can have to do a little modification in the walk towards function.

If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards function we have to increment pos in each iteration by,

pos += 1 + len( node.children.get( path [pos ] ) [0] )

For that, we can use this trick and inbetween every two valid Node we can pass,
len(node.children.get( path [pos] ) [0] ) None nodes to make up for the invalid keys inbetween.

I mean without actually traversing the path character by character.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27

method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27

manthanchauhan commented May 25, 2019 •

edited

Loading

manthanchauhan commented May 25, 2019

mina86 commented May 25, 2019

manthanchauhan commented May 26, 2019 •

edited

Loading

manthanchauhan commented May 26, 2019

mina86 commented May 26, 2019

manthanchauhan commented May 26, 2019

mina86 commented May 27, 2019

manthanchauhan commented May 27, 2019

mina86 commented May 27, 2019 •

edited

Loading

manthanchauhan commented May 27, 2019

mina86 commented May 27, 2019

manthanchauhan commented May 27, 2019 •

edited

Loading

mina86 commented May 27, 2019

manthanchauhan commented May 27, 2019

mina86 commented May 27, 2019

manthanchauhan commented May 28, 2019

mina86 commented May 28, 2019

manthanchauhan commented May 28, 2019

mina86 commented May 28, 2019 via email

manthanchauhan commented May 29, 2019 •

edited

Loading

mina86 commented May 30, 2019 via email

manthanchauhan commented May 31, 2019

mina86 commented Jun 1, 2019

manthanchauhan commented Jun 4, 2019 •

edited

Loading

mina86 commented Jun 4, 2019

manthanchauhan commented Jun 6, 2019

mina86 commented Jun 12, 2019

manthanchauhan commented Jun 13, 2019

mina86 commented Jun 13, 2019

manthanchauhan commented Jun 17, 2019

method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27

method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27

Comments

manthanchauhan commented May 25, 2019 • edited Loading

manthanchauhan commented May 25, 2019

mina86 commented May 25, 2019

manthanchauhan commented May 26, 2019 • edited Loading

manthanchauhan commented May 26, 2019

mina86 commented May 26, 2019

manthanchauhan commented May 26, 2019

mina86 commented May 27, 2019

manthanchauhan commented May 27, 2019

mina86 commented May 27, 2019 • edited Loading

manthanchauhan commented May 27, 2019

mina86 commented May 27, 2019

manthanchauhan commented May 27, 2019 • edited Loading

mina86 commented May 27, 2019

manthanchauhan commented May 27, 2019

mina86 commented May 27, 2019

manthanchauhan commented May 28, 2019

mina86 commented May 28, 2019

manthanchauhan commented May 28, 2019

mina86 commented May 28, 2019 via email

manthanchauhan commented May 29, 2019 • edited Loading

mina86 commented May 30, 2019 via email

manthanchauhan commented May 31, 2019

mina86 commented Jun 1, 2019

manthanchauhan commented Jun 4, 2019 • edited Loading

mina86 commented Jun 4, 2019

manthanchauhan commented Jun 6, 2019

mina86 commented Jun 12, 2019

manthanchauhan commented Jun 13, 2019

mina86 commented Jun 13, 2019

manthanchauhan commented Jun 17, 2019

manthanchauhan commented May 25, 2019 •

edited

Loading

manthanchauhan commented May 26, 2019 •

edited

Loading

mina86 commented May 27, 2019 •

edited

Loading

manthanchauhan commented May 27, 2019 •

edited

Loading

manthanchauhan commented May 29, 2019 •

edited

Loading

manthanchauhan commented Jun 4, 2019 •

edited

Loading