Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode strings with zero-byte values inside the key cannot be inserted #6

Open
1 task
kellydunn opened this issue Sep 18, 2015 · 2 comments
Open
1 task
Labels

Comments

@kellydunn
Copy link
Owner

As discovered by @nick-codes, it looks like inserting a unicode key into a new ArtTree panics with an index out of range error.

Here's some code to reproduce the error:

// Tests a Unicode insertion.
func TestArtTreeInsertWithInternalZeroByte(t *testing.T) {
    tree := NewArtTree()

    // ‘a' followed by unicode accent character.
    accent := []byte{0x61, 0x00, 0x60}
    tree.Insert([]byte("a"), "a")
    tree.Insert(accent, string(accent))

    res := tree.Search(accent)
    if res == nil {
        t.Errorf("Could not find Leaf Node with expected key: '%s'", accent)
    } else {
        if !bytes.Equal(res.([]byte), accent) {
            t.Error("Unexpected search result.")
        }
    }
}

Acceptance critieria:

  • A user can insert a unicode string with an empty byte value inside of it.
@kellydunn kellydunn added the bug label Sep 18, 2015
@cvermilion
Copy link

Just ran into this one in my project. Unfortunately I do not see an easy solution.

The issue isn't zero bytes precisely, but the fact that the library uses a null terminator to ensure that no key is a full prefix of another key. That's the case that needs to be avoided, and what this test produces. If the null character is allowed in the input, I don't see a simple way to ensure that no key is a prefix of another one, though perhaps an adaptive approach could be used.

Another approach (I think I've seen this in other radix tree implementations) would be for internal nodes to have a special leaf node pointer, and use child node pointers exclusively for pointing to other inner nodes. This would require tweaking the algorithm somewhat, and I haven't thought through the performance implications.

For UTF-8 encoded strings, this shouldn't come up unless you allow null characters in the strings. For UTF-16 or UTF-32, though, you can have zero bytes easily. For a particular encoding, the problem can be solved by using a null terminator appropriate to the encoding (eg, "\x00\x00" for UTF-16), but there isn't a general solution.

I'll probably just add null-char checking on my end for now, but I wanted to leave my thoughts here in case anyone else wanted to tackle this (perhaps me in the future).

@cvermilion
Copy link

(Apparently it's also an undocumented limitation of the C version that you can't have full-prefix keys, and null-terminators are similarly advised there.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants