Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another Unicode issue #15

Open
holopoj opened this issue Aug 4, 2019 · 3 comments
Open

Another Unicode issue #15

holopoj opened this issue Aug 4, 2019 · 3 comments

Comments

@holopoj
Copy link

holopoj commented Aug 4, 2019

Ran into an issue with unicode 0x300. This can be reproduced with the below code:

var a= "rosalía castro";
var b= "rosalía";
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
Console.WriteLine(t.Retrieve(a).Count());

This will print 0. Note that the second item added is not a byte-equal prefix of s, their unicode sequences are different. Though a.StartsWith(b) returns true, presumably because of culture settings. The second one uses two characters: a normal 'i' followed by unicode 0x300 to add the accent, while the first one uses a single accented i character.

@rjgotten
Copy link

rjgotten commented Sep 10, 2019

The proper fully compatible solution that would resolve most if not all issues with Unicode is to rewrite all substring handling to use the StringInfo class to work with 'real' characters, i.e. graphemes, rather than individual char codepoints.

However, the public StringInfo API is very uncomfortable. E.g. you have to manually pump a non-generic IEnumerator with MoveNext() to iterate over a string's graphemes. There's no IEnumerable<> support and thus also no foreach support.


[EDIT]

It looks like this wouldn't be too difficult of a change with the Ukkonen trie, if you go about it naively and just replace regular SubString() calls and Length accesses with StringInfo-driven equivalents.

The downsides are that it would probably murder atleast construction performance; and that the Node class will need to hold an IDictionary<string,Edge> as a grapheme may not fit in a single char. That last bit means an increase in space taken as well, but luckily it's still bounded. Unicode graphemes aren't endlessly long, iirc.

Might be better off by one-time converting all strings into a dedicated data structure operating at the grapheme level though. That would certainly keep code more maintainable.

@prj
Copy link

prj commented Apr 30, 2020

For me the thing throws OutOfBoundsExceptions when I even try to construct something that has any special characters in it. And all my sources are in ISO-8859-1.
So it seems this project is useless in any real world application, unless you're dealing with plain ASCII.

@jesuslpm
Copy link

jesuslpm commented Jan 31, 2021

@holopoj ,

Preparing the text for the trie before adding and before searching is a good workaround. I will work with Basic Multilingual Plane which contains characters for almost all modern languages, and a large number of symbols:

/// <summary>
/// It Removes diacritics from text, converts it to lower, removes surrogate 
/// characters and normalizes it to prepare text for accent and case insensitive search
/// </summary>
/// <param name="text"></param>
/// <returns></returns>
static string PrepareForTrie(string text)
{
    //return text;
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    for (int i = 0; i < normalizedString.Length; i++)
    {
        char c = normalizedString[i];
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (char.IsHighSurrogate(c) || char.IsLowSurrogate(c)) continue;
        if (unicodeCategory != UnicodeCategory.NonSpacingMark && unicodeCategory != UnicodeCategory.Control)
        {
            stringBuilder.Append(char.ToLower(c));
        }
    }
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

Now this code works, it shows 2 and 1. The second item still has the double code point grapheme:

var a = PrepareForTrie("Rosalia de Castro");
var b = PrepareForTrie("rosalía");
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
foreach (var value in t.Retrieve(b))
{
     Console.WriteLine(value);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants