Random UTF-8 string generation must generate valid string #69

elyezer · 2015-02-12T19:04:07Z

FauxFactory have generated the following string:

u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c'

But \ue94f [1] and \uf3b6 [2] are not valid unicode characters. FauxFactory need a better way to generate random unicode strings.

Thanks @Ichimonji10 for helping in the investigation process.

[1] http://www.fileformat.info/info/unicode/char/e94f/index.htm
[2] http://www.fileformat.info/info/unicode/char/f3b6/index.htm

The text was updated successfully, but these errors were encountered:

Ichimonji10 · 2015-02-12T19:08:18Z

👍

As it turns out, there are several versions of the unicode standard. This has practical implications. For example, Python's unicodedata module is pegged to unicode version 5.2.0 in Python 2.7 and unicode version 6.1.0 in Python 3.3.

JacobCallahan · 2015-02-12T19:14:18Z

I wonder if the differences between the versions are significant enough to warrant a change to the way we generate unicode. We could possibly accept custom unicode ranges and have a default range that is safe with both versions.

elyezer · 2015-02-12T19:22:10Z

Section 5.5.1 of [1] shows a table with character categories. We can use that to guide the generation, the character category can be easily get:

>>> for c in u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c':
...   print unicodedata.category(c)
...
Lo
Co
Po
Lo
So
Co
Lo
Lo
Lo
Cn

As you can see, some character of the "invalid" unicode string have Private_Use, Unassigned and Other_Punctuation characters.

I have excluded those categories from the string:

>>> print ''.join([c for c in u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c' if unicodedata.category(c) not in ('Co', 'Cn', 'So', 'Po')])
僝乜欘䤡濃

And the system accepted the new string as a valid data.

[1] http://www.unicode.org/reports/tr44/tr44-4.html

Ichimonji10 · 2015-02-12T20:03:12Z

Good thought, elyezer.

Searching pypi for "unicode" turns up quite a number of results. There may be something there we can (fork and) use.

This generator is a helper for the gen_utf8 function which will provide the system supported list of unicode letters. This will avoid generating unicode string with control characters and other non letters characters. Also adds tests for the generator in order to ensure it is not generating unwanted characters. Closes omaciel#69

elyezer mentioned this issue Feb 13, 2015

Add unicode letters generator #70

Merged

omaciel closed this as completed in #70 Feb 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random UTF-8 string generation must generate valid string #69

Random UTF-8 string generation must generate valid string #69

elyezer commented Feb 12, 2015

Ichimonji10 commented Feb 12, 2015

JacobCallahan commented Feb 12, 2015

elyezer commented Feb 12, 2015

Ichimonji10 commented Feb 12, 2015

Random UTF-8 string generation must generate valid string #69

Random UTF-8 string generation must generate valid string #69

Comments

elyezer commented Feb 12, 2015

Ichimonji10 commented Feb 12, 2015

JacobCallahan commented Feb 12, 2015

elyezer commented Feb 12, 2015

Ichimonji10 commented Feb 12, 2015