-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random UTF-8 string generation must generate valid string #69
Comments
👍 As it turns out, there are several versions of the unicode standard. This has practical implications. For example, Python's |
I wonder if the differences between the versions are significant enough to warrant a change to the way we generate unicode. We could possibly accept custom unicode ranges and have a default range that is safe with both versions. |
Section 5.5.1 of [1] shows a table with character categories. We can use that to guide the generation, the character category can be easily get: >>> for c in u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c':
... print unicodedata.category(c)
...
Lo
Co
Po
Lo
So
Co
Lo
Lo
Lo
Cn As you can see, some character of the "invalid" unicode string have I have excluded those categories from the string: >>> print ''.join([c for c in u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c' if unicodedata.category(c) not in ('Co', 'Cn', 'So', 'Po')])
僝乜欘䤡濃 And the system accepted the new string as a valid data. |
Good thought, elyezer. Searching pypi for "unicode" turns up quite a number of results. There may be something there we can (fork and) use. |
This generator is a helper for the gen_utf8 function which will provide the system supported list of unicode letters. This will avoid generating unicode string with control characters and other non letters characters. Also adds tests for the generator in order to ensure it is not generating unwanted characters. Closes omaciel#69
This generator is a helper for the gen_utf8 function which will provide the system supported list of unicode letters. This will avoid generating unicode string with control characters and other non letters characters. Also adds tests for the generator in order to ensure it is not generating unwanted characters. Closes omaciel#69
This generator is a helper for the gen_utf8 function which will provide the system supported list of unicode letters. This will avoid generating unicode string with control characters and other non letters characters. Also adds tests for the generator in order to ensure it is not generating unwanted characters. Closes omaciel#69
FauxFactory have generated the following string:
u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c'
But
\ue94f
[1] and\uf3b6
[2] are not valid unicode characters. FauxFactory need a better way to generate random unicode strings.Thanks @Ichimonji10 for helping in the investigation process.
[1] http://www.fileformat.info/info/unicode/char/e94f/index.htm
[2] http://www.fileformat.info/info/unicode/char/f3b6/index.htm
The text was updated successfully, but these errors were encountered: