Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random UTF-8 string generation must generate valid string #69

Closed
elyezer opened this issue Feb 12, 2015 · 4 comments · Fixed by #70
Closed

Random UTF-8 string generation must generate valid string #69

elyezer opened this issue Feb 12, 2015 · 4 comments · Fixed by #70

Comments

@elyezer
Copy link
Contributor

elyezer commented Feb 12, 2015

FauxFactory have generated the following string:

u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c'

But \ue94f [1] and \uf3b6 [2] are not valid unicode characters. FauxFactory need a better way to generate random unicode strings.

Thanks @Ichimonji10 for helping in the investigation process.

[1] http://www.fileformat.info/info/unicode/char/e94f/index.htm
[2] http://www.fileformat.info/info/unicode/char/f3b6/index.htm

@Ichimonji10
Copy link
Contributor

👍

As it turns out, there are several versions of the unicode standard. This has practical implications. For example, Python's unicodedata module is pegged to unicode version 5.2.0 in Python 2.7 and unicode version 6.1.0 in Python 3.3.

@JacobCallahan
Copy link
Contributor

I wonder if the differences between the versions are significant enough to warrant a change to the way we generate unicode. We could possibly accept custom unicode ranges and have a default range that is safe with both versions.

@elyezer
Copy link
Contributor Author

elyezer commented Feb 12, 2015

Section 5.5.1 of [1] shows a table with character categories. We can use that to guide the generation, the character category can be easily get:

>>> for c in u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c':
...   print unicodedata.category(c)
...
Lo
Co
Po
Lo
So
Co
Lo
Lo
Lo
Cn

As you can see, some character of the "invalid" unicode string have Private_Use, Unassigned and Other_Punctuation characters.

I have excluded those categories from the string:

>>> print ''.join([c for c in u'\u50dd\uf3b6\u1365\u4e5c\u2441\ue94f\u6b18\u4921\u6fc3\u052c' if unicodedata.category(c) not in ('Co', 'Cn', 'So', 'Po')])
僝乜欘䤡濃

And the system accepted the new string as a valid data.

[1] http://www.unicode.org/reports/tr44/tr44-4.html

@Ichimonji10
Copy link
Contributor

Good thought, elyezer.

Searching pypi for "unicode" turns up quite a number of results. There may be something there we can (fork and) use.

elyezer added a commit to elyezer/fauxfactory that referenced this issue Feb 13, 2015
This generator is a helper for the gen_utf8 function which will provide
the system supported list of unicode letters. This will avoid generating
unicode string with control characters and other non letters characters.

Also adds tests for the generator in order to ensure it is not
generating unwanted characters.

Closes omaciel#69
elyezer added a commit to elyezer/fauxfactory that referenced this issue Feb 13, 2015
This generator is a helper for the gen_utf8 function which will provide
the system supported list of unicode letters. This will avoid generating
unicode string with control characters and other non letters characters.

Also adds tests for the generator in order to ensure it is not
generating unwanted characters.

Closes omaciel#69
elyezer added a commit to elyezer/fauxfactory that referenced this issue Feb 13, 2015
This generator is a helper for the gen_utf8 function which will provide
the system supported list of unicode letters. This will avoid generating
unicode string with control characters and other non letters characters.

Also adds tests for the generator in order to ensure it is not
generating unwanted characters.

Closes omaciel#69
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants