Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic method for all different types of unicode string gens #65

Open
cswiii opened this issue Jan 12, 2015 · 3 comments
Open

generic method for all different types of unicode string gens #65

cswiii opened this issue Jan 12, 2015 · 3 comments

Comments

@cswiii
Copy link
Contributor

cswiii commented Jan 12, 2015

We already have gen_cjk() and per pull #63 might have gen_cyrillic.

If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.

It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.

e.g., instead of

     codepoints = [random.randint(0x4E00, 0x9FCC) for _ in range(length)]
     try:
         # (undefined-variable) pylint:disable=E0602
         output = u''.join(unichr(codepoint) for codepoint in codepoints)
     except NameError:
         output = u''.join(chr(codepoint) for codepoint in codepoints)
     return _make_unicode(output)

...put this into a generate_unicode_range() function that can have codepoint values passed to it, and then use that inside a function for any desired unicode block...

gen_bengali()
gen_hebrew()
gen_hiragana()

Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:

http://en.wikipedia.org/wiki/Unicode_block

So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.

@Ichimonji10
Copy link
Contributor

Some character sets span multiple, non contiguous blocks.

From gen_utf8:

Generate codepoints. The valid range of UTF-8 codepoints is
0x0-0x10FFFF, minus the following: 0xC0-0xC1, 0xF5-0xFF and
0xD800-0xDFFF. These 2061 invalid codepoints (2 + 11 + 2048) comprise
0.2% of 0x0-0x10FFFF. Thus, it should be OK to just check for invalid
codepoints and generate new ones if need be.

@JacobCallahan
Copy link
Contributor

I think adding an optional tuple parameter to gen_utf8 would be the best implementation. then we could either remove the cjk and cryllic functions or shrink them down to just pass the correct tuple to gen_utf8.

@Ichimonji10
Copy link
Contributor

So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.

Creating a list that contains all the characters in a given character set and pulling values out is not very streamy. We can find a way to generate a bunch of random-ish characters without creating a list containing tens/hundreds/whatever of thousands of characters and plucking characters out from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants