generic method for all different types of unicode string gens #65

cswiii · 2015-01-12T21:21:01Z

We already have gen_cjk() and per pull #63 might have gen_cyrillic.

If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.

It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.

e.g., instead of

     codepoints = [random.randint(0x4E00, 0x9FCC) for _ in range(length)]
     try:
         # (undefined-variable) pylint:disable=E0602
         output = u''.join(unichr(codepoint) for codepoint in codepoints)
     except NameError:
         output = u''.join(chr(codepoint) for codepoint in codepoints)
     return _make_unicode(output)

...put this into a generate_unicode_range() function that can have codepoint values passed to it, and then use that inside a function for any desired unicode block...

gen_bengali()
gen_hebrew()
gen_hiragana()

Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:

http://en.wikipedia.org/wiki/Unicode_block

So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.

The text was updated successfully, but these errors were encountered:

Ichimonji10 · 2015-01-12T21:44:53Z

Some character sets span multiple, non contiguous blocks.

From gen_utf8:

Generate codepoints. The valid range of UTF-8 codepoints is
0x0-0x10FFFF, minus the following: 0xC0-0xC1, 0xF5-0xFF and
0xD800-0xDFFF. These 2061 invalid codepoints (2 + 11 + 2048) comprise
0.2% of 0x0-0x10FFFF. Thus, it should be OK to just check for invalid
codepoints and generate new ones if need be.

JacobCallahan · 2015-01-12T21:47:50Z

I think adding an optional tuple parameter to gen_utf8 would be the best implementation. then we could either remove the cjk and cryllic functions or shrink them down to just pass the correct tuple to gen_utf8.

Ichimonji10 · 2015-01-12T21:48:34Z

So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.

Creating a list that contains all the characters in a given character set and pulling values out is not very streamy. We can find a way to generate a bunch of random-ish characters without creating a list containing tens/hundreds/whatever of thousands of characters and plucking characters out from it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic method for all different types of unicode string gens #65

generic method for all different types of unicode string gens #65

cswiii commented Jan 12, 2015

Ichimonji10 commented Jan 12, 2015

JacobCallahan commented Jan 12, 2015

Ichimonji10 commented Jan 12, 2015

generic method for all different types of unicode string gens #65

generic method for all different types of unicode string gens #65

Comments

cswiii commented Jan 12, 2015

Ichimonji10 commented Jan 12, 2015

JacobCallahan commented Jan 12, 2015

Ichimonji10 commented Jan 12, 2015