Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288

Lord-Nightmare · 2021-03-14T17:12:15Z

Fixed cp1252 characters from 0x80-0x9f not being converted to unicode properly
Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode
Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff

… properly Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff

mtijanic · 2021-03-15T22:22:32Z

At code read level this looks good, but I still need to test it out somehow to make sure the mappings are good.

By the way, I'm thinking if it is not easier to just do a [128] array for all codepages and simplify the code.

virusman · 2021-03-15T23:03:58Z

Checked the cp1251 logic – looks good. I haven't actually verified it, but the previously supported characters should keep working as they did.
I don't really understand all the shifting and other arithmetic with push_back, so can't verify that.

Lord-Nightmare · 2021-03-16T03:26:02Z

By the way, I'm thinking if it is not easier to just do a [128] array for all codepages and simplify the code.

Doing this is definitely an option, and would also simplify the FromUTF8 function as well.

Lord-Nightmare · 2021-03-16T03:34:20Z

This also makes me wonder a few things:
Should the users have control over these tables, so for instance, if you want a font to use a hair space instead of a space character, you could change an entry in a [256] long conversion table to use U+2009 THIN SPACE or U+200A HAIR SPACE in place of U+0020 SPACE, if it would look better in a specific font.

Also, should I add supoort in the ToUTF8 function for outputting unicode characters above 0xFFFF (i.e. 4 bytes utf-8, for unicode character codes 0x10000 thru 0x10FFFF)? This would allow (in theory) some characters to be rendered as emojis etc.

Lord-Nightmare · 2021-03-16T04:47:37Z

I don't really understand all the shifting and other arithmetic with push_back, so can't verify that.

I'll try to explain it:

A utf-8 character with unicode character code <= 0x7f is just sent as 0x00-0x7f (0b00000000 thru 0b01111111)

A utf-8 character with unicode character code >= 0x80 and <= 0x7ff is broken into two bytes, sent one after the other:

0b110xxxxx with the high 5 bits (bits 6 thru 10 lsb to msb) of the character code replacing the 'x's
0b10xxxxxx with the low 6 bits (bits 0 thru 5 lsb to msb) of the character code replacing the 'x's

A utf-8 character with unicode character code >= 0x800 and <= 0xffff is broken into three bytes, sent one after the other:

0b1110xxxx with the high 4 bits (bits 12 thru 15 lsb to msb) of the character code replacing the 'x's
0b10xxxxxx with the next 6 bits (bits 5 thru 11 lsb to msb) of the character code replacing the 'x's
0b10xxxxxx with the low 6 bits (bits 0 thru 5 lsb to msb) of the character code replacing the 'x's

A utf-8 character with uncode character code >= 0x10000 and <= 0x10ffff (in theory this could be as high as 0x1fffff but unicode defined the max range as 0x10ffff for compatibility with UTF-16) is broken into four bytes, sent one after the other:

0b11110xxx with the high 3 bits (bits 18 thru 20 lsb to msb) of the character code replacing the 'x's
0b10xxxxxx with the next 6 bits (bits 12 thru 17 lsb to msb) of the character code replacing the 'x's
0b10xxxxxx with the next 6 bits (bits 5 thru 11 lsb to msb) of the character code replacing the 'x's
0b10xxxxxx with the low 6 bits (bits 0 thru 5 lsb to msb) of the character code replacing the 'x's

mtijanic · 2021-03-16T07:54:34Z

No point in adding features that won't be used. As far as I know, cp1250, cp1251 and cp1252 none have 4-byte UTF8 symbols (do they even have 3 byte symbols?) and we don't support anything else.

Same with codepoint overrides, it just adds much complexity that I doubt anyone will need. And if they do, it's a simple isolated change to modify the mapping table, so they can just maintain it in their fork.

I think it makes sense to move the ToUTF8 encoding to be a [128] array everywhere. For FromUTF8, we can probably just dynamically generate another array from the previous one to avoid iterating for every character. I'm not too worried about wasting 64kb on a sparse table. (though it remains to be seen if that is faster that iteration, due to cache effects)

Lord-Nightmare · 2021-03-17T05:10:29Z

No point in adding features that won't be used. As far as I know, cp1250, cp1251 and cp1252 none have 4-byte UTF8 symbols (do they even have 3 byte symbols?) and we don't support anything else.

Yes, there are several 3 byte symbols already: 0x2026 (horizontal ellipsis) for instance, is translated to by cp1252 0x85 (which is used many times in dialog.tlk), and is a 3-byte utf-8 character: 0xE2 0x80 0xA6

…odepages, and fixed FromUTF8 to also search the entirety of these tables.

Lord-Nightmare added 2 commits March 14, 2021 13:09

Fixed cp1252 characters from 0x80-0x9f not being converted to unicode…

e4cfdff

… properly Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff

fix compile

27a1503

Update as suggested to use 0x80-long tables for all three supported c…

4b6c735

…odepages, and fixed FromUTF8 to also search the entirety of these tables.

Lord-Nightmare changed the title ~~Various fixes for converting cp1252, cp1251, cp1250 characters to utf-8~~ Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288

Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288

Lord-Nightmare commented Mar 14, 2021

mtijanic commented Mar 15, 2021

virusman commented Mar 15, 2021 •

edited

Loading

Lord-Nightmare commented Mar 16, 2021

Lord-Nightmare commented Mar 16, 2021

Lord-Nightmare commented Mar 16, 2021

mtijanic commented Mar 16, 2021

Lord-Nightmare commented Mar 17, 2021 •

edited

Loading

Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288

Are you sure you want to change the base?

Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288

Conversation

Lord-Nightmare commented Mar 14, 2021

mtijanic commented Mar 15, 2021

virusman commented Mar 15, 2021 • edited Loading

Lord-Nightmare commented Mar 16, 2021

Lord-Nightmare commented Mar 16, 2021

Lord-Nightmare commented Mar 16, 2021

mtijanic commented Mar 16, 2021

Lord-Nightmare commented Mar 17, 2021 • edited Loading

virusman commented Mar 15, 2021 •

edited

Loading

Lord-Nightmare commented Mar 17, 2021 •

edited

Loading