New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

parsing \uXXX? characters #1

Open

FotisK opened this issue Apr 21, 2016 · 2 comments

FotisK commented Apr 21, 2016

intro

disclaimer: I'm not opening these issues asking for some solution;
actually I'm very grateful for the existence of rt2html because of all the thing I tried, it is the only one that really produces proper html output. These issues may be known, or not, but since I'm not able to correct these at the time I'm posting for posterity. Right now I'm just running the windows binary - one day I may find the time to read the code and see if there is something I can add even though I'm not very skilled programmer

so why am I opening these issues?

to say thank you, because it's really proven helpful!
to show some activity because at least for me issues/wiki entries indicate a piece of software that has been tested by other users
to point out a few quirks and (some patchy solutions I've done)

Issue description

It doesn't properly treat \uXXX? characters. In my RTF files at least, the XXX are decimal representations of a unicode one.

workaround

The tool is properly treating\'XXencoded characters (it ignores the codepage, but in my case it's okay since I'm expecting it to always be Windows 1253/ISO 8859-7(greek)).
I'm preprocessing the file, replacing all the \uXXX? characters with their Windows 1253 equivalent. Of course some times (particularly when the characters encoded are not in the range of Windows 1253) it won't map properly, but for my scenario it's acceptable and rare.

The text was updated successfully, but these errors were encountered:

Owner

lvu commented Apr 21, 2016

Thanks for a detailed report! In fact, I haven't touched this project for a while, and when I wrote it I didn't think about unicode at all.

Probably the right thing to do is always output a utf-8 html, adding an ability to specify the encoding on command line, if it cannot be determined from the rtf itself. What do you think?

Author

FotisK commented Apr 21, 2016

Tricky topic, and I admit I'm not knowledgeable at all :/
I only know two ways in which HTML documents encode non-latin characters and that would be html entities, or the document-wide encoding.
I totally hate html entities; I've seen my share of Greek pages saved as piles of illegible HTML entities so I'd definitely go for the document-wide encoding. UTF8 is a good option IMO. Latin characters (including HTML tags etx) remain intact, and if the text is mostly in Latin script (like this discussion here), it probably won't need any re-mapping/transcoding.

If the file was exported in UTF16/32 every single character would have to be remapped anew - UTF16/32 don't share anything in common with ASCII; If it was exported in some 8bit codepage (eg. Windows 1253) this would bring headaches in case there were more than two locales coexisting in the same RTF file (like in my case); Image having juggle between the Greek, alphabet, latin and Cyrillic? some characters would have to be transliterated into some lossy equivalent
(and who is to tell what an acceptable equivalent is? using PHP's iconv I had ü and ä transliterated into " and " respectively) !

I'd go for an encoding that can represent all Unicode characters - in particular UTF8 since it's the one that does the least transcoding and produces the most humanly readable text. If the user needs something other that UTF8, I'd let the user find his own means of converting my UTF8 HTML file into anything else he/she wishes. Being certain that my tool didn't make any dubious decisions about transliterations (and let him/her worry about that) :-)

btw I noticed that in the RTF 1.5 specifications, there was mention of a control word called \ucX that would indicate the number of numeric digits in the successive Unicode characters. eg. \uc3 would indicate that the following Unicode characters are in the form\uXXX, such as \u319, u485 and u489 whereas\uc4 would indicate characters such as \u0319 and \u2991. In my file however there was no such control word (ucX) and the unicode characters were represented as \uXXX? (they were all 3 numeric digits and a question mark at the end)! A true mess I must say!

yangfar mentioned this issue

Some crashes occur when fuzzing rtf2html. #11

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment