-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing \uXXX? characters #1
Comments
Thanks for a detailed report! In fact, I haven't touched this project for a while, and when I wrote it I didn't think about unicode at all. Probably the right thing to do is always output a utf-8 html, adding an ability to specify the encoding on command line, if it cannot be determined from the rtf itself. What do you think? |
Tricky topic, and I admit I'm not knowledgeable at all :/ If the file was exported in UTF16/32 every single character would have to be remapped anew - UTF16/32 don't share anything in common with ASCII; If it was exported in some 8bit codepage (eg. Windows 1253) this would bring headaches in case there were more than two locales coexisting in the same RTF file (like in my case); Image having juggle between the Greek, alphabet, latin and Cyrillic? some characters would have to be transliterated into some lossy equivalent I'd go for an encoding that can represent all Unicode characters - in particular UTF8 since it's the one that does the least transcoding and produces the most humanly readable text. If the user needs something other that UTF8, I'd let the user find his own means of converting my UTF8 HTML file into anything else he/she wishes. Being certain that my tool didn't make any dubious decisions about transliterations (and let him/her worry about that) :-) btw I noticed that in the RTF 1.5 specifications, there was mention of a control word called |
intro
disclaimer: I'm not opening these issues asking for some solution;
actually I'm very grateful for the existence of rt2html because of all the thing I tried, it is the only one that really produces proper html output. These issues may be known, or not, but since I'm not able to correct these at the time I'm posting for posterity. Right now I'm just running the windows binary - one day I may find the time to read the code and see if there is something I can add even though I'm not very skilled programmer
so why am I opening these issues?
Issue description
It doesn't properly treat
\uXXX?
characters. In my RTF files at least, theXXX
are decimal representations of a unicode one.workaround
The tool is properly treating
\'XX
encoded characters (it ignores the codepage, but in my case it's okay since I'm expecting it to always be Windows 1253/ISO 8859-7(greek)).I'm preprocessing the file, replacing all the
\uXXX?
characters with their Windows 1253 equivalent. Of course some times (particularly when the characters encoded are not in the range of Windows 1253) it won't map properly, but for my scenario it's acceptable and rare.The text was updated successfully, but these errors were encountered: