Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--gnu-emacs yes mangles UTF-8 filenames #468

Closed
jidanni opened this issue Dec 2, 2016 · 12 comments
Closed

--gnu-emacs yes mangles UTF-8 filenames #468

jidanni opened this issue Dec 2, 2016 · 12 comments

Comments

@jidanni
Copy link
Contributor

jidanni commented Dec 2, 2016

$ tidy -i --wrap 122 --drop-proprietary-attributes yes --quote-nbsp no
-asxhtml --gnu-emacs yes -utf8
/tmp/台灣TG蝶園.html > /tmp/台灣TG蝶園.tidy
Warning: discarding invalid character code 143
/tmp/\345\260\347\243TG\350\266\345S^Y.html:22:56046: Warning: inserting implicit
/tmp/\345\260\347\243TG\350\266\345S^Y.html:22:56046: Warning: replacing unexpected button with
/tmp/\345\260\347\243TG\350\266\345S^Y.html:22:55870: Warning: missing ...

@geoffmcl
Copy link
Contributor

geoffmcl commented Dec 2, 2016

@jidanni thanks for the issue. This is for sure a bug!

If I embed that UTF-8 filename, which google tells me means "Taiwan TG butterfly garden" - nice name, if the translation is right - tidy correctly outputs the same UTF-8 into the body of the tidied document...

But makes a real cheese of it when outputting that same UTF-8 sequence as the emacs type warning string ;=(( And what is, where did that "Warning: discarding invalid character code 143" come from?

A test in windows showed more of these warnings, and also show the mangled utf-8 filename - ignore the fact that the codepage I had in use, 1252, does not support utf-8 display. The filename becomes just a set of 8-bit values... not petty, but not mangled...

F:\Projects\tidy-test\test\input5>tidy5 --gnu-emacs yes --show-info no "��TG�園.html"
Warning: discarding invalid character code 143
Warning: discarding invalid character code 129
Warning: discarding invalid character code 157
Warning: replacing invalid character code 156
Warning: replacing invalid character code 146
????TG???S�.html:4:1: Warning: discarding unexpected </html>
????TG???S�.html:7:32: Warning: missing </button>
etc

Must look at the output mechanism used when writing to the error file, or console if none... It is not handling valid UTF-8 sequences correctly...

Will certainly look at that as time permits, but if anyone else has some clues, help would be appreciated... thanks...

Regards, Geoff.

@geoffmcl geoffmcl added the Bug label Dec 2, 2016
@geoffmcl geoffmcl added this to the 5.3 milestone Dec 2, 2016
@jidanni
Copy link
Contributor Author

jidanni commented Dec 3, 2016 via email

@geoffmcl
Copy link
Contributor

geoffmcl commented Dec 3, 2016

@jidanni ok, understand a snip of sort of repeated warnings... no problem... and it took me a while to figure out exactly where they were coming from... and that provided the clue...

Have found what looks like a simple fix - changing the input encoding from ASCII to RAW, in ParseConfigValue, as in the following patch -

diff --git a/src/config.c b/src/config.c
index ddb677c..d001d3d 100644
--- a/src/config.c
+++ b/src/config.c
@@ -934,7 +934,7 @@ Bool TY_(ParseConfigValue)( TidyDocImpl* doc, TidyOptionId optId, ctmbstr optval
         if (optId == TidyOutFile)
             doc->config.cfgIn = TY_(BufferInput)( doc, &inbuf, RAW );
         else
-            doc->config.cfgIn = TY_(BufferInput)( doc, &inbuf, ASCII );
+            doc->config.cfgIn = TY_(BufferInput)( doc, &inbuf, RAW );
         doc->config.c = GetC( &doc->config );
 
         status = option->parser( doc, option );

Obviously, if it turns out correct, then we can actually get rid of the if (optId == TidyOutFile)... RAW ... else.

Would appreciate if you could apply the patch, and test the --gnu-emacs yes option again on any, and every filename... thanks...

You might wonder, like I did at first, why is this not UTF8!

Well, tidy internally translates UTF-8, its default in/out encoding, to unicode internally. Nothing wrong with this, but on output it does requires an internal unicode-to-output encoder, which tidy does for the html file text... allowing different output encoded text...

But this emacs format just copies the string with TY_(tmbsnprintf)(buf, count, "%s:%d:%d: ", cfgStr(doc, TidyEmacsFile), line, col); where cfgStr just returns a buffer pointer...

So if we used UTF8 then the filename would be stored as unicode, so the buffer pointer returned to this ReportPostion service would be 'unicode'... ie no encoder involved... one could be added... But we have been using RAW for years for the TidyOutFile, so I think RAW is the best choice here... for now...

Perhaps the warnings should mention on each line that they are coming from the filename...

Well it is not only for the filename, and they sort of do, in what they do not show. If this was a ReportPostion during document parsing, then we would have a lexer to give us row and column of the document, and the filename, if the gnu-emacs option set.

So they are warnings before any actual document parsing has started, so by the absense of this info, indicates they are from when the configuration options are read, one of which would be the TidyEmacsFile string... but also for many other config options that require a string...

A possible minor enhancement of the message might be, say -

config - Warning: discarding invalid character code 143

That is, if no lexer, then replace the usual line 7 column 7, or the gnu-emacs equivalent, with just say config... probably not too difficult... will try to look at that...

In this case they will not occur if the config reads are configured with UTF8 or RAW, but as stated, there are other considerations, and suggest RAW is the right choice here. Look forward to comments on this...

Hope you, and others using filenames outside ASCII range, get a chance to test and report... thanks...

@jidanni
Copy link
Contributor Author

jidanni commented Dec 3, 2016 via email

geoffmcl added a commit that referenced this issue Dec 3, 2016
@geoffmcl
Copy link
Contributor

geoffmcl commented Dec 3, 2016

@jidanni, thanks for the brief reply...

(I only test .debs... when they reach Debian sid.)

Well, that is sad, because it is only after testing can this fix reach release stage, and then reach Debian sid... minimum months later, unless considered hyper-critical...

And there are not that many of us, in dev, that have filenames that are outside the ASCII range... it took some effort for me to create one, first in linux, and then in windows...

I certainly hope others, using filenames outside ASCII range, get a chance to test and report if it fixes the --gnu-emacs yes output... thanks...

Anyways if you don't output Filename Problem: ... ... ...

Well, as mentioned it is not only a filename problem, but any configuration parameter that takes a string, which is why I suggested it be preceded by the word config, or some such indicator...

I have looked at this, and it would not be hard to do, but it involves first setting up the prefix in the language_en.h, new .po gen'ed, etc, so it can be internationalized...

As stated, at present that prefix generation is controlled by the code if ( line > 0 && col > 0 ) equals add document line/column, or gnu-emacs, prefix, else none, at present...

I have yet to prove to myself that an else here always means we are dealing with a config item, thus can add something like the config prefix... or an even longer filename or config or something...

Will certainly be considering this, but this is a little outside providing the correct gnu-emacs prefix option...

And for people who do not like to get into source patching, have pushed an issue-468 branch, with this fix... so you just need to do -

$ git clone git@github.com:htacg/tidy-html5.git
$ cd tidy-html5
$ git checkout issue-468
$ cd build/cmake
$ cmake ../..
$ make

Now you should have a local tidy - $ ./tidy -v - ... version 5.3.13-I468, for testing...

As stated, hope others, especially using filenames outside ASCII range, test and report it fixes the --gnu-emacs yes output, for all cases... thanks...

@geoffmcl
Copy link
Contributor

As requested, looking for to users testing branch issue-468, particularly with non-ascii file names... thanks...

@zmwangx
Copy link

zmwangx commented Jan 2, 2017

@geoffmcl I just ran into this issue with 5.2.0 on macOS 10.12.2 (with UTF8-encoded Chinese characters in filenames), and I can confirm that the issue-468 branch fixed it for me.

@geoffmcl
Copy link
Contributor

@zmwangx thanks for testing this...

Although you have been the only tester so far, I am convinced changing ASCII to RAW is correct in this case...

Accordingly, have pushed a fix to master, version 5.3.19, and deleted the issue-468 branch...

Although the if (optId == TidyOutFile) is no longer needed, since the code is now equivalent, have chosen to leave it in place, as a reminder, for the moment...

If you, or others, could pull master, and test this --gnu-emacs yes option, especially with utf-8 file names, maybe this can be closed... thanks...

@zmwangx
Copy link

zmwangx commented Feb 26, 2017

@geoffmcl Yes, I can confirm that master (d071341) works.

@geoffmcl
Copy link
Contributor

@zmwangx thanks for testing, so am closing this...

@jidanni
Copy link
Contributor Author

jidanni commented Feb 26, 2017

Sorry I cannot help, because I am busy.

1 similar comment
@jidanni
Copy link
Contributor Author

jidanni commented Feb 26, 2017

Sorry I cannot help, because I am busy.

Kristinita added a commit to Kristinita/SashaAppVeyorDebugging that referenced this issue Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants