Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(encoding): non-ASCII characters in configuration file #737

Closed
Kristinita opened this issue May 15, 2018 · 4 comments
Closed

bug(encoding): non-ASCII characters in configuration file #737

Kristinita opened this issue May 15, 2018 · 4 comments

Comments

@Kristinita
Copy link

Kristinita commented May 15, 2018

1. Summary

If non-ASCII characters in HTML Tidy configuration file:

    I get extra warnings in console: Warning: replacing invalid character code N

2. Environment

  • Operation system:

    • Windows 10 Enterprise LTSB 64-bit EN (local)
    • Ubuntu 14.04.5 LTS (Travis CI)
  • HTML Tidy:

3. Configuration

See example configuration in SashaTidyDebugging branch of my demo repository.

Sasha__Tidy--NonASCIIInConfiguration.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
    <span></span>
</body>
</html>

tidy.conf:

# “”
#############
# HTML Tidy #
#############
# Validate and fix HTML files:
# http://www.html-tidy.org/
# Description:
# http://api.html-tidy.org/tidy/tidylib_api_next/index.html
# Options:
# http://api.html-tidy.org/tidy/quickref_next.html
# Configuration file format:
# http://api.html-tidy.org/tidy/tidylib_api_next/tidy_config.html
# No official configuration filename, I use common:
# https://github.com/search?utf8=%E2%9C%93&q=filename%3Atidy.conf&type=Code
#
# Doesn't print content of HTML files to console:
# http://api.html-tidy.org/tidy/quickref_next.html#markup
markup: no
# Preserve &amp;, that valid, but no default:
# http://api.html-tidy.org/tidy/quickref_next.html#preserve-entities
# https://github.com/htacg/tidy-html5/issues/732
preserve-entities: yes
# Disable information about HTML Tidy in console:
# http://api.html-tidy.org/tidy/quickref_next.html#quiet
quiet: yes
# Remove meta name="generator":
# http://api.html-tidy.org/tidy/quickref_next.html#tidy-mark
# Arguments:
# https://github.com/htacg/tidy-html5/issues/558#issuecomment-388899700
tidy-mark: no
# Disable warnings, if proprietary attributes:
# http://api.html-tidy.org/tidy/quickref_next.html#warn-proprietary-attributes
# I need delete this option in 5.8.0 HTML Tidy version:
# https://github.com/htacg/tidy-html5/issues/686
warn-proprietary-attributes: no
# Disable line breaks:
# http://api.html-tidy.org/tidy/quickref_next.html#wrap
# https://github.com/gavinballard/grunt-htmltidy/issues/6
wrap: 0

4. Steps to reproduce

tidy -config tidy.conf Sasha__Tidy--NonASCIIInConfiguration.html

5. Expected behavior

If first line (# “”) in tidy.conf no exists:

$ tidy -config tidy.conf Sasha__Tidy--NonASCIIInConfiguration.html
line 8 column 9 - Warning: trimming empty <span>

6. Actual behavior

Else first line (# “”) exists:

$ tidy -config tidy.conf Sasha__Tidy--NonASCIIInConfiguration.html
Warning: replacing invalid character code 128
Warning: replacing invalid character code 156
Warning: replacing invalid character code 128
Warning: discarding invalid character code 157
line 8 column 9 - Warning: trimming empty <span>

Thanks.

@geoffmcl
Copy link
Contributor

@Kristinita thank you for the issue, but this is not an encoding bug, but a tidy feature ;=))

I know it because I read code - see config.c:945... TY_(ParseConfigFileEnc)( doc, file, "ascii" );...

Then I remembered I had seen this in the tidy docs... took some time to find it... but see tidyLoadConfig... and I hope maybe it is mentioned elsewhere...

Amd this has been brought up at least once before... see #201 ... and maybe others... as mentioned there, It has always been this way in Tidy!...

Now since all the internal options are ASCII encoded, it does not make sense to support other than ASCII, even in what is just a comment line...

Tidy does support a good number of input and output character encodings, but not for it's config file contents... sorry...

@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 9, 2020

2020/10/09:
@Kristinita seems question asked and answered...

And note those warnings like Warning: replacing invalid character code xxx do not get counted in the document errors and warnings...

Simply Tidy accepts only ASCII in its config file, and there seems no support for changing this, at this time...

So am closing this...

Please feel free to re-open, comment, or add a new issue... thanks...

@geoffmcl geoffmcl closed this as completed Oct 9, 2020
@Kristinita
Copy link
Author

Status: Not fixed 😿

1. Opinion

but this is not an encoding bug, but a tidy feature ;=))

I don’t agree with this statement at all. Degraded usability — isn’t a feature. Now, in 2020, people should be able to write any Unicode character in comments.

2. History

Yes, I understand that at the time HTML Tidy was being created, UTF-8 wasn’t popular. See, for example, web encoding statistics for 2001—2008 from the official Google blog:

UTF-8 2001—8

But currently, in the November 2020, for example, 95,8% of sites has UTF-8.

UTF-8 2010—20

Unicode is now a standard. It’s not good to inconvenience users by not accepting it.

Thanks.

@geoffmcl
Copy link
Contributor

@Kristinita, thank you for your further feedback...

Yes, we are well aware that utf-8 for Web Pages has gained almost universal acceptance and adoption, and when parsing HTML pages tidy's default is utf-8.

BTW, when tidy was first created, the default was latin-1 (ISO-8859-1)! Not sure when it changed to utf-8, but quite some time ago... wait, I found it - commit 6c9895d Thu Feb 16 12:07:03 2012 +0900 - 8 years ago... Yay for git log -p ... ;=))

But this is not a Web Page.

It is a configuration file, and it is currently documented that it should be in ASCII, as indicated, and as it happens, all config option are in ASCII, which, as I am sure you understand, is the same as utf-8 for the first 127 characters...

So here was are only talking about comments in that config file. As the config file reading is currently written, libTidy reads the file using its internal character decoder. If it was re-written another way, and we could probably completely skip such comment lines, and not care what followed the # character, or '//`, until a newline, or eof, reached... but that requires a re-write...

Also the libTidy API supports two(2) load config file services - tidyLoadConfig and tidyLoadConfigEnc - so apps using libTidy, could use the 2nd of these, and thus support config comments in a variety of encodings, including utf8...

It in just a fact of history, a feature if you will, that our example tidy.c console app uses the first, tidyLoadConfig, and internally, libTidy diverts that to tidyLoadConfigEnc adding a default "ascii" encoding... which dates back to before git tidy, something like before 2008... probably all the way back to the first release, cica 2000... no, it was after that... way back then, the config read just used getc... anyway...

I do not see this as a great inconvenience to users... nor greatly Degraded usability... that they must constrain their comments, in the config file, to US ASCII... but just the comments, since everything else has to be 'ascii' anyway, or at least ascii range... but agree is a little sad, in 2020...

Now having said all that, in the above, I have enumerated various ways this could be done, achieved, and if you, or someone else, wants to propose a code change, test it, including modifying the tidy docs accordingly, and present code patches, or a complete PR, tested, verified, it would certainly be considered... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants