Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even with -utf8 tidy replaces UTF8 code U+00A0 into numeric entity   #871

Closed
mcepl opened this issue Mar 14, 2020 · 7 comments
Closed

Comments

@mcepl
Copy link

mcepl commented Mar 14, 2020

When running tidy -i -m -xml -utf8 to canonicalize XML file I get this diff (among many other things):

@@ -15389,7 +15389,7 @@ xsi:schemaLocation="http://www.bibletechnologies.net/2003/OSIS/namespace z:/osis
             <verse sID="Lev.11.13" osisID="Lev.11.13" />Toto jsou
             ptáci, jichž se budete štítit. Nesmějí se jíst, jsou
             ohavní: 
-            <note>některé živočišné druhy v následujících
+            <note>některé živočišné druhy v&#160;následujících
             výčtech nelze určit s jistotou</note>orel, orlosup,
             mořský orel, 
             <verse eID="Lev.11.13" />

The character after the preposition “v” is the non-breakable space (U+00A0). When I say -utf8 it means in my opinion that both input and output documents are in UTF8 and tidy should keep its dirty paws from changing characters, and especially it shouldn’t convert perfect UTF8 characters into numeric entities.

I am using tidy-5.6.0-1.7.x86_64 from the openSUSE package.

@ler762
Copy link
Contributor

ler762 commented Mar 14, 2020 via email

@mcepl
Copy link
Author

mcepl commented Mar 14, 2020

Hmm, so I added quote-nbsp to my ~/.tidyrc and the result is not persuasive either (and I still hold, that quote-nbsp: no should be default when -utf8 is in use):

~@stitny$ unset HTML_TIDY
~@stitny$ tidy -xml ~/projekty/CzeB21/CzeB21.xml 
Loading config file "~/.tidyrc" failed, err = 1
~@stitny$ strace tidy -xml CzeB21.xml |&grep -C 5 '/home/matej/.tidyrc'
openat(AT_FDCWD, "/usr/lib/locale/cs_CZ.utf8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=340640, ...}) = 0
mmap(NULL, 340640, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f63b18de000
close(3)                                = 0
access("/etc/tidy.conf", F_OK)          = -1 ENOENT (Adresář nebo soubor neexistuje)
access("/home/matej/.tidyrc", F_OK)     = 0
openat(AT_FDCWD, "/home/matej/.tidyrc", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=420, ...}) = 0
mmap(NULL, 420, PROT_READ, MAP_SHARED, 3, 0) = 0x7f63b1e91000
close(3)                                = 0
munmap(0x7f63b1e91000, 420)             = 0
write(2, "Loading config file \"~/.tidyrc\" "..., 47Loading config file "~/.tidyrc" failed, err = 1) = 47
~@stitny$ ls -l ~/.tidyrc
lrwxrwxrwx 1 matej users 23 Mar 14 20:02 /home/matej/.tidyrc -> .config/dotfiles/tidyrc
~@stitny$ ls -l ~/.config/dotfiles/tidyrc
-rw-r--r-- 1 matej users 420 Mar 14 20:02 /home/matej/.config/dotfiles/tidyrc
~@stitny$

I don’t see any reason, why tidy shouldn’t read its config file.

@ler762
Copy link
Contributor

ler762 commented Mar 14, 2020 via email

@mcepl
Copy link
Author

mcepl commented Mar 15, 2020

^shrug^ I still hold there should be a command line option to tell tidy to not read any init files. But like the old song.. "you can't always get what you want"

Wouldn’t export HTML_TIDY=/dev/null work as well?

does 'cd /tmp; cat ~/.tidyrc' work for you?

~@stitny$ cd /tmp/
tmp@stitny$ cat ~/.tidyrc
char-encoding: utf8
quote-ampersand: yes
break-before-br: no
drop-empty-paras: yes
drop-proprietary-attributes:yes
show-warnings:no
// sort-attributes:alpha
// drop-font-tags: yes
join-classes:yes
replace-color:yes
write-back: yes
quiet: yes
markup: yes
indent: yes
logical-emphasis: yes
hide-endtags: no
clean: yes
xml: yes
tidy-mark: no
quote-nbsp: no
doctype: html5
word-2000: yes
output-xhtml: yes
tmp@stitny$

This works as well:

tmp@stitny$ tidy -show-config |grep quote
Loading config file "~/.tidyrc" failed, err = 1
quote-ampersand             Boolean    yes                                     
quote-marks                 Boolean    no                                      
quote-nbsp                  Boolean    no                                      
tmp@stitny$

@ler762
Copy link
Contributor

ler762 commented Mar 15, 2020 via email

@mcepl
Copy link
Author

mcepl commented Mar 15, 2020

The following command got frozen and I had to finish it with Ctrl-C. @.***$ tidy --show-config |grep wrap Loading config file "~/.tidyrc" failed, err = 1 ^C

The problem was with that double dash, without that tidy works as shown in the updated comment.

What's your OS and tidy version?

The very first comment on this ticket:

I am using tidy-5.6.0-1.7.x86_64 from the openSUSE package.

@geoffmcl
Copy link
Contributor

@mcepl, @ler762 have re-read all this... are there any outstanding issue(s) here?

And @ler762, adding an option to stop reading environmental, or configured, config files, has been considered, and discussed, at length, with you, IIRC, and it was rejected... sorry for that... tidy has an easy to understand config file policy, which conforms to many Unix apps... in no need of a change, that I can see...

But is everything else ok, here? Can this be CLOSED... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants