Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On empty list, tidy transforms a valid XHTML file into an invalid one #768

Open
hosiet opened this issue Nov 3, 2018 · 2 comments
Open

Comments

@hosiet
Copy link

hosiet commented Nov 3, 2018

I'm forwarding some longstanding downstream issues here, one of which is about empty list. Previous reports:

Tidy transforms some valid XHTML file into an invalid one.
For instance, the source has:

<ul class="ul"><li class="li"></li></ul>

which is valid. Tidy removes the empty li, but not the ul (this
doesn't happen if one removes the class attribute), so that one
gets:

<ul class="ul"></ul>

which is invalid (there must be at least one li).

Sample test case:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- $Id: tidy-empty-list.html 43963 2011-05-26 12:08:28Z vinc17/ypig $ -->
<head>
<title>Test of tidy on an empty list</title>
</head>
<body>
<p>Debian's <cite>Tidy</cite> 20091223cvs-1 transforms this valid XHTML
file into an invalid one: it removes the empty <samp>li</samp> but keeps
the <samp>ul</samp> element due to its <samp>class</samp> attribute!</p>
<ul class="ul"><li class="li"></li></ul>
</body>
</html>
@geoffmcl
Copy link
Contributor

geoffmcl commented Nov 8, 2018

@hosiet thank you for cross posting this here... and the sample xhtml...

I can confirm that even current tidy 5.7.16, will drop the empty <li>, as does that old 20091223cvs-1 version...

In the current version you can add --drop-empty-elements no option to the config to avoid this...

But this ref - https://www.w3.org/2010/04/xhtml10-strict.html#elem_ul - says At least one of li, thus as you suggest, an empty list is invalid in XHTML - need more W3C references - and libtidy needs a fix... should not be difficult...

Appreciate further feedback, patches or PR... thanks...

@geoffmcl geoffmcl added the Bug label Nov 8, 2018
@geoffmcl
Copy link
Contributor

geoffmcl commented Nov 8, 2018

@hosiet looking further into this... at first I though it might be a HTML4/one or more li, versus HTML5/0 or more li, something addressed in #396... but now think this is maybe a configuration issue...

If you tell tidy the input is to be treated as well formed XML, with either -xml, or --input-xml yes, then the TY_(ParseXMLDocument)(TidyDocImpl* doc) would be used, which does not end the parsing with TY_(DropEmptyElements)(doc, &doc->root); and I think you will get the desired output...

F:\Projects\tidy-test\test>tidy5 -v
HTML Tidy for Windows version 5.7.16
F:\Projects\tidy-test\test>tidy5 -xml input5\in_768.html
No warnings or errors were found.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- $Id: tidy-empty-list.html 43963 2011-05-26 12:08:28Z vinc17/ypig $ -->
<head>
<title>Test of tidy on an empty list</title>
</head>
<body>
<p>Debian's
<cite>Tidy</cite>20091223cvs-1 transforms this valid XHTML file
into an invalid one: it removes the empty
<samp>li</samp>but keeps the
<samp>ul</samp>element due to its
<samp>class</samp>attribute!</p>
<ul class="ul">
<li class="li"></li>
</ul>
</body>
</html>
F:\Projects\tidy-test\test>tidy-2009 -v
HTML Tidy for Windows released on 25 March 2009
**same output**

As can be seen, this also works for the tidy-2009 release...

To repeat, this only happens if tidy is allowed to default to using its HTML parser... where, at least in HTML5, such a deletion is not a problem... and can be overridden with the option --drop-empty-elements no, as a user choice...

The static Bool CanPrune(...) service could be enhanced to do some check on the tidy mode, if this problem needs to be addressed in HTML4 documents... but maybe that could be addressed as a separate new issue... thanks...

Does this solve the problem of deleting the empty <li>... in valid xhtml... thanks...

@balthisar balthisar added this to the 5.9 milestone Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants