Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandoc fails to convert table from invalid html to markdown #9090

Closed
milahu opened this issue Sep 21, 2023 · 6 comments
Closed

pandoc fails to convert table from invalid html to markdown #9090

milahu opened this issue Sep 21, 2023 · 6 comments
Labels

Comments

@milahu
Copy link

milahu commented Sep 21, 2023

pandoc 3.1.8 fails to convert this invalid html table:

<table>
  <tr>
    <th>invalid head cell</td>
  </tr>
  <tr>
    <td>body cell</td>
  </tr>
</table>

this is invalid html, because <th> is closed with </td>

-    <th>invalid head cell</td>
+    <th>valid head cell</th>

actual result

the table is removed, and only the table contents are rendered

$ pandoc -f html -t markdown_strict <(echo '<table><tr><th>invalid head cell</td></tr><tr><td>body cell</td></tr></table>')
invalid head cell

body cell

expected result

same result as with a valid head cell

html parsers in web browsers tolerate this invalid html, and still render the table

$ pandoc -f html -t markdown_strict <(echo '<table><tr><th>valid head cell</th></tr><tr><td>body cell</td></tr></table>')
<table>
<thead>
<tr class="header">
<th>valid head cell</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>body cell</td>
</tr>
</tbody>
</table>

keywords:

  • error-tolerant html parser
  • fault-tolerant html parser
  • loose html parser
  • nonstrict html parser

related:

workaround: fix the invalid html with a html linter:

  • html-tidy: tidy --tidy-mark no --quiet yes --indent no --wrap 0 --force-output yes --markup yes --doctype user --add-xml-decl no --break-before-br no --preserve-entities yes --keep-time yes --show-warnings no --show-errors 0 --newline LF --write-back yes input.html
  • HTMLHint
  • htmllint
  • linthtml
@milahu milahu added the bug label Sep 21, 2023
@jgm jgm closed this as completed in 0fdac49 Oct 19, 2023
@milahu
Copy link
Author

milahu commented Oct 20, 2023

thanks for the quickfix, but there are many similar cases of invalid html

i guess it would be better to have

  • a loose html parser by default which tolerates invalid html just like a web browser
  • a strict html parser (html_strict) which throws an error on invalid html

@jgm
Copy link
Owner

jgm commented Oct 20, 2023

Well, we already handle many cases of invalid HTML. If there are other particular ones that come up, feel free to report.

@milahu
Copy link
Author

milahu commented Oct 22, 2023

the <td>...</th> or <th>...</td> case is covered by

the <ul><li>...<li>...</ul> case is covered by

the <a href="x" href="y">...</a> case is not covered by tidy-html5 or chromium

etc ...

this could be automated by comparing a html2text transformation
between pandoc and other loose html parsers
the html2text transformation should be lossless, plusminus whitespace

or a html2html transformation, where pandoc should output valid html

@jgm
Copy link
Owner

jgm commented Oct 22, 2023

Have you actually tested these cases with pandoc? They work already.

@milahu
Copy link
Author

milahu commented Oct 22, 2023

i have only tested these cases, which work

echo '<a href="x" href="y">...</a>' | pandoc -f html -t html
echo '<ul><li>...<li>...</ul>' | pandoc -f html -t html

more:

git clone --depth=1 https://github.com/htacg/tidy-html5
cd tidy-html5
cd regression_testing/cases
n=1;
while read -u3 html; do
  echo "--- input $n: $html ---";
  cat $html;
  echo "--- output $n ---";
  pandoc -f html -t html $html;
  echo "--- hit enter to continue ---";
  read;
  n=$((n+1));
done 3< <(find . -name '*.html' -printf "%P\n")

lost tags: html, head, title, body

--- input 1: xml-expects/case-473490.html ---
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>[ #473490 ] DOCTYPE for Proprietary HTML to XHTML bad</title>
</head>
<body>
<nolayer>
<p>Test</p>
</nolayer>
</body>
</html>
--- output 1 ---
<p>Test</p>

lost tags: doctype, html, head, title, body, form, select, option, comment

--- input 2: xml-expects/case-432677.html ---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>[ #432677 ] Null value changed to "value" for -asxml</title>
</head>
<body>
<form action="http://www.w3c.org/" method="get">
<select name="option">
<option value="">option 1</option>
<!-- BAD VALUE -->
<option value="opt2">option 2</option>
</select> 
<input name="input" type="text" value="" /> 
<!-- BAD VALUE -->
 
<input name="submit" type="submit" value="submit" /></form>
</body>
</html>
--- output 2 ---
option 1 option 2

lost tags: xml, html, comment, head, title, body

--- input 3: xml-expects/case-542029.html ---
<?xml version="1.0"?>
<html>
<!-- use -config cfg_542029.txt -->
<head>
<title>[ 542029 ] PPrintXmlDecl reads outside array range</title>
</head>
<body>Test</body>
</html>
--- output 3 ---
Test

... these were just 3 of 700 html files

$ find . -name '*.html' | wc -l 
732

but obviously, pandoc is not made for html2html parsing,
or for lossless source-to-source transformations, for example between html and docx

@jgm
Copy link
Owner

jgm commented Oct 22, 2023

pandoc -f html -t html is not intended to be an identity function. See the introductory sections of the manual. In any case, you'd get closer results with -f hml+raw_html which allows parsing unknown things as raw HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants