pandoc fails to convert table from invalid html to markdown #9090

milahu · 2023-09-21T11:00:05Z

pandoc 3.1.8 fails to convert this invalid html table:

<table>
  <tr>
    <th>invalid head cell</td>
  </tr>
  <tr>
    <td>body cell</td>
  </tr>
</table>

this is invalid html, because <th> is closed with </td>

-    <th>invalid head cell</td>
+    <th>valid head cell</th>

actual result

the table is removed, and only the table contents are rendered

$ pandoc -f html -t markdown_strict <(echo '<table><tr><th>invalid head cell</td></tr><tr><td>body cell</td></tr></table>')
invalid head cell

body cell

expected result

same result as with a valid head cell

html parsers in web browsers tolerate this invalid html, and still render the table

$ pandoc -f html -t markdown_strict <(echo '<table><tr><th>valid head cell</th></tr><tr><td>body cell</td></tr></table>')
<table>
<thead>
<tr class="header">
<th>valid head cell</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>body cell</td>
</tr>
</tbody>
</table>

keywords:

error-tolerant html parser
fault-tolerant html parser
loose html parser
nonstrict html parser

html-tidy: tidy --tidy-mark no --quiet yes --indent no --wrap 0 --force-output yes --markup yes --doctype user --add-xml-decl no --break-before-br no --preserve-entities yes --keep-time yes --show-warnings no --show-errors 0 --newline LF --write-back yes input.html
HTMLHint
htmllint
linthtml

The text was updated successfully, but these errors were encountered:

milahu · 2023-10-20T05:56:37Z

thanks for the quickfix, but there are many similar cases of invalid html

i guess it would be better to have

a loose html parser by default which tolerates invalid html just like a web browser
a strict html parser (html_strict) which throws an error on invalid html

jgm · 2023-10-20T06:35:43Z

Well, we already handle many cases of invalid HTML. If there are other particular ones that come up, feel free to report.

milahu · 2023-10-22T05:56:44Z

the <td>...</th> or <th>...</td> case is covered by

the <ul><li>...<li>...</ul> case is covered by

the <a href="x" href="y">...</a> case is not covered by tidy-html5 or chromium

etc ...

this could be automated by comparing a html2text transformation
between pandoc and other loose html parsers
the html2text transformation should be lossless, plusminus whitespace

or a html2html transformation, where pandoc should output valid html

jgm · 2023-10-22T15:09:08Z

Have you actually tested these cases with pandoc? They work already.

milahu · 2023-10-22T15:56:09Z

i have only tested these cases, which work

echo '<a href="x" href="y">...</a>' | pandoc -f html -t html
echo '<ul><li>...<li>...</ul>' | pandoc -f html -t html

more:

git clone --depth=1 https://github.com/htacg/tidy-html5
cd tidy-html5
cd regression_testing/cases
n=1;
while read -u3 html; do
  echo "--- input $n: $html ---";
  cat $html;
  echo "--- output $n ---";
  pandoc -f html -t html $html;
  echo "--- hit enter to continue ---";
  read;
  n=$((n+1));
done 3< <(find . -name '*.html' -printf "%P\n")

lost tags: html, head, title, body

--- input 1: xml-expects/case-473490.html ---
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>[ #473490 ] DOCTYPE for Proprietary HTML to XHTML bad</title>
</head>
<body>
<nolayer>
<p>Test</p>
</nolayer>
</body>
</html>
--- output 1 ---
<p>Test</p>

lost tags: doctype, html, head, title, body, form, select, option, comment

--- input 2: xml-expects/case-432677.html ---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>[ #432677 ] Null value changed to "value" for -asxml</title>
</head>
<body>
<form action="http://www.w3c.org/" method="get">
<select name="option">
<option value="">option 1</option>
<!-- BAD VALUE -->
<option value="opt2">option 2</option>
</select> 
<input name="input" type="text" value="" /> 
<!-- BAD VALUE -->
 
<input name="submit" type="submit" value="submit" /></form>
</body>
</html>
--- output 2 ---
option 1 option 2

lost tags: xml, html, comment, head, title, body

--- input 3: xml-expects/case-542029.html ---
<?xml version="1.0"?>
<html>
<!-- use -config cfg_542029.txt -->
<head>
<title>[ 542029 ] PPrintXmlDecl reads outside array range</title>
</head>
<body>Test</body>
</html>
--- output 3 ---
Test

... these were just 3 of 700 html files

$ find . -name '*.html' | wc -l 
732

but obviously, pandoc is not made for html2html parsing,
or for lossless source-to-source transformations, for example between html and docx

jgm · 2023-10-22T17:30:46Z

pandoc -f html -t html is not intended to be an identity function. See the introductory sections of the manual. In any case, you'd get closer results with -f hml+raw_html which allows parsing unknown things as raw HTML.

milahu added the bug label Sep 21, 2023

jgm closed this as completed in 0fdac49 Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandoc fails to convert table from invalid html to markdown #9090

pandoc fails to convert table from invalid html to markdown #9090

milahu commented Sep 21, 2023 •

edited

milahu commented Oct 20, 2023 •

edited

jgm commented Oct 20, 2023

milahu commented Oct 22, 2023 •

edited

jgm commented Oct 22, 2023

milahu commented Oct 22, 2023

jgm commented Oct 22, 2023

pandoc fails to convert table from invalid html to markdown #9090

pandoc fails to convert table from invalid html to markdown #9090

Comments

milahu commented Sep 21, 2023 • edited

actual result

expected result

milahu commented Oct 20, 2023 • edited

jgm commented Oct 20, 2023

milahu commented Oct 22, 2023 • edited

jgm commented Oct 22, 2023

milahu commented Oct 22, 2023

jgm commented Oct 22, 2023

milahu commented Sep 21, 2023 •

edited

milahu commented Oct 20, 2023 •

edited

milahu commented Oct 22, 2023 •

edited