New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandoc fails to convert table from invalid html to markdown #9090
Comments
thanks for the quickfix, but there are many similar cases of invalid html i guess it would be better to have
|
Well, we already handle many cases of invalid HTML. If there are other particular ones that come up, feel free to report. |
Have you actually tested these cases with pandoc? They work already. |
i have only tested these cases, which work
more: git clone --depth=1 https://github.com/htacg/tidy-html5
cd tidy-html5
cd regression_testing/cases
n=1;
while read -u3 html; do
echo "--- input $n: $html ---";
cat $html;
echo "--- output $n ---";
pandoc -f html -t html $html;
echo "--- hit enter to continue ---";
read;
n=$((n+1));
done 3< <(find . -name '*.html' -printf "%P\n") lost tags: html, head, title, body
lost tags: doctype, html, head, title, body, form, select, option, comment
lost tags: xml, html, comment, head, title, body
... these were just 3 of 700 html files
but obviously, pandoc is not made for html2html parsing, |
|
pandoc 3.1.8 fails to convert this invalid html table:
this is invalid html, because
<th>
is closed with</td>
actual result
the table is removed, and only the table contents are rendered
expected result
same result as with a valid head cell
html parsers in web browsers tolerate this invalid html, and still render the table
keywords:
related:
workaround: fix the invalid html with a html linter:
tidy --tidy-mark no --quiet yes --indent no --wrap 0 --force-output yes --markup yes --doctype user --add-xml-decl no --break-before-br no --preserve-entities yes --keep-time yes --show-warnings no --show-errors 0 --newline LF --write-back yes input.html
The text was updated successfully, but these errors were encountered: