NewsDownloader: process HTML with cre.getBalancedHTML() to ensure self-closing tags like <hr> are closed like <hr/>#13188
Conversation
Works around the issue in <koreader#13173 (comment)>.
|
Not sure I'd advise to use it - and require() crengine even if reading mostly PDF, although I guess newsdownloader users will end up requireing it - but I added a getBalancedHTML() not yet used in: |
|
I forgot all about that, thanks. I'll test it out. The replacements in this PR have the advantage of being super fast, but that only goes a small way. |
|
The available flags are described at https://github.com/koreader/crengine/blob/master/crengine/src/lvtinydom.cpp#L4326-L4344. which may then be rendered as |
|
I'll what 0x20 and 0x30 output tomorrow. Some newlines are nice to have as a human. |
I'm not seeing any difference between the output from 0x0 and 0x30? |
|
Does 0x00 output a single line? |
No, that's what I mean, they both output a single line.
Yes, that one has the excessive spacing and newlines you referred to. :-) |
|
Ok, indeed, 0x30 just do as 0x00. If there were an erm_unitialized = 0 instead of erm_invisible, I could look the style display value, but not keen on thinking about all that :) Hope you're just fine with 0x00. |
|
I'm having issues with the the blog |
`<script src="etc"></script>` is turned into `<script/>`, which can cause far too much to be stripped. While that could be dealt with a bit better, for example by first stripping self-closing and then regular, it feels hacky to do so. See <koreader#13188 (comment)>.
|
You should use download full article false unless there's a really good reason not to btw. In this case it works fine with the default settings and it's a lot faster too:
As to the problem, fixed in #13260. |
`<script src="etc"></script>` is turned into `<script/>`, which can cause far too much to be stripped. While that could be dealt with a bit better, for example by first stripping self-closing and then regular, it feels hacky to do so. See <#13188 (comment)>.
|
Thanks a lot for resolving the issue this quickly. The newest nightly seems to work great |
…f-closing tags like <hr> are closed like <hr/> (koreader#13188) Works around the issue in <koreader#13173 (comment)>.
…3260) `<script src="etc"></script>` is turned into `<script/>`, which can cause far too much to be stripped. While that could be dealt with a bit better, for example by first stripping self-closing and then regular, it feels hacky to do so. See <koreader#13188 (comment)>.

Works around the issue in #13173 (comment).
It's not a proper solution, since traditional HTML like the following would still cause issues, but it catches the low-hanging fruit of self-closing tags. Other than
<hr>with its gray color I'm not sure if any of the other elements matter much in practice, but you might sometimes see some slightly curious indentation due to nesting.The simplest solution would be to revert to plain HTML files, but EPUB does offer some advantages. (And I don't know if I want to put in the work required for that. :-)
A better solution might be to leverage crengine or MuPDF's HTML parsing, to output normalized HTML. Or perhaps to put in the mimetype text/html to signal that it's to be parsed differently.
This change is