Skip to content

NewsDownloader: process HTML with cre.getBalancedHTML() to ensure self-closing tags like <hr> are closed like <hr/>#13188

Merged
Frenzie merged 3 commits into
koreader:masterfrom
Frenzie:newsdownloader-self-closing-tags
Feb 2, 2025
Merged

NewsDownloader: process HTML with cre.getBalancedHTML() to ensure self-closing tags like <hr> are closed like <hr/>#13188
Frenzie merged 3 commits into
koreader:masterfrom
Frenzie:newsdownloader-self-closing-tags

Conversation

@Frenzie

@Frenzie Frenzie commented Feb 1, 2025

Copy link
Copy Markdown
Member

Works around the issue in #13173 (comment).


It's not a proper solution, since traditional HTML like the following would still cause issues, but it catches the low-hanging fruit of self-closing tags. Other than <hr> with its gray color I'm not sure if any of the other elements matter much in practice, but you might sometimes see some slightly curious indentation due to nesting.

<p>Paragraph one
<p>Paragraph two

The simplest solution would be to revert to plain HTML files, but EPUB does offer some advantages. (And I don't know if I want to put in the work required for that. :-)

A better solution might be to leverage crengine or MuPDF's HTML parsing, to output normalized HTML. Or perhaps to put in the mimetype text/html to signal that it's to be parsed differently.


This change is Reviewable

@poire-z

poire-z commented Feb 1, 2025

Copy link
Copy Markdown
Contributor

Not sure I'd advise to use it - and require() crengine even if reading mostly PDF, although I guess newsdownloader users will end up requireing it - but I added a getBalancedHTML() not yet used in:
koreader/koreader-base#1824
koreader/crengine#569

@Frenzie

Frenzie commented Feb 1, 2025

Copy link
Copy Markdown
Member Author

I forgot all about that, thanks. I'll test it out. The replacements in this PR have the advantage of being super fast, but that only goes a small way.

@poire-z

poire-z commented Feb 1, 2025

Copy link
Copy Markdown
Contributor

The available flags are described at https://github.com/koreader/crengine/blob/master/crengine/src/lvtinydom.cpp#L4326-L4344.
It's possible 0x50 adds unwanted newlines, which may cause spurious spaces with some inline elements, ie. foo<b>bar</b> may be serialized back as:

foo
  <b>bar</b>

which may then be rendered as foo bar instead of foobar.
May be 0x30 can avoid this ? Or even 0x0 if no human will ever look at the output :)

@Frenzie

Frenzie commented Feb 1, 2025

Copy link
Copy Markdown
Member Author

I'll what 0x20 and 0x30 output tomorrow. Some newlines are nice to have as a human.

@Frenzie

Frenzie commented Feb 2, 2025

Copy link
Copy Markdown
Member Author

May be 0x30 can avoid this ? Or even 0x0 if no human will ever look at the output :)

I'm not seeing any difference between the output from 0x0 and 0x30?

@poire-z

poire-z commented Feb 2, 2025

Copy link
Copy Markdown
Contributor

Does 0x00 output a single line?
And 0x30 some more pretty multilines html ?
What about 0x50 ?
(All that is old, use what works for you :))

@Frenzie

Frenzie commented Feb 2, 2025

Copy link
Copy Markdown
Member Author

Does 0x00 output a single line?
And 0x30 some more pretty multilines html ?

No, that's what I mean, they both output a single line.

What about 0x50 ?

Yes, that one has the excessive spacing and newlines you referred to. :-)

@poire-z

poire-z commented Feb 2, 2025

Copy link
Copy Markdown
Contributor

Ok, indeed, 0x30 just do as 0x00.
Might be because we don't render the document, so all "rend_method" have their default value of erm_invisible.
https://github.com/koreader/crengine/blob/e53a294e30be165cb2c91f7d7de2dcb9414a6f04/crengine/src/lvtinydom.cpp#L4581-L4597

If there were an erm_unitialized = 0 instead of erm_invisible, I could look the style display value, but not keen on thinking about all that :) Hope you're just fine with 0x00.

@Frenzie Frenzie added this to the 2025.02 milestone Feb 2, 2025
@Frenzie Frenzie changed the title NewsDownloader: ensure self-closing tags like <hr> are closed like <hr/> NewsDownloader: process HTML with cre.getBalancedHTML() to ensure self-closing tags like <hr> are closed like <hr/> Feb 2, 2025
@Frenzie Frenzie merged commit 56c0802 into koreader:master Feb 2, 2025
@Frenzie Frenzie deleted the newsdownloader-self-closing-tags branch February 2, 2025 12:34
@PJansky

PJansky commented Feb 14, 2025

Copy link
Copy Markdown

I'm having issues with the the blog
www.wheresyoured.at/rss as it uses <hr> which messes up the formatting, as described in #13173 (comment) and was hoping this PR would fix it.
Unfortunately it seems no longer correctly create EPUBs for this site specifically, other feeds still seem to work. The nightly of 1. Feb works (with the formatting issues), however the nighty from 2. Feb and onwards with 56c0802 fails to correctly create the EPUB.
The created EPUBs open as:
Failed handling EPUB spine item 'OEBPS/content.html' (application/xhtml+xml, content).

Frenzie added a commit to Frenzie/koreader that referenced this pull request Feb 15, 2025
`<script src="etc"></script>` is turned into `<script/>`, which can cause far too much to be stripped.

While that could be dealt with a bit better, for example by first stripping self-closing and then regular, it feels hacky to do so.

See <koreader#13188 (comment)>.
@Frenzie

Frenzie commented Feb 15, 2025

Copy link
Copy Markdown
Member Author

You should use download full article false unless there's a really good reason not to btw. In this case it works fine with the default settings and it's a lot faster too:

As to the problem, fixed in #13260.

Frenzie added a commit that referenced this pull request Feb 15, 2025
`<script src="etc"></script>` is turned into `<script/>`, which can cause far too much to be stripped.

While that could be dealt with a bit better, for example by first stripping self-closing and then regular, it feels hacky to do so.

See <#13188 (comment)>.
@PJansky

PJansky commented Feb 16, 2025

Copy link
Copy Markdown

Thanks a lot for resolving the issue this quickly. The newest nightly seems to work great

0xstillb pushed a commit to 0xstillb/koreader-thai that referenced this pull request May 9, 2026
…f-closing tags like <hr> are closed like <hr/> (koreader#13188)

Works around the issue in <koreader#13173 (comment)>.
0xstillb pushed a commit to 0xstillb/koreader-thai that referenced this pull request May 9, 2026
…3260)

`<script src="etc"></script>` is turned into `<script/>`, which can cause far too much to be stripped.

While that could be dealt with a bit better, for example by first stripping self-closing and then regular, it feels hacky to do so.

See <koreader#13188 (comment)>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants