New HTML parser, libRu and FB2 tweaks #370

poire-z · 2020-08-24T08:20:21Z

Revert "FB2: don't draw cover in scroll mode" (revert lazy workaround from #366)
(Upstream) FB2: fix coverpage drawing in scroll mode
Proper solution by @virxkane to this issue, koreader/koreader#6490 (comment) , and fixing a bug of mine.

FB2 footnotes: only merge run-in when next is erm_final koreader/koreader#6344 (comment)
fb2.css: use OTF tabular-nums for footnote numbers koreader/koreader#6344 (comment)

Text: ignore ascii and unicode control chars
https://www.mobileread.com/forums/showthread.php?p=4025496#post4025496
I'm not sure it's really preferable, I'd like to set the crap I get :) but I guess if we find these control chars, it's usually crap left by puslisher or conversion that a real reader is not interested in.
https://www.aivosto.com/articles/control-characters.html
Please have a quick look that I didn't ignore other valuable code (I kept \t \r \n)

Fix HR positionning when floats involved
Both the DIV in blue and the HR in grey are styled similarly with {width: 30%; height: 100px; margin: 0 auto 0 auto; } but shouldn't be positionned similarly.
Before:

After:

writeNodeEx(): minor tweaks
OnTagClose(): add self_closing_tag parameter to all XML writers, even if most don't use it
HTML format detection: accept HTML5 doctype Closes #141

HTML parser: rework Lib.ru specific handling koreader/koreader#6480 (comment)
I don't like much having some little non-standard book format specific hacks in there, but well...
Rewrote it to have it a bit more "normal", and solve the display hash mismatch after first load.
The DOM built should hopefully be similar to the old CoolReader one (but may be not to the one got from a few months ago witk KOReader, so KOReader users reading Lib.Ru books may have highlights made recently messed up).

HTML parser: new more conforming implementation
HTML parser: ensure foster parenting inside tables
Try to conform to the HTML specs, even if not implementing the exact algorithm:
https://html.spec.whatwg.org/multipage/parsing.html
https://htmlparser.info/parser/
Limitations:

- FORM and form elements (SELECT, INPUT, OPTION...)
- TEMPLATE, APPLET, OBJECT, MARQUEE
- Mis-nested HTML/BODY/HEAD
- Reconstructing the active formatting elements (B, I...) when
  mis-nested or "on hold" when entering block or table elements.
- The "adoption agency algorithm" for mis-nested formatting
  elements (and nested <A>)
- We may not ignore some opening tag that we normally should
  (like HEAD or FRAME when in BODY) (but we ignore a standalone
  sub-table element when not inside a TABLE) as this would
  complicate the internal parser state.

Done after trying to fix the autoclose stuff, but what the existing code allowed wasn't right. Some discussion and discoveries at:
koreader/koreader#6482 (comment) #243 (comment)

https://hsivonen.fi/doctype/ (about doctypes, that I didn't really investigate it we'd need to parse that and act differently)

So, that new HTML parser will only be used with HTML and CHM files.
I wonder how much of this and the specs should apply when parsing XHTML from EPUBs.
Should it also correct autoclosing tags (<hr><div>toto</div></hr>, <br></br> should output 2 BRs), misnested tags, non table stuff inside tables... ? When having a balanced XHTML file will map directly to a DOM... (I don't plan to do anything on that front, just curious :)

Not sure this won't crash on twisted HTML documents :/
Also, it will make my life harder when testing stuff, as now what I usually test with a test.html file (because building a .epub is tedious) won't be handled the same as if it were in a .epub...

This change is

This reverts commit 1f9b2cc.

Happened only in enhanced rendering mode, caused by forgetting to use the 'y' provided to RenderBlockElement(). As we're now using it, don't provide a float's container padding_top as this 'y' when rendering the float, as it has already been given to and accounted by the main FlowState.

The content following the 'display:run-in' footnote number is usually a P rendered erm_final, and this merging of them into a container erm_final works quite well. But when the content is not a P but some more complex erm_block structure (like poem>stanza>v or cite>p), ensuring this mergins would make that structure all erm_inline, losing the formatting. So don't do that, and let a blank line after the footnote number happen, which is less worse than losing formatting.

If the font provides the 'tabular-nums' (tnum) OpenType feature, we'll get more properly aligned starts of text after the numbers.

Some of these can still be found in old documents or badly converted ones, and most fonts have no glyph for them (and if they do, it's a small symbol with no meaning for the reader).

Strangely, when floats are involved, a HR behaves differently than a regular DIV and adjust to the available space. Nothing mentionned about that in the specs, so try to handle them as Firefox does.

Avoid crash when called while styles not yet set or reset. Show the rootnode as <?RootNode?>.

So DOM writers can distinguish <br/> from <br></br> and handle these differently.

Accept <!DOCTYPE html> which may not have any <html>, <head> and <body> tags (they are then implicit), and any other kind of <!DOCTYPE ... HTML ...> which are obviously HTML, no matter how broken they are.

Frenzie · 2020-08-24T08:25:42Z

Also, it will make my life harder when testing stuff, as now what I usually test with a test.html file (because building a .epub is tedious) won't be handled the same as if it were in a .epub...

@poire-z XHTML documents would technically need to be handled the same as EPUB (the other way around really; EPUB is XHTML with some extra metadata). In KOReader .xhtml documents currently open in MuPDF but that could be an easy way around the problem.

Frenzie · 2020-08-24T08:30:55Z

crengine/src/lvtextfm.cpp

@@ -1369,6 +1369,21 @@ class LVFormatter {
                            else if ( c >= 0x2066 ) m_flags[pos] = LCHAR_IS_TO_IGNORE; // 2066>2069
                        }
                    }
+                    else if ( c <= 0x009F ) {
+                        // Also ignore some ASCII and Unicode control chars
+                        // in the ranges 00>1F and 7F>9F, except a few.


I'm not sure it's really preferable, I'd like to set the crap I get :) but I guess if we find these control chars, it's usually crap left by puslisher or conversion that a real reader is not interested in.

I like to show formatting in my editors, both Word/Writer and more code-focused ones like Geany. But displaying ? on a weird control char by default seems a bit off. :-)

Frenzie · 2020-08-24T09:11:59Z

crengine/src/lvtinydom.cpp

+            // "Lib.ru html" format is actually minimal HTML with
+            // the text wrapped in <PRE>. We will parse this text


crengine/include/fb2def.h

crengine/src/lvdocview.cpp

NiLuJe · 2020-08-24T11:24:19Z

Nothing jumps out at a quick glance ;) :+1;

poire-z · 2020-08-24T12:07:50Z

XHTML documents would technically need to be handled the same as EPUB (the other way around really; EPUB is XHTML with some extra metadata). In KOReader .xhtml documents currently open in MuPDF but that could be an easy way around the problem.

Except that currently, crengine would still use this new HTMLParser with files names .xhtml.
We'd have to add something to distinguish html from xhtml in the detection code.
But anyway, I'm sure most of my test.html files are not balanced, so I guess the XML/XHTML parser would just bug on them :) And may be even for standalone .xhtml, we can't just trust they are fully valid: best to use the loosy html parser so we're sure we get something out.
I guess that for EPUBs, we get more chances they are fully valid XHTML. I wonder if we could use the new HTML writer for them instead of the XMLParser/Writer (but the new HTML stuff is probably slower than the plain XML one). But well, no issue, no real need to wonder about that :)

Frenzie · 2020-08-24T13:26:06Z

And may be even for standalone .xhtml, we can't just trust they are fully valid: best to use the loosy html parser so we're sure we get something out.

That's what browsers do though. :-)

.html, even if it declares itself to be XHTML in the DTD, is parsed like HTML.
.xhtml is parsed like XHTML.

As to whether that's wise — we can leave that in the middle.

Avoid the first (wrong) rendering from being different than next (good) ones, because it was parsed as PRE and the node was substituted to a DIV when closed. Handle Lib.ru books in a more correct DOM building way.

Follow most rules from the HTML specs. Add a new DOM_VERSION_CURRENT 20200824 so newly opened HTML and CHM files use the new parser (which might give a different DOM tree than the previous one, which will still be used for previously opened book to preserve XPointers). fb2def.h: add a few more tags mentionned in the specs.

According to the specs, non-table elements met while building a table (before we meet a TD/TH), should be moved outside the table and added as the previous sibling of the table. Previously, these elements were kept there, and ignored by the table rendering code.

poire-z · 2020-08-26T08:38:23Z

Pffff:

<?xml version="1.0" encoding="UTF-8"?> <!-- caused by this -->
<html>
<head>
<title>Available fonts test document</title>

generates:

DEBUG CreDocument: goto xpointer /html/html/body/section[4]/title/text().0

Frenzie · 2020-08-26T08:45:36Z

Ah, that's not good. :-/

poire-z · 2020-08-26T10:56:16Z

Firefox wraps these "processing instruction" in a comment:

The XML parser feeds them to the DOM writers - we don't seem to use them, they don't make it into the DOM with EPUBs because we extract only the BODY from each individual XHTML docfragment.

So, I guess I'll just ignore tags that start with '?' in the new HTML DOM writer, so as to not mess with other stuff.

cramoisi · 2020-09-11T14:30:48Z

Hi @poire-z. Don't know if this is related but I just got a linebreak after an en-dash (dialogue) which is followed by an no-break space \u00a0. First time I see that in koreader. Any idea ?

poire-z · 2020-09-11T17:06:11Z

@cramoisi : more probably because of #365 / #364 , which changes a few things for French (but for EM-dash, not EN-dash).
This got into 2020.08.1, but not 2020.08 . So, you might see if things change between these 2 versions.

Not much time currently to dig into that, but may be @Jellby can have a look if this PR might have some impact here?

Jellby · 2020-09-11T17:43:08Z

I don't see how that could happen with either em-dash or en-dash (and I believe one should use em-dash for dialogues in French). I could have a look if you upload a sample file ("Le Rouge et le Noir" should be public domain and ok to upload).

cramoisi · 2020-09-11T18:02:45Z

@Jellby : for the em /en usage it's up to the publisher :/ (but I agree with you). It happens two times until now and each time in the same configuration (nbsp):(space)–(nbsp)WORD - and it's cut before word or (nbsp): "(nbsp)–(nbsp)WORD. I put the two chapters it happened in the sample file. Thanks !

r_et_n_sample - Stendhal.epub.zip

(edit : rolled back and also happened in 2020.08)

Jellby · 2020-09-11T18:13:53Z

Actually, I changed my mind, I see how this happens with an EN-dash. It has nothing to do with #365, it's just how en-dashes work, they are like hyphens and allow a break after, even if followed by a no-break space. Another reason to use em-dash. A possible workaround is replacing [en-dash][no-break-space] with [en-dash][word-joiner][no-break-space], but if you are going to change the code, you might as well use em-dashes.

As possible tweaks in the crengine code, I can say that in proper Spanish typesetting the en-dash does not exist, so it would be OK to treat the en-dash as an em-dash in Spanish texts, for cases where the publisher chooses to use the wrong dash (because it looks nicer in their font, or so they think). If the same happens in French, it could easily be implemented and solve your problem. But if en-dashes have their rightful uses, as in English, it gets trickier.

poire-z · 2020-09-11T18:19:13Z

OK, I can reproduce it.

https://www.unicode.org/reports/tr14/tr14-22.html#BA
Is there something in there that says BA wins over GL (glue) ?

cramoisi · 2020-09-11T18:23:28Z

@Jellby : Yup, I get it : I only see them now because of the publisher usage. Thanks to point that out ! In french the em-dash is usually for dialogs and the en-dash to do inserts. I will just convert them with calibre editor for now , it's easy to find them all :-)

@poire-z : If a publisher want to use en-dash in place of em-dash (which is wrong I'm first to agree) and put a nbsp after it, what's the point to not respect it ? what's the point to respect the nbsp everywhere but at some places ?

Jellby · 2020-09-11T18:27:55Z

Is there something in there that says BA wins over GL (glue) ?

The pair table (table 2, note that it has apparently been removed in later versions), says that BA GL is a direct break opportunity (i.e. break is allowed even if there are no spaces between them).

what's the point to respect the nbsp everywhere but at some places ?

That's the Unicode linebreaking algorithm (see link above), not a koreader decision, although koreader can choose to ignore/override it.

cramoisi · 2020-09-11T18:31:02Z

That's the Unicode linebreaking algorithm (see link above), not a koreader decision, although koreader can choose to ignore/override it.

Thanks ! If it's unicode it's fine by me :-)

poire-z · 2020-09-11T18:35:17Z

The pair table (table 2, note that it has apparently been removed in later versions), says that BA GL is a direct break opportunity (i.e. break is allowed even if there are no spaces between them).

Thanks. Hovering over it points back to the invoved rule: LB12a :

LB12a Do not break before NBSP and related characters, except after spaces and hyphens [...]
Allowing a break after BA or HY matches widespread implementation practice and supports a common way of handling special line breaking of explicit hyphens, such as in Polish and Portuguese

That still feel strange. I would expect these NBSPs to stick everything :)

cramoisi · 2020-09-11T18:38:54Z

That still feel strange. I would expect these NBSPs to stick everything :)

Yes, the problem is people who don't dig it out won't understand why and see this behaviour as an error. But the fact is that there is some books where the en-dash are correctly used for inserts but where they are surrounded with NBSPs. In these books the layout is just horrible so I think this rule is for the best.

poire-z · 2020-09-11T18:51:27Z

so I think this rule is for the best.

So, you're fine with how things are currently (in 2020.08.1) ? and you will just correct this specific EPUB ?

cramoisi · 2020-09-11T19:07:15Z

@poire-z : I'm never fine with a bad layout or a bad french hyphenation :-) But I was thinking about the mistake where NBSPs are put around correctly used en-dash (for inserts) and I've seen that far more often that en-dash used for em-dash...

I still agree with you it feels strange, mostly because it's not obvious for the common reader which will just see it broken.

Which one is the best error ? I don't know 😅

poire-z · 2021-09-01T10:37:13Z

I guess that for EPUBs, we get more chances they are fully valid XHTML. I wonder if we could use the new HTML writer for them instead of the XMLParser/Writer (but the new HTML stuff is probably slower than the plain XML one). But well, no issue, no real need to wonder about that :)

Mhhh, until now :(
Keeping a note about this for later.

Downloaded https://unglue.it/work/186982/ as EPUB.
It must have been made by https://github.com/brthanmathwoag/ebooks/blob/master/dive-into-python3.sh from the HTML5 files at https://github.com/diveintomark/diveintopython3, HTML content that don't use closing tags :/

So, it's an EPUB with non-XHTML content.
Parsing it with our XML parser that won't autoclose tags, we get some ugly DOM and rendering issues:

More serious issue is with tables:
The table in 13.9 of https://diveintopython3.net/serializing.html would take a lot of time to render. I have to shorten it to:

<table>
<tr><th>Notes
<th>JSON
<th>Python 3
<tr><th>
<td>object
<td><a href=native-datatypes.html#dictionaries>dictionary</a>
<tr><th>
<td>array
<td><a href=native-datatypes.html#lists>list</a>
<tr><th>
<td>string
<td><a href=strings.html#divingin>string</a>
</table>

to get it to render in a few seconds as...:

Add another layer ot tr > th > td, and it takes minutes...
It gets stuck into nested initTableRendMethods() initNodeRendMethodRecursive() recurseElementsDeepFirst(), so we probably have a real issue here.
Probably because of our "complete incomplete tables" from #328 that may be should not try to do too much with nested crap...

Frenzie · 2021-09-01T10:42:52Z

A reasonably impressive number of fatal errors there…

But yeah, being stuck in some kind of loop due to some bad input is no good. :-)

$ epubcheck b5d497d313204179bea0efae12ce2f7c.epub 
Validating using EPUB version 2.0.1 rules.
ERROR(OPF-031): b5d497d313204179bea0efae12ce2f7c.epub/content.opf(99,71): File listed in reference element in guide was not declared in OPF manifest: indextutorial.html.
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/index.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/index.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/index.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/index.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/files.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/files.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/files.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/files.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-007): b5d497d313204179bea0efae12ce2f7c.epub/content.opf(99,71): Referenced resource "indextutorial.html" could not be found in the EPUB.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "about.html" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "blank.html" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "colophon.html" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "dip3.css" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "table-of-contents.html" exists in the EPUB, but is not declared in the OPF manifest.
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/index.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/files.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".

Check finished with errors
Messages: 23 fatals / 94 errors / 5 warnings / 0 infos

EPUBCheck completed

poire-z · 2021-09-01T14:16:00Z

A quick try at using our new HTML parser (from this PR) instead of the XML parser with EPUBs solves the issue, and it is not noticably smaller on some large pure-XHTML EPUB.

--- a/crengine/src/epubfmt.cpp
+++ b/crengine/src/epubfmt.cpp
@@ -1317,7 +1317,8 @@ bool ImportEpubDocument( LVStreamRef stream, ldomDocument * m_doc, LVDocViewCall
     m_doc->setDocFlags( saveFlags );
     m_doc->setContainer( m_arc );

-    ldomDocumentWriter writer(m_doc);
+    // ldomDocumentWriter writer(m_doc);
+    ldomDocumentWriterFilter writer(m_doc, false, HTML_AUTOCLOSE_TABLE);
 #if 0
     m_doc->setNodeTypes( fb2_elem_table );
     m_doc->setAttributeTypes( fb2_attr_table );
diff --git a/crengine/src/lvtinydom.cpp b/crengine/src/lvtinydom.cpp
index c8354e79..cb18783b 100644
--- a/crengine/src/lvtinydom.cpp
+++ b/crengine/src/lvtinydom.cpp
@@ -14619,6 +14619,23 @@ ldomNode * ldomDocumentWriterFilter::OnTagOpen( const lChar32 * nsname, const lC
     if ( id == el_style && _currNode && _currNode->getElement()->getNodeId() == el_head ) {
         _inHeadStyle = true;
     }
+    // For EPUB, when ldomDocumentWriter is driven by ldomDocumentFragmentWriter:
+    // if we see a BODY coming and we are a DocFragment, its time to apply the
+    // styles set to the DocFragment before switching to BODY (so the styles can
+    // be applied to BODY)
+    if (id == el_body && _currNode && _currNode->_element->getNodeId() == el_DocFragment) {
+        _currNode->_stylesheetIsSet = _currNode->getElement()->applyNodeStylesheet();
+        // _stylesheetIsSet will be used to pop() the stylesheet when
+        // leaving/destroying this DocFragment ldomElementWriter
+        if (_currNode->_stylesheetIsSet) {
+            // If there's a stylesheet, re-init the styles of the parent DocFragment,
+            // which has taken  the place of the initial <HTML>: stylesheet for EPUBs
+            // have had CSS declarations targeting a <html> set to target <DocFragment>.
+            // The <BODY> we will be creating may inherit styles from the re-styled <HTML>.
+            _currNode->_element->initNodeStyle();
+        }
+    }
+

     _currNode = new ldomElementWriter( _document, nsid, id, _currNode, insert_before_last_child );
     _flags = _currNode->getFlags();

It needs more attention and probably some other tweaks, as the resulting DOM is a bit odd, with a bit too many <body> :)

We'd need to use a single "domwriter" for the whole EPUB, we can't switch between them across fragments - and we can't even use the media-type to be sure the whole EPUB has only XHTML fragments... as this one has:
<item href="serializing.html" id="ch15" media-type="application/xhtml+xml"/>

Frenzie · 2021-09-01T14:19:24Z

If the HTML parser always produces identical results on valid XML then it may be more user friendly. I'm not quite sure what to think though. I don't really like XML's draconian error handling... but at the same time, it is supposed to be XML.

poire-z and others added 9 commits August 24, 2020 09:04

Revert "FB2: don't draw cover in scroll mode"

350336d

This reverts commit 1f9b2cc.

fb2.css: use OTF tabular-nums for footnote numbers

28c4510

If the font provides the 'tabular-nums' (tnum) OpenType feature, we'll get more properly aligned starts of text after the numbers.

Text: ignore ascii and unicode control chars

11ba549

Some of these can still be found in old documents or badly converted ones, and most fonts have no glyph for them (and if they do, it's a small symbol with no meaning for the reader).

Fix HR positionning when floats involved

1ee2180

Strangely, when floats are involved, a HR behaves differently than a regular DIV and adjust to the available space. Nothing mentionned about that in the specs, so try to handle them as Firefox does.

writeNodeEx(): minor tweaks

7597bc5

Avoid crash when called while styles not yet set or reset. Show the rootnode as <?RootNode?>.

OnTagClose(): add self_closing_tag parameter

53ff75b

So DOM writers can distinguish <br/> from <br></br> and handle these differently.

HTML format detection: accept HTML5 doctype

82e6fb4

Accept <!DOCTYPE html> which may not have any <html>, <head> and <body> tags (they are then implicit), and any other kind of <!DOCTYPE ... HTML ...> which are obviously HTML, no matter how broken they are.

Frenzie reviewed Aug 24, 2020

View reviewed changes

Frenzie approved these changes Aug 24, 2020

View reviewed changes