Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HTML parser, libRu and FB2 tweaks #370

Merged
merged 12 commits into from
Aug 24, 2020

Conversation

poire-z
Copy link
Contributor

@poire-z poire-z commented Aug 24, 2020

Revert "FB2: don't draw cover in scroll mode" (revert lazy workaround from #366)
(Upstream) FB2: fix coverpage drawing in scroll mode
Proper solution by @virxkane to this issue, koreader/koreader#6490 (comment) , and fixing a bug of mine.

FB2 footnotes: only merge run-in when next is erm_final koreader/koreader#6344 (comment)
fb2.css: use OTF tabular-nums for footnote numbers koreader/koreader#6344 (comment)

Text: ignore ascii and unicode control chars
https://www.mobileread.com/forums/showthread.php?p=4025496#post4025496
I'm not sure it's really preferable, I'd like to set the crap I get :) but I guess if we find these control chars, it's usually crap left by puslisher or conversion that a real reader is not interested in.
https://www.aivosto.com/articles/control-characters.html
Please have a quick look that I didn't ignore other valuable code (I kept \t \r \n)

Fix HR positionning when floats involved
Both the DIV in blue and the HR in grey are styled similarly with {width: 30%; height: 100px; margin: 0 auto 0 auto; } but shouldn't be positionned similarly.
Before:
image
After:
image

writeNodeEx(): minor tweaks
OnTagClose(): add self_closing_tag parameter to all XML writers, even if most don't use it
HTML format detection: accept HTML5 doctype Closes #141

HTML parser: rework Lib.ru specific handling koreader/koreader#6480 (comment)
I don't like much having some little non-standard book format specific hacks in there, but well...
Rewrote it to have it a bit more "normal", and solve the display hash mismatch after first load.
The DOM built should hopefully be similar to the old CoolReader one (but may be not to the one got from a few months ago witk KOReader, so KOReader users reading Lib.Ru books may have highlights made recently messed up).

HTML parser: new more conforming implementation
HTML parser: ensure foster parenting inside tables
Try to conform to the HTML specs, even if not implementing the exact algorithm:
https://html.spec.whatwg.org/multipage/parsing.html
https://htmlparser.info/parser/
Limitations:

- FORM and form elements (SELECT, INPUT, OPTION...)
- TEMPLATE, APPLET, OBJECT, MARQUEE
- Mis-nested HTML/BODY/HEAD
- Reconstructing the active formatting elements (B, I...) when
  mis-nested or "on hold" when entering block or table elements.
- The "adoption agency algorithm" for mis-nested formatting
  elements (and nested <A>)
- We may not ignore some opening tag that we normally should
  (like HEAD or FRAME when in BODY) (but we ignore a standalone
  sub-table element when not inside a TABLE) as this would
  complicate the internal parser state.

Done after trying to fix the autoclose stuff, but what the existing code allowed wasn't right. Some discussion and discoveries at:
koreader/koreader#6482 (comment) #243 (comment)

https://hsivonen.fi/doctype/ (about doctypes, that I didn't really investigate it we'd need to parse that and act differently)

So, that new HTML parser will only be used with HTML and CHM files.
I wonder how much of this and the specs should apply when parsing XHTML from EPUBs.
Should it also correct autoclosing tags (<hr><div>toto</div></hr>, <br></br> should output 2 BRs), misnested tags, non table stuff inside tables... ? When having a balanced XHTML file will map directly to a DOM... (I don't plan to do anything on that front, just curious :)

Not sure this won't crash on twisted HTML documents :/
Also, it will make my life harder when testing stuff, as now what I usually test with a test.html file (because building a .epub is tedious) won't be handled the same as if it were in a .epub...


This change is Reviewable

poire-z and others added 9 commits August 24, 2020 09:04
Happened only in enhanced rendering mode, caused by
forgetting to use the 'y' provided to RenderBlockElement().
As we're now using it, don't provide a float's container
padding_top as this 'y' when rendering the float, as it has
already been given to and accounted by the main FlowState.
The content following the 'display:run-in' footnote number
is usually a P rendered erm_final, and this merging of them
into a container erm_final works quite well.
But when the content is not a P but some more complex
erm_block structure (like poem>stanza>v or cite>p),
ensuring this mergins would make that structure all
erm_inline, losing the formatting. So don't do that,
and let a blank line after the footnote number happen,
which is less worse than losing formatting.
If the font provides the 'tabular-nums' (tnum) OpenType
feature, we'll get more properly aligned starts of text
after the numbers.
Some of these can still be found in old documents or badly
converted ones, and most fonts have no glyph for them
(and if they do, it's a small symbol with no meaning for
the reader).
Strangely, when floats are involved, a HR behaves
differently than a regular DIV and adjust to the
available space. Nothing mentionned about that in
the specs, so try to handle them as Firefox does.
Avoid crash when called while styles not yet set or reset.
Show the rootnode as <?RootNode?>.
So DOM writers can distinguish <br/> from <br></br>
and handle these differently.
Accept <!DOCTYPE html> which may not have any <html>,
<head> and <body> tags (they are then implicit), and
any other kind of <!DOCTYPE ... HTML ...> which are
obviously HTML, no matter how broken they are.
@Frenzie
Copy link
Member

Frenzie commented Aug 24, 2020

Also, it will make my life harder when testing stuff, as now what I usually test with a test.html file (because building a .epub is tedious) won't be handled the same as if it were in a .epub...

@poire-z XHTML documents would technically need to be handled the same as EPUB (the other way around really; EPUB is XHTML with some extra metadata). In KOReader .xhtml documents currently open in MuPDF but that could be an easy way around the problem.

@@ -1369,6 +1369,21 @@ class LVFormatter {
else if ( c >= 0x2066 ) m_flags[pos] = LCHAR_IS_TO_IGNORE; // 2066>2069
}
}
else if ( c <= 0x009F ) {
// Also ignore some ASCII and Unicode control chars
// in the ranges 00>1F and 7F>9F, except a few.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's really preferable, I'd like to set the crap I get :) but I guess if we find these control chars, it's usually crap left by puslisher or conversion that a real reader is not interested in.

I like to show formatting in my editors, both Word/Writer and more code-focused ones like Geany. But displaying ? on a weird control char by default seems a bit off. :-)

Screenshot_2020-08-24_10-29-46

Comment on lines +13065 to +13636
// "Lib.ru html" format is actually minimal HTML with
// the text wrapped in <PRE>. We will parse this text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O_o

@NiLuJe
Copy link
Member

NiLuJe commented Aug 24, 2020

Nothing jumps out at a quick glance ;) :+1;

@poire-z
Copy link
Contributor Author

poire-z commented Aug 24, 2020

XHTML documents would technically need to be handled the same as EPUB (the other way around really; EPUB is XHTML with some extra metadata). In KOReader .xhtml documents currently open in MuPDF but that could be an easy way around the problem.

Except that currently, crengine would still use this new HTMLParser with files names .xhtml.
We'd have to add something to distinguish html from xhtml in the detection code.
But anyway, I'm sure most of my test.html files are not balanced, so I guess the XML/XHTML parser would just bug on them :) And may be even for standalone .xhtml, we can't just trust they are fully valid: best to use the loosy html parser so we're sure we get something out.
I guess that for EPUBs, we get more chances they are fully valid XHTML. I wonder if we could use the new HTML writer for them instead of the XMLParser/Writer (but the new HTML stuff is probably slower than the plain XML one). But well, no issue, no real need to wonder about that :)

@Frenzie
Copy link
Member

Frenzie commented Aug 24, 2020

And may be even for standalone .xhtml, we can't just trust they are fully valid: best to use the loosy html parser so we're sure we get something out.

That's what browsers do though. :-)

.html, even if it declares itself to be XHTML in the DTD, is parsed like HTML.
.xhtml is parsed like XHTML.

As to whether that's wise — we can leave that in the middle.

Avoid the first (wrong) rendering from being different
than next (good) ones, because it was parsed as PRE
and the node was substituted to a DIV when closed.
Handle Lib.ru books in a more correct DOM building way.
Follow most rules from the HTML specs.

Add a new DOM_VERSION_CURRENT 20200824 so newly
opened HTML and CHM files use the new parser (which
might give a different DOM tree than the previous one,
which will still be used for previously opened book
to preserve XPointers).
fb2def.h: add a few more tags mentionned in the specs.
According to the specs, non-table elements met while
building a table (before we meet a TD/TH), should be
moved outside the table and added as the previous
sibling of the table.
Previously, these elements were kept there, and ignored
by the table rendering code.
@poire-z
Copy link
Contributor Author

poire-z commented Aug 26, 2020

Pffff:

<?xml version="1.0" encoding="UTF-8"?> <!-- caused by this -->
<html>
<head>
<title>Available fonts test document</title>

generates:
image

DEBUG CreDocument: goto xpointer /html/html/body/section[4]/title/text().0

@Frenzie
Copy link
Member

Frenzie commented Aug 26, 2020

Ah, that's not good. :-/

@poire-z
Copy link
Contributor Author

poire-z commented Aug 26, 2020

Firefox wraps these "processing instruction" in a comment:
image

The XML parser feeds them to the DOM writers - we don't seem to use them, they don't make it into the DOM with EPUBs because we extract only the BODY from each individual XHTML docfragment.

So, I guess I'll just ignore tags that start with '?' in the new HTML DOM writer, so as to not mess with other stuff.

@cramoisi
Copy link
Contributor

cramoisi commented Sep 11, 2020

Hi @poire-z. Don't know if this is related but I just got a linebreak after an en-dash (dialogue) which is followed by an no-break space \u00a0. First time I see that in koreader. Any idea ?

Capture d’écran 2020-09-11 à 16 26 53

Capture d’écran 2020-09-11 à 16 35 56

@poire-z
Copy link
Contributor Author

poire-z commented Sep 11, 2020

@cramoisi : more probably because of #365 / #364 , which changes a few things for French (but for EM-dash, not EN-dash).
This got into 2020.08.1, but not 2020.08 . So, you might see if things change between these 2 versions.

Not much time currently to dig into that, but may be @Jellby can have a look if this PR might have some impact here?

@Jellby
Copy link
Contributor

Jellby commented Sep 11, 2020

I don't see how that could happen with either em-dash or en-dash (and I believe one should use em-dash for dialogues in French). I could have a look if you upload a sample file ("Le Rouge et le Noir" should be public domain and ok to upload).

@cramoisi
Copy link
Contributor

cramoisi commented Sep 11, 2020

@Jellby : for the em /en usage it's up to the publisher :/ (but I agree with you). It happens two times until now and each time in the same configuration (nbsp):(space)–(nbsp)WORD - and it's cut before word or (nbsp): "(nbsp)–(nbsp)WORD. I put the two chapters it happened in the sample file. Thanks !

r_et_n_sample - Stendhal.epub.zip

(edit : rolled back and also happened in 2020.08)

@Jellby
Copy link
Contributor

Jellby commented Sep 11, 2020

Actually, I changed my mind, I see how this happens with an EN-dash. It has nothing to do with #365, it's just how en-dashes work, they are like hyphens and allow a break after, even if followed by a no-break space. Another reason to use em-dash. A possible workaround is replacing [en-dash][no-break-space] with [en-dash][word-joiner][no-break-space], but if you are going to change the code, you might as well use em-dashes.

As possible tweaks in the crengine code, I can say that in proper Spanish typesetting the en-dash does not exist, so it would be OK to treat the en-dash as an em-dash in Spanish texts, for cases where the publisher chooses to use the wrong dash (because it looks nicer in their font, or so they think). If the same happens in French, it could easily be implemented and solve your problem. But if en-dashes have their rightful uses, as in English, it gets trickier.

@poire-z
Copy link
Contributor Author

poire-z commented Sep 11, 2020

OK, I can reproduce it.
image

https://www.unicode.org/reports/tr14/tr14-22.html#BA
Is there something in there that says BA wins over GL (glue) ?

@cramoisi
Copy link
Contributor

@Jellby : Yup, I get it : I only see them now because of the publisher usage. Thanks to point that out ! In french the em-dash is usually for dialogs and the en-dash to do inserts. I will just convert them with calibre editor for now , it's easy to find them all :-)

@poire-z : If a publisher want to use en-dash in place of em-dash (which is wrong I'm first to agree) and put a nbsp after it, what's the point to not respect it ? what's the point to respect the nbsp everywhere but at some places ?

@Jellby
Copy link
Contributor

Jellby commented Sep 11, 2020

Is there something in there that says BA wins over GL (glue) ?

The pair table (table 2, note that it has apparently been removed in later versions), says that BA GL is a direct break opportunity (i.e. break is allowed even if there are no spaces between them).

what's the point to respect the nbsp everywhere but at some places ?

That's the Unicode linebreaking algorithm (see link above), not a koreader decision, although koreader can choose to ignore/override it.

@cramoisi
Copy link
Contributor

That's the Unicode linebreaking algorithm (see link above), not a koreader decision, although koreader can choose to ignore/override it.

Thanks ! If it's unicode it's fine by me :-)

@poire-z
Copy link
Contributor Author

poire-z commented Sep 11, 2020

The pair table (table 2, note that it has apparently been removed in later versions), says that BA GL is a direct break opportunity (i.e. break is allowed even if there are no spaces between them).

Thanks. Hovering over it points back to the invoved rule: LB12a :

LB12a Do not break before NBSP and related characters, except after spaces and hyphens [...]
Allowing a break after BA or HY matches widespread implementation practice and supports a common way of handling special line breaking of explicit hyphens, such as in Polish and Portuguese

That still feel strange. I would expect these NBSPs to stick everything :)

@cramoisi
Copy link
Contributor

cramoisi commented Sep 11, 2020

That still feel strange. I would expect these NBSPs to stick everything :)

Yes, the problem is people who don't dig it out won't understand why and see this behaviour as an error. But the fact is that there is some books where the en-dash are correctly used for inserts but where they are surrounded with NBSPs. In these books the layout is just horrible so I think this rule is for the best.

@poire-z
Copy link
Contributor Author

poire-z commented Sep 11, 2020

so I think this rule is for the best.

So, you're fine with how things are currently (in 2020.08.1) ? and you will just correct this specific EPUB ?

@cramoisi
Copy link
Contributor

@poire-z : I'm never fine with a bad layout or a bad french hyphenation :-) But I was thinking about the mistake where NBSPs are put around correctly used en-dash (for inserts) and I've seen that far more often that en-dash used for em-dash...

I still agree with you it feels strange, mostly because it's not obvious for the common reader which will just see it broken.

Which one is the best error ? I don't know 😅

@poire-z
Copy link
Contributor Author

poire-z commented Sep 1, 2021

I guess that for EPUBs, we get more chances they are fully valid XHTML. I wonder if we could use the new HTML writer for them instead of the XMLParser/Writer (but the new HTML stuff is probably slower than the plain XML one). But well, no issue, no real need to wonder about that :)

Mhhh, until now :(
Keeping a note about this for later.

Downloaded https://unglue.it/work/186982/ as EPUB.
It must have been made by https://github.com/brthanmathwoag/ebooks/blob/master/dive-into-python3.sh from the HTML5 files at https://github.com/diveintomark/diveintopython3, HTML content that don't use closing tags :/

So, it's an EPUB with non-XHTML content.
Parsing it with our XML parser that won't autoclose tags, we get some ugly DOM and rendering issues:
image

More serious issue is with tables:
The table in 13.9 of https://diveintopython3.net/serializing.html would take a lot of time to render. I have to shorten it to:

<table>
<tr><th>Notes
<th>JSON
<th>Python 3
<tr><th>
<td>object
<td><a href=native-datatypes.html#dictionaries>dictionary</a>
<tr><th>
<td>array
<td><a href=native-datatypes.html#lists>list</a>
<tr><th>
<td>string
<td><a href=strings.html#divingin>string</a>
</table>

to get it to render in a few seconds as...:
image

Add another layer ot tr > th > td, and it takes minutes...
It gets stuck into nested initTableRendMethods() initNodeRendMethodRecursive() recurseElementsDeepFirst(), so we probably have a real issue here.
Probably because of our "complete incomplete tables" from #328 that may be should not try to do too much with nested crap...

@Frenzie
Copy link
Member

Frenzie commented Sep 1, 2021

A reasonably impressive number of fatal errors there…

But yeah, being stuck in some kind of loop due to some bad input is no good. :-)

$ epubcheck b5d497d313204179bea0efae12ce2f7c.epub 
Validating using EPUB version 2.0.1 rules.
ERROR(OPF-031): b5d497d313204179bea0efae12ce2f7c.epub/content.opf(99,71): File listed in reference element in guide was not declared in OPF manifest: indextutorial.html.
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/index.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/index.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/index.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/index.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/files.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/files.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/files.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/files.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(HTM-004): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(-1,-1): Irregular DOCTYPE: found "", expected "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(2,7): Error while parsing file: elements from namespace "" are not allowed
FATAL(RSC-016): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(2,27): Fatal Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-005): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(-1,-1): Error while parsing file: Open quote is expected for attribute "charset" associated with an element type "meta".
ERROR(RSC-007): b5d497d313204179bea0efae12ce2f7c.epub/content.opf(99,71): Referenced resource "indextutorial.html" could not be found in the EPUB.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "about.html" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "blank.html" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "colophon.html" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "dip3.css" exists in the EPUB, but is not declared in the OPF manifest.
WARNING(OPF-003): b5d497d313204179bea0efae12ce2f7c.epub(-1,-1): Item "table-of-contents.html" exists in the EPUB, but is not declared in the OPF manifest.
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/index.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/whats-new.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/installing-python.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/your-first-python-program.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/native-datatypes.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/comprehensions.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/strings.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/regular-expressions.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/generators.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/iterators.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/advanced-iterators.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/unit-testing.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/refactoring.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/files.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/xml.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/serializing.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/http-web-services.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/case-study-porting-chardet-to-python-3.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/packaging.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/porting-code-to-python-3-with-2to3.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/special-method-names.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/where-to-go-from-here.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".
ERROR(HTM-049): b5d497d313204179bea0efae12ce2f7c.epub/troubleshooting.html(2,7): Html element does not have an xmlns set to "http://www.w3.org/1999/xhtml".

Check finished with errors
Messages: 23 fatals / 94 errors / 5 warnings / 0 infos

EPUBCheck completed

@poire-z
Copy link
Contributor Author

poire-z commented Sep 1, 2021

A quick try at using our new HTML parser (from this PR) instead of the XML parser with EPUBs solves the issue, and it is not noticably smaller on some large pure-XHTML EPUB.

--- a/crengine/src/epubfmt.cpp
+++ b/crengine/src/epubfmt.cpp
@@ -1317,7 +1317,8 @@ bool ImportEpubDocument( LVStreamRef stream, ldomDocument * m_doc, LVDocViewCall
     m_doc->setDocFlags( saveFlags );
     m_doc->setContainer( m_arc );

-    ldomDocumentWriter writer(m_doc);
+    // ldomDocumentWriter writer(m_doc);
+    ldomDocumentWriterFilter writer(m_doc, false, HTML_AUTOCLOSE_TABLE);
 #if 0
     m_doc->setNodeTypes( fb2_elem_table );
     m_doc->setAttributeTypes( fb2_attr_table );
diff --git a/crengine/src/lvtinydom.cpp b/crengine/src/lvtinydom.cpp
index c8354e79..cb18783b 100644
--- a/crengine/src/lvtinydom.cpp
+++ b/crengine/src/lvtinydom.cpp
@@ -14619,6 +14619,23 @@ ldomNode * ldomDocumentWriterFilter::OnTagOpen( const lChar32 * nsname, const lC
     if ( id == el_style && _currNode && _currNode->getElement()->getNodeId() == el_head ) {
         _inHeadStyle = true;
     }
+    // For EPUB, when ldomDocumentWriter is driven by ldomDocumentFragmentWriter:
+    // if we see a BODY coming and we are a DocFragment, its time to apply the
+    // styles set to the DocFragment before switching to BODY (so the styles can
+    // be applied to BODY)
+    if (id == el_body && _currNode && _currNode->_element->getNodeId() == el_DocFragment) {
+        _currNode->_stylesheetIsSet = _currNode->getElement()->applyNodeStylesheet();
+        // _stylesheetIsSet will be used to pop() the stylesheet when
+        // leaving/destroying this DocFragment ldomElementWriter
+        if (_currNode->_stylesheetIsSet) {
+            // If there's a stylesheet, re-init the styles of the parent DocFragment,
+            // which has taken  the place of the initial <HTML>: stylesheet for EPUBs
+            // have had CSS declarations targeting a <html> set to target <DocFragment>.
+            // The <BODY> we will be creating may inherit styles from the re-styled <HTML>.
+            _currNode->_element->initNodeStyle();
+        }
+    }
+

     _currNode = new ldomElementWriter( _document, nsid, id, _currNode, insert_before_last_child );
     _flags = _currNode->getFlags();

It needs more attention and probably some other tweaks, as the resulting DOM is a bit odd, with a bit too many <body> :)
image

We'd need to use a single "domwriter" for the whole EPUB, we can't switch between them across fragments - and we can't even use the media-type to be sure the whole EPUB has only XHTML fragments... as this one has:
<item href="serializing.html" id="ch15" media-type="application/xhtml+xml"/>

@Frenzie
Copy link
Member

Frenzie commented Sep 1, 2021

If the HTML parser always produces identical results on valid XML then it may be more user friendly. I'm not quite sure what to think though. I don't really like XML's draconian error handling... but at the same time, it is supposed to be XML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minimal HTML document not supported
6 participants