-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New HTML parser, libRu and FB2 tweaks #370
Conversation
This reverts commit 1f9b2cc.
Happened only in enhanced rendering mode, caused by forgetting to use the 'y' provided to RenderBlockElement(). As we're now using it, don't provide a float's container padding_top as this 'y' when rendering the float, as it has already been given to and accounted by the main FlowState.
The content following the 'display:run-in' footnote number is usually a P rendered erm_final, and this merging of them into a container erm_final works quite well. But when the content is not a P but some more complex erm_block structure (like poem>stanza>v or cite>p), ensuring this mergins would make that structure all erm_inline, losing the formatting. So don't do that, and let a blank line after the footnote number happen, which is less worse than losing formatting.
If the font provides the 'tabular-nums' (tnum) OpenType feature, we'll get more properly aligned starts of text after the numbers.
Some of these can still be found in old documents or badly converted ones, and most fonts have no glyph for them (and if they do, it's a small symbol with no meaning for the reader).
Strangely, when floats are involved, a HR behaves differently than a regular DIV and adjust to the available space. Nothing mentionned about that in the specs, so try to handle them as Firefox does.
Avoid crash when called while styles not yet set or reset. Show the rootnode as <?RootNode?>.
So DOM writers can distinguish <br/> from <br></br> and handle these differently.
Accept <!DOCTYPE html> which may not have any <html>, <head> and <body> tags (they are then implicit), and any other kind of <!DOCTYPE ... HTML ...> which are obviously HTML, no matter how broken they are.
@poire-z XHTML documents would technically need to be handled the same as EPUB (the other way around really; EPUB is XHTML with some extra metadata). In KOReader |
@@ -1369,6 +1369,21 @@ class LVFormatter { | |||
else if ( c >= 0x2066 ) m_flags[pos] = LCHAR_IS_TO_IGNORE; // 2066>2069 | |||
} | |||
} | |||
else if ( c <= 0x009F ) { | |||
// Also ignore some ASCII and Unicode control chars | |||
// in the ranges 00>1F and 7F>9F, except a few. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's really preferable, I'd like to set the crap I get :) but I guess if we find these control chars, it's usually crap left by puslisher or conversion that a real reader is not interested in.
I like to show formatting in my editors, both Word/Writer and more code-focused ones like Geany. But displaying ?
on a weird control char by default seems a bit off. :-)
// "Lib.ru html" format is actually minimal HTML with | ||
// the text wrapped in <PRE>. We will parse this text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O_o
Nothing jumps out at a quick glance ;) :+1; |
Except that currently, crengine would still use this new HTMLParser with files names .xhtml. |
71d4975
to
1b13073
Compare
That's what browsers do though. :-)
As to whether that's wise — we can leave that in the middle. |
1b13073
to
de3d559
Compare
Avoid the first (wrong) rendering from being different than next (good) ones, because it was parsed as PRE and the node was substituted to a DIV when closed. Handle Lib.ru books in a more correct DOM building way.
Follow most rules from the HTML specs. Add a new DOM_VERSION_CURRENT 20200824 so newly opened HTML and CHM files use the new parser (which might give a different DOM tree than the previous one, which will still be used for previously opened book to preserve XPointers). fb2def.h: add a few more tags mentionned in the specs.
According to the specs, non-table elements met while building a table (before we meet a TD/TH), should be moved outside the table and added as the previous sibling of the table. Previously, these elements were kept there, and ignored by the table rendering code.
de3d559
to
9581fef
Compare
Ah, that's not good. :-/ |
Hi @poire-z. Don't know if this is related but I just got a linebreak after an en-dash (dialogue) which is followed by an no-break space |
@cramoisi : more probably because of #365 / #364 , which changes a few things for French (but for EM-dash, not EN-dash). Not much time currently to dig into that, but may be @Jellby can have a look if this PR might have some impact here? |
I don't see how that could happen with either em-dash or en-dash (and I believe one should use em-dash for dialogues in French). I could have a look if you upload a sample file ("Le Rouge et le Noir" should be public domain and ok to upload). |
@Jellby : for the em /en usage it's up to the publisher :/ (but I agree with you). It happens two times until now and each time in the same configuration (nbsp):(space)–(nbsp)WORD - and it's cut before word or (nbsp): "(nbsp)–(nbsp)WORD. I put the two chapters it happened in the sample file. Thanks ! r_et_n_sample - Stendhal.epub.zip (edit : rolled back and also happened in 2020.08) |
Actually, I changed my mind, I see how this happens with an EN-dash. It has nothing to do with #365, it's just how en-dashes work, they are like hyphens and allow a break after, even if followed by a no-break space. Another reason to use em-dash. A possible workaround is replacing As possible tweaks in the crengine code, I can say that in proper Spanish typesetting the en-dash does not exist, so it would be OK to treat the en-dash as an em-dash in Spanish texts, for cases where the publisher chooses to use the wrong dash (because it looks nicer in their font, or so they think). If the same happens in French, it could easily be implemented and solve your problem. But if en-dashes have their rightful uses, as in English, it gets trickier. |
https://www.unicode.org/reports/tr14/tr14-22.html#BA |
@Jellby : Yup, I get it : I only see them now because of the publisher usage. Thanks to point that out ! In french the em-dash is usually for dialogs and the en-dash to do inserts. I will just convert them with calibre editor for now , it's easy to find them all :-) @poire-z : If a publisher want to use en-dash in place of em-dash (which is wrong I'm first to agree) and put a nbsp after it, what's the point to not respect it ? what's the point to respect the nbsp everywhere but at some places ? |
The pair table (table 2, note that it has apparently been removed in later versions), says that BA GL is a direct break opportunity (i.e. break is allowed even if there are no spaces between them).
That's the Unicode linebreaking algorithm (see link above), not a koreader decision, although koreader can choose to ignore/override it. |
Thanks ! If it's unicode it's fine by me :-) |
Thanks. Hovering over it points back to the invoved rule: LB12a :
That still feel strange. I would expect these NBSPs to stick everything :) |
Yes, the problem is people who don't dig it out won't understand why and see this behaviour as an error. But the fact is that there is some books where the en-dash are correctly used for inserts but where they are surrounded with NBSPs. In these books the layout is just horrible so I think this rule is for the best. |
So, you're fine with how things are currently (in 2020.08.1) ? and you will just correct this specific EPUB ? |
@poire-z : I'm never fine with a bad layout or a bad french hyphenation :-) But I was thinking about the mistake where NBSPs are put around correctly used en-dash (for inserts) and I've seen that far more often that en-dash used for em-dash... I still agree with you it feels strange, mostly because it's not obvious for the common reader which will just see it broken. Which one is the best error ? I don't know 😅 |
Mhhh, until now :( Downloaded https://unglue.it/work/186982/ as EPUB. So, it's an EPUB with non-XHTML content. More serious issue is with tables: <table>
<tr><th>Notes
<th>JSON
<th>Python 3
<tr><th>
<td>object
<td><a href=native-datatypes.html#dictionaries>dictionary</a>
<tr><th>
<td>array
<td><a href=native-datatypes.html#lists>list</a>
<tr><th>
<td>string
<td><a href=strings.html#divingin>string</a>
</table> to get it to render in a few seconds as...: Add another layer ot tr > th > td, and it takes minutes... |
A reasonably impressive number of fatal errors there… But yeah, being stuck in some kind of loop due to some bad input is no good. :-)
|
If the HTML parser always produces identical results on valid XML then it may be more user friendly. I'm not quite sure what to think though. I don't really like XML's draconian error handling... but at the same time, it is supposed to be XML. |
Revert "FB2: don't draw cover in scroll mode"
(revert lazy workaround from #366)(Upstream) FB2: fix coverpage drawing in scroll mode
Proper solution by @virxkane to this issue, koreader/koreader#6490 (comment) , and fixing a bug of mine.
FB2 footnotes: only merge run-in when next is erm_final
koreader/koreader#6344 (comment)fb2.css: use OTF tabular-nums for footnote numbers
koreader/koreader#6344 (comment)Text: ignore ascii and unicode control chars
https://www.mobileread.com/forums/showthread.php?p=4025496#post4025496
I'm not sure it's really preferable, I'd like to set the crap I get :) but I guess if we find these control chars, it's usually crap left by puslisher or conversion that a real reader is not interested in.
https://www.aivosto.com/articles/control-characters.html
Please have a quick look that I didn't ignore other valuable code (I kept
\t \r \n
)Fix HR positionning when floats involved
Both the DIV in blue and the HR in grey are styled similarly with {width: 30%; height: 100px; margin: 0 auto 0 auto; } but shouldn't be positionned similarly.
Before:
After:
writeNodeEx(): minor tweaks
OnTagClose(): add self_closing_tag parameter
to all XML writers, even if most don't use itHTML format detection: accept HTML5 doctype
Closes #141HTML parser: rework Lib.ru specific handling
koreader/koreader#6480 (comment)I don't like much having some little non-standard book format specific hacks in there, but well...
Rewrote it to have it a bit more "normal", and solve the display hash mismatch after first load.
The DOM built should hopefully be similar to the old CoolReader one (but may be not to the one got from a few months ago witk KOReader, so KOReader users reading Lib.Ru books may have highlights made recently messed up).
HTML parser: new more conforming implementation
HTML parser: ensure foster parenting inside tables
Try to conform to the HTML specs, even if not implementing the exact algorithm:
https://html.spec.whatwg.org/multipage/parsing.html
https://htmlparser.info/parser/
Limitations:
Done after trying to fix the autoclose stuff, but what the existing code allowed wasn't right. Some discussion and discoveries at:
koreader/koreader#6482 (comment) #243 (comment)
https://hsivonen.fi/doctype/ (about doctypes, that I didn't really investigate it we'd need to parse that and act differently)
So, that new HTML parser will only be used with HTML and CHM files.
I wonder how much of this and the specs should apply when parsing XHTML from EPUBs.
Should it also correct autoclosing tags (
<hr><div>toto</div></hr>
,<br></br>
should output 2 BRs), misnested tags, non table stuff inside tables... ? When having a balanced XHTML file will map directly to a DOM... (I don't plan to do anything on that front, just curious :)Not sure this won't crash on twisted HTML documents :/
Also, it will make my life harder when testing stuff, as now what I usually test with a test.html file (because building a .epub is tedious) won't be handled the same as if it were in a .epub...
This change is