Elements only containing nbsps are removed #21

pgaskin · 2018-03-04T05:12:17Z

Similar to #14

- Improved robustness - More is implemented directly in the HTML parser and renderer (see my fork of x/net/html) - Better support for XHTML and HTML5 (rather than using a bunch of workarounds) - No more regexps for modifying HTML - Better smart punctuation - More punctuation supported - More robust (won't apply to everything unconditionally) - Now off by default - Faster and more efficient (15-30% faster, 50-70% less memory) - Less memory allocations and copies due to use of readers and writers rather than storing rhe entire file in memory multiple times - Stack-based span adding algorithm (rather than recursive, which has more runtime and memory overhead) - Use byte arrays or runes rather than strings where possible - Better parallel processing of content files - Eliminated memory, goroutine, and file descriptor leaks - Cleaner and better code - Easier to extend - More stable API - More complete unit tests - More accurate sentence splitting and segment numbering (checked against 3 recent free books) - Better match Kobo's behavior by preserving, but not wrapping (in a koboSpan) TextNodes with only whitespace. Previous versions of kepubify used to collapse it to a single space, which still works, but is less efficient to do and is slightly different than what Kobo does (although it results in the same thing during rendering). - Fixed some edge cases where the segment counter could be incorrectly incremented. - Also increment paragraph counter for tables (this case was missing before). - Don't increment paragraph counter if spans were added (i.e. an empty or only whitespace paragraph element) (this case was missing before). - Smaller binary size - Also run tests on Windows closes #47, fixes #45, fixes #35 better fix for #36, #29, #28, #26, #21, #14, #10, #5, and #2

pgaskin closed this as completed in 8219029 Mar 4, 2018

pgaskin self-assigned this Mar 4, 2018

pgaskin added the bug label Mar 4, 2018

pgaskin mentioned this issue Jan 12, 2020

Improve HTML parsing and manipulation #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elements only containing nbsps are removed #21

Elements only containing nbsps are removed #21

pgaskin commented Mar 4, 2018

Elements only containing nbsps are removed #21

Elements only containing nbsps are removed #21

Comments

pgaskin commented Mar 4, 2018