Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTML parsing and manipulation #47

Closed
pgaskin opened this issue Jan 12, 2020 · 1 comment
Closed

Improve HTML parsing and manipulation #47

pgaskin opened this issue Jan 12, 2020 · 1 comment
Assignees

Comments

@pgaskin
Copy link
Owner

pgaskin commented Jan 12, 2020

I'm rewriting the HTML manipulation code to fix the root cause of quite a few of the recent bugs including #45, #36, #29, #28, #25, #21, and #2. These issues were caused due to goquery (and thus kepubify) internally using golang.org/x/net/html which is a HTML5 library. The parsing was fine for nearly all books, but it wasn't tolerant of self-closing non-void elements (as the spec says it should, but it's not really a useful thing to do), which many XML generators generate when an element doesn't contain any children. The code generation was also usually fine as it generated valid XML (it did things optional for HTML5 but mandatory for XHTML: putting the closing /> on void elements, having an empty ="" on boolean attributes, using only a few named entities, etc), but it caused a few issues with not escaping NBSPs, and messing up the XML declarations found in XHTML for EPUB2 books.

In addition, these changes will improve the performance and memory usage of kepubify (there will be a lot less string allocations and copying).

The majority of the XHTML/XML/HTML5 fixes has been done in my fork of golang.org/x/net/html: https://github.com/geek1011/net/commits?author=geek1011.

@pgaskin pgaskin self-assigned this Jan 12, 2020
@pgaskin
Copy link
Owner Author

pgaskin commented Jan 14, 2020

#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant