Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot comment out XML start tag for files beginning with UTF-8 BOM (ef bb bf) #36

Closed
andychlin opened this issue May 3, 2019 · 3 comments
Assignees
Labels

Comments

@andychlin
Copy link

andychlin commented May 3, 2019

Version: v2.3.2 on Windows 10

I found that kepubify cannot comment out XML start tag for files beginning with UTF-8 BOM (ef bb bf). And it worked well after I removed the UTF-8 BOM from files.

Another strange issue is that if I change the string "utf-8" to uppercase, it will works, too.
It means, kepubify cannot comment out

But it does comment out

even the file begins with UTF-8 BOM.

@pgaskin
Copy link
Owner

pgaskin commented May 3, 2019

I would need to see the epub file in question. Also, what happens when you try to use the converted kepub?

@pgaskin pgaskin self-assigned this May 3, 2019
@andychlin
Copy link
Author

andychlin commented May 3, 2019 via email

@pgaskin
Copy link
Owner

pgaskin commented Jan 12, 2020

patrick@dpc01:~/kepubify-tmp$ diff <(xxd Bad_Book/GoogleDoc/TesteBook.xhtml) <(xxd Bad_Book-nobom/GoogleDoc/TesteBook.xhtml)
1,15c1,15
< 00000000: efbb bf3c 3f78 6d6c 2076 6572 7369 6f6e  ...<?xml version
< 00000010: 3d22 312e 3022 2065 6e63 6f64 696e 673d  ="1.0" encoding=
< 00000020: 2275 7466 2d38 223f 3e3c 6874 6d6c 2078  "utf-8"?><html x
< 00000030: 6d6c 6e73 3d22 6874 7470 3a2f 2f77 7777  mlns="http://www
< 00000040: 2e77 332e 6f72 672f 3139 3939 2f78 6874  .w3.org/1999/xht
< 00000050: 6d6c 223e 0a20 203c 6865 6164 3e0a 2020  ml">.  <head>.  
< 00000060: 2020 3c74 6974 6c65 3e42 6164 2065 426f    <title>Bad eBo
< 00000070: 6f6b 3c2f 7469 746c 653e 0a20 203c 2f68  ok</title>.  </h
< 00000080: 6561 643e 0a20 203c 626f 6479 2063 6c61  ead>.  <body cla
< 00000090: 7373 3d22 6330 223e 0a20 2020 203c 7020  ss="c0">.    <p 
< 000000a0: 636c 6173 733d 2263 3222 3e0a 2020 2020  class="c2">.    
< 000000b0: 2020 3c73 7061 6e20 636c 6173 733d 2263    <span class="c
< 000000c0: 3122 3e42 6164 2065 626f 6f6b 3c2f 7370  1">Bad ebook</sp
< 000000d0: 616e 3e0a 2020 2020 3c2f 703e 0a20 203c  an>.    </p>.  <
< 000000e0: 2f62 6f64 793e 0a3c 2f68 746d 6c3e 0a    /body>.</html>.
---
> 00000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
> 00000010: 2e30 2220 656e 636f 6469 6e67 3d22 7574  .0" encoding="ut
> 00000020: 662d 3822 3f3e 3c68 746d 6c20 786d 6c6e  f-8"?><html xmln
> 00000030: 733d 2268 7474 703a 2f2f 7777 772e 7733  s="http://www.w3
> 00000040: 2e6f 7267 2f31 3939 392f 7868 746d 6c22  .org/1999/xhtml"
> 00000050: 3e0a 2020 3c68 6561 643e 0a20 2020 203c  >.  <head>.    <
> 00000060: 7469 746c 653e 4261 6420 6542 6f6f 6b3c  title>Bad eBook<
> 00000070: 2f74 6974 6c65 3e0a 2020 3c2f 6865 6164  /title>.  </head
> 00000080: 3e0a 2020 3c62 6f64 7920 636c 6173 733d  >.  <body class=
> 00000090: 2263 3022 3e0a 2020 2020 3c70 2063 6c61  "c0">.    <p cla
> 000000a0: 7373 3d22 6332 223e 0a20 2020 2020 203c  ss="c2">.      <
> 000000b0: 7370 616e 2063 6c61 7373 3d22 6331 223e  span class="c1">
> 000000c0: 4261 6420 6562 6f6f 6b3c 2f73 7061 6e3e  Bad ebook</span>
> 000000d0: 0a20 2020 203c 2f70 3e0a 2020 3c2f 626f  .    </p>.  </bo
> 000000e0: 6479 3e0a 3c2f 6874 6d6c 3e0a            dy>.</html>.
patrick@dpc01:~/kepubify-tmp$ diff -r Bad_Book{,-nobom}
diff -r Bad_Book/GoogleDoc/TesteBook.xhtml Bad_Book-nobom/GoogleDoc/TesteBook.xhtml
1c1
< <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
---
> <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
patrick@dpc01:~/ToDelete/kepubify-tmp$ diff -r Bad_Book{,-nobom}.kepub
diff -r Bad_Book.kepub/GoogleDoc/TesteBook.xhtml Bad_Book-nobom.kepub/GoogleDoc/TesteBook.xhtml
1c1,4
< <html xmlns="http://www.w3.org/1999/xhtml"><head><style type="text/css">div#book-inner{margin-top: 0;margin-bottom: 0;}</style></head><body class="c0"><span class="koboSpan" id="kobo.0.1"></span><?xml version="1.0" encoding="utf-8"?><div id="book-columns"><div id="book-inner"><title><span class="koboSpan" id="kobo.0.4">Bad eBook</span></title><p class="c2"><span class="c1"><span class="koboSpan" id="kobo.1.2">Bad ebook</span></span></p></div></div></body></html>
\ No newline at end of file
---
> <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head>
>     <title>Bad eBook</title>
>   <style type="text/css">div#book-inner{margin-top: 0;margin-bottom: 0;}</style></head>
>   <body class="c0"><div id="book-columns"><div id="book-inner"><p class="c2"><span class="c1"><span class="koboSpan" id="kobo.1.2">Bad ebook</span></span></p></div></div></body></html>
\ No newline at end of file

@pgaskin pgaskin added bug and removed undecided labels Jan 12, 2020
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option treats the UTF-8 BOM, if present, as whitespace to
prevent moving comments into the body element.

This is mainly intended for use with RenderOptionAllowXMLDeclarations
to prevent the XML declaration being moved into the body element
(which is invalid). See pgaskin/kepubify#36 for an example of this.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option treats the UTF-8 BOM, if present, as whitespace to
prevent moving comments into the body element.

This is mainly intended for use with RenderOptionAllowXMLDeclarations
to prevent the XML declaration being moved into the body element
(which is invalid). See pgaskin/kepubify#36 for an example of this.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option treats the UTF-8 BOM, if present, as whitespace to
prevent moving comments into the body element.

This is mainly intended for use with RenderOptionAllowXMLDeclarations
to prevent the XML declaration being moved into the body element
(which is invalid). See pgaskin/kepubify#36 for an example of this.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option treats the UTF-8 BOM, if present, as whitespace to
prevent moving comments into the body element.

This is mainly intended for use with RenderOptionAllowXMLDeclarations
to prevent the XML declaration being moved into the body element
(which is invalid). See pgaskin/kepubify#36 for an example of this.
pgaskin added a commit that referenced this issue Jan 14, 2020
- Improved robustness
  - More is implemented directly in the HTML parser and renderer (see my fork of x/net/html)
  - Better support for XHTML and HTML5 (rather than using a bunch of workarounds)
  - No more regexps for modifying HTML
- Better smart punctuation
  - More punctuation supported
  - More robust (won't apply to everything unconditionally)
  - Now off by default
- Faster and more efficient (15-30% faster, 50-70% less memory)
  - Less memory allocations and copies due to use of readers and writers rather than storing rhe entire file in memory multiple times
  - Stack-based span adding algorithm (rather than recursive, which has more runtime and memory overhead)
  - Use byte arrays or runes rather than strings where possible
  - Better parallel processing of content files
  - Eliminated memory, goroutine, and file descriptor leaks
- Cleaner and better code
  - Easier to extend
  - More stable API
  - More complete unit tests
- More accurate sentence splitting and segment numbering (checked against 3 recent free books)
  - Better match Kobo's behavior by preserving, but not wrapping (in a koboSpan) TextNodes with only whitespace. Previous versions of kepubify used to collapse it to a single space, which still works, but is less efficient to do and is slightly different than what Kobo does (although it results in the same thing during rendering).
  - Fixed some edge cases where the segment counter could be incorrectly incremented.
  - Also increment paragraph counter for tables (this case was missing before).
  - Don't increment paragraph counter if spans were added (i.e. an empty or only whitespace paragraph element) (this case was missing before).
- Smaller binary size
- Also run tests on Windows

closes #47, fixes #45, fixes #35
better fix for #36, #29, #28, #26, #21, #14, #10, #5, and #2
pgaskin added a commit that referenced this issue Jun 11, 2021
This option treats the UTF-8 BOM, if present, as whitespace to
prevent moving comments into the body element.

This is mainly intended for use with RenderOptionAllowXMLDeclarations
to prevent the XML declaration being moved into the body element
(which is invalid). See #36 for an example of this.
pgaskin added a commit that referenced this issue Jun 11, 2021
This option treats the UTF-8 BOM, if present, as whitespace to
prevent moving comments into the body element.

This is mainly intended for use with RenderOptionAllowXMLDeclarations
to prevent the XML declaration being moved into the body element
(which is invalid). See #36 for an example of this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants