Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

application/xhtml+xml files with .xml extension are not converted #55

Closed
fthdgn opened this issue Nov 12, 2020 · 1 comment
Closed

application/xhtml+xml files with .xml extension are not converted #55

fthdgn opened this issue Nov 12, 2020 · 1 comment
Assignees
Labels

Comments

@fthdgn
Copy link

fthdgn commented Nov 12, 2020

Public domain books from Feedbooks use .xml extension for "application/xhtml+xml" files. Kepubify does not add "kobospan" classes to html files with extension .xml.

Is it possible to support these type epub files?

Some of the books I tried:
http://www.feedbooks.com/book/92.epub
http://www.feedbooks.com/book/52.epub
http://www.feedbooks.com/book/81.epub

@pgaskin
Copy link
Owner

pgaskin commented Dec 1, 2020

I'll look into this. I'll probably need to rework the scanning logic to add xml files specified as application/xhtml+xml in the OPF. Alternatively, I could test all XML files and detect if they're XHTML.

@pgaskin pgaskin self-assigned this Dec 1, 2020
@pgaskin pgaskin added the bug label Dec 1, 2020
pgaskin added a commit that referenced this issue Jun 11, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Converted files should be identical to before this change, except for:
  * Whitespace changes in content.opf.
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 11, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Whitespace changes in content.opf.
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 11, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Whitespace changes in content.opf.
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 15, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 15, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 15, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 15, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 16, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
pgaskin added a commit that referenced this issue Jun 16, 2021
* Use the OPF package document to find HTML files (fixes #55).
* Refactor content/files/opf transformation code.
* Rewrite conversion code to allow converting directly from a EPUB zip.Reader
  or fs.FS into an output zipped KEPUB io.Writer (closes #62).
  * This simplifies the code for kepub conversion.
    * Input and transformation code is unified.
    * Useless function to convert a directory to a KEPUB in-place has been
      eliminated.
    * Arbitrary virtual file-systems can now be used as input.
  * This makes the code easier to test.
  * This resolves security concerns with extracting untrusted EPUBs directly
    into a temp folder with limited path sanitization, which is important when
    embedding kepubify into server-side software.
  * This allows kepubify to easily be compiled and used as a WebAssembly
    library.
  * This allows us to greatly reduce the amount of IO required when
    converting books with a large amount of media or fonts by directly piping it
    into the output file. And, on Go 1.17+, it will also significantly reduce
    the CPU time, while also increasing the amount of time spent doing content
    transformation in parallel rather than waiting for unchanged files to
    compress by directly copying the untransformed compressed files as-is.
  * The slightly increased memory cost (~8-12%) is negligible compared to the
    performance gains mentioned in the previous point and the reduced time
    waiting for disk IO (especially on HDDs).
* Make use of Go 1.16's io/fs for more flexible code and tests.
* Remove cascadia dependency
  * We don't need full selector parsing or specificity.
  * Depending on cascadia complicates replacing x/net/html.
  * Doing things manually is slightly more efficient, and almost as concise.
* Reduce exposed functions for kepub library.
  * They weren't really used.
  * Removing them increases flexibility for future improvements.
* Use a less obtrusive hack for giving kobotest access to the un-exported kepub
  functions.
* Make documentation for transformations more detailed.
* Converted files should be identical to before this change, except for:
  * Improved MathML/SVG tag filtering in content files (some instances which
    would have previously been incorrectly modified are now left as-is).
  * Content files not listed in the package document are now left as-is.
  * Content files with nonstandard extensions, but listed in the package
    document, should now be converted correctly.
* Performance should be equal or better than before this change on Go 1.16, and
  significantly faster for books with many non-content files on Go 1.17. On slow
  storage, kepubify should also be much faster.
@pgaskin pgaskin closed this as completed in 948788e Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants