Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for EPUB 3 Audio-eBooks #2061

Closed
wants to merge 12 commits into from
Closed

Support for EPUB 3 Audio-eBooks #2061

wants to merge 12 commits into from

Conversation

duydl
Copy link
Contributor

@duydl duydl commented Oct 13, 2023

Feature: Support for EPUB 3 Audio-eBooks in Calibre

Overview

This pull request adds support for EPUB 3 Audio-eBooks in Calibre Ebook Viewer. They are EPUB with SMIL audio synchronization (EPUB3 with Media Overlays) which includes the additions of SMIL files and audio content compared to conventional EPUB.

Additional Resources

Public domain audio eBooks can be found on ReadBeyond. They also developed Thorium Reader but it lacked many features of Calibre ebook-viewer, not to mention the library management.

This PR enhances Calibre's capabilities and makes it compatible with the format.

Further plans

I created a new overlay based on the read-aloud overlay for tts. The program checks for SMIL Files and if detected, the Read Aloud toggle will open that overlay instead of TTS.

The audio control is implemented directly on the front end of Rapydscript. There is no communication with the Python backend.

I will continue to maintain and improve this feature if needed. Particularly, the audio files have not been linked successfully to the Calibre content server viewer.

Thank you very much for the amazing program.

@kovidgoyal
Copy link
Owner

Cool, nice to see. What's the issue with implementing it in the content
server viewer? Maybe I can offer some advice/help. Also which OSes have
you tested this on? IIRC the main issue with doing this was always
building Qt webengine with support for the various audio codecs on the
various platforms.

@duydl
Copy link
Contributor Author

duydl commented Oct 13, 2023

In terms of platform compatibility, I have run it on Windows and Ubuntu though I don't think it would be an issue even as qtwebengine seems only able to play mp3.
The audio-ebooks are generated from standard epub and audiobooks with tools that include ffmpeg. If the audiobook isn't originally in mp3, it can still be converted quite conveniently in the process, i.e all the ebooks on Readbeyond feature MP3 audio. A converting script would not be hard to create either.
Initially though I actually attempted to create a backend with Python controlling an MPV instance. I imagine it could still be useful somehow, like making Calibre an interactive transcript player for videos with subtitles.

My issue with implementing it on the content server is mainly from figuring out how to create the link to the audio file for embedding in the overlay. For the local viewer, it's as simple as adding the relative link to the src attribute of the audio tag. But I figure for the content server I will have to render some blob link to the uploaded file instead. The relevant code seems somewhere around create_link_replacer function in render_book.py or db.pyj. Though first I will have to understand how the srv part communicates with the pyj part as well as how to debug the code in srv more efficiently.

@duydl
Copy link
Contributor Author

duydl commented Oct 16, 2023

Hi. Could you merge it? Though I could use the feature fine with a shell script, I would like it also included in the official application for my mother and sister too.
The content server viewer on localhost could read the audio ebook well now. On my phone though, there are sometimes errors when I download the larger books, about 300MB.

@kovidgoyal
Copy link
Owner

I will review when I have the time.

@kovidgoyal
Copy link
Owner

Note that EPUB 3 defines mp3, mp4 and ogg as core media types for audio. https://www.w3.org/TR/epub/#sec-core-media-types

According to https://doc.qt.io/qt-6/qtwebengine-features.html#audio-and-video-codecs
one needs to build webengine with proprietary codecs for both mp3 and mp4. The calibre binaries do not include webengine built with proprietary codecs. https://github.com/kovidgoyal/bypy/blob/master/bypy/pkgs/qt_webengine.py#L12

So how is this supposed to work? I will try it with a sample file and see. Maybe the Qt docs are wrong and mp3 is not a propritary codec I know its patents expired in 2017

@kovidgoyal
Copy link
Owner

In my testing the audio worked on windows and macOS but not linux. I am guessing chromium fallsback to OS provided facilities? No idea.

@kovidgoyal
Copy link
Owner

  1. scroll_to_element_if_out_of_view() will break in paged mode. Instead use the scroll_to_elem function to do the scrolling. Also in paged mode you cant really check boundingClientRect() as the element can be spread over multiple pages. In practice it might be enough to check if the left and top of the bounding client rect or the right and bottom are visible.

  2. The svg images have width/height set in em. Change that to 1792 to match the rest of the images and change the viewBox as well.
    2.5) There is no need for an off icon just use the close X that is used by the read aloud panel as well, or if you like power off then change the read aloud panel to use it as well

  3. Please use snake_case rather than camelCase for function and variable names.

  4. if str(self.is_audio_ebook) == "undefined" should use jstype() not str()
    also when displaying a new

  5. In display_book() under the not is_redisplay branch there should be a self.is_audio_book = undefined

  6. There are various places where the == operator is used for string or number comparisons, use the is operator instead as this resolves to the much faster JS === operator

  7. I dont follow the logic in span_at_point. The correct way to do it is to use document.elementsFromPoint()

  8. The toggle overlay button doesnt seem to work. Clicking it makes it disappear but the panel stays.

  9. Loading time is very high for these books it might make sense to implement some kind of delay loading for audio files. However this can be doe later it is not needed to merge this PR

Thanks for your contribution!

@duydl
Copy link
Contributor Author

duydl commented Oct 17, 2023

Hi. Thank you for checking up. I will address the problems soon.
But regarding the 8 and 2.5. The toggle overlay is not meant to quit the player, its purpose is to just pull down the height of the read-aloud overlay so users can scroll/highlight/take notes while the audio still plays and marks. It is necessary for language learning among other things. I should probably figure out a better name then or maybe add an animation.

In my testing the audio worked on windows and macOS but not Linux. I am guessing chromium fallsback to OS provided facilities? No idea.

I also had no idea. It worked for me on Ubuntu. I expected the browser would support more codes but that seemed to not be the case.

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 17, 2023 via email

@duydl
Copy link
Contributor Author

duydl commented Oct 17, 2023

Did you try to highlight or click on the text? It will have a different effect. Instead of jump to the clicked sentence it will highlight instead like in normal mode.

@kovidgoyal
Copy link
Owner

No, I just started it with ctrl+s and clicked the toggle button

@duydl
Copy link
Contributor Author

duydl commented Oct 17, 2023

democalibre
I meant like this. I think the TTS part could benefit this minor util. Being able to highlight and take notes while listening is quite convenient.

@kovidgoyal
Copy link
Owner

Ah ok, yes, that works. Then maybe make the tooltip something like "Allow
selection" or similar.

@duydl
Copy link
Contributor Author

duydl commented Oct 18, 2023

Hi. I have finished what you asked.
Thank you very much.

@kovidgoyal
Copy link
Owner

Thanks, looks much better. Some more comments:

  1. Why is show_loading() removed from show_name()?

  2. Why is there a timeout based call to change_audio_src? Shouldn't this
    happen inside the cb() function or the show_spine_item function?

  3. If I recall the EPUB 3 media overlays spec correctly, individual HTML
    files can have media overlays defined in the OPF. Not all HTML files in
    any given book might have SMIL based media overlays.
    Therefore, I think the code to parse books should be updated to add this
    information to the book metadata. Code is in srv/render.py. You can even
    have it preparse the SMIL files into JSON for more efficient
    performance. Just as is done for HTML files. Add a property like has_smil_overlays similar to has_maths
    both at the overall book level and for individual spine items.
    Use this information in view_book.pyj
    Note that after changing this code you will need to click the "Reload
    book" button in the viewer to have it re-run the code for any previously
    opened book.

Relatedly why is change_audio_src() calling show_next_spine_item().
spine navigation should be in response to either user pressing
keys/clicking/tapping or reaching the end of the audio for the current
spine item. Once the next spine item is loaded, the code should check
if audio overlay exists for it and show/hide the overlay accordingly.

  1. In span_at_point() the tagname check should be with toUpperCase() as
    there is no guarantee the tagname will always be upper case in all
    browsers. Also you should probably use elementsAtPoint() to check th
    eancestors since the current code will fail for example for
    some text in bold when the point is inside the
    tag. It should check for the first element in the ancestor chain with
    tagname span and non empty id.

  2. In first_visible_span() if r?: should be changed to if r: since an
    empty string as an id is anyway useless. Also functions should be
    renamed to id_of_first_visible_span or first_visible_span_id and same
    for span_at_point.

@duydl
Copy link
Contributor Author

duydl commented Oct 19, 2023

  1. Perhaps just for testing. Apologize. I should have checked more carefully when cleaning up for the PR
  2. This is for when users start the player in a spine without audio or reach a spine without audio. The book will automatically go to the next spine until it reaches a spine with an audio. I tried putting the change_audio_src in the cb and other places but the global current_spine_item still would not update and the book just got stuck in the next spine. Very confused because set_current_spine_item() is called much earlier, there shouldn't even be a need for a callback. I could have modified the global function to return a promise but the effect felt much better with some delay anyway so I settled with setTimeout.
  3. It is true the smil file and audio need to be specified in content.opf. I will attempt to parse the book manifest from the Python end, the logic for parsing smil file needs some optimization anyway. Though it would be more work than I could finish immediately.
    4,5: fixed.

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 19, 2023 via email

@duydl
Copy link
Contributor Author

duydl commented Oct 19, 2023

I dont think I am comfortable with this, opening read aloud should not cause the book to jump ahead/behind to an arbitrary point. Instead, if the current file has no smil audio, just popup a modal dialog saying so and maybe ask the user if they want to skip to the next location with audio.

Just so you know I didn't just invent the behavior, just mimicked that from Thorium. It would not skip arbitrarily to anywhere, just continue to go ahead to where there is an audio. The users would feel like there is empty audio at the current location, just like an audiobook, the user experience would be much smoother IMO without any additional interaction needed. Also, users would not know which spine specifically has audio, so they may not be able to skip to the right section.

demothorium
democalibre

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 19, 2023 via email

@duydl
Copy link
Contributor Author

duydl commented Oct 19, 2023

But what if there is some spine without audio in the middle of the spines with audio? Many books have an h1 HTML spine or image spine among text sections.

Edited: They would also be unable to see the non-audio sections also.

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 19, 2023 via email

@duydl
Copy link
Contributor Author

duydl commented Oct 19, 2023

They could press pause and it will not skip to the next spine. Then they could toggle scrolling and read that section.
In any case, I added a question_dialog for the user to hide the overlay. Though I don't think your idea of skipping directly to the audio section is optimal. For now if the user selects yes, the viewer would just skip to the next spine item one by one until it arrives at a spine item with audio.

@kovidgoyal
Copy link
Owner

I have implemented parsing of SMIL data in srv/render_book.py please use
that data in your code. Note that seq is recursive, I dont recall if
your code handles that, if not, it should.

@kovidgoyal
Copy link
Owner

I added a function is_anchor_on_screen() that you can use to find the first visible smil id. It caches the position computation, so should be fairly performant.

@kovidgoyal
Copy link
Owner

This has been merged into the notes branch which will eventually become calibre 7. I have rewritten all the smil sync code and made it behave like TTS based read aloud. Feel free to test/comment.

@kovidgoyal kovidgoyal closed this Oct 24, 2023
@duydl
Copy link
Contributor Author

duydl commented Oct 24, 2023

I haven't looked into what you have changed, but the viewer now cannot play both the Readbeyond samples and my custom-made audio ebooks, with different errors appearing.
I am also really confused about why you decided to remove the toggle that allows text selection while the audio is playing, especially after all of my explanation on why it is a necessary feature for me.

Edited: Clearing the cache solved it. There are some other problems though from a brief testing.

  1. Audio does not update when changing spine with TOC.
  2. Clicking sometimes brings the audio to the beginning. It was a problem in some of my implementations also. I had solved it with the latest try though.
  3. The book jumps straight to the part with audio, automatically skipping all the sections without audio. No dialog or whatever. I thought you were against this behavior.
  4. It seemed like books, especially longer ones with lots of smil files, took even more time to load. Had me wonder why it is necessary to parse all the smil files with python backend and add to the clunky manifest though? The smil files are already in a well-structured format and I think the dom parser of webengine is quite optimized to process that. There was no delay at all when jumping around the TOC loading and processing the smil file in my original code. Like isn't it a Python vs C++ speed contest?

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 25, 2023 via email

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 25, 2023

  1. Yes, this behaviot matches TTS read aloud
  2. No it was still a problem in your code, indeed even clicking on the progress bar would reset time to zero. The fix is fairly simple, when handling clicks if no SMIL element is found, dont do anything.
  3. Yes, after some consideration I decided you were right about that. Since if the user decides to listen to audio, their primary mode of interaction is the audio.
  4. SMIL files are parsed in native code by lxml with a little bit of python on top to convert the lxml to json. And in any case parsing them does not affect load time only the initial "Preparing book for first read time", which happens only once per book, on first open. Indeed, parsing SMIL in JS means that it has to be parsed on every spine transition, contributing exactly to loading times. Whereas now it is parsed just once and in JS all that is done is JSON deserialize and building up some lists

@kovidgoyal
Copy link
Owner

And regarding being able to highlight while listening to audio, that strikes me a a pretty broken design, since the text scrolls while the audio is playing which means things move around. In flow mode it is potentially continuous design in paged mode it is of course better in that page transitions are less frequent, but can still happen while you are in the middle of highlighting something.

@duydl
Copy link
Contributor Author

duydl commented Oct 25, 2023

  1. Yes, this behaviot matches TTS read aloud

Changing the spine in the middle of playing the TTS will not update the audio, but the mark keeps on highlighting irrelevant text in the new spine. I don't understand why you don't see it as buggy as hell. I haven't seen any TTS that locks up the user while running either. Interacting with the text is what people want from these features, or else they would just listen to the audio.

  1. No it was still a problem in your code

Yeah it was in some of the other buggy commits. I didn't sync that newest ones. But seriously that is to show you would barely ever use the feature yourself to miss that problem.

  1. Yes, after some consideration I decided you were right about that. Since if the user decides to listen to audio, their primary mode of interaction is the audio.

No they want to read while listening. It is the better auto-scroll mode. And users could scroll and select text in that mode.

Or other comparison is with interactive transcripts like that of Youtube, they do not lock user in the current scene. No, the purpose of the interactive transcript is to have users able to search and navigate the media while listening/watching as well as study the text content when the media is not fully comprehensible i.e second language users, language learners.

that strikes me a a pretty broken design

Again to show you barely tried the feature. Both in flow mode and paged mode viewer only scrolls when the marked element is out of viewport. So technically paged mode and flowmode functioned the same. The other audio-ebook player, Thorium, allows that, and their users do not complain. And it is not even a default behavior in Calibre.
Tbh, I am your only user feedback source, you would probably not use the feature yourself. In itself, it is a pretty isolated module. Why do you need to be so controlling of its behavior?

SMIL files are parsed in native code by lxml

Javascript run async. So the smil file is parsed when the HTML is also parsed. And the parsed content is cached anyway. In the end any performance issue in the viewer would stem significantly more from the audio loading. Parsing smil with python or web engine is just a matter of preference. Though I think that just invented work for yourself.

@kovidgoyal
Copy link
Owner

kovidgoyal commented Oct 25, 2023 via email

@duydl
Copy link
Contributor Author

duydl commented Oct 25, 2023

Allow me to apologize sir if I was impolite. I understand you are the principal dev of Calibre and your value is most important for the project.
I was still unfamiliar with Rapydscript or Calibre codebase and I am still learning. I was aware of the problems you pointed out in my code and have been working on addressing them. Guess I was a little unhappy because my work for several days to fix those problems was for nothing. You are a busy man and I thought you would let me work on the feature with your direction, not finishing it yourself. Though it is perhaps more time-efficient for you.
Some of the features you left out are integral to my language learning flow. But if you are reluctant to discuss further I would settle with using my fork for the task.
Thank you very much for your time.

@kovidgoyal
Copy link
Owner

No worries. And yes I did it myself as it would take me less time that
way. As for your language learning flow, here is a proposal that works
for me:

A "background" mode for read aloud (for both tts and smil). Basically, a
"background" button on the bar which when clicked will close the overlay
but keep the audio running in the background. The user can then do
whatever she likes including highlighting, searching navigating around
etc. If the read aloud is once again clicked on then it should sync the
viewer position/satte to the background state and resume highlighting of
text.

If this is acceptable to you feel free to work on a PR for it. Doing it
this way makes sense to me. It might need some discussion about whether
background mode should jump to next spine item automatically or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants