Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewsDownloader: epub with images #4982

Merged
merged 3 commits into from Apr 30, 2019

Conversation

Projects
None yet
4 participants
@Thra11
Copy link
Contributor

commented Apr 29, 2019

Implements #4894

@Thra11 Thra11 force-pushed the Thra11:news-epub-with-images-2 branch from 099ba33 to 26da75b Apr 29, 2019

@Frenzie
Copy link
Member

left a comment

Looks decent enough at a glance (and of course I already gave some minor feedback earlier). Is EpubDownloadBackend:createEpub() the part that differs most from the Wikipedia code?

@Frenzie Frenzie added the enhancement label Apr 29, 2019

@Frenzie Frenzie added this to the 2019.05 milestone Apr 29, 2019

@Thra11 Thra11 force-pushed the Thra11:news-epub-with-images-2 branch from 26da75b to 5034cec Apr 29, 2019

@Thra11

This comment has been minimized.

Copy link
Contributor Author

commented Apr 29, 2019

Is EpubDownloadBackend:createEpub() the part that differs most from the Wikipedia code?

From memory, the main differences are:

  • Wikipedia starts with an article name and constructs various urls from it, whereas EpubDownloadBackend is given a url to fetch
  • In createEpub, there's lots of similar epub boilerplate, but the Wikipedia code also performs a series of substitutions to make the result render more like a wikipedia page and fix known issues.
  • Most of the code which parses the html to find and process image urls is unchanged.

I think the core of a shared epub-writing module would probably be a function to take a lua structure representing an epub (metadata, html files, images, css, table of contents, etc.) and write it out as an epub, possibly substituting minimal defaults for elements which aren't provided.

Possible future improvements to this module:

  • Parsing <h*> to construct a proper table of contents.
  • Downloading linked css and including it.
-- <reference href="toc.html" type="toc" title="Table of Contents" href="toc.html" />
-- </guide>
local koreader_version = "KOReader"
if lfs.attributes("git-rev", "mode") == "file" then

This comment has been minimized.

Copy link
@Frenzie

Frenzie Apr 30, 2019

Member

Could you use Version:getCurrentRevision() instead? Same thing, but only one place to change things.

--- Returns current KOReader git-rev.
-- @treturn string full KOReader git-rev such as `v2015.11-982-g704d4238`
function Version:getCurrentRevision()
if not self.rev then
local rev_file = io.open("git-rev", "r")
if rev_file then
self.rev = rev_file:read()
rev_file:close()
end
-- sanity check in case `git describe` failed
if self.rev == "fatal: No names found, cannot describe anything." then
self.rev = nil
end
end
return self.rev
end

@Frenzie

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@poire-z Do you have any further comments?

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

No comment after a quick look (but I'm not familiar with the newsdownloader code, and I've forgotten a bit about that Trapper thingy :| but it looks ok, so, if it works, go ahead.)

Just surprised about that comment (in a commit here, and in the other discussion):

Lift the Trapper:wrap call out of the individual article processing code,
so that articles are processed one by one, in order to:
Avoid concurrent progress updates fighting over the UI dialog
Avoid trying to download many images at the same time

I didn't know we could download things at the same time ! :) I think Trapper still gets stuff one after the other, just allowing to interrupt the whole flow, and indeed displaying progress. But there shouldn't be any fighting :)
So, unless NewsDownloader really download articles in parallele... (does it really do that?!)

@Thra11

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2019

My understanding is that Trapper:wrap is a "Simple wrapper function for a coroutine", so if you call it from the body of a for loop, you end up creating a bunch of coroutines which run concurrently.

@Thra11

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2019

I don't think the old newdownloader backend used Trapper inside the backend, so the concurrency was probably introduced accidentally when I took the Wikipedia code with Trapper and put it inside the feed-processing loop.

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

OK, I see what may have happened.
But (fyi, if I remember that business right):

  • coroutines don't run concurrently: they are not threads. Lua is not multi-threaded. Coroutines are just some functions that can pause themselves (when they call yield) and can be resumed. Only one piece of the code run at a time, and that piece of code can allow some other piece to be run with yield() or resume().
  • Trapper tries to make that a little bit easier for us: we run some download, and give control back to the UI, which waits for 100ms for a touch event (for aborting the next step) - if no touch event, it gives the control back to the coroutine code, which goes on with the next download. We can't interrupt in the middle of a download (we have to wait for it to complete, or for the timeout to be reached). But we can between downloads (so, between the article HTML download and the first image download, and between each image download).
  • That's the best we could do with Lue being non multithreaded.

So, may be you launched many coroutines, and depending on which gave control back to the UI, and which was resumed, or which got a touch event, you may have had that mess in "fighting for the UI".
So yes, there should be only one coroutine'd flow, with the loops inside it, so things can be kept simple :)

@Thra11 Thra11 force-pushed the Thra11:news-epub-with-images-2 branch from 5034cec to 1ea9d67 Apr 30, 2019

@@ -0,0 +1,521 @@
local Version = require("version")
local ffiutil = require("ffi/util")

This comment has been minimized.

Copy link
@Thra11

Thra11 Apr 30, 2019

Author Contributor

Having looked at the coding style guide and some of the other lua files, I think some of these should be capitalised (local FFIUtil = ...) because they're classes?

This comment has been minimized.

Copy link
@Frenzie

Frenzie Apr 30, 2019

Member

Either way is fine by me for ffiutil. I think the existing code (most of which predates that addition to the styleguide) may have ffiutil, FFIUtil as well as ffiUtil.

So I'm good to merge, but if you'd like to change it to FFIUtil I'll wait.

This comment has been minimized.

Copy link
@Thra11

Thra11 Apr 30, 2019

Author Contributor

I'll leave it as it is then :)

@Thra11

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2019

coroutines don't run concurrently: they are not threads.

I guess it depends on your definition of concurrency: some people make a distinction between concurrency (bits code which can be safely paused and resumed or executed in a different order, even on a single core, often aiming to keep the system responsive), versus parallelism (splitting a task across multiple cores with the aim of finishing it faster, without any part necessarily being interruptible).

Anyway, semantics aside, it looks like we understand what's going on. I didn't think too hard about where to put the Trapper:wrap, just moved it as far out as required to get it out of the loop and out of the individual processAtom/procesRSS functions to avoid duplicate copies. However, I'm wondering whether it makes sense to be showing InfoMessages via Trapper:info and UIManager:show in the same module, or whether it make more sense to move the Trapper:wrap further out and use Trapper:info everywhere.

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

whether it make more sense to move the Trapper:wrap further out and use Trapper:info everywhere.

Having not really looked at NewsDownloader, I would anyway say this is how it should be used.
Trapper.wrap() should be called quite soon after the menu item is hit and the callback/action called, after having checked there are indeed some configuration, feeds, and things to do, before doing the first thing of them.

Thra11 added some commits Apr 14, 2019

NewsDownloader: Download images and output EPUBs
Initial commit of new NewsDownloader which downloads images as well as text
and packs it all into an epub.

Based on Wikipedia "Download as EPUB" code.
NewsDownloader: Add flag 'include_images' to feed config
Allow the user to specify whether to download images for each individual
feed specified in feed_config.lua. Default to false to stay closest to
existing behaviour.
NewsDownloader: Process articles sequentially
Lift the Trapper:wrap call out of the individual article processing code,
so that articles are processed one by one, in order to:
* Avoid concurrent progress updates fighting over the UI dialog
* Avoid trying to download many images at the same time

@Thra11 Thra11 force-pushed the Thra11:news-epub-with-images-2 branch from 1ea9d67 to 5e86b6e Apr 30, 2019

@Frenzie Frenzie merged commit 8e23d2a into koreader:master Apr 30, 2019

1 check passed

ci/circleci Your tests passed on CircleCI!
Details
@Frenzie

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

Thanks!

@Thra11

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2019

Trapper.wrap() should be called quite soon after the menu item is hit and the callback/action called, after having checked there are indeed some configuration, feeds, and things to do, before doing the first thing of them.

Looks like I didn't test this well enough: it no longer interleaves articles in a feed, but it will interleave feeds if you have more than one.

I'll change it to use Trapper the way you suggest. (Can I add new commits to this PR or do I create a new one?)

@NiLuJe

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I'd go with a new PR, from a new, rebased branch ;).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.