Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WikiReader plugin #9534

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Bartvelp
Copy link

@Bartvelp Bartvelp commented Sep 20, 2022

Hi!

The WikiPedia plugin of KOReader is wonderful, but does not support offline files. In order to fix this and a few other things, I build a new plugin called WikiReader. This plugin aims to fill this gap. It allows reading a database from local disk that contains HTML files, and showing them in the default Reader. This is build for reading WikiPedia articles offline, but in principle more sources are compatible. Some of the features supported are:

  • 1-click hyperlinks between arbitrary articles
  • Integration with the default Reader UI
  • Searching the database for any title, including fuzzy search
  • Setting your own database
  • Setting favorite articles

This database is SQLite DB and uses zstd to compress articles. In order to get this database, ZIM files can be converted. For this purpose, I created another small repo here: https://github.com/Bartvelp/zim-converter. There you can also find an example database.

I think that the code of this plugin does not use the absolute best practices, but shouldn't interfere with anything if simply not used. I needed to use some hacks to allow hyperlinks and used a global variable to maintain a single plugin instance.
Some notable things that are not yet included are:

  • Images
  • Not in-lined CSS
  • Full-text search
  • Downloading a database directly from inside KOReaders menu.

Perhaps this plugin is not ready as-is, but I hope you agree this plugin adds nice features and show promise.
I tested it on my laptop (Linux AppImage), Android, and a PocketBook Touch HD 3.


This change is Reviewable

@Frenzie Frenzie added the Plugin label Sep 20, 2022
@poire-z
Copy link
Contributor

poire-z commented Sep 20, 2022

Kudos for having build a standalone plugin and having it hack into KOReader's core to get it to work :)

But the ReaderUI/ReaderLink hack is probably not acceptable if this plugin becomes part of KOReader.
It is fine if this plugin stays outside, on your repo or in https://github.com/koreader/contrib .

If it would be part of KOReader, you could just add support into ReaderLink for links like:
href="kolocalwiki://wiki-en.db/John Wayne" which would just send an Event OpenLocalWikiPage(wikidbfile, titlepage) that your plugin would be the only one to catch.

So, the question is: do we want this to be the main/official way to support local wikipedia (and co) dumps - which needs to have the more common ZIM files (that I know nothing about) post-processed by this tool to get them to be usable.

Does your 2.4 Gb archive contain images ? Or would it be way bigger if it had images ?
(My own usage for Wikipedia lookups is often happy to get images, to see where a country is located, how a person looked :) so I would be frustrated with offline articles without images - and also knowing that the article may not be up to date.)

@pazos
Copy link
Member

pazos commented Sep 20, 2022

supporting zim files is easy using libzim but requires ICU, which is a big library to bundle.

Also there's the limitation of 4GB files in fat32. Most zim archives with images are way bigger than that.

@Bartvelp
Copy link
Author

Thanks, yes I chose to do it this way with the Reader link patch so all changes are restricted to the plugin files, but it is too hacky.

Bundling libzim would be a nicer solution I guess, but I do not know how with all the c dependencies, that's why I chose to build a small tool to convert it into stuff KOReader already bundles.
Of course it would also be possible to split the databases if they are >4 GB if we use this non-standard format.

The DB I linked is without images because I did not add support for those yet, as I wanted some feedback on how this plugin should be done "properly".

I was not aware of the unofficial plugins repo, I'd like it if those plugins were accessible/findable in the native KOReader UI to download and install. Otherwise it would be nice IMO if this plugin could evolve into something that could be included in the KOreader package.

@Frenzie
Copy link
Member

Frenzie commented Sep 21, 2022

Adding a reference to #2333.

Do not use global instance anymore, instead just remember last widget instance in local variable.
Patch the readerlink to emit events if it caught a "kolocalwiki://" prefixed URL
And add listener for this in plugin
@Bartvelp
Copy link
Author

I refactored the plugin to use the event as @poire-z recommended, which turned out to be quite easy. I think a native libzim integration would be nicer, but I think this plugin can be good enough in the mean time. Support for images is possible, but then the database would get a lot bigger, currently it takes 2.5 GiB for the 50.000 most popular articles, with images that would be ~6.5 GiB. Definitely do-able, but too big for fat32 (how common is this in ereaders?).
Articles dumps are not very old, the latest dump was made this month, 2022-09.

@poire-z
Copy link
Contributor

poire-z commented Sep 26, 2022

I refactored the plugin to use the event as @poire-z recommended, which turned out to be quite easy.

That's indeed better. Ideally, we would have ReaderLink:registerProtocolHandler(scheme, event) and your plugin would use self.ui.link:registerProtocolHandler("kolocalwiki://", "OpenLocalWikiPage"). (Which would make it standalone without the need to hack core - except by adding this generic plug facility.)

too big for fat32 (how common is this in ereaders?).

I think nearly all (Kobo, Kindle, Pocketbook) except Android must use fat32.

I think this plugin can be good enough in the mean time

I'd like to have more feedback from other users: would you use/try this?

I'm quite an avid reader of Wikipedia articles on KOReader, but having done and maintained our "on-line" Wikipedia lookup and "Save as EPUB" features, I think I would be very frustrated with this plugin:

  • I don't know about the formatting we'll get, but it feels that without images, footnotes, ToC..., the reading experience would be limited (can you share an exemple for a generated HTML document?)
  • 50.000 most popular articles - using Wikipedia lookup as a way to get more info on the "little" subjects I know not a lot about, I fear that most of what I lookup would not be found among these 50.000 articles. And if it's about getting some random popular articles for casual reading, this could be done at home with Save as EPUB, to get a quality book to read.

So, ok, it may be better than nothing for when you are on the road with no wifi.
But then, all this could be made into a stardict dict (but ok, limited formatting and links following) - which would show the content in a text window, instead of having to make a HTML document and hack ReadHistory for them to not pollute KOReader's history.
And it needs the user to do the effort converting a zim file to your format - even if you build an English one, that makes a lot of users having to do the effort for their language.
Also, we would ship a plugin that would not work by default - and would require users to download these >2 GB files (which was initially bothering enough to make me not try this plugin :/ and ask questions in the dark :)

So, it feels a bit like you may end up being the single user of this plugin :) And I'm not fond of shipping "good enough in the meantime" stuff, I like us to ship perfect and stable stuff :)
Unless others users/devs speak that they want it :)

end

-- Encode the db path in the URL, URL escaping doesn't work for some reason, so use base64
local prefix = "kolocalwiki://" .. escape.base64_encode(self.db_path) .. "#" -- Title will be after the hashtag
Copy link
Member

@NiLuJe NiLuJe Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base642bin from the sha2 module (it's in base) is liable to be much faster than this, FWIW.

(I haven't actually checked turbo, but sha2 is awesome).

@Bartvelp
Copy link
Author

Bartvelp commented Sep 26, 2022

Okay that is fair, if I would be the only one that is using it then of course it shouldn't be bundled. Perhaps we can wait if there are people that are interested, and if so we could look into how to do this properly.
At least it was a lot of fun writing the plugin, kudos to the devs for writing KOReader so pleasantly!

Oh and btw, 50000 articles covered a lot more than I thought, I encourage people that are interested into downloading the preconverted db and playing around.
You can checkout a HTML file as it would be shown on the EReader here. All blue links should be in the database.

@Oppen
Copy link

Oppen commented Oct 17, 2022

I'd like to have more feedback from other users: would you use/try this?

User feedback here: I'd definitely use it. I actually wanted for something like this a few months ago, before I knew about KOReader, to have an offline copy of wikivoyage on my ereader during a trip. I had one based on kiwix on my phone, but the battery life wasn't good enough, nor the screen big enough for it to be comfortable. This feature would have been optimal.

@jgoerzen
Copy link

Same here. I also linked to this from a Mobileread forum at https://www.mobileread.com/forums/showthread.php?p=4267358#post4267358 . I bought a Kobo partly for the ZIM feature, only to find out that it mostly no longer works. This would be fantastic!

@pazos
Copy link
Member

pazos commented Oct 22, 2022

Ideally, we would have ReaderLink:registerProtocolHandler(scheme, event) and your plugin would use self.ui.link:registerProtocolHandler("kolocalwiki://", "OpenLocalWikiPage"). (Which would make it standalone without the need to hack core - except by adding this generic plug facility.)

IMHO this is the way to go.

I was not aware of the unofficial plugins repo, I'd like it if those plugins were accessible/findable in the native KOReader UI to download and install.

I'm working on a package manager to make this happen. Isn't ready yet and won't be ready for a few more months.

Otherwise it would be nice IMO if this plugin could evolve into something that could be included in the KOreader package.

IMHO we have too many plugins already in this repo. Adding new ones should be considered carefully, as they will need a maintainer, some documentation and will generate new tickets that would need to be addressed.

Okay that is fair, if I would be the only one that is using it then of course it shouldn't be bundled. Perhaps we can wait if there are people that are interested, and if so we could look into how to do this properly.

I'm not oposed to merge this particular plugin. But having some more users doesn't make the thing more appealing to me.

Doing a quick swot thing here

  • the plugin is strictly related to read 👍,
  • the plugin could become useful for a long userbase 👍
  • the plugin doesn't rely on a third party service/API that could cease to work in the long run 👍
  • the plugin requires some work to setup 👎 , but that's already documented in https://github.com/Bartvelp/zim-converter 👍
  • the plugin addresses some limitations of the "official" way of implementing the same feature (ie: fat32 hard limit on 4gb zim files) 👍
  • the plugin's code isn't large and, after a quick look, looks well written and easy to understand 👍

@Oppen
Copy link

Oppen commented Oct 24, 2022

I wonder how hard it is to write a minimal zim parser in Lua to avoid the conversion step. I gather the C++ lib is as big as it is because of the write support, which this does not need. The format is somewhat documented and doesn't look more excessively complex.

@tillmannheigel
Copy link

What is missing for a merge?

@anarcat
Copy link
Contributor

anarcat commented Apr 26, 2024

omg this exists?! i want this so much! for me the zim converter step is a minor issue, i need to do some hoolahoops to get teh files on the kobo anyways, converting them doesn't seem like a big deal, at least short term.

of course, it would be best if we could just read the ZIM files directly, but from what i gathered, that's typically done with kiwix tools and a web browser, not sure that's something we want to embark on here...

@Frenzie
Copy link
Member

Frenzie commented Apr 26, 2024

A "web browser" we already have; the problem when I last checked is that the relevant libs were rather substantial in size.

@anarcat
Copy link
Contributor

anarcat commented Apr 26, 2024 via email

@Frenzie
Copy link
Member

Frenzie commented Apr 26, 2024

"libs"?

libzim (a few hundred kB) has ICU as a dependency (much more). I could be mistaken or misremembering but I don't think the necessary dependencies would go for any less than a couple of MB.

if we already have a web browser in koreader, is there already a way to just fire that up against a running kiwix server? because that would serve me perfectly fine in the short term, i can hack that out myself!

I was only referring to the rendering part, but you can trivially pull in anything over HTTP and stuff it in an HtmlBoxWidget. Pictures might be a bit harder to get in.

Though if you're thinking of taking that approach I'd be remiss if I didn't point out you can simply run Lynx/Elinks/Links2 in the terminal emulator plugin.

@anarcat
Copy link
Contributor

anarcat commented Apr 26, 2024 via email

@Frenzie
Copy link
Member

Frenzie commented Apr 26, 2024

oh right, by "web browser" you mean "we have a web browser rendering
widget for EPUB files anyway", right? :) like we don't have a full thing
in there...

Right, I meant Zim files can be read and whatever HTML is in them can be rendered.

omg, we ship lynx?

lol no, just the terminal plugin but you can run e.g. Alpine or Debian in chroot or just run it straight up. It works quite well.

@anarcat
Copy link
Contributor

anarcat commented Apr 26, 2024

Does your 2.4 Gb archive contain images ? Or would it be way bigger if it had images ?

Support for images is possible, but then the database would get a lot bigger, currently it takes 2.5 GiB for the 50.000 most popular articles, with images that would be ~6.5 GiB. Definitely do-able, but too big for fat32 (how common is this in ereaders?).

So, just for the record, the reason I'm looking into this again is that I just bought a 512GiB microSD card. For 80$CAD. That's 60$USD. It's crazy.

I can not only fit the "all maxi" wikipedia en zim file in there (which is all of english wikipedia, with images), it can fit MULTIPLE TIMES. It's bonkers.

So yeah, we've passed the threshold where you can store all of wikipedia in a tiny little thing the size of my fingernail. I think arguments about "oh, but this is too big" are kind of moot at this point.

The question is: how well would the converter work on a 100GB ZIM file, and would the resulting sqlite database be even useable?

(It also puts into perspective concerns about ICU's size, IMHO. I don't know about how it would impact koreader, but here on Debian, the package is 37... megabytes...)

@anarcat anarcat mentioned this pull request Apr 26, 2024
@Oppen
Copy link

Oppen commented May 4, 2024

So yeah, we've passed the threshold where you can store all of wikipedia in a tiny little thing the size of my fingernail. I think arguments about "oh, but this is too big" are kind of moot at this point.

That depends. You can spend 80$CAD on an SD card. Not everyone can. As long as its optional, do whatever, but the moment you start needing to change hardware you start leaving a large number of people out. In fact, a lot of the people you would leave out is exactly the reason why Wikipedia packages ZIM files in the first place, people living in poor countries under dictatorships, like North Korea.

(It also puts into perspective concerns about ICU's size, IMHO. I don't know about how it would impact koreader, but here on Debian, the package is 37... megabytes...)

E-readers are a piece of tech that doesn't get obsoleted just because the web got fatter, I'd very much like for them to stay that way. I have currently no reason to switch my old Boox ML67. I already lost the ability to update KOReader there and it's OK, but notice some devices are more limited than whatever you have. Also, some systems (no idea if that's also true for the currently supported ones) can't install applications outside of main storage, which can be quite small. That's a limitation a bigger SD card won't fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants