Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting this zim file failed #352

Closed
sobaee opened this issue Jan 9, 2022 · 14 comments
Closed

Converting this zim file failed #352

sobaee opened this issue Jan 9, 2022 · 14 comments
Labels

Comments

@sobaee
Copy link

sobaee commented Jan 9, 2022

Hello Saeed

Could you please help with this issue?

I have a zim file which has built from a website by using youzim.it website, it called "Essential drugs"

This file is working normally in kiwix.apk and anyone can search any drug and find its definition.

When I tried to convert this file to any format like slob it gives many errors and complete converting but a small file produced. When I open the converted dictionary slob by aard2.apk it contains strange headwords like these of html codes with strange definitions.

The errors:
[WARNING] Unrecognized mimetype='application/warc-headers' [ERROR] unknown content type for 'medicalguidelines.msf.org/viewport/EssDr/english/eflornithine-injectable-16682680.html'
[ERROR] unknown content type for 'update.googleapis.com/service/update2/json?cup2key=10:2622198263&cup2hreq=0338fb5b5cb30f0e5182132d1ff9620dec7fccf4020a155becf04a6b4c2d247a' [ERROR] unknown content type for 'Xapian Fulltext Index' [ERROR] unknown content type for 'Xapian Title Index' [INFO] ZIM Entry Count: 4334 [INFO] Empty Content Count: 2 [INFO] Redirect Count: 1 Converting | |█████████████|%100.0 Time: 0:00:04

The file download link:
https://s3.us-west-1.wasabisys.com/org-kiwix-zimit/other/medicalguidelines.msf.org_07e28661.zim

I think this file has the same idea of Wikipedia zim files that are succesfully converted without any problem.

Is there any possibility to convert like this file?

I also have this file in .epub and .mobi formats, is there any possibility to support reading from these formats?

ilius added a commit that referenced this issue Jan 9, 2022
@ilius
Copy link
Owner

ilius commented Jan 9, 2022

Please try again.

Also can you upload your .epub and .mobi files?

@sobaee
Copy link
Author

sobaee commented Jan 9, 2022

I get this error directly when i start converting

$python main.py essential-drugs.zim essential-drugs.slob
Traceback (most recent call last):
File "/storage/emulated/0/pyglossary-master/main.py", line 8, in
from pyglossary.ui.main import main
File "/storage/emulated/0/pyglossary-master/pyglossary/ui/main.py", line 30, in
from pyglossary.ui.base import UIBase
ImportError: cannot import name 'UIBase' from 'pyglossary.ui.base' (/storage/emulated/0/pyglossary-master/pyglossary/ui/base.py)
Traceback locals:
name = 'pyglossary.ui.main'
doc = None
package = 'pyglossary.ui'
loader = <_frozen_importlib_external.SourceFileLoader object at 0x780...
spec = ModuleSpec(name='pyglossary.ui.main', loader=<_frozen_importli...
file = '/storage/emulated/0/pyglossary-master/pyglossary/ui/main.py'
cached = '/storage/emulated/0/pyglossary-master/pyglossary/ui/__pycac...
len(cached) = 84
builtins = {'name': 'builtins', 'doc': "Built-in functions, e...
len(builtins) = 155
os = <module 'os' from '/data/data/com.termux/files/usr/lib/python3.10/os...
sys = <module 'sys' (built-in)>
argparse = <module 'argparse' from '/data/data/com.termux/files/usr/lib/p...
json = <module 'json' from '/data/data/com.termux/files/usr/lib/python3.1...
logging = <module 'logging' from '/data/data/com.termux/files/usr/lib/pyt...
core = <module 'pyglossary.core' from '/storage/emulated/0/pyglossary-mas...
Entry = <class 'pyglossary.entry.Entry'>

@sobaee
Copy link
Author

sobaee commented Jan 10, 2022

I used the last pyglossary the one before this to let it work and just replace zimfile.py plugin by the new one

This time I have the headwords the original beside those of html, the definitions of each of them are not correct, they show something like coding
See this:
Screenshot_20220110033832

I got this errors during convertion:
[ERROR] unknown content type for 'update.googleapis.com/service/update2/json?cup2key=10:2622198263&cup2hreq=0338fb5b5cb30f0e5182132d1ff9620dec7fccf4020a155becf04a6b4c2d247a' [ERROR] unknown content type for 'Xapian Fulltext Index' [ERROR] unknown content type for 'Xapian Title Index' [INFO] ZIM Entry Count: 4334 [ERROR] Files with name too long: 692 [INFO] Empty Content Count: 2 [INFO] Redirect Count: 1 Converting | |█████████████|%100.0 Time: 0:00:02

@ilius
Copy link
Owner

ilius commented Jan 10, 2022

I tested with Aard2 Web (in desktop browser).
Images are not shown, but text is shown correctly.

Can you open the epub in mobile and search the drug name?
It's also got a nice list of drugs you can use to look up.
(But you have to keep going back to that page I guess, unless the reader app has a Back button!)

essential-drugs-epub-calibre-index

@ilius
Copy link
Owner

ilius commented Jan 10, 2022

Reading the epub is definitely possible.
Each entry seems to be a separate html file.
I will try to do it.
But I'm not sure I can include it in PyGlossary, since it's too specific (one entry per html file).

@sobaee
Copy link
Author

sobaee commented Jan 10, 2022

In bludict it still has no definition When converting zim to mdx

Screenshot_20220110152319

I get multiple errors during converting:
Traceback (most recent call last): File "/storage/emulated/0/pyglossary-master/pyglossary/entry.py", line 85, in save with open(fpath, "wb") as toFile: PermissionError: [Errno 1] Operation not permitted: '/storage/emulated/0/pyglossary-master/essential-drugs.mtxt_res/medicalguidelines.msf.org/s/e8c3fbfc487e50239343e141213e915a-CDN/-qljuxx/8402/45c55aec607bd3c0b24eb377ecd790d998a06033/e05c0ca06e5a38e49b9110818c14a22e/_/download/contextbatch/css/viewcontent,-_super/batch.css?highlightactions=true' [ERROR] error while saving /storage/emulated/0/pyglossary-master/essential-drugs.mtxt_res/update.googleapis.com/service/update2/json?cup2key=10:2622198263&cup2hreq=0338fb5b5cb30f0e5182132d1ff9620dec7fccf4020a155becf04a6b4c2d247a Traceback (most recent call last): File "/storage/emulated/0/pyglossary-master/pyglossary/entry.py", line 85, in save with open(fpath, "wb") as toFile: PermissionError: [Errno 1] Operation not permitted: '/storage/emulated/0/pyglossary-master/essential-drugs.mtxt_res/update.googleapis.com/service/update2/json?cup2key=10:2622198263&cup2hreq=0338fb5b5cb30f0e5182132d1ff9620dec7fccf4020a155becf04a6b4c2d247a' Traceback (most recent call last): File "/storage/emulated/0/pyglossary-master/pyglossary/entry.py", line 85, in save with open(fpath, "wb") as toFile: PermissionError: [Errno 1] Operation not permitted: '/storage/emulated/0/pyglossary-master/essential-drugs.mtxt_res/update.googleapis.com/service/update2/json?cup2key=10:2622198263&cup2hreq=0338fb5b5cb30f0e5182132d1ff9620dec7fccf4020a155becf04a6b4c2d247a' [INFO] ZIM Entry Count: 4334 [ERROR] Files with name too long: 692 [INFO] Empty Content Count: 2 [INFO] Redirect Count: 1 Converting | |█████████████|%100.0 Time: 0:00:17

Is there any possibility to get the definitions to appear?

@sobaee
Copy link
Author

sobaee commented Jan 10, 2022

Is there any possibilityto have the produced dictionary either mdx, slob or ifo to work in mobile and in desktop just like wikipedia.zim files that has converted before with pyglossary?

Is there any progress about .epub converting?

Please consider this when you have time.

Thank you Saeed

ilius added a commit that referenced this issue Jan 10, 2022
@sobaee
Copy link
Author

sobaee commented Jan 10, 2022

I appreciate that Saeed

Thank a lot 🙏

This epub plugin worked well with essential-drugs.epub but didn't work with other epub files which are from the same source like this:

https://medicalguidelines.msf.org/msf-books-hosting/16686604-English.epub

Ot this:
https://medicalguidelines.msf.org/msf-books-hosting/51415817-english.epub

The error:
[ERROR] Exception while calling plugin's write function
Traceback (most recent call last):
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 1214, in write
for entry in self:
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 406, in _readersEntryGen
for index, entry in enumerate(self._applyEntryFiltersGen(reader)):
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 417, in _applyEntryFiltersGen
for index, entry in enumerate(gen):
File "/storage/emulated/0/pyglossary-master/pyglossary/plugins/epub_ungrouped.py", line 77, in iter
title = doc.find(".//title").text
AttributeError: 'NoneType' object has no attribute 'text'
Traceback (most recent call last):
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 1214, in write
for entry in self:
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 406, in _readersEntryGen
for index, entry in enumerate(self._applyEntryFiltersGen(reader)):
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 417, in _applyEntryFiltersGen
for index, entry in enumerate(gen):
File "/storage/emulated/0/pyglossary-master/pyglossary/plugins/epub_ungrouped.py", line 77, in iter
title = doc.find(".//title").text
AttributeError: 'NoneType' object has no attribute 'text'
[ERROR] Writing file 'clinical-guidelines.txt' failed.

If this plugin is only used file by file, please tell me what to change inside this plugin code to make it suitable for any other epub file 🙏

I know this could be a lot of work, but if you get this plugin to work with all epub files, this will open the possibility to convert more and more dictionaries.

ilius added a commit that referenced this issue Jan 10, 2022
@ilius
Copy link
Owner

ilius commented Jan 10, 2022

Updated https://gist.github.com/ilius/b5a4cbec5a81ff77557f4a54e7221692

@sobaee
Copy link
Author

sobaee commented Jan 11, 2022

Updated https://gist.github.com/ilius/b5a4cbec5a81ff77557f4a54e7221692

Perfect
Thank you man 👍👍

@ilius
Copy link
Owner

ilius commented Jan 11, 2022

No worries.

I'd like to know if you later test it and works with epubs from other sources as well (not generated by PyGlossary).

@sobaee
Copy link
Author

sobaee commented Jan 11, 2022

No worries.

I'd like to know if you later test it and works with epubs from other sources as well (not generated by PyGlossary).

Looks like we need an epub file that have a separated html file for each entry for this to work, I tried it with more complicated epub file that has converted from pdf with its entries are the outlines (TOC) of the pdf, but this one didn't show any entry after conversion.

I will try with more original epub books (not converted or manipulated)

Thanks

@ilius ilius added Q&A and removed Improvement labels Jan 21, 2022
@ilius ilius closed this as completed Feb 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants