Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing wikipedia zim - once and for all #914

Closed
ghost opened this issue Oct 23, 2017 · 10 comments
Closed

Indexing wikipedia zim - once and for all #914

ghost opened this issue Oct 23, 2017 · 10 comments

Comments

@ghost
Copy link

ghost commented Oct 23, 2017

Previously discussed: #546 #680 #763

I'd like to see wikipedia zim files work out of the box with goldendict and I'm willing to do the work.
The solution (1) I have now is to:

  • disable FTS for zim files above ? GB
  • Modify the addWord() logic to index only the full (folded) title and not to generate additional entries on word boundaries - if the article count is large enough.

how about it, @Abs62 ? I can send a PR for (1) soon if you're interested.

@darnn
Copy link

darnn commented Oct 23, 2017

If you do move forward on this, please consider adding the option to create a Wikipedia-based dictionary from dumps, i.e. something that would enable you to input, say, "The Terminator" and have GD output "Терминатор (фильм)" (if you've selected English->Russian).

@Abs62
Copy link
Member

Abs62 commented Oct 24, 2017

wikipedia_en_all_nopic_2017-08.zim, 64-bit Qt5-based GD - 8 minutes for indexing, 720 MB index file. Up to 7 GB memory consumption while indexing.

What criterion do you suggest for reduce headwords handling?

@data-man
Copy link
Contributor

IMO, the best solution would be to use the libzim library, which has built-in support for the Xapian library with advanced full-text search capabilities.

Abs62 added a commit that referenced this issue Oct 25, 2017
@Abs62
Copy link
Member

Abs62 commented Oct 25, 2017

I have add parameter to config file.

@ghost
Copy link
Author

ghost commented Oct 25, 2017

Thanks. The problem is still there because:

  • FTS indexing is on by default for zim -> OOM for fresh users (FTS should have size limit too? some smaller zim files could benefit from FTS)
  • default value set for maxHeadwordsToExpand is 0, so headword expansion it now turned off for ALL dicts (IIUC). default should probably be large enough to only apply to wikipedia-scale files.

@Abs62
Copy link
Member

Abs62 commented Oct 25, 2017

FTS indexing is on by default for zim -> OOM for fresh users (FTS should have size limit too? some smaller zim files could benefit from FTS)

fts_limit

default value set for maxHeadwordsToExpand is 0, so headword expansion it now turned off for ALL dicts (IIUC). default should probably be large enough to only apply to wikipedia-scale files.

No. Limit should be turned on only if GD can't index dictionary wthout one. 64-bit GD and 8+ GB RAM is not rare configuration nowadays.

@ghost
Copy link
Author

ghost commented Oct 25, 2017

Fair enough. Thanks for fixing! 👍

Future Visitors:
If your machine doesn't have enough memory to complete the indexing of wikipedia_nopic_201?.zim, use the following workaround:

  • set the limit indicated in the image above. somewhere between 2000000 and 10000000 should work. This eliminates FTS indexing for very large files.
  • open up your ~/.goldendict/config file, and add the key:
 <maxHeadwordsToExpand>2000000</maxHeadwordsToExpand>

anything between 2M and 10M should work. This reduces title indexing to be less extensive for files with more then 2M articles. So for example, with reduced indexing, if you search for "Alice" you'll have "Alice in wonderland" turn up, but it won't appear if you searching for "wonderland". With full-indexing, "wonderland" would also match the "Alice in wonderland" article.

@ghost ghost closed this as completed Oct 25, 2017
@jjzz
Copy link
Contributor

jjzz commented Dec 12, 2017

@darnn
I just put wikipair online https://github.com/jjzz/wikipair
I guess that's exactly what you want.

@darnn
Copy link

darnn commented Dec 22, 2017

@jjzz That is, in fact, exactly what I want! I'm clueless when it comes to Linux, but I managed to run it as far as getting the txt and the dsl file, after which it gave an error on the dos2unix command. I copied over the files and changed the linebreak style to Windows, and turned on BOM, and Windows Goldendict now reads it! So, thank you so much!

@xiaoyifang
Copy link
Contributor

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants