Indexing wikipedia zim - once and for all #914

ghost · 2017-10-23T20:59:08Z

Previously discussed: #546 #680 #763

I'd like to see wikipedia zim files work out of the box with goldendict and I'm willing to do the work.
The solution (1) I have now is to:

disable FTS for zim files above ? GB
Modify the addWord() logic to index only the full (folded) title and not to generate additional entries on word boundaries - if the article count is large enough.

how about it, @Abs62 ? I can send a PR for (1) soon if you're interested.

The text was updated successfully, but these errors were encountered:

darnn · 2017-10-23T21:29:04Z

If you do move forward on this, please consider adding the option to create a Wikipedia-based dictionary from dumps, i.e. something that would enable you to input, say, "The Terminator" and have GD output "Терминатор (фильм)" (if you've selected English->Russian).

Abs62 · 2017-10-24T19:14:03Z

wikipedia_en_all_nopic_2017-08.zim, 64-bit Qt5-based GD - 8 minutes for indexing, 720 MB index file. Up to 7 GB memory consumption while indexing.

What criterion do you suggest for reduce headwords handling?

data-man · 2017-10-24T19:37:42Z

IMO, the best solution would be to use the libzim library, which has built-in support for the Xapian library with advanced full-text search capabilities.

…ord headwords while indexing (issie #914)

Abs62 · 2017-10-25T15:53:00Z

I have add parameter to config file.

ghost · 2017-10-25T17:14:49Z

Thanks. The problem is still there because:

FTS indexing is on by default for zim -> OOM for fresh users (FTS should have size limit too? some smaller zim files could benefit from FTS)
default value set for maxHeadwordsToExpand is 0, so headword expansion it now turned off for ALL dicts (IIUC). default should probably be large enough to only apply to wikipedia-scale files.

Abs62 · 2017-10-25T17:59:33Z

FTS indexing is on by default for zim -> OOM for fresh users (FTS should have size limit too? some smaller zim files could benefit from FTS)

default value set for maxHeadwordsToExpand is 0, so headword expansion it now turned off for ALL dicts (IIUC). default should probably be large enough to only apply to wikipedia-scale files.

No. Limit should be turned on only if GD can't index dictionary wthout one. 64-bit GD and 8+ GB RAM is not rare configuration nowadays.

ghost · 2017-10-25T18:42:53Z

Fair enough. Thanks for fixing! 👍

Future Visitors:
If your machine doesn't have enough memory to complete the indexing of wikipedia_nopic_201?.zim, use the following workaround:

set the limit indicated in the image above. somewhere between 2000000 and 10000000 should work. This eliminates FTS indexing for very large files.
open up your ~/.goldendict/config file, and add the key:

 <maxHeadwordsToExpand>2000000</maxHeadwordsToExpand>

anything between 2M and 10M should work. This reduces title indexing to be less extensive for files with more then 2M articles. So for example, with reduced indexing, if you search for "Alice" you'll have "Alice in wonderland" turn up, but it won't appear if you searching for "wonderland". With full-indexing, "wonderland" would also match the "Alice in wonderland" article.

jjzz · 2017-12-12T08:37:15Z

@darnn
I just put wikipair online https://github.com/jjzz/wikipair
I guess that's exactly what you want.

darnn · 2017-12-22T13:20:05Z

@jjzz That is, in fact, exactly what I want! I'm clueless when it comes to Linux, but I managed to run it as far as getting the txt and the dsl file, after which it gave an error on the dos2unix command. I copied over the files and changed the linebreak style to Windows, and turned on BOM, and Windows Goldendict now reads it! So, thank you so much!

xiaoyifang · 2022-10-02T05:37:55Z

xiaoyifang/goldendict-ng@7a2ec4f

Abs62 added a commit that referenced this issue Oct 25, 2017

Add config file parameter to limit headwords number to expand multi-w…

0b6f364

…ord headwords while indexing (issie #914)

ghost closed this as completed Oct 25, 2017

This was referenced Mar 20, 2022

Failure to display images on .zim files xiaoyifang/goldendict-ng#27

Closed

possibility to use libzim to parse the zim dictionary xiaoyifang/goldendict-ng#30

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing wikipedia zim - once and for all #914

Indexing wikipedia zim - once and for all #914

ghost commented Oct 23, 2017 •

edited by ghost

Loading

darnn commented Oct 23, 2017

Abs62 commented Oct 24, 2017

data-man commented Oct 24, 2017

Abs62 commented Oct 25, 2017

ghost commented Oct 25, 2017 •

edited by ghost

Loading

Abs62 commented Oct 25, 2017

ghost commented Oct 25, 2017 •

edited by ghost

Loading

jjzz commented Dec 12, 2017

darnn commented Dec 22, 2017 •

edited

Loading

xiaoyifang commented Oct 2, 2022

Indexing wikipedia zim - once and for all #914

Indexing wikipedia zim - once and for all #914

Comments

ghost commented Oct 23, 2017 • edited by ghost Loading

darnn commented Oct 23, 2017

Abs62 commented Oct 24, 2017

data-man commented Oct 24, 2017

Abs62 commented Oct 25, 2017

ghost commented Oct 25, 2017 • edited by ghost Loading

Abs62 commented Oct 25, 2017

ghost commented Oct 25, 2017 • edited by ghost Loading

jjzz commented Dec 12, 2017

darnn commented Dec 22, 2017 • edited Loading

xiaoyifang commented Oct 2, 2022

ghost commented Oct 23, 2017 •

edited by ghost

Loading

ghost commented Oct 25, 2017 •

edited by ghost

Loading

ghost commented Oct 25, 2017 •

edited by ghost

Loading

darnn commented Dec 22, 2017 •

edited

Loading