Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Wikimedia Dumps #48

Closed
wanglongqi opened this issue Apr 24, 2016 · 24 comments
Closed

Read Wikimedia Dumps #48

wanglongqi opened this issue Apr 24, 2016 · 24 comments
Labels

Comments

@wanglongqi
Copy link

Hi @ilius,

Thanks for developing this wonderful package! I would like to check whether this package is able to or has any plan to support wikidump. I think wikidump is some kind of glossary, and somehow can fit into the scope of this project.

Regards,

Longqi

@ilius
Copy link
Owner

ilius commented Apr 24, 2016

Hi
You mean read support, right?
I will consider doing it after we fully migrated to Python 3
Thanks

@ilius ilius self-assigned this Apr 24, 2016
@wanglongqi
Copy link
Author

wanglongqi commented Apr 25, 2016

@ilius Yes, absolutely. Read support is good enough. 👍

@ilius ilius changed the title Is it possible to support wikidumps? Read Wikimedia Dumps Apr 25, 2016
@ARezaK
Copy link

ARezaK commented Jul 12, 2016

any update on this?

@ilius
Copy link
Owner

ilius commented Oct 28, 2016

This is my incomplete (but kind of working) code
https://github.com/ilius/pyglossary/blob/next/pyglossary/plugins/wikipedia_dump.py

I'm not sure I would have enough time to finish it

@ilius ilius removed their assignment Oct 28, 2016
@wanglongqi
Copy link
Author

Well done. Thanks!

@chopinesque
Copy link

Does this read from html dump only? Seems like static html dumps are abandoned.

@ilius
Copy link
Owner

ilius commented Dec 17, 2018

Yes, unfortunately, looks that way

@ilius
Copy link
Owner

ilius commented Jun 24, 2019

For now you may want to use slob dictionaries for Wikipedia:
(PyGlossary supports slob now)
https://github.com/itkach/slob/wiki/Dictionaries
For some languages it's too outdated though

@ilius
Copy link
Owner

ilius commented Jul 20, 2020

I tried to implement this.
Unfortunately, I couldn't get any rendering library to work, and convert a full wiki page to html (that's what we expect and most dictionary applications understand).

Also Wikipedia is too big to be one glossary, even for not so popular languages (the size of xml for Persian is 5.6 Gigabytes!)
For converting to some formats (like StarDict, EPUB, Mobi, Kobo, DictionaryForMIDs) we have to load the entire glossary into memory, because they need sorting!

I'm gonna give up on this sadly!

Wiktionary on the other hand, should be doable.
You can use wikdict-gen tool to convert it to FreeDict, then use PyGlossary to convert FreeDict to your destination format.
I may try to implement reading directly from Wiktionary dbnary.

There are already a lot of glossaries converted from Wiktionary here:
https://freedict.org/downloads/

@ilius ilius closed this as completed Jul 20, 2020
@ilius ilius reopened this Jul 21, 2020
@ilius
Copy link
Owner

ilius commented Jul 29, 2020

I added a basic support for Wiktionary dump.
Will improve it in the meantime.
You can also test it.

@sobaee
Copy link

sobaee commented Aug 2, 2020

Hello ilius and all developers and users.

I have a clarification related to wikidumps:

There are multiple files for each wiktionary founded in dump wikipedia site.
The xml file should you download from this site has to be able to converted by pyglossary to other dictionary formats (e.g. ifo), this file name should contain "pages" and "xml" words so.

The site for arwiktionary as an example:
https://dumps.wikimedia.org/arwiktionary/latest

I converted these files after extraction:
" arwiktionary-latest-pages-articles.xml and this file arwiktionary-latest-pages-meta-current.xml" to txt and everything has worked perfectly.

I was able to know this after I tried a lot with the other files in this wikidump site with no benefit.

The converted file is good for me because it contains the basic translation for any word other than the converted slob files which contain many unusable details.

One simple issue, that the produced file will contain some unusable headwords like "category: somthing..", but for now it's amazing solution.

Thanks to illius for updating and adding the wikidump plugin yesterday.

Amazing work 👍.

@ilius
Copy link
Owner

ilius commented Aug 2, 2020

I think you should download pages-articles.xml

@sobaee
Copy link

sobaee commented Aug 2, 2020

Is this working with small sized Wikipedias or just with Wiktionaries?

@ilius
Copy link
Owner

ilius commented Aug 2, 2020

Let's focus on Wiktionary for now.

@sobaee
Copy link

sobaee commented Aug 4, 2020

I have to add something:

I converted wikimedia dump xml "enwikionary" today to (txt, ifo and also slob) and used the coverted file to translate some words (e.g. "google"), then I unfortunately discovered it contains a lot of non understanded words in the definations.
This is part of the translation of the word "google":

Verb
Template: wikipedia|google (verb)
Template: en-verb|googl
⚫︎ lb|en|transitive
l|en|To search for (something) on the Internet using the Google search engine.

⚫︎ : ''Tom '''googles''' all of his prospective girlfriends.''
⚫︎ * quote-web|en|author=Larry Page|authorlink=Larry Page|title=Google Search Engine: New Features|work=eGroups|url=http://www.egroups.com/group/google-friends/3.html|archiveurl=https://web.archive.org/web/19991009052012/www.egroups.com/group/google-friends/3.html|archivedate=9 October 1999|publisher=Google Friends Mailing List|date=8 July 1998|accessdate=6 August 2007|passage="Have fun and keep '''googling'''!"

⚫︎ * {{quote-av|en|title=w|Buffy the Vampire Slayer
|episode=w:Help (Buffy episode)|Help|season=7|number=4|people=w|Alyson Hannigan
and w|Nicholas Brendon
|role=w|Willow Rosenberg
and w|Xander Harris
|date=15 October 2002|passage=''Willow'': Have you '''googled''' her yet?
''Xander'': Willow! She's 17!
''Willow'': It's a search engine.}}
⚫︎ * {{quote-av|en|title=w|Maid In Manhattan
|people=w|Jennifer Lopez
|role=Marisa|date=13 December 2002|passage="'''Google''' it."}}
⚫︎ * {{quote-journal|en|author=Bill Keller|authorlink=Bill Keller|title=Who's Sorry Now?|url=http://www.nytimes.com/2002/12/28/opinion/28KELL.html|newspaper=w|The New York Times
|issn=0362-4331|page=A-19|date=28 December 2002|accessdate=24 June 2007|passage='''Googling''' in search of an apology from the former Enron C.E.O. ..
End

This is completely frustrating, I just need to see a clear، corrected translation of any word instead of this!

On the other hand, the translations found in the Slob files which I converted before, were completely organized (but unfortunately some important wikidumps there are outdated and some are not included)

@sobaee
Copy link

sobaee commented Aug 4, 2020

Hi @ilius

I think there is something wrong either with the dumps themselves, or with file I downloaded from there (page articles, may be one file is not enough to build a complete, clear dictionary!!), or could be because of the conversion tools. Really I don't know.
I hope you could help?

@ilius
Copy link
Owner

ilius commented Aug 4, 2020

Yes, templates are not rendered.
It's a regexp matching that converts basic wiki markup to html for now.

@sobaee
Copy link

sobaee commented Aug 4, 2020

Is there any close solution for this?

@sobaee
Copy link

sobaee commented Aug 18, 2020

Hi @ilius
I think zim file conversion could replace the converting from wikipedia dumps.

@BoboTiG
Copy link
Contributor

BoboTiG commented Dec 17, 2020

There is that project that is using the Wiktionary as input data: eBook Reader Dictionaries. We are working actively to generate more output files, see BoboTiG/ebook-reader-dict#409 using pyglossary. For now, only the Kobo is supported.
I hope it will help :)

@BoboTiG
Copy link
Contributor

BoboTiG commented Dec 17, 2020

We are now generating a .df file, so hypothetically, pyglossary could do the conversion to almost any formats.

@ilius
Copy link
Owner

ilius commented Mar 2, 2022

So we have two alternatives:

  • Zim

    • ➕ Very dood output for desktop.
    • ➕ Works for all languages / locales.
    • ➕ Works for all WikiMedia websites.
    • libzim requires compilation for most platform, which is hard for average user.
    • ➖ Can not use official .xml snapshot.
  • ebook-reader-dict

    • ➕ Very good output for both desktop and e-book readers.
    • ➖ Locale-specific, supports only 10 languages.
    • ➖ Only works for Wiktinary.
    • ➕ Offers daily builds (StarDict and .df)
    • ➕ Can use official .xml snapshot.
      • ➖ Includes serveral command-line steps, may be hard for average user.
      • ➖ Resource-intensive (consumes 4 GB of RAM, lots of CPU usage)

Both solutions are much better than anything that I can include in PyGlossary anytime soon.

So I'm going to remove the plugin and close this issue.

@ilius ilius closed this as completed Mar 2, 2022
@raffaem
Copy link

raffaem commented Apr 23, 2022

how can zim be used to read wikdictionary dumps?

@ilius
Copy link
Owner

ilius commented Apr 28, 2022

You need to download zim files instead or wikdictionary dumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants