Read Wikimedia Dumps #48

wanglongqi · 2016-04-24T07:54:07Z

Thanks for developing this wonderful package! I would like to check whether this package is able to or has any plan to support wikidump. I think wikidump is some kind of glossary, and somehow can fit into the scope of this project.

Regards,

Longqi

ilius · 2016-04-24T20:18:07Z

Hi
You mean read support, right?
I will consider doing it after we fully migrated to Python 3
Thanks

wanglongqi · 2016-04-25T02:01:14Z

@ilius Yes, absolutely. Read support is good enough. 👍

ARezaK · 2016-07-12T01:10:03Z

any update on this?

ilius · 2016-10-28T14:53:17Z

This is my incomplete (but kind of working) code
https://github.com/ilius/pyglossary/blob/next/pyglossary/plugins/wikipedia_dump.py

I'm not sure I would have enough time to finish it

wanglongqi · 2016-11-16T04:23:56Z

Well done. Thanks!

chopinesque · 2018-12-16T10:43:28Z

Does this read from html dump only? Seems like static html dumps are abandoned.

ilius · 2018-12-17T08:29:47Z

Yes, unfortunately, looks that way

ilius · 2019-06-24T02:24:16Z

For now you may want to use slob dictionaries for Wikipedia:
(PyGlossary supports slob now)
https://github.com/itkach/slob/wiki/Dictionaries
For some languages it's too outdated though

ilius · 2020-07-20T22:57:48Z

I tried to implement this.
Unfortunately, I couldn't get any rendering library to work, and convert a full wiki page to html (that's what we expect and most dictionary applications understand).

Also Wikipedia is too big to be one glossary, even for not so popular languages (the size of xml for Persian is 5.6 Gigabytes!)
For converting to some formats (like StarDict, EPUB, Mobi, Kobo, DictionaryForMIDs) we have to load the entire glossary into memory, because they need sorting!

I'm gonna give up on this sadly!

Wiktionary on the other hand, should be doable.
You can use wikdict-gen tool to convert it to FreeDict, then use PyGlossary to convert FreeDict to your destination format.
I may try to implement reading directly from Wiktionary dbnary.

There are already a lot of glossaries converted from Wiktionary here:
https://freedict.org/downloads/

ilius · 2020-07-29T12:48:14Z

I added a basic support for Wiktionary dump.
Will improve it in the meantime.
You can also test it.

sobaee · 2020-08-02T14:17:19Z

Hello ilius and all developers and users.

I have a clarification related to wikidumps:

There are multiple files for each wiktionary founded in dump wikipedia site.
The xml file should you download from this site has to be able to converted by pyglossary to other dictionary formats (e.g. ifo), this file name should contain "pages" and "xml" words so.

The site for arwiktionary as an example:
https://dumps.wikimedia.org/arwiktionary/latest

I converted these files after extraction:
" arwiktionary-latest-pages-articles.xml and this file arwiktionary-latest-pages-meta-current.xml" to txt and everything has worked perfectly.

I was able to know this after I tried a lot with the other files in this wikidump site with no benefit.

The converted file is good for me because it contains the basic translation for any word other than the converted slob files which contain many unusable details.

One simple issue, that the produced file will contain some unusable headwords like "category: somthing..", but for now it's amazing solution.

Thanks to illius for updating and adding the wikidump plugin yesterday.

Amazing work 👍.

ilius · 2020-08-02T15:30:56Z

I think you should download pages-articles.xml

sobaee · 2020-08-02T15:46:17Z

Is this working with small sized Wikipedias or just with Wiktionaries?

ilius · 2020-08-02T17:04:34Z

Let's focus on Wiktionary for now.

sobaee · 2020-08-04T22:02:16Z

I have to add something:

I converted wikimedia dump xml "enwikionary" today to (txt, ifo and also slob) and used the coverted file to translate some words (e.g. "google"), then I unfortunately discovered it contains a lot of non understanded words in the definations.
This is part of the translation of the word "google":

⚫︎ * {{quote-av|en|title=w|Buffy the Vampire Slayer
|episode=w:Help (Buffy episode)|Help|season=7|number=4|people=w|Alyson Hannigan
and w|Nicholas Brendon
|role=w|Willow Rosenberg
and w|Xander Harris
|date=15 October 2002|passage=''Willow'': Have you '''googled''' her yet?
''Xander'': Willow! She's 17!
''Willow'': It's a search engine.}}
⚫︎ * {{quote-av|en|title=w|Maid In Manhattan
|people=w|Jennifer Lopez
|role=Marisa|date=13 December 2002|passage="'''Google''' it."}}
⚫︎ * {{quote-journal|en|author=Bill Keller|authorlink=Bill Keller|title=Who's Sorry Now?|url=http://www.nytimes.com/2002/12/28/opinion/28KELL.html|newspaper=w|The New York Times
|issn=0362-4331|page=A-19|date=28 December 2002|accessdate=24 June 2007|passage='''Googling''' in search of an apology from the former Enron C.E.O. ..
End

This is completely frustrating, I just need to see a clear، corrected translation of any word instead of this!

On the other hand, the translations found in the Slob files which I converted before, were completely organized (but unfortunately some important wikidumps there are outdated and some are not included)

sobaee · 2020-08-04T22:07:26Z

Hi @ilius

I think there is something wrong either with the dumps themselves, or with file I downloaded from there (page articles, may be one file is not enough to build a complete, clear dictionary!!), or could be because of the conversion tools. Really I don't know.
I hope you could help?

ilius · 2020-08-04T22:14:09Z

Yes, templates are not rendered.
It's a regexp matching that converts basic wiki markup to html for now.

sobaee · 2020-08-04T22:15:29Z

Is there any close solution for this?

sobaee · 2020-08-18T13:15:17Z

Hi @ilius
I think zim file conversion could replace the converting from wikipedia dumps.

BoboTiG · 2020-12-17T18:17:43Z

There is that project that is using the Wiktionary as input data: eBook Reader Dictionaries. We are working actively to generate more output files, see BoboTiG/ebook-reader-dict#409 using pyglossary. For now, only the Kobo is supported.
I hope it will help :)

BoboTiG · 2020-12-17T18:18:19Z

We are now generating a .df file, so hypothetically, pyglossary could do the conversion to almost any formats.

ilius · 2022-03-02T15:03:39Z

So we have two alternatives:

Zim
- ➕ Very dood output for desktop.
- ➕ Works for all languages / locales.
- ➕ Works for all WikiMedia websites.
- ➖ libzim requires compilation for most platform, which is hard for average user.
- ➖ Can not use official .xml snapshot.
ebook-reader-dict
- ➕ Very good output for both desktop and e-book readers.
- ➖ Locale-specific, supports only 10 languages.
- ➖ Only works for Wiktinary.
- ➕ Offers daily builds (StarDict and .df)
- ➕ Can use official .xml snapshot.
  - ➖ Includes serveral command-line steps, may be hard for average user.
  - ➖ Resource-intensive (consumes 4 GB of RAM, lots of CPU usage)

Both solutions are much better than anything that I can include in PyGlossary anytime soon.

So I'm going to remove the plugin and close this issue.

raffaem · 2022-04-23T11:29:10Z

how can zim be used to read wikdictionary dumps?

ilius · 2022-04-28T15:15:07Z

You need to download zim files instead or wikdictionary dumps.

ilius added the New Extension label Apr 24, 2016

ilius self-assigned this Apr 24, 2016

ilius changed the title ~~Is it possible to support wikidumps?~~ Read Wikimedia Dumps Apr 25, 2016

ilius added Feature and removed New Extension labels May 1, 2016

ilius removed their assignment Oct 28, 2016

ilius closed this as completed Jul 20, 2020

ilius reopened this Jul 21, 2020

ilius mentioned this issue Jan 1, 2021

Word pages look unformatted rdoeffinger/Dictionary#129

Open

ilius added a commit that referenced this issue Mar 2, 2022

remove read support for Wiktiomary Dump, #48

f90f450

ilius closed this as completed Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Wikimedia Dumps #48

Read Wikimedia Dumps #48

wanglongqi commented Apr 24, 2016

ilius commented Apr 24, 2016 •

edited

Loading

wanglongqi commented Apr 25, 2016 •

edited

Loading

ARezaK commented Jul 12, 2016

ilius commented Oct 28, 2016

wanglongqi commented Nov 16, 2016

chopinesque commented Dec 16, 2018

ilius commented Dec 17, 2018

ilius commented Jun 24, 2019 •

edited

Loading

ilius commented Jul 20, 2020 •

edited

Loading

ilius commented Jul 29, 2020

sobaee commented Aug 2, 2020 •

edited

Loading

ilius commented Aug 2, 2020

sobaee commented Aug 2, 2020

ilius commented Aug 2, 2020

sobaee commented Aug 4, 2020

sobaee commented Aug 4, 2020 •

edited

Loading

ilius commented Aug 4, 2020

sobaee commented Aug 4, 2020

sobaee commented Aug 18, 2020 •

edited by ilius

Loading

BoboTiG commented Dec 17, 2020

BoboTiG commented Dec 17, 2020

ilius commented Mar 2, 2022 •

edited

Loading

raffaem commented Apr 23, 2022

ilius commented Apr 28, 2022

Read Wikimedia Dumps #48

Read Wikimedia Dumps #48

Comments

wanglongqi commented Apr 24, 2016

ilius commented Apr 24, 2016 • edited Loading

wanglongqi commented Apr 25, 2016 • edited Loading

ARezaK commented Jul 12, 2016

ilius commented Oct 28, 2016

wanglongqi commented Nov 16, 2016

chopinesque commented Dec 16, 2018

ilius commented Dec 17, 2018

ilius commented Jun 24, 2019 • edited Loading

ilius commented Jul 20, 2020 • edited Loading

ilius commented Jul 29, 2020

sobaee commented Aug 2, 2020 • edited Loading

ilius commented Aug 2, 2020

sobaee commented Aug 2, 2020

ilius commented Aug 2, 2020

sobaee commented Aug 4, 2020

sobaee commented Aug 4, 2020 • edited Loading

ilius commented Aug 4, 2020

sobaee commented Aug 4, 2020

sobaee commented Aug 18, 2020 • edited by ilius Loading

BoboTiG commented Dec 17, 2020

BoboTiG commented Dec 17, 2020

ilius commented Mar 2, 2022 • edited Loading

raffaem commented Apr 23, 2022

ilius commented Apr 28, 2022

ilius commented Apr 24, 2016 •

edited

Loading

wanglongqi commented Apr 25, 2016 •

edited

Loading

ilius commented Jun 24, 2019 •

edited

Loading

ilius commented Jul 20, 2020 •

edited

Loading

sobaee commented Aug 2, 2020 •

edited

Loading

sobaee commented Aug 4, 2020 •

edited

Loading

sobaee commented Aug 18, 2020 •

edited by ilius

Loading

ilius commented Mar 2, 2022 •

edited

Loading