Wikipedia Corpus Builder is a toolkit for creating clean (i.e. most content that usually are of little use for most NLP and IR tasks is removed) corpora from database snapshots of Mediawiki powered wikis. The Corpus Builder was created by Lars J. Solberg for his master thesis in 2012.
It is currently being updated and reworked in order to make it more usable for the public.
The project is built and tested using Python 2.7. if you're accustomed to another version or lacking access to install dependencies try virtualenv.
You should have about 90GB of free space to download and parse a recent English Wikipedia dump:
- ~60GB for extracting the downloaded snapshot (which is ~13GB)
- ~20GB for the constant database built with mwlib
- ~5GB for the parsed text generated by WCB
- mwlib
- mwlib.cdb 0.1.1
- tokenizer (has been removed from the link, and it's included in the project)
- srilm
Installation:
pip install mwlib
pip install mwlib.cdb
- Download and install sirlm using the instructions here
- Installing tokenizer:
-
cd /path-to-wcb/libs/tokenizer
-
./configure --prefix=/path-to-wcb/libs/tokenizer/build
-
make && make install
-
- The executable
tokenizer
should now be in/path-to-wcb/libs/tokenizer/build/bin
- The executable
Finally, copy tokenizer
and ngram
(from srilm) to /usr/local/bin
or another path that is accessible from your shell.
If the command python -c 'from mwlib.cdb import cdbwiki'
does not give any error message and your shell is able to find tokenizer
and ngram
(from srilm) you should be in good shape.
(On OS X) fatal error: 'timelib_config.h' file not found (see this issue), solution:
pip download timelib
which saves timelib zipped to your current folder- extract the zip-archive and edit setup.py:
# change the following
ext_modules=[Extension("timelib", sources=sources,
libraries=libraries,
define_macros=[("HAVE_STRING_H", 1)])],
# to this
ext_modules=[Extension("timelib", sources=sources,
include_dirs=[".", "ext-date-lib"],
libraries=libraries,
define_macros=[("HAVE_STRING_H", 1)])],
The project comes with pre-configuration for the following snapshots.
NB: the snapshots aren't hosted by wikimedia anymore, so you will have to configure a new snapshot until we are able to host the snapshots somewhere.
- Downlad the snapshot
- Decompress:
bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
- Create a constant database:
mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR
- Change the
wikiconf
entry in/wcb/enwiki-SNAPSHOT_DATE/paths.txt
to point to thewikiconf.txt
file generated in the previous step. - The WCB modules in this project need access to the
paths.txt
configuration file. They determine its location by examining thePATHSFILE
environment variable, set it like so:export PATHSFILE=/wcb/enwiki-SNAPSHOT_DATE/paths.txt
(in your ~/.bash_profile for persistence).
- Choose and downlad a recent snapshot from WikiMedia, look for the
enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
file. - Decompress:
bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
- Create a constant database:
mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR
- Now you have to add configuration for a new snapshot. Copy the
enwiki-20170201
directory in the repo to a new directory reflecting your snapshot's date. - Change the
wikiconf
entry in/wcb/enwiki-SNAPSHOT_DATE/paths.txt
to point to thewikiconf.txt
file generated in step 3. - The WCB modules in this project need access to the
paths.txt
configuration file. They determine its location by examining thePATHSFILE
environment variable, set it like so:export PATHSFILE=/wcb/enwiki-SNAPSHOT/paths.txt
(in your ~/.bash_profile for persistence).
To test the configuration, try running the corpus builder on the list of test articles, like so:
mkdir test-dir
python /wcb/scripts/build_corpus.py --article-list /wcb/test-articles.txt test-dir
> OUTPUT:
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Progress: 100.000% (saved article 3 of 3)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Empty articles (probably redirects): 2 of 3 (66.67%)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time per article: 0.534s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time elapsed: 0d:0h:00m:01s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Estimated time left: 0d:0h:00m:00s
The first invocation of this command will take some time as it will examine all the templates in the snapshot. On completion, you should see the compressed parsed test run in test-dir
(which only includes Alberto Masi as the other articles are redirects).
mkdir out-dir
python /wcb/scripts/build_corpus.py -p NUMBER_OF_PROCESSES out-dir
In progress...
- python build_corpus.py (builds a corpus for a complete dump or specified list of articles)
usage: build_corpus.py [-h] [--clean-port CLEAN_PORT]
[--dirty-port DIRTY_PORT] [--processes PROCESSES]
[--blacklist BLACKLIST]
[--article-list ARTICLE_LIST | --file-list FILE_LIST]
out_dir
- python getMarkup.py (gets the raw markup of an article)
usage: getMarkup.py [-h] article
- python list_articles.py (lists article names)
-python printNodes.py (Prints the syntax tree of an article)
Not Working due to an exception in nuwiki