Skip to content

ltgoslo/wcb

Repository files navigation

Wikipedia Corpus Builder

Wikipedia Corpus Builder is a toolkit for creating clean (i.e. most content that usually are of little use for most NLP and IR tasks is removed) corpora from database snapshots of Mediawiki powered wikis. The Corpus Builder was created by Lars J. Solberg for his master thesis in 2012.

It is currently being updated and reworked in order to make it more usable for the public.

Old documentation

Table of Contents

Setup

The project is built and tested using Python 2.7. if you're accustomed to another version or lacking access to install dependencies try virtualenv.

You should have about 90GB of free space to download and parse a recent English Wikipedia dump:

  • ~60GB for extracting the downloaded snapshot (which is ~13GB)
  • ~20GB for the constant database built with mwlib
  • ~5GB for the parsed text generated by WCB

Dependencies

Installation:

  • pip install mwlib
  • pip install mwlib.cdb
  • Download and install sirlm using the instructions here
  • Installing tokenizer:
    1. cd /path-to-wcb/libs/tokenizer
    1. ./configure --prefix=/path-to-wcb/libs/tokenizer/build
    1. make && make install
    1. The executable tokenizer should now be in /path-to-wcb/libs/tokenizer/build/bin

Finally, copy tokenizer and ngram (from srilm) to /usr/local/bin or another path that is accessible from your shell.
If the command python -c 'from mwlib.cdb import cdbwiki' does not give any error message and your shell is able to find tokenizer and ngram (from srilm) you should be in good shape.

Known Issues

(On OS X) fatal error: 'timelib_config.h' file not found (see this issue), solution:
  1. pip download timelib which saves timelib zipped to your current folder
  2. extract the zip-archive and edit setup.py:
    # change the following
    ext_modules=[Extension("timelib", sources=sources,
                            libraries=libraries,
                            define_macros=[("HAVE_STRING_H", 1)])],
    # to this
    ext_modules=[Extension("timelib", sources=sources,
                            include_dirs=[".", "ext-date-lib"],
                            libraries=libraries,
                            define_macros=[("HAVE_STRING_H", 1)])],

Running on the English Wikipedia

The project comes with pre-configuration for the following snapshots.

NB: the snapshots aren't hosted by wikimedia anymore, so you will have to configure a new snapshot until we are able to host the snapshots somewhere.

Using a pre-configured snapshot

  1. Downlad the snapshot
  2. Decompress: bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
  3. Create a constant database: mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR
  4. Change the wikiconf entry in /wcb/enwiki-SNAPSHOT_DATE/paths.txt to point to the wikiconf.txt file generated in the previous step.
  5. The WCB modules in this project need access to the paths.txt configuration file. They determine its location by examining the PATHSFILE environment variable, set it like so: export PATHSFILE=/wcb/enwiki-SNAPSHOT_DATE/paths.txt (in your ~/.bash_profile for persistence).

Configuring a new dump

  1. Choose and downlad a recent snapshot from WikiMedia, look for the enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2 file.
  2. Decompress: bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
  3. Create a constant database: mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR
  4. Now you have to add configuration for a new snapshot. Copy the enwiki-20170201 directory in the repo to a new directory reflecting your snapshot's date.
  5. Change the wikiconf entry in /wcb/enwiki-SNAPSHOT_DATE/paths.txt to point to the wikiconf.txt file generated in step 3.
  6. The WCB modules in this project need access to the paths.txt configuration file. They determine its location by examining the PATHSFILE environment variable, set it like so: export PATHSFILE=/wcb/enwiki-SNAPSHOT/paths.txt (in your ~/.bash_profile for persistence).

Test run

To test the configuration, try running the corpus builder on the list of test articles, like so:

mkdir test-dir
python /wcb/scripts/build_corpus.py --article-list /wcb/test-articles.txt test-dir

> OUTPUT:
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Progress: 100.000% (saved article 3 of 3)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Empty articles (probably redirects): 2 of 3 (66.67%)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time per article: 0.534s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time elapsed: 0d:0h:00m:01s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Estimated time left: 0d:0h:00m:00s

The first invocation of this command will take some time as it will examine all the templates in the snapshot. On completion, you should see the compressed parsed test run in test-dir (which only includes Alberto Masi as the other articles are redirects).

Full run

mkdir out-dir
python /wcb/scripts/build_corpus.py -p NUMBER_OF_PROCESSES out-dir

Adding support for additional languages

In progress...

Script invocation

- python build_corpus.py (builds a corpus for a complete dump or specified list of articles)

usage: build_corpus.py [-h] [--clean-port CLEAN_PORT]
                       [--dirty-port DIRTY_PORT] [--processes PROCESSES]
                       [--blacklist BLACKLIST]
                       [--article-list ARTICLE_LIST | --file-list FILE_LIST]
                       out_dir

- python getMarkup.py (gets the raw markup of an article)

usage: getMarkup.py [-h] article

- python list_articles.py (lists article names)

-python printNodes.py (Prints the syntax tree of an article)
Not Working due to an exception in nuwiki