Skip to content

Splitting a Wikipedia XML dump using Mahout

raymanrt edited this page Aug 30, 2011 · 5 revisions

Splitting a Wikipedia XML dump using Mahout

  • Download and unzip the latest binary release of Mahout from your closest mirror

  • Add MAHOUT_HOME/bin to your PATH and check that mahout is correctly installed with:

    $ mahout
    no HADOOP_HOME set, running locally
    An example program must be given as the first argument.
    Valid program names are:
      arff.vector: : Generate Vectors from an ARFF file or directory
      canopy: : Canopy clustering
      [...]
      wikipediaXMLSplitter: : Reads wikipedia data and creates ch
    
  • Download the Wikipedia dump, no need to uncompress it. Change "enwiki" by "frwiki" or other language codes to select the language you are interested in:

    $ wget -c http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
    
  • Split the dump into 100MB xml files on the local file system:

    $ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \
      -o wikipedia-xml-chunks -c 100
    $ ls -l wikipedia-xml-chunks
    -rw-r--r-- 1 ogrisel ogrisel 108387581 2010-12-31 17:17 chunk-0001.xml
    -rw-r--r-- 1 ogrisel ogrisel 108414882 2010-12-31 17:18 chunk-0002.xml
    -rw-r--r-- 1 ogrisel ogrisel 108221208 2010-12-31 17:18 chunk-0003.xml
    -rw-r--r-- 1 ogrisel ogrisel 108059995 2010-12-31 17:18 chunk-0004.xml
    [...]
    

You can append -n 10 to the previous command to extract only the first 10 chunks for instance.

  • Alternatively you can push the splitted dump directly to your Amazon S3 bucket with (faster if you execute this from a EC2 instance directly):

    $ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \
      -o s3n://mybucket/wikipedia-xml-chunks -c 100 \
      -i <your Amazon S3 ID key> \
      -s <your Amazon S3 secret key>
    

Note: the s3 native protocol need some support for cryptography that is not available in the version of OpenJDK 6 provided by default on Amazon EC2 Linux images. You will need to install the Sun / Oracle JDK and update your PATH and JAVA_HOME to point to it.

Splitting a Wikipedia dump and uploading it to S3 from a m1.large EC2 instance roughly takes half an hour. Note: the Wikipedia splitter implementation does not leverage any parallelism (at least in version 0.4): it is not not necessary to setup a Hadoop cluster for this step.

  • Note: if you are working on a Hadoop cluster, e.g. in EC2 using Apache Whirr, it is probably faster to import your dump directly to your cluster HDFS by running the following command on one of the nodes:

    $ HADOOP_HOME=/usr/local/hadoop-0.20.2 ./mahout-distribution-0.4/bin/mahout wikipediaXMLSplitter \
      -d enwiki-latest-pages-articles.xml.bz2 -o workspace/en/wikipedia-xml-chunks -c 100