Skip to content
ogrisel edited this page Jan 6, 2011 · 3 revisions

With a HTTP client such as wget

You can get the latest wikipedia dumps for the english articles here (around 5.4GB compressed, 23 GB uncompressed):

enwiki-latest-pages-articles.xml.bz2

The DBPedia links and entities types datasets are available here (16.4GB compressed):

Index of individual DBpedia 3.5.1 dumps

Complete multilingual archive

By mounting an EBS volume on your EC2 instance

All of those datasets are also available from the Amazon cloud as public EBS volumes:

Wikipedia XML dataset EBS Volume: snap-8041f2e9 (all languages - 500GB)

DBPedia Triples dataset EBS Volume: snap-63cf3a0a (all languages - 67GB)

See the wiki page Running pignlproc scripts on a EC2 Hadoop cluster for instructions on how to setup you Hadoop cluster on EC2.