A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.
-
Download a Wikipedia dump file (http://en.wikipedia.org/wiki/Wikipedia:Database_download)
-
Download Solr 4.9 and extract (http://lucene.apache.org/solr/)
-
Configure environment variables
Set SOLR_HOME to the location Solr was extracted to in Step 2 + "example", for example: export SOLR_HOME=/var/local/solr/example
Set JAVA_HOME to the location of your JDK.
-
Clone and build code
git clone https://github.com/bbende/solr-wikipedia.git
cd solr-wikipedia
mvn clean package -Pshade
-
Configure & start Solr
./deploy-wikipedia-collection.sh (copies src/main/resource/solr/wikiepediaCollection to $SOLR_HOME/solr/)
src/main/resources/solr.sh start
Check http://localhost:8983/solr in your browser
-
Ingest data (from solr-wikipedia dir)
java -jar target/solr-wikipeida-1.0-SNAPSHOT.jar http://localhost:8983/solr/wikipediaCollection /var/local/test-wiki-data.xml.bz2
There are three main concepts:
-
Handlers - Receive events related to the WikiMedia XML and produce objects based on those events. The DefaultHandler produces Page objects, but clients could implement a custom handler to produce another type of object.
-
Parser - A SAX parser for the WikiMedia XML. Clients pass in a Reader for the XML and a handler to take action on events.
-
Iterator - An Iterator that uses StAX processing to produces objects based on the given handler.
An example of parsing a bzip dump file:
String testWikiXmlFile = "src/test/resources/test-wiki-data.xml.bz2";
WikiMediaXMLParser wikiMediaXMLParser = new SAXWikiMediaParser<>();
PageHandler handler = new DefaultPageHandler();
try (FileInputStream fileIn = new FileInputStream(testWikiXmlFile);
BZip2CompressorInputStream bzipIn = new BZip2CompressorInputStream(fileIn);
InputStreamReader reader = new InputStreamReader(bzipIn)) {
wikiMediaXMLParser.parse(reader, handler);
...
}
An example of iterating over a bzip dump file:
String testWikiXmlFile = "src/test/resources/test-wiki-data.xml.bz2";
try (FileInputStream fileIn = new FileInputStream(testWikiXmlFile);
BZip2CompressorInputStream bzipIn = new BZip2CompressorInputStream(fileIn);
InputStreamReader reader = new InputStreamReader(bzipIn)) {
PageHandler handler = new DefaultPageHandler();
Iterator iterator = new WikiMediaIterator<>(
reader, handler);
while(iterator.hasNext()) {
Page page = iterator.next();
}
}