Skip to content
indexer and searcher for retrieving Wikipedia diffs
Pull request Compare This branch is 78 commits behind whym:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
META-INF
src
web
.gitignore
COPYING
README.rst
REQUIREMENTS.TXT
diffdb.sample.properties
lcp.py
pom.xml
query.py
template_query.py

README.rst

Wikihadoop Lucene Indexer & Searcher

Purpose

Search is one of the most useful means to look into a huge amount of text data. When we deal with hundreds of millions of revisions in Wikipedia, we want to find an answer to questions like ''when did this template start to be popular in this wiki?'' and a lot more. However, it is almost impossible without a search capability over revision diffs. WikiHadoop is a tool to create a database of the differences between two revisions for Wikipedia articles. While knowing who adds / removes certain content is very useful it is still cumbersome to search through the data.

Hence, we developed a Lucene indexer that takes as input the diffdb created by Wikihadoop and creates an index that is searachable using Lucene. This indexer assumes the input files to be formatted as explained in [1].

Note that this software is under-development. Most parts are not well documented and the architecture frequently changes. Any feedback will be welcomed at Issues.

[1]http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff

How to use

We use Apache Maven to compile this software in to a jar file. The jar file can be created by running the command mvn dependency:unpack-dependencies && mvn package in the top-level directory.

You invoke the indexer on the command line using the following command [2]:

CLASSPATH=$CLASSPATH:target/diffdb-0.1.jar java org.wikimedia.diffdb.Indexer ~/diffdbtest/index ~/diffdbtest/data/diffs

then the searcher daemon with the following command:

CLASSPATH=$CLASSPATH:target/diffdb-0.1.jar java org.wikimedia.diffdb.SearcherDaemon -index ~/diffdbtest/index

and then you can issue a query with an accompanying script to see which revisions are matched and when they are dated:

./query.py "Welcome to Wikipedia" -R -o monthly_hits.csv

With the parameters above, the script will find revisions containing "Welcome to Wikipedia" as added text. You can also search for other fields. See below for a more detailed format of the query format.

Requirements

  • Apache Maven
  • Apache Lucene 3.5.0
  • Apache Commons Lang 3.0.1
  • opencsv
  • junit
  • netty

Query format

Following fields can be searched over. When multiple fields are specified, the searcher will retrieve revisions containing all fields as specified.

  • rev_id
  • page_id
  • namespace
  • title
  • timestamp
  • comment
  • minor
  • user_id
  • user_text
  • added_size
  • removed_size
  • added
  • removed
  • action

For example, to find the revisions that contains the string 'Welcome to Wikipedia' and were made within January 2006 and January 2007, you will use

added:"Welcome to Wikipedia" timestamp:[2002-01 TO 2003-01]

This query format is used when using query.py with a --advanced flag turned on, or directly connecting to the SearcherDaemon via telnet. By default query.py use a command line argument as a phrase query to the added field.

See Lucene's Query Parser Syntax for more details.

(to be expanded)

Configurations

(to be expanded)

  • Type of analysis to convert a document to the index representation including the value of N in N-gram indexing
  • Number of threads used in indexing

Architecture

(to be written)

  • N-gram based indexing and search
  • Search result refinement with grep
  • Searcher daemon
[2]~/diffdbtest/data/diffs must contains the revision diff files explained at http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
Something went wrong with that request. Please try again.