Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Luceneutil: Lucene benchmarking utilities

Benchmarking Lucene Duke -- thank you @mocobeta!

Setting up luceneutil

First, pick a root directory, under which luceneutil will be checked out, datasets exist, indices are built, Lucene source code is checked out, etc.. We'll refer to this directory as $LUCENE_BENCH_HOME here.

# 1. checkout luceneutil:
# Choose a suitable directory, e.g. ~/Projects/lucene/benchmarks.
git clone util

# 2. Run the setup script
cd util
python src/python/ -download

In the second step, the setup procedure creates all necessary directories in the clones parent directory and downloads a 6 GB compressed Wikipedia line doc file from an Apache mirror. If you don't want to download the large data file just remove the -download flag from the commandline.

After the download has completed, extract the lzma file in $LUCENE_BENCH_HOME/data.

Preparing the benchmark candidates

The benchmark compares a baseline version of Lucene to a patched one. Therefore we need two checkouts of Lucene, for example:

  • $LUCENE_BENCH_HOME/lucene_baseline: contains a complete svn checkout of Lucene, this is the baseline for comparison
  • $LUCENE_BENCH_HOME/lucene_candidate: contains a complete svn checkout of Lucene with some change applied that should be benchmarked against the baseline.

A trunk version of Lucene can be checked out with

git clone lucene_baseline

Adjust the command accordingly for lucene_candidate.

Running a first benchmark has created two files:, and in $LUCENE_BENCH_HOME/util/src/python/.

The file should be used to override any existing constants in, for example if you want to change the Java commandline used to run benchmarks. To run an inintal benchmark you don't need to modify this file.

Now you can start editing to define your comparison, at the bottom near its __main__:

This file is a copy of and should be used to define your comparisons. You don't have to build 2 separate indexes; you can make one and pass it to the two different competitors if you are only benching some code difference but not a file format change.

To run the benchmark you first test like this:

python src/python/ -source wikimedium10k

If you get ClassNotFound exceptions, your Lucene checkouts may need to be rebuilt. Run ./gradlew jar in both lucene_candidate/ and lucene_baseline/ dirs.

If your benchmark fails with "facetDim Date was not indexed" or similar, try adding

facets = (('taxonomy:Date', 'Date'),('sortedset:Month', 'Month'),('sortedset:DayOfYear', 'DayOfYear'))
index = comp.newIndex('lucene_baseline', sourceData, facets=facets, indexSort='dayOfYearNumericDV:long')

in, and use that index in your benchmarks.

Running the geo benchmark

This one is different and self-contained. Read the command-line examples at the top of src/main/perf/

Creating line doc file from an arbitrary Wikimedia dump data

You can create your own line doc file from an arbitrary Wikimedia dump by following steps. Note that the src/python/ helper tool does these steps:

  1. Download Wikimedia dump (XML) from and decompress it on $YOUR_DATA_DIR.


    bunzip2 -d /data/jawiki/jawiki-20200620-pages-articles-multistream.xml.bz2
  2. Run src/python/ to extract attributes such as title and timestamp from the XML dump.


    python src/python/ /data/jawiki/jawiki-20200620-pages-articles-multistream.xml /data/jawiki/jawiki-20200620-text.txt
  3. Run src/python/ to extract cleaned body text from the XML dump. This may take long time!


    cat /data/jawiki/jawiki-20200620-pages-articles-multistream.xml | python -u src/python/ -b102400m -o /data/jawiki

4a. Combine the outputs of 2. and 3. by running src/python/

python src/python/ /data/jawiki/jawiki-20200620-text.txt /data/jawiki/AA/wiki_00 /data/jawiki/jawiki-20200620-lines.txt

4b. (Optional) If you want to strip all but the last three columns from the combined file, pass the -only-three-columns to

python src/python/ /data/jawiki/jawiki-20200620-text.txt /data/jawiki/AA/wiki_00 /data/jawiki/jawiki-20200620-lines.txt -only-three-columns

Alternatively, use the Unix `cut` tool:

# extract titie, timestamp and body text
cat /data/jawiki/jawiki-20200620-lines.txt | cut -f1,2,3


Various utility scripts for running Lucene performance tests







No releases published


No packages published