External Merge Sort in Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
exmeso-core
exmeso-jackson
exmeso-kryo
exmeso-msgpack
.gitignore
.travis.yml
LICENSE
README.md
pom.xml

README.md

External Merge Sort

This is a small library that implements External Merge Sort in Java.

"External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file. External sorting is a form of distribution sort, with the added characteristic that the individual subsets are separately sorted, rather than just being used internally as intermediate stages." – Wikipedia

Introduction

The library is pretty flexible as it takes an Iterable<T> as input and returns the sorted result as CloseableIterator<T>.

Sort order is given by an instance of Comparator<T>.

Persistence is handled by an implementation of the Serializer<T> interface, of which there are currently two implementations:

The sorting algorithm is implemented by ExternalMergeSort<T>.

Example

Using the Jackson serializer, this code example first reads unsorted input from input.json and then writes the sorted output to output.json. The input is a JSON array of objects that gets sorted by their "id" field in ascending order.

Example input:

[ {"id" : "c"}, {"id" : "b"}, {"id" : "a"} ]

Example output:

[ {"id" : "a"}, {"id" : "b"}, {"id" : "c"} ]

Note that it really doesn't make much sense to use this library to sort data this small, or anything even close. It is meant to be used to sort large amounts of data that cannot all be contained in memory at the same time.

Example code:

// Prepare input and output files
File inputFile = new File("/tmp/input.json");
File outputFile = new File("/tmp/output.json");

// Create comparator
Comparator<ObjectNode> comparator = new Comparator<ObjectNode>() {
    @Override
    public int compare(ObjectNode o1, ObjectNode o2) {
        String id1 = o1.path("id").getTextValue();
        String id2 = o2.path("id").getTextValue();
        return id1.compareTo(id2);
    }
};

// Create serializer
JacksonSerializer<ObjectNode> serializer = new JacksonSerializer<ObjectNode>(ObjectNode.class);

// Create the external merge sort instance
ExternalMergeSort<ObjectNode> sort = ExternalMergeSort.newSorter(serializer, comparator)
        .withChunkSize(1000)
        .withMaxOpenFiles(10)
        .withDistinct(true)
        .withCleanup(true)
        .withTempDirectory(new File("/tmp"))
        .build();

// Read input file as an input stream and write sorted chunks.
List<File> sortedChunks;
InputStream input = new FileInputStream(inputFile);
try {
    sortedChunks = sort.writeSortedChunks(serializer.readValues(input));
} finally {
    input.close();
}

// Get a merge iterator over the sorted chunks. This will return the
// objects in sorted order. Note that the sorted chunks will be deleted 
// when the CloseableIterator is closed if 'cleanup' is set to true.
CloseableIterator<ObjectNode> sorted = sort.mergeSortedChunks(sortedChunks);
try {
    OutputStream output = new FileOutputStream(outputFile);
    try {
        serializer.writeValues(sorted, output);
    } finally {
        output.close();
    }
} finally {
    sorted.close();
}

The above example is a little involved as it needs to, in addition to doing the actual sorting, read the input file and write to the output file. For this it uses the provided serializer directly, which is not always want you want. The input and output data may both be serialized differently, and the format of the internal sorted chunks may be in a third format. The example splits the merge sort in two phases; first write sorted chunks, then create merge iterator over the sorted chunks.

The ExternalMergeSort<T> class has the following public instance methods:

List<File> writeSortedChunks(Iterator<T>)
File writeSortedChunk(Iterator<T>)
CloseableIterator<T> mergeSortedChunks(List<File>)

CloseableIterator<T> mergeSort(Iterator<T>)

The example used the first and the third method, but the fourth one can also be used instead.

Helpers

Sometimes, when you don't know the size of your input data, it may not always be neccessary to do an external sort as everything can fit into available memory and be sorted there. This is the case when the size is less than or equal to the chunkSize. For this scenario the ChunkSizeIterator<T> helper class comes in handy. Wrap your own Iterator<T> so that it can figure out if an external merge sort is neccessary:

Iterator<String> yourIterator = …;
ChunkSizeIterator csi = new ChunkSizeIterator(yourIterator, chunkSize);
if (csi.isMultipleChunks()) {
    // do external sort        
} else {
    // do in-memory sort
}

Note that the ExternalMergeSort.mergeSort(Iterator) method already has this optimization, so no need to do this if you use that method.

Maven dependencies

exmeso-jackson

The exmeso-jackson module can be added to your own project like this:

<dependency>
  <groupId>org.geirove.exmeso</groupId>
  <artifactId>exmeso-jackson</artifactId>
  <version>0.2</version>
</dependency>

exmeso-kryo

The exmeso-kryo module can be added to your own project like this:

<dependency>
  <groupId>org.geirove.exmeso</groupId>
  <artifactId>exmeso-kryo</artifactId>
  <version>0.2</version>
</dependency>

exmeso-msgpack

The exmeso-msgpack module can be added to your own project like this:

<dependency>
  <groupId>org.geirove.exmeso</groupId>
  <artifactId>exmeso-msgpack</artifactId>
  <version>0.2</version>
</dependency>