Unit Testing Hadoop Mappers and Reducers

Paul Houle edited this page Jul 31, 2013 · 11 revisions
Clone this wiki locally

Why test?

The purpose of testing is to speed up development.

It might cost about $10 a week to do basic cleanup of Freebase in AWS EMR, and seen that way, AWS EMR is very cheap. If accidents and mistakes cause you, however, to run that job 10 times for debugging, that is a huge waste of time and money.

With unit testing of the mappers and reducers, this cycle can be greatly sped up. You can make small changes to the code, have your IDE compile incremental changes, and run tests in the debugger to create an interactive environment where you can test the effects of changes in seconds rather than minutes.

How?

There is an MRUnit library for this, but so far I've been using Mockito with great success. In particular, it works in the case where MultipleOutput(s) used. This documentation describes the state of the code at tag t20130731b.

Mocking Hadoop

The interfaces presented by Hadoop to write to the primary output stream and named output streams are somewhat irregular, as I have to write

someContext.write(key,value)

to write to the main stream and write

thatMultipleOutpurs.write(name,key,value)

to write to a named output stream. To abstract the general idea of "something that accepts Key-Value pairs", I created an adapter class.

public interface KeyValueAcceptor<K,V> {
	public void write(K k,V v) throws IOException,InterruptedException;
	public void close() throws IOException, InterruptedException;
}

At the moment we do a very primitive kind of dependency injection. When Hadoop runs a Mapper like PSE3Mapper.java, it calls the setup() method to initialize the class:

	@Override
	protected void setup(Context context) throws IOException,
			InterruptedException {
		super.setup(context);
		mos=new MultipleOutputs(context);
		accepted=new PrimaryKeyValueAcceptor(context);
		rejected=new NamedKeyValueAcceptor(mos,"rejected");
	}

note that the fields set above are all declared with default access (no private, public or protected)

   MultipleOutputs mos;	
   KeyValueAcceptor<Node,NodePair> accepted;
   KeyValueAcceptor<Text,Text> rejected;

this allows other classes in the same package as the PSE3Mapper to access these fields. Some people would prefer that these be private or protected, and to accommodate this we could use Spring.

Writing tests

As it is, it is dead simple to replace these with mock objects by writing

	@Before
	public void setup() {
		pse3mapper=new PSE3Mapper();
		pse3mapper.mos=mock(MultipleOutputs.class);
		pse3mapper.accepted=mock(KeyValueAcceptor.class);
		pse3mapper.rejected=mock(KeyValueAcceptor.class);
	}

in the setup() method of the ps3mapper class. The mock() method works because we did the static import

import static org.mockito.Mockito.*;

at the top of the file. At this point it is also easy to write a test

	@Test
	public void acceptsAGoodTriple() throws IOException, InterruptedException {
		pse3mapper.map(
				new LongWritable(944L),
				new Text("<http://example.com/A>\t<http://example.com/B>\t<http://example.com/C>."),
				mock(Context.class));
		verifyNoMoreInteractions(pse3mapper.rejected);

		verifyNoMoreInteractions(pse3mapper.accepted);
	}

This test sends a key value pair to the mapper object. Normally, the key is a sequence number used in a file (that's how it is by default in Hadoop) so for the test it is OK to pass in a random number. We then pass in the string form of a PrimitiveTriple, converted to a Hadoop Text objects which occupy less space than Strings. The test works if we mock up a Context interface that does nothing at all.

When the map is operation is run, the mock objects record what was done to them, and we can later check that record to see what happened.

Anyway, with a delightful applications of generics, Mockito allows you express your requirements as follows:

verify(pse3mapper.accepted).write(
  Node_URI.createURI("http://example.com/A"),
  new NodePair(
     Node_URI.createURI("http://example.com/B"),
     Node_URI.createURI("http://example.com/C")
  ));

this verifies that the write method was called on accepted for this valid triple, which is correct. Then we need to check that nothing else went on between the mapper and the mock streams.

verifyNoMoreInteractions(pse3mapper.accepted);
verifyNoMoreInteractions(pse3mapper.rejected);

Go to the Mockito site to learn the details.

Performance

If you have a large number of @Tests inside a test file, Mockito is efficient. Although there's a substantial cost in creating a mock class, you pay this cost just once, the first time the mock object is used, not the time it is created. (Mockito is doing lazy initialization.) It costs perhaps 0.5 seconds to initialize the mock class but then it is cheap to use -- once Mockito is initialized, tests run in a millisecond or so.

Controlling the risk from the default scope

Mappers, Reducers, and Tools are defined in their own packages underneath the

com.ontology2.bakemono.mappers
com.ontology2.bakemono.reducers
com.ontology2.bakemono.tools

An example of this is the class com.ontology2.bakemono.mappers.pse3.PSE3Mapper.java.

All of these extra directories probably slow down the build operation, but they let us write test classes in the same package as the classes under the test that have access to fields in the default scope.

Nothing stops somebody developing a downstream project from adding new classes to the pse3 package, however, it's generally a bad practice to do that, and developer who do this take responsibility for whatever problems they make for themselves.

At this point, most of the bakemono project is in a giant package. As I understand the relationship with Hadoop better and what kind of classes will emerge, I'm sure new packages will be created and the adapter classes above will If the project is subdivided into packages, the possible scope leakage through the default scope shrinks greatly. One logical idea is to bunch the tools together in separate packages, possibly with a 'tools' package organizing them in the hierarchy. The mappers and reducers could be similarly packaged. This would encourage the independent testing of all these components, which would be a great thing.

Summary

Unit tests can greatly speed the development of Hadoop applications. By cutting the feedback cycle from hours or minutes to just seconds, unit tests make developers productive. At the same time, risk is controlled, because problems can be found and solved outside of the production system -- in some cases developers can work entirely with unit tests alone and not need to deal with the bother of having a development cluster.

This document describes practices that are good in general, but that, specifically, are standard for development of unit tests in the Infovore system. This is a living document which we will update as the project goes forward.