Instead of having a single emit buffer for the entire process, we really need one per core. This may require changes to the way we partition the input dataset, in order to make sure we don't blow up the allocator.