# Merge Sorted Files

This problem is motivated by the following scenario.  You are given 500 files,
each containing trade information for an S&P 500 company.  Each trade is encoded by
a line in the following format: `1232111,AAPL,30,456.12`.

The first number is the time of the trade expressed as the number of milliseconds 
since the start of the day's trading.  Lines within each file are sorted in 
increasing order of time.  The remaining values are the stock symbol, number of
shares, and price.  You are to create a single file containing all the trades from
the 500 files, sorted in order of increasing trade times.  The individual files
are  of the order of 5-100 megabytes; the combined file will be of the order of
five gigabytes.  In the abstract, we are trying to solve the following problem:

**Write a program that takes as input a set of sorted sequences and computes the
union of these sequences as a sorted sequence.  For example, if the input is `[3,5,7],
[0,6], and [0,6,28]`, then the output is `[0,0,3,5,6,6,7,28]`.**

## Solution

A brute force approach is to concatenate these sequences into a single array and then
sort it.  The time complexity is $O(n \log n)$, assuming there are `n` elements in
total.

The brute-force approach does not use the fact that the individual sequences are
sorted.  We can take advantage of this fact by restricting our attention to the 
first remaining element in each sequence.  Specificcally, we repeatedly pick the 
smallest element amongst the first elements off each of the remaining part of each
of the sequences.

A min-heap is ideal for maintaining a collection of elements when we need to add 
arbitrary values and extract the smallest element.

For ease of exposition, we show how to merge sorted arrays, rather than files.  As
a concrete example, suppose there are three sorted arrays to be merged: `[3,5,7],
[0,6], and [0,6,28]`.  For simplicity, we show the min-heap as containing entries
from these three arrays.  In practice, we need additional information for each entry,
namely the array it is from, and its index in that array.  (In the file case we do
not need to explicitly maintain an ndex for next unprocessed element in each
sequence - the file I/O library tracks the first unread entry in the file.)

The min-heap is initialized to the first entry of each array, ie it is `[3,0,0]`.
We extract the smallest entry, 0, and add it to the output which is `[0]`.  Then we
add 6 to the min-heap which is `[3,0,6]` now.  (We chose the 0 entry corresponding
to the third array arbitrarily, it would be perfectly acceptable to choose from
the second array.)  Next, extract 0, and add it to the output which is `[0,0]`;
then add 6 to the min-heap which is `[3,6,6]`.  Next, extract 3, and add it to the
output which is `[0,0,3]`; then add 5 to the min-heap which is `[5,6,6]`.  Next,
extract 5, and add it to the output which is `[0,0,3,5]`; then add 7 to the min-heap
which is `[7,6,6]`.  Next, extract 6, and add it to the output which is 
`[0,0,3,5,6]`; assuming 6 is selected from the second array, which has no remaining
elements, the min-heap is `[7,6]`.  Next, extract 6, and add it to the output which 
is `[0,0,3,5,6]`; then add 28 to the min-heap, which is `[7,28]`.  Next, extract
7, and add it to the output which is `[0,0,3,5,6,7,28]`; now, all elements are 
processed and the output stores the sorted elements.
 

In [None]:
import heapq

def merge_sorted_arrays(sorted_arrays):
    min_heap = []
    # Builds a list of iterators for each array in sorted_arrays.
    sorted_arrays_iters = [iter(x) for x in sorted_arrays]
    
    # Puts first element from each iterator in min_heap.
    for i, it in enumerate(sorted_arrays_iters):
        first_element = next(it, None)
        if first_element is not None:
            heapq.heappush(min_heap, (first_element, i))
            
    result = []
    
    while min_heap:
        smallest_entry, smallest_array_1 = heapq.heappop(min_heap)
        smallest_array_iter = sorted_arrays_iters[smallest_array_1]
        result.append(smallest_entry)
        next_element = next(smallest_array_iter, None)
        if next_element is not None:
            heapq.heappush(min_heap, (next_element, smallest_array_1))
    
    return result

# Pythonic solution, uses the heapq.merge() method which takes multiple inputs.
def merge_sorted_arrays_pythonic(sorted_arrays):
    return list(heapq.merge(*sorted_arrays))


Let $k$ be the number of input sequences.  Then there are no more than $k$ elements
in the min-heap.  Both extract-min and insert take $O(\log k)$ time.  Hence, we can
do the merge in $O(n \log k)$ time. The space complexity is $O(k)$ beyond the space
needed to write the final result.  In particular, if the data comes from files and
is written to a file, instead of arrays, we would need only $O(k)$ additional 
storage.

Alternatively, we could recursively merge the `k` files, two at a time using the 
merge step from merge sort.  We would go from `k` to $k / 2$, the $k / 4$, etc 
files.  There would be log k stages, and each has time complexity $O(n)$, so the
time complexity is the same as that of the heap-based approach, ie $O(n \log k)$.
