# Merge sorted entries

## Problems

**A**. Given 500 files, each containing stock trade infroamtion for an S&P 500 company. Eeach trade is encoded by a line in the following format:   
1232111, APPL, 30, 456.12. 

The first number is the time of the trade expressed as the number of milliseconds since the start of the day's trading. Lines within each file are sorted in incresing order of time. The remaining values are the stock symbol, number of shares, and price. You are to create a single file containing all the trades from teh 500 files, sorted in order of increasing trade times. The individual file are of teh order of 5 - 100 megabytes, the combined file will be of the order of five gigabytes.

**B**. Give multiple log files, each file contains many **ordered** log entries from an application e.g.:
2020-11-01T01:12:29.023 VPN "some log information" 
2020-11-01T01:12:29.023 VPN "another log information"

How to create a single log files from all the log sources with ordered timestamp.

## Abstract

Write a program that takes input of multiple sorted sequences, computes the un union of these sequences as a sorted sequences.

Exampe 1:
```
Input:  [[3, 5, 7], [0, 6], [0, 6, 28]]
Output: [0, 0, 3, 5, 6, 6, 7, 28]
```

Example 2:
```
Input:  [[1,4,5],[1,3,4],[2,6]]
Output: [1,1,2,3,4,4,5,6]
```

**Constraints**:

* lists[i] is sorted in ascending order.
* 0 <= lists[i].length <= 500
* $-10^4 <= lists[i][j] <= 10^4$ 
* The sum of lists[i].length won't exceed $10^4$.

## Approach 1: Brute force

Merge all sub-list to a huge list and then sort the merged list.

**Time complexity**: $O(N*logN)$ (N is the length of elements in total)

In [32]:
from typing import List
import itertools

def merge_sorted_arrays(sorted_arrays: List[List[int]]) -> List[int]:
    return list(sorted(itertools.chain(*sorted_arrays)))

In [33]:
"""
Generate sample test data to test the runtime performance.
"""

import random

test_data = []
N = 1000

for i in range(N):
    l = random.choices(range(10000), k = random.randint(100, 500))
    l.sort()
    test_data.append(l)


In [34]:
%timeit merge_sorted_arrays(test_data)

50.1 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Appraoch 2: Recursively merge sort

Recursively merge the k files, two at a time usign the merge step from the merge sort. These would be log k stages, and each has time complexity O(n), so the total time complexity is O(n log k). The space complexity of any reasonable implementeation of merge sort would end up being O(n).

## Approach 3: Heap

The brute-force appraoch does not use the fact that the individual sequences are sorted. WE can repeatedly pick the smallest element amongst the first element of each of the reamaining part of each of the sequences.

A min-heap is ideal for maintaining a collection of elements when we need to add arbitrary value and extract the smallest element. In practice, we need **additional information for each entry, namely the array it is from, and its index in the array**.  

An concrete example is to merge list [3, 5, 7], [0, 6] and [0, 6, 28].

| input                                 | min-heap                | pop-up element | output                   | note                                      |
| ------------------------------------- | ----------------------- | -------------- | ------------------------ | ----------------------------------------- |
| [3, 5, 7]<br />[0, 6]<br />[0, 6, 28] | [ ]                     |                | [ ]                      |                                           |
| [5, 7]<br />[6]<br />[6, 28]          | [(3,0), (0,1),(0,2)]    | (0, 1)         | [ 0 ]                    | also be ok to pop (0, 2)                  |
| [5, 7]<br />[ ]<br />[6, 28]          | [ (0,2) (3,0), (6, 1) ] | (0, 2)         | [0, 0]                   | min element (0, 2) moved to the heap head |
| [5, 7]<br />[ ]<br />[28]             | [(3,0), (6,1) (6,2)]    | (3, 0)         | [0, 0, 3]                |                                           |
| [7]<br />[ ]<br />[28]                | [(5,0), (6,1) (6,2)]    | (5, 0)         | [0, 0, 3, 5]             |                                           |
| [ ]<br />[ ]<br />[28]                | [(6,1) (6,2),(7,0)]     | (6,1)          | [0, 0, 3, 5, 6]          |                                           |
| [ ]<br />[ ]<br />[28]                | [ (6,2),(7,0),(28,2) ]  | (6,2)          | [0, 0, 3, 5, 6, 6]       |                                           |
| [ ]<br />[ ]<br />[]                  | [(7,0),(28,2) ]         | (7, 0)         | [0, 0, 3, 5, 6, 6, 7]    |                                           |
| [ ]<br />[ ]<br />[]                  | [(28,2)]                | (28,2)         | [0, 0, 3, 5, 6, 6, 7,28] |                                           |



In [35]:
import heapq

def merge_sorted_arrays(sorted_arrays: List[List[int]]) -> List[int]:
    min_heap = []
    sorted_arrays_iters = [iter(x) for x in sorted_arrays]

    # initialize: put first elemetn from each iterattor in min_heap
    for i, it in enumerate(sorted_arrays_iters):
        first_element = next(it, None)
        if first_element is not None:
            heapq.heappush(min_heap, (first_element, i))

    result = []
    while min_heap:
        smallest_entry, smallest_array_i = heapq.heappop((min_heap))
        smallest_array_iter = sorted_arrays_iters[smallest_array_i]
        result.append(smallest_entry)
        next_element = next(smallest_array_iter, None)
        if next_element is not None:
            heapq.heappush(min_heap, (next_element, smallest_array_i))
    return result


In [36]:
%timeit merge_sorted_arrays(test_data)

310 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [37]:
# pythonic solutoin: heapq.merge() function is designed to solve this kind of problem

def merge_sorted_array(sorted_array):
    return list(heapq.merge(*sorted_arrays))

In [38]:
%timeit merge_sorted_arrays(test_data)

314 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Interest findings**: Python's sort libary is actually pretty fast, 6x fast than using heap. The vanilla python uses [Timsort](https://en.wikipedia.org/wiki/Timsort), which sorts depending the characters of the array.