The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 3 lecture on heap applications). The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one. Letting xi denote the ith number of the file, the kth median mk is defined as the median of the numbers x1,…,xk. (So, if k is odd, then mk is ((k+1)/2)th smallest number among x1,…,xk; if k is even, then mk is the (k/2)th smallest number among x1,…,xk.)

In the box below you should type the sum of these 10000 medians, modulo 10000 (i.e., only the last 4 digits). That is, you should compute (m1+m2+m3+⋯+m10000)mod10000.

OPTIONAL EXERCISE: Compare the performance achieved by heap-based and search-tree-based implementations of the algorithm.

In [1]:
# timer grabbed from 
# https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python
from timeit import default_timer as timer
class benchmark(object):
    def __init__(self, msg, fmt="%0.3g"):
        self.msg = msg
        self.fmt = fmt

    def __enter__(self):
        self.start = timer()
        return self

    def __exit__(self, *args):
        t = timer() - self.start
        print(("%s : " + self.fmt + " seconds") % (self.msg, t))
        self.time = t

# Median Maintenance

## Task
Given a sequence of numbers $x_1,\dots,x_n$, read the number one-by-one. <br>
We want to know, at each step $i$, the median of ${x_1, \dots, x_i}$ using ${\cal O}(\log i)$ time. <br>

## Solution
Maintain two heaps:
1. Hlow: store the lowest $i/2$ elements in Hlow that supports extract_max.
2. Hmax: store the highest $i/2$ elements in Hhigh that supports extract_min.

If we make sure ```len(Hlow) >= len(Hhigh)```, then according to the definition given above, the median is always the largest element in Hlow. <br>
To utilize the heapq from python, which only supports extract_min, we store the smallest $i/2$ elements as their negative numbers.

In [2]:
import heapq

Hlow, Hhigh = [], []

def median_maintenance_insert(x):
    """ Insert value x to the heaps and return the median. """
    if not Hlow:
        heapq.heappush(Hlow, -x)
    else:
        l = -Hlow[0] # the largest number for the smaller half
        if l < x:
            p = None
            if len(Hhigh) - len(Hlow) == 0:      # rebalance Hlow and Hhigh so that len(Hlow) >= len(Hhigh)
                p = heapq.heappushpop(Hhigh, x)
                heapq.heappush(Hlow, -p)
            else:                                # no need to rebalance
                heapq.heappush(Hhigh, x)
        else:
            p = None
            if len(Hlow) - len(Hhigh) == 1:      # rebalance Hlow and Hhigh
                p = heapq.heappushpop(Hlow, -x)
                heapq.heappush(Hhigh, -p)
            else:                                # no need to rebalance
                heapq.heappush(Hlow, -x)
    
    return -Hlow[0]

def reset_heaps():
    global Hlow, Hhigh
    Hlow, Hhigh = [], []
    return

def median_maintenance_list(list_iter):
    reset_heaps()
    medians = []
    for i in list_iter:
        medians.append(median_maintenance_insert(i))
        print Hlow, Hhigh
    modulo = sum(medians) % 10000
    print "List of medians: ", medians
    print "Sum of medians modulo 10000: {0}".format(modulo)
    return modulo

def median_maintenance_file(filename):
    # assume each line contains only one number in the file
    reset_heaps()
    sum_medians = 0
    for line in open(filename, 'r'):
        x = int(line.rstrip())
        sum_medians += median_maintenance_insert(x)
    print "Sum of medians: {0}, sum of medians modulo 10000: {1}".format(sum_medians, sum_medians % 10000)
    return

In [3]:
# test cases
test1 = [1,666,10,667,100,2,3]
answ1 = 142

test2 = [6331,2793,1640,9290,225,625,6195,2303,5685,1354]
answ2 = 9335

assert median_maintenance_list(test1) == answ1, "Did not pass test1."
assert median_maintenance_list(test2) == answ2, "Did not pass test2."

[-1] []
[-1] [666]
[-10, -1] [666]
[-10, -1] [666, 667]
[-100, -1, -10] [666, 667]
[-10, -1, -2] [100, 667, 666]
[-10, -3, -2, -1] [100, 667, 666]
List of medians:  [1, 1, 10, 10, 100, 10, 10]
Sum of medians modulo 10000: 142
[-6331] []
[-2793] [6331]
[-2793, -1640] [6331]
[-2793, -1640] [6331, 9290]
[-2793, -1640, -225] [6331, 9290]
[-1640, -625, -225] [2793, 9290, 6331]
[-2793, -1640, -225, -625] [6195, 9290, 6331]
[-2303, -1640, -225, -625] [2793, 6195, 6331, 9290]
[-2793, -2303, -225, -625, -1640] [5685, 6195, 6331, 9290]
[-2303, -1640, -225, -625, -1354] [2793, 5685, 6331, 9290, 6195]
List of medians:  [6331, 2793, 2793, 2793, 2793, 1640, 2793, 2303, 2793, 2303]
Sum of medians modulo 10000: 9335


In [None]:
median_maintenance_file("Median.txt")

# Notes on red-black tree

A red–black tree is a kind of self-balancing binary search tree.

## Invariants in red-black tree
1. Each node is either red or black.
2. Root node is black.
3. No two red nodes in a row, that is, red nodes have only black children.
4. Every root-NULL path has the same number of black nodes. (NULL: unsuccessful search)

## Operations