#   Big Data
## Algorithms: Sorting, Recursion and Data Structures
## Victor P. Debattista March 2017


We are going to start with a very simple exercise in recursion just to get used to it, then implement a couple of sorting algorithms, one O(n$^2$) and one O(n log n)

In [32]:
import numpy as np
import math
import random
import time

Write a function that computes the n$^{th}$ Fibonacci number if $(F_0,F_1) = (1,1)$.  By definition a Fibonacci number is one such that $F_n = F_{n-1} + F_{n-2}$. You should use recursion to solve this problem, not a loop.
Print the first 10 Fibonacci numbers.

In [33]:
def fibonacci(n):
    '''Computes the n-th Fibonacci number. '''
    if n == 0 or n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

Let us start exploring sorting. In the first step we want to create a list of N numbers which we will use as our list for sorting

In [200]:
random.seed(22)
N = 10000
data = random.sample(range(N), N)

Our first sorting algorithm is the insertion sort.  How would you sort the list data into another list, data2, using the insertion sort?  (We want to use a second list so we preserve our original list.  Note that swapping in Python is done easily via tuples:
(A,B) = (B,A)
with no need for temporary variables.)  Calculate how long this took to run.

In [204]:
%%time
insertion_sort = data.copy()
for i in range(N-1):
    j = min(sorted_data[i:])
    k = sorted_data.index(j)
    (sorted_data[i], sorted_data[k]) = (sorted_data[k], sorted_data[i])

CPU times: user 2.23 s, sys: 0 ns, total: 2.23 s
Wall time: 2.23 s


In [205]:
print(insertion_sort[:10])

[2299, 3975, 386, 7325, 3018, 1975, 5663, 1305, 3796, 4411]


In the next part we will develop the functionality of a heap.  Write a recursive function that, given a list arr, sifts up element k ensuring that a heap structure is obtained.  The function should return the list back.  Be careful with index of the parent, it must work for both daughter nodes.

In [199]:
def sift_up(arr, k):
    '''Given a list arr, sifts up element k ensuring that a heap structure is obatined.'''
    parent_index = math.floor(0.5*(k-1))
    if (arr[k] > arr[parent_index]) and (k > 0):
        arr[k], arr[parent_index] = arr[parent_index], arr[k]
        arr = sift_up(arr, parent_index)
    return arr

Next write a function that, given a list arr, filled to element k, inserts (pushes) a new element N to it, preserving the heap structure.  So this will need to use your sift function from above.

In [161]:
def heap_insert(arr, k, N):
    ''' Given a list arr, inserts a new element N to it while preserving the heap structure. '''
    arr.append(N)
    k = k + 1 # The size of the heap has to be increased
    arr = sift_up(arr, k)
    return arr, k

With these two functions, given a list of numbers, turn it into a heap by building a function heapify.  This should work by pushing every element of a list onto the heap.

In [158]:
def heapify(arr):
    '''Turns a given list arr into a heap. '''
    heap = []
    l = -1
    for value in arr:
        heap, l = heap_insert(heap, l, value)
    return heap, l

Now write a function to sift down for when we pop the maximum value off.  This is a recursive function.  Be careful about having 0 or 1 daughters and that sifting down always involve a swap with the larger of the two daughters if 2 exist.

In [154]:
def sift_down(arr, k, l):
    '''Given a list arr, sifts down element k ensuring that a heap structure is obtained. l is the last leaf'''
    if  (2*k + 1 <= l): # Check if a node has any daughters
        if (2*k + 2 > l): # Check if the node as one daughter
            if heap[k] < heap[2*k+1]:
                heap[k], heap[2*k+1] = heap[2*k+1], heap[k]
        else: # This loop is for nodes with two daughters
            left_daughter = (heap[2*k+1] > heap[2*k+2])
            m = 2 # Defaults to the right daughter
            if left == True:
                m = 1 # Choose the left daughter instead
            if heap[k] < heap[2*k+m]:
                heap[k], heap[2*k+m] = heap[2*k+m], heap[k]
                heap = sift_down(heap, 2*k+m, l)
        return heap

Now add a function to pop the maximum value.  Remember that when popping the root, it will be replaced by the furthest leaf, which is then sifted down.  Return both the heap and its size.

In [136]:
def heap_pop(heap, l):
    '''Pops the maximum value in a heap (root), replacing it with the furthest leaf l, which is then sifted down. The
    root is put at the end of the list. Returns the heap and its size. '''
    heap[l], heap[0] = heap[0], heap[l]
    l = l - 1
    heap = sift_down(heap, 0, l)
    return heap, l

We have all the building blocks in place for a heap sort, so let's write that next.  We start by creating the heap, then repeatedly popping the heap, placing the popped items at the tail of the list that holds the heap.  I.e. we are going to sort the list in place, needing no extra storage.

In [151]:
def heap_sort(arr):
    '''Heap sort algorithm. Sorts a given list arr. '''
    heap, l = heapify(arr)
    for i in range(N-1, 0, -1):
        heap, l = heap_pop(heap, i)
    return heap

In [203]:
%%time
heap_sorted = heap_sort(data)

CPU times: user 127 ms, sys: 4 ms, total: 131 ms
Wall time: 131 ms


In [194]:
print(heap_sorted[:10])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [223]:
%%time
'''Lets see how quickly python does it. '''
python_sort = sorted(data)

CPU times: user 3.41 ms, sys: 0 ns, total: 3.41 ms
Wall time: 3.34 ms


Let's now do a bin sort.  Now we're going to be a bit looser with this and directly use some of Python's sorting methods.  Start by writing a simple function that, given a value which is within a given range [lo,hi], finds the bin to place the element into if there are N bins.  If the value is out of range some flag value should be returned.

In [208]:
def bin_index(val, low, high, N):
    '''Finds the bin to place an element if there are N bins. If the value is outside the range [low, high] returns
    -1 as a flag value.'''
    if (val < low) or (val > high):
        return -1
    else:
        tmp = (val - low) * N/(high - low)
        return int(tmp)

In this final step we do a bin sort.  This is a bit more complicated.  The way to do this is to have a list of lists.  You can directly use Python's .sort method for lists, since this is purely illustrative.  Or you can use your heapsort from before if your code was general enough.

In [221]:
bin_sorted = data.copy()
nbins = 100
def bin_sort(arr, low, high, nbins):
    '''Bin sort algorithm. '''
    mlist = [] # Create an empty list of buckets
    for i in range(nbins):
        mlist.append([])
    for val in arr: # Populate the buckets
        i = bin_index(val, low, high, nbins)
        if i>= 0: # Ignore out of range values
            mlist[i].append(val)
    for i in range(nbins): # Sort the buckets
        if len(mlist[i]) > 1: # Don't sort bins that only contain one item
            mlist[i] = sorted(mlist[i])
    sorted_list = [c for v in mlist for c in v] # Flatten the list of lists
    return sorted_list

In [222]:
%%time
bin_sorted = bin_sort(data, 0, N, nbins)

CPU times: user 9.42 ms, sys: 0 ns, total: 9.42 ms
Wall time: 9.4 ms


In [220]:
print(bin_sorted[:10])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


Consider the quicksort.  Suppose it is given a sorted list.  If the pivot is always the leftmost value, what happens?  Suggest a solution.  You do not need to code this up.