In [1]:
import time
import random

# set seed
random.seed(a=100)

# create default lislt
short_list = list(random.sample(range(100000), 10))
long_list = list(random.sample(range(1000000), 10000))

## Sorting
[Sorting Algorithm Wiki](https://en.wikipedia.org/wiki/Sorting_algorithm)

**Algorithm**: set of steps necessary for a computer to accomplish a specific task

**Two lists**: return list ordered from smallest to largest in least amount of time
- short_list: use to validate algorithm
- long_list: use to compare computation times across sorting strategies
- Duplicates maintain original order (preserves algorithm stability)
- Efficiency: measured in runtime, also discussed in terms of steps
    
### Example: sorting a hand of cards
- Sorting makes it easier to know which cards you have and to access the cards when necessary
- Can sort by:
    - One card a time, sequencing as you go
    - Move through hand, organizing card by card
    - Random shuffling until they are sorted (obviously inefficient method but technically it exists)
- Different methods work best for different games & different sized hands

## Insertion Sort
- Maintain two lists: original list and new list that will be ordered
- Take elements from original list and move through new list, stopping and inserting element where it goes
    - Place in position ahead of the first element in the new list larger than chosen element
    - If none are larger, place at the end

In [2]:
def insert_sort(input_list):
    # copy input to a new list, leave original unmodified
    new_list = input_list   
    
    # iterate through list
    for i in range(len(new_list)):
        # assign place to a variable
        j = i
        
        # move through the list as long as the previous position is larger
        # than the current element
        while j > 0 and new_list[j - 1] > new_list[j]:
            
            # swap places
            new_list[j - 1], new_list[j] = new_list[j], new_list[j - 1]
            
            # reduce j by one
            j = j - 1
    return new_list

In [3]:
start_time = time.time()

# run insertion sort function
insert_sort(short_list)

# print results and runtime
print('%s seconds' % (time.time() - start_time))
print(insert_sort(short_list))

0.00013899803161621094 seconds
[19093, 22904, 45840, 51515, 56821, 59628, 60231, 66435, 92473, 95939]


In [4]:
# test on long list
start_time = time.time()

insert_sort(long_list)
print('%s seconds' % (time.time() - start_time))

7.552940130233765 seconds


- Sorting works, but doesn't scale very well
- If list is already ordered:
    - This sort takes n steps to complete
    - Iterates through list
- If list is perfectly out of order:
    - Can take asymptotically n-squared steps, or $\mathcal{O}(n^2)$ in big O notation
    - Have *n* elements and algorithm potentially looks through each element in sorted list before inserting the element
    - Computational intensity increases very quickly

## Merge Sort
- Merging two small sorted lists into one large sorted list is faster
- Overview
    - Breaks large list into single element sublists (which are inherently ordered)
    - Merges single element lists into ordered pairs, reading from one end to preserve order
    - Repeats this process to arrive at a sorted list
    
### Example
List = [3, 7, 2, 4]

Step 1: [3] [7] [2] [4]

Step 2: [3, 7] [2, 4]

Step 3: [2, 3, 4, 7]
<br>
- Any merge only have to look at leading entry from each prior list
- Final merge only compares 3 to 2, since its already known that 4 and 7 are larger than their prior entries
- Don't have to handle large amounts of unordered data
- Divide and conquer technique
    - Insertion sort attempts to solve the whole problem in one piece
    - Breaking task into smaller ones provides significant efficiency gains
- Tradeoff: ease to visualize/write (insertion) vs. efficiency gains but more difficult logic

In [5]:
def merge(a, b):
# merge function, takes two ordered lists and merges them into one
        
    # check for empty list
    if len(a) == 0 or len(b) == 0:
        return a or b
    
    # start with an empty result
    result = []
    # track two indexes
    i, j = 0, 0
    # set while condition to iterate only for the length of the two lists
    while (len(result) < len(a) + len(b)):
        # if next element in a is lower, append to result
        if a[i] < b[j]:
            result.append(a[i])
            i += 1
        # otherwise append next element in b
        else:
            result.append(b[j])
            j += 1
        # when one list is empty, append everything from the other & stop
        if i == len(a) or j == len(b):
            result.extend(a[i:] or b[j:])
            break

    return result

def merge_sort(lst):
    if len(lst) < 2:
        return lst
    
    mid = int(len(lst) / 2)
    a = merge_sort(lst[:mid])
    b = merge_sort(lst[mid:])
    
    return merge(a, b)

In [6]:
# test on short list
start_time = time.time()

merge_sort(short_list)

print('{:.10f} seconds'.format(time.time() - start_time))
print(merge_sort(short_list))

0.0000650883 seconds
[19093, 22904, 45840, 51515, 56821, 59628, 60231, 66435, 92473, 95939]


In [7]:
# test on long list
start_time = time.time()
merge_sort(long_list)
print('{:.10f} seconds'.format(time.time() - start_time))

0.0537347794 seconds


This algorithm is recursive:
- Function nests within itself & runs until stopping condition is met
- This creates multiple layers of lists to merge together
- Common feature, way to keep algorithm running until the problem is solved without having to specify number of times something should run
- Much faster & less complex
    - Cuts down on number of comparisons necessary since its known that shorter lists are already sorted
    - $\mathcal{O}(n\log{}n)$ instead of $\mathcal{O}(n^2)$
    - Scaling is quasilinear instead of quadratic
    
## Default Sort
Python has a built in .sort() and .sorted() methods
- Efficient: written in version of C from Python called Cython
- Faster than generic Python

In [8]:
start_time = time.time()

sorted(long_list)
print('{:.10f} seconds'.format(time.time() - start_time))

0.0003330708 seconds


## Drill

Pick an algorithm from the wiki page (sticking to simpler ones), implement it in Python, and see how sorting short & long lists compares.

### Quicksort
- Divide and conquer algorithm
- Relies on a partition operation
    - Pivot: element selected to partition an array
    - All elements before the pivot are moved before it, greater elements moved after it
    - Lesser & greater sublists are recursively sorted
    - $\mathcal{O}(n\log{}n)$
        - Caveat: $\mathcal{O}(n^2)$ for worst case

In [26]:
# create lists again
# otherwise will get first hand experience with worst case complexity
# works on short list but long list hits recursion limit
short_list = list(random.sample(range(1000000), 10))
long_list = list(random.sample(range(1000000), 10000))

def quicksort(lst):
    quick_sorter(lst, 0, len(lst) - 1)

def partition(lst, low, high):
    # set first element as pivot
    pivot = lst[low]
    #set left & right markers
    left = low + 1
    right = high
    
    while not False:
        while left <= right and lst[left] <= pivot:
            left += 1
        while lst[right] >= pivot and right >= left:
            right -= 1
        # unless we're done, swap left & right    
        if right < left:
            break
        else:
            lst[left], lst[right] = lst[right], lst[left]
            
    lst[low], lst[right] = lst[right], lst[low]
    return right

def quick_sorter(lst, low, high):
    if low < high:
        partition_index = partition(lst, low, high)
        #sort elements before and after partition index
        quick_sorter(lst, low, partition_index - 1)
        quick_sorter(lst, partition_index + 1, high)

In [27]:
start_time = time.time()
quicksort(short_list)
print('{:.10f} seconds'.format(time.time() - start_time))
print(short_list)

0.0002169609 seconds
[28097, 79676, 309638, 449022, 474720, 532926, 601049, 633216, 693849, 969178]


In [28]:
start_time = time.time()
quicksort(long_list)
print('{:.10f} seconds'.format(time.time() - start_time))
print(long_list[:25])

0.0317590237 seconds
[14, 74, 245, 254, 283, 337, 349, 371, 462, 514, 672, 1172, 1207, 1276, 1314, 1454, 1601, 1639, 1643, 1644, 1783, 1896, 2061, 2182, 2183]
