# 6.0001 Lecture 12: Searching and Sorting

**Speaker:** Prof. Eric Grimson

## Search Algorithms
- **search algorithm** - method for finding an item or group of items with specific properties within a collection of items
- collection could be implicit
    - example - find square root as a search problem
        - exhaustive enumeration
        - bisection search
        - Newton-Raphson
- collection could be explicit
    - example - is a student record in a stored collection of data?

## Searching Algorithms
- linear search
    - **brute force** search (a.k.a. British Museum algo)
    - list does not have to be sorted
- bisection search
    - list **MUST be sorted** to give correct answer
    - saw two different imlementations of the algorithm

## Linear Search on **UNSORTED** list: recap

In [2]:
def linear_search(L, e):
    found = False
    for i in range(len(L)):
        if e == L[i]:
            found = True # speed up a little by returning True here, but speed up doesn't impact worst case
    return found

- must look through all elements to decide if it's not there
- $O(\textrm{len(L)})$ for the loop * $O(1)$ to test if e == L[i]
    - assumes we can retrieve element of list in constant time
- overall complexity is $O(n)$ where $n$ is len(L)

Lineae Search on **SORTED** list: recap

In [3]:
def search(L, e):
    for i in range(len(L)):
        if L[i] == e:
            return True
        if L[i] > e:
            return False
    return False

- must only look until reach a number greater than e
- $O(\textrm{len(L)})$ for the loop * $O(1)$ to test if e == L[i]
- overall complexity is $O(n)$ where $n$ is len(L)

## Bisection search: recap
- steps:
    - pick an index, i, that divides list in half
    - ask if L[i] == e
    - if not, ask if L[i] is larger or smaller than e
    - depending on answer, search left or right half of L for e
- a new version of a divide-and-conquer algorithm
    - break into smaller version of problem (smaller list), plus some simple operations
    - answer to smaller version is answer to original problem

In [5]:
# bisection search implementation
def bisect_search2(L, e):
    def bisect_search_helper(L, e, low, high):
        if high == low: # list of single element; only check if that element is e
            return L[low] == e
        mid = (low + high)//2
        if L[mid] == e:
            return True
        elif L[mid] > e:
            if low == mid: # nothing left to search
                return False
            else:
                return bisect_search_helper(L, e, low, mid-1)
        else:
            return bisect_search_helper(L, e, mid+1, high)
        if len(L) == 0: # empty list; nothing to search
            return False
        else:
            return bisect_search_helper(L, e, 0, len(L)-1)

## Complexity of Bisection Search: recap
- **bisect_search2** and its helper
    - $O(\log{n})$ bisection search calls
        - reduces size of problem by factor of 2 on each step
    - pass list and indices as parameters
    - list never copied, just re-passed as pointer
    - constant work inside function
    - --> **$O(\log{n})$**

## Searching a sorted list -- $n$ is len(L)
- using **linear search**, search for an element is **$O(n)$**
- using **binary search**, can search for an element in $O(\log{n})$
    - assumes the list is **sorted**!
- when does it make sense to **sort first then search**?
    - SORT + $O(\log{n}) < O(n)$ --> SORT $< O(n) - O(\log{n})$
    - when sorting is less than $O(n)$
- **NEVER TRUE**!
    - to sort a collection of n elements, must look at each one at least once!

## Amortized cost -- $n$ is len(L)
- why bother sorting first?
- in some cases, may **sort a list once** then do **many searches**
- **AMORTIZE cost** of the sort over many searches
- SORT + $K \cdot O(\log{n})<K\cdot O(n)$
    - for large $K$, **SORT time becomes irrelevant,** if cost of sorting is small enough

## Sort Algorithms
- want to efficiently sort a list of entries (typically numbers)
- Will see a range of methods, including one that is quite efficient

## Monkey sort
- a.k.a. bogosort, stupid sort, slowsort, permutation sort, shotgun sort
- to sort a deck of cards
    - throw then in the air
    - pick them up
    - are they sorted?
    - repeat if not sorted

## Complexity of Bogo Sort

In [6]:
def bogo_sort(L):
    while not is_sorted(L):
        random.shuffle(L)

- best case: $O(n)$ where $n$ is len(L) to check if sorted
- worst case: $O(?)$ it is **unbounded** if really unlucky

## Bubble sort
- **compare consecutive pairs** of elements
- **swap elements** in pairs such that smaller is first
- when reach end of list, **start over** again
- stop when **no more swaps** have been made
- largest unsorted element always at end after pass, so at most $n$ passes

## Complexity of Bubble Sort

In [7]:
def bubble_sort(L):
    swap = False
    while not swap: # O(len(L))
        swap = True
        for j in range(1, len(L)): # O(len(L))
            if L[j-1] > L[j]:
                swap = False
                temp = L[j]
                L[j] = L[j-1]
                L[j-1] = temp

- inner *for* loop is for doing the **comparisons**
- outer *while* loop is for doing **multiple passes** until no more swaps
- **$O(n^2)$ where $n$ is len(L)** to do len(L)-1 comparisons and len(L)-1

## Selection sort
- first step
    - extract **minimum element**
    - **swap it** with element at **index 0**
- subsequent step
    - in remaining sublist, extract **minimum element**
    - **swap it** with the element at **index 1**
- keep the left portion of the list sorted
    - at i'th step, **first i elements in list are sorted**
    - all other elements are bigger than first i elements

## Analyzing selection sort
- loop invariant
    - given prefix of list L[0:i] and suffix L[i+1:len(L)], then prefix is sorted and no element in prefix is larger than the smallest element of suffix
        - base case: prefix empty, suffix whole list -- invariant true
        - induction step: move minimum element from suffix to end of prefix. Since invariant true before move, prefix sorted after append
        - when exit, prefix is entire list, suffix empty, so sorted

## Complexity of selection sort

In [8]:
def selection_sort(L):
    suffixSt = 0
    while suffixSt != len(L): # len(L) times --> O(len(L))
        for i in range(suffixSt, len(L)):
            if L[i] < L[suffixSt]: # len(L) - suffixSt times --> O(len(L))
                L[suffixSt], L[i] = L[i], L[suffixSt]
        suffixSt += 1

- outer loop executes len(L) times
- inner loop executes len(L) - i times
- complexity of selection sort is $O(n^2)$ where $n$ is len(L)

## Merge sort
- use a divide-and-conquer approach:
    - if list of length 0 or 1, already sorted
    - if list has more than one element, split into two lists, and sort each
    - merge sorted sublists
        - look at first element of each, move smaller to end of the result
        - when one list is empty, just copy rest of other list
- **split list in half** until have sublists of only 1 element
- merge such that **sublists will be sorted after merge**
- merge sorted sublists
- sublists will be sorted after merge

## Merging Sublists step

In [9]:
def merge(left, right):
    result = []
    i, j = 0, 0
    # left and right sublists are ordered
    # move indices for sublists depending on which sublists holds next smallest element
    while i < len(left) and j < len(right):
        if left[i] < right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    while (i < len(left)): # when right sublist is empty
        result.append(left[i])
        i += 1
    while (j < len(right)): # when left sublist is empty
        result.append(right[j])
        j += 1
    return result

## Complexity of merging sublists step
- go through two lists, only one pass
- compare only **smallest elements in each sublist**
- $O(\textrm{len(left)} + \textrm{len(right)})$ copied elements
- $O(\textrm{len(longer list)})$ comparisons
- **linear in length of the lists**

## Merge sort algorithm -- recursive

In [10]:
def merge_sort(L):
    if len(L) < 2: # base case
        return L[:]
    else:
        middle = len(L)//2
        left = merge_sort(L[:middle]) # divide
        right = merge_sort(L[middle:])
        return merge(left, right) # conquer with merge step

- **divide list** succesively into halves
- depth-first such that **conquer smallest pieces down one branch** first before moving to larger pieces

## Complexity of Merge Sort
- at **first recursion level**
    - n/2 elements in each list
    - $O(n) + O(n) = O(n)$ where $n$ is len(L)
- at **second recursion level**
    - n/4 elements in each list
    - two merges --> $O(n)$
- each recursion level is $O(n)$
- **dividing list in half** with each recursive call
    - $O(\log{n})$
- overall complexity is $O(n\log{n})$

## Sorting summary
- bogo sort
    - randomness, unbounded $O()$
- bubble sort
    - $O(n^2)$
- selection sort
    - $O(n^2)$
    - guaranteed the first i elements were sorted
- merge sort
    - $O(n \log{n})$
- $O(n\log{n})$ is the fastest a sort can be

# What have we seen in 6.0001?

## Key Topics
- represent knowledge with **data structures**
- **iteration and recursion** as computational metaphors
- **abstraction** of procedures and data types
- **organize and modularize** systems using object classes
- different classes of **algorithms**, searching and sorting
- **complexity** of algorithms

## Overview of course
- learn computational modes of thinking
- begin to master the art of computational problems solving
- make computers do what you want them to do

## What do computer scientists do?
- they think computationally
    - abstractions, algorithms, automated execution
- just like the three r's: reading, 'riting, and 'rithmetic --
    - computational thinking is becoming a fundamental skill that every well-educated person will need

## The Three A's of Computational Thinking
- abstraction
    - choosing the right abstractions
    - operating in multiple layers of abstraction simultaneously
    - defining the relationships between the abstraction layers
- automation
    - think in terms of mechanizing our abstraction
    - mechanization is possible -- because we have precise and exacting notations and models, and because there is some "machine" that can interpret our notations
- algorithms
    - language for describing automated processes
    - also allows for abstraction of details
    - language for communicating ideas & processes
    

## Aspects of Computational Thinking
- how difficult is this problem and how best can I solve it?
    - theoretical computer science gives precise meaning to these and related questions and their answers
- thinking recursively
    - reformulating a seeminly difficult problem into one which we know how to solve
    - reduction, embedding, transformation, simulation