# Chapter 4: Sorting and Searching (Completed 8/46: 17%)

## Applications of Sorting

### 4.1 [3]

The Grinch is given the job of partitioning $2n$ players into two teams of $n$ players each. Each player has a numerical rating that measures how good he/she is at the game. He seeks to divide the players as *unfairly* as possible, so as to create the biggest possible talent imbalance between team A and team B. Show how the Grinch can do the job in $O(n \log n)$ time.

*Solution:*

Have the players stand in a line and sort them. This presumably can be done in $O(n\log n)$ time although I imagine doing quicksort, mergesort or heapsort on a line of 100 people would be quite difficult; distribution sort might work if we know the ratings are close to uniformly distributed. Then put the first $n \, / \, 2$ players on one team and the other $n\, / \, 2$ players on the other.

### 4.2 [3]

For each of the following problems, give an algorithm that finds the desired numbers within the given amount of time. To keep your answers brief, feel free to use algorithms from the book as subroutines. For the example, $S = \{6, 13, 19, 3, 8\}$, $19 − 3$ maximizes the difference, while $8 − 6$ minimizes the difference.

(a) Let $S$ be an unsorted array of $n$ integers. Give an algorithm that finds the pair $x, y \in S$ that maximizes $|x−y|$. Your algorithm must run in $O(n)$ worst-case time.

(b) Let $S$ be a sorted array of $n$ integers. Give an algorithm that finds the pair $x, y \in S$ that maximizes $|x − y|$. Your algorithm must run in $O(1)$ worst-case time.

(c) Let $S$ be an unsorted array of $n$ integers. Give an algorithm that finds the pair $x, y \in S$ that minimizes $|x − y|$, for $x = y$. Your algorithm must run in $O(n \log n)$ worst-case time.

(d) Let $S$ be a sorted array of $n$ integers. Give an algorithm that finds the pair $x, y \in S$ that minimizes $|x − y|$, for $x = y$. Your algorithm must run in $O(n)$ worst-case time.

*Solution:*

**(a):** The pair of values that will maximize $|x - y|$ will be the largest and smallest values in $S$. To find them, perform a linear search for each. Or, combine them both into one linear search by tracking both the minimum and maximum values seen so far.

In [1]:
unsorted = [6, 13, 19, 3, 8]
sorted_S = [3, 6, 8, 13, 19]

def max_diff_unsorted(S):
    min = S[0]
    max = S[0]
    for i, _ in enumerate(S):
        if S[i] > max:
            max = S[i]
        if S[i] < min:
            min = S[i]
    print("min = %s" % min)
    print("max = %s" % max)
    print("difference = %s" % (max-min))

max_diff_unsorted(unsorted)

min = 3
max = 19
difference = 16


**(b):** Once again, the pair of values that will maximize $|x - y|$ will be the largest and smallest values in $S$. Since S is sorted, these will be the first and last values.

In [2]:
def max_diff_sorted(S):
    min = S[0]
    max = S[-1]
    
    print("min = %s" % min)
    print("max = %s" % max)
    print("difference = %s" % (max-min))
    
max_diff_sorted(sorted_S)

min = 3
max = 19
difference = 16


**(c):** The pair of values that minimize $|x-y|$ can be any $x$ and $y$ in $S$, not simply the largest or smallest. However, they will be adjacent if $S$ is sorted, since if there was an intermediate point, the distance to that point must be smaller: $S[i+2] - S[i] $ $= \left[ S[i+2] - S[i+1] \right] + \left[ S[i+1] - S[i] \right] $ $\geq \left[ S[i+1] - S[i] \right]$ since $S[i+2] - S[i+1] \geq 0$. We can sort $S$ in $O(n \log n)$ time, and then scan through the sorted set calculating $|x-y|$ for each adjacent pair, finding the minimum. Since $n \log n > n$, the sum of these two components will be $O(n \log n)$.

In [10]:
def min_diff_unsorted(S):
    S = sorted(S)
    j = 0
    k = 1
    diff = S[1] - S[0]
    
    for i in range(len(S) - 1):
        if (S[i+1] - S[i]) < diff:
            j = i
            k = i + 1
            diff = S[k] - S[j]
    
    print("Minimum difference is between %s and %s, which is %s." % (S[j], S[k], diff))

min_diff_unsorted(unsorted)

Minimum difference is between 6 and 8, which is 2.


**(d):** Since $S$ is already sorted, by the same reasoning of part **(c)** we know know that the pair that minimizes $|x-y|$ will be adjacent. Therefore we can reuse most of the previous solution.

In [11]:
def min_diff_sorted(S):
    S = sorted(S)
    j = 0
    k = 1
    diff = S[1] - S[0]
    
    for i in range(len(S) - 1):
        if (S[i+1] - S[i]) < diff:
            j = i
            k = i + 1
            diff = S[k] - S[j]
    
    print("Minimum difference is between %s and %s, which is %s." % (S[j], S[k], diff))

min_diff_unsorted(sorted_S)

Minimum difference is between 6 and 8, which is 2.


***
## Heaps

### 4.12 [3]

Devise an algorithm for finding the $k$ smallest elements of an unsorted set of $n$ integers in $O(n + k \log n)$.

*Solution:*

Well, an initial solution would be to do $k$ successive linear scans, pulling out the next smallest element, for an $O(kn)$ running time. Or do one scan, keeping track of the $k$ smallest elements seen so far, and comparing each new element against them. Again, this will be $O(kn)$, since up to $k$ comparisons are done for each of the $n$ elements in the set.

However, looking at this last idea, if the $k$ smallest elements seen so far are kept in sorted order, we only need to compare potential new elements with the largest of these $k$ smallest elements. In fact, full sorted order is not needed, only fast access to the current largest element.  This suggests storing these $k$ elements in a max-heap, which can be constructed in $O(k)$ time (p116). Then, each of the remaining $n-k$ elements needs to be compared with the maximum, which is a constant time operation. In the event that the candidate is less than this maximum, we need to insert it into the heap, and delete the current maximum to maintain the heap's size of $k$. Replacing the maximum with the candidate is constant time, but we then need to perform a "bubble-down" operation to ensure the dominance property of the heap is preserved. Bubble down is an $O(\log k)$ operation, and is done for at most $n-k$ elements.

$\hspace{2em} function \text{ k_smallest}(S,k):$  
$\hspace{4em} \text{Place the first } k \text{ elements of } S \text{ into a max-heap}$  
$\hspace{4em} \text{For the remaining } n-k \text{ elements of } S:$  
$\hspace{6em} \text{If } candidate < \text{ max(heap):}$  
$\hspace{8em} \text{Replace max(heap) with } candidate \text{ and bubble-down}$  


The algorithm is $O(k + n \log k)$.

But we want an algorithm that is $O(n + k \log n)$.

Put all $n$ elements on a min-heap, which is $O(n)$. Then extract the $k$ smallest elements. Each extraction requires an $O(\log n)$ bubble down, and there are $k$ extractions, for $O(k \log n)$.

$\hspace{2em} function \text{ k_smallest}(S,k):$  
$\hspace{4em} \text{Place all } n \text{ elements of } S \text{ into a min-heap}$  
$\hspace{4em} \text{Perform } k \text{ "extract-min" operations}$  

This algorithm is $O(n + k \log n)$. The first algorithm would perform better in "online" situations, where we don't know how many elements there will be.

***
## Quicksort

### 4.16 [3] Unfinished

Use the partitioning idea of quicksort to give an algorithm that finds the median element of an array of $n$ integers in expected $O(n)$ time. (Hint: must you look at both sides of the partition?)

*Solution:*

The Quicksort Partition function randomly selects an element from the array and partitions the array into 2 groups, one of which only contains elements that are smaller than the selected element, and another which only contains elements that are greater. These groups will be on the left and right sides of the selected element, respectively, after the function is finished, thereby placing the element in it's final position. The function takes $O(n)$ time.

Specifically, $\text{partition}(S, l, h)$ will modify the elements of $S$ that lie between $l$ and $h$ as described above and will return the new integer position of the element used for the partition.

For simplicity, we'll assume that $n = \text{length}(S)$ is odd, in which case the median will be located at position $\lfloor n\,/\,2 \rfloor$ when $S$ is sorted.

After we run $\text{partition}$, we can compare the new location of the partition element with the location of the median value; if it's less, then we know that the median must lie in the right partition, and vice versa. This leads naturally to the following algorithm:

$\hspace{2em} function \text{ median}(S,k):$  
$\hspace{4em} n = \text{length}(S)$  
$\hspace{4em} l = 0$  
$\hspace{4em} h = n-1$  
$\hspace{4em} mid = \lfloor n\,/\,2 \rfloor$  
$\hspace{4em} \text{while } h > l: $  
$\hspace{6em} p = \text{partition}(S, l, h)$  
$\hspace{6em} \text{if } p = mid:$  
$\hspace{8em} \text{return }p$  
$\hspace{6em} \text{elif } p < mid:$  
$\hspace{8em} l = p + 1$  
$\hspace{6em} \text{else}:$  
$\hspace{8em} h = p - 1$  
$\hspace{4em} \text{return}\ h$  


The algorithm could also be written recursively.

On average, how many partitions will it take to find the median? To answer this question, we need to know how much the search range is narrowed each time. First note that after the first partition, the location of the median value becomes random within the new range; it could be the first value... it could be the last. However, since each location is equally like, on average it will be in the middle.

### 4.17 [3]

The median of a set of $n$ values is the $\lceil n/2 \rceil \text{th}$ smallest value.

(a): Suppose quicksort always pivoted on the median of the current sub-array. How many comparisons would Quicksort make then in the worst case?

(b): Suppose quicksort were always to pivot on the $\lceil n/3 \rceil \text{th}$ smallest value of the
current sub-array. How many comparisons would be made then in the worst case?

*Solution:*

NOTE: In the implementation of the partition function given in the book, `i` is incremented from `l` until the condition `i<h`; so `i` only reaches `h-1`. This loss of $1$ comparison for each partition would essentially count the number of nodes in the tree of recursion calls. The reason why `i` need not reach `h` is because `s[h]` IS the partition element in the book's implementation and need not be compared with itself. In the solutions below we will assume that the counter `i` does reach `h` for 3 reasons: (1) simplicity; (2) even in the book's implementation, the condition can be trivially changed to `i<h+1` with no change in functionality, since `p = h`, and so the inner loop condition `(s[i] < s[p])` will fail; and (3) in this question we assume that the partition function somehow selects the median value. This may or may not located at `h` and so `h` would need to be checked anyway.

**(a):**

On a set of size $n$, the partition function always performs exactly $n$ comparisons. Therefore at each level of the tree of recursive calls, quicksort performs linear work, regardless of how many subranges the $n$ elements have been partitioned into.

The randomness of quicksort comes from the random selection of a partition element at each stage. If the partition function always selected the median value, then the performance of quicksort is no longer random, but depends solely on the input sequence.

What matters is the number of levels in the tree of recursion calls. If the median value is selected each time, then the two resulting partitions will have the same number of values. So the maximum size of any subrange in a given level of the recursion tree is half that of the previous level. Therefore the number of levels of the tree will be exactly m = $\lceil \lg n \rceil$.

If for simplicity we assume that the last level of the tree is completely filled, the total number of comparisons is

$$ n \lceil \lg n \rceil$$

**(b):**

If quicksort always pivoted on the $\lceil n/3 \rceil \text{th}$ smallest value, each partition would decrease the largest partition by a fraction of $2\, / \, 3$. How many times must this happen to reduce to $1$?

$\left(\frac{2}{3}\right)^hn = 1 \,\,\Rightarrow\,\, n = \left(\frac{3}{2}\right)^h \,\,\Rightarrow\,\, h = \log_{3/2}n$


Therefore the height of the tree is $\log_{3/2}n$. However, the smallest branch of the tree will be of length $\log_{3/1}n$. For simplicity, and to find an upper bound on the number of comparisons, we'll assume that every branch is of length $\log_{3/2}n$. If $n$ comparisons are done at each level, and again we assume that the last level of the tree is completely filled, then an upper bound for the total number of comparisons is:

$$ n \lceil \log_{3/2} n \rceil$$



***
## Other Sorting Algorithms

###  4.22 [3]

Show that $n$ positive integers in the range $1$ to $k$ can be sorted in $O(n \log k)$ time. The interesting case is when $k << n$.

*Solution:*

**Case:** $k \geq n$. Do quicksort, which will be $O(n \log n)$ running time. But since $\log n \leq \log k$, this will also be $O(n \log k)$.

**Case:** $k < n$. Like a hash table with chaining, make an array of size $k$ where element $i$ contains a pointer to a linked-list bucket that will contain only elements whose value is $i+1$. For each element in our set, use the table to look up the memory address of the correct bucket and add the element to that bucket. With an index, the look up is $O(1)$, and inserting the element into the bucket is also $O(1)$. After all the elements have been inserted, we need to merge them. Since they are linked-lists, we only need to have the pointers in the last elements of each bucket point to the head of the next. This would take $O(k)$ if the buckets maintained pointers to their last values. Therefore this sorting is $O(n + k)$. And since $k < n < n \log k$, this is also $O(n \log k)$.

Just looked this up, and this is very similar to Pidgeonhole Sort, Counting Sort, and Bucket Sort.

For a slower algorithm that is exactly $O(n \log k)$, construct a binary search tree of nodes $1$ through $k$ where each node contains a pointer to a bucket of equal-value elements. For each element in the set, perform the binary search in $O(\log k)$ time to find the appropriate bucket and place the value in the bucket. After all the elements have been placed in the buckets, connect/merge them in $O(k)$ time as in the earlier solution. This algorithm will be $O(n\log k + k) = O(n\log k)$ since $k<n$.

***
## Lower Bounds

***
## Searching

### 4.30 [3]

A company database consists of 10,000 sorted names, 40% of whom are known as good customers and who together account for 60% of the accesses to the database. There are two data structure options to consider for representing the database:

- Put all the names in a single array and use binary search.
- Put the good customers in one array and the rest of them in a second array. Only if we do not find the query name on a binary search of the first array do we do a binary search of the second array.

Demonstrate which option gives better expected performance. Does this change if linear search on an unsorted array is used instead of binary search for both options?

*Solution:*

$\text{P}(\text{good}) = 0.60$ while $\text{P}(\text{bad}) = 0.40$.

In the first option, queries on both good and bad customers take $\lceil \lg 10,000 \rceil = 14$ comparisons.

In the second option, queries on good customers take $\lceil \lg 4,000 \rceil = 12$ comparisons, while queries on bad customers take $\lceil \lg 4,000 \rceil + \lceil \lg 6,000 \rceil = 12 + 13 = 25$ comparisons. The expected case is therefore $12*0.6 + 25*0.4 = 17.2$ comparisons.

Clearly the first option is better, and simpler.

What if linear search on an unsorted array is used instead. In this case, expected look up time is half the size of the array. For option 1, expected number of comparisons for each query would be 5,000 comparisons. For option 2 it would be $0.6\cdot\text{E[good query]} + 0.4\cdot\text{E[bad query]}$ $ = 0.6\cdot2,000 + 0.4\cdot3,000 = 2,400$ comparisons. In this case, option 2 is better.

### 4.31 [3]

Suppose you are given an array $A$ of $n$ sorted numbers that has been circularly shifted $k$ positions to the right. For example, $\{35, 42, 5, 15, 27, 29\}$ is a sorted array that has been circularly shifted $k = 2$ positions, while $\{27, 29, 35, 42, 5, 15\}$ has been shifted $k = 4$ positions.

- Suppose you know what $k$ is. Give an $O(1)$ algorithm to find the largest number in $A$.
- Suppose you do not know what $k$ is. Give an $O(\lg n)$ algorithm to find the largest number in $A$. For partial credit, you may give an $O(n)$ algorithm.

*Solution:*

**Case:** $k$ is known.

In the first example given in the problem statement where $k=2$, the largest element is $42$ and is located at position $1$. It should be at position $n-1$, but instead is located at position $(n-1 + k) \bmod n = (6 - 1 + 2) \bmod 6 = 7 \bmod 6 = 1$. Generally, when circularly shifting numbers, they move from position $i$ to position $(i + k) \bmod n$. To find the largest element, plug in $i = n-1$.

So in this case, the largest element is $A[(n - 1 + k) \bmod n]$.

**Case:** $k$ unknown.

We need to determine $k$, after which it is an instant lookup to find the largest value. So how can we find $k$ in $O(\lg n)$ time? $\lg n$ suggests a binary search. Looking at the $k=2$ example $\{35, 42, 5, 15, 27, 29\}$, we note that the value $5$ should be in position $0$ but instead is in position $2$. Therefore the location of the smallest element is equal to $k$. So how can we find this smallest value? We use the fact that the elements are already sorted, modulo the circular permutation. We need to convert this into a test we can perform to determine if we need to look in the left or right subrange. Looking at the example again, we see that all the numbers in the array are greater than or equal to the first number, $35$, *until* the numbers reset, at which point they are all less than $35$. This observation probides our condition: we test if the median of the subrange is less than or greater than the first value.

The solution given below fails when the numbers are not distinct. Suppose we are given a set of $n$ numbers where $n-1$ of them are all the same. In this case, when we test the middle element, if it is the same as the first element, we have no idea if the differing element is to the left or to the right. Furthermore, we don't know if it will be less than or greater than all the other values, so we can't just ignore it. We have to find it. But there is no longer any property amongst the numbers that can expedite our search.

With this in mind, I don't see how an $O(\lg n)$ algorithm is possible in the case where the numbers are not distinct.

In [61]:
def mid(l, h):
    n = h - l + 1
    return l + n // 2

def largest(A):
    n = len(A)
    l = 0
    h = n - 1
    m = mid(l,h)
    k = 0
    count = 0
    
    while l < h:     
        m = mid(l,h)
        prev = m - 1 if m != 0 else n-1
        if A[prev] > A[m]:
            k = m
            break
        if A[m] < A[l]:
            h = m
        if A[m] > A[l]:
            l = m
        count += 1
        if count > n:
            break

    return A[(n - 1 + k) % n]

# TESTS
print("Distinct numbers, even number")
print(largest([0, 1, 2, 3, 4, 5]) == 5)
print(largest([5, 0, 1, 2, 3, 4]) == 5)
print(largest([4, 5, 0, 1, 2, 3]) == 5)
print(largest([3, 4, 5, 0, 1, 2]) == 5)
print(largest([2, 3, 4, 5, 0, 1]) == 5)
print(largest([1, 2, 3, 4, 5, 0]) == 5, "\n")

print("Distinct numbers, odd number")
print(largest([0, 1, 2, 3, 4, 5, 6]) == 6)
print(largest([5, 6, 0, 1, 2, 3, 4]) == 6)
print(largest([4, 5, 6, 0, 1, 2, 3]) == 6)
print(largest([3, 4, 5, 6, 0, 1, 2]) == 6)
print(largest([2, 3, 4, 5, 6, 0, 1]) == 6)
print(largest([1, 2, 3, 4, 5, 6, 0]) == 6, "\n")

print("All same number")
print(largest([0,0,0,0,0]) == 0)
print(largest([1,1,1,1,1]) == 1)
print(largest([5,5,5,5,5]) == 5, "\n")

print("Repeated non-largest elements")
print(largest([1,0,0,0,0]) == 1)
print(largest([0,1,0,0,0]) == 1)
print(largest([0,0,1,0,0]) == 1)
print(largest([0,0,0,1,0]) == 1)
print(largest([0,0,0,0,1]) == 1, "\n")

print("Repeated largest elements")
print(largest([1,2,2,2,2]) == 1)
print(largest([2,1,2,2,2]) == 1)
print(largest([2,2,1,2,2]) == 1)
print(largest([2,2,2,1,2]) == 1)
print(largest([2,2,2,2,1]) == 1)

Distinct numbers, even number
True
True
True
True
True
True 

Distinct numbers, odd number
True
True
True
True
True
True 

All same number
True
True
True 

Repeated non-largest elements
True
True
True
False
True 

Repeated largest elements
False
False
False
False
True


### 4.32 [3]

Consider the numerical 20 Questions game. In this game, Player 1 thinks of a number in the range $1$ to $n$. Player 2 has to figure out this number by asking the fewest number of true/false questions. Assume that nobody cheats.

(a): What is an optimal strategy if $n$ in known?

(b): What is a good strategy is $n$ is not known?

*Solution:*

**(a):** If $n$ is known, do a binary search between $1$ and $n$. If $n \leq 2^{20}$, you're sure to win. If $n > 2^{20}$, ask 19 binary search questions, and take a guess for question 20.

**(b):** If $n$ is not known, do a one-sided binary search for increasing powers of 2, meaning ask "Is the number less than 16?", "Is the number less than 32?", "Is the number less than 64?"... . Once you get a True answer, do a binary search between that number and the previous number. Meaning if "Is the number less than 1024?" returns True, do a binary search between 512 and 1023. Since 20 questions is sufficient to determine any number amongst $2^{20} \approx (2^{10})^2 \approx (1,000)^2 = 1,000,000$, maybe the first number in the one-sided binary search should be $2^{18}$. It's only greater than $2^{20}$ where you have to worry.

### 4.33 [5] Unfinished

Suppose that you are given a sorted sequence of distinct integers $\{a_1, a_2, . . . , a_n\}$. Give an $O(\lg n)$ algorithm to determine whether there exists an $i$ index such as $a_i = i$. For example, in $\{−10,−3, 3, 5, 7\}$, $a_3 = 3$. In $\{2, 3, 4, 5, 6, 7\}$, there is no such $i$.

*Solution:*

Let $f(x)$ be the function that maps $x$ to $A[x]$. We want to know if $f(x) = x$ for any $x$. Put another way, we want to know if $f(x)$ ever intersects the line $y = x$. Since $f$ is restricted to distinct integers, we know that $f(x+1) - f(x) \geq 1$. The slope of $f$ is always greater than or equal to that of the line $y=x$. So once $f$ crosses the line, is can't go back.

Therefore we can test the first and last values of A: if the first is less than $1$ and the last is greater than $n$, we know that somewhere inbetween $f$ crossed $y=x$. The only question is whether the crossing value is in $A$. We can perform a binary search: if the median value in a given range is greater than its position value, we look in the left subrange. If it's less, we look right.

In [None]:
def mid(l, h):
    n = h - l + 1
    return l + n // 2

def largest(A):
    n = len(A)
    l = 0
    h = n - 1
    m = mid(l,h)
    k = 0
    count = 0
    
    while l < h:     
        m = mid(l,h)
        prev = m - 1 if m != 0 else n-1
        if A[prev] > A[m]:
            k = m
            break
        if A[m] < A[l]:
            h = m
        if A[m] > A[l]:
            l = m
        count += 1
        if count > n:
            break

    return A[(n - 1 + k) % n]

***
## Implementation Challenges

***
## Interview Problems