# Compute the Intersection of Two Sorted Arrays

A natural implementation for a search engine is to retrieve documents that match
the set of words in a query by maintaining an inverted index.  Each page is assigned
an integer identifier, its *document-ID*.  An inverted index is a mapping that takes 
a word $w$ and returns a sorted array of page-ids which contain $w$ - the sort order
could be, for example, the page rank in descending order.  When a query contains
multiple words, the search engine finds the sorted array for each word and then
computes the intersection of these arrays - these are the pages containing all the 
words in the query.  The most computationally intensive step of doing this is 
finding the intersection of the sorted arrays.

**Write a program which takes as input two sorted arrays, and returns a new array
containing elements that are present in both of the input arrays.  The input arrays
may have duplicate entries, but the returned array should be free of duplicates.  
For example, the input is `[2,3,3,5,5,6,7,7,8,12]` and `[5,5,6,8,8,9,10,10]`, your
output should be `[5,6,8]`.**

## Brute-Force Solution

The brute-force algorithm is a "loop join", ie traversing through all the elements
of one array and comparing them to the elements of the other array.  Let $m$ and
$n$ be the lengths of the two input arrays: 

In [2]:
def intersect_two_sorted_arrays(A, B):
    return [a for i, a in enumerate(A) if (i == 0 or a != A[i-1]) and a in B]

list1 = [2,3,3,5,5,6,7,7,8,12]
list2 = [5,5,6,8,8,9,10,10]

intersect_two_sorted_arrays(list1, list2)

[5, 6, 8]

The brute-force algorithm has $O(mn)$ time complexity.

Since both the arrays are sorted, we can make some optimizations.  First, we can
iterate through the first array and use binary search in array to test if the 
element is present in the second array:

In [3]:
import bisect

def intersect_two_sorted_arrays(A, B):
    def is_present(k):
        i = bisect.bisect_left(B, k)
        return i < len(B) and B[i] == k
    
    return [a for i, a in enumerate(A) if (i == 0 or a != A[i - 1]) and is_present(a)]

intersect_two_sorted_arrays(list1, list2)

[5, 6, 8]

The time complexity is $O(m \log n)$, where $m$ is the length of the array being
iterated over.  We can further improve our run time by choosing the shorter array
for the outer loop since if $n$ is much smaller than $m$, then $n\log(m)$ is much 
smaller than $m \log(n)$.

This is the best solution if one set is much smaller than the other.  However, it is
not the best when the array lengths are similar because we are not exploiting the
fact that both arrays are sorted.  We can achieve linear runtime by simultaneously
advancing through the two input arrays in increasing order.  At each iteration, 
if the array elements differ, the smaller one can be eliminated.  If they are equal,
we add that value to the intersection and advance both.  (We handle duplicates by
comparing the current element with the previous one.)  For example, if the arrays are
`A = [2,3,3,5,7,11]` and `B = [3,3,7,15,31]`, then when know by inspecting the first
element of each that 2 cannot belong to the intersection, so we advance to the second
element of `A`.  Now we have a common element, 3, which we add to the result, and
then we advance in both arrays.  Now we are at 3 in both arrays, but we know 3 has
already been added to the result since the previous element in `A` is also 3.  We
advance in both again without adding to the intersection.  Comparing 5 to 7, we can 
eliminate 5 and advance to the fourth element in `A`, which is 7, and equal to the
element that `B`'s iterator holds, so it is added to the result.  We then eleminate
11, and since no elements remain in `A`, we return `[3,7]`:

In [6]:
def intersect_two_sorted_arrays(A, B):
    i, j, intersection_A_B = 0, 0, []
    while i < len(A) and j < len(B):
        if A[i] == B[j]:
            if i == 0 or A[i] != A[i - 1]:
                intersection_A_B.append(A[i])
            i, j = i + 1, j + 1
        elif A[i] < B[j]:
            i += 1
        else:   # A[i] > B[j].
            j += 1
    return intersection_A_B

list3 = [2,3,3,5,7,11]
list4 = [3,3,7,15,31]

intersect_two_sorted_arrays(list3, list4)

[3, 7]

Since we spend $O(1)$ time per input array  element, the time complexity for the 
entire algorithm is $O(m + n)$.

[References](../reference/13.1.md)
