The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.

The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.

Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t. (NOTE: ensuring distinctness requires a one-line addition to the algorithm from lecture.)

Write your numeric answer (an integer between 0 and 20001) in the space provided.

OPTIONAL CHALLENGE: If this problem is too easy for you, try implementing your own hash table for it. For example, you could compare performance under the chaining and open addressing approaches to resolving collisions.

In [1]:
# timer grabbed from 
# https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python
from timeit import default_timer as timer
class benchmark(object):
    def __init__(self, msg, fmt="%0.3g"):
        self.msg = msg
        self.fmt = fmt

    def __enter__(self):
        self.start = timer()
        return self

    def __exit__(self, *args):
        t = timer() - self.start
        print(("%s : " + self.fmt + " seconds") % (self.msg, t))
        self.time = t

In [2]:
# Note: we are asked to compute the possible t such that x + y = t.
# Thus, as long as we find a (x, y) pair that x != y, we can terminate and increase count by 1.
# We do NOT need to find all possible pairs!

# modified from https://stackoverflow.com/questions/30021060/two-sum-on-leetcode
def two_sum(array, target):
    H = {}
    count = 0
    for i, v in enumerate(array):
        try:
            n = H[v]
            if v != target - v:
                if DEBUG > 1:
                    print v, target - v
                return True
        except:
            H[target - v] = i
#             H.setdefault(target - v, i)
    return False


def find_2sum_in_range(array, lower_bound, upper_bound):
    global count
    count = 0
    for t in range(lower_bound, upper_bound + 1):
        print "Working on t = {0}".format(t)
        found = two_sum(array, t)
        if found:
            count += 1
            if DEBUG > 0:
                print "Possible for t = {0}".format(t)
    print "Total number of possible t: {0}\n".format(count)

In [3]:
DEBUG = 2

test1 = (-3,-1,1,2,9,11,7,6,2)
find_2sum_in_range(test1, 3, 10)

test2 = [-2,0,0,4]
find_2sum_in_range(test2, 0, 4)

Working on t = 3
2 1
Possible for t = 3
Working on t = 4
7 -3
Possible for t = 4
Working on t = 5
6 -1
Possible for t = 5
Working on t = 6
9 -3
Possible for t = 6
Working on t = 7
6 1
Possible for t = 7
Working on t = 8
9 -1
Possible for t = 8
Working on t = 9
7 2
Possible for t = 9
Working on t = 10
9 1
Possible for t = 10
Total number of possible t: 8

Working on t = 0
Working on t = 1
Working on t = 2
4 -2
Possible for t = 2
Working on t = 3
Working on t = 4
4 0
Possible for t = 4
Total number of possible t: 2



In [4]:
# Since we do not care all possible pairs of (x, y) for a given target t,
# we can load the data to a set.

def find_2sum(H, target):
    for i in H:
        lookup = target - i
        if lookup == i:
            continue
        if lookup in H:
            if DEBUG > 1:
                print i, lookup
            return True
    return False

def find_2sum_in_range_from_file(filename, lower_bound, upper_bound):
    H = set()
    for line in open(filename, 'r'):
        H.add(int(line.strip()))
    
    count = 0
    for t in range(lower_bound, upper_bound + 1):
        if DEBUG > 1:
            print "Working on t = {0}".format(t)
        
        found = find_2sum(H, t)
        if found:
            count += 1
            if DEBUG > 0:
                print "Possible for t = {0}".format(t)
    print "Total number of possible t: {0}".format(count)
    return

In [5]:
find_2sum_in_range_from_file("test1.txt", 3, 10)

Working on t = 3
1 2
Possible for t = 3
Working on t = 4
7 -3
Possible for t = 4
Working on t = 5
6 -1
Possible for t = 5
Working on t = 6
7 -1
Possible for t = 6
Working on t = 7
1 6
Possible for t = 7
Working on t = 8
1 7
Possible for t = 8
Working on t = 9
2 7
Possible for t = 9
Working on t = 10
1 9
Possible for t = 10
Total number of possible t: 8


In [6]:
DEBUG = 1
with benchmark("Hash implementation O[n * m]") as r:
    find_2sum_in_range_from_file("2sum.txt", -10000, -9900)

Possible for t = -9967
Possible for t = -9966
Total number of possible t: 2
Hash implementation O[n * m] : 53.5 seconds


# Notes

For this perticular example, it is more efficient to use sorted array as described in __[this excellent discussion](https://www.coursera.org/learn/algorithms-graphs-data-structures/discussions/weeks/4/threads/oDdtZLcSEeaYHgpNzSKpMA)__.

The procedures are described as follows:<br>
Assume $n$ is the number of items in array $\cal A$ and $m$ is the size of $\cal T$ range (in this case ${\cal T} = [-10^5, 10^5]$).

1. Sort the array $\cal A$ using ${\cal O}[n \log(n)]$ time.

2. For each item $x_i \in {\cal A}, \, i=0,1,\dots,n-1$, locate a subarray of $\cal A$ that contains the admissible $y$ values $y_j, \, j = 0,1,\dots,c_i$ such that $\forall j$, $x_i + y_j \in {\cal T}$.
  - we need $t_{\rm low} \leq x_i + y_j \leq t_{\rm high}$, hence $t_{\rm low} - x_i \leq y_j \leq t_{\rm high} - x_i$.
  - Let's call this range ${\cal Y_i} = [t_{\rm low} - x_i, t_{\rm high} - x_i] = y_0, y_1, \dots, y_{c_i}$, which takes $2 \log(n)$ time using binary search.
  - We can add $x_i$ to all $y_j$ to obtain ${\cal C_i} = t_0, t_1, \dots, t_{c_i}$. Statistically, let's assume the average length of ${\cal C_i}, \, i = 0,1,\dots,n-1$ is $c$.
  - Collectively, the whole step is ${\cal O} [n \log(n)]$.

3. From step 2, we obtain a list of ${\cal C}_0, {\cal C}_1, \dots, {\cal C}_{n-1}$, where each ${\cal C}_i$ is a subset of $\cal T$. The cost to count distinct values $t_i$ of all ${\cal C}_i$ is ${\cal O}[nc \log(nc)] \sim {\cal O}[nc \log(n)]$ since $c \leq m = {\cal O}(n)$ hence ${\cal O}[\log(nc)] = {\cal O}[\log(n)]$, where $c$ is the average length of the admissible sublists. The cost might be reduced ${\cal O}(nc)$ if we use a hash table for marking the $t_i$ that was summed up by any found $(x,y)$ pair.

Thus the asympotic running time of sorted array implementation is ${\cal O}[nc \log(n)]$, while the hash table implementation is ${\cal O}[nm]$.
As tested in this perticular case, $c \leq 10$ with $m = 2\times10^5 + 1$ and $n = 10^6$, it is better to use sorted array.

In [7]:
def count_2sum_in_range_from_file(filename, lower, upper):
    # load data to an array and sort it
    data = [int(i.strip()) for i in open(filename, 'r')]
    data.sort()
    
    if DEBUG > 2:
        print data[:10] # just the first ten
    
    # find possible t values
    T = set()
    for v in data:
        if DEBUG > 1:
            print "Checking data value {0:15d}".format(v)
        
        l, u = lower - v, upper - v
        il = binary_search(data, l)
        iu = binary_search(data, u, left=False)
        C = [i + v for i in data[il : iu]]
        for t in C:
            T.add(t)
            if DEBUG > 1:
                print "sum = {0:6d} is possible.".format(t)
    
    # return counts for possible t values
    lt = len(T)
    print "Total number of possible t: {0}".format(lt)
    
    return lt

from bisect import bisect_left, bisect_right
def binary_search(array, x, left=True):
    """ Perform binary search of a sorted array and return the index i.
    left - True:  left half = all(val < x for val in a[lo:i])
           False: left half = all(val <= x for val in a[lo:i])
    """
    
    i = None
    if left:
        i = bisect_left(array, x)
    else:
        i = bisect_right(array, x)
    return i

In [8]:
DEBUG = 3
count_2sum_in_range_from_file('test1.txt', 3, 10)

[-3, -1, 1, 2, 2, 6, 7, 9, 11]
Checking data value              -3
sum =      3 is possible.
sum =      4 is possible.
sum =      6 is possible.
sum =      8 is possible.
Checking data value              -1
sum =      5 is possible.
sum =      6 is possible.
sum =      8 is possible.
sum =     10 is possible.
Checking data value               1
sum =      3 is possible.
sum =      3 is possible.
sum =      7 is possible.
sum =      8 is possible.
sum =     10 is possible.
Checking data value               2
sum =      3 is possible.
sum =      4 is possible.
sum =      4 is possible.
sum =      8 is possible.
sum =      9 is possible.
Checking data value               2
sum =      3 is possible.
sum =      4 is possible.
sum =      4 is possible.
sum =      8 is possible.
sum =      9 is possible.
Checking data value               6
sum =      3 is possible.
sum =      5 is possible.
sum =      7 is possible.
sum =      8 is possible.
sum =      8 is possible.
Checking data value      

8

In [9]:
DEBUG = 0
with benchmark("Array implementation O[n * c * log(n)]") as r:
    count_2sum_in_range_from_file("2sum.txt", -10000, -9900)

Total number of possible t: 2
Array implementation O[n * c * log(n)] : 5.31 seconds
