# Vectorization Exercise

In this problem, you will learn how to properly use vectorization to speed up the calculations. In general, using appropriate vectorized calculations you will be able to get a significant speed boost and some other advantages such as pushing calculations to GPUs.

Let us first solve one sample problems using logical loops and compare it with vectorized solutions. 

**Example.** Consider two very large vectors $V_1$ and $V_2$. We will find the dot product of the vectors using a for loop and numpy, we then compare their corresponding speeds.

In [1]:
# for loop solution
def for_loop_dot(V1,V2):
    dot_prod = 0
    for i in range(len(V1)):
        dot_prod += V1[i]*V2[i]
    return dot_prod

In [2]:
# check that our function works correctly
for_loop_dot([1,2,3],[1,2,3])

14

In [3]:
import numpy as np

# create two large random vectors
V1 = np.random.randn(1000000,1)
V2 = np.random.randn(1000000,1)

In [4]:
# study the speed of the function on large vectors
%timeit for_loop_dot(V1,V2)

1.79 s ± 177 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
# vectorized solution
def vec_dot(V1,V2):
    dot_prod = np.dot(V1.T,V2)
    return dot_prod

In [6]:
# check that our function works correctly
vec_dot(np.array([1,2,3]),np.array([1,2,3]))

14

In [7]:
# study the speed of the function on large vectors
%timeit vec_dot(V1,V2)

806 µs ± 183 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


How much the vectorized solution faster than the for loop solution? Also note that the vectorized solution is more readable.

Here is a problem for you to exercise your vectorization skills.

**Problem.** Let A and B be matrices that their columns represent entries/coordinates/features of samples all having the same dimensionality and each row represents a vector/sample. You can use numpy random method to create such matrices.

In [8]:
np.random.seed(seed=1)

A = np.random.randn(1000,50)
B = np.random.randn(5000,50) 

**Part 1.** Use for loops to find the pairwise distance of each vector in A from each vector in B.

In [14]:
import math
def loop_dist(A,B):
    n = len(A[i])
    p=len(B)
    m = np.zeros((n,n))
    for i in range(n):
        for j in range(p):
            s = 0
            for k in range(n):
                s += (A[i,k] - B[j,k])**2
            m[i, j] = s**0.5
            k=k+1
    i=i+1
    return m
    pass

In [15]:
dist_w_loop = loop_dist(A,B)
print(len(dist_w_loop[0]))
print(len(dist_w_loop))

UnboundLocalError: local variable 'i' referenced before assignment

In [16]:
dist_w_loop

NameError: name 'dist_w_loop' is not defined

**Part 2.** Use only functions provided in numpy to calculate to find the pairwise distances in a vectorized fashion. You shouldn't use any loops or any other library except numpy.

In [17]:
def vect_dist(A,B):
    # your code goes here
    return np.sum((A[None,:] - B[:, None])**2, -1)**0.5
    pass

In [18]:
dist_w_vec = vect_dist(A,B)
print(len(dist_w_vec[0]))
print(len(dist_w_vec))

1000
5000


In [19]:
dist_w_vec

array([[11.05274066,  9.45975945,  9.10412115, ...,  8.61994485,
        10.56744549, 10.26401403],
       [10.38876466,  9.67015416,  9.94540619, ..., 10.24009112,
         9.95653534,  9.53785915],
       [10.16663561,  8.02972644, 10.33590363, ..., 10.36815424,
         9.7496924 ,  9.54423071],
       ...,
       [ 9.40609189,  8.35375709,  9.54935886, ..., 10.22758393,
        10.2086809 ,  8.45891164],
       [ 9.98343736,  9.26134481,  8.98421158, ...,  8.89727869,
         8.89732519,  9.12512204],
       [ 9.24938432,  9.12936845,  8.8744184 , ..., 10.29262723,
         9.98095355,  8.72211447]])

**Part 3.** Compare the solutions of the two methods to make sure you get the same answer. Then compare the efficiency of the two methods using *timeit*.

In [20]:
# use magic timeit on your functions
#%timeit ...

In [21]:
# Check that the answers from both methods are the same
# your code goes here


In [22]:
# time the loop method
# your code goes here


In [23]:
# time the vectorized method
# your code goes here


**Part 4.** Use scikit-learn package to do the same calculation and compare the efficiency with the previous methods.

In [24]:
import sklearn

def sk_dist(A,B):
    # your code goes here
    pass

In [25]:
dist_w_sk = sk_dist(A,B)

In [26]:
# check if the answers are the same
np.min(dist_w_vec == dist_w_loop)

NameError: name 'dist_w_loop' is not defined

In [27]:
# time the vectorized method
# your code goes here


### Conclusion

Using vectorization methods are orders of magnitude faster than using logical loops (witness the power of linear algebra and hardware acceleration). Vectorization also makes the code much more readable and generalizable to future applicaitons.