source: https://medium.com/codex/say-goodbye-to-loops-in-python-and-welcome-vectorization-e4df66615a52

# Loops come to us naturally, we learn about Loops in almost all programming languages. So, by default, we start implementing loops whenever there is a repetitive operation. But when we work with a large number of iterations (millions/billions of rows), using loops is a crime. You might be stuck for hours, to later realize that it won’t work. This is where implementing Vectorisation in python becomes super crucial.

# Vectorization is the technique of implementing (NumPy) array operations on a dataset. In the background, it applies the operations to all the elements of an array or series in one go (unlike a ‘for’ loop that manipulates one row at a time).

In [1]:
import time
start = time.time()


# iterative sum
total = 0
# iterating through 1.5 Million numbers
for item in range(0, 1500000):
    total = total + item


print('sum is:' + str(total))
end = time.time()

print(end - start)

sum is:1124999250000
0.19063472747802734


In [2]:
import numpy as np

start = time.time()

# vectorized sum - using numpy for vectorization
# np.arange create the sequence of numbers from 0 to 1499999
print(np.sum(np.arange(1500000)))

end = time.time()

print(end - start)

1124999250000
0.007496833801269531


# Vectorization took ~18x lesser time to execute as compared to the iteration using the range function. This difference will become more significant while working with Pandas DataFrame.

# USE CASE 2: Mathematical Operations (on DataFrame)

In [4]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d'))
df.shape
# (5000000, 5)
df.head()

Unnamed: 0,a,b,c,d
0,6,28,23,28
1,19,31,1,34
2,42,11,26,37
3,5,21,14,9
4,30,36,16,11


We will create a new column ‘ratio’ to find the ratio of the column ‘d’ and ‘c’.

Using Loops

In [5]:
import time
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    # creating a new column
    df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])
end = time.time()
print(end - start)


  df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])
  df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])


410.49225091934204


Using Vectorization

In [None]:
start = time.time()
df["ratio"] = 100 * (df["d"] / df["c"])

end = time.time()
print(end - start)

# USE CASE 3: If-else Statements (on DataFrame)

Using Loops

In [None]:
import time
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    if row.a == 0:
        df.at[idx,'e'] = row.d
    elif (row.a <= 25) & (row.a > 0):
        df.at[idx,'e'] = (row.b)-(row.c)
    else:
        df.at[idx,'e'] = row.b + row.c

end = time.time()

print(end - start)

Using Vectorization

In [None]:
start = time.time()
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']end = time.time()
print(end - start)

Time taken by the Vectorization operation is 600x faster as compared to the python loops with if-else statements.

# USE CASE 4 (Advance): Solving Machine Learning/Deep Learning Networks

Deep Learning requires us to solve multiple complex equations and that too for millions and billions of rows. Running loops in Python to solve these equations is very slow and Vectorization is the optimal solution.
For example, to calculate the value of y for millions of rows in the following equation of multi-linear regression:

    y = MixI +M2X2 + M3xz + M4x4 + M5x5 + C


we can replace loops with Vectorization.
The values of m1,m2,m3… are determined by solving the above equation using millions of values corresponding to x1,x2,x3… (for simplicity, we will just look at a simple multiplication step)

Creating the Data

In [6]:
import numpy as np
# setting initial values of m
m = np.random.rand(1,5)

# input values for 5 million rows
x = np.random.rand(5000000,5)

Using Loops

In [None]:
import numpy as np
m = np.random.rand(1,5)
x = np.random.rand(5000000,5)

total = 0
tic = time.process_time()

for i in range(0,5000000):
    total = 0
    for j in range(0,5):
        total = total + x[i][j]*m[0][j]

    zer[i] = total

toc = time.process_time()
print ("Computation time = " + str((toc - tic)) + "seconds")

Using Vectorization

Vide imagem png vectorization

In [None]:
tic = time.process_time()

#dot product
np.dot(x,m.T)

toc = time.process_time()
print ("Computation time = " + str((toc - tic)) + "seconds")


The np.dot implements Vectorized matrix multiplication in the backend. It is 165x faster as compared to loops in python.

Conclusion:
Vectorization in python is super fast and should be preferred over loops, whenever we are working with very large datasets.
Start implementing it over time and you will become comfortable with thinking along the lines of vectorization of your codes.