# np.vectorize

https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html

Vectorized operations in NumPy enable the use of efficient, pre-compiled functions and mathematical operations on NumPy arrays and data sequences. Vectorization is a method of performing array operations without the use of for loops


In [3]:
import pandas as pd
import numpy as np

In [32]:
N = 1000000
A_list = np.random.randint(1, 200, N)
B_list = np.random.randint(1, 200, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()

Unnamed: 0,A,B
0,192,139
1,175,88
2,71,81
3,73,32
4,11,38


In [28]:
def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

In [33]:
%timeit df['apply'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)

19.2 s ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
(2.82 * 1000) / 421 #100

6.69833729216152

In [31]:
192 / 6.73 #10000

28.52897473997028

In [35]:
(19.2 * 1000) / 603 #1000000

31.8407960199005

In [34]:
%timeit df['vectorize'] = np.vectorize(divide)(df['A'], df['B'])

603 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


When you write a function which applies numpy c-based ufuncs to the input, then it is automatically already vectorized over the inputs and can run very quickly when applied to an array.

However, when you have a complicated function that isn't built out of ufuncs, that applies to some scalar, then you would need to use np.vectorize to make it applicable to arrays. In this case the vectorization is simply wrapper code which does the iteration over various axes for you.

It's not possible to get the fast numpy speed unless you're using tools which remove the python overhead, because python functions can involve all sorts of dynamic activity, such as changing global state, logging, or whatever. A python function is almost a black-box that can affect anything in the interpreter.

In [36]:
name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=100000))

def parse_name(name):
    if name.lower().startswith('a'):
        return 'A'
    elif name.lower().startswith('e'):
        return 'E'
    elif name.lower().startswith('i'):
        return 'I'
    elif name.lower().startswith('o'):
        return 'O'
    elif name.lower().startswith('u'):
        return 'U'
    return name


In [37]:
parse_name_vec = np.vectorize(parse_name)

In [38]:
%timeit name_series.apply(parse_name)

159 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [39]:
%timeit parse_name_vec(name_series)

186 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
