# Numba with pandas

In [1]:
import pandas as pd
import numpy as np
from numba import njit

In [2]:
df = pd.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
)

In [3]:
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


# I like using `apply` but it's slow

Making pandas code run fast usually means writing your code with vectorization. But vectorized code is (sometimes/often) harder to read. I like using `pd.DataFrame.apply` because it allows me to break out single well-named function for what I want to do on my dataframe, hiding implementation details from the user. The problem is that `apply` allows you to write arbitrary python code, and so even if you pass in a well-optimized numpy function, it's still going to run more slowly than writing your code in a vectorized way.

But Ian has taught me some excellent tricks to allow me to have my readability cake, and also quickly eat it!

Let's say I have a hypothesis that the geometric mean of attack/defense/speed is the best way to find strong pokemon...

In [4]:
def geom_mean(row):    
    return np.power(np.prod(row), 1./row.shape[0])

In [5]:
%timeit df['geom_mean'] = df[['Attack', 'Defense', 'Speed']].apply(geom_mean, axis=1)

8.61 ms ± 96.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%timeit df['geom_mean'] = df[['Attack', 'Defense', 'Speed']].apply(geom_mean, axis=1, raw=True)

1.91 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [7]:
@njit
def geom_mean_numba(row):    
    return np.power(np.prod(row), 1./row.shape[0])

# this will both test, and compile, our numba function
np.testing.assert_allclose(geom_mean(np.array([1.,2.,3])), geom_mean_numba(np.array([1.,2.,3])))  

In [8]:
%timeit df['geom_mean'] = df[['Attack', 'Defense', 'Speed']].apply(geom_mean_numba, axis=1, raw=True)

915 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


About a x10 speed-up by going from a naive implementation to using `raw=True` + numba!

How does it compare to doing this the usual way...

In [9]:
%timeit df['geom_mean'] = (df.Attack * df.Defense * df.Speed)**(1./3.)

383 µs ± 6.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Vectorization in this case is still faster, but `apply` implemented in a smarter way still makes savings.

In [10]:
@njit(parallel=True)
def geom_mean_numba_parallel(row):    
    return np.power(np.prod(row), 1./row.shape[0])

# this will both test, and compile, our numba function
np.testing.assert_allclose(geom_mean(np.array([1.,2.,3])), geom_mean_numba_parallel(np.array([1.,2.,3])))  

In [11]:
%timeit df['geom_mean'] = df[['Attack', 'Defense', 'Speed']].apply(geom_mean_numba_parallel, axis=1, raw=True)

4.97 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Parallelization introduces a lot of overhead in this case, and probably wouldn't pay off until you get to much larger matrices.