# Optimizing Pandas for Speed

In [None]:
import pandas as pd
import numpy as np
from math import *

np.random.seed(0)

# How does the data look? What do we want to do with it?

We will play around with a simple dataframe with randomly generated values. 
We will work with the simple task of adding 1 to each row of the first column of the dataframe 

In [55]:
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
df.head()

Unnamed: 0,A,B,C,D
0,1.764052,0.400157,0.978738,2.240893
1,1.867558,-0.977278,0.950088,-0.151357
2,-0.103219,0.410599,0.144044,1.454274
3,0.761038,0.121675,0.443863,0.333674
4,1.494079,-0.205158,0.313068,-0.854096


# Basic Looping

Lets see what happens if we do a naive loop through all rows in the dataframe

In [56]:

df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def basic_loop(df):
    for i in range(0, len(df)):
        df['A'][i]=df['A'][i]+1
    
%timeit basic_loop(df);

1.31 s ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Using Iterrows

Replacing a naive loop with iterating using iterrows leads to some improvement in runtime

In [61]:
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def iterrows_loop(df):
    for index, row in df.iterrows():
        row['A']=row['A']+1
    
%timeit iterrows_loop(df);

1.01 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Using Apply

One can perform operations on each row of a dataframe using apply. 

We see that using apply instead of iterrows to explicitly iterate leads to significantly more improvement in speed 

In [58]:
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def using_apply(col):
    col=col.apply(lambda x: x+1)
    return col

%timeit df['A']=using_apply(df['A'])

4.24 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Pandas Vectorization

Performing an operation on the entire column of a data frame is more efficient than iterating or using apply over the dataframe. Several common operations, including the simple addition of scalar that we are performing are supported to be done directly on dataframe columns.

In [59]:
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def using_pandas_vectorization(col):
    col=col+1
    return col
%timeit df['A']=using_pandas_vectorization(df['A'])

558 µs ± 110 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# NumPy Vectorization

Finally, sending a column as a numpy array using values is more efficient than pandas vectorization (of perfoming operations on dataframe column directly)

In [60]:
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def using_numpy_vectorization(col):
    col=col+1
    return col
%timeit df['A']=using_numpy_vectorization(df['A'].values)


287 µs ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
