# Optimizing functions on pandas dataframes

In this notebook, we compare the efficiency of several ways of applying a function to a Pandas DataFrame:
1. Crude looping over df using indices
2. Looping with iterrows
3. .apply()
4. Vectorization with Pandas series
5. Vectorization with NumPy arrays

We are using the <mark>%%timeit</mark> magic command to get execution times. This command runs the cell 7 times and returns mean and standard deviation of the run times.

In [1]:
import numpy as np
import pandas as pd

## Read File

In [2]:
df = pd.read_csv('./data/input_file.csv.gz')

The files contains pairs of (lat.,long.). We will use the haversine distance function on this dataset.

In [3]:
df.head()

Unnamed: 0,id,lat_a,long_a,lat_b,long_b
0,id1_id1,22.541167,88.326124,22.541167,88.326124
1,id1_id2,22.541167,88.326124,22.487651,88.350805
2,id1_id3,22.541167,88.326124,22.486251,88.352905
3,id1_id4,22.541167,88.326124,22.569698,88.350871
4,id1_id5,22.541167,88.326124,22.498853,88.372714


The file has <b>250K</b> records

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 5 columns):
id        250000 non-null object
lat_a     250000 non-null float64
long_a    250000 non-null float64
lat_b     250000 non-null float64
long_b    250000 non-null float64
dtypes: float64(4), object(1)
memory usage: 9.5+ MB


## Haversine Function
[More info](https://www.movable-type.co.uk/scripts/latlong.html)

In [5]:
def calculate_distance_haversine(lat_a, lng_a, lat_b, lng_b):
    '''
    Function to calculate distance between two sets of points defined by (latitude, longitude)
    Parameters:
     lat_a: latitude of point a
     lng_a: longitude of point a
     lat_b: latitude of point b
     lng_b: longitude of point b     
    Returns:
     haversine distance, unit: kms
    '''
    #   Convert lat lng in radians
    lng_a, lat_a, lng_b, lat_b = map(np.deg2rad, [lng_a, lat_a, lng_b, lat_b])
    d_lat, d_lng = lat_b - lat_a, lng_a - lng_b
    #
    temp = (
            np.sin(d_lat / 2) ** 2
            + np.cos(lat_a)
            * np.cos(lat_b)
            * np.sin(d_lng / 2) ** 2
            )
    distance = 6373.0 * 2 * np.arcsin(np.sqrt(temp))
    return np.round(distance,3)

## Looping over row indices

In [6]:
%%timeit

distance = []
for i in range(df.shape[0]):
    d = calculate_distance_haversine(df.loc[i]['lat_a'], df.loc[i]['long_a'], df.loc[i]['lat_b'], df.loc[i]['long_b'])
    distance.append(d)
df['distance'] = distance

3min 30s ± 5.68 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
df.head()

Unnamed: 0,id,lat_a,long_a,lat_b,long_b,distance
0,id1_id1,22.541167,88.326124,22.541167,88.326124,0.0
1,id1_id2,22.541167,88.326124,22.487651,88.350805,6.47
2,id1_id3,22.541167,88.326124,22.486251,88.352905,6.7
3,id1_id4,22.541167,88.326124,22.569698,88.350871,4.066
4,id1_id5,22.541167,88.326124,22.498853,88.372714,6.713


## Looping with iterrows()

In [8]:
%%timeit

haversine_series = []
for index, row in df.iterrows():
    d = calculate_distance_haversine(row['lat_a'], row['long_a'], row['lat_b'], row['long_b'])
    haversine_series.append(d)

df['distance'] = haversine_series

42.5 s ± 507 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
df.head()

Unnamed: 0,id,lat_a,long_a,lat_b,long_b,distance
0,id1_id1,22.541167,88.326124,22.541167,88.326124,0.0
1,id1_id2,22.541167,88.326124,22.487651,88.350805,6.47
2,id1_id3,22.541167,88.326124,22.486251,88.352905,6.7
3,id1_id4,22.541167,88.326124,22.569698,88.350871,4.066
4,id1_id5,22.541167,88.326124,22.498853,88.372714,6.713


## apply

In [10]:
%%timeit

df['distance'] = df.apply(lambda row: calculate_distance_haversine(row['lat_a'],row['long_a'],row['lat_b'],row['long_b']),
                          axis=1)

18 s ± 332 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
df.head()

Unnamed: 0,id,lat_a,long_a,lat_b,long_b,distance
0,id1_id1,22.541167,88.326124,22.541167,88.326124,0.0
1,id1_id2,22.541167,88.326124,22.487651,88.350805,6.47
2,id1_id3,22.541167,88.326124,22.486251,88.352905,6.7
3,id1_id4,22.541167,88.326124,22.569698,88.350871,4.066
4,id1_id5,22.541167,88.326124,22.498853,88.372714,6.713


## Pandas series vectorization

In [12]:
%%timeit 

df['distance'] = calculate_distance_haversine(df['lat_a'],df['long_a'],df['lat_b'],df['long_b'])

57.1 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [13]:
df.head()

Unnamed: 0,id,lat_a,long_a,lat_b,long_b,distance
0,id1_id1,22.541167,88.326124,22.541167,88.326124,0.0
1,id1_id2,22.541167,88.326124,22.487651,88.350805,6.47
2,id1_id3,22.541167,88.326124,22.486251,88.352905,6.7
3,id1_id4,22.541167,88.326124,22.569698,88.350871,4.066
4,id1_id5,22.541167,88.326124,22.498853,88.372714,6.713


## NumPy arrays vectorization
Please note that this requires your function to use numpy methods. Methods from other packages like math aren't supported

In [14]:
%%timeit

df['distance'] = calculate_distance_haversine(df['lat_a'].values,
                                              df['long_a'].values,
                                              df['lat_b'].values,
                                              df['long_b'].values)

42.3 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
df.head()

Unnamed: 0,id,lat_a,long_a,lat_b,long_b,distance
0,id1_id1,22.541167,88.326124,22.541167,88.326124,0.0
1,id1_id2,22.541167,88.326124,22.487651,88.350805,6.47
2,id1_id3,22.541167,88.326124,22.486251,88.352905,6.7
3,id1_id4,22.541167,88.326124,22.569698,88.350871,4.066
4,id1_id5,22.541167,88.326124,22.498853,88.372714,6.713


## Summary

By avoiding looping over rows in a pandas dataframes, you can speed up your code snippets by more than 100x.
- Looping over indices: ~3 min
- Looping using iterrows: ~40 s
- apply: ~20 s
- Pandas Vectorization: ~60 ms
- Numpy vectorization: ~40 ms