## Pandas: Apply vs Map

### Conclusions For Pandas Optimization
#### Optimizing Pandas usage: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
1. Avoid loops. They're slow
2. Use apply if you have to loop
3. Vectorization is better than scalar operataions. **Most common
operations in Pandas can be vectorized**
4. Vectorization in Numpy is even faster than Pandas

### Personal takeaways
- Try to vectorize as much as possible in Pandas
- If not, use the apply function

________



### Basic differences
#### Basics: https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff
- apply: pd.DataFrame or Series. axis=0 for column. axis=1 for row
- applymap: Apply element-wise operation across the whole DataFrame
- map: pd.Series, substiture each value with another value




In [49]:
import pandas as pd
import numpy as np

In [50]:
# distance formula
# Our function takes the latitude and longitude of two points,
# adjusts for Earth’s curvature, and calculates the 
# straight-line distance between them. The function
# looks something like this:"""
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

In [62]:
df = pd.read_csv('new_york_hotels.csv')
df[:2]

Unnamed: 0,ean_hotel_id,name,address1,city,state_province,postal_code,latitude,longitude,star_rating,high_rate,low_rate
0,269955,Hilton Garden Inn Albany/SUNY Area,1389 Washington Ave,Albany,NY,12206,42.68751,-73.81643,3.0,154.0272,124.0216
1,113431,Courtyard by Marriott Albany Thruway,1455 Washington Avenue,Albany,NY,12206,42.68971,-73.82021,3.0,179.01,134.0


In [54]:
# Crude Looping
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, 
                      df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)
    return distance_list

In [55]:
%%timeit
df['distance'] = haversine_looping(df)

3.62 s ± 545 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
#  Iterrows() --> generator

In [56]:
%%timeit

# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))
df['distance'] = haversine_series

1.1 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [63]:
%%timeit
#  Apply 
# Also loops through rows like iterrows(), but takes advantage of 
# internal optimizatios usein iterators in Cython

# Timing apply on the Haversine function
df['distance'] = df.apply(lambda row: haversine(40.671, -73.985,
                                                row['latitude'],
                                                row['longitude']),
                          axis=1)


253 ms ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [73]:
# %load_ext line_profiler

In [74]:
# %lprun -f  haversine df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)


#### Vectorization over Pandas Series

In [76]:
%%timeit 

# Vectorized implementation of Haversine applied on Pandas series
df['distance'] = haversine(40.671, -73.985, df['latitude'], df['longitude'])

13.7 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### Vectorization over NumPy Arrays

In [85]:

%%timeit

# Vectorized implementation of Haversine applied on NumPy arrays
df['distance'] = haversine(40.671, -73.985, df['latitude'].values,
                           df['longitude'].values)


1.39 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Conclusions For Pandas Optimization
1. Avoid loops. They're slow
2. Use apply if you have to loop
3. Vectorization is better than scalar operataions. **Most common
operations in Pandas can be vectorized**
4. Vectorization in Numpy is even faster than Pandas

### Personal takeaways
- Try to vectorize as much as possible in Pandas
- If not, use the apply function




In [1]:
import pandas as pd
import numpy as np

In [57]:
df = pd.DataFrame({
    'days': [1,2,3,4],
    'visitors': [100, 20, 30, 60]
})
df

Unnamed: 0,days,visitors
0,1,100
1,2,20
2,3,30
3,4,60


In [58]:
def custom_sum(row):
    return row['days']* 2 

df['sum'] = df.apply(custom_sum, axis=1)
df

Unnamed: 0,days,visitors,sum
0,1,100,2
1,2,20,4
2,3,30,6
3,4,60,8


In [31]:
df.loc[4] = df.apply(custom_sum, axis=0)
df

Unnamed: 0,days,visitors,sum
0,1,100,101
1,2,20,22
2,3,30,33
3,4,60,64
4,10,210,220


In [32]:
# If apply on Series, you don't need to set Axis
df['mult'] = df['sum'].apply(lambda row: row * 2)
df

Unnamed: 0,days,visitors,sum,mult
0,1,100,101,202
1,2,20,22,44
2,3,30,33,66
3,4,60,64,128
4,10,210,220,440


In [33]:
# result_type takes result and then fills into shape
# of the orignal df (DataFrame)
df.apply(custom_sum, axis=1, result_type='broadcast')

Unnamed: 0,days,visitors,sum,mult
0,404,404,404,404
1,88,88,88,88
2,132,132,132,132
3,256,256,256,256
4,880,880,880,880


In [36]:
df

Unnamed: 0,days,visitors,sum,mult
0,1,100,101,202
1,2,20,22,44
2,3,30,33,66
3,4,60,64,128
4,10,210,220,440


In [37]:
# Expand returns a new Df with Columns 0,1,...
def cal_multi_col(row):
    return [row['days'] * 2, row['visitors']* 0.1]

res = df.apply(cal_multi_col, axis=1, result_type='expand')
res

Unnamed: 0,0,1
0,2.0,10.0
1,4.0,2.0
2,6.0,3.0
3,8.0,6.0
4,20.0,21.0


In [39]:
df[res.columns] = res
df

Unnamed: 0,days,visitors,sum,mult,0,1
0,1,100,101,202,2.0,10.0
1,2,20,22,44,4.0,2.0
2,3,30,33,66,6.0,3.0
3,4,60,64,128,8.0,6.0
4,10,210,220,440,20.0,21.0


In [41]:
df['new'] = df.apply(cal_multi_col, axis=1, result_type='reduce')
df

Unnamed: 0,days,visitors,sum,mult,0,1,new
0,1,100,101,202,2.0,10.0,"[2, 10.0]"
1,2,20,22,44,4.0,2.0,"[4, 2.0]"
2,3,30,33,66,6.0,3.0,"[6, 3.0]"
3,4,60,64,128,8.0,6.0,"[8, 6.0]"
4,10,210,220,440,20.0,21.0,"[20, 21.0]"


## Now onto applymap()
Only can be used for DataFrame and used for elementwise operations.

Can be faster than .apply, but should compare

In [42]:
df = pd.DataFrame({
    'a': [1,2,3,4],
    'b': [100, 20, 30, 60]
})
df

Unnamed: 0,a,b
0,1,100
1,2,20
2,3,30
3,4,60


In [43]:
df.applymap(np.square)

Unnamed: 0,a,b
0,1,10000
1,4,400
2,9,900
3,16,3600


### Map
Can only be used on Series

In [44]:
series = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
series

0       cat
1       dog
2       NaN
3    rabbit
dtype: object

In [45]:
series.map({'cat': 'kitten', 'dog': 'puppy'})

0    kitten
1     puppy
2       NaN
3       NaN
dtype: object

In [47]:
series.map('I am {}'.format)

0       I am cat
1       I am dog
2       I am nan
3    I am rabbit
dtype: object

In [48]:
series.map('I am {}'.format, na_action='ignore')

0       I am cat
1       I am dog
2            NaN
3    I am rabbit
dtype: object