# pandas 效率测试

这里我们测试几种情况：

1． 在用索引的DataFrame行上的Crude looping

2． 用iterrows()循环

3． 用 apply()循环

4． Pandas Series矢量化

5． NumPy数组矢量化

使用Haversine（半正矢）距离公式。函数取两点的经纬度，调整球面的曲率，计算它们之间的直线距离

In [1]:
import numpy as np

def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    total_miles = MILES * c

    return total_miles

数据集使用一个包含纽约所有酒店坐标的数据集：

In [2]:
import pandas as pd
data = pd.read_csv('new_york_hotels.csv', encoding='cp1252')
data.head()

Unnamed: 0,ean_hotel_id,name,address1,city,state_province,postal_code,latitude,longitude,star_rating,high_rate,low_rate
0,269955,Hilton Garden Inn Albany/SUNY Area,1389 Washington Ave,Albany,NY,12206,42.68751,-73.81643,3.0,154.0272,124.0216
1,113431,Courtyard by Marriott Albany Thruway,1455 Washington Avenue,Albany,NY,12206,42.68971,-73.82021,3.0,179.01,134.0
2,108151,Radisson Hotel Albany,205 Wolf Rd,Albany,NY,12205,42.7241,-73.79822,3.0,134.17,84.16
3,254756,Hilton Garden Inn Albany Medical Center,62 New Scotland Ave,Albany,NY,12208,42.65157,-73.77638,3.0,308.2807,228.4597
4,198232,CrestHill Suites SUNY University Albany,1415 Washington Avenue,Albany,NY,12206,42.68873,-73.81854,3.0,169.39,89.39


## Crude looping

In [3]:
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)

    return distance_list

In [4]:
%%timeit
# Run the haversine looping function
data['distance'] = haversine_looping(data)

1 loop, best of 3: 544 ms per loop


## 用iterrows()循环

In [5]:
%%timeit

# Haversine applied on rows via iteration
haversine_series = []
for index, row in data.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))

data['distance'] = haversine_series

1 loop, best of 3: 200 ms per loop


## 使用apply()方法实现更好的循环

In [6]:
%%timeit
# Timing apply on the Haversine function
data['distance'] = data.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)

10 loops, best of 3: 79.7 ms per loop


## Pandas Series矢量化

In [7]:
%%timeit
# Vectorized implementation of Haversine applied on Pandas series
data['distance'] = haversine(40.671, -73.985,data['latitude'], data['longitude'])

1000 loops, best of 3: 1.38 ms per loop


## 用NumPy数组矢量化

In [8]:
%%timeit
# Vectorized implementation of Haversine applied on NumPy arrays
data['distance'] = haversine(40.671, -73.985, data['latitude'].values, data['longitude'].values)

The slowest run took 4.13 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 344 µs per loop


## 结论

这给我们带来了一些关于优化Pandas代码的基本结论：

  1. 避免循环；它们很慢，而且在大多数情况下是不必要的。
  2. 如果必须使用循环，用 apply(),而不是迭代函数。
  3. 矢量化通常优于标量运算。在Pandas中的大部分常见操作都可以矢量化。
  4. NumPy数组矢量化操作比Pandas series更有效

当然，以上并不是Pandas所有可能优化的全面清单。更爱冒险的用户或许可以考虑进一步用Cython改写函数，或者尝试优化函数的各个组件。然而，这些话题超出了这篇文章的范围。

关键的是，在开始一次宏大的优化冒险之前，要确保正在优化的函数实际上是你希望在长期运行中使用的函数。引用XKCD不朽的名言：“过早优化是万恶之源”。