## Pandas

<br>

### Development Envrionment

In [6]:
import time
import numpy as np
import pandas as pd

### Numpy Array

In [3]:
data = [[np.random.randint(1_000_000) for _ in range(2)] for _ in range(1_000_000)]
df = pd.DataFrame(data, columns=['X', 'Y'])

In [4]:
df 

Unnamed: 0,X,Y
0,763762,514248
1,299451,161467
2,47573,647949
3,886777,374505
4,388853,356780
...,...,...
999995,842610,33966
999996,262999,12276
999997,513398,616229
999998,407523,648943


### Query

In [5]:
%%timeit

df.query("X >= 500_000 & Y < 500_000 & X + Y <= 700_000")

23.1 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Vectorization

In [7]:
def strange_calc(v):
    return (1.8 * v + 32) / 7 ** 3 // 17 % 9

np.vectorize
def strange_calc_vectorized(v):
    return (1.8 * v + 32) / 7 ** 3 // 17 % 9

In [8]:
start = time.perf_counter()
df['X'].apply(strange_calc)
df['Y'].apply(strange_calc)
print(time.perf_counter() - start)

start = time.perf_counter()
strange_calc_vectorized(df['X'])
strange_calc_vectorized(df['Y'])
print(time.perf_counter() - start)

0.769676800000525
0.22090380000008736


### Reference

<b>Conference</b>
<br>[뚱뚱하고 굼뜬 판다스(Pandas)를 위한 효과적인 다이어트 전략 - 오성우 - PyCon.KR 2019](https://youtu.be/0Vm9Yi_ig58)

<b>Blog</b>
<br>[Pandas DataFrame 성능 빠르게하기 - apply말고 Vectorization쓰자](https://chancoding.tistory.com/248)
<br>[Pandas를 Numpy로! 최적화 시리즈(1) - ndarray 활용](https://yahwang.github.io/posts/85)