# High performance

```python
mask = (x > 0.5) & (y < 0.5)
```

ex:
df = df[df[mask]]

intermediate variables in memory
```python
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2
```

Can use pd.eval("") -> performs elementwise directly using numexpr

Good for compound expressions

In [1]:
import numpy as np 
import pandas as pd 

nrows, ncols = 1000000, 100
df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.260082,-0.586668,0.183404,-1.344913,-1.447578,1.06857,0.427758,-0.686259,-0.34541,-0.309506,...,0.623939,0.982722,-0.227858,-1.379762,-0.617289,-0.779229,0.802751,0.104143,-0.843933,-0.159152
1,-2.047519,-0.104612,0.044892,-1.001026,-0.077735,1.258593,0.558816,-1.341239,1.249336,-1.417893,...,1.325937,-1.39222,0.4035,0.119259,1.284388,0.736133,2.107901,1.242057,-1.347362,-0.002361
2,2.357562,1.412255,1.325968,-0.41438,1.25808,-1.195423,-0.512794,1.399674,-0.880979,1.433202,...,0.074129,-0.63321,-0.479192,2.118864,-0.432426,-0.633328,-2.396356,0.71675,-1.409237,-1.191049
3,-0.712164,1.522635,-0.395182,-0.477154,0.003614,-0.10611,-1.206751,0.687709,1.060552,0.14997,...,-0.775723,-1.24778,-1.44408,0.041199,-0.418785,0.280531,-1.367063,1.411467,0.137394,-0.047214
4,0.271331,1.687778,1.334568,0.114491,0.465438,-1.619045,-1.436665,0.336317,2.226891,-0.220578,...,0.377337,-1.508901,-0.272022,-0.610488,-0.518168,0.526281,1.451927,-0.982366,-0.915068,-0.328765


In [5]:
%timeit df1 + df2 + df3 + df4
%timeit pd.eval("df1 + df2 + df3 + df4")

912 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
405 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
plain = df1 + df2 + df3 + df4
sum_eval = pd.eval("df1 + df2 + df3 + df4")

sum_eval.equals(plain)

True

In [8]:
# df.eval()
rolls = pd.DataFrame(np.random.randint(1,6, (6,3)), columns = ["Die1", "Die2", "Die3"])
rolls.eval("Sum = Die1 + Die2 + Die3", inplace = True)
rolls

Unnamed: 0,Die1,Die2,Die3,Sum
0,4,5,2,11
1,2,1,3,6
2,2,1,4,7
3,3,2,5,10
4,5,3,2,10
5,2,1,1,4


In [9]:
# use variables
high = 10
rolls.eval("Winner = Sum > @high", inplace = True)
rolls

Unnamed: 0,Die1,Die2,Die3,Sum,Winner
0,4,5,2,11,True
1,2,1,3,6,False
2,2,1,4,7,False
3,3,2,5,10,False
4,5,3,2,10,False
5,2,1,1,4,False


In [11]:
# filter out "traditional" way
rolls[rolls["Sum"] <= high]

Unnamed: 0,Die1,Die2,Die3,Sum,Winner
1,2,1,3,6,False
2,2,1,4,7,False
3,3,2,5,10,False
4,5,3,2,10,False
5,2,1,1,4,False


## Query

In [12]:
rolls.query("Sum <= @high")

Unnamed: 0,Die1,Die2,Die3,Sum,Winner
1,2,1,3,6,False
2,2,1,4,7,False
3,3,2,5,10,False
4,5,3,2,10,False
5,2,1,1,4,False


In [13]:
os = pd.read_csv("athlete_events.csv")
os.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [15]:
%timeit os[os["NOC"] == "SWE"]
%timeit os.query("NOC == 'SWE'")

17.1 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.2 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%timeit os[os["Height"] > 180]
%timeit os.query("Height > 180")

15.8 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.7 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
%timeit os[(os["Sex"] == "F") & (os["Height"] > 180) & (os["NOC"] == "SWE")]
%timeit os.query("Sex == 'F' & Height > 180 & NOC == 'SWE'")


31.5 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.8 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
os.query("Sex == 'F' & Height > 180 & NOC == 'SWE'")

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
729,417,Sara Helena berg,F,17.0,190.0,73.0,Sweden,SWE,1988 Summer,1988,Summer,Seoul,Swimming,Swimming Women's 50 metres Freestyle,
5175,2940,Jenny Alm,F,27.0,184.0,80.0,Sweden,SWE,2016 Summer,2016,Summer,Rio de Janeiro,Handball,Handball Women's Handball,
7555,4210,Marina Vladimirovna Andrievskaia,F,29.0,182.0,66.0,Sweden,SWE,2004 Summer,2004,Summer,Athina,Badminton,Badminton Women's Singles,
19070,10088,Anna Therese Bengtsson,F,29.0,187.0,83.0,Sweden,SWE,2008 Summer,2008,Summer,Beijing,Handball,Handball Women's Handball,
28221,14643,Maria Helene Brandin,F,25.0,186.0,85.0,Sweden,SWE,1988 Summer,1988,Summer,Seoul,Rowing,Rowing Women's Double Sculls,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242230,121329,Linnea Maria Torstenson,F,33.0,186.0,82.0,Sweden,SWE,2016 Summer,2016,Summer,Rio de Janeiro,Handball,Handball Women's Handball,
259242,129789,Anna Karolina Westberg,F,22.0,184.0,78.0,Sweden,SWE,2000 Summer,2000,Summer,Sydney,Football,Football Women's Football,
259243,129789,Anna Karolina Westberg,F,26.0,184.0,78.0,Sweden,SWE,2004 Summer,2004,Summer,Athina,Football,Football Women's Football,
259934,130126,Johanna Maria Wiberg,F,24.0,184.0,78.0,Sweden,SWE,2008 Summer,2008,Summer,Beijing,Handball,Handball Women's Handball,
