---
title: Num Expressions and Eval and Query Function
tags: [jupyter]
keywords: pandas
summary: "High-Performace Pandas using numexpr eval and query funcation."
mlType: dataFrame
infoType: pandas
sidebar: pandas_sidebar
permalink: __AutoGenThis__
notebookfilename:  __AutoGenThis__
---

This is an overview of various [High-Performance Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html) manipulations you can do in pandas.  Specifically using ```eval()``` and ```query()```. It is from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html).

**PyData** is built upon the ability of Numpy and Pandas to push basic operations into C via an intuitive syntax, these include:

- vectorized or broadcasted operations using numpy
- grouping of pandas

Though these look efficient they actually require intermediate arrays to be created. For instance:

> ```python
> mask = (x > 0.5) & (y < 0.5)
> ```

Though this looks like only one line of code what is actually happening is:

> ```python
> temp1 = (x > 0.5) # first allocation in memory
> temp2 = (y < 0.5) # second allocation in memory
> mask = temp1 & temp2
> ```

The **[numexpr](https://github.com/pydata/numexpr)** library enables you to run these types of operations without allocating intermediate arrays.  Here is an example.

> ```python
> import numexpr
> mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
> np.allclose(mask, mask_numexpr)
> ```




In [1]:
import sys

sys.path.append("../")

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint

In [3]:
from datetime import datetime
from dateutil import parser

# Efficient Operations

## eval()

Pandas executes the numexpr using its own ```eval()``` and in this section we will show that.

**NOTE:** eval takes in a string to be evaluated similar to MATLAB's eval function.

Lets create 4 dataframes with:

- 10^5 rows 
- 100 coloumns

And then add them

In [5]:
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

Adding them up using regular means

In [6]:
%timeit df1 + df2 + df3 + df4

109 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Adding them up using eval

In [8]:
%timeit pd.eval('df1 + df2 + df3 + df4')

52.3 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This less than half the time to be executed.

### Arithmetic Operations

In [9]:
result1 = -df1 * df2 / (df3 + df4)
result2 = pd.eval('-df1 * df2 / (df3 + df4)')
np.allclose(result1, result2)

True

### Comparison Operations

In [10]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

### Bitwise and Boolean Operations

These are using ```&``` and ```|``` bitwise operators

In [11]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

These are using ```and``` and ```or``` boolean operators

In [12]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

### Object Attributes and Indices

```eval()``` supports the use of accessing specific elements within the DF.

In [13]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

### Coloumn-Wise Operations

In [14]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.615875,0.525167,0.047354
1,0.330858,0.412879,0.441564
2,0.689047,0.559068,0.23035
3,0.290486,0.695479,0.852587
4,0.42428,0.534344,0.245216


In [15]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

Notice that we are accessing the columns with the ```.``` syntax not the ```[]``` syntax.

**ALTERNATIVELY**

We can use the method call within the DF variable itself and access the column and perform the eval() on the columns themselves.

In [16]:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

#### Assignment using eval

Within the eval function we can also assign a column and use the ```inplace``` variable prarameter to the function to place the desired resultant column.  This also works if there already exists a column name with the same name you are trying to allocate.

In [17]:
df.eval('D = (A + B) / (C - 1)', inplace=True)
np.allclose(result1, result3)

True

In [18]:
df.head()

Unnamed: 0,A,B,C,D
0,0.615875,0.525167,0.047354,-1.197761
1,0.330858,0.412879,0.441564,-1.331822
2,0.689047,0.559068,0.23035,-1.621667
3,0.290486,0.695479,0.852587,-6.688481
4,0.42428,0.534344,0.245216,-1.270064


### Local Variables

The ```@``` character suggests that you are using a local variable name rather than a column name within the DF itself when using the eval() method of the dfVariable.  

In [19]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

**Notice** that this @ character is only supported by the ```DataFrame.eval()``` method, not by the ```pandas.eval()``` function, because the ```pandas.eval()``` function only has access to the one (Python) namespace.

## query()

In [22]:
result2 = df.query('A < 0.5 and B < 0.5')
result2

Unnamed: 0,A,B,C,D
1,0.330858,0.412879,0.441564,-1.331822
8,0.448611,0.415924,0.481001,-1.665774
10,0.112910,0.394884,0.950129,-10.182075
11,0.191011,0.118751,0.130223,-0.356140
14,0.075723,0.260648,0.956146,-7.670347
...,...,...,...,...
964,0.478935,0.196736,0.913372,-7.799673
967,0.498382,0.465993,0.664128,-2.871262
980,0.150918,0.382386,0.305427,-0.767815
982,0.207822,0.356162,0.653230,-1.626391


This is a more efficient computation, compared to the masking expressions and also much easier to read and understand.  You can also use the ```@``` to obtain local variables.

In [23]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True

## When to Use

There are two essential cases to use them:

1. Computaion Time
2. Memory Use

Memory use is the most predictable aspect as its based on the number of variables.  You can use the ```.nbytes``` method to identify the number of bytes being used and then calculate the operations.

In [25]:
df.values.nbytes

32000

On the performance side, ```eval()``` can be faster even when you are not maxing out your system memory.  THe issue is how your temporary DF compared to the size of the L1 and L2 CPU cache on you systme.  If they are much bigger then eval() can avoid some potentially slow movements of values between the different memory caches.  

In practice, I find that the difference in computation time between the traditional methods and these methods are not significant.