### Motivating ``query()`` and ``eval()``: Compound Expressions 複合表達式

> We've seen previously that NumPy and Pandas support fast vectorized operations; for example, when adding the elements of two arrays:

我們已經掌握了NumPy和Pandas能夠支持快速向量化操作；例如，當將兩個數組進行加法操作時：

In [1]:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

396 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [2]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))

161 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


> But this abstraction can become less efficient when computing compound expressions.Because NumPy evaluates each subexpression, For example, consider the following expression: 

但是當運算變得複雜的情況下，這種向量化運算就會變得沒那麼高效了，因為NumPy會獨立計算每一個子表達式。如下例：

In [3]:
mask = (x > 0.5) & (y < 0.5)   #same
mask

array([False,  True,  True, ..., False, False, False])

> In other words, *every intermediate step is explicitly allocated in memory*. If the ``x`` and ``y`` arrays are very large, this can lead to significant memory and computational overhead.
The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
The [Numexpr documentation](https://github.com/pydata/numexpr) has more details, but for the time being it is sufficient to say that the library accepts a *string* giving the NumPy-style expression you'd like to compute:

換言之，*每個中間步驟都會顯式分配內存*。如果`x`和`y`數組變得非常巨大，這會帶來顯著的內存和計算資源開銷。 Numexpr庫提供了既能使用簡單語法進行數組的逐元素運算的能力，又不需要為中間步驟數組分配全部內存的能力。 [Numexpr在線文檔](https://github.com/pydata/numexpr)中有更加詳細的說明，我們現在只需要將它理解為這個庫能接受一個NumPy風格的表達式字符串，然後計算得到結果：

In [4]:
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

True

> The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays.
The Pandas ``eval()`` and ``query()`` tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

這樣做的優點是，Numexpr使用的臨時數組不是完全分配空間的，並利用這少量數組即能完成計算，因此能比NumPy更加高效，特別是對大的數組來說。我們將會討論到的Pandas的`eval()`和`query`工具，就是基於Numexpr包構建的。

## ``pandas.eval()`` for Efficient Operations

## `pandas.eval()` 更加高效的運算

> The ``eval()`` function in Pandas uses string expressions to efficiently compute operations using ``DataFrame``s.
For example, consider the following ``DataFrame``s:

Pandas中的`eval()`函數可以使用字符串類型的表達式對`DataFrame`進行運算。例如，創建下面的`DataFrame`：

In [5]:
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

In [6]:
%timeit df1 + df2 + df3 + df4

7.59 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


> The same result can be computed via ``pd.eval`` by constructing the expression as a string:

我們也可以使用`pd.eval`，參數傳入上述表達式的字符串形式，計算得到同樣的結果：

In [7]:
%timeit pd.eval('df1 + df2 + df3 + df4')

4.71 ms ± 88 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


> The ``eval()`` version of this expression is about 50% faster (and uses much less memory), while giving the same result:

`eval()`版本的計算比典型方法快了接近接近50%（而且使用了更少的內存），我們來使用`np.allclose()`函數驗證一下結果是否相同：

In [8]:
np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4'))

True

### Operations supported by ``pd.eval()``

### `pd.eval()`支持的運算

> As of Pandas v0.16, ``pd.eval()`` supports a wide range of operations.
To demonstrate these, we'll use the following integer ``DataFrame``s:

到了Pandas 0.16版本，`pd.eval()`支持很大範圍的運算。我們使用下面的整數`DataFrame`來進行展示：

In [9]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))

In [10]:
# Arithmetic operators 算術運算
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

In [11]:
# Comparison operators 比較運算
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

#### Bitwise operators 位運算

> ``pd.eval()`` supports the ``&`` and ``|`` bitwise operators:

`pd.eval()`支持與`&`以及或`|`位運算符：還支持非`~`位運算符。

In [12]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In [13]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

#### Object attributes and indices 對象屬性和索引

> ``pd.eval()`` supports access to object attributes via the ``obj.attr`` syntax, and indexes via the ``obj[index]`` syntax:

`pd.eval()`支持使用`obj.attr`語法獲取對象屬性，也支持使用`obj[index]`語法進行索引：

In [14]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

## ``DataFrame.eval()`` for Column-Wise Operations

> Just as Pandas has a top-level ``pd.eval()`` function, ``DataFrame``s have an ``eval()`` method that works in similar ways.
The benefit of the ``eval()`` method is that columns can be referred to *by name*.
We'll use this labeled array as an example:

Pandas有著頂層的`pd.eval()`函數，`DataFrame`也有自己的`eval()`方法，實現的功能類似。使用`eval()`方法的好處是可以使用*列名*指代列。我們使用下面的帶列標籤的數組作為例子說明：

In [15]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


> Using ``pd.eval()`` as above, we can compute expressions with the three columns like this:

使用上面的`pd.eval()`，我們可以如下計算三個列的結果：

In [16]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

> The ``DataFrame.eval()`` method allows much more succinct evaluation of expressions with the columns:

使用`DataFrame.eval()`方法允許我們採用更加直接的方式操作列數據：

In [17]:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

### Assignment in DataFrame.eval()

### DataFrame.eval() 中的賦值

> In addition to the options just discussed, ``DataFrame.eval()``  also allows assignment to any column.
Let's use the ``DataFrame`` from before, which has columns ``'A'``, ``'B'``, and ``'C'``:

除了上面的操作外，`DataFrame.eval()`也支持對任何列的賦值操作。還是使用上面的`DataFrame`，有著`A`、`B`和`C`三個列：

In [18]:
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [19]:
df.eval('D = (A + B) / C', inplace=True)
df.eval('D = (A - B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
2,0.677945,0.433839,0.652324,0.374209
3,0.264038,0.808055,0.347197,-1.566886
4,0.589161,0.252418,0.557789,0.603708


### Local variables in DataFrame.eval() 中的本地變量

> The ``DataFrame.eval()`` method supports an additional syntax that lets it work with local Python variables.
Consider the following:

`DataFrame.eval()`方法還支持使用腳本中的本地Python變量。見下例：

In [20]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

> The ``@`` character here marks a *variable name* rather than a *column name*, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.
Notice that this ``@`` character is only supported by the ``DataFrame.eval()`` *method*, not by the ``pandas.eval()`` *function*, because the ``pandas.eval()`` function only has access to the one (Python) namespace.

上面的字符串表達式中的`@`符號表示的是一個*變量名稱*而不是一個*列名*，這個表達式能高效的計算涉及列空間和Python對象空間的運算表達式。需要注意的是`@`符號只能在`DataFrame.eval()`方法中使用，不能在`pandas.eval()`函數中使用，因為`pandas.eval()`實際上只有一個命名空間。

## DataFrame.query() Method 方法

In [21]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

In [22]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = df.query('A < 0.5 and B < 0.5')   #dataframe cannot be expressed using the DataFrame.eval()  -> query()
np.allclose(result1, result2)

True

> In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand.
Note that the ``query()`` method also accepts the ``@`` flag to mark local variables:

除了提供更加高效的計算外，這種語法比遮蓋數組的方式更加容易讀明白。而且`query()`方法也接受`@`符號來標記本地變量：

In [23]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True

## Performance: When to Use These Functions

## 性能：什麼時候選擇使用這些函數

> When considering whether to use these functions, there are two considerations: *computation time* and *memory use*.
Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas ``DataFrame``s will result in implicit creation of temporary arrays:
For example, this:

是否使用這些函數主要取決與兩個考慮：*計算時間*和*內存佔用*。其中最易預測的是內存使用。我們之前已經提到，每個基於NumPy數組的複合表達式都會在每個中間步驟產生一個臨時數組，例如：

In [24]:
x = df[(df.A < 0.5) & (df.B < 0.5)]

In [25]:
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
x = df[tmp1 & tmp2]

> If the size of the temporary ``DataFrame``s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an ``eval()`` or ``query()`` expression.
You can check the approximate size of your array in bytes using this:

如果產生的臨時的`DataFrame`與你可用的系統內存容量在同一個量級（如數GB）的話，那麼使用`eval()`或者`query()`表達式顯然是個好主意。可以通過數組的nbytes屬性查看大概的內存佔用：

In [26]:
df.values.nbytes

32000

> On the performance side, ``eval()`` can be faster even when you are not maxing-out your system memory.
The issue is how your temporary ``DataFrame``s compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then ``eval()`` can avoid some potentially slow movement of values between the different memory caches.
In practice, I find that the difference in computation time between the traditional methods and the ``eval``/``query`` method is usually not significant–if anything, the traditional method is faster for smaller arrays!
The benefit of ``eval``/``query`` is mainly in the saved memory, and the sometimes cleaner syntax they offer.

至於計算時間考慮，`eval()`即使在不考慮內存佔用的情況下也可能會更快。造成這個差異的原因主要在於臨時的`DataFrame`的大小與計算機CPU的L1和L2緩存大小（在2016年通常是幾個MB）的比值；如果緩存相比而言足夠大的話，那麼`eval()`可以避免在內存和CPU緩存之間的數據複製開銷。在實踐中，作者發現使用傳統方式和`eval`/`query`方法之間的計算時間差異通常很小，如果存在的話，傳統方法在小尺寸數組的情況下甚至還更快。因此`eval`/`query`的優勢主要在於節省內存和它們的語法會更加清晰易懂。

# Example : Bile - Filtering with the `query` Method

The previous chapters on boolean selection showed us how to filter our DataFrames and Series based on their values. We created conditions, usually involving the comparison operators, resulting in boolean Series and passed them to the *just the brackets* or `loc` indexers to filter the data.

In this chapter we cover the `query` method, which enables us to also make selections based on the values of the DataFrame or Series. The `query` method is easier and more intuitive to use than boolean selection, but doesn't provide as much functionality to filter the data. Still, it is a good method to know about to make your subset selections more readable.

In [27]:
import pandas as pd
bikes = pd.read_csv('input/pd-bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31,Wells St & Walton St,19,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19,Clark St & Randolph St,31,72.0,16.1,mostlycloudy
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19,Damen Ave & Pierce Ave,19,73.0,17.3,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...
4054,Male,2014-06-08 21:09:00,2014-06-08 21:13:00,257,Pine Grove Ave & Irving Park Rd,15,Halsted St & Waveland Ave,15,62.1,8.1,mostlycloudy
4055,Male,2014-06-08 23:15:00,2014-06-08 23:20:00,304,Southport Ave & Irving Park Rd,15,Broadway & Sheridan Rd,15,59.0,6.9,mostlycloudy
4056,Male,2014-06-09 05:16:00,2014-06-09 05:25:00,530,Morgan Ave & 14th Pl,15,Wood St & Taylor St,15,55.0,9.2,partlycloudy
4057,Male,2014-06-09 07:31:00,2014-06-09 07:39:00,496,Clinton St & Washington Blvd,31,Stetson Ave & South Water St,19,60.1,8.1,partlycloudy


In [28]:
bikes[bikes['tripduration'] > 5000].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
18,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,35,Millennium Park,35,79.0,13.8,cloudy
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,2013-07-21 11:35:00,2013-07-21 13:54:00,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy


In [29]:
bikes.query('tripduration > 1000').head()
bikes.query('tripduration > 5000').head()
bikes.query('tripduration > 1000 and temperature > 85').head()
bikes.query('tripduration > 5000 and gender=="Female"').head()
bikes.query('tripduration > 5000 or gender=="Female"').head()

bikes.query('temperature >= 50 and temperature <= 60').head()
bikes.query('start_capacity > end_capacity').head()
bikes.query('50 <= temperature <= 60').head()

bikes.query('gender == "Female" and tripduration > 2000').head()
bikes.query('gender == "Female" and tripduration > 2500').head()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,2013-07-21 11:35:00,2013-07-21 13:54:00,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy
173,Female,2013-08-08 08:49:00,2013-08-08 09:31:00,2502,Sheffield Ave & Addison St,27,Dearborn St & Adams St,19,71.1,10.4,mostlycloudy
258,Female,2013-08-17 22:10:00,2013-08-17 22:53:00,2566,Millennium Park,35,Theater on the Lake,15,69.1,5.8,clear
320,Female,2013-08-24 14:50:00,2013-08-24 15:40:00,2980,Sheridan Rd & Irving Park Rd,11,Halsted St & Willow St,19,84.9,5.8,partlycloudy


### Use 'in' for multiple equalities

You can check whether each value in a column is equal to one or more other values by using the word 'in' within your query. Use the syntax for creating a list within the query string to contain all the values you'd like to check. The following tests whether the weather event was snow or rain.

In [30]:
bikes.query('events in ["snow", "rain"]').head(3)
bikes.query('events not in ["cloudy", "partlycloudy", "mostlycloudy"]').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
25,Female,2013-07-11 08:17:00,2013-07-11 08:31:00,830,Wabash Ave & Roosevelt Rd,19,Daley Center Plaza,47,73.9,8.1,clear
26,Male,2013-07-12 01:07:00,2013-07-12 01:24:00,1043,State St & Harrison St,19,Racine Ave & 18th St,15,64.9,0.0,clear
33,Male,2013-07-12 17:22:00,2013-07-12 17:34:00,730,Clark St & Congress Pkwy,27,Racine Ave & Congress Pkwy,19,79.0,10.4,clear


There are multiple syntaxes for the above that all work the same, but I prefer using the above as it is most similar to the `isin` method used during boolean selection.

* `bikes.query('["snow", "rain"] in events')`
* `bikes.query('["snow", "rain"] == events')`
* `bikes.query('events == ["snow", "rain"]')`

### Arithmetic operations within `query`

It is possible to write arithmetic operations within `query` just as you would outside of it. For instance, if we wanted to find all the rides such that there were 20 or more bikes at the start station than at the end, we do the following.

In [31]:
bikes.query('start_capacity - end_capacity >= 20').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
54,Male,2013-07-16 15:13:00,2013-07-16 15:18:00,347,Daley Center Plaza,47,State St & Van Buren St,27,91.0,8.1,mostlycloudy
66,Male,2013-07-17 20:56:00,2013-07-17 21:14:00,1073,Millennium Park,35,Morgan St & 18th St,15,86.0,9.2,partlycloudy
116,Male,2013-07-27 09:54:00,2013-07-27 09:56:00,121,Daley Center Plaza,47,LaSalle St & Washington St,15,60.8,11.5,cloudy


# Example - triangles

In [32]:
triangles = pd.read_csv('input/pd-triangles.csv')
triangles.head()

Unnamed: 0,a,b,c
0,2,3,4
1,3,2,4
2,3,4,5
3,3,5,6
4,3,6,7


We can use the `query` method to find all the right triangles, those that satisfy the Pythagorean Theorem. We write the condition using the arithmetic and comparison operators.

In [33]:
triangles.query('a ** 2 + b ** 2 == c ** 2').head()

Unnamed: 0,a,b,c
2,3,4,5
5,4,3,5
14,5,12,13
21,6,8,10
33,7,24,25


The syntax is quite a bit nicer than the boolean selection alternative.

In [34]:
filt = triangles['a'] ** 2 + triangles['b'] ** 2 == triangles['c'] ** 2
triangles[filt].head()

Unnamed: 0,a,b,c
2,3,4,5
5,4,3,5
14,5,12,13
21,6,8,10
33,7,24,25


## Reference variable names with the `@` symbol

By default, all words within the query string attempt to reference a column name. You can, however, reference a variable name by preceding it with the `@` symbol. Let's assign the variable name `min_length` to 5,000 and reference it in a query to find all the rides where trip duration was greater than it.

In [35]:
min_length = 5000
bikes.query('tripduration > @min_length').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
18,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,35,Millennium Park,35,79.0,13.8,cloudy
40,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,2013-07-21 11:35:00,2013-07-21 13:54:00,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy


## Using the index with `query`

You can even use the word `index` to make comparisons against the index as if it were a normal column. In the bikes DataFrame, the index is just the integers beginning at 0. Here, we select only the `events` that were 'cloudy' for an index value greater than 4,000.

In [36]:
bikes.query('index > 4000 and events == "cloudy" ').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
4007,Male,2014-06-07 14:07:00,2014-06-07 14:31:00,1434,Lake Shore Dr & North Blvd,15,Halsted St & Roscoe St,15,82.0,13.8,cloudy
4008,Male,2014-06-07 14:58:00,2014-06-07 15:19:00,1258,Theater on the Lake,15,Sheridan Rd & Buena Ave,15,82.0,13.8,cloudy
4009,Male,2014-06-07 15:23:00,2014-06-07 15:28:00,297,Sheffield Ave & Addison St,27,Pine Grove Ave & Waveland Ave,23,80.1,13.8,cloudy


### Referencing named index

If your DataFrame has an index that is named, which happens when a column is set as the index, then you can use that name within `query` just as if it were a regular column name. Here, we create a new DataFrame that has the `from_station_name` as the index.

In [37]:
bikes_idx = bikes.set_index('from_station_name')
bikes_idx.head(3)

Unnamed: 0_level_0,gender,starttime,stoptime,tripduration,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
from_station_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Lake Shore Dr & Monroe St,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy
Clinton St & Washington Blvd,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,31,Wells St & Walton St,19,69.1,6.9,partlycloudy
Sheffield Ave & Kingsbury St,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy


Notice the name 'from_station_name' directly above the index. This is the name for the index and what can be referenced when using `query`. Let's filter for trip ids greater than 200,000.

In [38]:
bikes_idx.query('from_station_name == "Theater on the Lake"').head(3)

Unnamed: 0_level_0,gender,starttime,stoptime,tripduration,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
from_station_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Theater on the Lake,Male,2013-08-23 17:57:00,2013-08-23 18:16:00,1166,15,Lincoln Ave & Roscoe St,19,79.0,9.2,partlycloudy
Theater on the Lake,Female,2013-08-24 15:31:00,2013-08-24 15:59:00,1661,15,Fairbanks Ct & Grand Ave,15,84.9,6.9,partlycloudy
Theater on the Lake,Male,2013-09-07 14:28:00,2013-09-07 14:37:00,540,15,Sheffield Ave & Fullerton Ave,15,88.0,10.4,mostlycloudy


## Summary

The `query` method provides an alternative to boolean selection to filter the data based on the values. Here are the rules for the string you provide.

* The expression in the string must evaluate as True or False for every row
* Column names may be accessed directly with their name
* Often you will use one of the comparison operators to create a condition
* Use chained comparison operators to shorten syntax
* Use `and`, `or`, and `not` to create more complex conditions
* To use a literal string, surround it with quotes
* Use `in` to test multiple equalities. Provide the test values in a list
* All arithmetic operators work just as they do outside of the string
* Use the `@` character to reference a variable name
* Reference the index with the string 'index' or the index's name
* Use backticks to reference a column name with spaces in it

## Example 

In [39]:
import pandas as pd
import numpy as np

data=pd.read_csv('input/pd-student.csv')
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [40]:
#Calculating the total score for the students.
data['Total_score'] = data['math score'] + data['reading score'] + data['writing score']
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218
1,female,group C,some college,standard,completed,69,90,88,247
2,female,group B,master's degree,standard,none,90,95,93,278
3,male,group A,associate's degree,free/reduced,none,47,57,44,148
4,male,group C,some college,standard,none,76,78,75,229


In [41]:
#Calculating the percentage of marks scored for each student.
data['Percentage']=(data['Total_score']/300)*100
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total_score,Percentage
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333


We can give grades to all the students according to their percentages.

In [42]:
#Defining the function which give grades to all students according to percentage of amrks obtained.

def grade(x):
    if x>80:
        return 'A'
    elif x>70 and x<80:
        return 'B'
    elif x>60 and x<70:
        return 'C'
    elif x>50 and x<60:
        return 'D'
    elif x>40 and x<50:
        return 'E'
    else:
        return 'F'
    
data['Grade']=data['Percentage'].apply(grade)
data['Grade'].head()

0    B
1    A
2    A
3    E
4    B
Name: Grade, dtype: object

In [43]:
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total_score,Percentage,Grade
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667,B
1,female,group C,some college,standard,completed,69,90,88,247,82.333333,A
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667,A
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333,E
4,male,group C,some college,standard,none,76,78,75,229,76.333333,B


## Example2

In [44]:
import pandas as pd

rain = pd.read_csv("https://milliams.com/courses/data_analysis_python/rain.csv")
rain.head()

Unnamed: 0,Cardiff,Stornoway,Oxford,Armagh
1853,,,57.7,53.0
1854,,,37.5,69.8
1855,,,53.4,50.2
1856,,,57.2,55.0
1857,,,61.3,64.6


In [45]:
# We can ask for how many years was the average rainfall above 100 mm for each city:'
rain[rain > 100].count()

Cardiff      13
Stornoway    74
Oxford        0
Armagh        1
dtype: int64

In [46]:
cardiff_rain = rain["Cardiff"]
cardiff_rain

1853      NaN
1854      NaN
1855      NaN
1856      NaN
1857      NaN
        ...  
2016     99.3
2017     85.0
2018     99.3
2019    119.0
2020    117.6
Name: Cardiff, Length: 168, dtype: float64

## Example - Filter Group Function 利用條件式進行分類

In [47]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv("https://milliams.com/courses/data_analysis_python/titanic.csv")
data.head()

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,sibsp,parch,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0.0,0.0,no
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0.0,2.0,no
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,1.0,1.0,no
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1.0,1.0,yes
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,0.0,0.0,yes


In [48]:
def fare_cat(x):
    if x < 100:
        return "Cheap"
    else:
        return "Expensive"
    
# apply Function
data['fare'] = data['fare'].apply(fare_cat)
data['fare'].value_counts()

Cheap        1206
Expensive    1001
Name: fare, dtype: int64

In [49]:
def age_cat(x):
    if x < 12:
        return 'Kids'
    elif x >= 12 and x < 18:
        return 'Grown Ups'
    elif x >= 18 and x < 35:
        return 'Adults'
    elif x > 35:
        return 'Old'

# apply Function
data['age'] = data['age'].apply(age_cat)
data['age'].value_counts()

Adults       1270
Old           684
Kids          106
Grown Ups      94
Name: age, dtype: int64

In [50]:
def fam_cat(x):
    if x == 1:
        return 'Alone'
    elif x > 1 and x <= 4:
        return 'Small Family'
    else:
        return 'Large Family'

# apply Function
data['FamilySize'] = data['sibsp'] + data['parch'] + 1
data['FamilySize'] = data['FamilySize'].apply(fam_cat)
data['FamilySize'].value_counts()

Large Family    982
Alone           788
Small Family    437
Name: FamilySize, dtype: int64

In [51]:
# lets Extract the Information of Marraige from the Names of the Passengers
data['Title'] = data['name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
data['Title'] = data['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
data['Title'] = data['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'],'Dr/Military/Noble/Clergy')
data['Title'].value_counts()

Mr                          1586
Miss/Mrs/Ms                  482
Master                        61
Sig                           42
Dr/Military/Noble/Clergy      16
Fr                             6
Colonel                        3
Sra                            3
Captain                        2
Sr                             2
Revd                           1
Lucy Christiana                1
Doña                           1
Lucy Noël Martha               1
Name: Title, dtype: int64

<!--NAVIGATION-->
< [在时间序列上操作](03.11-Working-with-Time-Series.ipynb) | [目录](Index.ipynb) | [更多资源](03.13-Further-Resources.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.12-Performance-Eval-and-Query.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
