<!--NAVIGATION-->
< [组合数据集：Merge 和 Join](03.07-Merge-and-Join.ipynb) | [目录](Index.ipynb) | [数据透视表](03.09-Pivot-Tables.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Aggregation and Grouping

# 聚合與分組

> An essential piece of analysis of large data is efficient summarization: computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a ``groupby``.

對於一個大數據集進行分析的關鍵部分是使用有效的概括：對數據集進行`sum()`、`mean()`、`median()`、`min()`和`max()`聚合運算，這些運算的結果就可能可以給出大數據集的一些內在特徵。在本節中，我們會探討Pandas中的聚合，從我們已經在NumPy數組中進行過的那些簡單的操作，直到基於分組`groupby`概念進行的更複雜的操作。

In [1]:
import numpy as np
import pandas as pd

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

> Here we will use the Planets dataset, available via the [Seaborn package](http://seaborn.pydata.org/) (see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)).
It gives information on planets that astronomers have discovered around other stars (known as *extrasolar planets* or *exoplanets* for short). It can be downloaded with a simple Seaborn command:

這裡我們會使用[Seaborn包](http://seaborn.pydata.org/)提供的行星數據（參見[使用Seaborn進行可視化](04.14-Visualization-With-Seaborn.ipynb)）。這個數據集提供了天文學家發現的其他恆星的行星的數據（被稱為太陽系外行星）。數據集可以簡單的使用一個Seaborn命令來下載：

In [31]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


> Pandas ``Series`` and ``DataFrame``s include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb); in addition, there is a convenience method ``describe()`` that computes several common aggregates for each column and returns the result.
Let's use this on the Planets data, for now dropping rows with missing values:

Pandas的`Series`和`DataFrame`包括了所有我們在[聚合：Min, Max, 以及其他](02.04-Computation-on-arrays-aggregates.ipynb)中介紹過的通用聚合操作；而且Pandas還提供了很方便的`describe()`可以用來對每個列計算這些通用的聚合結果。讓我們在行星數據集上使用這個函數，暫時先移除含有空值的行：

In [39]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


> This can be a useful way to begin understanding the overall properties of a dataset.
For example, we see in the ``year`` column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.
This is largely thanks to the *Kepler* mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.

對於開始理解數據集的整體情況來說，這是一個非常有用的方法。例如，在發現年份`year`列上，結果顯示，雖然第一顆太陽系外行星是1989年發現的，但是一半的行星直到2010年以後才被發現的。這多虧了*開普勒Kepler*計劃，它是一個太空望遠鏡，專門設計用來尋找其他恆星的橢圓軌道行星的。

## GroupBy: Split, Apply, Combine

## 分組：拆分、應用、組合

> Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.
The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: *split, apply, combine*.

簡單的聚合可以提供數據集的基礎特徵，但是通常我們更希望依據一些標籤或索引條件進行聚合操作：這可以通過`groupby`操作實現。 "group by"的名稱來自於SQL，但是將它想成是由Hadley Wickham首先創造的R數據統計術語會更合適：*拆分、應用、組合*。

### Split, apply, combine 拆分、應用、組合

> A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated in this figure:

作為拆分-應用-組合操作的一個典型例子，下圖展示了當進行求和的“應用”聚合操作時的情況：

![](https://github.com/wangyingsm/Python-Data-Science-Handbook/raw/61f1a8f5b27e374f3eb50ea41efb73ac531ae539/notebooks/figures/03.08-split-apply-combine.png)
[figure source in Appendix](06.00-Figure-Code.ipynb#Split-Apply-Combine)

[附录：生成图像的源代码](06.00-Figure-Code.ipynb#Split-Apply-Combine)

> This makes clear what the ``groupby`` accomplishes:

> - The *split* step involves breaking up and grouping a ``DataFrame`` depending on the value of the specified key.
> - The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
> - The *combine* step merges the results of these operations into an output array.

上圖很清晰地展示了`groupby`完成的工作：

- 拆分*split*步驟表示按照指定鍵上的值對`DataFrame`進行拆分和分組的功能。
- 應用*apply*步驟表示在每個獨立的分組上調用某些函數進行計算，通常是聚合、轉換或過濾。
- 組合*combine*步驟將上述計算的結果重新合併在一起輸出。

> While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that *the intermediate splits do not need to be explicitly instantiated*. Rather, the ``GroupBy`` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the ``GroupBy`` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.

雖然這可以通過將前面介紹過的遮蓋、聚合和組合指令組合在一起來實現，`groupby`的一個重要的實現是*拆分的中間結果不需要真正的創建出來*。而且，`groupby`（通常）可以在一次過程中處理完所有的數據分組的總和、平均值、計數、最小是或其他聚合操作。 `groupby`的強大在於它將這些步驟抽象了出來：用戶不需要思考這些計算是*如何*進行的，只需要認為*這些操作是一個整體*。

> As a concrete example, let's take a look at using Pandas for the computation shown in this diagram.
We'll start by creating the input ``DataFrame``:

作為一個具體的例子，我們來看一下使用Pandas來實現上面的這些計算，首先創建一個輸入`DataFrame`：

In [40]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


> The most basic split-apply-combine operation can be computed with the ``groupby()`` method of ``DataFrame``s, passing the name of the desired key column:

最基礎的拆分-應用-組合操作可以使用`DataFrame`的`groupby()`方法來實現，方法中傳遞作為鍵來運算的列名：

In [41]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f96102c3b80>

> Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object.
This object is where the magic is: you can think of it as a special view of the ``DataFrame``, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

上面運行的結果不是一個`DataFrame`，而是一個`DataFrameGroupBy`對象。這個對象就是上述步驟魔術的所在：你可以認為它是`DataFrame`對象的一個特殊的視圖，使用它可以很容易的研究分組的數據，但是除非聚合操作發生，否則它不會進行真實的運算。這種“懶運算”的方式意味著通用的聚合可以實現得非常的高效，而對用戶來說幾乎是透明的。

> To produce a result, we can apply an aggregate to this ``DataFrameGroupBy`` object, which will perform the appropriate apply/combine steps to produce the desired result:

要產生結果，我們可以將一個聚合操作應用到該`DataFrameGroupBy`對像上，這樣就會在分組上執行應用/組合的步驟，並產生需要的結果：

In [42]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


> The ``sum()`` method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid ``DataFrame`` operation, as we will see in the following discussion.

`sum()`方法僅是其中一個可能的操作；你可以在這裡應用幾乎所有的Pandas或NumPy的通用聚合函數，也可以應用集合所有正確的`DataFrame`操作，我們在下面馬上就會看到。

### The GroupBy object

### GroupBy 對象

> The ``GroupBy`` object is a very flexible abstraction.
In many ways, you can simply treat it as if it's a collection of ``DataFrame``s, and it does the difficult things under the hood. Let's see some examples using the Planets data.

`GroupBy`對像是一個很靈活的抽象。在很多情況下，你可以將它簡單的看成`DataFrame`的集合，它在底層做了很多複雜的工作。我們用行星數據集來看幾個例子。

> Perhaps the most important operations made available by a ``GroupBy`` are *aggregate*, *filter*, *transform*, and *apply*.
We'll discuss each of these more fully in ["Aggregate, Filter, Transform, Apply"](#Aggregate,-Filter,-Transform,-Apply), but before that let's introduce some of the other functionality that can be used with the basic ``GroupBy`` operation.

也許對`GroupBy`對象最重要的操作是*聚合*、*過濾*、*轉換*和*應用*。我們會在[聚合、過濾、轉換、應用](#Aggregate,-Filter,-Transform,-Apply)中逐個介紹它們，在這之前首先介紹一些其他用於`GroupBy`對象的基礎操作。

#### Column indexing 列索引

> The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.
For example:

`GroupBy`對象支持列索引，與`DataFrame`相同，返回的是修改後的`GroupBy`對象。例如：

In [14]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9650a424c0>

In [15]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f9650a426a0>

> Here we've selected a particular ``Series`` group from the original ``DataFrame`` group by reference to its column name.
As with the ``GroupBy`` object, no computation is done until we call some aggregate on the object:

上例中我們在原始的`DataFrame`中選擇了特定的`Series`，這個`Series`是按照提供的列名進行分組的。當然，`GroupBy`對像在調用聚合操作之前是不會進行計算的：

In [16]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

#### Iteration over groups 在分組上進行迭代

> The ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``:

`GroupBy`對象支持在分組上直接進行迭代，每次迭代返回分組的一個`Series`或`DataFrame`對象：

In [17]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


> This can be useful for doing certain things manually, though it is often much faster to use the built-in ``apply`` functionality, which we will discuss momentarily.

這種做法在某些需要手動實現的情況下很有用，雖然通常來說使用內建的`apply`函數會快很多，我們馬上會介紹到`apply`函數。

#### Dispatch methods 擴展方法

> Through some Python class magic, any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.
For example, you can use the ``describe()`` method of ``DataFrame``s to perform a set of aggregations that describe each group in the data:

通過一些Python面向對象的魔術技巧，任何非顯式定義在`GroupBy`對像上的方法，無論是`DataFrame`還是`Series`對象的，都可以給分組來調用。例如，你可以在數據分組上調用`DataFrame`的`describe()`方法，對所有分組進行通用的聚合運算：

譯者註：作者下面代碼多加了`unstack()`方法，應該是筆誤。

In [18]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


> Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade.
The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.

查看上表，能幫助我們更好的理解數據：例如，發現行星最多的方法是徑向速度和凌日法，雖然後者是近十年才變得普遍（因為新的更精準的望遠鏡的作用）。最新的方法應該是凌日時間變分法和軌道亮度調製法，它們直至2011年才開始發現新的行星。

### Aggregate, filter, transform, apply

### 聚合、過濾、轉換、應用

> The preceding discussion focused on aggregation for the combine operation, but there are more options available.
In particular, ``GroupBy`` objects have ``aggregate()``, ``filter()``, ``transform()``, and ``apply()`` methods that efficiently implement a variety of useful operations before combining the grouped data.

前面的討論聚焦在組合操作相應的聚合函數上，但實際上還有更多的可能選項。特別是`GroupBy`對像有`aggregate()`、`filter()`、`transfrom`和`apply()`方法，它們能在組合分組數據之前有效地實現大量有用的操作。

In [19]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


#### Aggregation 聚合

> We're now familiar with ``GroupBy`` aggregations with ``sum()``, ``median()``, and the like, but the ``aggregate()`` method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:

我們已經熟悉了`GroupBy`使用`sum()`、`median()`等方法進行聚合的做法，但是`aggregate()`方法能提供更多的靈活性。它能接受字符串、函數或者一個列表，然後一次性計算出所有的聚合結果。下面是一個簡單的例子：

In [20]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


> Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

還可以將一個字典，裡面是列名與操作的對應關係，傳遞給`aggregate()`來進行一次性的聚合運算：

In [21]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


#### Filtering 過濾

> A filtering operation allows you to drop data based on the group properties.
For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

過濾操作能在分組數據上移除一些你不需要的數據。例如，我們可能希望保留標準差大於某個閾值的所有的分組：

譯者註：你可以認為`filter()`類似於SQL中的HAVING。

In [22]:
def filter_func(x):
    return x['data2'].std() > 4

display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641

Unnamed: 0,key,data1,data2
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


> The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.

用來進行過濾的函數必須返回一個布爾值，表示分組是否能夠通過過濾條件。上例中A分組的標準差不是大於4，因此整個分組在結果中被移除了。

#### Transformation 轉換

> While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.
For such a transformation, the output is the same shape as the input.
A common example is to center the data by subtracting the group-wise mean:

聚合返回的是分組簡化後的數據集，而轉換可以返回完整數據轉換後並重新合併的數據集。因此轉換操作的結果和輸入數據集具有相同的形狀。一個通用例子是將整個數據集通過減去每個分組的平均值進行中心化：

In [23]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


#### The apply() method  應用

> The ``apply()`` method lets you apply an arbitrary function to the group results.
The function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar; the combine operation will be tailored to the type of output returned.

`apply()`方法能讓你將分組的結果應用到任意的函數上。該函數必須接受一個`DataFrame`參數，返回一個Pandas對象（如`DataFrame`、`Series`）或者一個標量；組合操作會根據返回的類型進行適配。

> For example, here is an ``apply()`` that normalizes the first column by the sum of the second:

例如，下面採用`apply()`使用`data2`的分組總和來正則化`data1`的值：

In [24]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

display('df', "df.groupby('key').apply(norm_by_data2)")

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.142857,0
2,C,0.166667,3
3,A,0.375,3
4,B,0.571429,7
5,C,0.416667,9


> ``apply()`` within a ``GroupBy`` is quite flexible: the only criterion is that the function takes a ``DataFrame`` and returns a Pandas object or scalar; what you do in the middle is up to you!

`GroupBy`對象的`apply()`方法是非常靈活的：唯一的限制就是應用的函數要接受一個`DataFrame`參數並且返回一個Pandas對像或者標量；函數體內做什麼工作完全是自定義的。

### Specifying the split key

### 指定拆分鍵

> In the simple examples presented before, we split the ``DataFrame`` on a single column name.
This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here.

在前面的簡單例子中，我們使用一個列名對`DataFrame`進行拆分。這只是分組的眾多方式的其中之一，我們下面繼續探討其他的選項。

#### A list, array, series, or index providing the grouping keys 使用列表、數組、序列或索引指定分組鍵

> The key can be any series or list with a length matching that of the ``DataFrame``. For example:

分組使用的鍵可以使任何的序列或列表，只要長度和`DataFrame`的長度互相匹配即可。例如：

In [25]:
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0,data1,data2
0,7,17
1,4,3
2,4,7


In [26]:
display('df', "df.groupby(df['key']).sum()")

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,8
B,5,7
C,7,12


#### A dictionary or series mapping index to group 使用字典或映射索引的序列來分組

> Another method is to provide a dictionary that maps index values to the group keys:

還有一種方法是提供一個字典，將索引值映射成分組鍵：

In [27]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
consonant,12,19
vowel,3,8


#### Any Python function 任何Python函數

> Similar to mapping, you can pass any Python function that will input the index value and output the group:

類似映射，你可以傳遞任何Python函數將輸入的索引值變成輸出的分組鍵：

In [28]:
display('df2', 'df2.groupby(str.lower).mean()')

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


#### A list of valid keys 正確鍵的列表

> Further, any of the preceding key choices can be combined to group on a multi-index:

任何前面的多個分組鍵可以組合併輸出成一個多重索引的結果：

In [29]:
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key,key,Unnamed: 2_level_1,Unnamed: 3_level_1
a,vowel,1.5,4.0
b,consonant,2.5,3.5
c,consonant,3.5,6.0


### Grouping example

### 分組例子

> As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:

作為分組的例子，我們將前面介紹的內容用幾行Python代碼寫出來用於計算通過不同方法在不同年代發現的行星的個數：

In [30]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


> This shows the power of combining many of the operations we've discussed up to this point when looking at realistic datasets.
We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!

這個例子展示了我們結合前面介紹過的多種操作之後，我們能在真實的數據集上完成多強大的操作。我們獲得了過去幾十年間我們是如何發現行星的大概統計。

<!--NAVIGATION-->
< [组合数据集：Merge 和 Join](03.07-Merge-and-Join.ipynb) | [目录](Index.ipynb) | [数据透视表](03.09-Pivot-Tables.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
