<!--NAVIGATION-->
< [处理空缺数据](03.04-Missing-Values.ipynb) | [目录](Index.ipynb) | [组合数据集：Concat 和 Append](03.06-Concat-And-Append.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.05-Hierarchical-Indexing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Hierarchical Indexing

# 層次化索引

> Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas ``Series`` and ``DataFrame`` objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
While Pandas does provide ``Panel`` and ``Panel4D`` objects that natively handle three-dimensional and four-dimensional data (see [Aside: Panel Data](#Aside:-Panel-Data)), a far more common pattern in practice is to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional ``Series`` and two-dimensional ``DataFrame`` objects.

直到目前為止，我們主要集中在一維和二維數據上，它們被存儲在Pandas的`Series`和`DataFrame`對象當中。很多時候，我們需要超越二維來存儲更高維度的數據，即用來檢索的關鍵字會超過1個或2個。雖然Pandas提供了`Panel`和`Panel4D`對象（參見[額外內容: Panel數據](#Aside:-Panel-Data)），但是我們在實踐中更常用的方式是使用*層次化索引*（也被成為*多重索引*）來將多個索引*層次*在一個索引中結合起來。使用這種方法，高維數據也可以用緊湊的方式表示成我們熟悉的一維`Series`和二維`DataFrame`對象。

## A Multiply Indexed Series

## 多重索引Series

> Let's start by considering how we might represent two-dimensional data within a one-dimensional ``Series``.
For concreteness, we will consider a series of data where each point has a character and numerical key.

我們從在一維`Series`中表示二維數據開始。我們考慮一個序列的數據，每個數據點都有一個字符串和數字關鍵字。

### The bad way 不好的做法

> Suppose you would like to track data about states from two different years.
Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

設想你想追蹤州人口兩個不同年份的數據。使用我們已經學過的Pandas工具，你可能會想簡單的使用Python元組來作為key：

In [1]:
import pandas as pd
import numpy as np

index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [2]:
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

> For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

例如，如果你需要2010年的全部數據，就需要寫一些沒那麼直觀（且可能低性能的）的代碼來實現了：

In [3]:
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

### The Better Way: Pandas MultiIndex 更好的方法：Pandas多重索引

> Notice that the ``MultiIndex`` contains multiple *levels* of indexing–in this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.If we re-index our series with this ``MultiIndex``, we see the hierarchical representation of the data:

注意的`MultiIndex`對象包含多重*層級*的索引，本例中為州名和年份，同時也有多個編碼*標籤*對應著每個數據點。如果我們使用這個`MultiIndex`對我們的series進行重新索引，我們可以看到這個數據集的層級展示：

In [4]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [5]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

> Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

現在想要獲取第二個索引值為2010年的數據，我們只需要簡單的使用Pandas的切片語法即可：

In [6]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

### MultiIndex as extra dimension 多重索引作為額外維度

> You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

你可能已經註意到上例中，我們可以很簡單的將數據存儲在一個簡單的`DataFrame`裡面，州名作為行索引，年份作為列索引。實際上，Pandas已經內建了這種等同的機制。 `unstack()`方法可以很快地將多重索引的`Series`轉換成普通索引的`DataFrame`：

In [7]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [8]:
pop_df.stack() #stack() method provides the opposite operation:

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

> Seeing this, you might wonder why would we would bother with hierarchical indexing at all.
The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional ``Series``, we can also use it to represent data of three or more dimensions in a ``Series`` or ``DataFrame``.
Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a ``MultiIndex`` this is as easy as adding another column to the ``DataFrame``:

看到這裡，你可能會疑惑為什麼我們需要使用層次化索引。原因很簡單：就像我們可以使用多重索引來將一維`Series`表示成二維數據一樣，我們也可以使用`Series`或`DataFrame`來表示三維或多維的數據。每個多重索引中的額外層次都代表著數據中額外的維度；利用這點我們可以靈活地詳細地展示我們的數據，例如我們希望在上面各州各年人口數據的基礎上增加一列（比方說18歲以下人口數）；使用`MultiIndex`能很簡單的為`DataFrame`增加一列：

In [9]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


> In addition, all the ufuncs and other functionality discussed in [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) work with hierarchical indices as well.
Here we compute the fraction of people under 18 by year, given the above data:

除此之外，所有在[在Pandas中操作數據](03.03-Operations-in-Pandas.ipynb)中介紹過的ufuncs和其他功能也可以應用到層次化索引數據上。下面我們計算18歲一下人口的比例：

In [10]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


## Methods of MultiIndex Creation

## 多重索引創建的方法

> The most straightforward way to construct a multiply indexed ``Series`` or ``DataFrame`` is to simply pass a list of two or more index arrays to the constructor. For example:

最直接的構建多重索引`Series`或`DataFrame`的方式是向index參數傳遞一個多重列表。例如：

In [11]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.324963,0.743415
a,2,0.936632,0.634869
b,1,0.562142,0.086149
b,2,0.586943,0.168097


> The work of creating the ``MultiIndex`` is done in the background.Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a ``MultiIndex`` by default:

創建`MultiIndex`的工作會自動完成。類似的，如果你使用元組作為關鍵字的字典數據傳給Series，Pandas也會自動識別並默認使用`MultiIndex`：

In [12]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit MultiIndex constructors 顯式 MultiIndex 構造器

> For more flexibility in how the index is constructed, you can instead use the class method constructors available in the ``pd.MultiIndex``.
For example, as we did before, you can construct the ``MultiIndex`` from a simple list of arrays giving the index values within each level:

當你需要更靈活地構建多重索引時，你可以使用`pd.MultiIndex`的構造器。例如，你可以使用多重列表來構造一個和前面一樣的`MultiIndex`對象：

In [13]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])  #Method A :  MultiIndex

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [14]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)]) #MethodB : list of tuples 

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [15]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]]) #MethodC : Cartesian product of single indices

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

> Similarly, you can construct the ``MultiIndex`` directly using its internal encoding by passing ``levels`` (a list of lists containing available index values for each level) and ``labels`` (a list of lists that reference these labels):

你可以用`MultiIndex`構造器來構造多重索引，你需要傳遞`levels`（多重列表包括每個層次的索引值）和`labels`（多重列表包括數據點的標籤值）參數：

In [16]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

> Any of these objects can be passed as the ``index`` argument when creating a ``Series`` or ``Dataframe``, or be passed to the ``reindex`` method of an existing ``Series`` or ``DataFrame``.

上面創建的這些對像都能作為`index`參數傳遞給`Series`或`DataFrame`構造器使用，或者作為`reindex`方法的參數提供給`Series`或`DataFrame`對象進行重新索引。

### MultiIndex level names MultiIndex 層次名稱

> Sometimes it is convenient to name the levels of the ``MultiIndex``.
This can be accomplished by passing the ``names`` argument to any of the above ``MultiIndex`` constructors, or by setting the ``names`` attribute of the index after the fact:

為了方便有時需要給`MultiIndex`的不同層次進行命名。這可以通過在上面的`MultiIndex`構造方法中傳遞`names`參數，或者創建了之後通過設置`names`屬性來實現：

In [17]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### MultiIndex for columns 列的 MultiIndex

> In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:

在一個`DataFrame`中，行和列是完全對稱的，就像前面看到的行可以有多層次的索引，列也可以有多層次的索引。看下面的例子，用來模擬真實的醫療數據：

In [18]:
# 行和列的多重索引
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# 模擬真實數據
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,28.0,36.0,39.0,38.1,35.0,35.5
2013,2,26.0,36.1,58.0,37.3,50.0,38.5
2014,1,17.0,38.7,32.0,39.1,24.0,35.8
2014,2,56.0,37.4,19.0,37.7,42.0,38.0


> Here we see where the multi-indexing for both rows and columns can come in *very* handy.
This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.
With this in place we can, for example, index the top-level column by the person's name and get a full ``DataFrame`` containing just that person's information:

我們看到多重索引對於行和列來說都是非常方便的。上面的數據集實際上是一個四維的數據，四個維度分別是受試者、測試類型、年份和測試編號。創建了這個`DataFrame`之後，我們可以使用受試者的姓名來很方便的獲取到此人的所有測試數據：

In [19]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,39.0,38.1
2013,2,58.0,37.3
2014,1,32.0,39.1
2014,2,19.0,37.7


> For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.) use of hierarchical rows and columns can be extremely convenient!

對於這種包含著多重標籤的多種維度（人、國家、城市等）數據。使用這種層次化的行和列的結構會非常方便。

## Indexing and Slicing a MultiIndex

## 在 MultiIndex 上檢索和切片

> Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.

在`MultiIndex`上進行檢索和切片設計的非常直觀，你可以將其想像為在新增的維度上進行檢索能幫助你理解。

> We'll first look at indexing multiply indexed ``Series``, and then multiply-indexed ``DataFrame``s.

我們先來看一下多重索引`Series`的方法，然後再看多重索引的`DataFrame`。

### Multiply indexed Series 多重索引 Series

> Consider the multiply indexed ``Series`` of state populations we saw earlier: We can access single elements by indexing with multiple terms:



回頭再看前面的那個人口的多重序列`Series`：我們可以使用多重索引值獲取單個元素：

In [20]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [21]:
pop['California', 2000]

33871648

> The ``MultiIndex`` also supports *partial indexing*, or indexing just one of the levels in the index.
The result is another ``Series``, with the lower-level indices maintained:

`MultiIndex`同樣支持*部分檢索*，即僅在索引中檢索其中的一個層次。得到的結果是另一個`Series`但是具有更少的層次結構：

In [22]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

> Partial slicing is available as well, as long as the ``MultiIndex`` is sorted (see discussion in [Sorted and Unsorted Indices](#Sorted-and-unsorted-indices)):

部分切片同樣也是支持的，只要`MultiIndex`是排序的（參見[有序和無序的索引](#Sorted-and-unsorted-indices)）：

In [23]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

> With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

在有序索引的情況下，部分檢索也可以用到低層次的索引上，只需要在第一個索引位置傳遞一個空的切片即可：

In [24]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

> Other types of indexing and selection (discussed in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb)) work as well; for example, selection based on Boolean masks:

其他類型的索引和選擇（參見[數據索引和選擇](03.02-Data-Indexing-and-Selection.ipynb)）也是允許的；例如，使用布爾遮蓋進行選擇：

In [25]:
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

> Selection based on fancy indexing also works:

使用高級索引進行選擇：

In [26]:
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

### Multiply indexed DataFrames 多重索引 DataFrame

> A multiply indexed ``DataFrame`` behaves in a similar manner.
Consider our toy medical ``DataFrame`` from before:

對`DataFrame`進行多重索引也是同樣的。再看前面我們的醫療`DataFrame`數據：

In [27]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,28.0,36.0,39.0,38.1,35.0,35.5
2013,2,26.0,36.1,58.0,37.3,50.0,38.5
2014,1,17.0,38.7,32.0,39.1,24.0,35.8
2014,2,56.0,37.4,19.0,37.7,42.0,38.0


> Remember that columns are primary in a ``DataFrame``, and the syntax used for multiply indexed ``Series`` applies to the columns.
For example, we can recover Guido's heart rate data with a simple operation:

請注意`DataFrame`中主要的索引是列，你可以將上面的多重索引`Series`的方法應用到`DataFrame`的列上。例如，通過一個簡單的操作就能獲得Guido的心率數據：

In [28]:
health_data['Guido', 'HR']

year  visit
2013  1        39.0
      2        58.0
2014  1        32.0
      2        19.0
Name: (Guido, HR), dtype: float64

> Also, as with the single-index case, we can use the ``loc``, ``iloc``, and ``ix`` indexers introduced in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb). For example:

同樣，就像單一索引的情況那樣，我們可以使用在（[數據索引和選擇](03.02-Data-Indexing-and-Selection.ipynb)）中介紹的`loc`、`iloc`和`ix`索引符。例如：

In [29]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,28.0,36.0
2013,2,26.0,36.1


> These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in ``loc`` or ``iloc`` can be passed a tuple of multiple indices. For example:

這些索引符提供了一個底層二維數據的數組視圖，並且`loc`或`iloc`中每個獨立的索引都可以傳遞一個多重索引的元組。例如：

In [30]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        28.0
      2        26.0
2014  1        17.0
      2        56.0
Name: (Bob, HR), dtype: float64

> You could get around this by building the desired slice explicitly using Python's built-in ``slice()`` function, but a better way in this context is to use an ``IndexSlice`` object, which Pandas provides for precisely this situation.
For example:

解决上述问题的方法可以是显式调用Python內建的`slice()`函数，还有一个更好的方式是使用`IndexSlice`对象，该对象是Pandas专门为这种情况准备的。例如：

In [31]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,28.0,39.0,35.0
2014,1,17.0,32.0,24.0


### Stacking and unstacking indices

### 索引的堆疊和拆分

> As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

我們前面已經看到，我們可以將一個堆疊的多重索引的數據集拆分成一個簡單的二維形式，還可以指定使用哪個層次進行拆分：

In [33]:
pop.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [34]:
pop.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


> The opposite of ``unstack()`` is ``stack()``, which here can be used to recover the original series:

`unstack()`的逆操作是`stack()`，我們可以使用它來重新堆疊數據集：

In [35]:
pop.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Index setting and resetting

### 設置及重新設置索引

> Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the ``reset_index`` method.
Calling this on the population dictionary will result in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:

還有一種重新排列層次化數據的方式是將行索引標籤轉為列索引標籤；這可以使用`reset_index`方法來實現。在人口數據集上調用這個方法能讓結果`DataFrame`的列有層次化的州和年份標籤，它們是從原來的行標籤轉換過來的。為了清晰起見，我們可以設置列的標籤：

In [36]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


> Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:

通常當我們處理真實世界的數據的時候，我們看到的就會是如上的數據集的形式，因此從列當中構建一個`MultiIndex`會很有用。這可以通過在`DataFrame`上使用`set_index`方法來實現，這樣會返回一個多重索引的`DataFrame`：

In [37]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


## Data Aggregations on Multi-Indices

## 多重索引的數據聚合

> We've previously seen that Pandas has built-in data aggregation methods, such as ``mean()``, ``sum()``, and ``max()``.
For hierarchically indexed data, these can be passed a ``level`` parameter that controls which subset of the data the aggregate is computed on.

前面我們已經了解到Pandas有內建的數據聚合方法，例如`mean()`、`sum()`和`max()`。對於層次化索引的數據而言，這可以通過傳遞`level`參數來控制數據沿著那個層次的索引來進行計算。

In [38]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,28.0,36.0,39.0,38.1,35.0,35.5
2013,2,26.0,36.1,58.0,37.3,50.0,38.5
2014,1,17.0,38.7,32.0,39.1,24.0,35.8
2014,2,56.0,37.4,19.0,37.7,42.0,38.0


> Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:

可能我們希望能將每年測量值進行平均。我們可以用level參數指定我們需要進行聚合的標籤，這裡是年份：

In [39]:
data_mean = health_data.mean(level='year')
data_mean

  data_mean = health_data.mean(level='year')


subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,27.0,36.05,48.5,37.7,42.5,37.0
2014,36.5,38.05,25.5,38.4,33.0,36.9


> By further making use of the ``axis`` keyword, we can take the mean among levels on the columns as well:

通過額外指定`axis`關鍵字，我們可以在列上沿著某個層次`level`進行聚合：

In [40]:
data_mean.mean(axis=1, level='type')

  data_mean.mean(axis=1, level='type')


type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,39.333333,36.916667
2014,31.666667,37.783333


## Aside: Panel Data

## 額外知識：Panel數據

> Pandas has a few other fundamental data structures that we have not yet discussed, namely the ``pd.Panel`` and ``pd.Panel4D`` objects.
These can be thought of, respectively, as three-dimensional and four-dimensional generalizations of the (one-dimensional) ``Series`` and (two-dimensional) ``DataFrame`` structures.
Once you are familiar with indexing and manipulation of data in a ``Series`` and ``DataFrame``, ``Panel`` and ``Panel4D`` are relatively straightforward to use.
In particular, the ``ix``, ``loc``, and ``iloc`` indexers discussed in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) extend readily to these higher-dimensional structures.

Pandas還有一些其他的基礎數據結構我們沒有介紹到，名稱為`pd.Panel`和`pd.Panel4D`的對象。這兩個對像被認為是對應於一維的`Series`和二維的`DataFrame`相應的三維和四維的通用數據結構。一旦你熟悉了`Series`和`DataFrame`的使用方法，`Panel`和`Panel4D`的使用相對來說也是很直觀的。特別的，我們在[數據索引和選擇](03.02-Data-Indexing-and-Selection.ipynb)中介紹過的`ix`、`loc`和`iloc`索引符在高維結構中也是直接可用的。

> We won't cover these panel structures further in this text, as I've found in the majority of cases that multi-indexing is a more useful and conceptually simpler representation for higher-dimensional data.
Additionally, panel data is fundamentally a dense data representation, while multi-indexing is fundamentally a sparse data representation.
As the number of dimensions increases, the dense representation can become very inefficient for the majority of real-world datasets.
For the occasional specialized application, however, these structures can be useful.
If you'd like to read more about the ``Panel`` and ``Panel4D`` structures, see the references listed in [Further Resources](03.13-Further-Resources.ipynb).

我們不會在本書中繼續介紹Panel結構，因為作者認為在大多數情況下多重索引會更加有用，在表現高維數據時概念也會顯得更加簡單。而且更加重要的是，面板數據從基本上來說是密集數據，而多重索引從基本上來說是稀疏數據。隨著維度數量的增加，使用密集數據方式表示真實世界的數據是非常的低效的。但是對於一些特殊的應用來說，這些結構是很有用的。如果你希望獲取更多有關`Panel`和`Panel4D`結構的內容，請查閱[更多資源](03.13-Further-Resources.ipynb)。

## Importance of Hierarchical Indexing

In [1]:
# pd.Series.index?
# pd.Series.unstack?
# pd.names?
# pd.MultiIndex?

In [2]:
import pandas as pd
import numpy as np
data_hi = pd.Series(np.random.randn(9),
          index=[['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
                 [1, 2, 3, 1, 4, 1, 2, 2, 4]])
data_hi

A  1   -0.886328
   2    0.808507
   3   -0.708538
B  1    1.059874
   4    0.000406
C  1   -1.147799
   2    1.304196
D  2    0.114513
   4    0.474171
dtype: float64

In [3]:
data_hi.index

MultiIndex([('A', 1),
            ('A', 2),
            ('A', 3),
            ('B', 1),
            ('B', 4),
            ('C', 1),
            ('C', 2),
            ('D', 2),
            ('D', 4)],
           )

In [4]:
data_hi['A']
# data_hi['A':'C']
# data_hi[['A', 'C']]
# data_hi.loc[:, 1]

1   -0.886328
2    0.808507
3   -0.708538
dtype: float64

In [5]:
data_hi.unstack()
# data_hi.unstack(fill_value=0)
# data_hi.unstack().stack()

Unnamed: 0,1,2,3,4
A,-0.886328,0.808507,-0.708538,
B,1.059874,,,0.000406
C,-1.147799,1.304196,,
D,,0.114513,,0.474171


In [6]:
df_hi = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['one', 'one', 'three'],['Green', 'Red', 'Green']])
df_hi.index.names = ['val1', 'val2']
df_hi.columns.names = ['number', 'color']
df_hi['one']
df_hi

Unnamed: 0_level_0,number,one,one,three
Unnamed: 0_level_1,color,Green,Red,Green
val1,val2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


### How Reordering and Sorting of Index Levels Takes Place?

In [7]:
# pd.DataFrame.swaplevel?
# pd.DataFrame.sort_index?

In [8]:
import pandas as pd
import numpy as np

df_hi

Unnamed: 0_level_0,number,one,one,three
Unnamed: 0_level_1,color,Green,Red,Green
val1,val2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [9]:
df_hi.swaplevel('val1', 'val2', axis=0)
df_hi.swaplevel('number', 'color', axis=1)
df_hi.swaplevel(0, 1).sort_index(level=0) 

Unnamed: 0_level_0,number,one,one,three
Unnamed: 0_level_1,color,Green,Red,Green
val2,val1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


In [10]:
df_hi.sort_index(level=0)
df_hi.sort_index(level=1)

Unnamed: 0_level_0,number,one,one,three
Unnamed: 0_level_1,color,Green,Red,Green
val1,val2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [15]:
import pandas as pd
import numpy as np
df_c = pd.DataFrame({'a': range(7), 'b': range(14, 7, -1),
                     'c': ['one', 'one', 'one', 'two', 'two','two', 'two'],
                     'd': [0, 1, 2, 0, 1, 2, 3]})
df_c

Unnamed: 0,a,b,c,d
0,0,14,one,0
1,1,13,one,1
2,2,12,one,2
3,3,11,two,0
4,4,10,two,1
5,5,9,two,2
6,6,8,two,3


In [16]:
df_si = df_c.set_index(['c', 'd'])
df_si

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,14
one,1,1,13
one,2,2,12
two,0,3,11
two,1,4,10
two,2,5,9
two,3,6,8


In [17]:
df_c.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,14,one,0
one,1,1,13,one,1
one,2,2,12,one,2
two,0,3,11,two,0
two,1,4,10,two,1
two,2,5,9,two,2
two,3,6,8,two,3


In [18]:
df_si.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,14
1,one,1,1,13
2,one,2,2,12
3,two,0,3,11
4,two,1,4,10
5,two,2,5,9
6,two,3,6,8


<!--NAVIGATION-->
< [处理空缺数据](03.04-Missing-Values.ipynb) | [目录](Index.ipynb) | [组合数据集：Concat 和 Append](03.06-Concat-And-Append.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.05-Hierarchical-Indexing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
