# 复合索引

除了常规的索引方式,pandas还可以定义复合索引

In [1]:
import pandas as pd
import numpy as np

## 分层索引

分层/多级索引是功能强大的工具,它为一些非常复杂的数据分析和操作,尤其是对于高维数据的处理提供了便利.实际上它使我们能够在诸如`Series`和`DataFrame`这样的的低维数据结构中存储和操作具有任意数量维度的数据.

### 创建分层索引

创建分层索引可以使用

+ `pd.MultiIndex.from_tuples`从元祖创建

+ `pd.MultiIndex.from_product`当你想要在两个迭代中的每个元素的配对时可以使用

+ 为了方便,可以将数组列表直接传递到Series或DataFrame,以自动构建MultiIndex：

In [2]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

In [3]:
tuples = list(zip(*arrays))

In [4]:
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [5]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [6]:
index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [7]:
s = pd.Series(np.random.randn(8), index=index)

In [8]:
s

first  second
bar    one      -0.691989
       two      -1.073631
baz    one      -1.269384
       two       0.700795
foo    one       0.337454
       two      -1.847011
qux    one      -1.708133
       two       0.131374
dtype: float64

In [9]:
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

pd.MultiIndex.from_product(iterables, names=['first', 'second'])

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [10]:
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
          np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one    0.194750
     two    1.519968
baz  one    0.998856
     two   -0.513483
foo  one   -0.108684
     two   -0.113097
qux  one    0.355886
     two    1.293657
dtype: float64

In [11]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-0.021652,-1.128505,0.696531,-0.132104
bar,two,0.157299,-1.695966,-0.355066,0.34601
baz,one,1.377002,0.974853,1.082361,0.861955
baz,two,-1.820746,0.746764,0.532095,0.256986
foo,one,1.449398,-1.36158,-0.950292,0.102716
foo,two,-1.72492,-0.414413,-1.112252,-0.917028
qux,one,2.683856,-0.240259,-2.093356,1.078969
qux,two,0.419431,-0.136242,-0.587698,0.483396


### 将复合索引应用于列

In [12]:
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,0.075922,-1.508019,-1.023941,0.142567,-1.790206,0.360418,1.143857,0.411639
B,-2.117579,-1.026132,-1.541469,-0.575999,-0.976532,2.045527,0.157497,0.869661
C,-1.702369,-0.171502,-0.748726,-0.622308,1.184927,2.208714,0.473766,-0.679159


MultiIndex的重要性在于它允许进行分组,选择和重塑操作.我们将在下面的后续部分中进行描述.你可以发现自己使用分层索引的数据而不需要自己创建一个MultiIndex.但是从文件加载数据时,我们可能希望在准备数据集时生成自己的MultiIndex.可以通过使用`pandas.set_printoptions`中的`multi_sparse`参数进行控制显示索引的形式

In [13]:
pd.set_option('display.multi_sparse', False)

In [14]:
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,0.075922,-1.508019,-1.023941,0.142567,-1.790206,0.360418,1.143857,0.411639
B,-2.117579,-1.026132,-1.541469,-0.575999,-0.976532,2.045527,0.157497,0.869661
C,-1.702369,-0.171502,-0.748726,-0.622308,1.184927,2.208714,0.473766,-0.679159


In [15]:
pd.set_option('display.multi_sparse', True)

## 重建分级标签

方法`get_level_values`将返回特定级别上每个位置的标签的向量

In [16]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [17]:
index.get_level_values('second')

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

## 使用MultiIndex在轴上进行基本索引

分层索引的一个重要特征是可以通过标识数据中子组的"部分"标签来选择数据.部分选择"丢弃"层次索引的水平在结果中以一种完全类似的方式选择常规DataFrame中的列:

In [18]:
df['bar']

second,one,two
A,0.075922,-1.508019
B,-2.117579,-1.026132
C,-1.702369,-0.171502


In [19]:
df['bar', 'one']

A    0.075922
B   -2.117579
C   -1.702369
Name: (bar, one), dtype: float64

In [20]:
df['bar']['one']

A    0.075922
B   -2.117579
C   -1.702369
Name: one, dtype: float64

In [21]:
s['qux']

one    0.355886
two    1.293657
dtype: float64

In [22]:
df.columns

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [23]:
df[['foo','qux']].columns

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           codes=[[2, 2, 3, 3], [0, 1, 0, 1]],
           names=['first', 'second'])

这样做是为了避免重新计算水平以便使切片具有高性能.如果我们想看到实际使用的index层级

In [24]:
df[['foo','qux']].columns.values

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

In [25]:
df[['foo','qux']].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [26]:
pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)

MultiIndex(levels=[['foo', 'qux'], ['one', 'two']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

## 数据对齐和使用reindex

在轴上具有多索引的不同索引对象之间的操作将像下面演示的这样,数据对齐将像元组索引一样工作:

In [27]:
s + s[:-2]

bar  one    0.389501
     two    3.039936
baz  one    1.997712
     two   -1.026967
foo  one   -0.217368
     two   -0.226194
qux  one         NaN
     two         NaN
dtype: float64

In [28]:
 s + s[::2]

bar  one    0.389501
     two         NaN
baz  one    1.997712
     two         NaN
foo  one   -0.217368
     two         NaN
qux  one    0.711773
     two         NaN
dtype: float64

reindex可以用另一个MultiIndex或甚至一个元组的列表或数组调用:

In [29]:
s.reindex(index[:3])

first  second
bar    one       0.194750
       two       1.519968
baz    one       0.998856
dtype: float64

In [30]:
s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])

foo  two   -0.113097
bar  one    0.194750
qux  one    0.355886
baz  one    0.998856
dtype: float64

## 高级索引与层次索引

使用.loc在高级索引中语法集成MultiIndex有点具有挑战性,但我们已尽一切努力这样做.例如下面：

In [31]:
df = df.T

In [32]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.075922,-2.117579,-1.702369
bar,two,-1.508019,-1.026132,-0.171502
baz,one,-1.023941,-1.541469,-0.748726
baz,two,0.142567,-0.575999,-0.622308
foo,one,-1.790206,-0.976532,1.184927
foo,two,0.360418,2.045527,2.208714
qux,one,1.143857,0.157497,0.473766
qux,two,0.411639,0.869661,-0.679159


In [33]:
df.loc['bar']

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0.075922,-2.117579,-1.702369
two,-1.508019,-1.026132,-0.171502


In [34]:
df.loc['bar', 'two']

A   -1.508019
B   -1.026132
C   -0.171502
Name: (bar, two), dtype: float64

In [35]:
df.loc['baz':'foo']

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,-1.023941,-1.541469,-0.748726
baz,two,0.142567,-0.575999,-0.622308
foo,one,-1.790206,-0.976532,1.184927
foo,two,0.360418,2.045527,2.208714


你可以通过提供一个元组的切片使用一个"范围"的值.

In [36]:
df.loc[('baz', 'two'):('qux', 'one')]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,0.142567,-0.575999,-0.622308
foo,one,-1.790206,-0.976532,1.184927
foo,two,0.360418,2.045527,2.208714
qux,one,1.143857,0.157497,0.473766


In [37]:
df.loc[('baz', 'two'):'foo']

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,0.142567,-0.575999,-0.622308
foo,one,-1.790206,-0.976532,1.184927
foo,two,0.360418,2.045527,2.208714


传递标签或元组的列表与重建索引类似:

In [38]:
df.loc[[('bar', 'two'), ('qux', 'one')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,two,-1.508019,-1.026132,-0.171502
qux,one,1.143857,0.157497,0.473766


## 使用`swaplevel()`交换index的层级

In [39]:
df[:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.075922,-2.117579,-1.702369
bar,two,-1.508019,-1.026132,-0.171502
baz,one,-1.023941,-1.541469,-0.748726
baz,two,0.142567,-0.575999,-0.622308
foo,one,-1.790206,-0.976532,1.184927


In [40]:
df[:5].swaplevel(0, 1, axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
second,first,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,bar,0.075922,-2.117579,-1.702369
two,bar,-1.508019,-1.026132,-0.171502
one,baz,-1.023941,-1.541469,-0.748726
two,baz,0.142567,-0.575999,-0.622308
one,foo,-1.790206,-0.976532,1.184927


## 使用`reorder_levels()`重新排序层级

In [41]:
df[:5].reorder_levels([1,0], axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
second,first,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,bar,0.075922,-2.117579,-1.702369
two,bar,-1.508019,-1.026132,-0.171502
one,baz,-1.023941,-1.541469,-0.748726
two,baz,0.142567,-0.575999,-0.622308
one,foo,-1.790206,-0.976532,1.184927


# CategoricalIndex

介绍一个层级类型`CategoricalIndex`,一种新的索引类型,用于支持索引与重复.这是围绕分类类型数据的容器,并且允许对具有大量重复元素的索引进行有效的索引和存储.

In [42]:
df = pd.DataFrame({'A': np.arange(6),
                   'B': list('aabbca')})


df['B'] = df['B'].astype('category', categories=list('cab'))


  exec(code_obj, self.user_global_ns, self.user_ns)


In [43]:
df

Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,c
5,5,a


In [44]:
df.dtypes

A       int64
B    category
dtype: object

In [45]:
df.B.cat.categories

Index(['c', 'a', 'b'], dtype='object')

设置索引将其创建为一个CategoricalIndex

In [46]:
df2 = df.set_index('B')

In [47]:
df2.index

CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

In [48]:
df2.loc['a']

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0
a,1
a,5


这些保留了分类索引

In [49]:
df2.loc['a'].index

CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

In [50]:
df2.sort_index()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,0
a,1
a,5
b,2
b,3


索引上的Groupby操作也将保留索引本质

In [51]:
df2.groupby(level=0).sum()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,6
b,5


In [52]:
df2.groupby(level=0).sum().index

CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

重索引操作将根据传递的索引器的类型返回一个结果索引,这意味着传递一个列表将返回一个普通的索引;使用分类索引将返回CategoricalIndex,根据PASSED分类类型的类别索引.这允许任意索引这些甚至与不在类别中的值,类似于您可以重新索引任何pandas索引.

In [53]:
df2.reindex(['a','e'])

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
a,1.0
a,5.0
e,


In [54]:
df2.reindex(['a','e']).index

Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [55]:
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
a,1.0
a,5.0
e,
