# 复合索引

除了常规的索引方式,pandas还可以定义复合索引

In [1]:
import pandas as pd
import numpy as np

## 分层索引

分层/多级索引是非常令人兴奋的，因为它打开了一些非常复杂的数据分析和操作的门，尤其是对于更高维数据的处理。
实质上，它使你能够在诸如Series（1d）和DataFrame（2d）的低维数据结构中存储和操作具有任意数量维度的数据。

### 创建分层索引

创建分层索引可以使用

+ `pd.MultiIndex.from_tuples` 从元祖创建

+ `pd.MultiIndex.from_product`当你想要在两个迭代中的每个元素的配对时可以使用

+ 为了方便，可以将数组列表直接传递到Series或DataFrame，以自动构建MultiIndex：

In [2]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

In [3]:
tuples = list(zip(*arrays))

In [4]:
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [5]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [6]:
index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [7]:
s = pd.Series(np.random.randn(8), index=index)

In [8]:
s

first  second
bar    one       1.924541
       two      -2.127108
baz    one       0.417056
       two      -0.378908
foo    one       0.780429
       two       1.540210
qux    one       0.689353
       two       1.101065
dtype: float64

In [9]:
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

pd.MultiIndex.from_product(iterables, names=['first', 'second'])

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [10]:
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
          np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one   -1.221834
     two   -0.816061
baz  one    0.134687
     two    0.143186
foo  one    0.990110
     two    0.208515
qux  one    1.150871
     two    0.546428
dtype: float64

In [11]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-1.690183,0.27154,-0.519416,-0.109875
bar,two,1.144788,0.429944,-0.466847,-0.475691
baz,one,-1.384935,0.359243,0.443673,0.283476
baz,two,-0.922243,-0.731361,0.955463,-0.315289
foo,one,-0.725888,0.455123,-0.978464,-0.340143
foo,two,-0.147488,-0.077069,0.386868,-0.910659
qux,one,-0.360796,1.773004,-0.088235,0.107931
qux,two,-0.107048,0.659507,-0.983554,1.220696


### 将复合索引应用于列

In [12]:
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,-0.623886,-1.210745,-1.247687,-1.775733,0.071501,-0.559662,-1.227424,0.048207
B,0.539847,1.14401,-0.602236,0.712611,1.933703,0.761082,0.87389,-0.211337
C,-0.36005,0.51899,-0.336434,-1.005913,1.455619,-0.156915,-0.131643,-1.106066


MultiIndex的重要性在于，它允许您进行分组，选择和重塑操作，我们将在下面和文档的后续部分中进行描述。正如你将在后面部分看到的，你可以发现自己使用分层索引的数据，而不需要自己创建一个MultiIndex。但是，从文件加载数据时，您可能希望在准备数据集时生成自己的MultiIndex。请注意，如何通过使用pandas.set_printoptions中的multi_sparse选项进行控制来显示索引：

In [13]:
pd.set_option('display.multi_sparse', False)

In [14]:
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,-0.623886,-1.210745,-1.247687,-1.775733,0.071501,-0.559662,-1.227424,0.048207
B,0.539847,1.14401,-0.602236,0.712611,1.933703,0.761082,0.87389,-0.211337
C,-0.36005,0.51899,-0.336434,-1.005913,1.455619,-0.156915,-0.131643,-1.106066


In [15]:
pd.set_option('display.multi_sparse', True)

## 重建级别标签

方法`get_level_values`将返回特定级别上每个位置的标签的向量

In [16]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [17]:
index.get_level_values('second')

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

## 使用MultiIndex在轴上进行基本索引

分层索引的一个重要特征是您可以通过标识数据中子组的“部分”标签来选择数据。部分选择“丢弃”层次索引的水平在结果中以一种完全类似的方式选择常规DataFrame中的列：

In [18]:
df['bar']

second,one,two
A,-0.623886,-1.210745
B,0.539847,1.14401
C,-0.36005,0.51899


In [19]:
df['bar', 'one']

A   -0.623886
B    0.539847
C   -0.360050
Name: (bar, one), dtype: float64

In [20]:
df['bar']['one']

A   -0.623886
B    0.539847
C   -0.360050
Name: one, dtype: float64

In [21]:
s['qux']

one    1.150871
two    0.546428
dtype: float64

In [22]:
df.columns

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [23]:
df[['foo','qux']].columns

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[2, 2, 3, 3], [0, 1, 0, 1]],
           names=['first', 'second'])

这样做是为了避免重新计算水平以便使切片具有高性能。如果你想看到实际使用的水平。

In [24]:
df[['foo','qux']].columns.values

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)

In [25]:
df[['foo','qux']].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [26]:
pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)

MultiIndex(levels=[['foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

## 数据对齐和使用reindex

在轴上具有多索引的不同索引对象之间的操作将如期望地工作;数据对齐将像元组索引一样工作：

In [27]:
s + s[:-2]

bar  one   -2.443668
     two   -1.632123
baz  one    0.269374
     two    0.286373
foo  one    1.980220
     two    0.417029
qux  one         NaN
     two         NaN
dtype: float64

In [28]:
 s + s[::2]

bar  one   -2.443668
     two         NaN
baz  one    0.269374
     two         NaN
foo  one    1.980220
     two         NaN
qux  one    2.301741
     two         NaN
dtype: float64

reindex可以用另一个MultiIndex或甚至一个元组的列表或数组调用：

In [29]:
s.reindex(index[:3])

first  second
bar    one      -1.221834
       two      -0.816061
baz    one       0.134687
dtype: float64

In [30]:
s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])

foo  two    0.208515
bar  one   -1.221834
qux  one    1.150871
baz  one    0.134687
dtype: float64

## 高级索引与层次索引

使用.loc / .ix在高级索引中语法集成MultiIndex有点具有挑战性，但我们已尽一切努力这样做。例如下面的工作，你会期望：

In [31]:
df = df.T

In [32]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,-0.623886,0.539847,-0.36005
bar,two,-1.210745,1.14401,0.51899
baz,one,-1.247687,-0.602236,-0.336434
baz,two,-1.775733,0.712611,-1.005913
foo,one,0.071501,1.933703,1.455619
foo,two,-0.559662,0.761082,-0.156915
qux,one,-1.227424,0.87389,-0.131643
qux,two,0.048207,-0.211337,-1.106066


In [33]:
df.loc['bar']

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,-0.623886,0.539847,-0.36005
two,-1.210745,1.14401,0.51899


In [34]:
df.loc['bar', 'two']

A   -1.210745
B    1.144010
C    0.518990
Name: (bar, two), dtype: float64

In [35]:
df.loc['baz':'foo']

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,-1.247687,-0.602236,-0.336434
baz,two,-1.775733,0.712611,-1.005913
foo,one,0.071501,1.933703,1.455619
foo,two,-0.559662,0.761082,-0.156915


你可以通过提供一个元组的切片，使用一个“范围”的值。

In [36]:
df.loc[('baz', 'two'):('qux', 'one')]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-1.775733,0.712611,-1.005913
foo,one,0.071501,1.933703,1.455619
foo,two,-0.559662,0.761082,-0.156915
qux,one,-1.227424,0.87389,-0.131643


In [37]:
df.loc[('baz', 'two'):'foo']

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-1.775733,0.712611,-1.005913
foo,one,0.071501,1.933703,1.455619
foo,two,-0.559662,0.761082,-0.156915


传递标签或元组的列表与重建索引类似：

In [38]:
df.ix[[('bar', 'two'), ('qux', 'one')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,two,-1.210745,1.14401,0.51899
qux,one,-1.227424,0.87389,-0.131643


## 使用swaplevel（）交换级别

In [39]:
df[:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,-0.623886,0.539847,-0.36005
bar,two,-1.210745,1.14401,0.51899
baz,one,-1.247687,-0.602236,-0.336434
baz,two,-1.775733,0.712611,-1.005913
foo,one,0.071501,1.933703,1.455619


In [40]:
df[:5].swaplevel(0, 1, axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
second,first,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,bar,-0.623886,0.539847,-0.36005
two,bar,-1.210745,1.14401,0.51899
one,baz,-1.247687,-0.602236,-0.336434
two,baz,-1.775733,0.712611,-1.005913
one,foo,0.071501,1.933703,1.455619


## 使用reorder_levels（）重新排序级别

In [41]:
df[:5].reorder_levels([1,0], axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
second,first,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,bar,-0.623886,0.539847,-0.36005
two,bar,-1.210745,1.14401,0.51899
one,baz,-1.247687,-0.602236,-0.336434
two,baz,-1.775733,0.712611,-1.005913
one,foo,0.071501,1.933703,1.455619


# CategoricalIndex

我们介绍一个CategoricalIndex，一种新类型的索引对象，用于支持索引与重复。这是围绕分类（在v0.15.0中引入）的容器，并且允许对具有大量重复元素的索引进行有效的索引和存储。在0.16.1之前，使用类别dtype设置DataFrame / Series的索引会将其转换为常规的基于对象的Index。

In [42]:
df = pd.DataFrame({'A': np.arange(6),
                   'B': list('aabbca')})


df['B'] = df['B'].astype('category', categories=list('cab'))


In [43]:
df

Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,c
5,5,a


In [44]:
df.dtypes

A       int32
B    category
dtype: object

In [45]:
df.B.cat.categories

Index(['c', 'a', 'b'], dtype='object')

设置索引，将创建一个CategoricalIndex

In [46]:
df2 = df.set_index('B')

In [47]:
df2.index

CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

In [48]:
df2.loc['a']

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0
a,1
a,5


这些保留了分类索引

In [49]:
df2.loc['a'].index

CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

In [50]:
df2.sort_index()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,0
a,1
a,5
b,2
b,3


索引上的Groupby操作也将保留索引本质

In [51]:
df2.groupby(level=0).sum()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,6
b,5


In [52]:
df2.groupby(level=0).sum().index

CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

重索引操作将根据传递的索引器的类型返回一个结果索引，这意味着传递一个列表将返回一个普通的索引;使用分类索引将返回CategoricalIndex，根据PASSED分类类型的类别索引。这允许任意索引这些甚至与不在类别中的值，类似于您可以重新索引任何pandas索引。

In [53]:
df2.reindex(['a','e'])

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
a,1.0
a,5.0
e,


In [54]:
df2.reindex(['a','e']).index

Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [55]:
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
a,1.0
a,5.0
e,
