# pandas主要功能

在了解pandas数据结构的基础上，了解其常用功能。

## 1.重新索引（Reindexing）

In [1]:
import pandas as pd

In [2]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

更改index需要调用reindex，如果没有对应index会引入缺失值

In [3]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于DataFrame，reindex能更改row index,或column index。
reindex the rows:

In [4]:
import numpy as np

In [5]:
frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])

In [6]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [7]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


reindex the columns:

In [8]:
states = ['Texes', 'Utah', 'California']

In [9]:
frame.reindex(columns=states)

Unnamed: 0,Texes,Utah,California
a,,,2
c,,,5
d,,,8


reinsex参数: 

![image](http://oydgk2hgw.bkt.clouddn.com/pydata-book/x0pq4.png)

In [10]:
frame.loc[['a','b','c','d'], states]

Unnamed: 0,Texes,Utah,California
a,,,2.0
b,,,
c,,,5.0
d,,,8.0


## 2.按轴删除记录（Dropping Entries from an Axis）

对于DataFrame，index能按行或列的axis来删除：

In [11]:
data = pd.DataFrame(np.arange(16).reshape(4, 4),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


行处理：（axis 0）

In [12]:
data.drop(['Ohio'])

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


列处理：（axis 1）

In [13]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


## 2.索引，选择，过滤（indexing, selection, filtering）

Series索引

相当于numpy的Array索引，而且还可以使用label索引。注意使用label切片会包括尾节点。

DataFrame 索引

#### 值或序列索引：

In [14]:
data['one']

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32

In [15]:
data[['one', 'two']]

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5
Utah,8,9
New York,12,13


#### 布尔数组索引：

In [16]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [17]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [18]:
data[data>14] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,0


#### 标签和位置索引：

对于label-indexing on rows：loc（for labels标签索引）、iloc（for integers位置索引）

In [19]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,0


In [20]:
data.loc['Ohio', ['one', 'two']]

one    0
two    1
Name: Ohio, dtype: int32

In [21]:
data.iloc[0, [0, 1]]

one    0
two    1
Name: Ohio, dtype: int32

In [22]:
data.loc[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [23]:
data.iloc[:, :3][data.three>5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


选择数据方法：

![image](http://oydgk2hgw.bkt.clouddn.com/pydata-book/bwadf.png)

![image](http://oydgk2hgw.bkt.clouddn.com/pydata-book/lc2uc.png)

## 3.算数和数据对齐（Arithmetic and Data Alignment）

In [24]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'),
                  index={'Ohio', 'Texas', 'Colorado'})
df1

Unnamed: 0,b,c,d
Colorado,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Ohio,6.0,7.0,8.0


In [25]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [26]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,9.0,,12.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


因为'c'和'e'列都不在两个DataFrame里，所有全是缺失值。对于行，即使有相同的，但列不一样的话也会是缺失值。

使用带填充值得方法：

In [27]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 
                   columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), 
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


下表是这样的灵活算数方法：

![image](http://oydgk2hgw.bkt.clouddn.com/pydata-book/y0rr4.png)

每一个都有一个配对的，以r开头，意思是反转。

In [28]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [29]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


在reindexing（重建索引）时，也可以使用fill_value

In [30]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


#### DataFrame和Series之间的操作：

举一个numpy的例子：

In [31]:
arr = np.arange(12.).reshape((3, 4))

In [32]:
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

减法用在了每一行上，这种操作叫做broadcating（广播）。

In [33]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [34]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [35]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

可以理解为Series和DataFrame的列匹配。

Broadcasting down the rows（向下按行广播）

In [36]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


如果Series和DataFrame有不同的index，那么相加结果也是合集：

In [37]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


如果想要广播列，去匹配行，必须要用到算数方法：

In [38]:
series = frame['d']

In [39]:
frame.sub(series, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


axis参数就是用来匹配轴的。在这个例子里是匹配dataframe的row index(axis='index' or axis=0)，然后再广播。

## 4.函数应用和映射（Fuction Application and Mappong）

numpy的ufuncs(element-wise数组方法)也能用在pandas的object上：

In [40]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.326382,-0.69092,0.121802
Ohio,1.2551,0.496809,1.017018
Texas,0.752331,-0.148764,-1.549744
Oregon,1.063863,0.208184,-1.32806


In [41]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.326382,0.69092,0.121802
Ohio,1.2551,0.496809,1.017018
Texas,0.752331,0.148764,1.549744
Oregon,1.063863,0.208184,1.32806


此外，可以把一个用在一维数组上的函数应用在一行或者一列上。

用到DataFrame的apply函数：

In [42]:
f = lambda x: x.max()-x.min()
frame.apply(f)

b    2.581482
d    1.187729
e    2.566762
dtype: float64

这里函数f，计算的是一个series中最大值和最小值的差，在frame中的每一列，这个函数被调用一次。作为结果的Series，它的index就是frame的column。

如果你传入axis='column'用于apply，那么函数会被用在每一行。

apply不会返回标量，只会返回一个含有多个值的Series：

In [43]:
def f(x):
    return pd.Series([x.min, x.max], index=['min','max'])

In [44]:
frame.apply(f)

Unnamed: 0,b,d,e
min,<bound method Series.min of Utah -1.326382...,<bound method Series.min of Utah -0.690920...,<bound method Series.min of Utah 0.121802...
max,<bound method Series.max of Utah -1.326382...,<bound method Series.max of Utah -0.690920...,<bound method Series.max of Utah 0.121802...


element-wise的python函数也能用。假设想要格式化frame中的浮点数，变为string。可以用applymap:

In [45]:
format = lambda x:'%2f'%x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-1.326382,-0.69092,0.121802
Ohio,1.2551,0.496809,1.017018
Texas,0.752331,-0.148764,-1.549744
Oregon,1.063863,0.208184,-1.32806


applymap的做法是，Series有一个map函数，用来实现element-wise函数：

In [46]:
frame['e'].map(format)

Utah       0.121802
Ohio       1.017018
Texas     -1.549744
Oregon    -1.328060
Name: e, dtype: object

## 5.排序（Sorting and Ranking）

按row或column index来排序的话，可以用sort_index方法，按照某个axis来排序，并且会返回一个新的object：

In [47]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [48]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [49]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [50]:
frame.sort_index(axis=0, ascending=False)

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


通过值来排序，使用sort_values方法：（缺失值会被排在最后）

In [51]:
obj = pd.Series([4, np.nan, -3, 2])
obj.sort_values()

2   -3.0
3    2.0
0    4.0
1    NaN
dtype: float64

In [52]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


rank(略)

## 6.有重复label的轴索引（Axis Indexes with Duplicate Labels）

有一些有重复索引：

In [53]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int32

In [54]:
obj.index.is_unique

False

数据选择时，对于Series，如果一个label有多个值，返回一个Series，反之返回一个标量。
        对于DataFrame，如果一个label有多行/列，返回一个DataFrame。

In [55]:
obj['a']

a    0
a    1
dtype: int32