**函数应用和映射**

In [1]:
import numpy as np
import pandas as pd
# NumPy的ufuncs（元素级数组方法）也可用于操作pandas对象


In [2]:
frame = pd.DataFrame(np.random.randn(4,3),columns=list('abc'),
                     index=["utah",'ohio','china','oregin'])
frame
np.abs(frame)

Unnamed: 0,a,b,c
utah,0.098852,0.026537,0.372806
ohio,2.442644,1.412191,1.043528
china,0.965674,1.147498,1.851184
oregin,1.721748,1.386722,0.838761


将函数应用到由各列或行所形成的一维数组上。DataFrame
的apply方法即可实现此功能

In [3]:
f = lambda x:x.max()-x.min()
frame.apply(f)

a    2.541495
b    2.798912
c    2.689945
dtype: float64

In [4]:
frame.apply(f,axis='columns')#%%

utah      0.399343
ohio      3.854834
china     2.816857
oregin    0.882987
dtype: float64

传递到apply的函数不是必须返回一个标量，还可以返回由多个值组成的Series


In [5]:
def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)

Unnamed: 0,a,b,c
min,-2.442644,-1.386722,-0.838761
max,0.098852,1.412191,1.851184


元素级的Python函数也是可以用的。假如你想得到frame中各个浮点值的格式化字
符串，使用applymap即可

In [6]:
format =lambda x:'%.2f'%x
frame.applymap(format)

Unnamed: 0,a,b,c
utah,0.1,-0.03,0.37
ohio,-2.44,1.41,1.04
china,-0.97,1.15,1.85
oregin,-1.72,-1.39,-0.84


Series有一个用于应用元素级函数的map方法

In [7]:
frame['a'].map(format)

utah       0.10
ohio      -2.44
china     -0.97
oregin    -1.72
Name: a, dtype: object

In [8]:
frame.loc[:,['a']]


Unnamed: 0,a
utah,0.098852
ohio,-2.442644
china,-0.965674
oregin,-1.721748


In [9]:
frame.iloc[:2]

Unnamed: 0,a,b,c
utah,0.098852,-0.026537,0.372806
ohio,-2.442644,1.412191,1.043528


In [10]:
frame.loc['utah','a']

0.09885161725205341

In [11]:
frame.loc['utah',['a']]
# 上述两者存在差异，第二个含有列名


a    0.098852
Name: utah, dtype: float64

**排序和排名**

In [12]:
obj = pd.Series(range(4),index=list('bcad'))
obj

b    0
c    1
a    2
d    3
dtype: int64

In [13]:
obj.sort_index()

a    2
b    0
c    1
d    3
dtype: int64

In [14]:
frame

Unnamed: 0,a,b,c
utah,0.098852,-0.026537,0.372806
ohio,-2.442644,1.412191,1.043528
china,-0.965674,1.147498,1.851184
oregin,-1.721748,-1.386722,-0.838761


In [15]:
frame.sort_index(axis=0)

Unnamed: 0,a,b,c
china,-0.965674,1.147498,1.851184
ohio,-2.442644,1.412191,1.043528
oregin,-1.721748,-1.386722,-0.838761
utah,0.098852,-0.026537,0.372806


In [16]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c
utah,0.098852,-0.026537,0.372806
ohio,-2.442644,1.412191,1.043528
china,-0.965674,1.147498,1.851184
oregin,-1.721748,-1.386722,-0.838761


axis=0对行排列，=1对列排序

参数ascending = Fause时数据反转

In [17]:
frame.sort_index(ascending=False)

Unnamed: 0,a,b,c
utah,0.098852,-0.026537,0.372806
oregin,-1.721748,-1.386722,-0.838761
ohio,-2.442644,1.412191,1.043528
china,-0.965674,1.147498,1.851184


In [18]:
# 按值对series排序,在排序时，任何缺失值默认都会被放到Series的末尾
obj = pd.Series([4,5,1,2,np.nan])
obj
obj.sort_values()

2    1.0
3    2.0
0    4.0
1    5.0
4    NaN
dtype: float64

In [19]:
frame.sort_values

<bound method DataFrame.sort_values of                a         b         c
utah    0.098852 -0.026537  0.372806
ohio   -2.442644  1.412191  1.043528
china  -0.965674  1.147498  1.851184
oregin -1.721748 -1.386722 -0.838761>

当排序一个DataFrame时，你可能希望根据一个或多个列中的值进行排序。将一个
或多个列的名字传递给sort_values的by选项即可达到该目的

In [22]:
frame = pd.DataFrame({'a':[2,-2,5,1],'b':[1,9,2,-3]})
frame

Unnamed: 0,a,b
0,2,1
1,-2,9
2,5,2
3,1,-3


In [24]:
frame.sort_values(by=['a','b'])

Unnamed: 0,a,b
1,-2,9
3,1,-3
0,2,1
2,5,2


rank()方法通过为各组分配一个平均排名进行排位

比如Series([7, -5, 7, 4, 2, 0, 4])

先按照顺序进行排序：
[-5,0,2,4,4,7,7]
-5的排位就是（1+1）/2 = 1，4的排位则是（4+5）/2 = 4.5

In [28]:
obj  = pd.Series([-5,0,2,4,4,7,7])
obj


0   -5
1    0
2    2
3    4
4    4
5    7
6    7
dtype: int64

In [29]:
obj.rank()

0    1.0
1    2.0
2    3.0
3    4.5
4    4.5
5    6.5
6    6.5
dtype: float64

In [30]:
# 可以根据值在原数据中出现的顺序给出排名
obj.rank(method='first')

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

In [32]:
# 按照降序进行排位：
obj.rank(ascending=False,method='first')

0    7.0
1    6.0
2    5.0
3    3.0
4    4.0
5    1.0
6    2.0
dtype: float64

In [37]:
# DataFrame可以在行或列上计算排名
frame = pd.DataFrame({'b':[4.3,7,-3,2],'a': [0,1,0,1],'c':[-2,5,8,-2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [39]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [41]:
frame.rank(axis='index')

Unnamed: 0,b,a,c
0,3.0,1.5,2.0
1,4.0,3.5,3.0
2,1.0,1.5,4.0
3,2.0,3.5,1.0


average:在相等分组中，为各个值分配平均值

min:使用整个分组的最小排名

max:使用最大排名

first:按值在原始数据的出现顺序进行排名

dense:类似与min,但是排名在组间加1

In [46]:
obj

0   -5
1    0
2    2
3    4
4    4
5    7
6    7
dtype: int64

In [47]:
obj
obj.rank(method='min')

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    6.0
6    6.0
dtype: float64

In [48]:
obj.rank(method='dense')

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    5.0
6    5.0
dtype: float64

**带有重复标签的轴索引**

In [51]:
obj = pd.Series(range(5),index=list('aabbc'))
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [53]:
# is_unique属性可以告诉你它的值是否是唯一的
obj.index.is_unique


False

In [55]:
# 某个索引对应多个值，则返回一个Series；而对应单个值的，则返回一个标量值
obj['c']

4

In [57]:
obj['a']

a    0
a    1
dtype: int64

In [60]:
df = pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
df

Unnamed: 0,0,1,2
a,-1.105177,-1.133498,-0.679767
a,-0.803274,-0.905859,-0.672138
b,-0.105054,1.1904,1.081195
b,0.732348,-0.330881,-0.341569


In [61]:
df.loc['a']

Unnamed: 0,0,1,2
a,-1.105177,-1.133498,-0.679767
a,-0.803274,-0.905859,-0.672138
