# **Pandas**
---
## **Pandas数据结构介绍**

### **Series** 
> **一维数组，包含索引index   
> index和value分别查看索引和值**

In [192]:
import pandas as pd
import numpy as np
obj = pd.Series([4,7,-5,3])
obj.values

array([ 4,  7, -5,  3], dtype=int64)

> **pandas会自动生成索引，也可以自定义索引**

In [193]:
obj2 = pd.Series([4,7,-5,3],index=['a','b','c','d'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [194]:
obj2[obj2>2]

a    4
b    7
d    3
dtype: int64

In [195]:
np.exp(obj2) # 转换为浮点数仍然保留索引

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

> **Series可以当成字典用在函数中  
> 所以可以拿python字典来创建series**

In [196]:
'b' in obj2

True

In [197]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

> **改变字典的键来更改索引**

In [198]:
# 改变字典的键来更改索引
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata,index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

> **isnull和notnull函数来检测缺失值  
> 同样obj.isnull方法也能实现同样操作**

In [199]:
# isnull和notnull函数来检测缺失值
pd.isnull(obj4) 

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

> **Series最重要的一个功能是，它会根据运算的索引标签自动对齐数据：意味着运算结果为交集，类似与数据库join**

In [200]:
obj3 + obj4 

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

> **还可以给Series对象设置名字和索引名，索引名能赋值修改**

In [201]:
obj4.name = "population"
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

### **Dataframe** 
> **DataFrame是一个表格型的数据结构  
> DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）**

In [202]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data) # 转换为数据帧
frame.index= [1,2,3,4,5,6] # 可以像Series一样更改索引
frame.columns = ['year','state','pop'] # 指定排列方式
frame

Unnamed: 0,year,state,pop
1,Ohio,2000,1.5
2,Ohio,2001,1.7
3,Ohio,2002,3.6
4,Nevada,2001,2.4
5,Nevada,2002,2.9
6,Nevada,2003,3.2


In [203]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'])
frame2.index = ['one', 'two', 'three', 'four', 'five', 'six']
frame2 # 如果没数据会产生空值

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


> **可以通过切片从Dataframe中提取Series或者新的Dataframe**

In [204]:
frame2.year
frame2[['year','pop','debt']] # 跟numpy的切片索引很像

Unnamed: 0,year,pop,debt
one,2000,1.5,
two,2001,1.7,
three,2002,3.6,
four,2001,2.4,
five,2002,2.9,
six,2003,3.2,


In [205]:
frame2.loc['three'] # 可以通过位置索引

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

> **通过赋值可以更改列数据**

In [206]:
import numpy as np
frame2.debt = 16.5
frame2['debt'] = np.arange(6.0)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [207]:
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2.debt = val 
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


> **del用来删除列  
> 赋值不存在的列会创建新列**

In [208]:
frame2['easterm'] = frame2.state == 'Ohio' # 这里就不能用方法了，只能用索引['eastern']
frame2 # 注意：不能用frame2.eastern创建新的列，笑死了我

Unnamed: 0,year,state,pop,debt,easterm
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


> **注意：通过索引方式返回的列只是相应数据的视图而已，并不是副本。因此，对返回的Series所做的任何就地修改全都会反映到源DataFrame上。通过Series的copy方法即可指定复制列。**

In [209]:
frame3 = frame2.year.copy() # 复制一列
frame3

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

### **嵌套字典**

> **嵌套字典：外层字典的键作为列，内层键则作为行索引**

In [210]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [211]:
frame3.T # 转置

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


> **自定义索引顺序**

In [212]:
pd.DataFrame(pop, index=[2001, 2002, 2003])
frame3.index = [2001, 2002, 2003] # 两者效果一样
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,1.5


In [213]:
pdata = {
    'Ohio':frame3['Ohio'][:-1],
    'Nevada':frame3['Nevada'][:2]
}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


> **dataframe构造函数能接受二维数组、numpy的结构化数组、Series字典、字典组成的字典、列表或者元组组成的列表、另一个dataframe**

> **可以给dataframe设置表名、列名、索引名**

In [214]:
frame3.index.name = 'year'
frame3.columns.name = 'state' # 设置列名，相当于设置了表名
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2003,,1.5


> **跟Series一样可以用index和value返回索引和值**

In [215]:
frame2.values
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

### **索引对象**
> **索引也是对象,索引对象不可更改，所以能在多个数据结构中共享**

In [216]:
import pandas as pd
obj = pd.Series(range(3),index=['a','b','c'])
index = obj.index
index[1:]
# index[1] = 'd' # 索引对象的索引不可更改

Index(['b', 'c'], dtype='object')

In [217]:
import numpy as np
labels = pd.Index(np.arange(3))
labels
# Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2 
 #   0    1.5
 #   1   -2.5
 #   2    0.0
 #   dtype: float64
obj2.index is labels


True

> **列的columns操作和index一样**

In [218]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2003,,1.5


In [219]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [220]:
'Ohio' in frame3.columns

True

> **索引名相同时选择索引会显示所有值**

>**常见的index方法和属性**

> - **append 连接另一个index对象，产生一个新的index**
> - **difference 计算差集，得到一个index**
> - **intersection 计算交集**  
> - **union 计算并集**  
> - **isin 计算指示值是否包含在参数集**  
> - **delect 删除某处索引**  
> - **drop 删除传入的值**   
> - **insert 插入索引**  
> - **is_monotonic**   
> - **is_unique 检测索引是否重复**  
> - **unique  找到不重复的索引**


---
## **基本功能**

### **.reindex()重新索引 —— 说人话就是重命名表格行列名**

In [221]:
import pandas as pd
import numpy as np 
obj = pd.Series([4.5,7.2,-5.3,3.6],index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [222]:
obj3 = pd.Series(['blue','purple','yellow'] , index = [0.,2,4] )
obj3

0.0      blue
2.0    purple
4.0    yellow
dtype: object

In [223]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

> **reindex重新设置索引名和列名  
> reindex函数的参数：index，method，fill_value**

In [224]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), # reshape方法重塑数组形状
                    index=['a','c','d'],
                    columns=['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [225]:
states = ['Texas', 'Utah', 'California','Ohio']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California,Ohio
a,1,,2,0
c,4,,5,3
d,7,,8,6


### **.drop()丢弃指定轴上的项 —— 说人话就是清除数据** 
**使用drop方法**

In [226]:
import pandas as pd 
import numpy as np 
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [227]:
obj.drop('c')

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

> **可以删除dataframe任意轴上的索引值**

In [228]:
data = pd.DataFrame(np.arange(36).reshape((6, 6)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York','Marsh','Mosh'],
                    columns=['one', 'two', 'three', 'four','five','six'])
data

Unnamed: 0,one,two,three,four,five,six
Ohio,0,1,2,3,4,5
Colorado,6,7,8,9,10,11
Utah,12,13,14,15,16,17
New York,18,19,20,21,22,23
Marsh,24,25,26,27,28,29
Mosh,30,31,32,33,34,35


> **删除行需要使用[索引]切片**

In [229]:
data.drop(['Marsh','Mosh']) # 删除行

Unnamed: 0,one,two,three,four,five,six
Ohio,0,1,2,3,4,5
Colorado,6,7,8,9,10,11
Utah,12,13,14,15,16,17
New York,18,19,20,21,22,23


> **删除列也能使用[索引]切片   
> 要加上axis=1 删除1列  
> 或者axis='columns'**

In [230]:
data.drop('five', axis=1)

Unnamed: 0,one,two,three,four,six
Ohio,0,1,2,3,5
Colorado,6,7,8,9,11
Utah,12,13,14,15,17
New York,18,19,20,21,23
Marsh,24,25,26,27,29
Mosh,30,31,32,33,35


In [231]:
data.drop(['five','six'],axis='columns')

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,6,7,8,9
Utah,12,13,14,15
New York,18,19,20,21
Marsh,24,25,26,27
Mosh,30,31,32,33


### **用索引、选取和过滤**

> **Series索引（obj[...]）的工作方式类似于NumPy数组的索引**

In [232]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b'] # 根据索引名查找
obj[1] # 根据索引号查找 结果一样

1.0

In [233]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [234]:
obj[['b', 'a', 'd']] # 时刻记牢索引切片

b    1.0
a    0.0
d    3.0
dtype: float64

In [235]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

> **pandas的切片运算包含末端**

In [236]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

> **切片索引可以赋值**

In [237]:
obj['b', 'a', 'd'] = 5
obj

a    5.0
b    5.0
c    2.0
d    5.0
dtype: float64

> **切片操作同样适用于dataframe  
> 特殊情况是纯切片和布尔选取时跟python一样**

In [238]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                  columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [239]:
data[['three', 'one']] # 选取列

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [240]:
data[:2] # 选取行

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [241]:
data[data['three']>5] # 选取行

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### **用loc和iloc选取子集——索引函数**
> **loc 轴标签索引  
> iloc 整数索引**

In [242]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [243]:
data.loc['Colorado', ['two', 'three']] # 相当于坐标

two      5
three    6
Name: Colorado, dtype: int32

In [244]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [245]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


>**【重点笔记】  
df[val] 选取单列或者多列  
df[value] 选取单行或者多行  
df.loc[val] 选取单行或者多行  
df.loc[:,val] 选取单列或者列子集  
df.loc[val1,val2] 选取坐标位置上的值或者子集  
df.iloc[where] 通过整数进行选择，用法与loc一样  
df.at[label_i,label_2] 通过标签选取坐标位置上的值  
df.iat[label_i,label_2] 通过整数选取坐标位置上的值**

### **整数索引** 
>**整数索引会产生歧义，非整数索引不会产生歧义  
所以尽量都使用loc和iloc方法索引**

In [246]:
ser = pd.Series(np.arange(3.))
ser
# ser[-1] 会报错


0    0.0
1    1.0
2    2.0
dtype: float64

### **算术运算和数据对齐**
> **索引的运算类似于SQL的Join连接**

In [247]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5],index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [248]:
s3 = s1 + s2 
s3 

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [249]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [250]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [251]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### **在算术方法中填充值**

In [252]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                      columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                       columns=list('abcde'))

df2.loc[1, 'b'] = np.nan # 设置一个空值

df1 

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [253]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


> **直接相加相当于内连接**

In [254]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


> **缺失值可以使用加法add用其他对象值填充，NaN值可以使用fill_value=？指定填充**  
> **使用add相加相当于外连接**

In [255]:
df1.add(df2,fill_value=1)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,5.0
1,9.0,6.0,13.0,15.0,10.0
2,18.0,20.0,22.0,24.0,15.0
3,16.0,17.0,18.0,19.0,20.0


> Series和DataFrame的算术方法
![](img/%E7%AE%97%E6%9C%AF%E8%AE%A1%E7%AE%97pandas.png)

In [256]:
df1.radd(df2,fill_value=1)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,5.0
1,9.0,6.0,13.0,15.0,10.0
2,18.0,20.0,22.0,24.0,15.0
3,16.0,17.0,18.0,19.0,20.0


> **重设索引列也可指定填充值**

In [257]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


### **DataFrame和Series之间的运算**

> **广播（broadcasting） 数组之间的运算会广播到每一行**

In [258]:
arr = np.arange(12.).reshape((3, 4))
arr[0]

array([0., 1., 2., 3.])

In [259]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

> **根据列匹配在行上广播  
> Series和DataFrame的运算也会广播**

In [260]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame 

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [261]:

series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [262]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [263]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [264]:
calc = frame + series2
calc

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [265]:
calc1 = calc.add(frame)
calc1

Unnamed: 0,b,d,e,f
Utah,0.0,,5.0,
Ohio,6.0,,11.0,
Texas,12.0,,17.0,
Oregon,18.0,,23.0,


> **根据行匹配在列上广播**

In [266]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [267]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [268]:
frame.sub(series3, axis='index') # 传入的轴号就是希望匹配的轴

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### **函数应用和映射**

> **numpy通用函数也能用来处理pandas数据**

In [269]:
import pandas as pd
import numpy as np 
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                         index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

Unnamed: 0,b,d,e
Utah,-0.371497,1.258912,0.252577
Ohio,0.196748,-1.006962,1.467145
Texas,-0.120018,0.456162,-1.082559
Oregon,-2.34649,1.12801,-0.054834


In [270]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.371497,1.258912,0.252577
Ohio,0.196748,1.006962,1.467145
Texas,0.120018,0.456162,1.082559
Oregon,2.34649,1.12801,0.054834


> **【重点】df.apply(f)自定义函数处理数据**

In [271]:
frame.sum(axis='columns') # 常见函数不需要用apply直接使用就好

Utah      1.139992
Ohio      0.656932
Texas    -0.746415
Oregon   -1.273315
dtype: float64

In [272]:
# 默认按每行执行
f = lambda x: x.max() - x.min()
frame.apply(f) 

b    2.543238
d    2.265874
e    2.549705
dtype: float64

In [273]:
# 设置按列执行
f = lambda x: x.max() - x.min()
frame.apply(f,axis='columns') 

Utah      1.630410
Ohio      2.474107
Texas     1.538721
Oregon    3.474499
dtype: float64

> **可以把返回的多个值设置成Series**

In [274]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-2.34649,-1.006962,-1.082559
max,0.196748,1.258912,1.467145


> **也可以返回计算后的每个字符，比如格式化字符串**

In [275]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.37,1.26,0.25
Ohio,0.2,-1.01,1.47
Texas,-0.12,0.46,-1.08
Oregon,-2.35,1.13,-0.05


### **排序和排名**

> **sort_index()按索引排序**

In [276]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [310]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                        index=['three', 'one'],
                        columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [281]:
frame.sort_index(axis=0)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [288]:
frame.sort_index(axis=1,ascending=False)
# axis=1 axis='columns' 按列排 axis=0 按行排
# ascending=False 降序排

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


> **sort_value(by=name) 按值排序**


In [290]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values() # 在排序时，任何缺失值默认都会被放到Series的末尾：

2   -3
3    2
0    4
1    7
dtype: int64

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values() # 在排序时，任何缺失值默认都会被放到Series的末尾：

> **sort_value(by='columns_name1','columns_name2')   
> 指定多条件按值排序**

In [292]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [294]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


> **df.rank()把平均名次排序**

In [312]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [313]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [314]:
obj.rank(method='first') # 源数据出现顺序排

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [316]:
obj.rank(ascending=False, method='max') # 降序按最大值排名

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

> **df.rank()用在Dataframe里**
> 

In [317]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [319]:
frame.rank(ascending=False,axis='columns')

Unnamed: 0,b,a,c
0,1.0,2.0,3.0
1,1.0,3.0,2.0
2,3.0,2.0,1.0
3,1.0,2.0,3.0


**【rank排序】**
![](img/%E6%8E%92%E5%BA%8F.png)

### 带有重复标签的轴索引 
> **如果标签对应多个值，切片选取时会返回多个值**

In [2]:
import pandas as pd
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj['a']

a    0
a    1
dtype: int64

---
## **汇总和计算描述统计**

In [5]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                  [np.nan, np.nan], [0.75, -1.3]],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
0,1.4,
1,7.1,-4.5
2,,
3,0.75,-1.3


> **按列求和 NA值会被忽略**

In [8]:
df.sum(axis=0)

one    9.25
two   -5.80
dtype: float64

> **按行求和**

In [7]:
df.sum(axis=1)

0    1.40
1    2.60
2    0.00
3   -0.55
dtype: float64

In [11]:
df.mean(axis=1, skipna=False) # 按列求平均 跳过空值

0      NaN
1    1.300
2      NaN
3   -0.275
dtype: float64

In [12]:
df.idxmax()

one    1
two    3
dtype: int64

In [13]:
df.cumsum()

Unnamed: 0,one,two
0,1.4,
1,8.5,-4.5
2,,
3,9.25,-5.8


> **df.describe()**

In [14]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [16]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

![](img/pandas%E6%96%B9%E6%B3%95%E6%B1%87%E6%80%BB.png)

### **计算相关系数与协方差**

> **Series的corr方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数。与此类似，cov用于计算协方差**

```
In [242]: returns = price.pct_change()

In [243]: returns.tail()
Out[243]: 
                AAPL      GOOG       IBM      MSFT
Date                                              
2016-10-17 -0.000680  0.001837  0.002072 -0.003483
2016-10-18 -0.000681  0.019616 -0.026168  0.007690
2016-10-19 -0.002979  0.007846  0.003583 -0.002255
2016-10-20 -0.000512 -0.005652  0.001719 -0.004867
2016-10-21 -0.003930  0.003011 -0.012474  0.042096
```
> **计算相关系数和协方差**
```
In [244]: returns['MSFT'].corr(returns['IBM'])
Out[244]: 0.49976361144151144

In [245]: returns['MSFT'].cov(returns['IBM'])
Out[245]: 8.8706554797035462e-05
```

> **DataFrame的corr和cov方法将以DataFrame的形式分别返回完整的相关系数或协方差矩阵：**
```
In [247]: returns.corr()
Out[247]: 
          AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.407919  0.386817  0.389695
GOOG  0.407919  1.000000  0.405099  0.465919
IBM   0.386817  0.405099  1.000000  0.499764
MSFT  0.389695  0.465919  0.499764  1.000000

In [248]: returns.cov()
Out[248]: 
          AAPL      GOOG       IBM      MSFT
AAPL  0.000277  0.000107  0.000078  0.000095
GOOG  0.000107  0.000251  0.000078  0.000108
IBM   0.000078  0.000078  0.000146  0.000089
MSFT  0.000095  0.000108  0.000089  0.000215
```

> **DataFrame的corrwith方法，你可以计算其列或行跟另一个Series或DataFrame之间的相关系数。传入一个Series将会返回一个相关系数值Series（针对各列进行计算）：**

```
In [249]: returns.corrwith(returns.IBM)
Out[249]: 
AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64
```
> **传入一个DataFrame则会计算按列名配对的相关系数**

### **计算唯一值、值计数以及成员资格**

> **unique得到唯一值数组  
uniques.sort()可以对其排序  
value_counts计算其值出现频率**

In [19]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

![](img/%E5%94%AF%E4%B8%80%E5%80%BC%E6%96%B9%E6%B3%95.png)

In [20]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                    'Qu2': [2, 3, 1, 2, 3],
                    'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [21]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
