# pandas 入门
- 具备按轴自动或显式数据对齐的数据结构。这可以防止许多由于数据未对齐以及来自不同数据源（索引方式不同）的数据导致常见错误
- 集成时间序列功能
- 技能处理时间序列数据也能处理非时间序列数据的数据结构
- 数学运算和约简可以根据不同的元数据执行
- 灵活处理缺失数据
- 合并及其他出现在常见数据库中的关系型运算

In [1]:
from pandas import Series,DataFrame
import pandas as pd

## pandas 数据结构介绍
### Series
- series是一种类似于一维数组以及与之相关的数据标签组成。

In [2]:
obj = Series([4,7,-3,-5])

In [3]:
obj

0    4
1    7
2   -3
3   -5
dtype: int64

In [4]:
obj.values #获取值

array([ 4,  7, -3, -5], dtype=int64)

In [5]:
obj.index#获取索引

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = Series([4,7,-2,3],index=['a','b','c','s'])

In [7]:
obj2

a    4
b    7
c   -2
s    3
dtype: int64

In [8]:
obj2['a']

4

In [9]:
obj2 *2

a     8
b    14
c    -4
s     6
dtype: int64

In [10]:
obj2[obj2>0]

a    4
b    7
s    3
dtype: int64

In [11]:
'b' in obj2 

True

### pandas 的isnull 和notnull可以用于检测缺失数据

In [12]:
pd.isnull(obj2)

a    False
b    False
c    False
s    False
dtype: bool

In [13]:
pd.notnull(obj2)

a    True
b    True
c    True
s    True
dtype: bool

### Series在算数运算中会自动对齐不同索引的数据
### Series本身及其索引都有一个name属性，该属性跟pandas其他的关键功能关系非常密切。

In [14]:
obj2.name = 'population'
obj2.index.name = 'state'

In [15]:
obj2

state
a    4
b    7
c   -2
s    3
Name: population, dtype: int64

## DataFrame
**DataFrame是一个表格型数据结构，它含有一组有序的列。梅列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame既有行索引也有列索引，它可以被看作是有series做成的字典。**

In [16]:
data = {'state':['ohio','ohio','sdfg','asdf','asdfaw'],
       'year':[2000,2001,2002,2003,2004],
       'pop':[1,2,3,4,3]}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1,ohio,2000
1,2,ohio,2001
2,3,sdfg,2002
3,4,asdf,2003
4,3,asdfaw,2004


In [17]:
DataFrame(data,columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,ohio,1
1,2001,ohio,2
2,2002,sdfg,3
3,2003,asdf,4
4,2004,asdfaw,3


In [18]:
frame2 = DataFrame(data,columns=['year','state','pop','debt'])

In [19]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,ohio,1,
1,2001,ohio,2,
2,2002,sdfg,3,
3,2003,asdf,4,
4,2004,asdfaw,3,


In [20]:
frame2['debt'] = 16.5

In [21]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,ohio,1,16.5
1,2001,ohio,2,16.5
2,2002,sdfg,3,16.5
3,2003,asdf,4,16.5
4,2004,asdfaw,3,16.5


In [22]:
 del frame2['debt']

In [23]:
frame2.columns

Index(['year', 'state', 'pop'], dtype='object')

## 另一种常见的数据形式是嵌套字典

### 它被解释为：外层字典的键作为列，内层键则作为索引

In [24]:
pop = {'nevada':{2001:2.4,2002:2.9},
      'ohio':{2003:3.4,2004:4.5}}

In [25]:
frame3 =DataFrame(pop)

In [26]:
frame3

Unnamed: 0,nevada,ohio
2001,2.4,
2002,2.9,
2003,,3.4
2004,,4.5


## 可以输给DataFrame构造器的数据
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/0B1eHeeelJ.png)

## 索引对象
### pandas的索引对象负责管理轴标签和其他元数据（比如轴名称）。构建Series或者DataFrame时，所用到的任何数组或其他序列的标签都会被转换成Index。
#### index对象是不可修改的
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/6aI76DilIb.png)

### index的方法和属性
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/GGJD20kDA9.png)

## 基本功能
### 重新索引
#### 重新索引的插值选项
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/kh1j0ah1g7.png)
使用columns=关键字即可重新索引列。

In [27]:
obj2.reindex(['a','c','b','s','d'],fill_value = 0)

state
a    4
c   -2
b    7
s    3
d    0
Name: population, dtype: int64

### reindex参数
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/2e1fhi8LAF.png)

## 丢弃指定轴上的项
### drop方法返回的是一个指定轴上删除指定值的新对象。

In [28]:
obj2.drop(['a'])

state
b    7
c   -2
s    3
Name: population, dtype: int64

In [29]:
frame[:2];frame<5

Unnamed: 0,pop,state,year
0,True,True,False
1,True,True,False
2,True,True,False
3,True,True,False
4,True,True,False


In [30]:
frame[frame['pop']>2]

Unnamed: 0,pop,state,year
2,3,sdfg,2002
3,4,asdf,2003
4,3,asdfaw,2004


### DataFrame的索引选项
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/8mg6E2DiAC.png)

## 算术运算和数据对齐

In [31]:
obj

0    4
1    7
2   -3
3   -5
dtype: int64

In [32]:
obj2+obj

a   NaN
b   NaN
c   NaN
s   NaN
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

### 在算数方法中填充值

In [33]:
import numpy as np
df1 = DataFrame(np.arange(12).reshape((3,4)),columns = list('abcd'))
df2 = DataFrame(np.arange(20).reshape((4,5)),columns = list('abcde'))

In [34]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/bLcDbD2ihG.png)

In [35]:
df1.add(df2,fill_value= 0 )

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


## DataFrame和Series之间的运算
### 例子是广播功能

In [36]:
arr = np.arange(12).reshape((3,4))

In [37]:
arr-arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

In [38]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [39]:
series = df1.ix[0]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [40]:
series

a    0
b    1
c    2
d    3
Name: 0, dtype: int32

In [41]:
df1-series

Unnamed: 0,a,b,c,d
0,0,0,0,0
1,4,4,4,4
2,8,8,8,8


## 函数应用和映射
### numpy的ufuncs（元素级数组方法）也可用于操作pandas对象。

In [42]:
np.abs(df1)

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [43]:
f = lambda x: x.max()-x.min()
df1.apply(f)

a    8
b    8
c    8
d    8
dtype: int64

In [44]:
def f(x):
    return Series([x.min(),x.max()],index = ['min','max'])
df1.apply(f)

Unnamed: 0,a,b,c,d
min,0,1,2,3
max,8,9,10,11


## 元素级python函数也可以不过要使用.applymap()

In [45]:
f = lambda x: '%.2f' % x
df1.applymap(f)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


## 排序和排名
### 对行或列进行排序，可使用sort_index方法，它将返回一个已排序的对象“”

In [60]:
obj.reindex([2,3,0,1])

2   -3
3   -5
0    4
1    7
dtype: int64

In [64]:
df1.reindex([2,0,1])

Unnamed: 0,a,b,c,d
2,8,9,10,11
0,0,1,2,3
1,4,5,6,7


In [65]:
df1.sort_index(by='c')

  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


## 排名（ranking）跟排序密切相关，且它会增设一个排名值（从1开始，一直到数组中有效数据的数量）

In [66]:
obj.rank(ascending=False,method='max')

0    2.0
1    1.0
2    3.0
3    4.0
dtype: float64

![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/69F26lfKl7.png)

### 带有重复值的索引

In [68]:
obj = Series(range(5),index=['a','a','b','b','c'])

In [69]:
obj.index.is_unique

False

In [70]:
obj['a']

a    0
a    1
dtype: int32

In [71]:
df = DataFrame(np.random.randn(4,3),index= ['a','a','b','b'])

In [73]:
df

Unnamed: 0,0,1,2
a,-0.434697,-0.684654,0.919133
a,-0.203852,0.000963,-0.675268
b,0.293825,1.043958,0.350277
b,-2.344953,1.758543,-0.101875


In [74]:
df.ix['b']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2
b,0.293825,1.043958,0.350277
b,-2.344953,1.758543,-0.101875


## 汇总和计算描述统计

In [75]:
df

Unnamed: 0,0,1,2
a,-0.434697,-0.684654,0.919133
a,-0.203852,0.000963,-0.675268
b,0.293825,1.043958,0.350277
b,-2.344953,1.758543,-0.101875


In [76]:
df.sum()

0   -2.689677
1    2.118810
2    0.492267
dtype: float64

In [77]:
df.sum(axis = 1)

a   -0.200218
a   -0.878156
b    1.688060
b   -0.688285
dtype: float64

![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/LcaaAKF5fJ.png)
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/fL7I6K0cHD.png)
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/0gkGDik4DH.png)

## 相关系数与协方差

In [86]:
from pandas import io

![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/HhhiA4KDmd.png)

## 处理缺失数据
### pandas对象上的所有描述都排除缺失数据。
### pandas使用浮点值（Not a Number）表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测出来的标记而已。

![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/CaH6BGf1LA.png)

## 滤除缺失数据
## 对于一个Series，dropna返回一个仅含非空数据和索引值的Series。

In [88]:
from numpy import nan as NA

In [90]:
data =  Series([1,NA,2.5,NA,7])

In [91]:
data.dropna()

0    1.0
2    2.5
4    7.0
dtype: float64

In [92]:
data[4] = NA

In [93]:
data

0    1.0
1    NaN
2    2.5
3    NaN
4    NaN
dtype: float64

In [94]:
df = DataFrame(np.random.randn(7,4))

In [96]:
df.ix[:4,1] = NA;df.ix[:2,2] = 2

In [97]:
df

Unnamed: 0,0,1,2,3
0,0.140793,,2.0,-1.594434
1,0.041085,,2.0,1.454171
2,0.511884,,2.0,-0.610111
3,0.263047,,0.744005,-0.072165
4,0.743003,,2.325666,0.811782
5,-1.477293,-0.438473,1.332082,-0.078039
6,-1.471648,0.717153,1.229606,0.852795


In [103]:
df.dropna()

Unnamed: 0,0,1,2,3
5,-1.477293,-0.438473,1.332082,-0.078039
6,-1.471648,0.717153,1.229606,0.852795


## 填充缺失数据

### 通过一个常数调用fillna就会将缺失值替换为那和常数值

In [104]:
df.fillna(1333333)

Unnamed: 0,0,1,2,3
0,0.140793,1333333.0,2.0,-1.594434
1,0.041085,1333333.0,2.0,1.454171
2,0.511884,1333333.0,2.0,-0.610111
3,0.263047,1333333.0,0.744005,-0.072165
4,0.743003,1333333.0,2.325666,0.811782
5,-1.477293,-0.4384729,1.332082,-0.078039
6,-1.471648,0.7171526,1.229606,0.852795


In [108]:
df.fillna({1:0.5,2:-1})

Unnamed: 0,0,1,2,3
0,0.140793,0.5,2.0,-1.594434
1,0.041085,0.5,2.0,1.454171
2,0.511884,0.5,2.0,-0.610111
3,0.263047,0.5,0.744005,-0.072165
4,0.743003,0.5,2.325666,0.811782
5,-1.477293,-0.438473,1.332082,-0.078039
6,-1.471648,0.717153,1.229606,0.852795


### 对reindex有效的哪些插值方法也可用于fillna。

In [None]:
df = DataFrame(np.random.randn(6,3))

In [112]:
df.ix[2:,1] = NA;df.ix[4:,2] = NA

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [113]:
df

Unnamed: 0,0,1,2
0,0.967029,-1.402508,-0.277573
1,-1.706497,-0.501296,0.311105
2,1.16181,,0.521435
3,2.978255,,-0.175206
4,0.295963,,
5,0.799385,,


![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/43jhkFlj9c.png)
![mark](http://p6yio0wew.bkt.clouddn.com/blog/180517/5Gf45gD40L.png)

## 层次化索引
### 它使你能在一个轴上拥有多个（两个以上）索引级别。抽象点说，它能使你以低纬度形式处理高纬度数据。

In [115]:
data = Series(np.random.randn(10),index = [['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])

In [116]:
data

a  1   -0.683400
   2   -0.734540
   3   -0.702336
b  1   -1.988585
   2    1.441567
   3    0.886949
c  1   -1.381591
   2   -1.003692
d  2    0.031825
   3    0.561878
dtype: float64

In [117]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

In [118]:
data['b']

1   -1.988585
2    1.441567
3    0.886949
dtype: float64

In [119]:
data['b':'c']

b  1   -1.988585
   2    1.441567
   3    0.886949
c  1   -1.381591
   2   -1.003692
dtype: float64

## 层次化索引在数据重塑和基于分组的操作中扮演重要角色、

In [121]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.6834,-0.73454,-0.702336
b,-1.988585,1.441567,0.886949
c,-1.381591,-1.003692,
d,,0.031825,0.561878


In [122]:
data.unstack().stack()

a  1   -0.683400
   2   -0.734540
   3   -0.702336
b  1   -1.988585
   2    1.441567
   3    0.886949
c  1   -1.381591
   2   -1.003692
d  2    0.031825
   3    0.561878
dtype: float64

## DataFrame的set_index函数会将其中一个或多个列转换为行索引，并创建一个新的DataFrame.
## 默认情况下，哪些列会从DataFrame中移除，但也可以保留下来。

## 整数索引
### 为了保持良好的一致性，如果你的轴索引含有索引器，那么根据整数进行数据选取的操作总是面向标签的。这也报错用ix进行切片。

### 如果需要可靠地，不考虑索引类型的、基于位置的索引，可以使用Series的iget_value方法和DataFrame的irow和icol方法：