## DataFrame
> DataFrame表示的是矩阵的数据表，它包含已排序的列集合，每一列可以是不同的值类型
> DataFrame既有行索引又有列索引，它可以被视为一个共享相同索引的Series的字典
### DateFrame的构建
#### 包含等长度列表或NumPy数组的字典

In [20]:
import pandas as pd
import numpy as np

In [2]:
data = {
    'state': ['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
    'year': [2000,2001,2002,2001,2002,2003],
    'pop': [1.5,1.7,3.6,2.4,2.9,3.2]
}
frame = pd.DataFrame(data)
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


### 只选出头部的五行

In [6]:
# head方法的参数用来控制显示多少行，默认是五行
print(frame.head(3))

  state  year  pop
0  Ohio  2000  1.5
1  Ohio  2001  1.7
2  Ohio  2002  3.6


### 按照指定的列顺序排列

In [10]:
frame = pd.DataFrame(data,columns=['year','state','pop'])
print(frame)

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2


该方法还可以筛选出需要的列信息

In [9]:
frame = pd.DataFrame(data,columns=['state','pop'])
print(frame)

    state  pop
0    Ohio  1.5
1    Ohio  1.7
2    Ohio  3.6
3  Nevada  2.4
4  Nevada  2.9
5  Nevada  3.2


如果传入的列不在字典中，将在结果中出现缺失值

In [12]:
frame = pd.DataFrame(data,columns=['state','pop','attr1'])
print(frame)

    state  pop attr1
0    Ohio  1.5   NaN
1    Ohio  1.7   NaN
2    Ohio  3.6   NaN
3  Nevada  2.4   NaN
4  Nevada  2.9   NaN
5  Nevada  3.2   NaN


### 将DateFrame中的一列按字典型标记或属性检索为Series

In [13]:
frame2 = pd.DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
print(frame2)

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN


In [14]:
print(frame2.columns)

Index(['year', 'state', 'pop', 'debt'], dtype='object')


In [15]:
# 法1
print(frame2['state'])

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object


In [17]:
# 法2
print(frame2.year)

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


<font color="red">frame2[column]对于任意列名均有效，但是frame2.column只在列名是有效的Python变量名时有效</font>

### 将DateFrame中的一行按字典型标记或属性检索为Series

In [18]:
print(frame2.loc['three'])

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object


### 将列赋值为标量值或值数组

In [19]:
frame2['debt'] = 16.5
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5


In [21]:
frame2['debt'] = np.arange(6.)
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0


+ 当用列表或数组给一个列赋值时，值的长度必须和DataFrame的长度相匹配
+ 如果将Series赋值给一列时，Series的索引将会**按照DataFrame的索引重新排列**，并在空缺的地方填充缺失值

In [22]:
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt'] = val
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


+ 如果被赋值的列不存在，则会生成一个新的列

In [23]:
frame2['eastern'] = frame2.state == 'Ohio'
print(frame2)

       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False


+ del方法用于移除列

In [24]:
del frame2['eastern']
print(frame2.columns)

Index(['year', 'state', 'pop', 'debt'], dtype='object')
