## 核心数据结构

In [1]:
import pandas as pd
import numpy as np

### Series

Series 是**一维带标签的数组**，数组里可以放任意的数据（整数，浮点数，字符串，Python Object）。其基本的创建函数是：

```python
s = pd.Series(data, index=index)
```

其中 index 是一个列表，用来作为数据的标签。data 可以是不同的数据类型：

* Python 字典
* ndarray 对象
* 一个标量值，如 5


#### 从 ndaray 创建

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.747292
b   -1.120276
c   -0.132692
d   -0.267813
e   -0.590904
dtype: float64

In [3]:
s.index

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [4]:
s = pd.Series(np.random.randn(5))
s

0    0.324214
1   -0.183776
2   -0.518808
3    0.866421
4   -0.601668
dtype: float64

In [5]:
s.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

#### 从字典创建

In [6]:
# 空值的默认处理
d = {'a' : 0., 'b' : 1., 'd' : 3}
s = pd.Series(d, index=list('abcd'))
s

a     0
b     1
c   NaN
d     3
dtype: float64

#### 从标量创建

In [7]:
pd.Series(3, index=list('abcde'))

a    3
b    3
c    3
d    3
e    3
dtype: int64

In [8]:
print "Missing required dependencies {values}".format(values = ['aaa', 'bbb'])

Missing required dependencies ['aaa', 'bbb']


#### Series 是类 ndarray 对象

熟悉 numpy 的同学对下面的操作应该不会陌生。我们在 numpy 简介里也介绍过下面的索引方式。

In [9]:
s = pd.Series(np.random.randn(5))
s

0    0.882069
1   -0.134360
2   -0.925088
3    0.191072
4    2.546704
dtype: float64

In [10]:
s[0]

0.88206876023157332

In [11]:
s[:3]

0    0.882069
1   -0.134360
2   -0.925088
dtype: float64

In [12]:
s[[1, 3, 4]]

1   -0.134360
3    0.191072
4    2.546704
dtype: float64

In [13]:
np.exp(s)

0     2.415892
1     0.874275
2     0.396497
3     1.210546
4    12.764963
dtype: float64

In [14]:
np.sin(s)

0    0.772055
1   -0.133957
2   -0.798673
3    0.189911
4    0.560416
dtype: float64

#### Series 是类字典对象

In [15]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e   -0.462794
dtype: float64

In [16]:
s['a']

-2.1498403551053218

In [17]:
s['e'] = 5

In [18]:
s

a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e    5.000000
dtype: float64

In [19]:
s['g'] = 100

In [20]:
s

a     -2.149840
b     -0.924115
c      0.481231
d      1.033813
e      5.000000
g    100.000000
dtype: float64

In [21]:
'e' in s

True

In [22]:
'f' in s

False

In [23]:
# s['f']

In [24]:
print s.get('f')

None


In [25]:
print s.get('f', np.nan)

nan


#### 标签对齐操作

In [26]:
s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
print '{0}\n\n{1}'.format(s1, s2)

a   -0.917905
c   -0.744616
e    0.114522
dtype: float64

a    0.721087
d   -0.471575
e    0.796093
dtype: float64


In [27]:
s1 + s2

a   -0.196818
c         NaN
d         NaN
e    0.910615
dtype: float64

#### name 属性

In [28]:
s = pd.Series(np.random.randn(5), name='Some Thing')
s

0    0.623787
1    0.517239
2    1.551314
3    1.414463
4   -1.224611
Name: Some Thing, dtype: float64

In [29]:
s.name

'Some Thing'

### DataFrame

DataFrame 是**二维带行标签和列标签的数组**。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格，还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。

创建 DataFrame 的基本格式是：

```python
df = pd.DataFrame(data, index=index, columns=columns)
```

其中 index 是行标签，columns 是列标签，data 可以是下面的数据：

* 由一维 numpy 数组，list，Series 构成的字典
* 二维 numpy 数组
* 一个 Series
* 另外的 DataFrame 对象


#### 从字典创建

In [30]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

In [31]:
pd.DataFrame(d)

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [32]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [33]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


In [34]:
d = {'one' : [1, 2, 3, 4],
     'two' : [21, 22, 23, 24]}

In [35]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1,21
1,2,22
2,3,23
3,4,24


In [36]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,21
b,2,22
c,3,23
d,4,24


#### 从结构化数据中创建

In [37]:
data = [(1, 2.2, 'Hello'), (2, 3., "World")]

In [38]:
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,1,2.2,Hello
1,2,3.0,World


In [39]:
pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])

Unnamed: 0,A,B,C
first,1,2.2,Hello
second,2,3.0,World


#### 从字典列表创建

In [40]:
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [41]:
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [42]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [43]:
pd.DataFrame(data, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


#### 从元组字典创建

了解其创建的原理，实际应用中，会通过数据清洗的方式，把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。

In [44]:
d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
     ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
     ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
     ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
     ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}

In [45]:
# 多级标签
pd.DataFrame(d)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


#### 从 Series 创建

In [46]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
pd.DataFrame(s)

Unnamed: 0,0
a,-0.789343
b,0.127384
c,1.084005
d,-0.755011
e,-0.963299


In [47]:
pd.DataFrame(s, index=['a', 'c', 'd'])

Unnamed: 0,0
a,-0.789343
c,1.084005
d,-0.755011


In [48]:
pd.DataFrame(s, index=['a', 'c', 'd'], columns=['A'])

Unnamed: 0,A
a,-0.789343
c,1.084005
d,-0.755011


#### 列选择/增加/删除

In [49]:
df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
df

Unnamed: 0,one,two,three,four
0,2.0453,-0.981722,-0.656081,-0.639517
1,-0.55078,0.248781,-0.146424,0.217392
2,1.702775,0.103998,-0.662138,-0.534071
3,-2.035681,0.015025,1.368209,0.178378
4,-1.092208,0.091108,-0.892496,-0.611198
5,0.093502,0.267428,1.189654,-0.258723


In [50]:
df['one']

0    2.045300
1   -0.550780
2    1.702775
3   -2.035681
4   -1.092208
5    0.093502
Name: one, dtype: float64

In [51]:
df['three'] = df['one'] + df['two']
df

Unnamed: 0,one,two,three,four
0,2.0453,-0.981722,1.063578,-0.639517
1,-0.55078,0.248781,-0.301999,0.217392
2,1.702775,0.103998,1.806773,-0.534071
3,-2.035681,0.015025,-2.020656,0.178378
4,-1.092208,0.091108,-1.0011,-0.611198
5,0.093502,0.267428,0.360931,-0.258723


In [52]:
df['flag'] = df['one'] > 0
df

Unnamed: 0,one,two,three,four,flag
0,2.0453,-0.981722,1.063578,-0.639517,True
1,-0.55078,0.248781,-0.301999,0.217392,False
2,1.702775,0.103998,1.806773,-0.534071,True
3,-2.035681,0.015025,-2.020656,0.178378,False
4,-1.092208,0.091108,-1.0011,-0.611198,False
5,0.093502,0.267428,0.360931,-0.258723,True


In [53]:
del df['three']
df

Unnamed: 0,one,two,four,flag
0,2.0453,-0.981722,-0.639517,True
1,-0.55078,0.248781,0.217392,False
2,1.702775,0.103998,-0.534071,True
3,-2.035681,0.015025,0.178378,False
4,-1.092208,0.091108,-0.611198,False
5,0.093502,0.267428,-0.258723,True


In [54]:
four = df.pop('four')
four

0   -0.639517
1    0.217392
2   -0.534071
3    0.178378
4   -0.611198
5   -0.258723
Name: four, dtype: float64

In [55]:
df

Unnamed: 0,one,two,flag
0,2.0453,-0.981722,True
1,-0.55078,0.248781,False
2,1.702775,0.103998,True
3,-2.035681,0.015025,False
4,-1.092208,0.091108,False
5,0.093502,0.267428,True


In [56]:
df['five'] = 5
df

Unnamed: 0,one,two,flag,five
0,2.0453,-0.981722,True,5
1,-0.55078,0.248781,False,5
2,1.702775,0.103998,True,5
3,-2.035681,0.015025,False,5
4,-1.092208,0.091108,False,5
5,0.093502,0.267428,True,5


In [57]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,two,flag,five,one_trunc
0,2.0453,-0.981722,True,5,2.0453
1,-0.55078,0.248781,False,5,-0.55078
2,1.702775,0.103998,True,5,
3,-2.035681,0.015025,False,5,
4,-1.092208,0.091108,False,5,
5,0.093502,0.267428,True,5,


In [58]:
# 指定插入位置
df.insert(1, 'bar', df['one'])
df

Unnamed: 0,one,bar,two,flag,five,one_trunc
0,2.0453,2.0453,-0.981722,True,5,2.0453
1,-0.55078,-0.55078,0.248781,False,5,-0.55078
2,1.702775,1.702775,0.103998,True,5,
3,-2.035681,-2.035681,0.015025,False,5,
4,-1.092208,-1.092208,0.091108,False,5,
5,0.093502,0.093502,0.267428,True,5,


#### 使用 assign() 方法来插入新列

更方便地使用 methd chains 的方法来实现

In [59]:
df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,4,3,3,4
1,1,4,4,2
2,1,4,4,3
3,2,4,4,3
4,1,2,4,2
5,3,4,1,4


In [60]:
df.assign(Ratio = df['A'] / df['B'])

Unnamed: 0,A,B,C,D,Ratio
0,4,3,3,4,1.333333
1,1,4,4,2,0.25
2,1,4,4,3,0.25
3,2,4,4,3,0.5
4,1,2,4,2,0.5
5,3,4,1,4,0.75


In [61]:
df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)

Unnamed: 0,A,B,C,D,AB_Ratio,CD_Ratio
0,4,3,3,4,1.333333,-1
1,1,4,4,2,0.25,2
2,1,4,4,3,0.25,1
3,2,4,4,3,0.5,1
4,1,2,4,2,0.5,2
5,3,4,1,4,0.75,-3


In [62]:
df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)

Unnamed: 0,A,B,C,D,AB_Ratio,ABD_Ratio
0,4,3,3,4,1.333333,5.333333
1,1,4,4,2,0.25,0.5
2,1,4,4,3,0.25,0.75
3,2,4,4,3,0.5,1.5
4,1,2,4,2,0.5,1.0
5,3,4,1,4,0.75,3.0


#### 索引和选择

对应的操作，语法和返回结果

* 选择一列 -> df[col] -> Series
* 根据行标签选择一行 -> df.loc[label] -> Series
* 根据行位置选择一行 -> df.iloc[label] -> Series
* 选择多行 -> df[5:10] -> DataFrame
* 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame

In [63]:
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
a,2,2,6,6
b,8,3,5,7
c,4,6,8,3
d,7,8,3,9
e,8,4,4,2
f,4,2,4,3


In [64]:
df['A']

a    2
b    8
c    4
d    7
e    8
f    4
Name: A, dtype: int32

In [65]:
df.loc['a']

A    2
B    2
C    6
D    6
Name: a, dtype: int32

In [66]:
df.iloc[0]

A    2
B    2
C    6
D    6
Name: a, dtype: int32

In [67]:
df[1:4]

Unnamed: 0,A,B,C,D
b,8,3,5,7
c,4,6,8,3
d,7,8,3,9


In [68]:
df[[False, True, True, False, True, False]]

Unnamed: 0,A,B,C,D
b,8,3,5,7
c,4,6,8,3
e,8,4,4,2


#### 数据对齐

DataFrame 在进行数据计算时，会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。

In [69]:
df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
df1

Unnamed: 0,A,B,C,D
a,0.576428,-0.037913,-0.329787,-1.752916
b,0.406743,-1.044561,-0.724447,0.374599
c,0.073578,0.423914,-1.49977,-0.488374
d,-0.377609,1.137422,-1.951169,-0.814306
e,-2.171648,-2.364502,-0.833594,0.168636
f,-1.1348,-0.927469,0.886889,0.542603
g,0.625104,0.115953,-1.282609,1.031292
h,0.403509,0.263207,0.403614,-0.177888
i,0.148494,-2.034253,0.134859,-0.96065
j,0.0942,-1.803288,0.057472,-0.338958


In [70]:
df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
df2

Unnamed: 0,A,B,C
c,0.884518,0.337344,-1.072027
d,0.264036,-0.152542,-0.225544
e,1.048813,-1.496442,1.022348
f,0.895314,-0.890236,1.230465
g,-0.588162,-0.492354,-0.739563
h,-2.580322,1.10481,-0.167137
i,-0.842738,0.171735,0.847714


In [71]:
df1 + df2

Unnamed: 0,A,B,C,D
a,,,,
b,,,,
c,0.958096,0.761259,-2.571797,
d,-0.113573,0.98488,-2.176713,
e,-1.122834,-3.860944,0.188754,
f,-0.239486,-1.817705,2.117354,
g,0.036942,-0.376401,-2.022171,
h,-2.176813,1.368016,0.236476,
i,-0.694245,-1.862517,0.982573,
j,,,,


In [72]:
df1 - df1.iloc[0]

Unnamed: 0,A,B,C,D
a,0.0,0.0,0.0,0.0
b,-0.169685,-1.006648,-0.39466,2.127515
c,-0.50285,0.461827,-1.169983,1.264541
d,-0.954037,1.175335,-1.621382,0.93861
e,-2.748076,-2.326589,-0.503807,1.921551
f,-1.711228,-0.889556,1.216676,2.295518
g,0.048676,0.153866,-0.952822,2.784208
h,-0.172919,0.301119,0.7334,1.575028
i,-0.427934,-1.99634,0.464646,0.792265
j,-0.482228,-1.765375,0.387259,1.413957


#### 使用 numpy 函数

Pandas 与 numpy 在核心数据结构上是完全兼容的

In [73]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
df

Unnamed: 0,one,two,three,four
0,-1.121818,1.233686,0.681618,-0.502204
1,1.469664,-0.060555,-0.044857,0.725021
2,1.21967,0.108709,1.806063,0.332685
3,-0.190615,1.244102,-0.86385,1.795335
4,-0.133109,-0.101591,0.818724,1.24623
5,0.729804,0.716593,2.472841,-0.078224
6,0.010136,1.725441,-1.071194,1.602945
7,1.002507,-1.122593,-0.147411,-1.678843
8,-0.550077,0.230777,-0.65847,-1.680395
9,1.006271,0.455683,-2.279833,-0.823792


In [74]:
np.exp(df)

Unnamed: 0,one,two,three,four
0,0.325687,3.433864,1.977073,0.605196
1,4.347774,0.941242,0.956134,2.064774
2,3.386069,1.114838,6.08644,1.394708
3,0.82645,3.469817,0.421536,6.02149
4,0.875369,0.903399,2.267604,3.47721
5,2.074675,2.047446,11.856082,0.924757
6,1.010187,5.614995,0.342599,4.967641
7,2.725105,0.325435,0.862939,0.18659
8,0.576905,1.259578,0.517643,0.1863
9,2.735382,1.57725,0.102301,0.438765


In [75]:
np.asarray(df) == df.values

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

In [76]:
type(np.asarray(df))

numpy.ndarray

In [77]:
np.asarray(df) == df

Unnamed: 0,one,two,three,four
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True
6,True,True,True,True
7,True,True,True,True
8,True,True,True,True
9,True,True,True,True


In [78]:
# TAB 自动完成功能
df.one

0   -1.121818
1    1.469664
2    1.219670
3   -0.190615
4   -0.133109
5    0.729804
6    0.010136
7    1.002507
8   -0.550077
9    1.006271
Name: one, dtype: float64

### Panel 

Panel 是三维带标签的数组。实际上，Pandas 的名称由来就是由 Panel 演进的，即 pan(el)-da(ta)-s。Panel 比较少用，但依然是最重要的基础数据结构之一。

* items: 坐标轴 0，索引对应的元素是一个 DataFrame
* major_axis: 坐标轴 1, DataFrame 里的行标签
* minor_axis: 坐标轴 2, DataFrame 里的列标签

In [79]:
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
pn = pd.Panel(data)
pn

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

In [80]:
pn['Item1']

Unnamed: 0,0,1,2
0,0.638298,-1.600822,3.11221
1,0.394099,0.184129,0.43845
2,0.427692,-0.294556,0.03943
3,1.555046,0.933749,0.218616


In [81]:
pn.items

Index([u'Item1', u'Item2'], dtype='object')

In [82]:
pn.major_axis

Int64Index([0, 1, 2, 3], dtype='int64')

In [83]:
pn.minor_axis

Int64Index([0, 1, 2], dtype='int64')

In [84]:
# 函数调用
pn.major_xs(pn.major_axis[0])

Unnamed: 0,Item1,Item2
0,0.638298,-1.427579
1,-1.600822,-0.77809
2,3.11221,


In [85]:
# 函数调用
pn.minor_xs(pn.major_axis[1])

Unnamed: 0,Item1,Item2
0,-1.600822,-0.77809
1,0.184129,0.698347
2,-0.294556,-0.167423
3,0.933749,0.205092


In [86]:
pn.to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Item1,Item2
major,minor,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,0.638298,-1.427579
0,1,-1.600822,-0.77809
1,0,0.394099,-0.999929
1,1,0.184129,0.698347
2,0,0.427692,0.559905
2,1,-0.294556,-0.167423
3,0,1.555046,-1.992102
3,1,0.933749,0.205092
