# 3.3 pandas数据运算

## 3.3.1 算术运算

### pandas的数据对象在进行算术运算时，如果有相同索引对则进行算术运算，如果没有则会引入缺失值，这就是数据对齐。

In [2]:
from pandas import Series,DataFrame
import pandas as pd
import numpy as np

In [3]:
obj1 = Series([3.2,5.3,-4.4,-3.7],index=['a','c','g','f'])
obj1

a    3.2
c    5.3
g   -4.4
f   -3.7
dtype: float64

In [4]:
obj2 = Series([5.0,-2,4.4,3.4],index=['a','b','c','d'])
obj2

a    5.0
b   -2.0
c    4.4
d    3.4
dtype: float64

In [5]:
obj1+obj2

a    8.2
b    NaN
c    9.7
d    NaN
f    NaN
g    NaN
dtype: float64

### 对于DataFrame数据而言，对齐操作会同时发生在行和列上

In [6]:
df1 = DataFrame(np.arange(9).reshape(3,3),columns=['a','b','c'],index=['apple','tea','banana'])
df1

Unnamed: 0,a,b,c
apple,0,1,2
tea,3,4,5
banana,6,7,8


In [7]:
df2 = DataFrame(np.arange(9).reshape(3,3),columns=['a','b','d'],index=['apple','tea','coco'])
df2

Unnamed: 0,a,b,d
apple,0,1,2
tea,3,4,5
coco,6,7,8


In [8]:
df1+df2

Unnamed: 0,a,b,c,d
apple,0.0,2.0,,
banana,,,,
coco,,,,
tea,6.0,8.0,,


### DataFrame和Series数据在进行运算时，先通过Series的索引匹配到相应的DataFrame列索引上，然后沿行向下运算（广播）

In [9]:
df1

Unnamed: 0,a,b,c
apple,0,1,2
tea,3,4,5
banana,6,7,8


In [11]:
s = df1.ix['apple']
s

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


a    0
b    1
c    2
Name: apple, dtype: int32

In [12]:
df1-s

Unnamed: 0,a,b,c
apple,0,0,0
tea,3,3,3
banana,6,6,6


## 3.3.2 函数应用和映射

### 在数据分析时，常常会对数据进行较复杂的数据运算，这时需要定义函数。定义好的函数可以应用到pandas数据中，其中有三种方法：map函数，将函数套用在Series的每个元素中；apply函数，将函数套用到DataFrame的行与列上；applymap函数，将函数套用到DataFrame的每个元素上。

### 需要把price列的“元”字去掉，这时就需要用到map函数

In [13]:
data = {
    'fruit':['apple','orange','grape','banana'],
    'price':['25元','42元','35元','14元']
}
df1 = DataFrame(data)
df1

Unnamed: 0,fruit,price
0,apple,25元
1,orange,42元
2,grape,35元
3,banana,14元


In [15]:
def f(x):
    return x.split('元')[0]
df1['price'] = df1['price'].map(f)
df1

Unnamed: 0,fruit,price
0,apple,25
1,orange,42
2,grape,35
3,banana,14


In [16]:
df2 = DataFrame(np.random.randn(3,3),columns=['a','b','c'],index=['app','win','mac'])
df2

Unnamed: 0,a,b,c
app,-0.204185,-0.501947,-0.111658
win,0.493893,0.94763,0.140371
mac,-1.316223,-1.542992,0.384525


In [18]:
f = lambda x:x.max()-x.min()
df2.apply(f)

a    1.810116
b    2.490621
c    0.496182
dtype: float64

#### 注意：lambda为匿名函数，和定义好的函数一样，可以节省代码量。

### applymap函数可作用于每个元素，便于对整个DataFrame数据进行批量处理

In [19]:
df2

Unnamed: 0,a,b,c
app,-0.204185,-0.501947,-0.111658
win,0.493893,0.94763,0.140371
mac,-1.316223,-1.542992,0.384525


In [21]:
df2.applymap(lambda x:'%.2f' %x)

Unnamed: 0,a,b,c
app,-0.2,-0.5,-0.11
win,0.49,0.95,0.14
mac,-1.32,-1.54,0.38


## 3.3.3 排序

### 在Series中，通过sort_index函数可对索引进行排序，默认情况为升序

In [22]:
obj1 = Series([-2,3,2,1],index=['b','a','d','c'])
obj1

b   -2
a    3
d    2
c    1
dtype: int64

In [23]:
obj1.sort_index()   #升序

a    3
b   -2
c    1
d    2
dtype: int64

In [24]:
obj1.sort_index(ascending=False)   #降序

d    2
c    1
b   -2
a    3
dtype: int64

### 通过sort_values方法可对值进行排序

In [25]:
obj1.sort_values()

b   -2
c    1
d    2
a    3
dtype: int64

### 对于DataFrame数据而言，通过指定轴方向，使用sort_index函数可对行或者列索引进行排序，这里不多做示例了。要根据列进行排序，可以通过sort_values函数，把列名传给by参数即可

In [26]:
df2

Unnamed: 0,a,b,c
app,-0.204185,-0.501947,-0.111658
win,0.493893,0.94763,0.140371
mac,-1.316223,-1.542992,0.384525


In [27]:
df2.sort_values(by='b')

Unnamed: 0,a,b,c
mac,-1.316223,-1.542992,0.384525
app,-0.204185,-0.501947,-0.111658
win,0.493893,0.94763,0.140371


## 3.3.4 汇总与统计

### 在DataFrame数据中，通过sum函数可以对每列进行求和汇总，与Excel中的sum函数类似

In [28]:
df = DataFrame(np.random.randn(9).reshape(3,3),columns=['a','b','c'])
df

Unnamed: 0,a,b,c
0,-0.966997,1.24834,0.491556
1,0.20945,0.380859,-0.565245
2,-1.065144,0.35355,0.353428


In [29]:
df.sum()

a   -1.822691
b    1.982749
c    0.279739
dtype: float64

### 指定轴方向，通过sum函数可按行汇总

In [30]:
df.sum(axis=1)

0    0.772899
1    0.025063
2   -0.358166
dtype: float64

### describe方法可对每个数值型列进行统计，经常用于对数据的初步观察时使用

In [31]:
data = {
    'name':['张三','李四','王五','小明'],
    'sex':['female','female','male','male'],
    'year':[2001,2001,2003,2002],
    'city':['北京','上海','广州','北京']
}
df = DataFrame(data)
df

Unnamed: 0,name,sex,year,city
0,张三,female,2001,北京
1,李四,female,2001,上海
2,王五,male,2003,广州
3,小明,male,2002,北京


In [32]:
df.describe()

Unnamed: 0,year
count,4.0
mean,2001.75
std,0.957427
min,2001.0
25%,2001.0
50%,2001.5
75%,2002.25
max,2003.0


## 3.3.5 唯一值和值计数

### 在Series中，通过unique函数可以获取不重复的数组

In [33]:
obj = Series(['a','b','a','c','b'])
obj

0    a
1    b
2    a
3    c
4    b
dtype: object

In [34]:
obj.unique()

array(['a', 'b', 'c'], dtype=object)

### 通过values_counts方法可统计每个值出现的次数

In [35]:
obj.value_counts()

b    2
a    2
c    1
dtype: int64

#### 注意：对于DataFrame的列而言，unique函数和values_counts方法同样适用，这里不再举例。