<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<!--NAVIGATION-->
< [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) | [Contents](Index.ipynb) | [Handling Missing Data](03.04-Missing-Values.ipynb) >

# Operating on Data in Pandas

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).

numPy的基本能力是对每个元素快速计算


Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) are key to this.

Pandas includes a couple useful twists, 有用的技巧
however:

for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, 

对于一元操作,函数的结果中保存这行/列坐标

and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing 

the objects to the ufunc.


This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof ones with Pandas.


We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures.


## Ufuncs: Index Preservation, 通用函数保留行/列的索引

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.

Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

In [1]:
import pandas as pd
import numpy as np

In [3]:
rng = np.random.RandomState(42)  # 随机数种子
ser = pd.Series(rng.randint(0, 10, 4))  # 随机数
ser        

0    6
1    3
2    7
3    4
dtype: int64

In [7]:
rng = np.random.RandomState(12)
rng.randint(100,size = (3,4))

array([[75, 27,  6,  2],
       [ 3, 67, 76, 48],
       [22, 49, 52,  5]])

In [4]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'],index = ['a','b','c'])
df

Unnamed: 0,A,B,C,D
a,6,9,2,6
b,7,4,3,7
c,7,2,5,4


If we apply a NumPy ufunc on either of these objects, 
the result will be another Pandas object *with the indices preserved:*

返回的结果中带索引

In [10]:
np.exp(ser)
ser + 100

0    106
1    103
2    107
3    104
dtype: int64

Or, for a slightly more complex calculation:

In [11]:
np.sin(df * np.pi / 4) #  np里面的计算函数


Unnamed: 0,A,B,C,D
a,-1.0,0.7071068,1.0,-1.0
b,-0.707107,1.224647e-16,0.707107,-0.7071068
c,-0.707107,1.0,-0.707107,1.224647e-16


Any of the ufuncs discussed in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) can be used in a similar manner.

## UFuncs: Index Alignment, 索引对齐

For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.


This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

### Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:


In [12]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

df = pd.DataFrame({'area':area,'pop':population})
df # 自动补全数据

Unnamed: 0,area,pop
Alaska,1723337.0,
California,423967.0,38332521.0
New York,,19651127.0
Texas,695662.0,26448193.0


Let's see what happens when we divide these to compute the population density:

In [7]:
population / area  # NaN

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [14]:
area.index | population.index

area.index & population.index

Index(['California', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data (see further discussion of missing data in [Handling Missing Data](03.04-Missing-Values.ipynb)).


This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [15]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.

For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

In [16]:
A.add(B, fill_value=0)  #可以设置天充值

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

In [18]:
A.add(B,fill_value=100) # 
A.sub(B,fill_value=0)

0    2.0
1    3.0
2    3.0
3   -5.0
dtype: float64

### Index alignment in DataFrame

A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:



In [19]:
list('aaaaa')

['a', 'a', 'a', 'a', 'a']

In [37]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))  # list
A

Unnamed: 0,A,B
0,15,1
1,4,0


In [21]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,0,5,8
1,2,9,3
2,4,3,1


In [22]:
A + B

Unnamed: 0,A,B,C
0,18.0,2.0,
1,20.0,12.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.

计算时索引对齐

As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.


Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

In [36]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

AttributeError: 'numpy.ndarray' object has no attribute 'stack'

In [38]:
A.stack().mean()
#A.mean()# 3.75

5.0

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Ufuncs: Operations Between DataFrame and Series

When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.


Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.

Series与DataFrame的计算,跟一维数组与二维数组的计算一样

Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [39]:
A = rng.randint(10, size=(3, 4))
A

array([[4, 1, 5, 5],
       [3, 4, 5, 5],
       [0, 6, 6, 3]])

In [40]:
A - A[0]   # 多维数组之间的计算

array([[ 0,  0,  0,  0],
       [-1,  3,  0,  0],
       [-4,  5,  1, -2]])

According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [41]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,3,0,0
2,-4,5,1,-2


If you would instead like to operate column-wise,

you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:

使用对象的方法

In [62]:
df.subtract(df['R'], axis=0) # 列

0    5
1    5
2    6
Name: S, dtype: int64

In [70]:
df['T']

0    5
1    5
2    3
Name: T, dtype: int64

Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align  indices between the two elements:

In [77]:
df.iloc[0,[1,2]]
df.iloc[1,::2]

Q    3
S    5
Name: 1, dtype: int64

In [79]:
halfrow = df.iloc[0, ::2]
halfrow # Series

Q    4
S    5
Name: 0, dtype: int64

In [20]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

<!--NAVIGATION-->
< [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) | [Contents](Index.ipynb) | [Handling Missing Data](03.04-Missing-Values.ipynb) >