<!--NAVIGATION-->
< [格式化数据：NumPy里的结构化数组](02.09-Structured-Data-NumPy.ipynb) | [目录](Index.ipynb) | [Pandas对象简介](03.01-Introducing-Pandas-Objects.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Data Manipulation with Pandas

# 使用Pandas處理數據

> As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

正如我們前面看到的，NumPy的`ndarray`數據結構能為數值計算任務所需要的數據提供必不可少的功能。雖然`ndarray`的功能已經很強大，但是當我們需要更多的靈活性的時候，它的缺陷就體現了出來（例如，為數據提供標籤，處理缺失的數據等）。而且如果當需要對數據進行超過廣播能處理範疇的操作時（例如分組，數據透視等），NumPy就無能為力了。而上述提到的這些能力對於我們處理真實世界中產生的非嚴格格式化數據來說是非常重要的。 Pandas，或者更具體的來說，它的`Series`和`DataFrame`對象，在NumPy的基礎上提供了上述操作，讓數據科學家能從花很多時間的這種乏味的數據處理工作中解脫出來。

> In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

我們在本章中會聚焦於了解`Series`、`DataFrame`和相關結構的機制上。例子中使用了真實的數據集進行說明。

## Series as generalized NumPy array 
## Series 作為通用的NumPy數組

> From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

目前為止，我們看到的`Series`對象和一維NumPy數組似乎是可以互換的概念。兩者最基本的區別是索引序號的存在機制：NumPy數組的整數索引*隱式提供*的，而Pandas的`Series`的索引是*顯式定義*的。顯式定義的索引提供了`Series`對象額外的能力。例如，索引值不需要一定是個整數，可以用任何需要的數據類型來定義索引。比方說，下面我們用字符串來作為索引：

In [1]:
import pandas as pd
import numpy as np
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [2]:
data['b']

0.5

> In this way, you can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

在這個層面上，你可以將Pandas的`Series`當成Python字典的一種特殊情形。 Python中的字典可以將任意的關鍵字key和任意的值value對應起來，`Series`是一種能將特定類型的關鍵字key和特定類型的值value對應起來的字典。這種靜態類型是很重要的：正如NumPy數組的靜態類型能提供編譯好的代碼提升對Python列表或集合的操作性能一樣，Pandas的`Series`能提供編譯好的代碼提升對Python字典的操作性能。

In [3]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [4]:
population['California']

38332521

In [5]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

> where ``index`` is an optional argument, and ``data`` can be one of many entities.For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

其中`index`是一個可選的參數，而`data`可以使很多種的數據集合。例如`data`可以是列表或NumPy數組，在這種情況下`index`默認是一個整數序列：

In [6]:
pd.Series([2, 4, 6]) # methodA

0    2
1    4
2    6
dtype: int64

In [7]:
pd.Series(5, index=[100, 200, 300]) # methodB

100    5
200    5
300    5
dtype: int64

In [8]:
pd.Series({2:'a', 1:'b', 3:'c'}) # methodC

2    a
1    b
3    c
dtype: object

In [9]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]) # methodD

3    c
2    a
dtype: object

## DataFrame as a generalized NumPy array
## DataFrame作為一種通用的NumPy數組

> If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

如果說`Series`是帶有靈活索引的通用一維數組的話，那麼`DataFrame`就是帶有靈活的行索引和列索引的通用二維數組。你也可以將`DataFrame`想像成一系列的`Series`對象堆疊在一起，所謂的堆疊實際上指的是這些`Series`擁有相同的索引值序列。

> To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

下面我們構建一個新的`Series`存儲著美國5個州面積（和上面的州人口例子一致）來說明這一點：

In [10]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

> Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

現在我們就有了兩個`Series`，一個人口和一個面積，我們可以再使用一個字典來創建一個二維的對象來存儲兩個序列的數據：

In [11]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


> Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

類似`Series`，我們也可以將`DataFrame`看成是一種特殊的字典。普通的字典將一個關鍵字key映射成一個值value，而`DataFrame`將一個列標籤映射成一個`Series`對象，裡面含有整列的數據。例如，訪問`area`屬性會返回一個`Series`對象包含前面我們放入的面積數據：

In [12]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

> Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

類似`Series`，我們也可以將`DataFrame`看成是一種特殊的字典。普通的字典將一個關鍵字key映射成一個值value，而`DataFrame`將一個列標籤映射成一個`Series`對象，裡面含有整列的數據。例如，訪問`area`屬性會返回一個`Series`對象包含前面我們放入的面積數據：

In [13]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [14]:
# From a single Series object
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [15]:
# From a list of dicts
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [16]:
# Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [17]:
# From a dictionary of Series objects 
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [18]:
# From a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.705469,0.415147
b,0.503092,0.227274
c,0.109805,0.737346


In [19]:
# From a NumPy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object

## Pandas的Index對象

> We have seen here that both the ``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you reference and modify data.
This ``Index`` object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values).
Those views have some interesting consequences in the operations available on ``Index`` objects.
As a simple example, let's construct an ``Index`` from a list of integers:

前面內容介紹的`Series`和`DataFrame`對像都包含著一個顯式定義的*索引index*對象，它的作用就是讓你快速訪問和修改數據。 `Index`對像是一個很有趣的數據結構，它可以被當成*不可變的數組*或者*排序的集合*（嚴格來說是多數據集合，因為`Index`允許包含重複的值）。這兩種看法在對`Index`對象進行操作時會產生一些很有趣的結果。先以一個簡單的例子來說明，我們從整數列表構建一個`Index`對象：

> The ``Index`` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:

`Index`很多的操作都像一个数组。例如，我们可以使用标准的Python索引语法来获得值和切片：

In [20]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [21]:
ind[1]

3

In [22]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [23]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


> Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

Pandas對像被設計成能夠滿足跨數據集進行操作，例如連接多個數據集查找或操作數據，這很大程度依賴於集合運算。 `Index`對象遵循Python內建的`set`數據結構的運算法則，因此並集、交集、差集和其他的集合操作也可以按照熟悉的方式進行：

In [24]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [25]:
indA & indB  # 交集

  indA & indB  # 交集


Int64Index([3, 5, 7], dtype='int64')

In [26]:
indA | indB  # 并集

  indA | indB  # 并集


Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [27]:
indA ^ indB  # 互斥差集

  indA ^ indB  # 互斥差集


Int64Index([1, 2, 9, 11], dtype='int64')

## Data Selection in Series

## 在Series中選擇數據

> As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

我們上一節已經看到，`Series`對像在很多方面都表現的像一個一維NumPy數組，也同時在很多方面表現像是一個標準的Python字典。如果我們能將這兩個基本概念記住，它們能幫助我們理解Series的數據索引和選擇的方法。

### Series as dictionary 將Series看成字典

> Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

像字典一樣，`Series`對象提供了從關鍵字集合到值集合的映射：

In [28]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [29]:
data['b']

0.5

In [30]:
'a' in data

True

In [31]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [32]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

### Series as one-dimensional array 將Series看成一維數組

> A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

`Series`對象構建在字典一樣的接口之上，並且提供了和NumPy數組一樣的數據選擇方式，即*切片*，*遮蓋*和*高級索引*。請看下面的例子：

In [33]:
data['a':'c']
data[0:2]
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

### Indexers: loc, iloc, and ix 索引符：loc，iloc 和 ix

> One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

Python編碼的一大原則就有“明確含義優於隱含意義”。 `loc`和`iloc`屬性的明確含義使得它們對於維護乾淨和可讀的代碼方面非常有效；尤其是當使用顯示整數索引的情況下，作者推薦堅持使用它們，既能保證代碼的易讀性，也能防止因為前面提到的混亂情況造成的難以發現的bug。

> These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index. Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

仔細想一下，你會發現這樣的切片和索引操作是會造成混亂的。例如，如果`Series`對像有顯式的整數索引，那麼`data[1]`的操作會使用顯式索引，但是`data[1:3]`的操作會使用隱式索引。因為存在上面看到的這種混亂，Pandas提供了一些特殊的*索引符*屬性來明確指定使用哪種索引規則。這些索引符不是函數，而是用來訪問`Series`數據的切片屬性。

In [34]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [35]:
data.loc[1]

'a'

In [36]:
data.loc[1:3]

1    a
3    b
dtype: object

> The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

`iloc`屬性允許用戶永遠使用隱式索引來定位和切片：

In [37]:
data.iloc[1]

'b'

In [38]:
data.iloc[1:3]

3    b
5    c
dtype: object

> A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.
The purpose of the ``ix`` indexer will become more apparent in the context of ``DataFrame`` objects, which we will discuss in a moment.

第三個索引符屬性`ix`，是兩者的混合，對於`Series`對象來說，等同於標準的`[]`索引。 `ix`索引符的意義會在`DataFrame`對像中體現出來，我們很快就會討論到。 譯者註：ix已經在新版的Pandas中已經被拋棄了，因此會有一個警告，也說明讀者應該慎用這個屬性。

## Data Selection in DataFrame

## DataFrame 選擇數據

> Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

我們介紹過`DataFrame`表現得既像二維數組又像由共同的索引值組成的`Series`對象的字典。這能幫助你學習如何在`DataFrame`裡面進行數據選擇。

### DataFrame as a dictionary 將DataFrame當成字典

> The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

首先我們將`DataFrame`看成是相關`Series`對象組成的字典。讓我們回到之前那個美國州人口和麵積的例子：

In [39]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [40]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [41]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [42]:
data.area is data['area']

True

> Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

雖然這是個有用的縮寫方式，但是請記住屬性表達式並不是通用的。例如，如果列名不是字符串，或者與`DataFrame`的方法名字發生衝突，屬性表達式都沒法使用。例如，`DataFrame`有`pop()`方法，因此，`data.pop`將會指向該方法而不是`"pop"`列：

In [43]:
data.pop is data['pop']

False

> In particular, Avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

特別是應該避免使用屬性表達式給列賦值（例如，應該使用`data['pop']=z`而不是`data.pop=z`）。

In [44]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


> This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects; we'll dig into this further in [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb).

這裡展示了使用直接的語法對多個`Series`對象按元素進行算術運算；我們會在[在Pandas中操作數據](03.03-Operations-in-Pandas.ipynb)一節中深入討論。

### DataFrame as two-dimensional array 將DataFrame看成二維數組

> As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:

前面說到，我們也可以將`DataFrame`看成是一個擴展的二維數組。我們可以通過`values`屬性查看`DataFrame`對象的底層數組：

In [45]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

> With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself.
For example, we can transpose the full ``DataFrame`` to swap rows and columns:

有了這個基本概念之後，很多熟悉的數組操作都可以應用在`DataFrame`對像上。例如，我們可以將`DataFrame`的行和列交換，也就是矩陣的倒置：

In [46]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


> When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array.
In particular, passing a single index to an array accesses a row:

當我們需要對`DataFrame`對象進行索引時，因為列所具有的字典索引方式，我們無法簡單地按照NumPy數組的方式來處理。比方說傳遞一個索引值來獲取一行：

In [47]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [48]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

### Indexers: loc, iloc, and ix 索引符：loc，iloc 和 ix


> Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc``, and ``ix`` indexers mentioned earlier.
Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

因此對於數組方式的索引方式，我們需要使用另一種方法。 Pandas仍然使用`loc`、`iloc`和`ix`索引符來進行操作。當你使用`iloc`時，這就是使用隱式索引，Pandas會把`DataFrame`當成底層的NumPy數組來處理，但行和列的索引值還是會保留在結果中：

In [49]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


> Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

使用`loc`索引符時，我們使用的是明確指定的顯示索引：

In [50]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [51]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [52]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


> To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.

為了鍛煉你操作Pandas數據的熟練度，建議花些時間構建一個簡單的`DataFrame`對象，然後在上面運用索引、切片、遮蓋和高級索引等各種操作。

## Ufuncs: Index Preservation

## Ufuncs：保留索引

> Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

因為Pandas是設計和NumPy一起使用的，因此所有的NumPy通用函數都可以在Pandas的`Series`和`DataFrame`對像上使用。首先我們定義簡單的`Series`和`DataFrame`對象來展示：

In [53]:
import pandas as pd
import numpy as np

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [54]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [55]:
np.exp(ser)  #如果我們對上面的一個對象使用一元ufunc運算，結果會產生另一個Pandas對象，且*保留了索引*：

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [56]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## UFuncs: Index Alignment

## Ufuncs：索引對齊

> For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.
This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

對於兩個`Series`或`DataFrame`進行二元運算操作，Pandas會在運算過程中會自動將兩個數據集的索引進行對齊操作。這對於我們處理不完整的數據集的情況下非常方便，下面我們來看一些例子。

### Index alignment in Series , Series對像中的索引對齊

> As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

假設我們從兩個不同的數據源分別獲得美國前三大面積和前三大人口的州，作為下面的例子：

In [57]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,'New York': 19651127}, name='population')
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [58]:
area.index | population.index  #數組中的索引包含了兩個輸入數組的並集

  area.index | population.index  #數組中的索引包含了兩個輸入數組的並集


Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

> Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data (see further discussion of missing data in [Handling Missing Data](03.04-Missing-Values.ipynb)).

兩個任意輸入數據集中對應的另一個數據集不存在的元素都會被設置為`NaN`（非數字的縮寫），也就是Pandas標示缺失數據的方法：

In [59]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

> If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

如果填充成NaN值不是你需要的結果，你可以使用相應的ufunc函數來計算，然後在函數中設置相應的填充值參數。例如，調用`A.add(B)`等同於調用`A + B`，但是可以提供額外的參數來設置用來缺失的替換值：

In [60]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index alignment in DataFrame , DataFrame中的索引對齊

> A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

類似的對齊方式在對`DataFrame`操作當中會同時發生在列和行上：

In [61]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [62]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [63]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


> Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.
As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.
Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

注意不管索引在輸入數據集中的順序並不會影響結果當中索引的對齊情況。與`Series`的情況一樣，我們可以使用相應的ufunc函數來代替標準運算操作，然後代入你需要的`fill_value`參數來代替缺失值。這裡我們會使用`A`中所有值的平均值來替代空值，我們首先堆疊（stack）`A`的所有行來計算平均值：

In [64]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


> The following table lists Python operators and their equivalent Pandas object methods:

下面列出了Python的運算操作及其對應的Pandas方法：

| Python運算符  | Pandas方法                             |Python運算符     | Pandas方法                             |
|--------------|---------------------------------------|-----------------|---------------------------------------|
| ``+``        | ``add()``                             | ``//``          | ``floordiv()``                        |
| ``-``        | ``sub()``, ``subtract()``             | ``%``           | ``mod()``                             |
| ``*``        | ``mul()``, ``multiply()``             | ``**``          | ``pow()``                             |
| ``/``        | ``truediv()``, ``div()``, ``divide()``|

## Ufuncs: Operations Between DataFrame and Series

## Ufuncs：DataFrame和Series之間的操作

> When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

當在`DataFrame`和`Series`之間進行運算操作時，行和列的標籤對齊機制依然有效。 `DataFrame`和`Series`之間的操作類似於在一維數組和二維數組之間進行操作。例如一個很常見的操作，我們想要找出一個二維數組和它其中一行的差：

In [65]:
A = rng.randint(10, size=(3, 4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [66]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

> According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:

依據NumPy的廣播規則（參見[在數組上計算：廣播](02.05-Computation-on-arrays-broadcasting.ipynb)），二維數組的每一行都會減去它自身的第一行。如果你希望能夠按照列進行減法，你需要使用對應的ufunc函數，然後指定`axis`參數：

In [67]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


In [68]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


> Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align  indices between the two elements:

上面介紹的這些`DataFrame`或者`Series`操作，都會自動對運算的數據集進行索引對齊：

In [69]:
halfrow = df.iloc[0, ::2] # 第一行的Q和S列
halfrow

Q    3
S    2
Name: 0, dtype: int64

In [70]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


## Example

In [71]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('input/movehubcostofliving.csv')
data.head()

Unnamed: 0,City,Cappuccino,Cinema,Wine,Gasoline,Avg Rent,Avg Disposable Income
0,Lausanne,3.15,12.59,8.4,1.32,1714.0,4266.11
1,Zurich,3.28,12.59,8.4,1.31,2378.61,4197.55
2,Geneva,2.8,12.94,10.49,1.28,2607.95,3917.72
3,Basel,3.5,11.89,7.35,1.25,1649.29,3847.76
4,Perth,2.87,11.43,10.08,0.97,2083.14,3358.55


### Find out the Places where Cappuccino is Least and Most Expensive

In [72]:
print("The Places where Cappuccino is Very Cheap")
data[['City','Cappuccino']].sort_values(by = 'Cappuccino').head().reset_index(drop = True)

The Places where Cappuccino is Very Cheap


Unnamed: 0,City,Cappuccino
0,Addis Ababa,0.46
1,Kochi,0.6
2,Chennai,0.66
3,Porto,0.68
4,Ahmedabad,0.72


In [73]:
print("The Places where Cappuccino is very Expensive")
data[['City','Cappuccino']].sort_values(by = 'Cappuccino', ascending = False).head(5).reset_index(drop = True)

The Places where Cappuccino is very Expensive


Unnamed: 0,City,Cappuccino
0,Stavanger,4.48
1,Bergen,3.92
2,Nashville,3.84
3,Trondheim,3.81
4,Copenhagen,3.66


### Find out the Places where Cinema is Least and Most Expensive

In [74]:
print("The Places where Cinema is Very Cheap")
data[['City','Cinema']].sort_values(by = 'Cinema').head(5).reset_index(drop = True)

The Places where Cinema is Very Cheap


Unnamed: 0,City,Cinema
0,Hyderabad,1.81
1,Chennai,1.81
2,Kochi,1.81
3,Davao,1.9
4,Dhaka,2.09


In [75]:
print("\n The Places where Cinema is very Expensive")
data[['City','Cinema']].sort_values(by = 'Cinema', ascending = False).head(5).reset_index(drop = True)


 The Places where Cinema is very Expensive


Unnamed: 0,City,Cinema
0,Riyadh,79.49
1,Brighton,14.95
2,Geneva,12.94
3,Lausanne,12.59
4,Zurich,12.59


### Find out the Places where Wine is Least and Most Expensive

In [76]:
print("The Places where Wine is Very Cheap")
print(data[['City','Wine']].sort_values(by = 'Wine').head(5).reset_index(drop = True))

print("\n The Places where Wine is very Expensive")
print(data[['City','Wine']].sort_values(by = 'Wine', ascending = False).head(5).reset_index(drop = True))

The Places where Wine is Very Cheap
        City  Wine
0      Braga  2.13
1   Budapest  2.85
2  Bucharest  2.94
3     Lisbon  2.98
4     Malaga  2.98

 The Places where Wine is very Expensive
        City   Wine
0     Tehran  26.15
1     Manama  19.61
2     Riyadh  17.43
3    Jakarta  16.83
4  Singapore  15.82


### Find out the Places where Gasoline is Least and Most Expensive

In [77]:
print("The Places where Gasoline is Very Cheap")
print(data[['City','Gasoline']].sort_values(by = 'Gasoline').head(5).reset_index(drop = True))

print("\n The Places where Gasoline is very Expensive")
print(data[['City','Gasoline']].sort_values(by = 'Gasoline', ascending = False).head(5).reset_index(drop = True))

The Places where Gasoline is Very Cheap
      City  Gasoline
0  Caracas      0.07
1   Riyadh      0.08
2   Manama      0.17
3     Doha      0.18
4    Cairo      0.19

 The Places where Gasoline is very Expensive
        City  Gasoline
0      Izmir      1.69
1  Stavanger      1.68
2   Istanbul      1.66
3    Antalya      1.62
4     Bergen      1.57


### Find out the Places where Rent is Least and Most Expensive

In [78]:
print("The Places where Avg Rent is Very Low")
print(data[['City','Avg Rent']].sort_values(by = 'Avg Rent').head(5).reset_index(drop = True))

print("\n The Places where Avg Rent is very High")
print(data[['City','Avg Rent']].sort_values(by = 'Avg Rent', ascending = False).head(5).reset_index(drop = True))

The Places where Avg Rent is Very Low
        City  Avg Rent
0   Vadodara    120.68
1      Kochi    181.02
2  Ahmedabad    193.08
3    Karachi    197.78
4     Indore    205.15

 The Places where Avg Rent is very High
        City  Avg Rent
0  Hong Kong   5052.31
1   New York   3268.84
2  Singapore   3164.42
3     Sydney   2788.71
4     Geneva   2607.95


### Find out the Least Expensive and Most Expensive Cities in the World

In [79]:
print("The Places where Avg Disposable Income is Very Low")
print(data[['City','Avg Disposable Income']].sort_values(by = 'Avg Disposable Income').head(5).reset_index(drop = True))

print("\n The Places where Avg Rent is very High")
print(data[['City','Avg Disposable Income']].sort_values(by = 'Avg Disposable Income',
                                                    ascending = False).head(5).reset_index(drop = True))

The Places where Avg Disposable Income is Very Low
          City  Avg Disposable Income
0       Indore                 120.68
1  Addis Ababa                 124.22
2       Lahore                 132.95
3      Karachi                 139.60
4        Davao                 158.34

 The Places where Avg Rent is very High
       City  Avg Disposable Income
0  Lausanne                4266.11
1    Zurich                4197.55
2    Geneva                3917.72
3     Basel                3847.76
4     Perth                3358.55


## Example for airport

In [80]:
import pandas as pd

flights= pd.read_csv('input/flights.csv')
planes= pd.read_csv('input/planes.csv')
airlines= pd.read_csv('input/airlines.csv')
airports=pd.read_csv('input/airports.csv')

airlines

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.
5,EV,ExpressJet Airlines Inc.
6,F9,Frontier Airlines Inc.
7,FL,AirTran Airways Corporation
8,HA,Hawaiian Airlines Inc.
9,MQ,Envoy Air


In [81]:
# Open a script and name it to section 6 assignment and try to tackle the below questions:
# what is the most popular destination city from NewYork?
flights.columns
table1=flights.groupby('dest').agg(count=('dest','count')).sort_values(by='count',ascending=False).reset_index()
pd.merge(table1, airports[['faa','name']],how='left',left_on='dest',right_on= 'faa')

Unnamed: 0,dest,count,faa,name
0,ORD,17283,ORD,Chicago Ohare Intl
1,ATL,17215,ATL,Hartsfield Jackson Atlanta Intl
2,LAX,16174,LAX,Los Angeles Intl
3,BOS,15508,BOS,General Edward Lawrence Logan Intl
4,MCO,14082,MCO,Orlando Intl
...,...,...,...,...
100,MTJ,15,MTJ,Montrose Regional Airport
101,SBN,10,SBN,South Bend Rgnl
102,ANC,8,ANC,Ted Stevens Anchorage Intl
103,LGA,1,LGA,La Guardia


In [82]:
# which airline is the most punctual?
flights['total_delay']= flights['arr_delay']+flights['dep_delay']
table1=flights.groupby('carrier').agg(mean_delay= ('total_delay','mean')).sort_values(by='mean_delay').reset_index()
airlines
pd.merge(table1,airlines,how='left')

Unnamed: 0,carrier,mean_delay,name
0,AS,-4.100141,Alaska Airlines Inc.
1,HA,-2.01462,Hawaiian Airlines Inc.
2,US,5.874288,US Airways Inc.
3,AA,8.933421,American Airlines Inc.
4,DL,10.868291,Delta Air Lines Inc.
5,VX,14.52111,Virgin America
6,UA,15.57492,United Air Lines Inc.
7,MQ,21.220114,Envoy Air
8,B6,22.425521,JetBlue Airways
9,9E,23.819244,Endeavor Air Inc.


In [84]:
# what destination has  the longest duration

table1=flights.groupby(['origin','dest']).agg(average_air_time= 
                            ('air_time','mean')).reset_index().sort_values(by='average_air_time',ascending=False)
pd.merge(table1,airports[['faa','name']],how='left',left_on= 'dest',right_on='faa')

Unnamed: 0,origin,dest,average_air_time,faa,name
0,JFK,HNL,623.087719,HNL,Honolulu Intl
1,EWR,HNL,612.075209,HNL,Honolulu Intl
2,EWR,ANC,413.125000,ANC,Ted Stevens Anchorage Intl
3,JFK,SFO,347.403626,SFO,San Francisco Intl
4,JFK,SJC,346.606707,SJC,Norman Y Mineta San Jose Intl
...,...,...,...,...,...
219,EWR,ALB,31.787081,ALB,Albany Intl
220,JFK,PHL,30.836872,PHL,Philadelphia Intl
221,EWR,PHL,28.666667,PHL,Philadelphia Intl
222,EWR,BDL,25.466019,BDL,Bradley Intl


In [87]:
# what airline is the worst in terms of delays fronteir 
# which airline has the highest capacity of seats?

carrier_tailnum= flights[['carrier','tailnum']].drop_duplicates()
seats=pd.merge(carrier_tailnum,planes[['tailnum','seats']],how='left').groupby('carrier').agg(total_seats=('seats','sum')).sort_values(by='total_seats',ascending=False)
seats

Unnamed: 0_level_0,total_seats
carrier,Unnamed: 1_level_1
UA,116252.0
DL,115715.0
WN,82700.0
US,57821.0
AA,29309.0
B6,27148.0
EV,19525.0
9E,13685.0
AS,13465.0
FL,13451.0


In [88]:
### which airplane model is the highest in use and from which manufacturer?
airplanes_use=flights.groupby('tailnum').agg(count= ('tailnum','count')).reset_index()
planes.columns
pd.merge(planes[['tailnum','model','manufacturer']],airplanes_use).groupby(['model','manufacturer']).agg(total_flights= ('count','sum')).sort_values(by='total_flights',ascending=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,total_flights
model,manufacturer,Unnamed: 2_level_1
A320-232,AIRBUS,31278
EMB-145LR,EMBRAER,28027
ERJ 190-100 IGW,EMBRAER,23716
A320-232,AIRBUS INDUSTRIE,14553
EMB-145XR,EMBRAER,14051
...,...,...
737-3T5,BOEING,2
737-3A4,BOEING,1
A330-323,AIRBUS,1
A330-223,AIRBUS INDUSTRIE,1



<!--NAVIGATION-->
< [格式化数据：NumPy里的结构化数组](02.09-Structured-Data-NumPy.ipynb) | [目录](Index.ipynb) | [Pandas对象简介](03.01-Introducing-Pandas-Objects.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
