<!--NAVIGATION-->
< [格式化数据：NumPy里的结构化数组](02.09-Structured-Data-NumPy.ipynb) | [目录](Index.ipynb) | [Pandas对象简介](03.01-Introducing-Pandas-Objects.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Data Manipulation with Pandas

# 使用Pandas處理數據

> As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

正如我們前面看到的，NumPy的`ndarray`數據結構能為數值計算任務所需要的數據提供必不可少的功能。雖然`ndarray`的功能已經很強大，但是當我們需要更多的靈活性的時候，它的缺陷就體現了出來（例如，為數據提供標籤，處理缺失的數據等）。而且如果當需要對數據進行超過廣播能處理範疇的操作時（例如分組，數據透視等），NumPy就無能為力了。而上述提到的這些能力對於我們處理真實世界中產生的非嚴格格式化數據來說是非常重要的。 Pandas，或者更具體的來說，它的`Series`和`DataFrame`對象，在NumPy的基礎上提供了上述操作，讓數據科學家能從花很多時間的這種乏味的數據處理工作中解脫出來。

> In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

我們在本章中會聚焦於了解`Series`、`DataFrame`和相關結構的機制上，在pandas的兩種特有的資料結構DataFrame與Series。

- Series是一個類似陣列的物件，裡面可包含陣列的資料
- DataFrame就像是我們在使用的excel表格一樣，是一個二維的數據有index和column。

### Series as generalized NumPy array 
### Series 作為 NumPy 陣列

> From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

目前為止，我們看到的`Series`對象和一維NumPy數組似乎是可以互換的概念。兩者最基本的區別是索引序號的存在機制：
- NumPy 數組的整數索引*隱式提供*的，
- Pandas 的`Series`的索引是*顯式定義*的。

顯式定義的索引提供了`Series`對象額外的能力。例如，索引值不需要是整數，可以用任何需要的數據類型來定義索引。

In [1]:
import pandas as pd
import numpy as np

data = pd.Series([0.25, 0.5, 0.75, 1.0])  #index 1,2,3
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])  #index a,b,c,d
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
data['b']  # index

0.5

> You can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

你可以將Pandas的`Series`當成Python字典的一種特殊情形。 Python中的字典可以將任意的關鍵字key和任意的值value對應起來，`Series`是一種能將特定類型的關鍵字key和特定類型的值value對應起來的字典。這種靜態類型是很重要的：正如NumPy數組的靜態類型能提供編譯好的代碼提升對Python列表或集合的操作性能一樣，Pandas的`Series`能提供編譯好的代碼提升對Python字典的操作性能。

In [4]:
# dict -> pd.Series
population_dict = {'California': 38332521,'Texas': 26448193,'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [5]:
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

### DataFrame as a generalized NumPy array
### DataFrame 作為 NumPy 陣列

> If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

如果說`Series`是帶有靈活索引的通用一維數組的話，那麼`DataFrame`就是帶有靈活的行索引和列索引的通用二維數組。你也可以將`DataFrame`想像成一系列的`Series`對象堆疊在一起，所謂的堆疊實際上指的是這些`Series`擁有相同的索引值序列。

In [63]:
# # dict -> pd.Series -> pd.DataFrame
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,'Florida': 170312, 'Illinois': 149995}
states = pd.DataFrame({'area': pd.Series(area_dict)})
states

Unnamed: 0,area
California,423967
Texas,695662
New York,141297
Florida,170312
Illinois,149995


In [8]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

### The Pandas Index Object
### Pandas的Index對象

> We have seen here that both the ``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you reference and modify data.
This ``Index`` object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values).
Those views have some interesting consequences in the operations available on ``Index`` objects.
As a simple example, let's construct an ``Index`` from a list of integers:

前面內容介紹的`Series`和`DataFrame`對像都包含著一個顯式定義的*索引index*對象，它的作用就是讓你快速訪問和修改數據。

In [69]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [70]:
indA.size
indA.shape
indA.ndim
indA.dtype
indA.intersection(indB)             #indA & indB  # 交集
indA.union(indB)                    #indA | indB  # 聯集
indA.symmetric_difference(indB)     #indA ^ indB  # 互斥差集

Int64Index([1, 2, 9, 11], dtype='int64')

## Data Selection in Series
## 在Series中選擇數據

>  ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.

`Series`在很多方面都表現的像一維NumPy數組，也同時在很多方面表現像是一個標準的Python字典。


In [71]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) 
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [72]:
# Series as dictionary 將Series看成字典 #Series提供了從關鍵字集合到值集合的映射
'a' in data
data[0:4]
data['b']
data['A':'D']
data['a':'c']
data[0:3]
data.keys()
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [18]:
# Conditions
data[(data > 0.2) & (data < 0.6)]

a    0.25
b    0.50
dtype: float64

## Data Selection in DataFrame
## DataFrame 選擇數據

> Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

我們介紹過`DataFrame`表現得既像二維數組又像由共同的索引值組成的`Series`對象的字典。這能幫助你學習如何在`DataFrame`裡面進行數據選擇。

In [78]:
# dictionary -> Series -> DataFrame
area = pd.Series({'California': 423967, 'Texas': 695662,'New York': 141297, 'Florida': 170312,'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,'New York': 19651127, 'Florida': 19552860,'Illinois': 12882135,'ZMore': 111})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967.0,38332521
Florida,170312.0,19552860
Illinois,149995.0,12882135
New York,141297.0,19651127
Texas,695662.0,26448193
ZMore,,111


In [79]:
# Series as one-dimensional DataFrame 將 DataFrame 看成一維數組
data.area           #錯誤用法 
data['area']        #正確用法:避免使用屬性表達式給列賦值（例如，應使用data['pop']=z而非data.pop=z）。

California    423967.0
Florida       170312.0
Illinois      149995.0
New York      141297.0
Texas         695662.0
ZMore              NaN
Name: area, dtype: float64

In [80]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967.0,38332521,90.413926
Florida,170312.0,19552860,114.806121
Illinois,149995.0,12882135,85.883763
New York,141297.0,19651127,139.076746
Texas,695662.0,26448193,38.01874
ZMore,,111,


## DataFrame 計算

In [24]:
s_sub = pd.Series({'Pruthvi': 'Kannada', 'Pranam': 'Hindi', 'Pratham':'English', 'Pravera':'Maths', 'Prabu':'Science'})
total_m = pd.Series({'Pruthvi': 60, 'Pranam': 60, 'Pratham':60, 'Pravera':60, 'Prabu':60})
minf_m = pd.Series({'Pruthvi': 30, 'Pranam': 30, 'Pratham':30, 'Pravera':30, 'Prabu':30})
obt_m = pd.Series({'Pruthvi': 25, 'Pranam': 35, 'Pratham':40, 'Pravera':60, 'Prabu':55})

students_d = pd.DataFrame({'Sub' : s_sub, 'T_m' : total_m, 'Min_m' : minf_m, 'O_m' : obt_m})
students_d

Unnamed: 0,Sub,T_m,Min_m,O_m
Pruthvi,Kannada,60,30,25
Pranam,Hindi,60,30,35
Pratham,English,60,30,40
Pravera,Maths,60,30,60
Prabu,Science,60,30,55


In [25]:
students_d['score'] = students_d['O_m'] / students_d['T_m']
students_d['mp_score'] = students_d['Min_m'] / students_d['T_m']
students_d
#students_d.T  #Transform

Unnamed: 0,Sub,T_m,Min_m,O_m,score,mp_score
Pruthvi,Kannada,60,30,25,0.416667,0.5
Pranam,Hindi,60,30,35,0.583333,0.5
Pratham,English,60,30,40,0.666667,0.5
Pravera,Maths,60,30,60,1.0,0.5
Prabu,Science,60,30,55,0.916667,0.5


In [26]:
students_d[1:4]
students_d['Pranam':'Pravera']

Unnamed: 0,Sub,T_m,Min_m,O_m,score,mp_score
Pranam,Hindi,60,30,35,0.583333,0.5
Pratham,English,60,30,40,0.666667,0.5
Pravera,Maths,60,30,60,1.0,0.5


In [27]:
students_d.values

array([['Kannada', 60, 30, 25, 0.4166666666666667, 0.5],
       ['Hindi', 60, 30, 35, 0.5833333333333334, 0.5],
       ['English', 60, 30, 40, 0.6666666666666666, 0.5],
       ['Maths', 60, 30, 60, 1.0, 0.5],
       ['Science', 60, 30, 55, 0.9166666666666666, 0.5]], dtype=object)

## Ufuncs: Index Preservation

## Ufuncs：保留索引

> Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
因為Pandas是設計和NumPy一起使用的，因此所有的NumPy通用函數都可以在Pandas的`Series`和`DataFrame`上使用。

In [28]:
ps.index     # index 
ps.shape     # returns the shape
ps.dtype     # returns the data type
ps.size      # returns the size
ps.empty     # checks if the series is empty or not
ps.hasnans   # checks if the series has any nan value 
ps.nbytes    # returns the number of bytes in the data
ps.ndim      # returns the number of dimensions in the data
ser.rolling(window=2).sum()
ser.expanding(min_periods = 2).sum()

In [83]:
import pandas as pd
import numpy as np

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [84]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [85]:
np.exp(ser)                   #np.exp(df)
np.sin(ser * np.pi / 4)       #np.sin(df * np.pi / 4)

0   -1.000000e+00
1    7.071068e-01
2   -7.071068e-01
3    1.224647e-16
dtype: float64

## UFuncs: Index Alignment

## Ufuncs：索引對齊

> For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.
This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

對於兩個`Series`或`DataFrame`進行二元運算操作，Pandas會在運算過程中會自動將兩個數據集的索引進行對齊操作。這對於我們處理不完整的數據集的情況下非常方便，下面我們來看一些例子。

### Index alignment in Series , Series對像中的索引對齊

> As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

假設我們從兩個不同的數據源分別獲得美國前三大面積和前三大人口的州，作為下面的例子：

In [32]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,'New York': 19651127}, name='population')
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [33]:
area.index.union(population.index)   # union )(|)

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

> Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data (see further discussion of missing data in [Handling Missing Data](03.04-Missing-Values.ipynb)).

兩個任意輸入數據集中對應的另一個數據集不存在的元素都會被設置為`NaN`（非數字的縮寫），也就是Pandas標示缺失數據的方法：

In [34]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

> If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

如果填充成NaN值不是你需要的結果，你可以使用相應的ufunc函數來計算，然後在函數中設置相應的填充值參數。例如，調用`A.add(B)`等同於調用`A + B`，但是可以提供額外的參數來設置用來缺失的替換值：

In [35]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index alignment in DataFrame , DataFrame中的索引對齊

> A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

類似的對齊方式在對`DataFrame`操作當中會同時發生在列和行上：

In [87]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),columns=list('AB'))
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),columns=list('ABC'))
A + B

Unnamed: 0,A,B,C
0,25.0,6.0,
1,10.0,19.0,
2,,,


> Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.
As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.
Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

注意不管索引在輸入數據集中的順序並不會影響結果當中索引的對齊情況。與`Series`的情況一樣，我們可以使用相應的ufunc函數來代替標準運算操作，然後代入你需要的`fill_value`參數來代替缺失值。這裡我們會使用`A`中所有值的平均值來替代空值，我們首先堆疊（stack）`A`的所有行來計算平均值：

In [88]:
A.add(B, fill_value=A.stack().mean())

Unnamed: 0,A,B,C
0,25.0,6.0,18.75
1,10.0,19.0,13.75
2,18.75,11.75,19.75


> The following table lists Python operators and their equivalent Pandas object methods:

下面列出了Python的運算操作及其對應的Pandas方法：

| Python運算符  | Pandas方法                             |Python運算符     | Pandas方法                             |
|--------------|---------------------------------------|-----------------|---------------------------------------|
| ``+``        | ``add()``                             | ``//``          | ``floordiv()``                        |
| ``-``        | ``sub()``, ``subtract()``             | ``%``           | ``mod()``                             |
| ``*``        | ``mul()``, ``multiply()``             | ``**``          | ``pow()``                             |
| ``/``        | ``truediv()``, ``div()``, ``divide()``|

## Ufuncs: Operations Between DataFrame and Series

## Ufuncs：DataFrame和Series之間的操作

> When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

當在`DataFrame`和`Series`之間進行運算操作時，行和列的標籤對齊機制依然有效。 `DataFrame`和`Series`之間的操作類似於在維數組之間進行操作。

> According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:

依據NumPy的廣播規則（參見[在數組上計算：廣播](02.05-Computation-on-arrays-broadcasting.ipynb)），二維數組的每一行都會減去它自身的第一行。如果你希望能夠按照列進行減法，你需要使用對應的ufunc函數，然後指定`axis`參數：

In [42]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


In [43]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


> Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align  indices between the two elements:

上面介紹的這些`DataFrame`或者`Series`操作，都會自動對運算的數據集進行索引對齊：

In [44]:
halfrow = df.iloc[0, ::2] # 第一行的Q和S列
halfrow

Q    3
S    2
Name: 0, dtype: int32

In [45]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


## Example - Places Price

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('input/movehubcostofliving.csv')
data.head()

Unnamed: 0,City,Cappuccino,Cinema,Wine,Gasoline,Avg Rent,Avg Disposable Income
0,Lausanne,3.15,12.59,8.4,1.32,1714.0,4266.11
1,Zurich,3.28,12.59,8.4,1.31,2378.61,4197.55
2,Geneva,2.8,12.94,10.49,1.28,2607.95,3917.72
3,Basel,3.5,11.89,7.35,1.25,1649.29,3847.76
4,Perth,2.87,11.43,10.08,0.97,2083.14,3358.55


### Find out the Places where Cappuccino is Least and Most Expensive

In [6]:
data[['City','Cappuccino']].sort_values(by = 'Cappuccino', ascending = True).head().reset_index(drop = True)  # Very Cheap
data[['City','Cappuccino']].sort_values(by = 'Cappuccino', ascending = False).head().reset_index(drop = True) # Most Expensive

The Places where Cappuccino is Very Cheap


Unnamed: 0,City,Cappuccino
0,Stavanger,4.48
1,Bergen,3.92
2,Nashville,3.84
3,Trondheim,3.81
4,Copenhagen,3.66


### Find out the Places where Cinema is Least and Most Expensive

In [9]:
data[['City','Cinema']].sort_values(by = 'Cinema', ascending = True).head().reset_index(drop = True)  # Very Cheap
data[['City','Cinema']].sort_values(by = 'Cinema', ascending = False).head().reset_index(drop = True) # Most Expensive

Unnamed: 0,City,Cinema
0,Riyadh,79.49
1,Brighton,14.95
2,Geneva,12.94
3,Lausanne,12.59
4,Zurich,12.59


### Find out the Places where Rent is Least and Most Expensive

In [10]:
print(data[['City','Avg Rent']].sort_values(by = 'Avg Rent', ascending = True).head().reset_index(drop = True))
print(data[['City','Avg Rent']].sort_values(by = 'Avg Rent', ascending = False).head().reset_index(drop = True))

        City  Avg Rent
0   Vadodara    120.68
1      Kochi    181.02
2  Ahmedabad    193.08
3    Karachi    197.78
4     Indore    205.15
        City  Avg Rent
0  Hong Kong   5052.31
1   New York   3268.84
2  Singapore   3164.42
3     Sydney   2788.71
4     Geneva   2607.95


## Example for airport

In [14]:
import pandas as pd

flights= pd.read_csv('input/flights.csv')
planes= pd.read_csv('input/planes.csv')
airlines= pd.read_csv('input/airlines.csv')
airports=pd.read_csv('input/airports.csv')

airlines

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.
5,EV,ExpressJet Airlines Inc.
6,F9,Frontier Airlines Inc.
7,FL,AirTran Airways Corporation
8,HA,Hawaiian Airlines Inc.
9,MQ,Envoy Air


In [56]:
# Open a script and name it to section 6 assignment and try to tackle the below questions:
# what is the most popular destination city from NewYork?
flights.columns
table1=flights.groupby('dest').agg(count=('dest','count')).sort_values(by='count',ascending=False).reset_index()
pd.merge(table1, airports[['faa','name']],how='left',left_on='dest',right_on= 'faa')

Unnamed: 0,dest,count,faa,name
0,ORD,17283,ORD,Chicago Ohare Intl
1,ATL,17215,ATL,Hartsfield Jackson Atlanta Intl
2,LAX,16174,LAX,Los Angeles Intl
3,BOS,15508,BOS,General Edward Lawrence Logan Intl
4,MCO,14082,MCO,Orlando Intl
...,...,...,...,...
100,MTJ,15,MTJ,Montrose Regional Airport
101,SBN,10,SBN,South Bend Rgnl
102,ANC,8,ANC,Ted Stevens Anchorage Intl
103,LGA,1,LGA,La Guardia


In [57]:
# which airline is the most punctual?
flights['total_delay']= flights['arr_delay']+flights['dep_delay']
table1=flights.groupby('carrier').agg(mean_delay= ('total_delay','mean')).sort_values(by='mean_delay').reset_index()
airlines
pd.merge(table1,airlines,how='left')

Unnamed: 0,carrier,mean_delay,name
0,AS,-4.100141,Alaska Airlines Inc.
1,HA,-2.01462,Hawaiian Airlines Inc.
2,US,5.874288,US Airways Inc.
3,AA,8.933421,American Airlines Inc.
4,DL,10.868291,Delta Air Lines Inc.
5,VX,14.52111,Virgin America
6,UA,15.57492,United Air Lines Inc.
7,MQ,21.220114,Envoy Air
8,B6,22.425521,JetBlue Airways
9,9E,23.819244,Endeavor Air Inc.


In [58]:
# what destination has  the longest duration

table1=flights.groupby(['origin','dest']).agg(average_air_time= 
                            ('air_time','mean')).reset_index().sort_values(by='average_air_time',ascending=False)
pd.merge(table1,airports[['faa','name']],how='left',left_on= 'dest',right_on='faa')

Unnamed: 0,origin,dest,average_air_time,faa,name
0,JFK,HNL,623.087719,HNL,Honolulu Intl
1,EWR,HNL,612.075209,HNL,Honolulu Intl
2,EWR,ANC,413.125000,ANC,Ted Stevens Anchorage Intl
3,JFK,SFO,347.403626,SFO,San Francisco Intl
4,JFK,SJC,346.606707,SJC,Norman Y Mineta San Jose Intl
...,...,...,...,...,...
219,EWR,ALB,31.787081,ALB,Albany Intl
220,JFK,PHL,30.836872,PHL,Philadelphia Intl
221,EWR,PHL,28.666667,PHL,Philadelphia Intl
222,EWR,BDL,25.466019,BDL,Bradley Intl


In [59]:
# what airline is the worst in terms of delays fronteir 
# which airline has the highest capacity of seats?

carrier_tailnum= flights[['carrier','tailnum']].drop_duplicates()
seats=pd.merge(carrier_tailnum,planes[['tailnum','seats']],how='left').groupby('carrier').agg(total_seats=('seats','sum')).sort_values(by='total_seats',ascending=False)
seats

Unnamed: 0_level_0,total_seats
carrier,Unnamed: 1_level_1
UA,116252.0
DL,115715.0
WN,82700.0
US,57821.0
AA,29309.0
B6,27148.0
EV,19525.0
9E,13685.0
AS,13465.0
FL,13451.0


In [60]:
### which airplane model is the highest in use and from which manufacturer?
airplanes_use=flights.groupby('tailnum').agg(count= ('tailnum','count')).reset_index()
planes.columns
pd.merge(planes[['tailnum','model','manufacturer']],airplanes_use).groupby(['model','manufacturer']).agg(total_flights= ('count','sum')).sort_values(by='total_flights',ascending=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,total_flights
model,manufacturer,Unnamed: 2_level_1
A320-232,AIRBUS,31278
EMB-145LR,EMBRAER,28027
ERJ 190-100 IGW,EMBRAER,23716
A320-232,AIRBUS INDUSTRIE,14553
EMB-145XR,EMBRAER,14051
...,...,...
737-3T5,BOEING,2
737-3A4,BOEING,1
A330-323,AIRBUS,1
A330-223,AIRBUS INDUSTRIE,1


### Methods similar to Python string methods

### 類似Python的字符串方法

> Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

幾乎所有Python內建的字符串方法都有Pandas的向量化版本。下面是Pandas的`str`屬性中與Python內建字符串方法一致的方法：

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |


### Methods using regular expressions

### 使用正則表達式的方法

> In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:

除此之外，還有一些方法可以接受正則表達式來檢查每個元素字符串是否匹配模式，它們遵從Python內建的`re`模塊的API規範：

| 方法 | 描述 |
|--------|-------------|
| ``match()`` | 在每個元素上調用``re.match()``方法，返回布爾類型Series |
| ``extract()`` | 在每個元素上調用``re.match()``方法，返回匹配到模式的正則分組的Series |
| ``findall()`` | 在每個元素上調用``re.findall()``方法 |
| ``replace()`` | 將匹配模式的字符串部分替換成其他字符串值 |
| ``contains()`` | 在每個元素上調用``re.search()``，返回布爾類型Series |
| ``count()`` | 計算匹配到模式的次數 |
| ``split()``   | 等同於``str.split()``，但是能接受正則表達式參數 |
| ``rsplit()`` | 等同於``str.rsplit()``, 但是能接受正則表達式參數 |

### Miscellaneous methods

### 其他方法

> Finally, there are some miscellaneous methods that enable other convenient operations:

最後，下面是一些無法分類的其他方法但也是很方便的字符串功能：

| 方法 | 描述 |
|--------|-------------|
| ``get()`` | 對每個元素使用索引值獲取字符中的字符 |
| ``slice()`` | 對每個元素進行字符串切片 |
| ``slice_replace()`` | 將每個元素的字符串切片替換成另一個字符串值 |
| ``cat()``      | 將所有字符串元素連接成一個字符串 |
| ``repeat()`` | 對每個字符串元素進行重複操作 |
| ``normalize()`` | 返回字符串的unicode標準化結果 |
| ``pad()`` | 字符串對齊 |
| ``wrap()`` | 字符串換行 |
| ``join()`` | 字符串中字符的連接 |
| ``get_dummies()`` | 將字符串按照分隔符分割後形成一個二維的dummy DataFrame |


<!--NAVIGATION-->
< [格式化数据：NumPy里的结构化数组](02.09-Structured-Data-NumPy.ipynb) | [目录](Index.ipynb) | [Pandas对象简介](03.01-Introducing-Pandas-Objects.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
