## Data Manipulation with Pandas
## 使用Pandas處理數據

> NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time. In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively. We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

NumPy 的數據結構為數據分析不可少的功能，雖然 ndarray 的功能已經很強大，但是當我們需要更多的靈活性的時候，它的缺陷就體現了出來（例如，為數據提供標籤，處理缺失的數據等）。而且如果當需要對數據進行超過廣播能處理範疇的操作時（例如分組，數據透視等）NumPy 就無能為力了。處理真實生活中產生的不乾淨數據來說這會是非常重要的步驟。 Pandas 它的Series和DataFrame對象，讓數據科學家能在 NumPy 的基礎上提供更多操作。我們在本章中會聚焦於了解Series、DataFrame和相關結構的機制上，這是Pandas 的兩種特有的資料結構。

- Series是一個類似陣列的物件，裡面可包含陣列的資料
- DataFrame就像是我們在使用的excel表格一樣，是一個二維的數據有index和column。

In [None]:
import numpy as np
import pandas as pd

## NumPy array  / DataFrame as array
> From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

我們看到的 Series 和 NumPy 數組是可以互換的概念，兩者最基本的區別方式是索引序號的存在：
- NumPy 數組的整數索引隱式的
- Pandas 的 Series/DataFrame 的索引是顯式的

In [1]:
# Numpy
np.array([0.25, 0.5, 0.75, 1.0])

array([0.25, 0.5 , 0.75, 1.  ])

In [2]:
# Series
data = pd.Series([0.25, 0.5, 0.75, 1.0])  #index 1,2,3
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])  #index a,b,c,d
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
# Dataframe
area_dict = {'a': 0.25, 'b': 0.5, 'c': 0.75,'d': 1.0}
states = pd.DataFrame({'area': pd.Series(area_dict)})
states

Unnamed: 0,area
a,0.25
b,0.5
c,0.75
d,1.0


## Series / Dataframe轉換 

In [4]:
# Dictionary
dict_area = {'California': 423967, 'Texas': 695662,'New York': 141297, 'Florida': 170312,'Illinois': 149995}
dict_pop = {'California': 38332521, 'Texas': 26448193,'New York': 19651127, 'Florida': 19552860,'Illinois': 12882135,'ZMore': 111}

> You can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

你可以將Pandas的`Series`當成Python字典的一種特殊情形。 Python中的字典可以將任意的關鍵字key和任意的值value對應起來，`Series`是一種能將特定類型的關鍵字key和特定類型的值value對應起來的字典。這種靜態類型是很重要的：正如NumPy數組的靜態類型能提供編譯好的代碼提升對Python列表或集合的操作性能一樣，Pandas的`Series`能提供編譯好的代碼提升對Python字典的操作性能。

In [5]:
# Dictionary to Series
import pandas as pd
area = pd.Series(dict_area)
pop = pd.Series(dict_pop)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

> If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

如果說`Series`是帶有靈活索引的通用一維數組的話，那麼`DataFrame`就是帶有靈活的行索引和列索引的通用二維數組。你也可以將`DataFrame`想像成一系列的`Series`對象堆疊在一起，所謂的堆疊實際上指的是這些`Series`擁有相同的索引值序列。

In [6]:
# Series to Dataframe
dataf = pd.DataFrame({'area':area, 'pop':pop})
dataf

Unnamed: 0,area,pop
California,423967.0,38332521
Florida,170312.0,19552860
Illinois,149995.0,12882135
New York,141297.0,19651127
Texas,695662.0,26448193
ZMore,,111


## DataFrame 計算

In [9]:
s_sub = pd.Series({'Pruthvi': 'Kannada', 'Pranam': 'Hindi', 'Pratham':'English', 'Pravera':'Maths', 'Prabu':'Science'})
total_m = pd.Series({'Pruthvi': 60, 'Pranam': 60, 'Pratham':60, 'Pravera':60, 'Prabu':60})
minf_m = pd.Series({'Pruthvi': 30, 'Pranam': 30, 'Pratham':30, 'Pravera':30, 'Prabu':30})
obt_m = pd.Series({'Pruthvi': 25, 'Pranam': 35, 'Pratham':40, 'Pravera':60, 'Prabu':55})

students_d = pd.DataFrame({'Sub' : s_sub, 'T_m' : total_m, 'Min_m' : minf_m, 'O_m' : obt_m})
students_d

Unnamed: 0,Sub,T_m,Min_m,O_m
Pruthvi,Kannada,60,30,25
Pranam,Hindi,60,30,35
Pratham,English,60,30,40
Pravera,Maths,60,30,60
Prabu,Science,60,30,55


In [10]:
students_d['score'] = students_d['O_m'] / students_d['T_m']
students_d['mp_score'] = students_d['Min_m'] / students_d['T_m']
students_d

Unnamed: 0,Sub,T_m,Min_m,O_m,score,mp_score
Pruthvi,Kannada,60,30,25,0.416667,0.5
Pranam,Hindi,60,30,35,0.583333,0.5
Pratham,English,60,30,40,0.666667,0.5
Pravera,Maths,60,30,60,1.0,0.5
Prabu,Science,60,30,55,0.916667,0.5


In [11]:
students_d[1:4]
students_d['Pranam':'Pravera']
#students_d.values

Unnamed: 0,Sub,T_m,Min_m,O_m,score,mp_score
Pranam,Hindi,60,30,35,0.583333,0.5
Pratham,English,60,30,40,0.666667,0.5
Pravera,Maths,60,30,60,1.0,0.5


## Drop - data.drop()

In [47]:
data = pd.Series(np.arange(6), index=['a', 'b', 'c', 'd', 'e', 'f'])
print(data.drop(['e', 'f']))

a    0
b    1
c    2
d    3
dtype: int32


In [53]:
dataframe = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['a', 'b', 'd', 'e'], 
                                                        columns=['Jerry', 'Jane', 'Shawn', 'Medy'])
print(dataframe.drop(['a', 'e']))      #axis=1  inplace=True

   Jerry  Jane  Shawn  Medy
b      4     5      6     7
d      8     9     10    11


## Sort - sort_index()

In [24]:
series = pd.Series(range(6), index=['d', 'a', 'b', 'c', 'f', 'g'])

series.sort_index(axis=0, level=None, ascending=True)

a    1
b    2
c    3
d    0
f    4
g    5
dtype: int64

In [71]:
df8 = pd.DataFrame(np.random.randn(4, 3), index=['One', 'Two', 'Three', 'Four'], columns=list('abc'))
#df8 = pd.DataFrame(np.arange(12).reshape((3, 4)), index=['One', 'Two', 'Three', 'Four'], columns=list('abc'))

#df8.rank(axis='columns')
#df8.sort_values(by=['b'])
df8.sort_index(axis=1, level=None, ascending=True)  # axis 

Unnamed: 0,a,b,c
One,2.385161,-0.683483,0.755742
Two,0.530051,-0.945674,0.476549
Three,-0.719978,-0.937375,-1.720091
Four,0.722771,-0.115036,1.372991


## 在 Series / Dataframe 中選擇數據

>  ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

我們介紹過`DataFrame`表現得既像二維數組又像由共同的索引值組成的`Series`對象的字典。這能幫助你學習如何在`DataFrame`裡面進行數據選擇。`Series`在很多方面都表現的像一維NumPy數組，也同時在很多方面表現像是一個標準的Python字典。

In [7]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) 

# Series as dictionary 將Series看成字典 #Series提供了從關鍵字集合到值集合的映射
'a' in data
data[0:4]
data['b']
data['A':'D']
data['a':'c']
data[0:3]
data.keys()
list(data.items())
data[(data > 0.2) & (data < 0.6)]  # Conditions

a    0.25
b    0.50
dtype: float64

In [8]:
# dictionary -> Series -> DataFrame
data = pd.DataFrame({'area':area, 'pop':pop})

# Series as one-dimensional DataFrame 將 DataFrame 看成一維數組
data.area           #錯誤用法 
data['area']        #正確用法:避免使用屬性表達式給列賦值（例如，應使用data['pop']=z而非data.pop=z）。

California    423967.0
Florida       170312.0
Illinois      149995.0
New York      141297.0
Texas         695662.0
ZMore              NaN
Name: area, dtype: float64

## 使用 `loc` 從 DataFrames 中選擇數據子集

>In this chapter, we use the `loc` indexer to select subsets of data from DataFrames. The `loc` indexer selects data in a different manner than *just the brackets*. It has its own separate set of rules that we must learn.  使用 `loc` 索引器從 DataFrame 中選擇數據子集。 

In [87]:
ps = pd.Series(['Ganga', 'Yamuna', 'Gomti', 'Koshi','Godavari','Kaveri'], index = ['a','b','c','d','e','f'])
ps

a       Ganga
b      Yamuna
c       Gomti
d       Koshi
e    Godavari
f      Kaveri
dtype: object

In [88]:
for i in ps:
    print(i)
for i in ps.iteritems():
    print(i)

Ganga
Yamuna
Gomti
Koshi
Godavari
Kaveri
('a', 'Ganga')
('b', 'Yamuna')
('c', 'Gomti')
('d', 'Koshi')
('e', 'Godavari')
('f', 'Kaveri')


In [90]:
ps.loc['d']
ps.loc['c':'f']

c       Gomti
d       Koshi
e    Godavari
f      Kaveri
dtype: object

## `loc` with slice notation

Review Python's slice notation, which is used to select subsets from some core Python objects such as lists, tuples, and strings. Slice notation always has three components - the **start**, **stop**, and **step**. Syntactically, each component is separated by a colon like this - `start:stop:step`. All components of slice notation are optional and not necessary to include. Each has a default value if not included in the notation. The start component defaults to the beginning, the stop defaults to the end, and the step size to 1.

回顧一下 Python 的切片表示法，它用於從一些核心 Python 對象（如列表、元組和字符串）中選擇子集。 切片表示法始終具有三個組件 - **start**、**stop** 和 **step**。 從語法上講，每個組件都用冒號分隔，就像這樣 - `start:stop:step`。 切片符號的所有組件都是可選的，不需要包含。 如果未包含在符號中，則每個都有一個默認值。 start 組件默認為開頭，stop 默認為結尾，步長為 1。

In [4]:
import pandas as pd
df = pd.read_csv('input/pd-loc.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [5]:
df.loc['Niko']
df.loc['Niko', :]
df.loc['Niko', 'state']

'TX'

In [6]:
df.loc[['Niko']]
df.loc[['Niko'], :]
df.loc[['Niko'], ['state']]

Unnamed: 0_level_0,state
name,Unnamed: 1_level_1
Niko,TX


In [7]:
df.loc[['Dean', 'Aaron']]
df.loc[['Dean', 'Aaron'], 'food']
df.loc[['Dean', 'Aaron'], ['age', 'state', 'score']]

Unnamed: 0_level_0,age,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dean,32,AK,1.8
Aaron,12,FL,9.0


In [8]:
df.loc['Niko':'Dean']
df.loc['Niko':'Dean':2]
df.loc['Niko':'Dean', ['state', 'color']]

Unnamed: 0_level_0,state,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,TX,green
Aaron,FL,red
Penelope,AL,white
Dean,AK,gray


In [9]:
df.loc[:, 'height']
df.loc[:'Dean', 'height':]
df.loc[['Penelope','Cornelia'], :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


## Selecting Subsets of Data from DataFrames with `iloc` 從 iloc 中選取資料

The `iloc` indexer is very similar to the `loc` indexer but only uses **integer location** to make its subset selections. The word `iloc` itself stands for integer location and can help remind you what it does.

## Simultaneous row and column subset selection

The `iloc` indexer is capable of making simultaneous row and column selections just like `loc`. Selection with `iloc` takes the following form, with a comma separating the row and column selections.

```python
df.iloc[rows, cols]

In [10]:
import pandas as pd
df = pd.read_csv('input/pd-loc.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [11]:
df.iloc[5]
df.iloc[2:]
df.iloc[::2]
df.iloc[3, 2]

'Apple'

In [12]:
df.iloc[[3], [2]]
df.iloc[[2, 3, 5], 4]
df.iloc[[2, 3, 5], [4]]
df.iloc[[2, 4], [0, -1]]

Unnamed: 0_level_0,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,FL,9.0
Dean,AK,1.8


In [13]:
df.iloc[2, :]
df.iloc[:, [2, 4]]
df.iloc[2:4, [4, 2]]

Unnamed: 0_level_0,height,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,120,Mango
Penelope,80,Apple


In [14]:
df.iloc[[2], :]
df.iloc[[5, 2, 4], 3:]
df.iloc[[-3, -1, -2], :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dean,AK,gray,Cheese,32,180,1.8
Cornelia,TX,red,Beans,69,150,2.2
Christina,TX,black,Melon,33,172,9.5


### Methods similar to Python string methods

### 類似Python的字符串方法

> Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

幾乎所有Python內建的字符串方法都有Pandas的向量化版本。下面是Pandas的`str`屬性中與Python內建字符串方法一致的方法：

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |


### Methods using regular expressions

### 使用正則表達式的方法

> In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:

除此之外，還有一些方法可以接受正則表達式來檢查每個元素字符串是否匹配模式，它們遵從Python內建的`re`模塊的API規範：

| 方法 | 描述 |
|--------|-------------|
| ``match()`` | 在每個元素上調用``re.match()``方法，返回布爾類型Series |
| ``extract()`` | 在每個元素上調用``re.match()``方法，返回匹配到模式的正則分組的Series |
| ``findall()`` | 在每個元素上調用``re.findall()``方法 |
| ``replace()`` | 將匹配模式的字符串部分替換成其他字符串值 |
| ``contains()`` | 在每個元素上調用``re.search()``，返回布爾類型Series |
| ``count()`` | 計算匹配到模式的次數 |
| ``split()``   | 等同於``str.split()``，但是能接受正則表達式參數 |
| ``rsplit()`` | 等同於``str.rsplit()``, 但是能接受正則表達式參數 |

### Miscellaneous methods

### 其他方法

> Finally, there are some miscellaneous methods that enable other convenient operations:

最後，下面是一些無法分類的其他方法但也是很方便的字符串功能：

| 方法 | 描述 |
|--------|-------------|
| ``get()`` | 對每個元素使用索引值獲取字符中的字符 |
| ``slice()`` | 對每個元素進行字符串切片 |
| ``slice_replace()`` | 將每個元素的字符串切片替換成另一個字符串值 |
| ``cat()``      | 將所有字符串元素連接成一個字符串 |
| ``repeat()`` | 對每個字符串元素進行重複操作 |
| ``normalize()`` | 返回字符串的unicode標準化結果 |
| ``pad()`` | 字符串對齊 |
| ``wrap()`` | 字符串換行 |
| ``join()`` | 字符串中字符的連接 |
| ``get_dummies()`` | 將字符串按照分隔符分割後形成一個二維的dummy DataFrame |


<!--NAVIGATION-->
< [格式化数据：NumPy里的结构化数组](02.09-Structured-Data-NumPy.ipynb) | [目录](Index.ipynb) | [Pandas对象简介](03.01-Introducing-Pandas-Objects.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
