<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/PDSH-cover-small.png?raw=1">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<!--NAVIGATION-->
< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >

<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# 向量化字串操作

本章就Pandas字串操作做一個完整的檢視，而且關注在如何使用它們去清理那些非常繁雜的、從網際網路蒐集而來的資料集。

## Pandas字串操作介紹



*向量化*的運算簡化了在資料陣列上的運算語法，可不必擔心陣列的大小和形狀，只要關心想要做怎麼樣的運算即可。

對字串的陣列，Numpy並沒有提供像這樣簡單的使用方法，以致於無法執行成功。

In [1]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

In [2]:
# 若有資料缺失，就會失敗
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: ignored

Pandas提供了對應的方法包括透過Pandas Series的str特性以及包含字串的Index物件以解決向量化字串操作的需求，以及正確處理缺失資料的方法。

In [3]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [4]:
# 在有缺失資料下呼叫函式
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

## Pandas字串方法的表格

In [5]:
# 舉例
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

### 和Python字串處理類似的方法
幾乎所有Python內建字串方法都被對應到一個Pandas向量化字串方法。

Pandas``str``方法有對應Python字串方法函式的部分:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |



In [6]:
# 有些傳回一系列的字串
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [7]:
# 有些傳回數字
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [8]:
# 有些傳回布林值
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [9]:
# 有些傳回list或其他每一個元素的複合值
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

### 雜項(Miscellaneous)方法

| 方法 | 說明 |
|--------|-------------|
| ``get()`` | 索引每一個元素 |
| ``slice()`` | 切片每一個元素|
| ``slice_replace()`` | 使用傳進去的值取代在每一個元素的切片|
| ``cat()``      | 串接字串|
| ``repeat()`` | 重複值|
| ``normalize()`` | 傳回Unicode格式的字串 |
| ``pad()`` |在字串的左邊、右邊、或是兩邊加上空白|
| ``wrap()`` |把長字串分割成多列，每一列不超過給定的寬度|
| ``join()`` |每Series中的每一個元素以傳入的分格符號串連成字串|
| ``get_dummies()`` |把虛擬變數(Dummy variable)提取出來變成一個DataFrame|

#### 向量化項目的存取和切片

``get()``和``slice()``操作，啟用從每一個陣列中向量化存取元素。

In [11]:
# df.str.slice(0.3) = df.str[0:3]
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [12]:
# 可結合split()和get()以擷取每一個項目的姓氏
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

#### 指示變數(Indicator variables)

若類別變數的數值大小不具有意義時，須以虛擬變數(dummy variable)編碼。

In [13]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [None]:
# get_dummies()
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1
