# PSG week2
re模塊(Regular Expression)中文叫正則表達式，是一個文本匹配以及解析工具， 可以在一大串字符中找想找的内容 

```
常用語法
符號         意義                     備註                   
.           所有
^           限定字串開頭    
$           限定字串結尾
* ?         皆為後面可有0或多個字
+           後面可有1或多個字
*? +? ??    只找出搜尋結果的第一個
{m}         對前一個字符重複m次
[]          匹配[]內的字符         [a-z A-Z 0-9]配對所有英文字母及數字 但[^6] 為配對6以外的數字
()          匹配()內的任意正則表達式
參考:https://www.ibm.com/developerworks/cn/opensource/os-cn-pythonre/index.html
```

```
\ 對特殊字轉義 或指定特殊序列
常用特殊序列
符號         意義                 相當於
\A           只配對字串開頭
\Z           只配對字串結尾
\d           匹配0-9             [0-9]
\D           匹配非0-9           [^0-9]
\s           匹配任意空白        [\t\n\r\f\v]
\S           匹配非任意空白      [^\t\n\r\f\v]
\w           匹配任意数字和字母   [a-zA-Z0-9_]
\W           匹配非任意数字和字母 [^a-zA-Z0-9_]
參考:https://www.ibm.com/developerworks/cn/opensource/os-cn-pythonre/index.html          
```

# 2.1 用界定符分割字串  
# re.split()
比較.split()與re.split()

In [None]:
import re
s1="aa bb          cc"
print(s1.split(' '))
print(re.split(r'[\s]\s*',s1))

用()匹配時 被匹配的字也會輸出 

In [None]:
line = 'asdf fjdk; afed, fjek,asdf, foo'
print(re.split(r'(;|,|\s)\s*', line))
print(re.split(r'[;|,|\s]\s*', line))
#用  ?: 去掉分隔符
print(re.split(r'(?:,|;|\s)\s*', line))

# 2.2 配對字符串的開頭或結尾
```
.endswith('')  
.startswith('') 
可應用於檢查網址或者檔名
```

In [None]:
filename = 'spam.txt'
print(filename.endswith('.txt'))
print(filename.startswith('file:'))


```
如果丟入的是網址 則用urlopen()開啟
如果不是          用open()
```

In [None]:
from urllib.request import urlopen
def read_data(name):
    if name.startswith(('http:', 'https:', 'ftp:')):
        return urlopen(name).read()
    else:
        with open(name) as f:
            return f.read()

In [None]:
read_data('https://github.com/hyades910739/psg')

```
要用集合表示需匹配的項目時 
一定要化成tuple 如果用list或set會error
```

In [None]:
choices = ['http:', 'ftp:']
url = 'https://github.com/hyades910739/psg'
#如果用  url.startswith(choices)  會error
#要化成tuple 如下
url.startswith(tuple(choices))   

# 2.3 用通配符匹配
```
fnmatchcase()  大小寫需一樣
fnmatch()     視作業系統而定   (Mac 對大小寫敏感  Windows 大小寫沒差)
可用於處理非文件名的字符串
```

In [None]:
from fnmatch import fnmatch, fnmatchcase
print("Are 'foo.txt' and '*.txt' matched ?",fnmatch('foo.txt', '*.txt'))
print("Are 'foo.txt' and '*.txt' matched ?",fnmatchcase('foo.txt', '*.TXT'))
print("Are 'foo.txt' and '?oo.txt' matched ?",fnmatch('foo.txt', '?oo.txt'))
print("Are 'Dat45.csv' and 'Dat[0-9]*' matched ?",fnmatch('Dat45.csv', 'Dat[0-9]*'))

In [None]:
addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]
from fnmatch import fnmatchcase
print([addr for addr in addresses if fnmatchcase(addr, '* ST')])
print([addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')])

# 2.4 複雜的字符串匹配
```
re.match()
re.compile()
```

In [None]:
import re
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'
if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
else:
    print('no')

```
.compile() 可以將欲多次使用的匹配法儲存起來 
再搭配.match()使用 
```

In [None]:
datepat = re.compile(r'\d+/\d+/\d+')
if datepat.match(text1):
    print('yes')
else:
    print('no')


if datepat.match(text2):
    print('yes')
else:
    print('no')

```
.match()     為從字符串開始去匹配
.findall()   為尋找字符串中任意位置符合匹配的
```

In [None]:
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
datepat.findall(text)

在.compile()裡 用()包住欲匹配對象可以方便提取

In [None]:
datepat1 = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datepat1.match('11/27/2012')
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.groups())

.finditer()以迭代方式產生

In [None]:
for m in datepat1.finditer(text):
    print(m.groups())

# 2.5 字符串搜索和替換

.replace() 與re.sub()

In [None]:
text = 'yeah, but no, but yeah, but no, but yeah'
print('origin        is:',text)
print('after replace is:',text.replace('yeah', 'yep'))
text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print('origin        is:',text2)
print('after replace is:',re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2))

re.subn() 又多顯示取代了幾次

In [None]:
newtext, n = datepat1.subn(r'\3-\1-\2', text2)
print(newtext)
print(n)

# 2.6 替換時忽略大小寫
flags=re.IGNORECASE

In [None]:
text = 'UPPER PYTHON, lower python, Mixed Python'
#找出python 不管大小寫
print(re.findall('python', text, flags=re.IGNORECASE))
#只要是python 全換成snake
# 但
# 被取代的字不會因原字的大小寫狀態而改變
print(re.sub('python', 'snake', text, flags=re.IGNORECASE))

可定義隨原字而變的函數

In [None]:
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():   #檢驗是否全大寫
            return word.upper()
        elif text.islower(): #檢驗是否全小寫
            return word.lower()
        elif text[0].isupper():   #檢驗是否第一字為大寫
            return word.capitalize()
        else:
            return word
    return replace

```
matchcase('snake') 返回了一個回調函數(參數要是match 對象 ex:下面例子的python)
sub() 函數除了接受替換字符串外，還能接受一個回調函數。
```

In [None]:
re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)

# 2.7  找最短的匹配
```
正則表達式匹配某個文本模式，可能找到的是最長的
所以修改它變成查找最短的可能匹配
```

In [None]:
#找雙引號內的字
str_pat = re.compile(r'\"(.*)\"')
text1 = 'Computer says "no."'
print(str_pat.findall(text1))
text2 = 'Computer says "no." Phone says "yes."'
print(str_pat.findall(text2))

```
* 會找最長的
需要在*後面加一個?
*? 為找搜尋結果的第一個(最開頭有寫)
```

In [None]:
str_pat = re.compile(r'\"(.*?)\"')
str_pat.findall(text2)

# 2.8  多行情況下匹配

In [None]:
# 找 /* */ 內的字

comment = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is a comment */'
text2 = '''/* this is a
    multiline comment */
    '''
print(comment.findall(text1))
print(comment.findall(text2))

(?:.|\n) 為 找.或\n

In [None]:
comment = re.compile(r'/\*((?:.|\n)*?)\*/')
comment.findall(text2)

```
re.compile() 可加入一個標誌參數叫re.DOTALL
讓正則表達式中的點(.) 匹配包括換行符在內的任意字符
```

In [None]:
comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
comment.findall(text2)

# 2.9 2.10 是unicode 先跳

# 2.11 刪除字符串中不需要的字符

```
預設下會刪除空白
strip()      刪除開始或結尾的字符
lstrip()     從左執行刪除操作
rstrip()     從右執行刪除操作
```

In [None]:
s = ' hello world \n'
print(s.strip())
print(s.rstrip())
s.lstrip()

In [None]:
t = '-----hello====='
print(t.lstrip('-'))
print(t.strip('-='))

```
用strip() 不會 對中間的文本產生影響
若要刪除空格 就用replace 或 re.sub()把空格替換掉
```

In [None]:
s = ' hello    world \n'
s.strip()

# 2.13 字符串對齊
```
ljust()   從左對齊
rjust()   從右對齊
center()  向中間對齊
```

In [None]:
#參數為總長
text = 'Hello World'
print(text.ljust(20))
print(text.rjust(20))
print(text.center(20))

In [None]:
#填充字串
text.rjust(20,'=')

format() 也可執行一樣的事

In [None]:
# >為靠右 <靠左 ^靠中間
print(format(text, '>20'))
print(format(text, '<20'))
print(format(text, '^20'))
# 位置符號前可加 填充字符
print(format(text, '*^20s'))

```
format()可格式多個
也可格式化數值
```

In [None]:
print('{:>10s} {:>10s}'.format('Hello', 'World'))
x = 1.2345
print(format(x, '>10'))
print(format(x, '^10.2f'))

```
整體而言 
format較ljust() rjust() center()優
```

# 2.14 合併拼接字符串

In [None]:
parts = ['Is', 'Chicago', 'Not', 'Chicago?']
' '.join(parts)

# 2.15 字符串中插入變量

.format()

In [None]:
s = '{name} has {n} messages.'
s.format(name='Guido', n=37)

```
變數如果有先定義
可用format_map()搭配vars()編輯
```

In [None]:
name = 'Guido'
n = 37
s.format_map(vars())

```
當要format多個同型態的值時
可自訂函數 搭配vars()
```

In [None]:
class Info:
    def __init__(self, name, n):
        self.name = name
        self.n = n

a = Info('Guido',37)
s.format_map(vars(a))

# 但
```
format跟format_map 在有missing value 的時候 
插入變量會error
否則須自訂函數
```

In [None]:
class safesub(dict): #防止key 找不到
    def __missing__(self, key):
        return '{' + key +  '}' #把missing value 用key名取代

In [None]:
del n # 把剛定義的n刪掉
s.format_map(safesub(vars()))

要替換很多次的話就自訂一個替換函數

In [None]:
import sys
def sub(text):
    return text.format_map(safesub(sys._getframe(1).f_locals))  #sys._getframe(1).f_locals 把所需的key和value傳回來

In [None]:
name = 'Guido'
n = 37
print(sub('Hello {name}'))
print(sub('You have {n} messages.'))
print(sub('Your favorite color is {color}'))

# 2.16 把字符串按指定列寬格式排
textwrap.fill()

In [None]:
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."
import textwrap
print(textwrap.fill(s, 70)) #參數為一列字元數
print(textwrap.fill(s, 40, initial_indent='    '))   #第一列首字留空
print(' ')
print(textwrap.fill(s, 40, subsequent_indent=' '))   #第一列以外首字留空

# 2.17 跟編碼有關先跳個

# 2.18解析字符串 逐項歸類(令牌化)

In [None]:
text = 'foo = 23 + 42 * 10'
tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),('NUM', '42'), ('TIMES', '*'), ('NUM', 10)]
import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

In [None]:
scanner = master_pat.scanner('foo = 42')
scanner.match()

In [None]:
print(_.lastgroup, _.group())

In [None]:
scanner.match()

In [None]:
print(_.lastgroup, _.group())

定個函數較方便

In [None]:
from collections import namedtuple
scanner = master_pat.scanner('foo = 42')
def generate_tokens(pat, text):
    Token = namedtuple('Token', ['type', 'value'])
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        yield Token(m.lastgroup, m.group())
for tok in generate_tokens(master_pat, 'foo = 42'):
    print(tok)

# 2.19 太難ㄌ看不懂

# 2.20 字節字符串上的字符串文本操作
```
ex:移除 搜索 替換
操作上跟一般字符串差不多
```

In [None]:
data = b'Hello World'
print(data[0:5])
print(data.startswith(b'Hello'))
data = bytearray(b'Hello World')
print(data[0:5])
print(data.startswith(b'Hello'))
print(data.split())
print(data.replace(b'Hello', b'Hello Cruel'))

記得加字節符  不然會error

In [None]:
data = b'FOO:BAR,SPAM'
import re
re.split('[:,]',data)

In [None]:
re.split(b'[:,]',data) 

只搜索一次時 字節會返回數字

In [None]:
a = 'Hello World'
b = b'Hello World'
print('a[0] is',a[0])
print('b[0] is',b[0])

要純print字串的話要先解碼

In [None]:
print(b)
print(b.decode('ascii'))


格式化也是

In [None]:
'{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii')