## Regular expression (正規表示式)

- re.match只匹配字符串的開始，如果字符串開始不符合正則表達式，則匹配失敗，函數返回None；
- re.search匹配整個字符串，直到找到一個匹配。
- re.sub用於替換字串
- re.findall 字符串中找到所匹配的所有串，返回一List

### 1.re.match() 函數 - 從頭開始比對文字
re.match會從文本中的起始位置開始進行文字符的匹配，如果不是一開始第一個字符就匹配成功的話，就會直接返回一個none，簡單來說就是欲匹配的文本一開始就要符合我們定義的字符規則，不符合直接回傳none，符合就會回傳字符位置資訊。與re.search()的最大差別在於它是檢測文字是否在開頭位置。

In [1]:
import re

text = 'https://matters.news/@CHWang'
print(re.match('https', text))
print(re.match('https', text).span())
print(re.match('https', text).group(0))

<re.Match object; span=(0, 5), match='https'>
(0, 5)
https


### 2.re.search() 函數 - 搜尋整個字符串 (最常用)
re.search會搜尋整個字符串，然後找到匹配的字符並且傳回，如果沒有匹配到任何字符則傳回none，如果成功就傳回一個匹配的對象，就可以使用group()來取得匹配成功的字符

In [2]:
import re

text = 'https://medium.com/@juck30808'
text1 = 'Date：2023-02-28 14:02:48'
print(re.search('https://', text))
print(re.search('https://', text).span())

### regrex 寫法
text1 = '賴aa23542'
if re.search('賴([a-z|0-9])',text1):
    print('line帳號')
    
### group
text = 'I likes to eat cake and drink coke, but ...'
result = re.search('(.*) likes to eat (\w+) and drink ([a-z]*)', text, re.I|re.M)
print(result.group())
print(result.groups())
print(result.group(1))

### Date
text1 = 'Date：2023-02-28 14:02:48'
print(re.search(r'\d+-\d+-\d+ \d+:\d+:\d+', text1).group(0))

<re.Match object; span=(0, 8), match='https://'>
(0, 8)
line帳號
I likes to eat cake and drink coke
('I', 'cake', 'coke')
I
2023-02-28 14:02:48


### 3.re.findall() - 找尋文字中所有匹配的文字

找尋所有匹配的字符，裝進串列後返回，如果沒有找到匹配的字符，就會回傳一個空的串列喔
小筆記：re.findall會匹配所有符合規則的字符，而re.search與re.match只會匹配一次而已喔

In [3]:
import re
text1 = 'good98Morning66 Jen666 Yeah'
print(re.compile(r'[a-z]+', re.I).findall(text1)) ## 匹配所有字母並忽略大小寫

text1 = '編輯 - 衛斯理 小編、編輯 - Christy 小編、編輯 - 阿龍 小編'
print(re.findall('編輯 - (.*?) ', text1))

['good', 'Morning', 'Jen', 'Yeah']
['衛斯理', 'Christy', '阿龍']


### 4.re.sub() 函數 - 匹配好字符後，將它替換成我們想要的字符
這個方法相當方便，我們在進行數據處理時，有時候會有一些多餘的不要的空格、符號等等，就可以透過這個方法來一次拿掉

In [4]:
import re 

text = 'Jack/25/1993 and Jen/23/1995'
sub_result1 = re.sub('\sand\s', '&', text)

## A: 把中間的and與空格拿掉，用&替換
print(sub_result1)

## B: 再把/拿掉
print(re.sub('/', '', sub_result1))

## C: 再把/拿掉，但只要拿掉前兩個
print(re.sub('/', '', sub_result1, 2))

Jack/25/1993&Jen/23/1995
Jack251993&Jen231995
Jack251993&Jen/23/1995


### 5.re.compile() 函數 - 生成一個pattern對象供給match、search、findall函數使用
我們只要定義好一次正則表達式的規則，就能用這個定義好的pattern規則，來提供match、search、findall函數匹配字符，用了這個方法後，我們就不用每次使用匹配函數時，都要重新寫一次正則表達式語法，但明明匹配的規則與寫法是一樣的

In [5]:
import re

text = '68Jack66Jen58Ken28,Cathy38'
pattern = re.compile(r'([a-z]+)', re.I) ## 匹配字母，並忽略大小寫

## match預設從第一個位置開始匹配
print(pattern.match(text)) ## None，因為match會從第一個位置開始匹配，如果不通過就會返回none

## 從第3個位置開始匹配
print(pattern.match(text, 2, 20))

None
<re.Match object; span=(2, 6), match='Jack'>


### 6.re.split() 函數 - 切割
將匹配的字符進行切割，回傳 List

In [6]:
import re

text = 'Jack66Jen58Ken28Cathy'

## 用數字來做為分隔依據
print(re.split('\d+', text))

## 分隔，並將數字也傳進陣列
print(re.split('(\d+)', text))

## 如果匹配的一句剛好在前後的位置，就會傳回空值
text1 = '66Jack66Jen58Ken28Cathy38'
print(re.split('\d+', text1))

## 如果找不到匹配會回串全部字串
print(re.split('\s+', text1))

['Jack', 'Jen', 'Ken', 'Cathy']
['Jack', '66', 'Jen', '58', 'Ken', '28', 'Cathy']
['', 'Jack', 'Jen', 'Ken', 'Cathy', '']
['66Jack66Jen58Ken28Cathy38']


# 總整理
推薦一個好用的即時檢查re的工具 — https://regex101.com/

### Basic topics 基本主題

- Anchors — ^ and $

In [7]:
# ^The        matches any string that starts with The 
# end$        matches a string that ends with end
# ^The end$   exact string match (starts and ends with The end)
# roar        matches any string that has the text roar in it

- Quantifiers — * + ? and {}

In [8]:
# abc*        matches a string that has ab followed by zero or more c
# abc+        matches a string that has ab followed by one or more c
# abc?        matches a string that has ab followed by zero or one c
# abc{2}      matches a string that has ab followed by 2 c
# abc{2,}     matches a string that has ab followed by 2 or more c
# abc{2,5}    matches a string that has ab followed by 2 up to 5 c
# a(bc)*      matches a string that has a followed by zero or more copies of the sequence bc
# a(bc){2,5}  matches a string that has a followed by 2 up to 5 copies of the sequence bc

- OR operator — | or []

In [9]:
# a(b|c)     matches a string that has a followed by b or c (and captures b or c)
# a[bc]      same as previous, but without capturing b or c

- Character classes — \d \w \s and .

In [10]:
# \d         matches a single character that is a digit
# \w         matches a word character (alphanumeric character plus underscore)
# \s         matches a whitespace character (includes tabs and line breaks)
# .          matches any character
# \D         matches a single non-digit character
# \$\d       matches a string that has a $ before one digit

### Intermediate topics 中級主題

- Grouping and capturing — ()

In [11]:
# a(bc)           parentheses create a capturing group with value bc 
# a(?:bc)*        using ?: we disable the capturing group 
# a(?<foo>bc)     using ?<foo> we put a name to the group 

- Bracket expressions — [] 

In [12]:
# [abc]            matches a string that has either an a or a b or a c -> is the same as a|b|c 
# [a-c]            same as previous
# [a-fA-F0-9]      a string that represents a single hexadecimal digit, case insensitively 
# [0-9]%           a string that has a character from 0 to 9 before a % sign
# [^a-zA-Z]        a string that has not a letter from a to z or from A to Z. In this case the ^ is used as negation of the expression 

- Greedy and Lazy match

In [13]:
# <.+?>            matches any character one or more times included inside < and >, expanding as needed
# <[^<>]+>         matches any character except < or > one or more times included inside < and >

### Advanced topics 高級主題

- Boundaries — \b and \B

In [14]:
# \babc\b          performs a "whole words only" search
# \Babc\B          matches only if the pattern is fully surrounded by word characters

- Back-references — \1

In [15]:
# ([abc])\1              using \1 it matches the same text that was matched by the first capturing group
# ([abc])([de])\2\1      we can use \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group
# (?<foo>[abc])\k<foo>   we put the name foo to the group and we reference it later (\k<foo>). The result is the same of the first regex

- Look-ahead and Look-behind — (?=) and (?<=)

In [16]:
# d(?=r)       matches a d only if is followed by r, but r will not be part of the overall regex match
# (?<=r)d      matches a d only if is preceded by an r, but r will not be part of the overall regex match

# d(?!r)       matches a d only if is not followed by r, but r will not be part of the overall regex match
# (?<!r)d      matches a d only if is not preceded by an r, but r will not be part of the overall regex match 

### 常見使用

- Alpha-numeric, literals, digits, lowercase, uppercase chars only 僅字母數字、文字、數字、小寫、大寫字符

In [17]:
# \w                //alpha-numeric only
# [a-zA-Z]          //literals only
# \d                //digits only
# [a-z]             //lowercase literal only
# [A-Z]             //uppercase literal only

- Simple numbers 簡單的數字

In [18]:
# ^(\d+)$.          15/12  8.5  12

- Decimal numbers 十進制數字

In [19]:
# ^(\d*)[.,](\d+)$   15/12  8.5  12  8,7

- Fractions 分數

In [20]:
# ^(\d+)[\/](\d+)$   15/12  8.5  12

- Alphanumeric without spaces 不帶空格的字母數字

In [21]:
# ^(\w*)$            hello123

- Email 電子郵件

In [22]:
# ^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})*$
#                    jonny.fox@factorymind.com
#                    hello@sdasdad.hello
#                    but not this!

# ^([a-z0-9_\.\+-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$

- Trim spaces 修剪空格 匹配文本避免額外的空格

In [23]:
# ^[\s]*(.*?)[\s]*$

- HTML Tag 標籤 

In [24]:
# <([a-z]+)[^<]*(?:>(.*?)<\/\1>|\s+\/>)

- Hexadecimal value 十六進制值

In [25]:
# \B#(?:[a-fA-F0–9]{6}|[a-fA-F0–9]{3})\b

- Valid email (RFC5322) 有效的電子郵件

In [26]:
# \b[\w.!#$%&’*+\/=?^`{|}~-]+@[\w-]+(?:\.[\w-]+)*\b\

- Username (simple) 簡單用戶名 (最小長度為 3，最大長度為 16，由字母、數字或破折號組成)

In [27]:
# /^[a-z0-9_-]{3,16}$/

- Strong password 強密碼 (最小長度為6，至少1個大寫字母，至少1個小寫字母，至少1個數字，至少1個特殊字符)

In [28]:
# (?=^.{6,}$)((?=.*\w)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[|!"$%&\/\(\)\?\^\'\\\+\-\*]))^.*

- 2 of a kind 兩個同類

In [29]:
# ^(?=([0-9]*[a-z]){2,})([a-zA-Z0-9]{8,32})$

- URL tokenization 標記化

In [30]:
# ^(((https?|ftp):\/\/)?([\w\-\.])+(\.)([\w]){2,4}([\w\/+=%&_\.~?\-]*))*$

- IPv4 address IPv4 地址

In [31]:
# \b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b

- URL or IPv4 address URL 或 IPv4 地址

In [32]:
# ^(((h..ps?|f.p):\/\/)?(?:([\w\-\.])+(\[?\.\]?)([\w]){2,4}|(?:(?:25[0–5]|2[0–4]\d|[01]?\d\d?)\[?\.\]?){3}(?:25[0–5]|2[0–4]\d|[01]?\d\d?)))*([\w\/+=%&_\.~?\-]*)$

# 練習

In [33]:
import re

text = """中央流行疫情指揮中心今(29)日公布國內新增11例COVID-19確定病例，分別為1例本土及10例境外移入；另確診個案中無新增死亡。
指揮中心表示，今日新增1例本土個案(案16326)，為印尼籍30多歲女性，今(2021)年9月27日出現頭痛症狀，9月28日就醫採檢，於今日確診。衛生單位已匡列接觸者7人，均列居家隔離，其餘接觸者匡列中。
指揮中心指出，今日新增10例境外移入個案，為9例男性、1例女性，年齡介於20多歲至40多歲，入境日介於9月3日至9月28日，分別自美國(案16316)、哈薩克(2例，案16317、案16318)、巴基斯坦(案16319)、柬埔寨(16320)、俄羅斯(案16324)及菲律賓(案16325)入境，餘3例 (案16321、案16322、案16323)的旅遊國家調查中；詳如新聞稿附件。
指揮中心統計，截至目前國內累計3,358,228例新型冠狀病毒肺炎相關通報(含3,341,439例排除)，其中16,216例確診，分別為1,581例境外移入，14,581例本土病例，36例敦睦艦隊、3例航空器感染、1例不明及14例調查中；另累計110例移除為空號。2020年起累計842例COVID-19死亡病例，其中830例本土，個案居住縣市分布為新北市412例、臺北市318例、基隆市28例、桃園市26例、彰化縣15例、新竹縣13例、臺中市5例、苗栗縣3例、宜蘭縣及花蓮縣各2例，臺東縣、雲林縣、臺南市、南投縣、高雄市及屏東縣各1例；另12例為境外移入。
指揮中心再次呼籲，民眾應落實手部衛生、咳嗽禮節及佩戴口罩等個人防護措施，減少不必要移動、活動或集會，避免出入人多擁擠的場所，或高感染傳播風險場域，並主動積極配合各項防疫措施，共同嚴守社區防線。
"""

In [34]:
en = """12Drummers Drumming 11 Pipers Piping 10 Lords a Leaping 9 Ladies Dancing 8 Maids a Milking 7 Swans a Swimming
6 Geese a Laying 5 Golden Rings 4 Calling Birds 3 French Hens 2 Turtle Doves and a Partridge in a Pear Tree
I love watching movies in India country. Ape Bpe Cpe Dpe cost me $2000 and $10, F.B.I. I.R.S. CIA, Hahahahaha Adventures Batman
"""

In [35]:
phone = "Please call the number: 415-424-1212 on time"
mail  = "To email Jerry, try juck30808@gmail.com.tw  or other address juck30808@hotmail.com"
name  = 'First Name: Jerry Last Name: Chien'

### 字符

In [36]:
re.search('.',text)    # <re.Match object; span=(1, 2), match='中'>

<re.Match object; span=(0, 1), match='中'>

In [37]:
## . 匹配任何字符

re.search('^',text)    # <re.Match object; span=(0, 0), match=''>
re.compile(r'.pe').findall(en)  # ['ipe', 'Ape', 'Bpe', 'Cpe', 'Dpe']

['ipe', 'Ape', 'Bpe', 'Cpe', 'Dpe']

In [38]:
## \ 轉意字符，使後面字符轉變意思

re.compile(r'\$..').findall(en)  # ['$20', '$10']
re.findall(".\..\..",en)         # ['F.B.I', 'I.R.S']

['F.B.I', 'I.R.S']

In [39]:
## [...] 字符集，範圍內任意字符意思

re.search('[abc]',en)              # <re.Match object; span=(46, 47), match='a'>
re.search('[COVID]',text)          # <re.Match object; span=(26, 27), match='C'>

re.findall("..[縣市]", text)       #  ['..縣','..市','臺北市','基隆市','桃園市','彰化縣','新竹縣','臺中市']
re.findall("[ABD]pe",en)           # ['Ape', 'Bpe', 'Dpe']
re.findall("[A-Z]pe",en)           # ['Ape', 'Bpe', 'Cpe', 'Dpe']
re.findall("[^C]pe",en)            # ['ipe', 'Ape', 'Bpe', 'Dpe']
re.findall('[a]',en)               # ['a', 'a','a','a',...
re.findall('[arn]',en)             # ['r', 'r','r','n',...
re.findall('[^arn]',en)            # ['\n','1','2',' ','D','u','m','m',
re.findall('[a-n]',en)             # ['m', 'm','e','m',...
re.findall('[a-zA-Z]',en)          # ['D', 'r','u','m',...

re.compile(r'[aeiouAEIOU]').findall(en)   #vowelRegex
re.compile(r'[^aeiouAEIOU]').findall(en)  #nativeRegex
re.compile('[$]').split(en)  #['\n12 Drummers Drumming 11 Pipers Piping 10 Lords a Leaping 9 Ladies Dancing 8 Maids a Milking\n7 Swans a Swimming 6 Geese a Laying 5 Golden Rings 4 Calling Birds 3 French Hens 2 Turtle Doves and a Partridge in a Pear Tree\nI love watching movies.India is my country. Ape Bpe Cpe Dpe.',
                             # '2000 ',
                             # '10, F.B.I. I.R.S. CIA"\n']

['12Drummers Drumming 11 Pipers Piping 10 Lords a Leaping 9 Ladies Dancing 8 Maids a Milking 7 Swans a Swimming\n6 Geese a Laying 5 Golden Rings 4 Calling Birds 3 French Hens 2 Turtle Doves and a Partridge in a Pear Tree\nI love watching movies in India country. Ape Bpe Cpe Dpe cost me ',
 '2000 and ',
 '10, F.B.I. I.R.S. CIA, Hahahahaha Adventures Batman\n']

### 預定義字符 

In [40]:
## \d 取得 [0-9] 數字 
## \D 取得 [^\d] 非數字

re.search("新增\d+例本土",text)       # span=(77, 83), match='新增1例本土'>
re.findall("新增\d+例境外", text)     # ['新增10例境外']
re.findall("新增(\d+)例境外", text)   # ['10']
re.findall('\D',en)                  # ['\n',' ','D','r','u',.....]

re.compile(r'\d\d\d-\d\d\d-\d\d\d\d').search(phone)          # <re.Match object; span=(24, 36), match='415-424-1212'>
re.compile(r'\d\d\d-\d\d\d-\d\d\d\d').search(phone).group()  #'415-424-1212'

'415-424-1212'

In [41]:
## \s 空白字符:[<空格>\t\r\n\F\v]
## \S 非空白字符:[^\s]

re.search("\s", en).start()  #10  The first white space is located in position
re.search("\s", en).end()    #11

re.sub('\s','_' ,en)         # 12Drummers_Drumming_11_Pipers_
re.sub("\s", "|", en)        # 12Drummers|Drumming|11|Pipers|

re.split('\s', en)           # ['12Drummers','Drumming','11','Pipers',
re.split("\s", en, 1)        # ['12Drummers',
                             #  'Drumming 11 Pipers Piping 

['12Drummers',
 'Drumming 11 Pipers Piping 10 Lords a Leaping 9 Ladies Dancing 8 Maids a Milking 7 Swans a Swimming\n6 Geese a Laying 5 Golden Rings 4 Calling Birds 3 French Hens 2 Turtle Doves and a Partridge in a Pear Tree\nI love watching movies in India country. Ape Bpe Cpe Dpe cost me $2000 and $10, F.B.I. I.R.S. CIA, Hahahahaha Adventures Batman\n']

In [42]:
## \w 單詞字符:[A-Za-z0-9_]
## \W 非單詞字符:[시W)

re.compile(r'\w\s\w').findall(en)   #['s D',
                                    # 'g 1',
                                    # '1 P',

['s D',
 'g 1',
 '1 P',
 's P',
 'g 1',
 '0 L',
 's a',
 'g 9',
 's D',
 'g 8',
 's a',
 'g 7',
 's a',
 'g\n6',
 'e a',
 'g 5',
 'n R',
 's 4',
 'g B',
 's 3',
 'h H',
 's 2',
 'e D',
 's a',
 'd a',
 'e i',
 'n a',
 'r T',
 'e\nI',
 'e w',
 'g m',
 's i',
 'n I',
 'a c',
 'e B',
 'e C',
 'e D',
 'e c',
 't m',
 '0 a',
 'a A',
 's B']

In [43]:
## * 匹配前一個字符0或無限次。

re.search('ma*',en)    # <re.Match object; span=(7, 8), match='m'>

<re.Match object; span=(5, 6), match='m'>

### 數量詞 (在字符後)

In [44]:
## + 匹配前一個字符1次或無限次。

re.search('Golden+',en)  #<re.Match object; span=(131, 137), match='Golden'>

re.compile(r'[\w.]+@\w+\.[a-z]{3}').findall(mail)         #['juck30808@gmail.com', 'juck30808@hotmail.com']
re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})').findall(mail)   #[('juck30808', 'gmail', 'com'), ('juck30808', 'hotmail', 'com')]
re.compile(r'\w+').findall(en)   #['12Drummers',
                                 # 'Drumming',
                                 # '11',

['12Drummers',
 'Drumming',
 '11',
 'Pipers',
 'Piping',
 '10',
 'Lords',
 'a',
 'Leaping',
 '9',
 'Ladies',
 'Dancing',
 '8',
 'Maids',
 'a',
 'Milking',
 '7',
 'Swans',
 'a',
 'Swimming',
 '6',
 'Geese',
 'a',
 'Laying',
 '5',
 'Golden',
 'Rings',
 '4',
 'Calling',
 'Birds',
 '3',
 'French',
 'Hens',
 '2',
 'Turtle',
 'Doves',
 'and',
 'a',
 'Partridge',
 'in',
 'a',
 'Pear',
 'Tree',
 'I',
 'love',
 'watching',
 'movies',
 'in',
 'India',
 'country',
 'Ape',
 'Bpe',
 'Cpe',
 'Dpe',
 'cost',
 'me',
 '2000',
 'and',
 '10',
 'F',
 'B',
 'I',
 'I',
 'R',
 'S',
 'CIA',
 'Hahahahaha',
 'Adventures',
 'Batman']

In [45]:
## ? 匹配前一個字符0次或1次。

In [46]:
## {m} 匹配前一個字符m次,

re.search('e{2}', en)    # <re.Match object; span=(113, 115), match='ee'>
re.search('m{2}', en)    # <re.Match object; span=(7, 9), match='mm'>

re.compile(r'\d{3}-\d{3}-\d{4}').search(phone)          # <re.Match object; span=(24, 36), match='415-424-1212'>
re.compile(r'\d{3}-\d{3}-\d{4}').search(phone).group()  #'415-424-1212'

re.compile(r'\w{3}').findall(en)                        # ['12D', 'rum', 'mer', 'Dru', 'mmi',
re.compile(r'(ha){3}').search(en)                       # <re.Match object; span=(319, 325), match='hahaha'>
re.compile(r'(ha){3}').search(en).group()               #'hahaha'

dd = "24012"
if re.search("\d{5}",dd): print("It is a zip code")
    
dd = "0909-000-123"
if re.search("\w{4}-\w{3}-\w{3}",dd): print("It is a phoneNum")

It is a zip code
It is a phoneNum


In [47]:
## {m,n} 匹配前一個字符m至n次。

re.compile(r'(\d){3,5}').search(mail)           #<re.Match object; span=(24, 29), match='30808'>
re.compile(r'(\d){3,5}').search(mail).group()   #'30808'
re.compile(r'.{1,2}at').findall(en)             #[' wat']

dd = "Toshio Mauramatsu"

if re.search("\w{2,20}\s\w{2,20}",dd):
    print("It is a full name")

It is a full name


### 邊界匹配 (不消耗)

In [48]:
## ^ 匹配字符串開頭。在多行模式中匹配每一行的開頭。

re.search('^12',en)  # <re.Match object; span=(0, 2), match='12'>

<re.Match object; span=(0, 2), match='12'>

In [49]:
## $ 匹配字符串末尾,在多行模式中匹配每一行的末尾。

re.search('g$',en)
re.search('\$a', '$abc')  #<re.Match object; span=(0, 2), match='$a'>

<re.Match object; span=(0, 2), match='$a'>

In [50]:
## \A 僅匹配字符串開頭。


In [51]:
## \Z 僅匹配字符串末尾

In [52]:
## \b 匹配\w 和\W 之間。

re.search(r"\bL\w+", en)         # <re.Match object; span=(40, 45), match='Lords'>
re.search(r"\bL\w+", en).span()  # (40, 45)
re.search(r"\bL\w+", en).string  #'12Drummers Drumming 11 Pipers Piping
re.search(r"\bL\w+", en).group() #'Lords'

'Lords'

In [53]:
## \B [\b]

### Logistic

In [54]:
## |  代表左右表達式任意匹配一個。匹配左邊的表達式,一旦成功匹配則跳過匹配右邊的表達式

re.search('a|b', 'mango')        #<re.Match object; span=(1, 2), match='a'>

<re.Match object; span=(1, 2), match='a'>

In [55]:
## (...) 被括起來的表達式將作為分組,從表達式左邊開始每遇到個分組的左活號(,編號+1

re.search('(a)m', 'amm')                       #<re.Match object; span=(0, 2), match='am'>
re.compile(r'Bat(wo)?man').search(en)          #<re.Match object; span=(339, 345), match='Batman'>
re.compile(r'Bat(wo)?man').search(en).group()  #'Batman'
re.compile(r'First Name: (.*) Last Name: (.*)').findall(name)   #[('Jerry', 'Chien')]

[('Jerry', 'Chien')]

In [56]:
## (?P<name>...) 分組,除了原有的編號外再指定一個額外的別名。

email4 = re.compile(r'(?P<user>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
match = email4.match('guido@gamil.com')
match.groupdict()

{'user': 'guido', 'domain': 'gamil', 'suffix': 'com'}

In [57]:
## \<number> 引用編號為<number>的分組匹配到的字符串。

serve = '<To serve humans> for dinner.>'
re.compile(r'<(.*?)>').findall(serve)       #['To serve humans']   
re.compile(r'<(.*)>').findall(serve)        #['To serve humans> for dinner.']

['To serve humans> for dinner.']

In [58]:
## (?P=name) 引用別名為<name>的分組四配到的字符串。

dd = "psw123"

if re.search(re.compile("^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*#?&])[A-Za-z\d@$!#%*?&]{6,20}$"),dd):
    print("Password is valid")
else:
    print("Password NNN")

Password NNN


### 其他

In [59]:
line = 'the quick brown fox jumped over a lazy dog'

regex = re.compile('fox')
match = regex.search(line)
match.start()

16

In [60]:
newtext = "The match in Germany"
x = re.search("^The.*Germany$", newtext)
print(x)

if x:
    print("Yes! We have a match")
else:
    print("No match")

<re.Match object; span=(0, 20), match='The match in Germany'>
Yes! We have a match


In [61]:
# strip sub
import re
def strip(text):
    stripStartRegex = re.compile(r'(^\s*)')
    stripEndRegex = re.compile(r'(\s*$)')

    textStartStripped = stripStartRegex.sub('', text)
    textStripped = stripEndRegex.sub('', textStartStripped)

    return textStripped

if __name__ == "__main__":
    text = ' test ffs   '
    print(strip(text))

test ffs


In [62]:
allApes = re.findall("ape","ape... together...strong... apes")
for i in allApes:
    print(i)

ape
ape


In [63]:
theStr = "ape... together...strong... apes"

for i in re.finditer("ape",theStr):
    locTuple = i.span()
    print(locTuple)
    print(theStr[locTuple[0]:locTuple[1]])

(0, 3)
ape
(28, 31)
ape


In [64]:
dd = "my email address is macbook@gmail.com"

re.findall("[\w._%+-]{1,20}@[\w.-]{2,20}.[A-Za-z]{2,3}",dd)

['macbook@gmail.com']

In [65]:
dd = "To email Guido, try guido@python.org or the older address guido@google.com."

re.compile('\w+@\w+\.[a-z]{3}').findall(dd)

['guido@python.org', 'guido@google.com']

In [66]:
#!pip install pyperclip

#Finds phone numbers and email addresses on the clipboard.

import pyperclip, re
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))? # area code
    (\s|-|\.)?         # separator
    (\d{3})              # first 3 digits
    (\s|-|\.)          # separator
    (\d{4})              # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?  # extension
    )''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+      # username
    @                      # @ symbol
    [a-zA-Z0-9.-]+         # domain name
    (\.[a-zA-Z]{2,4}){1,2} # dot-something
    )''', re.VERBOSE)

# Find matches in clipboard text.
text = str(pyperclip.paste())

matches = []
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])

# Copy results to the clipboard.
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

lyrics = '12 Drummers Drumming 11 Pipers Piping 10 Lords a Leaping 9 Ladies Dancing 8 Maids a Milking \
7 Swans a Swimming 6 Geese a Laying 5 Golden Rings 4 Calling Birds 3 French Hens 2 Turtle Doves and a Partridge in a Pear Tree'

xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall(lyrics)

No phone numbers or email addresses found.


['12 Drummers',
 '11 Pipers',
 '10 Lords',
 '9 Ladies',
 '8 Maids',
 '7 Swans',
 '6 Geese',
 '5 Golden',
 '4 Calling',
 '3 French',
 '2 Turtle']

In [67]:
#23 須深入 Regular Expression 
# 非正規表達式 (擁擠)

def isPhoneNumber(text):
    if len(text) != 12:  #long ?
        return False
    for i in range(0,3):
        if not text[i].isdecimal():  #is number?
            return False
    if text[3] != '-':   #in - 
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8,12):
        if not text[i].isdecimal():
            return False
    return True
    #[7]如果前面所有的判斷基線都沒有發現問題，那就認為這個字符串是電話號碼

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))
print('----------')

message = 'Call 415-555-1011 office'
for i in range(len(message)):
    chunk = message[i:i+12]
    print(chunk)

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False
----------
Call 415-555
all 415-555-
ll 415-555-1
l 415-555-10
 415-555-101
415-555-1011
15-555-1011 
5-555-1011 o
-555-1011 of
555-1011 off
55-1011 offi
5-1011 offic
-1011 office
1011 office
011 office
11 office
1 office
 office
office
ffice
fice
ice
ce
e


In [68]:
import re

def testPasswordStrength(password):
    eightCharsLongRegex = re.compile(r'[\w\d\s\W\D\S]{8,}')
    upperCaseRegex = re.compile(r'[A-Z]+')
    lowerCaseRegex = re.compile(r'[a-z]+')
    oneOrMoreDigitRegex = re.compile(r'\d+')
    
    if not eightCharsLongRegex.search(password):
        return False
    elif not upperCaseRegex.search(password):
        return False
    elif not lowerCaseRegex.search(password):
        return False
    elif not oneOrMoreDigitRegex.search(password):
        return False
    return True
    

if __name__ == "__main__":
    password = 'A&dsas9$_'
    print(testPasswordStrength(password))

True


In [69]:
#re-auto_madLibs

import os
import re

def madLibs(input_file, output_file):
    regex = re.compile(r'(NOUN|ADJECTIVE|ADVERB|VERB)')

    with open(input_file, 'r') as in_file, open(output_file, 'w') as out_file:
        content = in_file.read()
        matches = regex.findall(content)
        for found in matches:
            sub = input('Enter a ' + found + ': ')
            content = content.replace(found, sub, 1)

        out_file.write(content)
        print(content)

if __name__ == "__main__":
    madLibs('input/re-madLibs.txt', 'input/re-madLibs.txt')


