## <font color='darkblue'>RE 101 in Python</font>
[**Regular expression**](https://en.wikipedia.org/wiki/Regular_expression) 又稱 **正規表示式、正規表達式、正規表示法、規則運算式、常規表示法**，是電腦科學的一個概念。正規表示式使用單個字串來描述、符合一系列符合某個句法規則的字串。在很多文字編輯器裡，正則表達式通常被用來檢索、替換那些符合某個模式的文字。

在 Python 中提供 [**re 模組**](https://docs.python.org/3/library/re.html) 來提供 Regular expression 的功能, 底下是個簡單範例:

In [2]:
import re

log_content = '''
  AdapterState:
    total records=4
    rec[0]: time=11-26 16:30:04.861 processed=OffState org=OffState
    dest=TurningBleOnState what=3(0x3) BLE_TURN_ON
    ...
  curState=OnState
'''

matcher = re.search('curState=([a-zA-Z]+)', log_content)
if matcher is not None:
    print(f"Current state is {matcher.group(1)}")
else:
    print('No match!')

Current state is OnState


<a id='sect0'></a>
上面代碼中的 `re.search('curState=([a-zA-Z]+)', log_content)` 的 [search](https://docs.python.org/3/library/re.html#re.search) 函數用來從 `log_content` 中透過比對取出 `curState=OnState` 的字串. 並透過 函數 [group](https://docs.python.org/3/library/re.html#re.Match.group) 來取出 `onState` 的關鍵字.

在這個 101 的教學會帶過以下的內容來熟悉 Python 在 RE 的使用:
* <font size='3ptx'><b><a href='#sect1'>基礎 RE 語法介紹</a></b></font>
* <font size='3ptx'><b><a href='#sect2'>基礎 Python RE 模組使用範例</a></b></font>
* <font size='3ptx'><b><a href='#sect3'>練習題 </a></b></font>

<a id='sect1'></a>
## <font color='darkblue'>基礎 RE 語法介紹</font>
* <font size='3ptx'><b><a href='#sect1_1'>Characters</a></b></font>
* <font size='3ptx'><b><a href='#sect1_2'>Character Sets</a></b></font>
* <font size='3ptx'><b><a href='#sect1_3'>Repetition</a></b></font>
* <font size='3ptx'><b><a href='#sect1_4'>Grouping and Alternation</a></b></font>
* <font size='3ptx'><b><a href='#sect1_5'>Anchors</a></b></font>
<br/>

**為了要能有效的使用 RE, 我們必須知道 RE 的語法, 並根據我們的需求寫出比對的 Pattern.**

在學習的撰寫 Pattern 過程中, 你可以使用 [**regex101**](https://regex101.com/) 或 [**regexr**](https://regexr.com/) 來快速驗證 Pattern 的結果:
![img1](images/img1_regex101.PNG)

<a id='sect1_1'></a>
### <font color='darkgreen'>Characters</font>

#### Literal characters

In [5]:
pattern = r'zz'
input_string = 'Pizzazz'

# Search the pattern in input_string
match = re.search(pattern, input_string)
print(match)

<re.Match object; span=(2, 4), match='zz'>


In [7]:
# Match the pattern with while string
match = re.match(pattern, input_string)
print(match)

None


In [6]:
# Find all matches
match = re.findall(pattern, input_string)
match

['zz', 'zz']

#### Metacharacters
* Wildcard **<font size='4ptx' color='darkblue'>.</font>**
    * **/h.t/**: `hat`, `hot` and `hit`, but not `heat`

In [8]:
pattern = r'h.t'
input_string = 'hit hot hit heat'
match = re.findall(pattern, input_string)
match

['hit', 'hot', 'hit']

#### Escaping Metacharacters
* Allow use of metacharacters as literal characters
* Match a period with **/\./**
* **/9\\.00/** matches `9.00`, but not `9500` or `9-00`
* Match a backslash by escaping a backslash (\\\\)
* Escaping is only needed for metacharacters not literal characters which may cause them to have special meaning.
* Quotation marks are not metacharacters; don't need to be escaped.

In [11]:
pattern = r'resume..txt'
input_string = 'resume1.txt resume2.txt resume3_txt.zip'
match = re.findall(pattern, input_string)
match

['resume1.txt', 'resume2.txt', 'resume3_txt']

上面的 "resume3_txt" 不是預期的 match. 該如何修改 `pattern`?

使用 "." 要很小心, 不然常常會發生意外. e.g.:

In [17]:
pattern = r'h.t'
input_string = 'hit hot hit reach temple'
match = re.findall(pattern, input_string)
match  # 'h t' is an accident by matching "reac`h t`emple"!

['hit', 'hot', 'hit', 'h t']

#### <font color='orange'>Challenge</font>
使用 3 個 literal characters 與 3 個 wildcard characters 來比對 `please`, `palace` 與 `parade`

In [14]:
pattern = r'TBD' # TBD
input_string = 'please palace parade'
match = re.findall(pattern, input_string)
match

[]

<a id='sect1_2'></a>
### <font color='darkgreen'>Character Sets</font> ([back](#sect1))

#### Define a character set
* Use "<font color='blue' size='5ptx'>**[**</font></font>" and "<font color='blue' size='5ptx'>**]**</font>" to define a character set.
* Any one of several characters
* But only one character
* Order of characters in the set does not matter
* <font color='blue'>**/[aeiou]/**</font> matches any one vowel
* <font color='blue'>**/gr[ea]y/**</font> matches "grey" and "gray" but not "great"

In [18]:
pattern = r'gr[ea]y'
input_string = 'grey gray great'
match = re.findall(pattern, input_string)
match

['grey', 'gray']

#### Character ranges
如果你要比對任一數字 0~9, 寫成 <font color='blue'>**/[0123456789]/**</font> ? 有更省力的方法 <font color='blue'>**/[0-9]/**</font>
* Includes all characters between two characters with "-"
* <font color='blue'>**/[A-Za-z]/**</font> 代表所有大小寫英文字母.
* <font color='blue'>**/[50-99]/**</font> is not all numbers from 50 to 99!!!

#### Negative character sets
* 使用 Metacharacter "^" 來代表此 character set 是不想比對的集合.
* <font color='blue'>**/[^aeiou]/**</font> matches any one consonant (non-vowel)
* <font color='blue'>**/see[^mn]/**</font> matches "sees" but not "seem", "seen" or "see"

In [25]:
pattern = r'see[^mn]'
input_string = 'sees seem seen see see.'
match = re.findall(pattern, input_string)
match # "see " and "see." may be surprise ><". How to fix the pattern to match "sees" only?

['sees', 'see ', 'see.']

#### Metacharacters inside character set
* Most metacharacters inside character set are already escaped and no need to escape them again.
* <font color='blue'>**/h[a.]t/**</font> matches "hat" and "h.t", but not "hot"
* Exceptions: **<font color='blue'>] - ^ \ </font>**
    * **<font color='blue'> /var[[(][0-9][</font><font color='red'>\\]</font><font color='blue'>)]/ </font>**
    * **<font color='blue'> /file[0</font><font color='red'>\\-\\\\</font><font color='blue'>_]1/ </font>**
    * **<font color='blue'> /2013[-/]10[-/]05/ </font>** <- 當我們把 "-" 放到 character set 的第一個, 便不需要 escape 它.

In [3]:
pattern = r'var[([][0-9][)\]]'
input_string = 'var(3) var(4)'
match = re.findall(pattern, input_string)
match

['var(3)', 'var(4)']

#### Shorthand character sets
![img2](images/img2_shorthand_character_sets.PNG)
<br/>

* **<font color='blue'>/\d\d\d\d/ </font>** matches "2022", but not "text"
* **<font color='blue'>/\w\w\w/ </font>** matches "ABC", "123", and "1_A" but not "1-A"
* **<font color='blue'>/\w\s\w\w/ </font>** matches "A am", but not "Am I"
* **<font color='blue'>/[\w\\-]/ </font>** matches any word character or hyphen 
* **<font color='blue'>/[^\d]/ </font>** is the same as both <font color='blue'>**/^\D/</font>** and <font color='blue'>**/^0-9/</font>**
* <font color='darkred'>**Caution:**</font> **<font color='blue'>/[^\d\s]/</font>** is not the same as **<font color='blue'>/[\D\S]/</font>**
    * **<font color='blue'>/[^\d\s]/</font>**: NOT digit OR space character
    * **<font color='blue'>/[\D\S]/</font>**: EITHER NOT digit OR NOT space character

In [6]:
pattern = r'[^\d\s]' # 不能是 (\d -> 數字) 或 (\s -> 空白)
input_string = '1234 5678 abc'
match = re.findall(pattern, input_string)
match

['a', 'b', 'c']

In [8]:
pattern = r'[\D\S]'  # (\D -> 不是數字) OR (\S -> 不是空白)
match = re.findall(pattern, input_string)
match

['1', '2', '3', '4', ' ', '5', '6', '7', '8', ' ', 'a', 'b', 'c']

#### <font color='orange'>Challenge</font>
* Match both "lives" and "lived"
* Match "virtue" but not "virtues"
* Match the numbers and periods on all numbered paragraphs
* Find the 5-character word that starts with "c"

In [10]:
pattern = r'TBD' 
input_string = 'lives lived'
match = re.findall(pattern, input_string)
match

[]

In [27]:
pattern = r'TBD' 
input_string = 'virtue virtues'
match = re.findall(pattern, input_string)
match  # Expect ['virtue']

[]

In [22]:
pattern = r'TBD' 
input_string = '12.0 27 1. 10.'
match = re.findall(pattern, input_string)
match  # Expect: ['2.', '1.', '0.']

[]

In [23]:
pattern = r'TBD' 
input_string = 'chain chess cake cook lived cheek'
match = re.findall(pattern, input_string)
match  # Expect: ['chain', 'chess', 'cheek']

[]

<a id='sect1_1'></a>
### <font color='darkgreen'>Repetition</font> ([back](#sect1))
想想剛剛的問題 `Find the 5-character word that starts with "c"`, 除了 **<font color='blue'>/[c]\w\w\w\w/ <font>**, 有更好的寫法嗎?

#### [Repetition metacharacters](https://www.linkedin.com/learning/learning-regular-expressions-2/repetition-metacharacters?autoAdvance=true&autoSkip=false&autoplay=true&resume=false&u=56685617)
某個字元後的數量限定符用來限定前面這個字元允許出現的個數。最常見的數量限定符包括 <font size='5ptx'>**+**</font>、<font size='5ptx'>**?**</font> 和 <font size='5ptx'>**\***</font>（<font color='brown'>不加數量限定則代表出現一次且僅出現一次</font>）：
* 加號 <font size='5ptx'>**+**</font> 代表前面的字元必須至少出現一次。（<font color='brown'>1次或多次</font>）。例如，**<font color='blue'>/goo+gle/<font>** 可以符合 "google"、"gooogle"、"goooogle" 等;
* 問號 <font size='5ptx'>**?**</font> 代表前面的字元最多只可以出現一次。（<font color='brown'>0次或1次</font>）。例如，**<font color='blue'>/colou?r/<font>** 可以符合 "color" 或者 "colour";
* 星號 <font size='5ptx'>**\***</font> 代表前面的字元可以不出現，也可以出現一次或者多次。（<font color='brown'>0次、1次或多次</font>）。例如，**<font color='blue'>/0*42/<font>** 可以符合 "42"、"042"、"0042"、"00042" 等。

In [29]:
pattern = r'apples?' 
input_string = 'apple apples appeal'
match = re.findall(pattern, input_string)
match

['apple', 'apples']

In [32]:
pattern = r'Good .+?\.'  # Try r'Good .+\.' to see the result.
input_string = 'Good morning. Good eventing. Good afternoon.'
match = re.findall(pattern, input_string)
match

['Good morning.', 'Good eventing.', 'Good afternoon.']

#### [Quantified repetition](https://www.linkedin.com/learning/learning-regular-expressions-2/quantified-repetition?autoAdvance=true&autoSkip=true&autoplay=true&resume=false&u=56685617)
* 使用 Metacharacters "<font color='blue' size='5ptx'>**{**</font>" 與 "<font color='blue' size='5ptx'>**}**</font>"
* 語法為 <font color='blue' size='5ptx'>**{min, miax}**</font>, `min` must always be included and can be zero, `max` is optional
* <font color='blue'>**/\d{4, 8}/**</font> matches numbers with four to eight digits.
* <font color='blue'>**/\d{4}/**</font> matches numbers with exactly four digits.
* <font color='blue'>**/\d{4,}/**</font> matches numbers with four or more digits (`max` is infinite).
* <font color='blue'>**/\d{0,}/**</font> is the same as <font color='blue'>**/\d*/**</font>
* <font color='blue'>**/\d{1,}/**</font> is the same as <font color='blue'>**/\d+/**</font>

In [33]:
pattern = r'09\d{8}|09\d{2}-\d{3}-\d{3}'  # Match phone number
input_string = '0983123456 0983-123-456 123 0'
match = re.findall(pattern, input_string)
match

['0983123456', '0983-123-456']

In [35]:
pattern = r'[c]\w{4}' 
input_string = 'chain chess cake cook lived cheek'
match = re.findall(pattern, input_string)
match  # Expect: ['chain', 'chess', 'cheek']

['chain', 'chess', 'cheek']

#### Greedy Expressions
![img3](images/img3_greedy_ex.PNG)
<br/>

如果我們只想比對到 "01_FY_07_" 呢?

* Standard repetition quantifiers are greedy.
* Expression tries to match the longest possible string
* Defers to achieving overal match
* **<font color='blue'> /.+\\.jpg/ </font>** matches "filename.jpg"
* The **<font color='blue' size='5ptx'>+</font>** is greedy, but "gives back" the ".jpg" to make the match 
* **<font color='blue'> /.*[0-9]+/ </font>** matches "Page266"
    * Gives back as little as possible
    * **<font color='blue'>.*</font>** portion matches "Page26"
    * **<font color='blue'>[0-9]+</font>** portion matches only "6"
* Match as much as possible before giving control to the next expression part.    

In [50]:
pattern = r'(.+)[0-9]+' 
input_string = 'Page266'
match = re.match(pattern, input_string)
match.group(1)  # (.*) 比對到 "Page26" 而不是 "Page"

'Page26'

如何修改讓我們取出 "Page" 而不是 "Page26"?

In [47]:
pattern = r'([a-zA-Z]+)[0-9]+'  # Be more specific
input_string = 'Page266'
match = re.match(pattern, input_string)
match.group(1)

'Page'

#### [Lazy Expressions](https://www.linkedin.com/learning/learning-regular-expressions-2/lazy-expressions?autoAdvance=true&autoSkip=true&autoplay=true&resume=false&u=56685617)
* **<font size='5ptx' color='blue'>?</font>** makes preceding quantifier lazy
* Instructs quantifier to use a "lazy strategy" for making choices.
* Match as little as possible before giving control to the next expression part
* Still defers to overall match
* Not necessarily faster or slower.

In [49]:
pattern = r'(.+?)[0-9]+'  # Be more specific
input_string = 'Page266'
match = re.match(pattern, input_string)
match.group(1)

'Page'

In [55]:
pattern = r'\d+\w+?\d+'
input_string = '01_FY_07_report_99.xls'
match = re.match(pattern, input_string)
match.group(0)

'01_FY_07'

#### <font color='orange'>Challenge</font>
* Match "self", "himself", "herself", "itself", "myself", "yourself" and "thyself"
* Match both "virtue" and "virtues"
* Use quantified repetition to find the word that starts with "T" and has 12 letters.
* Match all text inside quotation marks, but nothing that is not inside them

In [68]:
# 1) Match "self", "himself", "herself", "itself", "myself", "yourself" and "thyself"
pattern = r'TBD'  
input_string = 'self himself herself itself myself yourself thyself aselftesting aselfb'
match = re.findall(pattern, input_string)
match  # Expect ['self', 'himself', 'herself', 'itself', 'myself', 'yourself', 'thyself']

[]

In [69]:
# 2) Match both "virtue" and "virtues"
pattern = r'TBD'  
input_string = 'virtue virtues forceworng'
match = re.findall(pattern, input_string)
match  # Expect ['virtue', 'virtues']

[]

In [70]:
# 3) Use quantified repetition to find the word that starts with "T" and has 7~11 letters.
pattern = r'TBD'  
input_string = 'Testing The Television To testbed'
match = re.findall(pattern, input_string)
match  # Expect ['Testing', 'Television']

[]

In [77]:
# 4) Match all text inside quotation marks, but nothing that is not inside them
pattern = r'TBD'  
input_string = '"Abc" "The space" "Has punctuation ," no_quote '
match = re.findall(pattern, input_string)
match  # Expect ['Abc', 'The space', 'Has punctuation ,']

[]

<a id='sect1_4'></a>
### <font color='darkgreen'>Grouping and Alternation</font> ([back](#sect1))

#### Grouping metacharacters
* 使用 Metacharacters "<font color='blue' size='5ptx'>**(**</font>" 與 "<font color='blue' size='5ptx'>**)**</font>"
* Group portions of the expression
* Apply repetation operators to a group
* Create a group of alternation expressions
* Captures group for use in matching and replacing.
* **<font color='blue'>/(abc)+/</font>** matches "abc" and "abcabc"
* **<font color='blue'>/(in)?dependent/ </font>** matches "independent" and "dependent"
* **<font color='blue'>/run(s)?/ </font>** is the same as **<font color='blue'>/runs?/ </font>**

In [85]:
pattern = r'Hello, ([A-Za-z]+)' 
input_string = 'Hello, John Hello, Selina.'
match = re.findall(pattern, input_string)
match

['John', 'Selina']

#### Alternation metacharacters
* 使用 Metacharacter "<font color='blue' size='5ptx'>**|**</font>" as OR operator
* Either match expression on the left or match expression on the right
* Ordered, leftmost expression gets precedence
* Multiple choices can be daisy-chained.
* Group alternation expressions to keep them distince
* **<font color='blue'>/apple|orange/ </font>** matches "apple" and "orange"
* **<font color='blue'>/apple(juice|source)/ </font>** is not the same as **<font color='blue'>/applejuice|source/ </font>**
* **<font color='blue'>/w(ei|ie)rd/ </font>** matches "weird" and "wierd"
* **<font color='blue'>/(AA|BB|CC){4}/ </font>** matches "AABBAACC" and "CCCCBBBB"

In [88]:
pattern = r'(w(ei|ie)rd)' 
input_string = 'weird wierd would'
match = re.findall(pattern, input_string)
match

[('weird', 'ei'), ('wierd', 'ie')]

#### Efficiency when using alternation

In [91]:
# RE is lazy
pattern = r'(peanut|peanutbutter)' 
input_string = 'peanutbutter'
match = re.findall(pattern, input_string)
match

['peanut']

In [95]:
# RE is greedy
pattern = r'peanut(butter)?' 
input_string = 'peanutbutter'
match = re.search(pattern, input_string)
match.group(0)

'peanutbutter'

In [144]:
%%timeit
pattern = r'\d' 
input_string = '''
TBD
'''
match = re.search(pattern, input_string)
match.group(0)

12.9 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [145]:
%%timeit
pattern = r'see|john|thee|tree|three' 
input_string = '''TBD'''
match = re.search(pattern, input_string)
match.group(0)

12.3 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### <font color='orange'>Challenge</font>
* Match "myself", "yourself", "thyself", but not "himself", "herself" and "itself"
* Match "good", "goodness", and "goods" without typing "good" more than once.
* Match "do" or "does" followed by "no", "not" or "nothing", even when it occurs at the start of a sentence.

In [152]:
# 1) Match "myself", "yourself", "thyself", but not "himself", "herself" and "itself"
pattern = r'((my|your|thy)self)' 
input_string = 'myself yourself thyself himself herself itself self'
match = re.findall(pattern, input_string)
list(map(lambda g: g[0], match))  # Expect ['myself', 'yourself', 'thyself']

['myself', 'yourself', 'thyself']

In [155]:
# 2) Match "good", "goodness", and "goods" without typing "good" more than once.
pattern = r'(good(ness|s)?)' 
input_string = 'goodness good goods others'
match = re.findall(pattern, input_string)
list(map(lambda g: g[0], match))  # Expect ['goodness', 'good', 'goods']

['goodness', 'good', 'goods']

In [169]:
# 3) Match "do" or "does" followed by "no", "not" or "nothing", even when it occurs at the start of a sentence.
pattern = r'TBD' 
'''
for input_string, expected_result in [
    ('do not', True),
    ('do nothing', True),
    ('does not', True),
    ('Does not', True),
    ('Does nothing', True),
    ('Do no', True),
    ('Did not', False),
    ('Does too', False),
]:
    match = re.match(pattern, input_string)
    try:
        if expected_result:
            assert match is not None
        else:
            assert match is None
    except:
        print(f'Fail in "{input_string}"')
        raise
'''
print('TBD')

TBD


<a id='sect1_5'></a>
### <font color='darkgreen'>Anchors</font> ([back](#sect1))

#### Start and end anchors
* "**<font color='blue' size='5ptx'>^</font>**" as start of string/line
* "**<font color='blue' size='5ptx'>$</font>**" as end of string/line
* "**<font color='blue' size='5ptx'>\\A</font>**" as start of string and never end of line
* "**<font color='blue' size='5ptx'>\\Z</font>**" as end of string and never end of line
* Reference a position, not an actual character
* Zero-width
* **<font color='blue'>/^apple/ </font>** matches "apple" but not "an apple"
* **<font color='blue'>/apple\$/ </font>** matches "an apple" but not "apple and orange"

#### Line breaks and multiple mode
* **Single-line mode**: 
    * **<font color='blue' size='5ptx'>^</font>** and **<font color='blue' size='5ptx'>\$</font>** do not match at line breaks.
    * **<font color='blue' size='5ptx'>\\A</font>** and **<font color='blue' size='5ptx'>\\Z</font>** do not match at line breaks.
* **Multiline mode**:
    * **<font color='blue' size='5ptx'>^</font>** and **<font color='blue' size='5ptx'>$</font>** do match at the start and end of lines
    * **<font color='blue' size='5ptx'>\\A</font>** and **<font color='blue' size='5ptx'>\\Z</font>** do not match at line breaks.
    
在 Python 中, 使用 <b><a href='https://docs.python.org/3/library/re.html#re.MULTILINE'>re.MULTILINE</a></b> 來切換 **Multiline mode.**:

In [238]:
pattern = r"^apple is ([a-zA-Z]+).$"
input_string = 'apple is health.\napple is good.'

In [239]:
# Single-line mode
match = re.findall(pattern, input_string)
match

[]

In [240]:
re.findall(r"apple is (.+).", input_string)

['health', 'good']

In [193]:
re.findall(r"^apple is ([\na-zA-Z. ]+).$", input_string)

['health.\napple is good']

In [180]:
# Mutiline mode
match = re.findall(pattern, input_string, re.MULTILINE)
match

['health', 'good']

#### [Word boundaries](https://www.linkedin.com/learning/learning-regular-expressions-2/lazy-expressions?autoAdvance=true&autoSkip=true&autoplay=true&resume=false&u=56685617)
* "**<font color='blue' size='5ptx'>\b</font>**" -> word boundary (start/end of word)
* "**<font color='blue' size='5ptx'>\B</font>**" -> not a word boundary (start/end of word)
* Reference a position, not an actual character
* Before the first word character in the string
* After the last word character in the string
* Between a word character and a non-word character
* Word characters: **<font color='blue'>[A-Za-z0-9_] </font>**

In [194]:
pattern = r'\b\w+\b' 
input_string = 'This is a test.'
match = re.findall(pattern, input_string)
match

['This', 'is', 'a', 'test']

In [200]:
# Surprise?
pattern = r'\bNew\bYork\b' 
input_string = 'New York'
match = re.findall(pattern, input_string)
match

[]

In [197]:
# Guess the result
pattern = r'\B\w+\B' 
input_string = 'This is a test.'
match = re.findall(pattern, input_string)
# match

#### <font color='orange'>Challenge</font>
* How many paragraphs start with "I" as in "I read"?
* How many paragraphs end with a question mark?
* Match all words with exactly 15 letters, including hyphenated words.

In [214]:
# 1) How many paragraphs start with "I"?
pattern = r'TBD'
input_string = '''
I read the other day sometimes.

You don't know. I read every day.

I studies every day.
'''
match = re.findall(pattern, input_string, re.MULTILINE)
print(len(match))

0


In [218]:
# 2) How many paragraphs end with a question mark?
pattern = r'TBD'
input_string = '''
Don't you know RE is very interesting?

I like RE because it is very useful.

Are you able to solve this question?
'''
match = re.findall(pattern, input_string, re.MULTILINE)
print(len(match))

0


In [237]:
# 3) Match all words with exactly 5 letters, including hyphenated words.
pattern = r'TBD'
input_string = 'These topics are very interesting do-as it test$ supposed to be sure.'
match = re.findall(pattern, input_string)
match  # Expect ['These', 'do-as']

[]

<a id='sect2'></a>
## <font color='darkblue'>基礎 Python RE 模組使用範例</font> ([back](#sect0))

### <font color='darkgreen'>尋找特定規律</font>
[**re.match**](https://docs.python.org/3/library/re.html#re.match), [**Pattern.match**](https://docs.python.org/3/library/re.html#re.Pattern.match) 只從文本的最開頭開始尋找特定規律. 如果比對成功返回 [**re.Match**](https://docs.python.org/3/library/re.html#match-objects) object.

In [249]:
pattern = r'John'
print(re.match(pattern, 'John is me'))  # Matched
print(re.match(pattern, 'I am John'))   # Miss will return None

<re.Match object; span=(0, 4), match='John'>
None


[**re.search**](https://docs.python.org/3/library/re.html#re.search), [**Pattern.search**](https://docs.python.org/3/library/re.html#re.Pattern.search) 從任意位置尋找特定規律. 如果比對成功返回 [**re.Match**](https://docs.python.org/3/library/re.html#match-objects) 物件.

In [246]:
pattern = r'John'
print(re.search(pattern, 'John is me'))  # Matched
print(re.search(pattern, 'I am John'))   # Matched

<re.Match object; span=(0, 4), match='John'>
<re.Match object; span=(5, 9), match='John'>


### <font color='darkgreen'>提取符合規律的文字</font>
[**re.findall**](https://docs.python.org/3/library/re.html#re.findall) 回傳符合規律的文字列表:

In [247]:
pattern = r'\b\w{4,5}\b'
print(re.findall(pattern, "This is a very interesting topic."))

['This', 'very', 'topic']


[**re.finditer**](https://docs.python.org/3/library/re.html#re.finditer) 回傳符合規律的 [**re.Match**](https://docs.python.org/3/library/re.html#match-objects) 物件 迭代器:

In [253]:
pattern = r'\b\w{4,5}\b'
for matched_result in re.finditer(pattern, "This is a very interesting topic."):
    print(matched_result.group(0))

This
very
topic


### <font color='darkgreen'>修改文本</font>
[**re.split**](https://docs.python.org/3/library/re.html#re.split), [**Pattern.split**](https://docs.python.org/3/library/re.html#re.Pattern.split) 根據條件分割文本:

In [254]:
re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [256]:
# 保留分割條件
re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [258]:
# re.IGNORECASE 用來忽略大小寫
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

[**re.sub**](https://docs.python.org/3/library/re.html#re.sub), [**Pattern.sub**](https://docs.python.org/3/library/re.html#re.Pattern.sub) 根據條件替代文字:

In [262]:
re.sub(r"likes \b(\w+)\b", "movies", 'John likes coding.')

'John movies.'

In [263]:
re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)

'Baked Beans & Spam'

In [265]:
def name_replace(mth):
    if mth.group(0) == 'Bob':
        return 'Ken'
    elif mth.group(0) == 'Mary':
        return 'Jane'
    else:
        return '?'
    
re.sub(r'Bob|Mary', name_replace, 'Bob and Mary are good friends!')

'Ken and Jane are good friends!'

### <font color='darkgreen'>group 與 match 物件</font>
當使用 match 或是 search 時，會回傳一個 [**match**](https://docs.python.org/3/library/re.html#match-objects) 物件，同時會存下比對到的 group，可以使用 [match.group](https://docs.python.org/3/library/re.html#re.Match.group) 方法去取出比對到的 group(s)。

In [275]:
# https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
email_pattern = r'\b([a-zA-Z0-9.!#$%&\'*+/=?^_`{|}~-]+)@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*\b'

input_string = 'My email address is abc@test.com and his address is petter.lee@cafe.com'
for email_matcher in re.finditer(email_pattern, input_string):
    print(f'account={email_matcher.group(1)}; email={email_matcher.group(0)}')

account=abc; email=abc@test.com
account=petter.lee; email=petter.lee@cafe.com


group 可以透過 `(P?<group_name>pattern)` 給定 group 的名稱 ([更多 group 用法介紹](https://realpython.com/regex-python/#other-grouping-constructs)):

In [295]:
ls_pattern = r'(?P<attributes>[rwe-]{10})\W' \
    r'(?P<filecount>\d+)\W+' \
    r'(?P<user>[a-zA-Z]+)\W+' \
    r'(?P<group>[a-zA-Z]+)\W+' \
    r'(?P<filesize>\d+[KMG]?)\W+' \
    r'(?P<month>[A-Za-z]{3})\W+' \
    r'(?P<day>[0-9]{2})\W+' \
    r'(?P<time>[0-9]{2}:[0-9]{2})\W+' \
    r'(?P<filename>.+)'
    
input_string='''
root@ubuntu# ls -hl
total 304K
-rw-r--r-- 1 root root 112K Feb 19 17:28 img1_regex101.PNG
-rw-r--r-- 1 root root 169K Feb 20 10:00 img2_shorthand_character_sets.PNG
-rw-r--r-- 1 root root  17K Feb 20 13:45 img3_greedy_ex.PNG
'''
for matcher in re.finditer(ls_pattern, input_string):
    print(f"File={matcher.group('filename')} with size={matcher.group('filesize')}")
    #print(matcher)

File=img1_regex101.PNG with size=112K
File=img2_shorthand_character_sets.PNG with size=169K
File=img3_greedy_ex.PNG with size=17K


<a id='sect3'></a>
## <font color='darkblue'>練習題</font> ([back](#sect0))

## <font color='darkblue'>Supplement</font>
* [Google for Education - Python Regular Expressions](https://developers.google.com/edu/python/regular-expressions)
* [Python 正則表達式 – re模組與正則表示式語法介紹](https://pyecontech.com/2021/10/22/python-%E6%AD%A3%E5%89%87%E8%A1%A8%E9%81%94%E5%BC%8F-re%E6%A8%A1%E7%B5%84%E8%88%87%E6%AD%A3%E5%89%87%E8%A1%A8%E7%A4%BA%E5%BC%8F%E8%AA%9E%E6%B3%95%E4%BB%8B%E7%B4%B9/)
* [給自己的Python小筆記 — 強大的數據處理工具 — 正則表達式 — Regular Expression — regex詳細教學](https://chwang12341.medium.com/%E7%B5%A6%E8%87%AA%E5%B7%B1%E7%9A%84python%E5%B0%8F%E7%AD%86%E8%A8%98-%E5%BC%B7%E5%A4%A7%E7%9A%84%E6%95%B8%E6%93%9A%E8%99%95%E7%90%86%E5%B7%A5%E5%85%B7-%E6%AD%A3%E5%89%87%E8%A1%A8%E9%81%94%E5%BC%8F-regular-expression-regex%E8%A9%B3%E7%B4%B0%E6%95%99%E5%AD%B8-a5d20341a0b2)
* [InLearing - Learning Regular Expressions](https://www.linkedin.com/learning/learning-regular-expressions-2/what-are-regular-expressions?autoAdvance=true&autoSkip=false&autoplay=true&resume=false&u=56685617)
* [RealPython - Regular Expressions: Regexes in Python (Part 1)](https://realpython.com/regex-python/)