<h1>正則表達式 #2 - Regular Expression</h1>

<quote>https://stackoverflow.com/questions/52290219/how-to-increase-the-font-size-of-the-markdown-table-in-jupyter-notebook</quote>

In [1]:
%%HTML

<style>
td,th {
  font-size: 18px
}
table {
    float: left;
}
</style>

<h2>findall 設定符號的定義</h2>

| 符號 | 說明 | 表達式 | 適合範例 |
| :--------: | :-------: | :-------: | :-------: |
| \d  | 數字 0-9 | a\dc | a2c |
| \D | 非數字 | a\Dd | azd |
| \s | 空白 | a\sd | a d |
| \S | 非空白字元 | a\Sd | axd |
| \w | a-z A-Z 0-9 | a\wd | a0d |
| \W | 非 a-z A-Z 0-9 | a\Wd | a d |

<h2>計算字元數量</h2>

| 符號 | 說明 | 表達式 | 適合範例 |
| :--------: | :-------: | :-------: | :-------: |
| [a-z]  | 表示 a 至 z 等 26 個英文小寫字母 | a[b-c]d | <ul><li>abd</li><li>acd</li></ul> |
| [^b-e] | 表示除了 b 至 e 之外，所有字元都接受 | a[^b-e]d | <ul><li>afd</li><li>axd</li></ul> |
| {3} | 需輸入 3 個字元 | [A-C]{3} | <ul><li>ABC</li><li>axd</li></ul> |
| {1,3} | 需輸入 1 至 3 個字元 | [A-C]{1,3} | <ul><li>A</li><li>AB</li><li>ABC</li></ul> |
| {4,} | 需輸入 4 個或以上字元 | [A-C]{4,} | <ul><li>ABCC</li><li>ACCCB</li><li>AAAA</li></ul> |

<h2>計算重覆數量</h2>

| 符號 | 說明 | 表達式 | 適合範例 |
| :--------: | :-------: | :-------: | :-------: |
| ? | 表示前面的項目，重複 0 次或 1 次 | abcd? | <ul><li>abc</li><li>abcd</li></ul> |
| * | 表示前面的項目，重複 0 次或以上 | abcd? | <ul><li>abc</li><li>abcddd</li></ul> |
| + | 表示前面的項目，重複 1 次或以上 | abcd+ | <ul><li>abcd</li><li>abcddd</li></ul> |

<h2>\b - Whole word only</h2>

In [8]:
text_1 = "The quick brown fox jumps over the lazy dog."
text_2 = "The quick brown fox foxa jumps foxs over xfox the lazy dog."

In [10]:
import re
answer = re.findall(r'fox', text_2)
if answer:
    print(answer)

['fox', 'fox', 'fox', 'fox']


In [9]:
import re
answer = re.findall(r'\bfox\b', text_2)
if answer:
    print(answer) # return first fox

['fox']


In [11]:
import re
answer = re.findall(r'fox\b', text_2)
if answer:
    print(answer) # return first fox and xfox

['fox', 'fox']


In [12]:
import re
answer = re.findall(r'\bfox', text_2)
if answer:
    print(answer) # return first fox, foxa and foxs

['fox', 'fox', 'fox']


<h2>找出所有數字</h2>

In [13]:
import re
text = "Year 2020, 2021 and 2022"
answer = re.findall(r'\d+', text)
if answer:
    print(answer)

['2020', '2021', '2022']


<h2>提取電子郵件</h2>

In [1]:
import re
text = "Please contact us at info@examoke.com or support@example.org"
answer = re.findall(r'[A-Za-z0-9._%+-]+@[a-z]{2,}.[a-z]{2,}', text)
if answer:
    print(answer)

['info@examoke.com', 'support@example.org']


<h2>查找所有大寫單詞</h2>

In [3]:
import re
text = "This is an EXAMPLE of a Sentence with UPPER case words."
answer = re.findall(r'[A-Z]+', text)
if answer:
    print(answer)

['T', 'EXAMPLE', 'S', 'UPPER']


<h2>獲取日期格式</h2>

In [7]:
import re
text = "Important dates are 2022/10/23 and 01/01/2023."
answer = re.findall(r'[\d]{2,4}/[\d]{2}/[\d]{2,4}', text)
if answer:
    print(answer)

['2022/10/23', '01/01/2023']


<h2>從字串符中提取所有單詞</h2>

In [10]:
import re
text = "Hello, how are you?"
answer = re.findall(r'\w+', text)
if answer:
    print(answer)

['Hello', 'how', 'are', 'you']


<h2>提取帶括號的信息</h2>

In [45]:
import re
text = "Results: (success), (failure), (pending)"
answer = re.findall(r'\(([\w]+)\)', text)
answer2 = re.findall(r'\(([a-zA-z0-9]+)\)', text)
answer3 = re.findall(r'\([^)]+\)', text)
answer4 = re.findall(r'\(([^)]+)\)', text)
if answer:
    print(answer, answer2, answer3, answer4)

['success', 'failure', 'pending'] ['success', 'failure', 'pending'] ['(success)', '(failure)', '(pending)'] ['success', 'failure', 'pending']


<h2>提取帶有條件的字串</h2>

<h3>圓括號係 group</h3>

In [53]:
import re
text = "Items, apple-10, banana-20, cherry-30"
answer = re.findall(r'(\w+)-(\d+)', text)
if answer:
    print(answer)

[('apple', '10'), ('banana', '20'), ('cherry', '30')]


<h2>匹配並提取嵌套的引號內文</h2>

In [4]:
import re
text = "He said, \"She said, \'Indeed, It\'s ready.\' very quietly.\""
print("Before:", text)
answer = re.findall(r'["\'](.*?)["\']', text)
if answer:
    print("After:", answer)
    #print("type", type(answer))

Before: He said, "She said, 'Indeed, It's ready.' very quietly."
After: ['She said, ', 's ready.']


<h2>提取特定格式的電話號碼</h2>

In [3]:
import re
text = "Contact: 123-456-7890, 987.654.3210, (123) 456-7890, 123-12345-12345"
print("Before:", text)
answer = re.findall(r'\(?[\d]{3}\)?[-.\s]?\d{3}[-.]\d{4}', text)
if answer:
    print("After:", answer)
    #print("type", type(answer))

Before: Contact: 123-456-7890, 987.654.3210, (123) 456-7890, 123-12345-12345
After: ['123-456-7890', '987.654.3210', '(123) 456-7890']


<h2>從 HTML 中提取特定屬性</h2>

In [11]:
import re
text = '<img src="image1.png" alt="An image one" /><img alt="An image two" src="image2.png" />'
print("Before:", text)
answer1 = re.findall(r'img[\w=\"\s]*src=\"([\w\d.]+)', text)
# *? means non-greedy
answer2 = re.findall(r'img\s+[^>]*?src="([^"]+)', text)
if answer1 and answer2:
    print("After (1):", answer1)
    print("After (2):", answer2)

Before: <img src="image1.png" alt="An image one" /><img alt="An image two" src="image2.png" />
After (1): ['image1.png', 'image2.png']
After (2): ['image1.png', 'image2.png']
