## Python RegEx

RegEx 或正则表达式是形成搜索模式的字符序列 <br>
RegEx 可用于检查字符串是否包含指定的搜索模式

### RegEx 函数

|函数|描述|
|----|----|
|findall|返回包含所有匹配项的列表|
|search|如果字符串中的任意位置存在匹配,则返回 Match对象|
|split|返回在每次匹配时拆分字符串的列表|
|sub|用字符串替换一个或多个匹配项|

### 元字符

元字符是具有特殊含义的字符

|字符|描述|
|----|----|
|[]|一组字符|
|\\|示意特殊序列(也可用于转义特殊字符)
|.|任何字符(换行符除外)|
|^|起始于|
|$|结束于|
|*|零次或多次出现|
|+|一次或多次出现|
|{}|确切地指定的出现次数|
|\||两者任一|
|()|捕获和分组|

#### 元字符[]

In [12]:
import re

str = "The rain in Spain"
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", str)

print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


#### 元字符\\

In [14]:
import re

str = "That will be 59 dollars"
#Find all digit characters:
x = re.findall("\d", str)

print(x)

['5', '9']


#### 元字符 .

In [16]:
import re

str = "hello world"
#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
x = re.findall("he..o", str)

print(x)

['hello']


#### 元字符 ^

In [23]:
import re

str = "hello world"
#Check if the string starts with 'hello':
x = re.findall("^hello", str)
print(x)

if (x):
    print("Yes, the string starts with 'hello'")
else:
    print("No match")
    
if x == ['hello']:
    print("Yes, the string starts with 'hello'")
else:
    print("No match")

['hello']
Yes, the string starts with 'hello'
Yes, the string starts with 'hello'


#### 元字符 $

In [24]:
import re

str = "hello world"
#Check if the string ends with 'world':
x = re.findall("world$", str)
print(x)

if (x):
    print("Yes, the string ends with 'world'")
else:
    print("No match")

if x == ['world']:
    print("Yes, the string ends with 'world'")
else:
    print("No match")

['world']
Yes, the string ends with 'world'
Yes, the string ends with 'world'


#### 元字符 *

In [25]:
import re

str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 0 or more "x" characters:
x = re.findall("aix*", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")


['ai', 'ai', 'ai', 'ai']
Yes, there is at least one match!


#### 元字符 +

In [26]:
import re

str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 1 or more "x" characters:
x = re.findall("aix+", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[]
No match


#### 元字符 {}

In [36]:
import re

str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "a" followed by exactly two "l" characters:
x = re.findall("al{1}", str)
y = re.findall("al{2}", str)
print(x)
print(y)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")
    
if (y):
    print("Yes, there is at least one match!")
else:
    print("No match")    

['al']
['all']
Yes, there is at least one match!
Yes, there is at least one match!


#### 元字符 |

In [40]:
import re

str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains either "falls" or "mainly":
x = re.findall("falls|mainly", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['falls', 'mainly']
Yes, there is at least one match!


### 特殊序列

特殊序列指的是 \ 后跟下表中的某个字符，拥有特殊含义

|字符|描述|
|----|----|
|\A|如果指定的字符位于字符串的开头,则返回匹配项|
|\b|返回指定字符位于单词的开头或末尾的匹配项|
|\B|返回指定字符存在的匹配项,但不在单词的开头(或结尾处)|
|\d|返回字符串包含数字的匹配项(数字0-9)|
|\D|返回字符串不包含数字的匹配项|
|\s|返回字符串包含空白字符的匹配项|
|\S|返回字符串不包含空白字符的匹配项|
|\w|返回一个匹配项,其中字符串包含任何单词字符(从a到Z的字符,从0到9的数字和下划线_字符)|
|\W|返回一个匹配项,其中字符串不包含任何单词字符|
|\Z|如果指定的字符位于字符串的末尾,则返回匹配项|

#### 特殊字符 \A

In [42]:
import re

str = "The rain in Spain"
#Check if the string starts with "The":
x = re.findall("\AThe", str)
print(x)

if (x):
    print("Yes, there is a match!")
else:
    print("No match")

['The']
Yes, there is a match!


#### 特殊字符 \b

In [46]:
import re

str = "The rain in Spain"
#Check if "ain" is present at the beginning of a WORD:
x = re.findall(r"\bain", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[]
No match


In [48]:
import re

str = "The rain in Spain"
#Check if "ain" is present at the end of a WORD:
x = re.findall(r"ain\b", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['ain', 'ain']
Yes, there is at least one match!


#### 特殊字符 \B

In [49]:
import re

str = "The rain in Spain"
#Check if "ain" is present, but NOT at the end of a word:
x = re.findall(r"ain\B", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[]
No match


In [53]:
import re

str = "The rain in Spain"
#Check if "ain" is present, but NOT at the start of a word:
x = re.findall(r"\Bain", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['ain', 'ain']
Yes, there is at least one match!


#### 特殊字符 \d

In [55]:
import re

str = "The rain in Spain"
#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[]
No match


#### 特殊字符 \D

In [56]:
import re

str = "The rain in Spain"
#Return a match at every no-digit character:
x = re.findall("\D", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#### 特殊字符 \s

In [57]:
import re

str = "The rain in Spain"
#Return a match at every white-space character:
x = re.findall("\s", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


#### 特殊字符 \S

In [58]:
import re

str = "The rain in Spain"
#Return a match at every NON white-space character:
x = re.findall("\S", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#### 特殊字符 \w

In [59]:
import re

str = "The rain in Spain"
#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):
x = re.findall("\w", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#### 特殊字符 \W

In [61]:
import re

str = "The rain in Spain!"
#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):
x = re.findall("\W", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match") 

[' ', ' ', ' ', '!']
Yes, there is at least one match!


#### 特殊字符 /Z

In [63]:
import re

str = "The rain in Spain"
#Check if the string ends with "Spain":
x = re.findall("Spain\Z", str)
print(x)

if (x):
    print("Yes, there is a match!")
else:
    print("No match")

['Spain']
Yes, there is a match!


### 集合(Set)

集合(Set)是一对方括号[]内的一组字符,具有特殊含义

|集合|描述|
|----|----|
|[arn]|返回一个匹配项，其中存在指定字符（a，r 或 n）之一|
|[a-n]|返回字母顺序 a 和 n 之间的任意小写字符匹配项|
|[^arn]|返回除 a、r 和 n 之外的任意字符的匹配项|
|[0123]|返回存在任何指定数字（0、1、2 或 3）的匹配项|
|[0-9]|返回 0 与 9 之间任意数字的匹配|
|[0-5][0-9]|返回介于 0 到 9 之间的任何数字的匹配项|
|[a-zA-Z]|返回字母顺序 a 和 z 之间的任何字符的匹配，小写或大写|
|[+]|在集合中，+、*、.、、()、$、{} 没有特殊含义，因此 [+] 表示：返回字符串中任何 + 字符的匹配项|

#### 集合 [arn]

In [64]:
import re

str = "The rain in Spain"
#Check if the string has any a, r, or n characters:
x = re.findall("[arn]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['r', 'a', 'n', 'n', 'a', 'n']
Yes, there is at least one match!


#### 集合 [a-n]

In [65]:
import re

str = "The rain in Spain"
#Check if the string has any characters between a and n:
x = re.findall("[a-n]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!


#### 集合 [^arn]

In [66]:
import re

str = "The rain in Spain"
#Check if the string has other characters than a, r, or n:
x = re.findall("[^arn]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
Yes, there is at least one match!


#### 集合 [0123]

In [67]:
import re

str = "The rain in Spain"
#Check if the string has any 0, 1, 2, or 3 digits:
x = re.findall("[0123]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

[]
No match


#### 集合 [0-9]

In [68]:
import re

str = "8 times before 11:45 AM"
#Check if the string has any digits:
x = re.findall("[0-9]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['8', '1', '1', '4', '5']
Yes, there is at least one match!


#### 集合 [0-5][0-9]

In [69]:
import re

str = "8 times before 11:45 AM"
#Checkif the string has any two-digit numbers, from 00 to 59:
x = re.findall("[0-5][0-9]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['11', '45']
Yes, there is at least one match!


#### 集合 [a-zA-Z]

In [70]:
import re

str = "8 times before 11:45 AM"
#Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall("[a-zA-Z]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!


#### 集合 [+]

In [72]:
import re

str = "8 times before 11:45 AM|||"
#Check if the string has any + characters:
x = re.findall("[|]", str)
print(x)

if (x):
    print("Yes, there is at least one match!")
else:
    print("No match")

['|', '|', '|']
Yes, there is at least one match!


### 函数

#### findall() 函数

findall() 函数返回包含所有匹配项的列表

In [79]:
import re

str = "China is a great country"
x = re.findall('a', str)
print(x)

['a', 'a', 'a']


#### search() 函数

search()函数搜索字符串中的匹配项,如果存在匹配则返回Match对象

In [86]:
import re

str = "China is a great country"
x = re.search("\s", str)

print(x)
print(x.start())

<re.Match object; span=(5, 6), match=' '>
5


#### split()函数

split() 函数返回一个列表,其中字符串在每次匹配时被拆分

In [88]:
import re

str = "China is a great country"
x = re.split("\s", str)
print(x)

['China', 'is', 'a', 'great', 'country']


In [92]:
import re

str = "China is a great country"
x = re.split("\s", str, maxsplit = 3)      ### 可以通过指定 maxsplit 参数来控制出现次数
print(x)

['China', 'is', 'a', 'great country']


#### sub() 函数

sub()函数把匹配替换为您选择的文本

In [94]:
import re

str = "China is a great country"
x = re.sub("\s", "_", str)
print(x)

China_is_a_great_country


In [96]:
import re

str = "China is a great country"
x = re.sub("\s", "_", str, count = 2)     ### 可以通过 count参数来控制替换次数
print(x)

China_is_a great country


#### Match 对象

Match 对象是包含有关搜索和结果信息的对象<br>
注释：如果没有匹配，则返回值 None，而不是 Match 对象

In [97]:
import re

str = "China is a great country"
x = re.search("a", str)
print(x) # 将打印一个对象

<re.Match object; span=(4, 5), match='a'>


+ span() 返回的元组包含了匹配的开始和结束位置
+ .string 返回传入函数的字符串
+ group() 返回匹配的字符串部分

In [99]:
import re

str = "China is a great country"
x = re.search(r"\bC\w+", str)
print(x.span())

(0, 5)


In [100]:
import re

str = "China is a great country"
x = re.search(r"\bC\w+", str)
print(x.string)

China is a great country


In [101]:
import re

str = "China is a great country"
x = re.search(r"\bC\w+", str)
print(x.group())

China
