# Regex on AUTOMATE THE BORING STUFF WITH PYTHON in Chapter7

In [1]:
import re

##  创建正则表达式对象

- 用impore re导入正则表达式模块
- 用re.complie()函数创建一个Regex对象
- 向Regex对象的search()方法传入想查找的字符串,它返回一个Match对象
- 调用Match对象的group()方法, 返回实际匹配文本的字符串

In [2]:
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


## 利用括号分组

In [3]:
phoneNumRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group(1))
print('Phone number found: ' + mo.group(2))
print('Phone number found: ' + mo.group(0))
print('Phone number found: ' + mo.group())

Phone number found: 415
Phone number found: 555-4242
Phone number found: 415-555-4242
Phone number found: 415-555-4242


In [4]:
print(mo.groups())
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

('415', '555-4242')
415
555-4242


In [5]:
phoneNumRegex = re.compile(r'(\(\d{3}\))(\d{3}-\d{4})')
mo = phoneNumRegex.search('My number is (415)555-4242.')
print('Phone number found: ' + mo.group(1))
print('Phone number found: ' + mo.group(2))

Phone number found: (415)
Phone number found: 555-4242


## 利用管道匹配多个分组

In [6]:
heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group())
mo2 = heroRegex.search('Tina Fey and Batman.')
print(mo2.group())

Batman
Tina Fey


In [7]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())

Batmobile


## 用?匹配这个问号之前的分组零次或一次

In [16]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())

Batman
Batwoman


In [17]:
phoneNumRegex = re.compile(r'(\d{3}-)?\d{3}-\d{4}')
mo1 = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo1.group())
mo2 = phoneNumRegex.search('My number is 555-4242.')
print('Phone number found: ' + mo2.group())

Phone number found: 415-555-4242
Phone number found: 555-4242


## 用*匹配零次或多次

In [19]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batwowowowoman')
print(mo3.group())

Batman
Batwoman
Batwowowowoman


## 用+匹配一次或多次

In [20]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwowowowowoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batman')
print(mo3 is None)

Batwoman
Batwowowowowoman
True


## 用{}匹配特定的次数

In [21]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())
mo2 = haRegex.search('Ha')
print(mo2 is None)

HaHaHa
True


## 贪心和非贪心匹配

In [23]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())
nongreedHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedHaRegex.search('HaHaHaHaHa')
print(mo2.group())

HaHaHaHaHa
HaHaHa


## findall()方法

In [25]:
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(mo.group())
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))
phoneNumRegex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

415-555-9999
['415-555-9999', '212-555-0000']
[('415', '555', '9999'), ('212', '555', '0000')]


## 字符分类

In [26]:
xmasRegex = re.compile(r'\d+\s\w+')
print(xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, '
                        '5 rings, 4 birds, 3 hens, 2 doves, 1 partridge'))

['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']


## 建立自己的字符分类

In [28]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
print(vowelRegex.findall('RoboCop eats baby food. BABY FOOD.'))
vowelRegex = re.compile(r'[^aeiouAEIOU]')  # 非字符类匹配不在这个字符类中的所有字符
print(vowelRegex.findall('RoboCop eats baby food. BABY FOOD.'))

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']
['R', 'b', 'C', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']


## ^和$匹配开始和结尾

In [30]:
beginsWithHello = re.compile(r'^Hello')
print(beginsWithHello.search('Hello world!'))
print(beginsWithHello.search('He said hello.') is None)


<_sre.SRE_Match object; span=(0, 5), match='Hello'>
True


In [31]:
endsWithNumber = re.compile(r'\d$')
print(endsWithNumber.search('Your number is 42'))
print(endsWithNumber.search('Your number is forty two.') is None)

<_sre.SRE_Match object; span=(16, 17), match='2'>
True


In [32]:
wholeStringIsNum = re.compile(r'^\d+$')
print(wholeStringIsNum.search('1234567890'))
print(wholeStringIsNum.search('12345xyz67890'))
print(wholeStringIsNum.search('12 34567890'))

<_sre.SRE_Match object; span=(0, 10), match='1234567890'>
None
None


## 通配字符

In [33]:
atRegex = re.compile(r'.at')
print(atRegex.findall('The cat in the hat sat on the flat mat.'))

['cat', 'hat', 'sat', 'lat', 'mat']


##  用点-星匹配除换行外所有字符-- .表示出换行外所有单个字符, *表示前面字符出现零次或多次

In [35]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo.group())
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))

First Name: Al Last Name: Sweigart
First Name: Al Last Name: Sweigart
Al
Sweigart


- 非贪心模式

In [36]:
nongreedHaRegex = re.compile(r'<.*?>')
mo = nongreedHaRegex.search('<To serve man> for dinner.>')
print(mo.group())
greedyHaRegex = re.compile(r'<.*>')
mo = greedyHaRegex.search('<To serve man> for dinner.>')
print(mo.group())

<To serve man>
<To serve man> for dinner.>


- 用句点字符匹配换行--通过传入re.DOTALL作为re.complile()的第二个参数,可以让句点字符匹配所有字符,包括换行字符

In [37]:
noNewlineRegex = re.compile('.*')
print(noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law').group())
NewlineRegex = re.compile('.*', re.DOTALL)
print(NewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law').group())

Serve the public trust.
Serve the public trust.
Protect the innocent.
Uphold the law


## 正则表达式符号复习

- ?匹配零次或一次前面的分组
- *匹配零次或多次前面的分组
- +匹配一次或多次前面的分组
- {n}匹配n此前面的分组
- {n,}匹配n次或更多前面的分组
- {,m}匹配零次到m次前面的分组
- {n,m}匹配至少n次,至多m次前面的分组
- {n,m}?或*?或+?对前面的分组进行非贪心匹配
- ^spam意味着字符串必须以spam开始
- spam$意味着字符串必须以spam结束
- .匹配所有字符,换行符除外
- \d和\w和\s分别匹配数字,单词和空格
- \D和\W和\S分别匹配出数字,单词和空格外的所有字符
- [abc]匹配方括号内的任意字符
- [^abc]匹配不在方括号内的任意字符