## 2.1 使用多个界定符分割字符串
分割字符串时，分割符并不固定

In [1]:
import re
line = 'asdf fjdk; afed, fjek,asdf, foo'
print(re.split('[;,\s]\s*', line))

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


## 2.3 shell 的通配符匹配字符串
如使用 *.py匹配字符串

In [17]:
from fnmatch import fnmatch, fnmatchcase

print(fnmatch('foo.txt', '*.txt'))
print(fnmatch('foo.txt', '??o.txt'))
print(fnmatch('zhu123.txt', 'zhu[0-9]*.txt'))
# fnmatch 大小写在不同的系统敏感规则不同，fnmatchcase 则要严格

print(fnmatch('asd.txt', '*.TXT'))
print(fnmatchcase('asd.txt', '*.TXT'))

True
True
True
True
False


## 2.5 字符串搜索和替换

简单的可以使用replace替换，复杂的需要用到re.sub

In [21]:
# 将11/27/2013替换为2012-11-27
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
# 反斜杠数字代表的是前面模式中捕获的组号
print(re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))

# 通过编译一个模式来多次替换
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
print(datepat.sub(r'\3-\1-\2', text))

# 对于更加复杂的替换模式，可以传递一个函数来替代，如
from calendar import month_abbr
def change_date(m):
    mon_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

print(datepat.sub(change_date, text))
print(datepat.subn(change_date, text))

Today is 2012-11-27. PyCon starts 2013-3-13.
Today is 2012-11-27. PyCon starts 2013-3-13.
Today is 27 Nov 2012. PyCon starts 13 Mar 2013.
('Today is 27 Nov 2012. PyCon starts 13 Mar 2013.', 2)


In [3]:
import re

code_name = {'820000': '台湾'}

'820000'

In [32]:
text = './geojson_data/820000\820000\820000.json'
codepat = re.compile(r'(\d{6,})')

def code_change(m):
    code = m.group(1)
    name = code_name.get(code, code)
    return f'{name}_{code}'

print(codepat.findall(text))
codepat.sub(code_change, text)

['820000', '820000', '820000']


'./geojson_data/台湾_820000\\台湾_820000\\台湾_820000.json'

## 2.6 字符串忽略大小写的搜索替换

需要忽略大小写的方式搜索与替换文本字符串

In [4]:
text = 'UPPER PYTHON, lower python, Mixed Python'
print(re.findall('python', text, flags=re.IGNORECASE))
print(re.sub('python', 'snake', text, flags=re.IGNORECASE))

['PYTHON', 'python', 'Python']
UPPER snake, lower snake, Mixed snake


上述替换的例子存在一个小缺陷，替换的字符串不会自动跟被匹配的字符串保持大小写一致，为了修复这个缺陷：

In [6]:
# matchcase()返回了一个回调函数，参数对象必须是match对象
# re.sub()除了接受替换字符串外，还能接受一个回调函数。
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)


'UPPER SNAKE, lower snake, Mixed Snake'

## 2.8 多行匹配模式

跨行匹配

In [17]:
text1 = '/* this is a comment */'
text2 = '''/* this is a
 multiline comment */
'''

comment = re.compile(r'/\*(.*?)\*/')
print(comment.findall(text1))
print(comment.findall(text2))   # . 不能匹配换行符

comment = re.compile(r'/\*(.*?)\*/', re.S)
print(comment.findall(text2))

comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
print(comment.findall(text2))

[' this is a comment ']
[]
[' this is a\n multiline comment ']
[' this is a\n multiline comment ']


## 2.11 删除字符中不需要的字符

In [22]:
text = 'pythonp'
print(text.strip('p'))
print(text.lstrip('p'))
print(text.rstrip('p'))
print(text.replace('p', ''))

ython
ythonp
python
ython


## 2.12 审查清理文本字符串、

In [27]:
s = 'pýtĥöñ\fis\tawesome\r\n'
print(s)
remap = {ord('\t'): ' ', ord('\f'): ' '}
print(s.translate(remap))

pýtĥöñis	awesome

pýtĥöñ is awesome



## 2.16 以指定列宽格式化字符串

In [33]:
import textwrap

s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."
print(textwrap.fill(s, 65, initial_indent='    '))

    Look into my eyes, look into my eyes, the eyes, the eyes, the
eyes, not around the eyes, don't look around the eyes, look into
my eyes, you're under.
