**针对任意多的分隔符拆分字符串**

In [13]:
import re
line = 'asdf fjdk ; afed,fjek,asdf,   foo'
re.split('[;,\s]',line)

['asdf', 'fjdk', '', '', 'afed', 'fjek', 'asdf', '', '', '', 'foo']

In [14]:
re.split('[;,]\s*',line)

['asdf fjdk ', 'afed', 'fjek', 'asdf', 'foo']

In [9]:
re.split('(;|,)\s*',line)

['asdf fjdk ', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

In [11]:
re.split('(?:;|,)\s*',line)

['asdf fjdk ', 'afed', 'fjek', 'asdf', 'foo']

字符串对象的split()方法只能处理简单的情况，而且不支持多个分隔符，对分隔符周围可能存在的空格无能为力，当需要更灵活的功能时，应该使用re.split()，re.split()可以为分隔符指定多个模式。

小括号代表正则表达式中捕获组，用到捕获组时，匹配的文本也会包含在最后只能给结果中，如果不想看到分隔符，但仍然想用括号来对正则表达式模式进行分组，改用(?:...)非捕获组的形式。

**在字符串的开头或结尾处做文本匹配**

In [15]:
filename = 'spam.txt'
filename.endswith('.txt')

True

In [16]:
filename.startswith('file:')

False

In [17]:
url = 'https://www.python.org'
url.startswith('http:')

False

In [18]:
url.startswith('https:')

True

In [19]:
import os
filename = os.listdir('.')
print(filename)

['.ipynb_checkpoints', 'address.json', 'Bag of Words Meets Bags of Popcorn.ipynb', 'Bag_of_Words_model.csv', 'ceshi.txt', 'labeledTrainData.tsv', 'nbaallelo.db', 'python logging.ipynb', 'python 分词.ipynb', 'python 复习记录.ipynb', 'python 字典的遍历与排序以及后续的学习记录.ipynb', 'python 学习.ipynb', 'python 学习记录+.ipynb', 'sentiment_lstm', 'sklearn.ipynb', 'sklearn学习记录.ipynb', 'testData.tsv', 'Untitled.ipynb', 'Untitled1.ipynb', '字符串和文本.ipynb']


In [22]:
[name for name in filename if name.endswith(('.python','.ipynb'))]

['Bag of Words Meets Bags of Popcorn.ipynb',
 'python logging.ipynb',
 'python 分词.ipynb',
 'python 复习记录.ipynb',
 'python 字典的遍历与排序以及后续的学习记录.ipynb',
 'python 学习.ipynb',
 'python 学习记录+.ipynb',
 'sklearn.ipynb',
 'sklearn学习记录.ipynb',
 'Untitled.ipynb',
 'Untitled1.ipynb',
 '字符串和文本.ipynb']

In [23]:
any(name.endswith('py') for name in filename)

False

In [24]:
any(name.endswith('ipynb') for name in filename)

True

#any(x)判断x对象是否为空对象，如果都为空、0、false，则返回false，如果不都为空、0、false，则返回true
#all(x)如果all(x)参数x对象的所有元素不为0、''、False或者x为空对象，则返回True，否则返回False

In [25]:
any('123')

True

In [27]:
any([0,1])

True

In [28]:
any([0,''])

False

In [29]:
all('123')

True

In [30]:
all([0,1])

False

In [31]:
all([0,''])

False

In [32]:
choices = ['https:','ftp:']
url = 'https://www.python.org'
url.startswith(choices)

TypeError: startswith first arg must be str or a tuple of str, not list

In [33]:
url.startswith(tuple(choices))

True

**利用Shell通配符做字符串匹配**

fnmatch 模块提供了两个函数，fnmatch()和fnmatchcase()

In [34]:
from fnmatch import fnmatch, fnmatchcase
fnmatch('foo.txt','*.txt')

True

In [35]:
fnmatch('foo.txt','?.txt')

False

In [36]:
fnmatch('foo.txt','?oo.txt')

True

In [37]:
fnmatch ('Dat45.csv','Dat[0-9]*')

True

In [38]:
names = ['Dat1.csv','Dat2.csv','config.ini','foo.py']
[name for name in names if fnmatch(name,'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

In [39]:
fnmatch('foo.txt','*.TXT')

True

In [40]:
fnmatchcase('foo.txt','*.txt')

True

In [41]:
fnmatchcase('foo.txt','*.TXT')

False

**文本模式的匹配和查找**

In [43]:
text = 'year , but no, but year,but no,but year'
text.startswith('year')
text.endswith('year')
text.find('no')

11

In [44]:
import re
text1= '11/17/2012'
text2 = 'Nov 27,2012'
if re.match(r'\d+/\d+/\d+',text1):
    print('yes')
else:
    print('no')
if re.match(r'\d+/\d+/\d+',text2):
    print('yes')
else:
    print('no')

yes
no


In [45]:
datapat = re.compile(r'\d+/\d+/\d+')
if datapat.match(text1):
    print('yes')
else:
    print('no')
if datapat.match(text2):
    print('yes')
else:
    print('no')

yes
no


如果打算针对同一种模式做多次匹配，那么通常会先将正则表达式模式预编译成一个模式对象，即re.compile。

match()方法总是尝试在字符串的开头找到匹配项，如果想针对整个文本进行搜索出所有的匹配项，那么就应该使用findall()方法

In [46]:
text = 'Today is 28/9/2017,PyCon starts 3/13/2013'
datapat.findall(text)

['28/9/2017', '3/13/2013']

In [47]:
datapat = re.compile(r'(\d+)/(\d+)/(\d+)')
datapat.findall(text)

[('28', '9', '2017'), ('3', '13', '2013')]

In [61]:
datapat = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datapat.match('28/9/2017 today')
print(m)
print(m.groups())
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))

<_sre.SRE_Match object; span=(0, 9), match='28/9/2017'>
('28', '9', '2017')
28/9/2017
28
9
2017


findall()方法搜索整个文本并找出所有的匹配项然后让他们以列表的形式返回，如果想以迭代的方式搜索匹配项，可以使用finditer()方法。

In [64]:
text = 'Today is 28/9/2017,PyCon starts 3/13/2013'
datapat = re.compile(r'(\d+)/(\d+)/(\d+)')

for m in datapat.finditer(text):
    print(m.groups())

('28', '9', '2017')
('3', '13', '2013')


**查找和替换文本**

对于简单的文本模式，使用str.replace()即可。针对复杂的模式，可以使用re模块中的sub函数。

In [65]:
text = 'year , but no, but year,but no,but year'
text.replace('year','yeah')

'yeah , but no, but yeah,but no,but yeah'

In [66]:
import re
text = 'Today is 28/9/2017,PyCon starts 3/13/2013'
re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',text)

'Today is 2017-28-9,PyCon starts 2013-3-13'

sub()的第一个参数是要匹配的模式，第2个参数是要替换上的模式，类似‘\3’这样的反斜线加数字的符号代表着模式中捕获组的数量。

In [67]:
import re
datepat =re.compile('(\d+)/(\d+)/(\d+)')
text = 'Today is 28/9/2017,PyCon starts 3/13/2013'
datepat.sub(r'\3-\1-\2',text)

'Today is 2017-28-9,PyCon starts 2013-3-13'

除了得到替换后的文本外，还可以知道一共完成了多少次替换，用re.subn()

In [68]:
import re
datepat =re.compile('(\d+)/(\d+)/(\d+)')
text = 'Today is 28/9/2017,PyCon starts 3/13/2013'
newtext,n=datepat.subn(r'\3-\1-\2',text)
print(newtext)
print(n)

Today is 2017-28-9,PyCon starts 2013-3-13
2


**以不区分大小写的方式对文本做查找和替换**

In [3]:
import re
text = 'UPPER PYTHON, lower python,Mixed Python'
re.findall('python',text,flags = re.IGNORECASE)

['PYTHON', 'python', 'Python']

In [5]:
re.sub('python','snake',text,flags = re.IGNORECASE)

'UPPER snake, lower snake,Mixed snake'

In [13]:
def matchcase(word):
    def replace(m):
        print(m)
        text = m.group()
        print(text)
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace
re.sub('python',matchcase('snake'),text,flags = re.IGNORECASE)

<_sre.SRE_Match object; span=(6, 12), match='PYTHON'>
PYTHON
<_sre.SRE_Match object; span=(20, 26), match='python'>
python
<_sre.SRE_Match object; span=(33, 39), match='Python'>
Python


'UPPER SNAKE, lower snake,Mixed Snake'

上述函数有待理解

**定义实现最短匹配的正则表达式**

在尝试用正则表达式对文本模式做匹配，识别出来的是最长的可能匹配，相反，我们想找出最短匹配。

In [21]:
import re
str_pat = re.compile(r'"(.*)"')
text1 = 'Computer say "no."'
str_pat.findall(text1)

['no.']

In [22]:
text2 = 'Computer says "no." Phone says "yes."'
str_pat.findall(text2)

['no." Phone says "yes.']

In [20]:
import re
str_pat1 = re.compile(r'"(.*?)"')
text2 = 'Computer says "no." Phone says "yes."'
str_pat1.findall(text2)

['no.', 'yes.']

添加？强制将匹配算法调整为最短的可能匹配

**编写多行文本的正则表达式**——跨行匹配

In [27]:
comment = re.compile(r'/\*(.*?)\*/')
text1 = '/*this is a comment*/'
text2 = '''/*this is a 
multiline comment*/
'''
comment.findall(text1)

['this is a comment']

In [28]:
comment.findall(text2)

[]

要解决这个问题，可以添加对换行符的支持，（?:.|\n）指定了一个非捕获组，即只匹配但不捕获结果，也不会分配组号。

In [30]:
comment = re.compile(r'/\*((?:.|\n)*?)\*/')
text1 = '/*this is a comment*/'
text2 = '''/*this is a 
multiline comment*/
'''
print(comment.findall(text1))
print(comment.findall(text2))

['this is a comment']
['this is a \nmultiline comment']


re.complie()函数可接受一个有用的标记——re.DOTALL。这个使得正则表达式中的句点（.）可以匹配所有的字符。

In [31]:
comment = re.compile(r'/\*(.*?)\*/',re.DOTALL)
text1 = '/*this is a comment*/'
text2 = '''/*this is a 
multiline comment*/
'''
comment.findall(text2)

['this is a \nmultiline comment']

**从字符串中去掉不需要的字符**

In [32]:
s = ' hello world \n'
s.strip()

'hello world'

In [33]:
s.lstrip()

'hello world \n'

In [34]:
s.rstrip()

' hello world'

In [35]:
t= '----hello===='
t.lstrip('-')

'hello===='

In [37]:
t.strip('-=')

'hello'

In [39]:
s = ' hello   world    \n'
s = s.strip()
print(s)

hello   world


去除字符串的操作并不会对位于字符串中间的任何文本起作用，如果想对里面的空格执行某些操作，可以使用replace()、re.sub()。

s.replace(' ','')

In [41]:
import re
re.sub('\s+',' ',s)

'hello world'

**对齐文本字符串**

In [1]:
text = 'Hello World'
text.center(20)

'    Hello World     '

In [2]:
text.ljust(20)

'Hello World         '

In [3]:
text.rjust(20)

'         Hello World'

In [4]:
text.ljust(20,'=')



In [5]:
text.ljust(20,'=')



In [6]:
text.center(20,'=')

'====Hello World====='

foramt()函数也可以用来完成任务，需要做的就是合理利用‘<’、‘>’、‘^’。

In [7]:
format(text,'<20')

'Hello World         '

In [8]:
format(text,'>20')

'         Hello World'

In [9]:
format(text,'^20')

'    Hello World     '

In [10]:
format(text,'*<20')

'Hello World*********'

In [11]:
format(text,'*>20')

'*********Hello World'

In [12]:
format(text,'*^20')

'****Hello World*****'

当格式化多个值时，这些格式化代码也可以用在format()方法中。

In [13]:
'{:>10s} {:>10s}'.format('hello','world')

'     hello      world'

In [14]:
x=1.234
format(x,'>10')

'     1.234'

In [15]:
format(x,'^10.2f')

'   1.23   '

format()比‘%’以后更通用。

**字符串的连接及合并**

In [16]:
parts = ['Is','Chicago','Not','Chicago?']
' '.join(parts)

'Is Chicago Not Chicago?'

In [17]:
','.join(parts)

'Is,Chicago,Not,Chicago?'

In [18]:
''.join(parts)

'IsChicagoNotChicago?'

In [20]:
a = 'Is Chicago'
b = 'Not  Chicago?'
a + b

'Is ChicagoNot  Chicago?'

In [21]:
a +' '+b

'Is Chicago Not  Chicago?'

In [22]:
print('{} {}'.format(a,b))

Is Chicago Not  Chicago?


In [23]:
a = 'hello' 'world'
a

'helloworld'