## About Regular Expressions (REs)

In [1]:
import re

Using 'regular expression', you **specify** the **rules** for the set of possible **strings** that you want to **match**. We can also use REs to modify a string or to **split** it apart in various ways. 背后实现：REs在被compile成一系列bytecode后，就被交给一个由C实现的matching engine。

For advanced use, we may consider:
- how the engine will execute a given RE
- how to write the RE in a certain way in order to produce bytecode that runs faster

### 一、RE 的 matching 功能

The list of metcharacters (which cannot match themselves): . ^ $ * + ? {} [] \ | () 这些在设计匹配pattern时，被用作其他用途。

**The 1st : [ ]**  specify a **set** of characters that you wish to match. 可以把characters分别单独列出，当他们属于一个范围时，也可以只列出头尾两个characters，并用 '-' 连接。

比如：[abcd]和[a-d]是两种等价的pattern。

**The 2nd : ^** will match a complement set. 如：[^5]将会match any character except 5.

**The 3rd: \\** (**转义字符**) : \\^ match real character '^'

**Some other RE syntax**
- \d matches [0-9]
- \D matches [^0-9]
- \s matches any whitespace character [\t\n\r\f\v]
- \S matches any non-whitespace character [^\s]
- \w matches any alphanumeric character [a-zA-Z0-9]
- \W matches any non-alphanumeric character [^\w]

做个练习：**[\s,]** will match **any whitespace character**, or  " **,** "

### 二、RE 的 compiling

将具体的regular expressions compiler 成 pattern object, which have methods for various operations such as searching for pattern matches or performing string susbstitutions.

In [2]:
import re
ptn=re.compile('[a-z]+')
ptn

re.compile(r'[a-z]+', re.UNICODE)

In [3]:
p=ptn.match("test0fff12")
p

<_sre.SRE_Match object; span=(0, 4), match='test'>

group() 返回substring，start(), end()返回starting, ending index.

In [4]:
p.group() 

'test'

In [5]:
p.start(), p.end()

(0, 4)

### 三、re.split() 功能

**.split(string[, maxsplit=0])**  Split the string into a list, **splitting** it wherever the **RE** matches. If maxsplit is non-zero, at most maxsplit are performed.

In [6]:
ptn = re.compile(r'\W+')
ptn.split('This is a test, short and sweet, of split().')

['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']

In [7]:
ptn.split('This is a test, short and sweet, of split().',3)

['This', 'is', 'a', 'test, short and sweet, of split().']

上面返回的list默认不包含match到的内容，若想把match的内容也返回，应为ptn的r(raw)加上括号。

In [8]:
ptn2= re.compile(r'(\W+)')
print(ptn2.split('This is a test, short and sweet, of split().'))

['This', ' ', 'is', ' ', 'a', ' ', 'test', ', ', 'short', ' ', 'and', ' ', 'sweet', ', ', 'of', ' ', 'split', '().', '']


上述方式是先生成一个pattern，然后用pattern.split()方式调用。也可以直接用re.split()然后把要匹配的内容作为参数传入：

In [9]:
re.split(r'\W+',"Words+++words---words!!!!")

['Words', 'words', 'words', '']

In [10]:
re.split(r'(\W+)',"Words+++words---words!!!!")

['Words', '+++', 'words', '---', 'words', '!!!!', '']

再看几个例子：

In [11]:
strs='aaa bbb ccc; ddd    eee,fff'
strs

'aaa bbb ccc; ddd    eee,fff'

两个字符以上的切割要放在 **[]** 中表示成一个 set 

不保留匹配项的split

In [12]:
re.split(r'[;,]',strs)

['aaa bbb ccc', ' ddd    eee', 'fff']

保留匹配项的split

In [13]:
re.split(r'([;,])',strs)

['aaa bbb ccc', ';', ' ddd    eee', ',', 'fff']

### 四、re.sub(pattern, repl, string, count=0, flags=0) 功能

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 用repl替换pattern

另一个例子

In [14]:
# s2='''
# bool x = True;
# bool y = False;
# bool z = True;
# main()
# {
# 1:  x = !y;2:  if ( (x&y) | (!z) )
# 3:  y = !y;
# 4: pass;
# fi
# 5:  return x;
# }
# '''

In [2]:
s2='''bool x = True;bool y = False;bool z = True;main(){1:  x = !y;2:  if ( (x&y) | (!z) )3:  y = !y;4: pass;fi 5:  return x;}'''

In [3]:
s2=s2.replace("{","{\n")
s2=s2.replace("}","\n}")
print(s2)

bool x = True;bool y = False;bool z = True;main(){
1:  x = !y;2:  if ( (x&y) | (!z) )3:  y = !y;4: pass;fi 5:  return x;
}


In [4]:
s2=re.sub(r';\s*fi\s+',r';\nfi\n',s2)
print(s2)

bool x = True;bool y = False;bool z = True;main(){
1:  x = !y;2:  if ( (x&y) | (!z) )3:  y = !y;4: pass;
fi
5:  return x;
}


In [5]:
s2=re.sub(r'(\w+\s*:)',r'\n\1',s2) #所有带：的地方都转行
print(s2)

bool x = True;bool y = False;bool z = True;main(){

1:  x = !y;
2:  if ( (x&y) | (!z) )
3:  y = !y;
4: pass;
fi

5:  return x;
}


In [6]:
s2=re.sub(r'\s*\n\s*',r'\n',s2) #整理所有的转行，得到最终的标准格式
print(s2)

bool x = True;bool y = False;bool z = True;main(){
1:  x = !y;
2:  if ( (x&y) | (!z) )
3:  y = !y;
4: pass;
fi
5:  return x;
}


In [20]:
s2=s2.split('\n')
s2

['',
 'bool x = True;',
 'bool y = False;',
 'bool z = True;',
 'main()',
 '{',
 '1:  x = !y;',
 '2:  if ( (x&y) | (!z) )',
 '3:  y = !y;',
 '4: pass;',
 'fi',
 '5:  return x;',
 '}',
 '']

In [21]:
m = re.match(r'(\w+) (\w+)', "Isaac Newton, physicist") #两对(),两个group：1,2 其中第i对括号match对应的是group(i)

In [22]:
m.group(0)

'Isaac Newton'

In [23]:
m.group(1)

'Isaac'

In [24]:
m.group(2)

'Newton'

In [25]:
m.group(3)

IndexError: no such group