# Regular Expression
In this notebook we will learn regular expresison. Regular Expression is a special string expression which is used for pattern match.

## Search

#### Regular Expression 
| Pattern | Explanation |
| --- | --- |
| \\d | digits, 0 -9 | 
| \\ D | Non-digits |
| \\s | spaces |
| \\S | non-spaces |
| \\w | word characters |
| \\W | non-word characters |
| * | match 0 or more occurences |
| + | match 1 or more occurences |
| ? | march 0 or 1 occurence |
| { } | specify how many occurences |
| ( ) | group pattern |
| . | any single character |


#### Match in groups

In [1]:
# import regular expression module.
import re

# match telephone numbers
pattern = "\d{3}-\d{3}-\d{4}"
text = "My phone number is 425-241-1234 or 260-112-2345. The code is 1231234567 or 123-1111-1234"
result = re.findall(pattern, text)
print("text = ", text, "; pattern =", pattern)
print("result = ", result)

text =  My phone number is 425-241-1234 or 260-112-2345. The code is 1231234567 or 123-1111-1234 ; pattern = \d{3}-\d{3}-\d{4}
result =  ['425-241-1234', '260-112-2345']


In [2]:
# match telephone numbers in groups
pattern = r'(\d{3})-(\d{7})'
text = "My phone number is 425-2411234"
result = re.search(pattern, text)
print("text = ", text, "; pattern =", pattern)
print("area = ", result.groups()[0], "local number = ", result.groups()[1])

text =  My phone number is 425-2411234 ; pattern = (\d{3})-(\d{7})
area =  425 local number =  2411234


#### Multiple choices

In [3]:
import re
# match multiple choices
pattern = 'Tom|Jerry'
text = "The president of club is Tom"
result = re.search(pattern, text)
print("text = ", text, "; pattern =", pattern)
print("result = ", result.group() if result != None else None)

# match multiple choices as similar names
msg = 'Johnson, Johnnason and Johnnathan will attend my party tonight'
print("msg =", msg)
pattern = 'John(son|nason|nathan)'
print("pattern =", pattern)
result = re.search(pattern, msg)
print(result.group())

pattern = 'John(na)?son'
print("pattern =", pattern)
result = re.search(pattern, msg)
print(result.group())

pattern = 'John(na)*son'
print("pattern =", pattern)
result = re.search(pattern, msg)
print(result.group())

pattern = 'John(na)+son'
print("pattern =", pattern)
result = re.search(pattern, msg)
print(result.group())

pattern = 'john(na)+son'
print("pattern =", pattern)
result = re.search(pattern, msg)
print(result.group() if result != None else None)
result = re.search(pattern, msg, re.IGNORECASE)
print(result.group() if result != None else None)

text =  The president of club is Tom ; pattern = Tom|Jerry
result =  Tom
msg = Johnson, Johnnason and Johnnathan will attend my party tonight
pattern = John(son|nason|nathan)
Johnson
pattern = John(na)?son
Johnson
pattern = John(na)*son
Johnson
pattern = John(na)+son
Johnnason
pattern = john(na)+son
None
Johnnason


#### Greedy Search
Greedy search will match multiple occurence

In [4]:
import re
msg = 'sonsonsonsonson'
pattern = '(son){1,3}'
print('msg = ', msg)
print('pattern = ', pattern)
result = re.search(pattern, msg, re.IGNORECASE)
print(result.group() if result != None else None)

msg =  sonsonsonsonson
pattern =  (son){1,3}
sonsonson


#### Shortest Match

In [5]:
import re
text = 'Computer says "no". Phone says "yes".'
print("text=", text)
print("re.findall(r'\"(.*)\"', text)", re.findall(r'\"(.*)\"', text))
print("re.findall(r'\"(.*?)\"', text)", re.findall(r'\"(.*?)\"', text))



text= Computer says "no". Phone says "yes".
re.findall(r'"(.*)"', text) ['no". Phone says "yes']
re.findall(r'"(.*?)"', text) ['no', 'yes']


#### Specify range
1. In the regular expression pattern we can specify a range in '[x-y]', both x and y must be character.
2. We can also use [^x-y] to exclusive search specific range.

In [6]:
import re
# match multiple choices
pattern = '[1-5]'
msg = '1. cat, 2. dog, 3. pig, 4. swan'
result = re.findall(pattern, msg)
print("text = ", text, "; pattern =", pattern)
print("result = ", result)
pattern = '[^0-9., ]+'
result = re.findall(pattern, msg)
print("text = ", text, "; pattern =", pattern)
print("result = ", result)

text =  Computer says "no". Phone says "yes". ; pattern = [1-5]
result =  ['1', '2', '3', '4']
text =  Computer says "no". Phone says "yes". ; pattern = [^0-9., ]+
result =  ['cat', 'dog', 'pig', 'swan']


#### Start and end
1. We can use ^ to specify the pattern should match at the start of string.
2. We can use $ to specify the pattern should match at the end of string.

In [7]:
import re
# match start and end of sentence
msg = 'John will attend my party 28 tonight.'
pattern = '\w+.?$'
result = re.findall(pattern, msg)
print("text = ", msg, "; pattern =", pattern)
print("result = ", result)
pattern = '^\w+'
result = re.findall(pattern, msg)
print("pattern =", pattern)
print("result = ", result)


text =  John will attend my party 28 tonight. ; pattern = \w+.?$
result =  ['tonight.']
pattern = ^\w+
result =  ['John']


## Match
The match method will return MatchObject which will have the following methods
| Function | Explanation |
| --- | --- |
| group() | return matched object |
| end() | end position of match object |
| start() | start position of match object |
| span() | (start, end) of match object |


In [8]:
import re
msg = 'Johnson, Johnnason and Johnnathan will attend my party tonight'
print("msg =", msg)
pattern = 'John(son|nason|nathan)'
print("pattern =", pattern)
result = re.match(pattern, msg)
print("group = ", result.group(), "start = ", result.start(), "end = ", result.end(), "span = ", result.span())

msg = Johnson, Johnnason and Johnnathan will attend my party tonight
pattern = John(son|nason|nathan)
group =  Johnson start =  0 end =  7 span =  (0, 7)


## Substitution
We can call sub to replace regular expression matches.

In [9]:
# The following example will replace CIA name as mask
import re

msg = 'CIA Mark told CIA Linda that secret USB had been given to CIA Peter.'
print('Message:', msg)
pattern = r'CIA (\w)\w*'
print('Pattern:', pattern)
newstr = r'CIA \1***'
print('New String:', newstr)
txt = re.sub(pattern, newstr, msg)
print('After substitution:', txt)


Message: CIA Mark told CIA Linda that secret USB had been given to CIA Peter.
Pattern: CIA (\w)\w*
New String: CIA \1***
After substitution: CIA M*** told CIA L*** that secret USB had been given to CIA P***.


### Split String on any of multiple delimiters

In [10]:
line = 'abc efg; afrd, fjek, asdf,     foo'
import re
print(re.split(r'[;,\s]\s*', line))

['abc', 'efg', 'afrd', 'fjek', 'asdf', 'foo']


## Shell Wildcard Patterns

In [11]:
from fnmatch import fnmatch, fnmatchcase
print("fnmatch('foo.txt', '*.txt') ", fnmatch('foo.txt', '*.txt'))
print("fnmatch('foo.txt', '?oo.txt') : ", fnmatch('foo.txt', '?oo.txt'))
print("fnmatch('Dat45.csv', 'dat[0-9]*') : ", fnmatch('Dat45.csv', 'dat[0-9]*'))


fnmatch('foo.txt', '*.txt')  True
fnmatch('foo.txt', '?oo.txt') :  True
fnmatch('Dat45.csv', 'dat[0-9]*') :  True


### HTML and XML escape

In [12]:
s = 'Elements are written as "<tag>text</tag>".'
import html
print("s = ", s)
print("html.escape(s) = ", html.escape(s))

s = 'The prompt is &gt;&gt;&gt;'
from xml.sax.saxutils import unescape
print("unescape(s) = ", unescape(s))

s =  Elements are written as "<tag>text</tag>".
html.escape(s) =  Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
unescape(s) =  The prompt is >>>


## Exercise
12.1 Please write a regular expression to match IP address v4. IP address V4 is in the format of a.b.c.d, (a, b, c, d is between 1 and 255)

12.2 Please write an regular expression to match email address. email is something like billgates@microsoft.com, bill123@gmail.com