
# Advanced Regular Expressions

- https://github.com/rexdwyer/Splitsville/blob/master/Splitsville.ipynb
- http://www.python-course.eu/python3_re.php
- http://www.python-course.eu/python3_re_advanced.php
- https://github.com/tartley/python-regex-cheatsheet/blob/master/cheatsheet.rst

Flags for re.compile(), etc. Combine with '|':

```
re.I == re.IGNORECASE   Ignore case
re.L == re.LOCALE       Make \w, \b, and \s locale dependent
re.M == re.MULTILINE    Multiline
re.S == re.DOTALL       Dot matches all (including newline)
re.U == re.UNICODE      Make \w, \b, \d, and \s unicode dependent
re.X == re.VERBOSE      Verbose (unescaped whitespace in pattern
                        is ignored, and '#' marks comment lines)
```

Module level functions:

```
compile(pattern[, flags]) -> RegexObject
match(pattern, string[, flags]) -> MatchObject
search(pattern, string[, flags]) -> MatchObject
findall(pattern, string[, flags]) -> list of strings
finditer(pattern, string[, flags]) -> iter of MatchObjects
split(pattern, string[, maxsplit, flags]) -> list of strings
sub(pattern, repl, string[, count, flags]) -> string
subn(pattern, repl, string[, count, flags]) -> (string, int)
escape(string) -> string
purge() # the re cache
```

RegexObjects (returned from compile()):

```
.match(string[, pos, endpos]) -> MatchObject
.search(string[, pos, endpos]) -> MatchObject
.findall(string[, pos, endpos]) -> list of strings
.finditer(string[, pos, endpos]) -> iter of MatchObjects
.split(string[, maxsplit]) -> list of strings
.sub(repl, string[, count]) -> string
.subn(repl, string[, count]) -> (string, int)
.flags      # int, Passed to compile()
.groups     # int, Number of capturing groups
.groupindex # {}, Maps group names to ints
.pattern    # string, Passed to compile()
```

In [11]:
# https://pypi.python.org/pypi/regex
# http://www.rexegg.com/regex-python.html
# https://github.com/rexdwyer/Splitsville/blob/master/Splitsville.ipynb

import regex as re
t = 'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.'

## match and search

- `match`: checks for a match of re_str merely at the **beginning** of the string.
- `search`: checks a string s for an occurrence of a **substring**.

In [10]:
re.match(r'The', t)

<regex.Match object; span=(0, 3), match='The'>

(Normally if we were going to reuse a pattern, we'd compile it and use the match method of the resulting pattern like this:

In [13]:
p = re.compile(r'The')
p.match(t)

<regex.Match object; span=(0, 3), match='The'>

In [17]:
print(re.match(r'the',t))

None


**search** looks at the whole string. It finds the **the** before **lazy dog**.

In [28]:
re.search(r'the',t)

<regex.Match object; span=(50, 53), match='the'>

**ignore case**

In [23]:
re.search(r'the', t, re.IGNORECASE)

<regex.Match object; span=(0, 3), match='The'>

In [25]:
re.search(r'(?i)the', t) # or use inline modifiers

<regex.Match object; span=(0, 3), match='The'>

## A Closer Look at the Match Objects

In [90]:
m = re.search(r'([0-9]+).*: (.*)', "Customer number: 232454, Date: February 12, 2011")

print(m.group()) # whole string
print(m.groups()) # list of groups
print(m.span())

232454, Date: February 12, 2011
('232454', 'February 12, 2011')
(17, 48)


##  findall and finditer 

---

`findall()` lists every match to the pattern, but doesn't give the position. `I` is short for `IGNORECASE`
.

In [43]:
t="A fat cat doesn't eat oat but a rat eats bats."

In [44]:
m = re.findall("[force]at", t)

In [46]:
print(m)

['fat', 'cat', 'eat', 'oat', 'rat', 'eat']


In [63]:
m = re.findall(r'([0-9]+)', "Customer number: 232454, Date: February 12, 2011")
print(m)

['232454', '12', '2011']


In [92]:
import re

regex = r"([0-9]+)(aa)"
test_str = "2231aazxczxc21zxc4652aa"

matches = re.finditer(regex, test_str)

# print(matches.matches(0))

# for matchNum, match in enumerate(matches):
#     matchNum = matchNum + 1
    
#     print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
#     for groupNum in range(0, len(match.groups())):
#         groupNum = groupNum + 1
        
#         print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
        
        
        
for m in matches:
    print(m.group())
    print(m.groups())

2231aa
('2231', 'aa')
4652aa
('4652', 'aa')


## Compiling Regular Expressions

---

`re.compile(pattern[, flags])`

## Search and Replace with sub

---

In [95]:
import re
s = "yes I said yes I will Yes."
res = re.sub("[yY]es","no", s)
print(res)

no I said no I will no.
