#### Diving In

Getting a small bit of text out of a large block is a challenge. In Python, string have method for searching and replacing: `index(), find(), split(), count(), replace()` etc. But these methods are limited to the simplest of cases. For example `index()` method looks for a single, hard coded substring and search is always case sensitive.

If your goal can be accomplished using string methods, use them. They're fast and simple and easy to read. But if you find yourself using a lot of different string functions with `if` statements and special cases, of if you are chaining method calls to `split()` and `join()` to slice and dice your string, you need to move to regular expressions.

Regular expressions are powerful way to search, replace and parse text with complex character patterns.

##### Case Study : Street Addresses

Scrubbing and standardization of stree addresses. Goal is to standardize street address so that `ROAD` is always abbreviated as `RD.`

The first attempt replaces `ROAD` with `RD.` using the string `replace()` method.

In [6]:
s = '100 NORTH MAIN ROAD'
s.replace('ROAD', 'RD.')

'100 NORTH MAIN RD.'

But this fails on the below example

In [9]:
s = '100 NORTH BROAD ROAD'
s.replace('ROAD', 'RD.')

'100 NORTH BRD. RD.'

By limiting the replacement to the last 4 character in the string, we can achieve the right result

In [11]:
s[:-4] + s[-4:].replace('ROAD', 'RD.')

'100 NORTH BROAD RD.'

Using regex we can achive the same result using more readable and simple code

In [12]:
import re
re.sub('ROAD$', 'RD.', s)

'100 NORTH BROAD RD.'

The first parameter is `ROAD$` This is a simple regular expressions that matches `ROAD` only when it occurs a the end of the string. The `$` means "end of string".

Using the `re.sub()` function you can search the string `s` for the regular expressions `ROAD$` and replace it with `RD.`. This matches `ROAD` at the end of the string `s`, but does *not* match the ROAD that's part of the word `BROAD`, becuase that is in the middle of the string `s`.

In [16]:
# cases where the logic doesn't work
s = '100 BROAD'
re.sub('ROAD$', 'RD.', s)

'100 BRD.'

In [17]:
re.sub('\\bROAD$', 'RD.', s)

'100 BROAD'

In [18]:
s = '100 BROAD ROAD APT 5'
re.sub('\\bROAD$', 'RD.', s)

'100 BROAD ROAD APT 5'

What we need is to identify the word `ROAD` with whitespace around it on both sides. The `\b` means a word *b*oundary. 

In Python using `\` in a string is complicated by the fact that `\` character in a string must be escaped. The alternative is it utilize a `raw` string. Strings prefix by the letter `r` are raw strings. In raw string `\t` is not the tab character it is a string of `len` 2, with characters `\` and `t`.

In [20]:
re.sub(r'\bROAD\b', 'RD.', s)

'100 BROAD RD. APT 5'

##### Case Study : Roman Numerals

Roman numerals are a system of representing numbers that date back to the ancient Roman emptire. In Roman numerals there are seven characters that are repeated and combined in various ways to represent numbers.

* I = 1
* V = 5
* X = 10
* L = 50
* C = 100
* D = 500
* M = 1000

The following are general rules for constructing Roman numerals.


* Sometimes characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
* The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). 40 is written as XL (“10 less than 50”), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (“10 less than 50, then 1 less than 5”).
* Sometimes characters are the opposite of additive. By putting certain characters before others, you subtract from the final value. For example, at 9, you need to subtract from the next highest tens character: 8 is VIII, but 9 is IX (“1 less than 10”), not VIIII (since the I character can not be repeated four times). 90 is XC, 900 is CM.
* The fives characters can not be repeated. 10 is always represented as X, never as VV. 100 is always C, never LL.
* Roman numerals are read left to right, so the order of characters matters very much. DC is 600; CD is a completely different number (400, “100 less than 500”). CI is 101; IC is not even a valid Roman numeral (because you can't subtract 1 directly from 100; you would need to write it as XCIX, “10 less than 100, then 1 less than 10”).


**Write a Regex for parsing Roman Numberals**

In [31]:
import re

p = re.compile(r"^(M?M?M?)$")
m = p.match("MMM")
print("Starting position of the match is {0}".format(m.start()))
print("Ending position of the match is {0}".format(m.end()))
print("Matched string is {0}".format(m.group(0)))

Starting position of the match is 0
Ending position of the match is 3
Matched string is MMM


In [121]:
import typing, re

def print_match(s: str, m: re.Match):
    print("For the example {0}".format(s))
    print("Result of match is {0}".format(not m == None))
    if m:
        print("Number of groups is {0}".format(len(m.groups())))
        print("First group is is {0[0]}".format(m.groups()))
        print("Second group is is {0[1]}".format(m.groups()))
        
    print("-------\n")
        
def check_pattern(pattern: str, tests: typing.List[str], flags: int = 0):
    p = re.compile(pattern, flags)
    [print_match(s, p.match(s)) for s in tests]

**Checking for hundreds**

The hundreds place is more difficult than the thousands because there are are several mutually exclusive ways it could be expressed

* 100 = C
* 200 = CC
* 300 = CCC
* 400 = CD
* 500 = D
* 600 = DC
* 700 = DCC
* 800 = DCCC
* 900 = CM

In [122]:
check_pattern(r"^(M?M?M?)(D?C?C?C?|CD|CM)$", ["MMM", "MMMCM",
                                              "MMMC", "MMMCD",
                                              "MMMDC", "MMMDCD"])


For the example MMM
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is 
-------

For the example MMMCM
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is CM
-------

For the example MMMC
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is C
-------

For the example MMMCD
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is CD
-------

For the example MMMDC
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is DC
-------

For the example MMMDCD
Result of match is False
-------



In [124]:
# has the same result as previous example but utilizes grouping {n,m}
check_pattern(r"^(M{0,3})(D?C{0,3}|CD|CM)$", ["MMM", "MMMCM",
                                              "MMMC", "MMMCD",
                                              "MMMDC", "MMMDCD"])

For the example MMM
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is 
-------

For the example MMMCM
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is CM
-------

For the example MMMC
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is C
-------

For the example MMMCD
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is CD
-------

For the example MMMDC
Result of match is True
Number of groups is 2
First group is is MMM
Second group is is DC
-------

For the example MMMDCD
Result of match is False
-------



**Checking for tens and ones**

The tens and ones place is similar to the hundres there are several mutually exclusive ways it could be expressed

* 10 = X
* 20 = XX
* 30 = XXX
* 40 = XL
* 50 = L
* 60 = LX
* 70 = LXX
* 80 = LXXX
* 90 = XC
    
The ones group is

* 1 = I
* 2 = II
* 3 = III
* 4 = IV
* 5 = V
* 6 = VI
* 7 = VII
* 8 = VIII
* 9 = IX

In [125]:
pattern = r"^(M{0,3})(D?C{0,3}|CD|CM)(L?X{0,3}|XL|XC)(V?I{0,3}|IV|IX)$"

check_pattern(pattern, ["MMM", "MMMCM",
                        "MMMC", "MMMCD",
                        "MMMDC", "MMMDCD",
                        "CDLX", "MDCCCLXXX",
                        "MDCMLXXX", "MCMLXXX"
                       ])

For the example MMM
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is 
-------

For the example MMMCM
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is CM
-------

For the example MMMC
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is C
-------

For the example MMMCD
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is CD
-------

For the example MMMDC
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is DC
-------

For the example MMMDCD
Result of match is False
-------

For the example CDLX
Result of match is True
Number of groups is 4
First group is is 
Second group is is CD
-------

For the example MDCCCLXXX
Result of match is True
Number of groups is 4
First group is is M
Second group is is DCCC
-------

For the example MDCMLXXX
Result of match is False
-------

For the example MCMLXXX
Result of match is 

*Quick side bar on strings in `print` function calls*

With python 3.6 and later Python support string interpolation with the `f` prefix. String interpolation allows function call in string literals which is not available in positional argument - `{0}` - syntax.

As an example, the below `print` call is invalid

    print("Length of list is {len(0}}".format([1,2,3]))

where as one could do the below when using string interpolation

    print(f"Length of list is {len([1,2,3])}")

##### Verbose Regular Expressions

The use of compact syntaxt with RE make the code difficult to comprehend.

Python allows you to write verbose regular expressions by using multiline strings. In verbose regex

* Whitespace is ignored. Spaces, tabs and carriage return are not matched. If you want to match white space character, you need to escape with with a backslash in front of it.
* Comments are ignored, similar to code comments start with # and go to the end of the line

Lets rewrite the roman number regex in verbose form

    pattern = r"^(M{0,3})(D?C{0,3}|CD|CM)(L?X{0,3}|XL|XC)(V?I{0,3}|IV|IX)$"
    
The use of *verbose regular expression* **requires** specification of **re.VERBOSE** flag during regex compilation. If flag is *not specified*, the pattern is treated as a *non verbose* regex


In [128]:
pattern = r"""
    ^            # beginning of string
    (M{0,3})     # thousands 0-3 times Ms. capturing Ms as a group
    (D?C{0,3}|CD|CM) # hundreds CM - 900, CD - 400,
                     # optional D followed by 0-3 C to represent, 600, 700 and 800, 
                     # optional Cs 0-3 (without prefix D) to represent 100, 200, 300     
    (L?X{0,3}|XL|XC) # the tens handling is similar, XC - 90, XL - 50
                     # optional L followed by 0-3 X to represent, 60, 70 and 80
                     # optional X 0-3 (without prefix L) to represent 10, 20 30
    (V?I{0,3}|IV|IX) # units handling is the same as tens and hundred
    $
"""

check_pattern(pattern, ["MMM", "MMMCM",
                        "MMMC", "MMMCD",
                        "MMMDC", "MMMDCD",
                        "CDLX", "MDCCCLXXX",
                        "MDCMLXXX", "MCMLXXX"
                       ], re.VERBOSE)

For the example MMM
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is 
-------

For the example MMMCM
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is CM
-------

For the example MMMC
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is C
-------

For the example MMMCD
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is CD
-------

For the example MMMDC
Result of match is True
Number of groups is 4
First group is is MMM
Second group is is DC
-------

For the example MMMDCD
Result of match is False
-------

For the example CDLX
Result of match is True
Number of groups is 4
First group is is 
Second group is is CD
-------

For the example MDCCCLXXX
Result of match is True
Number of groups is 4
First group is is M
Second group is is DCCC
-------

For the example MDCMLXXX
Result of match is False
-------

For the example MCMLXXX
Result of match is 

##### Exercise - Parsing Phone numbers

Write a regular expression for parsing phone numbers with optional extensions. Some examples

* 800-555-1212
* 800 555 1212
* 800.555.1212
* (800) 555-1212
* 1-800-555-1212
* 800-555-1212-1234
* 800-555-1212x1234
* 800-555-1212 ext. 1234
* work 1-(800) 555.1212 #1234

Hint : `\d` matches any numeric digit (0-9) `\D` matches anything *but* digits


In [165]:
import typing, re

pattern = re.compile(r"""
    \D*     # ignore leading non numeric text : example _work_
    \d?     # optional country code 1 in the examples above
    \D*     # ignore text for seperator between county and area code
    (\d{3}) # 3 digit area code. captured as a group
    \D*     # ignore text for seperator between area code and ph no.
    (\d{3}) # 3 leading digits in phone number
    \D*     # ignore text for seperator between leading and trailing digits
    (\d{4}) # 4 trailing digits in phone number
    \D*     # ignore text between ph no and extension
    (\d*)   # allow variable length phone number
""", re.VERBOSE)



def check_phone(s: str, p: re.Pattern = pattern):
    print("For the example {0}".format(s))
    m: re.Match = p.match(s)
    if m:
        groups: List[str] = m.groups()
        last = groups[-1]
        hasExt = True if len(last) > 0 else False
        if hasExt:
            print(f"Please call ({groups[0]})-{groups[1]}-{groups[2]} at extension x{groups[3]}")
        else:
            print(f"Please call ({groups[0]})-{groups[1]}-{groups[2]}")
    else:
        print("Dope!! unable to parse phone number, please check input")
        
    print("-------\n")

In [171]:
check_phone("work 1-(800) 555.1212 #1234")
check_phone("work 1-(800) 555.1212")
check_phone("18005551212")
check_phone("800-555-1212-1234")
check_phone("800-555-1212x1234")
check_phone("800-55-1212x1234")


For the example work 1-(800) 555.1212 #1234
Please call (800)-555-1212 at extension x1234
-------

For the example work 1-(800) 555.1212
Please call (800)-555-1212
-------

For the example 18005551212
Please call (800)-555-1212
-------

For the example 800-555-1212-1234
Please call (800)-555-1212 at extension x1234
-------

For the example 800-555-1212x1234
Please call (800)-555-1212 at extension x1234
-------

For the example 800-55-1212x1234
Dope!! unable to parse phone number, please check input
-------



##### Summary

This is just a taste for regex, regex are vast and reading the documentation will always be neccesary. 

We played around with the following subset

* ^ matches beginning of string
* $ matches end of string
* \b matches a work boundary
* \d matches a number
* \D matches any non-numeric character
* x? optional at most one occurence of x
* x* zero or more occurences of x
* x+ one or more occurences of x
* x{n,m} at least n at most m occurences of x
* (a|b|c) matches one of a or b or c.
* (x) capture the match as a remembered group. You can get the value of what matched by using the `groups()` method of the object returned by `re.search`

Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn about them to know when they are appropriate, when they will solve your problem, and when they will cause more problems than they solve.