# Regular Expressions.

# Matching
Being able to *find* strings within other strings is an essential part of string manipulations. From web search, Word's find-and-replace and command line tools like `grep`, being able to test whether text matches a search pattern is a vital task.

## Matching strings
We know how to find if a (sub)string is inside another string using `in`

In [16]:
print("ican" in "pelicans")
print("" in "pelicans") # the empty string is in everything
print("pelicans" in "pelicans") # note: inclusive of the whole string!
print("icant" in "pelicans")

True
True
True
False


If we wish to find a word in a string we can use.

In [None]:
sentence = "The cat sat on the mat"

if "cat" in sentence:
    print("Yes")
else:
    print("No")

But this may not be what we wanted.

In [None]:
sentence ="The fabrication of steel is expensive"

if "cat" in sentence:
    print("Yes")
else:
    print("No")

We could change it so that it does not do this.

In [None]:
sentence = "The cat sat on the mat"
sentence2 ="The fabrication of steel is expensive"

if " cat " in sentence:
    print("Yes")
else:
    print("No")
    
if " cat " in sentence2:
    print("Yes2")
else:
    print("No2")

However we may wish to find "cat" and "cats" and "cat.". We can use regular expressions to do this.

-----------

## Pattern matching
But what if we wanted to match things less precisely? If we didn't want just exact matches, but wanted to find strings that followed a certain pattern or template?

**Regular expressions** are an extremely powerful tool for doing this. They are sophisticated tools, and we will only scratch the surface. Regular expressions are an entire sub-language, which is widely available as part of most common programming languages or their standard libraries. They are also part of many command line utilities (`grep`, `sed`, etc.) and part of any "serious" text editor (`vim`, `emacs`, `notepad++`, `sublime`, `atom`, etc.).

Regular expressions have a syntax that can be obscure, but is very compact. They are powerful, so it is also easy to abuse them. But, for many cases, they are the tool of choice to solve text processing problems. 

### Regular expression patterns
A regular expression pattern is just a string, which uses **special characters** to represent ways in which the pattern can vary. Regular expression matching functions take a regular expression pattern and can match that against a string.

In the most basic case, a regular expression matching can just look for a literal string, just as `.find()` does:

In [13]:
import re # import the Regular Expression module
print(re.findall("ican", "pelican"))

['ican']


Say you have a partial solution to a crossword: `c_tl___`. How could you find all the words that match that partial solution?

## A placeholder: .

We can write a `.` (period, dot) in a regular expression to mean "any character can go here". Let's use the list of words from `words.txt` and solve this problem.

There is a function `re.match(pattern, string)` that determines if the given pattern matches a given string, *with the pattern required to be present at the very start of the string.*

Let's wtite a function that takes apattern and finds all possible matches in the "words.txt" file.

In [15]:
def match_words(pattern):
    with open("words.txt") as f:
        for line in f:
            word = line.strip() 
            if re.search(pattern, word):
                print(word)

cutlass
cutlasses
cutlers
cutlery
cutlets


If we pass in the word "cat" we get all words containing the substring "cat"

In [None]:
match_words("cat")

Find all the words that contain the substring 'ingly'

If we replace the a with a . it will now match any substing that has a c followed by any letter and then a t.

In [None]:
match_words("c.t")

Find all the words that contain a letter then a w followed by 2 letters and a k then 2 more letters and then an "es".

## Repeat characters
We can also tell the matcher to allow characters to **repeat**. The two most common ways of doing that are postfixing an expression element (e.g. a character or a character class) with :
* a `*` (repeat zero or more times)
* a `+` (repeat at least once or more times)
* a `?` (appears zero or one times)
* a `{n}` (appears exactly n times)

In [None]:
# all words that begin fir and end with t
match_words("fir.*t$")

In [None]:
# all words that have f<something>t<something>ha<something>
match_words("f.*t.*ha.*")

In [None]:
# match nec<zero or more vowels><one or more s><something>y
match_words("nec[aeiou]*s+.*y")

Now instead of a . (one single leter) .* allows between zero and many letters to be between the c and the t. (so abduct is found)

In [None]:
match_words("c.*t")

Find similar string to the last exercise except this time we can have any number of letters between the w and k (0...many)

.+ insists that at least one but also many letters can be between c and t (so no longer abduct)

In [None]:
match_words("c.+t")

Now find all words with the original pattern but this time can have as many letters between the k and the "es" but must have at least one

.? allows for zero or one letters to be between c and t but no more.

In [None]:
match_words("c.?t")

Now find all words that have at most one letter between both the w,k and k,es.

## Anchors: $ and ^
This works, but `match()` will match any string that matches the pattern at the start, regardless of what comes next. This include `cutlasses`, even though it has extra characters at the end. 

We can use the special characters `^` (caret) and `$` (dollar) to mean `start of string` and `end of string`. For example, $ forces the pattern to only match if the string ends where the `$` is. These characters are called `anchors` because they anchor the pattern to the start or end of a string. (`^` isn't useful with `re.match()` but it is useful with other regular expression functions).

What if we wish to have the word starting with a c we can use the ^c

In [None]:
match_words("^c.t")

Find all words that start with a letter followed by a z

And to get it to end with a t we use t$

In [None]:
match_words("c.t$")

Find all words that end in an yz followed by 2 letters

Putting it together we can find all 3 letter words staring with a c and ending with a t.

In [None]:
match_words("^c.t$")

Find all words that start eith a w end with a k and have 3 letters between them

### Character classes
We can restrict a placeholder to a set of possible characters, instead of just any character. 

To do this, we put all the possible characters we want to match inside square brackers `[]`. We can also specify consecutive range of characters, like `a-z` or `0-9`

### *The whole square bracketed set of characters applies to one character.*

In [None]:
match_words("p[nt][aeiou]")

In [None]:
match_words("h[yui]z")

In [None]:
# using a range of characters
print(re.match('[a-z].$', "as")) # match
print(re.match('[a-z].$', "0s")) # no match
# note that it is case sensitive
print(re.match('[A-Z].$', "AS")) # no match

Now we can find only the words that have a a or an o in between.

In [None]:
match_words("^c[ao]t$")

Find words that have a y,u or an i between a h and an z

In [None]:
match_words("h[yui]z")

And we can all these vowels to be repeated as many as times as we can find.

In [None]:
match_words("^c[ao]*t$")

We can give a choice of values for each letter placement

In [None]:
match_words("c[rhl][aeiou][nh]t")

## Escaping
What if we wanted to actually match a `$` or a literal `.`? We can always **escape** a special character to make it behave as if it were not special. Backslash \ is the escape character. It makes the following character work as if it were not special.


In [23]:
# does not match -- because the $ is taken to mean the anchor
print(re.match('200$', '200$'))

# now we escape the $ and it behaves as the literal character $
# it matches correctly
print(re.match('200\$', '200$'))

None
<_sre.SRE_Match object; span=(0, 4), match='200$'>


**However, \ still has its effect of making characters like \n into newlines.** This can be a pain, and raw strings (with an `r` in front) are often used to avoid this.

In [5]:
# note the r: no chance the \ does something unexpected
print(re.match(r'200\$', '200$'))

### Inverted character classes
We can also invert a character class to say match **any character except these ones**. To do this, we put a ^ (caret) as the very first character in the square brackets:

In [None]:
# a word beginning with f then three **non-vowels**
match_words("f[^aeiou][^aeiou][^aeiou]")

In [None]:
# same as the f followed by three non-vowels example
# note the way the repeat character binds to the previous expression, 
# which might not be a single character in the pattern!
match_words("f[^aeiou]{3}")

## Built-in character classes
Some classes are so commonly used there are special codes for them:

    \d 	Match any digit: character in the range 0 - 9 [0-9]
    \D 	Match any nondigit: character NOT in the range 0 - 9 [^0-9]
    \s 	Match any whitespace characters (space, tab etc.).
    \S 	Match any character NOT whitespace (space, tab).
    \w 	Match any alphanumeric character: in the range 0 - 9, A - Z, a - z and punctuation 
    \W     Match any character not in \w

In [None]:
# match a sequence of digits, possibly followed by one letter
# Note that we can just jam in multiple ranges in the character class 
# inside the square brackets
options = ["13131", "3133103b", "31hello", "o88", "7B"]

for option in options:
    print(option, re.match("[0-9]+[A-Za-z]?$", option) is not None)

### Groups
We can group multiple elements in a regular expression together. To do this, we put the "subexpression" in brackets. So:

    f(lip)+$
    
means anything with an `f`, followed by one or more `lip`, then the end of string. It will match:

    flip
    fliplip
    flipliplip
    
but not:

    flipli
    fliplp
    
Any regular expression components can go in these brackets.    

**Any repeat operator following will apply to the whole group -- everything in the group works as if it were just one character**

In [28]:
tests = ["flip", "fliplip", "flipliplip", "flipli", "flipl", "fliplipliplop"]

# match a pattern against a list of tests
def match_against(tests, pattern):
    for st in tests:
        print(st.ljust(20), re.match(pattern, st) is not None)
        
match_against(tests, "f(lip)+$")

<_sre.SRE_Match object; span=(0, 4), match='flip'>
<_sre.SRE_Match object; span=(0, 7), match='fliplip'>
<_sre.SRE_Match object; span=(0, 10), match='flipliplip'>
None
None
None


### Alternatives
We can use this grouping functionality to make an intelligent kind of "or". Imagine we wanted to match any of `Mrs.` or `Ms.` or `Miss.`. How could we write that as a regular expression?

| is an operator which means "one of the options on either side of the |". The pattern can be a *grouped expression*.

So this pattern would work:

    (Mrs\.)|(Ms\.)|(Miss\.)
    
or this one:

    M(rs)|(s)|(iss)\.
    


In [11]:
names = ["Mrs. Purple", "Miss. White", "Ms. Yellow", "Dr. Blue", "Mr. Red"]

match_against(names, "(Mrs\.)|(Miss\.)|(Ms\.)")

Mrs. Purple          True
Miss. White          True
Ms. Yellow           True
Dr. Blue             False
Mr. Red              False


  
Any regular expression element can be alternated with |:

    [a-z]|[A-Z] same as [a-zA-Z]
    b|g  matches b or g, same as [bg]
    (b[aeiou]+k)|(d[aeiou]+t) matches both book and duet
    
But grouped subexpressions are the most useful thing to alternate -- usually character classes can capture most other patterns.

In [None]:
match_words("((b[aeiou]+k)|(d[aeiou]+t))$")

### Captures and extraction
As well as being able to do interesting alternation, every time a group is used, a regular expression matcher *captures* the contents of a group. This is how regular expressions can be used to extract specific text.

Every capture is numbered, counting the number of ( from the left. So if I wanted to capture someone's title and their name following their title:

    (Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)

Then the title would be in capture 0 and their name would be in capture 1.

`re.match()` lets us get at those groups, by calling `.groups()` on the return value.

In [15]:
names = ["Mrs. Purple", "Miss. White", "Dr. Blue", "Mr. Red", "Lord. Black"]
# simple matching
match_against(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)")

Mrs. Purple          True
Miss. White          True
Dr. Blue             True
Mr. Red              True
Lord. Black          True


### Substitution
We can also do regular expression find and replace. This lets us find text matching a pattern and replace it:

`re.sub` performs this operation:

def sub_list(tests, pattern, replacement):
    for test in tests:
        print(test.ljust(20), "=>", end=' ')
        subst = re.sub(pattern,  replacement, test)
        print(subst)
        
sub_list(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\.", "<title>")

## Back references
We can refer to the value of any previous captured group using the notation `\<n>` where `<n>` is an integer specifying the index of the group (+1: \0 means the whole string, so \1 is the first capture). This works in substitutions:


In [12]:
def sub_list(tests, pattern, replacement):
    for test in tests:
        print(test.ljust(20), "=>", end=' ')
        subst = re.sub(pattern,  replacement, test)
        print(subst)
# the \1 refers to the title and the \2 refers to the name
sub_list(names, "(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", 
         r"{'title': '\1', 'name': '\2'}")

Mrs. Purple          => {'title': 'Mrs', 'name': 'Purple'}
Miss. White          => {'title': 'Miss', 'name': 'White'}
Ms. Yellow           => {'title': 'Ms', 'name': 'Yellow'}
Dr. Blue             => {'title': 'Dr', 'name': 'Blue'}
Mr. Red              => {'title': 'Mr', 'name': 'Red'}


Back references actually work in the matching part as well, and allow us to force a previously match value to be used again.

In [None]:
# match everything that has a v, followed by two of the **same** vowel
match_words(r"v([aeiou])\1")

### Matching versus finding
* `re.match()` finds a pattern at the start of a string.
* `re.search()` find the first pattern in a string (like match, but the pattern does not have to match at the start of the string).
* `re.findall()` can find **multiple** matches in a string, and returns them in a list

In [40]:
# one massive string
names = """Mrs. Purple, Miss. White, Dr. Blue, Mr. Red, Lord. Black"""
# just matched the first one
print(re.match("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names).groups(4))
# search does the same thing here
print(re.search("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names).groups(3))

('Mrs', 'Purple')
('Mrs', 'Purple')


In [30]:
### using findall
all_matches = re.findall("(Mr|Mrs|Dr|Ms|Miss|Sir|Lord|Dame)\. (\w*)", names)
# this will just be a list of the groups found
for match in all_matches:
    print(match)

('Mrs', 'Purple')
('Miss', 'White')
('Dr', 'Blue')
('Mr', 'Red')
('Lord', 'Black')
