*This notebook contains an excerpt from the [Whirlwind Tour of Python](http://www.oreilly.com/programming/free/a-whirlwind-tour-of-python.csp) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/WhirlwindTourOfPython).*

## Flexible Pattern Matching with Regular Expressions

Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes.
The Python interface to regular expressions is contained in the built-in ``re`` module; as a simple example, let's use it to duplicate the functionality of the string ``split()`` method:

In [4]:
line = 'the quick brown fox jumped over a lazy dog'
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

In [9]:
import re
regex = re.compile('\s+')
regex.split(line)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

In this case, the input is ``"\s+"``: "``\s``" is a special character that matches any whitespace (space, tab, newline, etc.), and the "``+``" is a character that indicates *one or more* of the entity preceding it.
Thus, the regular expression matches any substring consisting of one or more spaces.

In [15]:
for s in ["     ", "abc  ", "  abc", "   abc"]:
    if s.startswith('  a'):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' does not match
'abc  ' does not match
'  abc' matches
'   abc' does not match


In [16]:
regex = re.compile('\s+')
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


In [17]:
regex = re.compile('\s+a')
for s in ["     ", "abc  ", "  abc", "   abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' does not match
'abc  ' does not match
'  abc' matches
'   abc' matches


In [18]:
line = 'the quick brown fox jumped over a lazy dog'
line.index('fox')

16

In [22]:
line.find('fox')

16

In [19]:
regex = re.compile('fox')
match = regex.search(line)
match.start()

16

In [20]:
regex = re.compile('[a-z]+o[a-z]+')
match = regex.search(line)
match.start()

10

In [21]:
regex.findall(line)

['brown', 'fox', 'dog']

In [7]:
line.replace('fox', 'BEAR')

'the quick brown BEAR jumped over a lazy dog'

In [8]:
regex.sub('BEAR', line)

'the quick brown BEAR jumped over a lazy dog'

### A more sophisticated example

But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods?
The advantage is that regular expressions offer *far* more flexibility.

In [30]:
email = re.compile('\w+@\w+\.[a-z]{3}')

In [25]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

In [26]:
email.sub('--@--.--', text)

'To email Guido, try --@--.-- or the older address --@--.--.'

In [31]:
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

### Basics of regular expression syntax

The syntax of regular expressions is much too large a topic for this short section.
Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more.
My hope is that the following quick primer will enable you to use these resources effectively.

#### Simple strings are matched directly

If you build a regular expression on a simple string of characters or digits, it will match that exact string:

In [35]:
regex = re.compile('ion')
regex.findall('Great Expectations')

['ion']

#### Some characters have special meanings

While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:
```
. ^ $ * + ? { } [ ] \ | ( )
```
We will discuss the meaning of some of these momentarily.
In the meantime, you should know that if you'd like to match any of these characters directly, you can *escape* them with a back-slash:

In [42]:
print(r'Hi\nThere')

Hi\nThere


In [46]:
regex = re.compile(r'\\')
regex.findall("the \ cost \ is $20")

['\\', '\\']

In [48]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

In [15]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

The following table lists a few of these characters that are commonly useful:

| Character | Description                 |.| Character | Description                     |
|-----------|-----------------------------|-|-----------|---------------------------------|
| ``"\d"``  | Match any digit             |-| ``"\D"``  | Match any non-digit             |
| ``"\s"``  | Match any whitespace        |-| ``"\S"``  | Match any non-whitespace        |
| ``"\w"``  | Match any alphanumeric char |-| ``"\W"``  | Match any non-alphanumeric char |

This is *not* a comprehensive list or description; for more details, see Python's [regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).

#### Square brackets match custom character groups

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in.
For example, the following will match any lower-case vowel:

In [52]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

Similarly, you can use a dash to specify a range: for example, ``"[a-z]"`` will match any lower-case letter, and ``"[1-3]"`` will match any of ``"1"``, ``"2"``, or ``"3"``.
For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [50]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6, Z44')

['G2', 'H6', 'Z4']

#### Wildcards match repeated characters

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, ``"\w\w\w"``.
Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:

In [18]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

In [53]:
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

The following is a table of the repetition markers available for use in regular expressions:

| Character | Description | Example |
|-----------|-------------|---------|
| ``?`` | Match zero or one repetitions of preceding  | ``"ab?"`` matches ``"a"`` or ``"ab"`` |
| ``*`` | Match zero or more repetitions of preceding | ``"ab*"`` matches ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
| ``+`` | Match one or more repetitions of preceding  | ``"ab+"`` matches ``"ab"``, ``"abb"``, ``"abbb"``... but not ``"a"`` |
| ``{n}`` | Match ``n`` repetitions of preeeding | ``"ab{2}"`` matches ``"abb"`` |
| ``{m,n}`` | Match between ``m`` and ``n`` repetitions of preceding | ``"ab{2,3}"`` matches ``"abb"`` or ``"abbb"`` |

In [20]:
email = re.compile(r'\w+@\w+\.[a-z]{3}')

In [54]:
email2 = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
email2.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

We have changed ``"\w+"`` to ``"[\w.]+"``, so we will match any alphanumeric character *or* a period.
With this more flexible expression, we can match a wider range of email addresses (though still not all – can you identify other shortcomings of this expression?).

#### Parentheses indicate *groups* to extract

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to *group* the results:

In [62]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [63]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]