# Python Regular Expressions

Python's regular expression API is provided by the **re** package.
The official documentation for `re` is provided by the
[re section of the Python Standard Library](https://docs.python.org/3.5/library/re.html).
Open this link in a separate window as you work through
these exercises and become familiar with the documentation.

## 1. Preliminaries

The following sections need to be understood before getting started
with Python regular expressions.

### Import and Compile

Be sure to import `re` in any script or interactive
session in which you expect to use regular expressions.

In [32]:
import re

Regular expressions need to be compiled before they can be applied
to strings to determine a match.

In [33]:
findData = re.compile('\\s[Dd]ata\\s')

This will match "`data`" or "`Data`" as standalone words.
Recall that `\b` means boundary between words and whitespaces.
Also recall that Python treats the backslash as an escape
character in a string.  That means Python will convert `\b`
to a backspace before it passes the result to the `compile`
method.  In order to pass a back slash `\b`, we need to pass
`\\b`.  Python will convert the `\\` to a single slash
and leave the `b` alone.

In [34]:
# Apply regular expression from compilation above
found = findData.search("Where is my data now?")

# Compile and apply in a single step.
found = re.search('\\s[Dd]ata\\s', "Where is my data now?")

print("Found" if found else "Not found")

Found



### Raw Strings

The more involved regular expressions can contain quite a few
back slashes.  This is confusing enough and is further compounded
when string meta-character interpretation forces us to
*double the number of backslashes*!  To help reduce this mess,
Python provides the notion of a **raw string** in which back slashes
are left alone.  This doesn't mean the regular expression compiler
will leave them alone; but the Python string literal parser will.
A raw string literal is prefixed with a lower-case **r**.
So the regular expression above could be equivalently expressed as

In [35]:
findData = re.compile(r'\s[Dd]ata\s')

## 2. Regular Expression Methods

Python compiled regular expressions may be applied to strings
through several methods.  These methods may be invoked in one
of two ways.

1. As a method of the compiled regular expression object, or
2. As a static method defined on `re`.

The first case requires two steps.  The first step, invoking
the `compile` method was shown in the previous section.
Then apply the regular expression to a string in a separate
statement.

By splitting compilation and application, it's
especially suitable for applying a regular expression in a loop
since the compilation is done once outside the loop and applied
within the loop.

The second method is more convenient for the programmer.  The
regular expression and the string on which to apply it are
specified in the same method call.  This performs the compilation
and evaluation in one step.  This is easier to code and read
most of the time.  We'll demonstrate both types below.

Most of these methods return a
**Match** object.  There are several ways to interogate a
`Match` object.  The most basic way is treat it like a boolean.
In this case, **Match always returns `True`**.  In other words,
if one of the methods returned a `Match` object at all, something
was matched.  If nothing was matched, the method will return
`None` (which evaluates to `False` as a boolean).

### Search

The `search` method searches for the presence of a match anywhere
in the string.

In [36]:
found = re.search(r'\s[Dd]ata\s', "My data is here.")
print("Found" if found else "Not found")

Found


In [37]:
found = re.search(r'\s[Dd]ata\s', "Here is my data.")
print("Found" if found else "Not found")

Not found


Note that we didn't match `data` in `sentence2` like we
wanted.  We wanted to match `data` alone (not part of
another word).  So we insisted on a space on each side.
But we didn't count on punctuation or other non-space
items such as beginning or end of line.  Rather than
using space (`\s`) as a delimiter, the boundary (`\b`)
is a better one for this case.  It represents the boundary
between a word and a non-word.

In [38]:
found = re.search(r'\b[Dd]ata\b', "My data is here.")
print("1 found" if found else "1 not found")
found = re.search(r'\b[Dd]ata\b', "Here is my data.")
print("2 found" if found else "2 not found")

1 found
2 found


### Match

The `match` method is similar to `search`.
The main difference is that `match` presumes the search to
start from the beginning of the string or line.

In [39]:
found = re.match(r'\b[Dd]ata\b', "Data is here.")
print("1 found" if found else "1 not found")
found = re.match(r'\b[Dd]ata\b', "Here is data.")
print("2 found" if found else "2 not found")

1 found
2 not found


### Split

The `split` method uses the regular expression as the criterion
for delimiters in splitting strings.

In [40]:
re.split("data", "Here data, there data, too much data everywhere")

['Here ', ', there ', ', too much ', ' everywhere']

Note that spaces and punctuation are part of the tokenize list elements.
Also notice that the things we matched were not part of the list.

In [41]:
re.split(r',', "Here data, there data, too much data everywhere")

['Here data', ' there data', ' too much data everywhere']

In [42]:
re.split(r'\s+|\W+|data', "Here data, there b, too much data everywhere")

['Here', '', '', 'there', 'b', 'too', 'much', '', '', 'everywhere']

In the last example, we matched multiple separators in a row.
When there is nothing between two separators, an empty string is "the token".

## 3. Grouping

Grouping is a very powerful technique with regular expressions.
It allows us to select a particular piece of an expression from
the braoder context of the match.

A group in a regular expression is the portion within parenthesis

In [49]:
match = re.search(r'(.*) data, (.*) data,', 'Here data, there data, too much data everywhere')

The group method will allow you to grab groups from a match. Passing a 0 into the group method will return the entire match

In [44]:
match.group(0)

'Here data, there data,'

Let's find out how many matches we found.

In [48]:
len(match.groups())

2

Passing a number other than 0 into the group method will return the respective group matched

In [50]:
match.group(1)

'Here'

In [51]:
match.group(2)

'there'