# 20. Regular Expressions in Python

The [`re`](https://docs.python.org/3.5/library/re.html) module provides regular expression or regex features in Python. It can be used to search and match on both unicode and 8-bit strings. However, the two types can not be mixed.

> _you cannot match an Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string._

When representing regex patterns in Python code, use the raw string notation __`r"string"`__ (string prefixed with the letter __`r`__) to prevent confusing special regex forms and character escaping.

If you need a refresher on the topic, now is a good time to read the documentation linked above. It provides a good reference and a link to an introduction to regex.

Let's import the module:

In [None]:
import re

> _The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form._

## 20.1 Functions

> _Python offers two different primitive operations based on regular expressions: __re.match() checks for a match only at the beginning__ of the string, while __re.search() checks for a match anywhere__ in the string (this is what Perl does by default)._

### 20.1.1 `re.match() and re.search()`

In [None]:
re.match('hello', 'hello world')

When successful, `re.match()` and `re.search()` results return a match object that evauates to `True` and `None` on failure.

In [None]:
text = "639876543210 is my mobile number"
regex = r'(63|0)[0-9]{10,11}'

match = re.match(regex, text)
if match:
    print("match:", match.group())

search = re.search(regex, text)
if search:
    print("search:", search.group())

The text above and the text below provide different results for `re.match()`.

In [None]:
text = "My mobile numbers are 639876543210 and 09876543211"
regex = r'(63|0)[0-9]{10,11}'

match = re.match(regex, text)
if match:
    print("match:", match.group())

search = re.search(regex, text)
if search:
    print("search:", search.group())

### 20.1.2 `re.compile()`

We used our regex pattern several times in our examples. If we wanted to make our pattern reusable, `re.compile()` is available to also optimize it into an object. There is a slight but almost negligible performance boost when working with compiled regex.

In [None]:
text = "639876543210 is my mobile number"
regex = re.compile(r'(63|0)[0-9]{10,11}')

match = regex.match(text)
if match:
    print("match:", match.group())

search = regex.search(text)
if search:
    print("search:", search.group())

### 20.1.3 `re.split()`

There is also `re.split()` that functions similiarly to `string.split()`, except it accepts regex.

In [None]:
re.split('\W+', 'Words, words, words.')  # \W = non-words

### 20.1.4 `re.findall()`

To return all non-overlapping matches of a pattern, use `re.findall()`.

In [None]:
re.findall('\w+', 'Words, words, words.')

### 20.1.5 `re.finditer()`

Instead of returning a list, `re.finditer()` returns an iterator that returns match objects.

In [None]:
for m in re.finditer('\w+', 'Words, words, words.'):
    print('{0}-{1}: {2}'.format(m.start(), m.end(), m.group(0)))

### 20.1.6 `re.fullmatch()`

If you want to check whether a whole string matches a pattern, there is a `re.fullmatch()` available (new in Python 3.4).

In [None]:
text = "639876543210"
regex = r'(63|0)[0-9]{10,11}'

match = re.fullmatch(regex, text)
if match:
    print("match:", match.group())

### 20.1.7 `re.sub()` and `re.subn()`

`sub()` performs a substitution on any pattern matches with with the given replacement and returns the new string.

`subn()` returns a tuple with the new string and the number of substitutions made.

In [None]:
re.sub(r'\w', '*', 'password')

In [None]:
re.subn(r'\w', '*', 'password')

#### Exercises

Using the functions provided by the `re` module, write functions that use regex patterns to:

* Extract the title of an HTML page

* Validate phone numbers with the following parts:

    - country code or 0
    - carrier code
    - optional space or hyphen
    - first three digits
    - optional space or hyphen
    - last four digits

* Extract the date and month from various date formats:

    - YYYY-MM-DD
    - MM/DD/YYYY
    - MM-DD-YYYY
    - MM DD, 'YY