# Pattern Matching and Regular Expressions

- You may be familiar with searching for text by pressing `CTRL-F` and typing in the words you’re looking for
- **Regular expressions** go one step further: 
    - They allow you to specify a pattern of text to search for

## Regular Expressions
- Regular expressions are helpful, but not many non-programmers know about them even though most modern text editors and word processors
- have find and find-and-replace features that can search based on regular expressions
- Regular expressions are huge time-savers

## Regular Expressions
“Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.” - Cory Doctorow

## Regular Expressions
- Regular expressions, called *regexes* for short, are descriptions for a pattern of text
- For example, a `\d` in a regex stands for a digit character—that is, any single numeral 0 to 9.
- The regex `\d\d\d-\d\d\d-\d\d\d\d` is used by Python to match text
    - 888-888-8888

## Regular Expressions
- Regular expressions can be more sophisticated
    - Shorter regex `\d{3}-\d{3}-\d{4}`
    - `{3}` $\implies$ 'Match this pattern three times'

## Create Regex Objects
- Regex functions are in the `re` module

In [2]:
import re

- Passing a string value representing your regular expression to `re.compile()` returns a Regex object
- Create a Regex object that matches the phone number pattern
- Pass a `raw string` to `re.compile()`

In [3]:
phone_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

- `phone_regex` now contains a Regex object

## Match Regex Objects
- A Regex object `search()` method searches the string it is passed for any matches to the regex.
- `search()` method returns `None` if the regex patter is not found in the string

## Match Regex Objects
- If the pattern is found, the `search()` method returns a `Match`
- `Match` objects have a `group()` method

In [6]:
mo = phone_regex.search('My number is 123-456-7890.')

In [5]:
print(mo.group())

123-456-7890


## Review of Regex
While there are several steps to using regular expressions in Python, each step is fairly simple.

1. Import the regex module with import re.

2. Create a Regex object with the re.compile() function. (Remember to use a raw string.)

3. Pass the string you want to search into the Regex object’s search() method. This returns a Match object.

4. Call the Match object’s group() method to return a string of the actual matched text.

## Group with Parentheses
- Say you want to separate the area code from the rest of the phone number
- Adding parentheses will create groups in the regex: 
    - `(\d\d\d)-(\d\d\d-\d\d\d\d)`
- Then you can use the `group()` match object method to grab the matching text from just one group

## Group with Paraentheses
- The first set of parentheses in a regex string will be group 1. 
- The second set will be group 2. 
- By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. 
- Passing 0 or nothing to the group() method will return the entire matched text. 

In [11]:
phone_regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phone_regex.search('My number is 123-456-7890.')
print(mo.group())
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))

123-456-7890
123-456-7890
123
456-7890


- If you would like to retrieve all the groups at once, use the groups() method

- The `\(` and `\)` escape characters in the raw string passed to re.compile() will match actual parenthesis characters.

## Matching Multiple Groups with Pipe |

- `|` is called the pipe operator
- Use it where you want to match one of many expressions.
- `r'Batman|Robin'` will match either Batman or Robin

In [12]:
regex = re.compile(r'Batman|Robin')
mo2 = regex.search('Batman and Robin.')

In [13]:
mo2.group()

'Batman'

In [14]:
mo3 = regex.search('Robin and Batman.')

In [15]:
mo3.group()

'Robin'

## Matching Multiple Groups with Pipe |
- use the pipe to match one of several patterns as part of your regex

In [20]:
regex = re.compile(r'Bat(man|mobile|copter|bat)')

In [21]:
mo = regex.search('Batmobile lost a wheel.')
mo.group()

'Batmobile'

- If you need to match an actual pipe character, escape it with a backslash, like `\|`.

## Optional Matching with Question Mark
- Sometimes there is a pattern that you want to match only optionally
- regex should find a match whether or not that bit of text is there
- `?` flags the group that precedes it as an optional part of the pattern

In [22]:
regex = re.compile(r'Bat(wo)?man')
mo = regex.search('The Adventures of Batman')
mo.group()

'Batman'

In [23]:
mo2 = regex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

## Optional Matching with Question Mark
- The `(wo)?` part of the regular expression means that the pattern wo is an optional group
- regex will match text that has zero instances or one instance of wo in it


In [25]:
phone_regex = re.compile(r'(\d{3}-)?\d{3}-\d{4}')
mo = phone_regex.search('My number is 123-456-7890')
mo.group()

'123-456-7890'

In [26]:
mo = phone_regex.search('My number is 456-7890')
mo.group()

'456-7890'

- You can think of the `?` as saying, “Match zero or one of the group preceding this question mark.”
- If you need to match an actual question mark character, escape it with `\?`

## Matching Zero or More with asterisk
- The `*` means 'match zero or more'

In [27]:
regex = re.compile(r'Bat(wo)*man')
mo = regex.search('The Adventures of Batman')
mo.group()

'Batman'

In [28]:
mo = regex.search('The Adventures of Batwoman')
mo.group()

'Batwoman'

In [29]:
mo = regex.search('The Adventures of Batwowowowowoman')
mo.group()

'Batwowowowowoman'

## Matching Zero or More with asterisk
- For 'Batman', the `(wo)*` part of the regex matches zero instances of wo in the string; 
- for 'Batwoman', the `(wo)*` matches one instance of wo; 
- for 'Batwowowowoman', `(wo)*` matches four instances of wo.

## Matching One or More with +
- `+` means “match one or more.” 
- Unlike the star the group preceding a plus must appear at least once. It is not optional.

In [None]:
regex = re.compile(r'Bat(wo)+man')