Even more regex
-----

![](images/face_tat.png)

By The End Of This Session You Should Be Able To:
----

- Use qualifiers to add optional elements
- Describe a regex workflow and how to think about debugging
- Use capturing groups to post-process your matches

Matching phone numbers
----

![](https://s3-media4.fl.yelpcdn.com/bphoto/DBdoasm-ehGN5vm2r0qEEg/180s.jpg)

g cafe phone number: 415-805-1888

Match it:
`[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]`  
`\d\d\d-\d\d\d-\d\d\d\d`  
`\d{3}-\d{3}-\d{4}`  

[Test it](http://regexr.com/)

Great!

`\d{3}-\d{3}-\d{4}` uses Quantifiers.

Quantifiers: allow you to specify how many times the preceding expression should match.
    
{} is extact quantifier.

Unextact quantifiers
-----

| Symbol | Matches |  
|:-------:|:------:|
| question mark (?) | zero or one |
| the asterisk (*) | zero or more |
| plus sign (+) | one or more |  

`z?a` # zero or one

chunk: za, a   
chink: z, zzza

`z*a` # zero or more

chunk: za, a, zzza   
chink: z

`z+a` # one or more
    
chunk: za, zzza   
chink: z, a

[Test it](http://regexr.com/)

RegEx Workflow
-----

- Start with raw text
- Filter with pattern
- Return capturing groups
- Reformat based on capturing groups
- Test with unit tests

-----
Capturing groups
-----

Problem: You have odd line breaks in your text.

In [2]:
text = 'Long-\nterm problems with short-\nterm solutions.'
print(text)

Long-
term problems with short-
term solutions.


Solution: Write a regex to find the "dash with line break" and replace it with just a line break.

In [3]:
import re

In [4]:
# 1st Attempt
text = 'Long-\nterm problems with short-\nterm solutions.'
re.sub('(\w+)-\n(\w+)', r'-', text)

'- problems with - solutions.'

Not right!

We need capturing groups!

Caputuring groups allow you to apply regex operators to the groups that have been matched by regex.

For for example, if you wanted to list all the image files in a folder. You could then use a pattern such as `^(IMG\d+\.png)$` to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern `^(IMG\d+)\.png$` which only captures the part before the period.

In [5]:
re.sub('(\w+)-\n(\w+)', r'\1-\2', text)

'Long-term problems with short-term solutions.'

The parentheses around the word characters (specified by \w) means that any matching text should be captured into a group.  

The '\1' and '\2' specifiers refer to the text in the first and second captured groups.  

"Long" and "term" are the first and second captured groups for the first match.  
"short" and "term" are the first and second captured groups for the next match.

__NOTE: 1-based indexing__

Summary
-----

- Think through your problem with a workflow so you can isolate where bugs are
    - Start with raw text
    - Filter with pattern
    - Return capturing groups
    - Reformat based on capturing groups
    - Test with unit tests
- Manipulate the results of your regex matches with capturing groups

<br>
<br> 
<br>

----