# Regular Expressions


### So, what's RegEx?

Regular expressions (also called regexes or regex patterns) are sequences of characters that specify a search pattern.  In Python, this is made available through the built-in `re` module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English words, or e-mail addresses, or phone numbers, or anything you like.

Regexes are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern.  They can also be used for other things, such as string replacement or data validation.


In [1]:
import re

## Table of Contents

Heads up, this is a lengthy notebook, so here are some links to help you jump to certain sections you may be interested in:
   1. [RegEx Basics and Syntax Guide](#RegEx-Basics-and-Syntax-Guide)
   2. [Compiling and Running RegEx in Python](#Compiling-and-running-RegEx-in-Python)
   3. [Common RegEx Patterns](#Common-RegEx-Patterns)
   4. [Modifying Strings](#Modifying-Strings)
   5. [Grouping](#Grouping)
   6. [Caveats](#Caveats)


Obligatory XKCD about RegEx:
<div>
<img src="regex_xkcd.PNG" width="35%"/>
</div>


##  RegEx Basics and Syntax Guide

Before we get into any Python, let's go over the basics of the regex:  How does one write regex patterns?

Whether you're new to RegEx or not, I **highly recommend** using [RegExr](https://regexr.com/).  You need to have this page bookmarked.  Seriously, if you get nothing else from this notebook, just bookmark that page.  It is an excellent tool with great features to explain the elements of your regex and highlight matches in a test string.  It also provides example regexes and documents the regex language far better than I do below.  I don't write a regex without it.

If you aren't new to RegEx syntax, but want to learn more about using it in Python, go ahead and skip ahead to the [Compiling and Running in Python](#Compiling-and-running-RegEx-in-Python) section.  This section is a bit long.

### Metacharacters
  
Most letters and characters will simply match themselves in regex patterns. For example, the regular expression `test` will match the string `test` exactly. (You can enable a case-insensitive mode that would let this RE match `Test` or `TEST` as well; more about this later.)

There are exceptions to this rule; 14 characters are *metacharacters*, which means they have a special meaning and don’t match themselves.  Below is a complete list of the metacharacters.  Their meanings will be discussed in the rest of this notebook.

`^ $ * + ? { } [ ] \ | ( ) . `

#### Anchors `^ $`

  - `^` : match the start of a string (e.g. `^Start` matches the string `Start`, but not `false start`)
  - `$` : match the end of a string (e.g. `Fin$` matches the string `Fin`, but not `finish`)
  
  - `^The end$` : exact string match (string starts and ends with `The end`)

#### Quantifiers ` * + ? { }`

  - `abc*`       : matches a string that has `ab` followed by **zero or more** `c`
  - `abc+`       : matches a string that has `ab` followed by **one or more** `c`
  - `abc?`       : matches a string that has `ab` followed by **zero or one** `c`
  - `abc{2}`     : matches a string that has `ab` followed by **exactly 2** `c`
  - `abc{2,}`    : matches a string that has `ab` followed by **2 or more** `c`
  - `abc{2,5}`   : matches a string that has `ab` followed by **2 up to 5** `c`
  - `a(bc)*`     : matches a string that has `a` followed by **zero or more** copies of the sequence `bc`
  - `a(bc){2,5}` : matches a string that has `a` followed by 2 up to 5 copies of the sequence `bc`

#### Character Classes  `[ ]`

   - `[abc]`       : matches a string that has either an a or a b or a c
      - Note: This is equivalent to using `a|b|c` (`|` is a boolean "OR")
   - `[a-c]`       : same as previous
      - note: `-` is used as a "range" here, commonly you see character classes like `[A-Z]` to match all uppercase English letters.
   - `[a-fA-F0-9]` : a single hexadecimal digit, case insensitively
   - `[0-9]%`      : a string that has the numbers from 0 to 9 before a % sign
   - `[^a-zA-Z]`   : a string that has not a letter from a to z or from A to Z
     - Note: In this case the `^` is used as negation of the expression and returns the *complement*

#### The Backslash `\`

Perhaps the most important metacharacter.  The backslash can be followed by various characters to represent various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a `[` or `\`, you can precede them with a backslash to remove their special meaning: `\[` or `\\`.

An incomplete list of common is below.  For a complete list, see the last part of [Regular Expression Syntax](https://docs.python.org/3/library/re.html#re-syntax) in the Standard Library reference.

  - `\d` : Matches any digit. Equivalent to the class `[0-9]`.
  - `\D` : Matches anything other than a decimal/digit (negation/complement of above)
     - Note: Most backslash-defined character classes are negated by using the capital letter, as with the two above.
  - `\s` : Matches any whitespace (space, tab or newline) character.
  - `\w` : Matches any alphanumeric character. Equivalent to the class `[a-zA-Z0-9_]`.
  - `\W` : Matches any non-alphanumeric character. Equivalent to the class `[^a-zA-Z0-9_]`.

#### Group Constructs `( )`

This is more of an advanced topic, but we'll cover it more in [Grouping](#Grouping)

  - `a(bc)`       : parentheses create a **capturing group** with value `bc`
  - `a(?:bc)*`    : using `?:` we **disable the capturing group**
  - a`(?P<foo>bc)` : using `?P<foo>` we assign a name to the group

#### The dot `.`
- The final metacharacter in this section is `.`. By default matches anything except a newline character (although there's an option to have it match newline too).  `.` is often used where you want to match “any character”.


## Compiling and running RegEx in Python

Phew, ok.  I think that's enough syntax for this notebook.

Now how do we use RegEx in Python?  Let's compile our first regex using `re.compile()` below.

In [2]:
p = re.compile('needle')
p

re.compile(r'needle', re.UNICODE)

Ok, so now we have a pattern object that matches the string `needle` exactly... what can we do with it?

Here are the 4 most common methods/attributes of the pattern and why they're used:

<p align="left">

| Method     | Purpose                                                                                                                                                                                    |
|------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **findall()**  | Find all substrings where the regex matches, and returns the matching parts<br> of the string as a **list**.                                                                                                                   |
| **finditer()** | Find all substrings where the regex matches, and returns the matching parts<br>of the string as an **iterator** of `MatchObject`s.                                                                                                              |
| **search()**   | Scan through a string, looking ***for the first location*** where this regex matches.<br>Returns a `MatchObject` if a match is found, and `None` if not.                                                                                                                   |
| **match()**    | Determine if the regex matches ***at the beginning of the string***.<br>Returns a `MatchObject` if a match is found, and `None` if not.   |


</p>

#### `findall()`
Let's start with my personal favorite, `findall()`, to grab all the `needle`s from this string.

In [3]:
string = """hay hay hay hay hay hay hay hay hayyyyyy, needle... hay hay hay, needle hay hay hay
            hay hay hneedleay hay hNEEDLEay hay"""

p.findall(string)

['needle', 'needle', 'needle']

We found 3 `needle`s.  But wait, I see a big capitalized needle in there, how do we do a case insensitive version of this?


#### Enter, Compilation Flags
In this case, we can use the **compilation flag** `re.IGNORECASE` when we compile our pattern to get all possible casing of the string `needle` (Note: The other `re` methods, such as `re.findall()`, also accept flags as arguments).

In [4]:
p_ignore = re.compile('needle', re.IGNORECASE)
p_ignore.findall(string)

['needle', 'needle', 'needle', 'NEEDLE']

Note: There are loads of other potentially useful compilation flags including:
  - **re.MULTILINE**: This affects anchors, `^` and `$`.  When this flag is specified, `^` matches at the beginning of the string AND at the beginning of each line within the string, immediately following each newline. Similarly, the `$` metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).
  - **re.DOTALL** : Makes the `.` special character match any character at all, including a newline; without this flag, `.` will match anything except a newline
  - **re.VERBOSE** : Allows you to write more readable regex, complete with comments.  This ignores whitespace characters in the pattern unless they are within a character class (e.g. [char \n \t] will still match spaces, tabs, and newline)
     - Example of a pattern you can write with re.VERBOSE:
       ```
       re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.VERBOSE)```


#### `finditer()`
`finditer()` works the same way, but it returns an iterable of `MatchObject` instances that may be useful if you need to do some extra fancy processing using the attributes of those objects (such as start and end positions, as shown below).

In [5]:
for match in re.finditer('needle', string, flags = re.IGNORECASE):
    s = match.start()
    e = match.end()
    print(f"String match '{string[s:e]}' at positions {s}:{e}")

String match 'needle' at positions 42:48
String match 'needle' at positions 65:71
String match 'needle' at positions 105:111
String match 'NEEDLE' at positions 119:125


#### `search()`/`match()`
Let's go over another example using `re.search()` and `re.match()` to grab the *first phone numbers* in some strings.

In [6]:
phone_pattern = re.compile(r'\d{3}-\d{3}-\d{4}') ## With area code and no country code

In [7]:
## Define strings
phonebook_string = """Phonebook:
                    Front desk: 555-234-5678
                    Boss man: 555-345-6789
                    Other guy: 888-987-6543
                    """
phone_string = "555-234-5678"

## Run search() on each string
search_phonebook = phone_pattern.search(phonebook_string)
search_phone_str = phone_pattern.search(phone_string)

## Run match() on each string
match_phonebook  = phone_pattern.match(phonebook_string)
match_phone_str  = phone_pattern.match(phone_string)

### Print results
print("Results using search():")
print(f"phonebook string: {search_phonebook}")
print(f"phone string:     {search_phone_str}")

print("\nResults using match():")
print(f"phonebook string: {match_phonebook}")
print(f"phone string:     {match_phone_str}")

Results using search():
phonebook string: <re.Match object; span=(43, 55), match='555-234-5678'>
phone string:     <re.Match object; span=(0, 12), match='555-234-5678'>

Results using match():
phonebook string: None
phone string:     <re.Match object; span=(0, 12), match='555-234-5678'>


Notice that `match()` returns `None` when the phone number pattern is not at the start of the string (in the case of the full phonebook string).  Checking if `search()` has a start position of 0 is basically equivalent to `match()`.

Since `MatchObject`s evaluate to `True` and `None` evaluates to `False`, a common Pythonic way to check if there is a match anywhere in a string is to use if/else logic as below.

In [8]:
not_a_phonebook_string = """just a bunch of text without any phone numbers.... nothing to see here"""

match = phone_pattern.search(not_a_phonebook_string)

counter = 0
for string in [phonebook_string, not_a_phonebook_string]:
    counter+=1
    m = phone_pattern.search(string)
    if m:
        print(f"We found a phone number at position {m.start()}:{m.end()} in string {counter}!!")
    else:
        print(f"No phone numbers found in string {counter} :(")


We found a phone number at position 43:55 in string 1!!
No phone numbers found in string 2 :(


### Beware the backslash

One final note on compilation... you may notice that above we have been using **raw strings**.  This is because standard Python strings treat the backslash (`\`) as an escape character for things like tabs and newlines.

If you're not careful about using raw strings, your regex may not be doing what you think.  In the example below, we try to pull any words that start with a backslash.

Let's try doing this without a raw string.

In [9]:
data = r"""this is a string with \words with \backslashes before \some."""

backslash_pattern = re.compile('\\\w+')
backslash_pattern.findall(data)

['\\w']

Hmm, that didn't work.  Let's check out our pattern again.

In [10]:
backslash_pattern

re.compile(r'\\w+', re.UNICODE)

Oh!  We forgot to put an "r" in front of our pattern, so this pattern just looks for a single backslash followed by 1 or more "w"s instead of alphanumeric characters.

To do this properly without raw strings, you would need *FIVE* backslashes.

In [11]:
backslash_pattern_2 = re.compile('\\\\\\w+')
backslash_pattern_2.findall(data)

['\\words', '\\backslashes', '\\some']

But with raw strings, you only need the 3 that I tried at the start.

In [12]:
backslash_raw_pattern = re.compile(r'\\\w+')
backslash_raw_pattern.findall(data)

['\\words', '\\backslashes', '\\some']

Lesson: just put "r" in front of your regex strings.  It makes it easier to read and understand.

## Common RegEx Patterns

Below are some regexes that may be useful to you.  These are written in verbose mode to demonstrate readability and to give explanations of what each element of the regex is doing.

In [13]:
email_regex = re.compile(r"""
                         (?:[\w\.\-_]+)?         ## optional string of alphanumeric,underscore/dot, etc.
                         \w+                     ## alphanumeric before @
                         @[\w\-_]+(?:\.\w+){1,}  ## @ character and website name, 1 or more extensions
                         """, re.VERBOSE)

In [14]:
phone_regex = re.compile(r"""
                         (?:\+?(?: |-|\.)?          ## Optionally start with +
                         \d{1,2}                    ## 1/2 digit country code
                         (?: |-|\.)?)?              ## spacing
                         (?:\(?\d{3}\)?|\d{3})      ## area code with/without parens
                         (?: |-|\.)?                ## spacing
                         (?:\d{3}(?: |-|\.)?\d{4})  ## 3 digits, a space, followed by 4 digits
                         """, re.VERBOSE)

In [15]:
date_regex = re.compile("""
                        ### Date in M/D/YYYY, MM/DD/YYYY, M-D-YYYY, or MM-DD-YYYY
                        (?:0?[1-9]|1[0-2]) ## Month part
                        [\/-](?:0?[1-9]|[12]\d|3[01])[\/-](?:19|20)\d{2}
                        """, re.VERBOSE)

Demoing these on a long string containing a bunch of stuff.

In [16]:
string_with_lots_of_stuff = """Here's a string with words that aren't emails, phones, and dates...but also has
emails: test123.part2@email.co.uk, name@email.com, firstname.lastname234@domain.ru.net
phones: +1-(800)-555-5555, +93-200-555-5555, 999.923.2935, 1-855-345-6789, 234-5678
dates:  12/31/2006, 09-11-2001, 1/2/2000, February 1st, 2019
and lots of other nonsense... blah blah blah.
"""

In [17]:
emails_found = email_regex.findall(string_with_lots_of_stuff)
phones_found = phone_regex.findall(string_with_lots_of_stuff)
dates_found  = date_regex.findall(string_with_lots_of_stuff)

print(f"Emails found in string: {emails_found}")
print(f"Phones found in string: {phones_found}")
print(f"Dates found in string: {dates_found}")

Emails found in string: ['test123.part2@email.co.uk', 'name@email.com', 'firstname.lastname234@domain.ru.net']
Phones found in string: ['+1-(800)-555-5555', '+93-200-555-5555', '999.923.2935', '1-855-345-6789']
Dates found in string: ['12/31/2006', '09-11-2001', '1/2/2000']


Notice, these patterns aren't silver bullets and don't get *all* of the phones/dates in the string.  With all of the ways these things may be formatted, it's recommended to make multiple regexes to capture everything you need and make it easier to understand than one massive regex.

## Grouping

You may have noticed in the syntax for common regexes above, we use `(?:...)` in our group constructs (where `...` are characters in the pattern).  This is used for **non-capturing groups**, which means Python will not retrieve the contents of that group separately.

Let's take a look at **capturing groups** and why they can be useful for breaking up the matches.

Below I've copied the `phone_regex` from above but removed the `?:` syntax and some of the groups.

In [18]:
phone_capture_grp_regex = re.compile(r"""
                             (\+?[ \-.]?             ## Optionally start with +
                             \d{1,2}                  ## 1/2 digit country code
                             )?([ \-.])?                ## spacing
                             (\(?\d{3}\)?|\d{3})      ## area code with/without parens
                             ( |-|\.)?                ## spacing
                             (\d{3}(?: |-|\.)?\d{4})  ## 3 digits, a space, followed by 4 digits
                             """, re.VERBOSE)

In [19]:
email_groups_found = phone_capture_grp_regex.findall(string_with_lots_of_stuff)
email_groups_found

[('+1', '-', '(800)', '-', '555-5555'),
 ('+93', '-', '200', '-', '555-5555'),
 ('', ' ', '999', '.', '923.2935'),
 (' 1', '-', '855', '-', '345-6789')]

So when groups are present, `findall()` returns a list of tuples with each of the matches and their groups.

We can also assign a name to each of the groups using the syntax `(?P<name>...)`.  But, we'll need to use a different function, such as `finditer()` that returns a `MatchObject` to access a dictionary containing the names of each group.  This is shown below.

In [20]:
phone_capture_grp_name_regex = re.compile(r"""
                             (?P<country_cd>\+?[ \-.]?           ## Optionally start with +
                             \d{1,2}                             ## 1/2 digit country code
                             )?(?P<space_1>[ \-.])?              ## spacing
                             (?P<area_cd>\(?\d{3}\)?|\d{3})      ## area code with/without parens
                             (?P<space_2> |-|\.)?                ## spacing
                             (?P<number>\d{3}(?: |-|\.)?\d{4})   ## 3 digits, a space, followed by 4 digits
                             """, re.VERBOSE)

In [21]:
email_named_groups_found = phone_capture_grp_name_regex.finditer(string_with_lots_of_stuff)
for match in email_named_groups_found:
    print(f"group dictionary for {''.join([g for g in match.groups() if g])}: {match.groupdict()}")

group dictionary for +1-(800)-555-5555: {'country_cd': '+1', 'space_1': '-', 'area_cd': '(800)', 'space_2': '-', 'number': '555-5555'}
group dictionary for +93-200-555-5555: {'country_cd': '+93', 'space_1': '-', 'area_cd': '200', 'space_2': '-', 'number': '555-5555'}
group dictionary for  999.923.2935: {'country_cd': None, 'space_1': ' ', 'area_cd': '999', 'space_2': '.', 'number': '923.2935'}
group dictionary for  1-855-345-6789: {'country_cd': ' 1', 'space_1': '-', 'area_cd': '855', 'space_2': '-', 'number': '345-6789'}


You can see how labeling the individual parts within each match could be useful metadata to gather (imagine that you need to get a list of all phone numbers for a given set of countries and/or a set of area codes).

## Modifying Strings

Aside from performing searches against a static string, regexes are also commonly used to modify strings in various ways, using the following methods:

| Method      | Purpose                                                                                   |
|-------------|:------------------------------------------------------------------------------------------|
| **split()** | Split the string into a list, splitting it wherever<br> the RE matches                        |
| **sub()**   | Find all substrings where the RE matches, and replace<br> them with a different string        |
| **subn()**  | Does the same thing as `sub()`, but returns the new<br> string and the number of replacements |

### `split()`

`split()` can be seen as an extended version of the standard Python string function, `str.split()`.  Instead of splitting by a static string, you can use special regex patterns to split the string.

Here's a quick example tokenizing the words of a long string using whitespace patterns.

In [22]:
%pprint
story = """
        Once upon a time, there was a string that needed to be split.
           It was long and had   multiple lines
                and tabs, and   lots   of      spaces!!
        """

whitespace_regex = re.compile(r"\s+")

whitespace_regex.split(story)

Pretty printing has been turned OFF


['', 'Once', 'upon', 'a', 'time,', 'there', 'was', 'a', 'string', 'that', 'needed', 'to', 'be', 'split.', 'It', 'was', 'long', 'and', 'had', 'multiple', 'lines', 'and', 'tabs,', 'and', 'lots', 'of', 'spaces!!', '']

### `sub()`

`sub()` can be seen as an extended version of the standard Python string function, `str.replace()`.  Instead of replacing a static string, you can use special regex patterns to replace within the string.

Here's an example replacing all 2+ digit numbers:

In [23]:
statement = "I have 2 dogs, 37 cats, and 182 chickens."

number_regex = re.compile("\d{2,}")

number_regex.sub('lots of', statement)

'I have 2 dogs, lots of cats, and lots of chickens.'

### `subn()`

`subn()` is basically the same as `sub()`, but it returns the number of replacements too.

Here's what it does using the same example:

In [24]:
replaced_string, n_replacements = number_regex.subn('lots of', statement)

print(f"New string after {n_replacements} replacements: '{replaced_string}'")

New string after 2 replacements: 'I have 2 dogs, lots of cats, and lots of chickens.'


## Summary

As you’ve seen, RegExs can be super useful.  This should be a go-to tool for any Python coder facing such problems as:
  - Data extraction
  - String/pattern replacement
  - Data validation

### Caveats

A funny quote often shared about RegEx goes like this:

    "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

What I think is meant by this is that **RegEx can get out of hand and hard to maintain.**  If you write regular expressions that are difficult to read into your codebase, you are doing it wrong.

Here are some tips I stole from a blog post:
  1. **Do not try to do everything in one uber-regex.** I know you can do it that way, but you're not going to. It's not worth it. Break the operation down into several smaller, more understandable regular expressions, and apply each in turn. Nobody will be able to understand or debug that monster 20-line regex, but they might just have a fighting chance at understanding and debugging five mini regexes.
  2. **Use whitespace and comments.** It isn't 1997 any more. A tiny ultra-condensed regex is no longer a virtue. Flip on the IgnorePatternWhitespace option, then use that whitespace to make your regex easier for us human beings to parse and understand. Comment liberally.
  3. **Get a regular expression tool.** I don't stare at regular expressions and try to suss out their meaning through sheer force of will. Neither should you. It's a waste of time. I paste them into my regex tool of choice, [RegExr](https://regexr.com/), which not only tells me what the regular expression does, but also lets me run it through some test data. All in real time as I type.
  4. **Regular expressions are not Parsers.** Although you can do some amazing things with regular expressions, they are weak at balanced tag matching. Some regex variants have balanced matching, but it is clearly a hack – and a nasty one. You can often make it kinda-sorta work, as I have in the sanitize routine. But no matter how clever your regex, don't delude yourself: it is in no way, shape or form a substitute for a real live parser.


There are some tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.