# Why Regular Expressions?

Regular expressions are a compact "mini-languages" within Python to identify, modifiy, and extract from strings that have certain characterstics.  With slight syntax differences across variants, "regexen" are available in nearly every programming language, and in a variety of other tools such as text editors and command line tools.

At heard, a regular expression *pattern* is a ways of describing which text within a larger string *matches*.  Special characters can describe complex features of the patterns, but in the simplest case, a pattern can simply be a literal string to match.  This course uses a convenience function `show()`, at times, to focus on matches themselves. Notice that this function adds a small middle-dot around matches as well as a red color to emphasize matched substrings (including spaces, where applicable).

In [1]:
from src.setup import *

## Simple Examples

As a start, let us match a literal pattern.  We can do that with a shell tool like `grep` (**G**lobally search for a **R**egular **E**xpression and **P**rint matching lines).  This tool can also match the full patterns we show throughout the course, but after this one example, we use Python.

In [2]:
%%bash
opt='-n -C2 --color=always'
grep $opt 'Mary' data/rhyme.txt

[32m[K1[m[K[36m[K:[m[K[01;31m[KMary[m[K had a little lamb
[32m[K2[m[K[36m[K-[m[KIts fleece as white as snow
[32m[K3[m[K[36m[K:[m[KAnd everywhere that [01;31m[KMary[m[K
[32m[K4[m[K[36m[K-[m[Kwent, the lamb was sure
[32m[K5[m[K[36m[K-[m[Kto go


We can perform the same match using our Python utility function.

In [3]:
show(r'Mary', rhyme)

## Limiting the Match

Several special characters are available to constrain the application of patterns.  We will usually use Python *raw strings* to define patterns.  This prevents most escape sequences that have meanign in Python strings to be passed unaltered.

For example, the character `^` limits a match to the beginning of a line, while the character `$` limits a match to the end of a line.  The sequence `\b` restricts a match to word boundaries. None of these symbols themselves match any characters.

In [4]:
show(r'^Mary', rhyme)

In [5]:
show(r'Mary$', rhyme)

In [6]:
show(r'as', rhyme)

In [7]:
show(r'\bas', rhyme)

In [8]:
show(r'as\b', rhyme)

## Wildcard and Quantifiers

The character `.` can be used to match any single character, and is sometimes called the "wildcard."  More limitd wildcards are available as well 

 Wildcard | Behavior
:--------:|--------------------------
|   .     | Any single character
|  \d     | Any decimal digit
|  \D     | Any non-digit character
|  \s     | Any whitespace character
|  \S     | Any non-whitespace character
|  \w     | Any alphanumeric character
|  \W     | Any non-alphanumeric character

Three "quantifiers" are also available.  These indicate the following:

 Quantifier | Behavior
:----------:|-------------------------------
|    *      | Zero or more of last pattern
|    +      | One or more of last pattern
|    ?      | Zero or one of last pattern
    

In [9]:
# Surrounded by word boundaries, not-whitespace, 'as'
show(r'\b\Sas\b', rhyme)

In [10]:
# Match an 'l' followed by one or two 'e'
show(r'lee?', rhyme)

In [11]:
# Any character (including space) followed by one or more 'e's
show(r'.e+', rhyme)

In [12]:
# Whole words containing one or more e's. 
# Zero or more letters, one or more 'e', zero or more letters
show(r'\w*e+\w*', rhyme)

## Matching Too Much

Regular expression patterns, by default, are *greedy*. They match as much as they possibly can. Sometimes this is what you want, but often it is not.  For example, let us try to match the immediate comparison for what fleece is like (i.e. "fleece as white")

In [13]:
# Want to match just "some characters until word boundary"
show(r'fleece as .+\b', rhyme)

We match all the way to the end of the line here.  But it gets worse if we wish to use the flag that makes the dot (`.`) include newlines.

In [14]:
# Want to match just "some characters until word boundary"
show(r'fleece as .+\b', rhyme, re.DOTALL)

The problem is that the subpattern `.*` wants to match as much as it can; it needs to end with a word boundary, but the entire ditty does.  The solution is to transform quantifiers into their non-greedy form.


 Quantifier | Behavior
:----------:|----------------------------------
|    *?     | Zero or more, as few as possible
|    +?     | One or more, as few as possible
|    ??     | Zero or one, as few as possible

In [15]:
# Successfully match just "some characters until word boundary"
show(r'fleece as .+?\b', rhyme, re.DOTALL)

## Numeric Range Quantifiers

The basic quantifiers in regexen are "some", "many", and "maybe", spelled as `*`, `+`, and `?`.  However, Python (and many other tools) enhance that with specific numeric ranges for occurrence counts.  These are given by subpatterns with curly braces.  

The general pattern is `{m,n}`, but either end of the range may be omitted.  A single number in the curly braces indicates an exact count.  The non-greedy modifier `?` may also be used with numeric quantifiers.

In [16]:
# Match from 10-32 characters ending at word boundary
show(r'fleece as .{10,32}\b', rhyme, re.DOTALL)

In [17]:
# Non-greedy match from 10-32 characters ending at word boundary
show(r'fleece as .{10,32}?\b', rhyme, re.DOTALL)

In [18]:
# Match exactly 15 characters ending at word boundary (FAIL)
show(r'fleece as .{15}\b', rhyme, re.DOTALL)

In [19]:
# Match exactly 17 characters ending at word boundary (SUCCESS)
show(r'fleece as .{17}\b', rhyme, re.DOTALL)

In [20]:
# Match at least 15 characters ending at word boundary
show(r'fleece as .{15,}\b', rhyme, re.DOTALL)

In [21]:
# Non-greedy match at least 15 chars ending at word boundary
show(r'fleece as .{15,}?\b', rhyme, re.DOTALL)