# More Regular Expresssions

Let's look at a few more important topics for regular expressions. 

## Anchors and Boundaries

Regular expresssions allow you to match specific positions in text. You can probably imagine that this would be very useful when you want to match single words or word phrases in text, identify sentences, etc. Here are the most commonly used anchors and boundaries used in regex.

- `^` matches the beginning of a string or line (when it is **not** inside `[]`)
- `$` matches the end of a string or line
- `\A` matches the beginning of a string
- `\Z` matches the end of a string
- `\b` matches a word boundary and allows "whole words only" searches in the form of `\bword\b`
- `\B` mathes anything but a word boundary

Let's look at a few examples.

In [None]:
import re
s = "high income from the higher end of the spectrum is quite high, but what is highest still?"

In [None]:
# Can we find the string "hi" at the beginning of the string s?
re.findall(r"^hi", s)

In [None]:
# Find only the word "high" using word boundaries`
re.findall(r"\bhigh\b", s)

### Which Ones?

How do we know which one of those "high" substring it was? We can use `re.finditer()` to get back an interable object that contains all the `Match` objects that allow us to see the start and end points for each substring matched. These end points are in the `span` part of the print out.

In [None]:
# Use re.finditer() to get Match objects
for i in re.finditer(r"\bhigh\b", s):
    print(i)

In [None]:
# What happens if we use \b at the beginning and \B at the end?
for i in re.finditer(r"\bhigh\B", s):
    print(i)
    print(s[i.start():i.end()+5])

## Quantifiers in Regex

If you want to specify whether a regex element can be matched zero, one, or many times, then you should use *quantifiers*. Quantifiers are always placed ***after*** regex elements. The most commonly used quantifiers include:

- `+` matches an item one or more times
- `*` matches an item zero or mort times
- `{n}` matches an item exactly $n$ times
- `{k,n}` matches an item at least $k$ times and at most $n$ times
- `?` matches an item zero or one time

Let's look at a few examples.

In [None]:
testStrings = ["bt", "bet", "beet", "beeet", "beeeet"]

In [None]:
# zero or more
for t in testStrings:
    print(re.findall(r"be*t", t))

In [None]:
# one or more
for t in testStrings:
    print(re.findall(r"be+t", t))

In [None]:
# excactly once
for t in testStrings:
    print(re.findall(r"be{1}t", t))

In [None]:
# at least 1, no more than 2
for t in testStrings:
    print(re.findall(r"be{1,2}t", t))

In [None]:
# Zere or One
for t in testStrings:
    print(re.findall(r"be?t", t))

## Groups in Regex

You can group elements of regex by enclosing them in parantheses `()`. By default, groups are also captured in addition to the whole regular expression. In other words, a regex `Match` object will store the textual value of the complete match as well as textual values of the patterns specified in group. Groups can be thought of as regex subpatterns. (Note: To turn off group capturing, you would use the syntax `(?:)` to start the group.

Let's look at a few examples.

In [None]:
assets = "Total Assets = $10,000,000"

In [None]:
for i in re.finditer(r"Total Assets = (\$[\d,\.]+)\b", assets):
    print(i)
    print(i.groups())

## Look Ahead and Look Behind in Regex

One of the powerful characteristics of regex is that it is possible to check whether a regex item is followed by a certain pattern without including that pattern in the resulting match. ***Positive look ahead*** checks whether a regex item is followed by a given pattern using the syntax `(?=pattern)`. ***Negative look ahead***, denoted by `(?!pattern)`, checks whether a regex item is **not** followed by a given pattern.

We can also check to see if a regex item is preceded (or not) by a certain pattern without including that pattern in the output match. ***Positive look behind*** checks whether a regex item is preceded by a given pattern using the syntax `(?<=pattern)`. ***Negative look behind***, denoted `(?<!pattern)`, checks whether a regex item is **not** preceded by a given pattern.

Let's look a few examples.

In [None]:
names = ["filename.txt", "filename.csv"]

In [None]:
# Does .txt follow "filename"?
for i in names:
    print(re.findall(r"filename(?=.txt)", i))

In [None]:
# Does .txt NOT follow "filename"?
for i in names:
    print(re.findall(r"filename(?!.txt)", i))

In [None]:
years = ["year 2020", "series 2020"]

In [None]:
# Does "year " (with space) show up before 4 consecutive digits?
for y in years:
    print(re.findall(r"(?<=year\s)\d{4}", y))

In [None]:
# Does "year " (with space) NOT show up before 4 consecutive digits?
for y in years:
    print(re.findall(r"(?<!year\s)\d{4}", y))