# Introduction to Regular Expressions

We've seen some useful and powerful methods of the Python `str` type for formatting, splitting, and manipulation. We have seen that those methods might have some drawbacks. For example, when we split a string the resulting list contains **mostly** words. However, they are **mostly** words because there may be trailing punctuation marks such as commas or periods attached to the word. The good news is that we have a more powerful tool at our disposal called *regular expressions* or *regex* for short. 

Regular expressions are a huge topic. In fact, there are entire books written on the topic. Therefore, we are only going to be able to touch the surface. We'll see examples that should give you a glimpse into the power of regular expressions and how they can be used across various scenarios. 

One way to think of regular expressions is that they are simply a way conduct *flexible pattern matching* in strings. If you have any experience with Unix or Linux and use the command line, then you are probably familiar with the wildcard character `*`. For example, if want a listing of all the `.txt` files in a directory, on a Linux machine we would issue the command `ls *.txt`. On a Windows machine we would similarly use `dir *.txt`. We can access command line commands from Jupyter Notebooks by preceding the command with the exclamation mark, `!`. Let's try it.

In [None]:
# can use dir on a Windows machine
!dir *.txt

# can use ls on Linux machine
#!ls *.txt

Regular expressions are more than just using wildcards; there are a wide range of flexible string-matching syntaxes. To use regular expressions in Python, you use the built-in `re` module. Using regular expressions we could duplicate some of the built-in string methods like `split()`. The advantage of learning and using regular expressions is that they are **much more flexible**.  

## Commonly Used Functions

Within the `re` module there are many useful methods/functions. Here are some of the most commonly used ones:

- `re.search(pattern, string)` scans through a string looking for the first location where the `pattern` matches; output is a `Match` object if there is a match or `None` otherwise.
- `re.findall(pattern, string)` returns all substrings where the `pattern` is matches and returns them as a list.
- `re.split(pattern, string)` splits a string at every match of the `pattern` and returns a list of strings. 
- `re.sub(pattern, repl, string)` finds all matches of `pattern` in `string` and replaces them with `repl`.

When creating the regular expressions (i.e., the patterns) you often use raw literal strings by preceding the string with `r`. For example, we might call `re.search(r"pattern", string)`. Recall, that raw literal strings treat backslashes as backslashes instead to trying to convert to special characters. If we used `r"\n"` as the pattern, it would look for a backslash followed by the character `n`, whereas the pattern `"\n"` would look for newlines.

## Syntax

You need to `import re` to get started. You then need to come up with an expression and compile it. When you compile the regular expression into a pattern object, that object then has methods for various operations such as searching for pattern matches or performing string substitutions. The most commonly used methods are:

| Method/Attribute | Purpose                | 
| :----------------|:----------------------- | 
| `match()` | Determine if the regular expression matches at the beginning of the string. |
| `search()` | Scan through a string, looking for any location where this RE matches. |
| `findall()` | Find all substrings where the RE matches, and returns them as a list. |
| `finditer()` | Find all substrings where the RE matches, and returns them as an [iterator](https://docs.python.org/3/glossary.html#term-iterator). |

Let's start with a simple example where we build a regular expression with a simple string of characters. It will match that exact string.

In [None]:
import re

In [None]:
# build and compile the regular expression to look for "tart"
regex = re.compile("tart")
s = "What starts as tart never turns sweet."

# Search for regex in s; if found get Match object
regex.search(s)

In [None]:
# Find all of the matches in s
regex.findall(s)

Notice that a `list` is returned with all of the matches. You can see how many were found by looking at the length of the returned list. 

In [None]:
listOfMatches = regex.findall(s)
print(f"We found {len(listOfMatches)} matches")
print(f"listOfMatches:\n{listOfMatches}")

### Characters and Character Sets

#### Special Characters

There are several special purpose characters in regular expressions. These include: `.`, `^`, `$`, `*`, `+`, `?`, `{`, `}`, `[`, `]`, `\`, `|`, `(`, and `)`. For a special character to be treated literally, you need to add a backslash before that character. For example, `\$` indicates the dollar sign. Let's look at few examples.

In [None]:
earnings = "Apple's earnings per share for the three months that ended in December 2020 was $5.67."

In [None]:
# Try to find the string "$5.67"
eRegex = re.compile("$5.67")
eRegex.findall(earnings)

Well, that returned an empty `list` meaning that it was not found. Let's **escape** the dollar sign with a backslash and see if it will find it.

In [None]:
# Try escaping the dollar sign by preceding it with a backslash
eRegex = re.compile("\$5.67")
eRegex.findall(earnings)

#### Character Sets

A *character set* allows you to specify a set of characters that will allow you to match any single character in that given set. To specify a character set, you use square brackets, `[` and `]`. For example, `[hik]` would match any one of the lower case letters `h`, `i`, and `k`. Some examples of character sets include:

- `[a-z]` matches any lower case letter between `a` and `z`.
- `[A-Z]` matches any upper case letter between `A` and `Z`. 
- `[a-zA-Z]` matches any letter, both lower case and upper case, in the English alphabet.
- `[0-9]` matches a single digit between 0 and 9. 

You can also **negate** character sets by using the special character `^` inside the square brackets. For example, `[^a-z]` will match anything as long as it is **not** a lower case letter in the range of `a` to `z`. 

There are some built-in common character sets that have a shorthand notation. Examples include:

- `.` matches any character except new line.
- `\d` matches any decimal digit (i.e., `[0-9]`).
- `\D` matches any **non-digit** character (i.e., `[^0-9]`).
- `\w` matches any alphanumeric character (i.e., `[a-zA-Z0-9_]`); also called a word character and includes digits and underscores.
- `\W` matches a **non-word** character (i.e., `[^a-zA-Z0-9_]`).
- `\s` matches any space, new line, tab, carriage return, etc. (i.e., whitespace; `[ \t\n\r\f\v]`).
- `\S` matches everything but whitespace (i.e., `[^ \t\n\r\f\v]`).

In [None]:
# Let's find all the digits in earnings string
tempRegex = re.compile("\d")
tempRegex.findall(earnings)

In [None]:
# Let's find all the non-digits in earnings string
tempRegex = re.compile("\D")
tempRegex.findall(earnings)

In [None]:
# What about whitespace
tempRegex = re.compile("\s")
tempRegex.findall(earnings)

In [None]:
# Non-word characters
tempRegex = re.compile("\W")
tempRegex.findall(earnings)

In [None]:
# Make a new pattern to search for
# Any lower case letter
p = re.compile("[a-z]")
p

In [None]:
# Match looks at the BEGINNING of the string
m = p.match(earnings)
print(m)
if m:
    print(m.group())

In [None]:
# Match looks at the BEGINNING of the string
m = p.match(earnings.lower())
print(m)
if m:
    print(m.group())

The wildcard character is `*` and matches 0 or more characters. Alternatively, if you want to have at least 1 of the characters (instead of 0), then you can use the `+` character. Let's create two new regular expressions that examines how these special characters differ.

In [None]:
# match 0 or more character between 
p1 = re.compile("be*t")
print(f"p1 is {p1}")

p2 = re.compile("be+t")
print(f"p2 is {p2}")

In [None]:
# create a list of strings to search through
bStrings = ["b", "be", "bee", "bt", "bet", "beet", "beeet"]

In [None]:
for i in bStrings:
    if p1.match(i):
        print(f"p1 matched {i}")
        
    if p2.match(i):
        print(f"p2 matched {i}")

We can try to split our earnings sentence into words. We saw earlier that we can just call `.split()` on the string itself. There are some draw backs to that approach, but it at least gives us an **approximation** of the words found in the string. We can also try to find all of the words using regular expressions. 

In [None]:
print(f"There are approximately {len(earnings.split())} words in earnings. Here they are:\n{earnings.split()}")

In this case, this sentence does not have any quirks (e.g., dashes), so we get the correct answer by simply splitting on whitespace with `.split()`. On the other hand if our sentence was something like the one below, it will overcount the number of words.

In [None]:
sent = "Yet many vaccinated people continue to obsess over the risks from Covid — because they are so new and salient."

In [None]:
print(f"Count using split = {len(sent.split())}")
print(sent.split())

Let's try with regular expressions. We know that `\w` will give us "words", so try that first by finding them all.

In [None]:
print("Count using \\w =", len(re.findall(r'[\w]+', earnings)))
print(re.findall(r'[\w]+', earnings))

Well, that did not end up like we had hoped. Let's try putting word boundaries around it and include apostrophes.

In [None]:
x = re.findall(r"\b[a-zA-Z\'\-]+\b", earnings)
print(len(x))
print(x)

What happened now? We lost the year and the dollar amount. In some sense this is correct because these are the only *true words*. Much of which way you to count depends on what you are trying to accomplish. Let's try the other sentence and see what happens. 

In [None]:
y = re.findall(r"\b[a-zA-Z\'\-]+\b", sent)
print(len(y))
print(y)

For this particular sentence it looks like our regular expression worked the way hoped. It removed the dash. Notice also that in both cases, the last word: neither had the period attached to it which is probably what we want. 

Now, we create another "sentence" that contains a few characters that we haven't seen in our examples yet. Before running the code cell below the one defining `sent2`, take a minute to think about what you think will be included and excluded in the result of calling `findall()`. 

In [None]:
sent2 = "Here's another sentence; it contains some other hard-coded items: 1. here 2. maybe \
        this will be under_represented by some stuff! Maybe?"

In [None]:
z = re.findall(r"\b[a-zA-Z\'\-]+\b", sent2)
print(len(z))
print(z)

### Try with File

Now, let's try with our file from before `nyTimes.txt`. 

In [None]:
with open("nyTimes_UTF8.txt", "r", encoding="UTF8") as ny:
    nyt = ny.read()
    
# Find all "words" two ways
# 1. with split on the string
sWords = nyt.split()
print(f"Using .split() there were {len(sWords)} words")


# 2. with regex findall
sWords2 = re.findall(r"\b[a-zA-Z\'\-]+\b", nyt)
print(f"Using regex there were {len(sWords2)}")

In [None]:
print(sWords2)