## Introduction to Regular Expressions

*This notebook is based off tutorials by [Doug Knox](https://programminghistorian.org/en/lessons/understanding-regular-expressions), [Laura Turner O'Hara](https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions#regular-expressions-regex), [William J. Turkel and Adam Crymble](https://programminghistorian.org/en/lessons/normalizing-data#python-regular-expressions), and [Sejal Jaiswal](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial).* 

What are regular expressions?

Doug Knox puts it this way: "Regular expressions (or “regexes” for short) are a way of defining patterns that can apply to sequences of things. They have the funny name that they do because of their origins in computer science and formal language theory, and they are incorporated into most general programming languages."

In other words: regex offers a (relatively) standard syntax for finding and/or replacing patterns in text. 

Depending on your experience, you may have already encountered regex in R via functions like `grep` and packages like `stringr`. 

If not (or even if so), it may be helpful to have some the basic regex syntax explained again. 

### Basic regex syntax

Say you have this sample text:

`Let's eat a delicious FRUIT $alad. N0M n0m!`

You could isolate all the capital letters (L, F, R, U, I, T, N, M) with this regex:

`[A-Z]`

A construction like `[A-Z]` makes use of a few basic features of regex:

* Square brackets indicate a set of possible things to match. (The things go inside.)
* A-Z is how you indicate a range of characters to match. 
* Regex is case sensitive, so A-Z would match only uppercase letters. To match lowercase chars, you'd use `[a-z]`.

### Historical aside

Ever wonder why we call capital letters uppercase and lowercase letters lowercase? This is why:

![cases](https://i.imgur.com/l6A3Urh.png)

### Let's try out our first regex using Python

In [1]:
import re # this is the python regular expression library

txt = "Let's eat a delicious FRUIT $alad. N0M n0m!"

ans = re.findall("[A-Z]", txt)

ans

['L', 'F', 'R', 'U', 'I', 'T', 'N', 'M']

As you've just observed, `findall` is a method that--you guessed it--*finds all* matches of a pattern. 

Let's review a few more basic ways of formulating regex patterns: 

A `^` outside the brackets but before the pattern finds only the first match, e.g.:

In [2]:
ans = re.findall("^[A-Z]", txt) 
ans

['L']

A `^` *inside* the brackets but before the start of the pattern negates the pattern.

In [4]:
ans = re.findall("[^A-Z]", txt) 
ans

['e',
 't',
 "'",
 's',
 ' ',
 'e',
 'a',
 't',
 ' ',
 'a',
 ' ',
 'd',
 'e',
 'l',
 'i',
 'c',
 'i',
 'o',
 'u',
 's',
 ' ',
 ' ',
 '$',
 'a',
 'l',
 'a',
 'd',
 '.',
 ' ',
 '0',
 ' ',
 'n',
 '0',
 'm',
 '!']

Those two different usages of the ^ are why some people delight in mastery of regex and others abhor it. It is TOTALLY FINE to have to look up a regex cheat sheet like [this one](http://www.pyregex.com/) if you can't remember the syntax off the top of your head. 

Also on that point:

<img src="http://lklein.lmc.gatech.edu/wp-content/uploads/2019/09/Screen-Shot-2019-09-18-at-11.21.11-AM.png" width="500" />

Re: step 4, above, note that the [cheat sheet I linked to above](http://www.pyregex.com/) also contains a web-based tester, which is another incredibly helpful tool. There are many such testers on the internet. I recommend that you use them.

In any case, back to basic regex.

A number inside curly braces indicates five matches in a row:

In [6]:
ans = re.findall("[A-Z]{5}", txt)

ans

['FRUIT']

There are all sorts of special sequences that let you look for specific types of characters. For example, \W is how you indicate any non-alphanumeric character, as in the following:

In [7]:
ans = re.findall("[\W]", txt)

ans

["'", ' ', ' ', ' ', ' ', ' ', '$', '.', ' ', ' ', '!']

If you are looking for multiple types of things, you can just run them together inside the square brackets. For example, you could find both non-alphanumeric and all numbers in this way:

In [10]:
ans = re.findall("[\W0-9]", txt)

ans

["'", ' ', ' ', ' ', ' ', ' ', '$', '.', ' ', '0', ' ', '0', '!']

Regex can seem intimidating. But with some practice, you will start to remember some of the most commonly used syntax and expressions. That said, you will need to consult a cheat sheet (or Google, as above) more often than not.

Here are some but not all of the most common regex parts:

| Syntax  | Description |
| :------ | :---------- |
| `A b 1` | literals - letters, digits, and spaces match themselves |
| `[Ab1]` | a character class, matching any of `A`, `b`, or `1` in this case |
| `[a-z]` | all lowercase letters within a range |
| `[0-9]` | all digits |
| `^` | beginning of string |
| `$` | end of string |
| `.` | any character |
| `*` | zero or more |
| `+` | one or more |
| `()` | creates a capture group for future reference |
| `\s` | whitespace (there are many special sequences that begin with `\` like this) |

We will practice with these and a few more as we work through the rest of this notebook.

### Python's regex functions

There are a few helpful functions included in Python's "re" library. We've already used `findall`, but here are a few more:

`re.match` finds a pattern anchored at the beginning of a string. Because it only looks at the beginning of a string, it's not always exceptionally useful. But let's just see how it works.

In [14]:
print("Let's recall our example string:")
print(txt + "\n")

pattern = "[A-Z]{5}" # let's look for five uppercase letters, as above

print("Looking for five uppercase letters at the start...")
if re.match(pattern,txt):
    print("Match!")
else: 
    print("Sorry. No match.")

Let's recall our example string:
Let's eat a delicious FRUIT $alad. N0M n0m!

Looking for five uppercase letters at the start...
Sorry. No match.


So, we don't have a match here because "F R U I and T" are not the first five chars in the sring. 

But what about if we look for "Let's"?

Cue **Exercise 1**!

Here are some regex building blocks to use in this exercise:

* `[A-Z]` will get you all uppercase letters
* `[a-z]` will get you all lowercase letters
* In order to match a `'`, you need to use a `\` before it. This is called an escape sequence.
* Adding a `+` after a set of characters will match one or more instances of that set

In [57]:
# So, match doesn't work here. But what about if we search for "Let's"?

pattern = "[A-Z]+[a-z]+\'s" # fill in this part

if re.match(pattern,txt):
    print("Match!")
else: 
    print("Sorry. No match.")

Match!


Another thing to know about `match` is thatm like `findall`, it actually returns a list of matchs. In other words, you can test for the fact of a match. Or you can examine the specific sets of characters that it matches. For example:

In [15]:
pattern = "[A-Z]+[a-z]+\'s" # same as above

matches = re.match(pattern,txt)

matches[0]

"Let's"

We've spent a lot of time on a not so helpful function. Much more helpful is `search`, which searches the entire string. 

Let's go back to looking for FRUIT:

In [16]:
pattern = "[A-Z]{5}" # remember this?

matches = re.search(pattern,txt)

matches[0]

'FRUIT'

The TL;dr version: use `search` and not `match` unless you have a real reason to look only at the start of a string.

There's just one more function to go over before we turn to our lyrics exercise. And it's a helpful one. Promise!

`re.sub` substitutes one pattern with another. This is exceptionally helpful for cleaning text.

The syntax is as follows:

`newstring = re.sub(pattern, replacement, original_string)`

For example:


In [18]:
print("Original string:")
print(txt + "\n")

new_txt = re.sub("\$[a-z]{4}","smoothie",txt)

print("New string:")
print(new_txt)


Original string:
Let's eat a delicious FRUIT $alad. N0M n0m!

New string:
Let's eat a delicious FRUIT smoothie. N0M n0m!


Enough of this. Let's look at our lyrics...