
## Introduction to Regular Expressions

Thus far, we’ve been learning how to generate text using LLMs, and to classify text that we've gotten from elsewhere. But very often, the text that we've gotten from elsewhere isn’t quite ready to be analyzed--or the text that is generated needs a bit more processing in order to be analyzed.

Today we'll cover the first of these challenges: cleaning our data so that it can be analyzed. We’ll learn how to do this with regular expressions, or "regex."

What are regular expressions?

Doug Knox puts it this way: "Regular expressions (or “regexes” for short) are a way of defining patterns that can apply to sequences of things. They have the funny name that they do because of their origins in computer science and formal language theory, and they are incorporated into most general programming languages."

In short: regex offers a (relatively) standard method for finding and/or replacing patterns in text.

Depending on your experience, you may have already encountered regex in R via functions like `grep` and packages like `stringr`.

If not (or even if so), it may be helpful to have some of the basic regex syntax explained again.

### Basic regex syntax

To begin, we need to import the Python module for regular expressions:

In [None]:
import re # this is the python regular expression library

Now. Let's say we have this sample text:

`Let's eat a delicious FRUIT $alad. N0M n0m!`

If we want, we can isolate all the capital letters with regex.

`[A-Z]` tells regex we're looking for capital letters.

A construction like '`[A-Z]`' makes use of a few basic features of regex:

* Square brackets indicate a set of possible things to match. (The things go inside.)
* A-Z is how you indicate a range of characters to match.
* Regex is case sensitive, so A-Z would match only uppercase letters. To match lowercase chars, you'd use '`[a-z]`'.

`findall(pattern, text)` finds all matches of the pattern in the text

In [None]:
txt = "Let's eat a delicious FRUIT $alad. N0M n0m!"
re.findall('[A-Z]', txt)

### Historical aside

Ever wonder why we call capital letters uppercase and lowercase letters lowercase? This is why:

![cases](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioX-YWjQWgnw7Tsm-rThARBuU8uxLpPi1jU8Am7TwPE6-C0Mx1oo7KspC7yZRftpv4UnUr0yQKH0JuVSVkjWd19tSZcuSQ0h8qqMTQ4yjQ9_4q0VrsNaaXCy69l2tohvyKluPmp2xZENsV/s1600/upper-and-lower-case-2.png)

### More regex with Python

Let's review a few more basic ways of formulating regex patterns:

`^` outside the brackets but before the pattern finds only the first match, e.g.:

In [None]:
ans = re.findall("^[A-Z]", txt)
ans

`^` *inside* the brackets but before the start of the pattern negates the pattern.

In [None]:
ans = re.findall("[^A-Z]", txt)
ans

Those two different usages of the ^ are why some people delight in mastery of regex and others abhor it. It is TOTALLY FINE to have to look up a regex cheat sheet like [this one](http://www.pyregex.com/) if you can't remember the syntax off the top of your head. I almost always do.

The [cheat sheet](http://www.pyregex.com/) also contains a web-based tester, which is another incredibly helpful tool. There are many such testers on the internet. I recommend that you use them.

Okay, back to basic regex.

A number inside curly braces indicates that you want instances where there are that number of matches, uninterrupted, in a row. In our case (`Let's eat a delicious FRUIT $alad. N0M n0m!`) what should `re.findall("[A-Z]{5}", txt)` find?

In [None]:
ans = re.findall("[A-Z]{5}", txt)

ans

There are all sorts of special sequences that let you look for specific types of characters. For example, \W is how you indicate any non-alphanumeric character--that is, any character that's not in the alphabet and is not a number--as in the following:

In [None]:
ans = re.findall("[\W]", txt)

ans

If you are looking for multiple types of things, you can run them together inside the square brackets. For example, you could find both non-alphanumeric (`\W`) and all numbers (`0-9`) in this way:

In [None]:
ans = re.findall("[\W0-9]", txt)

ans

Regex can seem intimidating. But with some practice, you will start to remember some of the most commonly used syntax and expressions. That said, you will need to consult a cheat sheet more often than not.

Here are some but not all of the most common regex parts:

| Syntax  | Description |
| :------ | :---------- |
| `A b 1` | literals - letters, digits, and spaces match themselves |
| `[Ab1]` | a character class, matching any of `A`, `b`, or `1` in this case |
| `[a-z]` | all lowercase letters within a range |
| `[0-9]` | all digits |
| `^` | beginning of string |
| `$` | end of string |
| `.` | any character |
| `*` | zero or more |
| `+` | one or more |
| `()` | creates a capture group for future reference |
| `\s` | whitespace (there are many special sequences that begin with `\` like this) |

We will practice with these and a few more as we work through the rest of this notebook.

### Python's regex functions

There are a few helpful functions included in Python's "re" library. We've already used `findall`, but here are a few more:

`match` finds a pattern anchored at the beginning of a string. Because it only looks at the beginning of a string, it's not always exceptionally useful. But let's just see how it works.

In [None]:
print("Let's recall our example string:")
print(txt + "\n")

pattern = "[A-Z]{5}" # let's look for five uppercase letters, as above

print("Looking for five uppercase letters at the start...")
if re.match(pattern,txt):
    print("Match!")
else:
    print("Sorry. No match.")

So, we don't have a match here because "F" "R" "U" "I" and "T" are not the first five chars in the sring. "L" "e" are the first two characters, and as soon as `match` gets to the "e" it knows its search has failed: Sorry. No match.

But what about if we look for "Let's"?

### Cue **Exercise 1**!

Here are some regex building blocks to use in this exercise:

* `[A-Z]` will get you all uppercase letters
* `[a-z]` will get you all lowercase letters
* In order to match a `'`, you need to use a `\` before it. This is called an escape sequence.
* Adding a `+` after a set of characters will match one or more instances of that set

In [None]:
# So, match doesn't work above. But what about if we search for "Let's"?

pattern = " " # fill in this part

if re.match(pattern,txt):
    print("Match!")
else:
    print("Sorry. No match.")

Another thing to know about `match` is that it returns information about its search. You can test for the fact of a match. To return the matching characters, you can specify with the number match you want, starting with the first as: `[0]`. For example:

In [None]:
pattern = " " # fill in same as above

matches = re.match(pattern,txt)

print(matches)
print(matches[0])

More helpful than `match` is `search`, which searches the entire string.

Let's go back to looking for FRUIT:

In [None]:
pattern = "[A-Z]{5}" # remember this?

matches = re.search(pattern,txt)

matches[0]

The TL;dr version: use `search` and not `match` unless you have a real reason to look only at the start of a string.

There's just one more function to go over before we turn to our song lyrics exercise. And it's a helpful one. Promise!

`re.sub` substitutes one pattern with another. This is exceptionally helpful for cleaning text.

The syntax is as follows:

`newstring = re.sub(pattern, replacement_string, original_string)`

For example:


In [None]:
print("Original string:")
print(txt + "\n")

Let's swap out `$alad` for `smoothie`. To find `$alad` we need the regex pattern `"\$[a-z]{4}"`

In [None]:
new_txt = re.sub("\$[a-z]{4}","smoothie",txt)

print("New string:")
print(new_txt)

Enough of this. Let's look at our song lyrics...

*Lauren F. Klein wrote version 1.0 of this notebook, based off tutorials by [Doug Knox](https://programminghistorian.org/en/lessons/understanding-regular-expressions), [Laura Turner O'Hara](https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions#regular-expressions-regex), [William J. Turkel and Adam Crymble](https://programminghistorian.org/en/lessons/normalizing-data#python-regular-expressions), and [Sejal Jaiswal](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial).*

