# "Regular Expressions"

*Also known as "RegEx"*

## `.strip(...)`, `.replace(...)`, etc. have their limits

They deal with sets of explicit, pre-defined characters (e.g., the `"$ "` in `my_string.replace("$ ", "")`. But what if we don't know all characters we want to replace/strip/etc.?

What if we want to clean based on ... *patterns*?

## Regular expressions("RegEx")

Regexes are:

- Like "Control-F" * a 1,000
- Usable across different programming language (Python, JavaScript, R, etc.) and environment (Excel and Google Sheets)

## Searching text for patterns

Is X *in* my string?

`re.search(pattern, text)`

In [1]:
import re

In [2]:
print(re.search(r"\d+", "There are 75 people in class"))

<re.Match object; span=(10, 12), match='75'>


In [3]:
print(re.search(r"\d+", "There are seventy-five people in class"))

None


## Extracting text, using patterns

Regexes use parentheses to define "capture groups":

In [4]:
match = re.search(r"(\d+)", "There are 75 people and 15 dogs in class")
match.group(1)

'75'

In [5]:
match = re.search(r"(\d+) ([^ ]+)", "There are 75 people and 15 dogs in class")
match.groups()

('75', 'people')

In [6]:
re.findall(r"(\d+) ([^ ]+)", "There are 75 people and 15 dogs in class")

[('75', 'people'), ('15', 'dogs')]

## Changing text, using patterns

Like "Find/Replace," but much more powerful.

`re.sub(pattern, replacement, text)`

In [7]:
phrase = "I like giant pandas, my favorite animal is named Xin Bao"
re.sub(r"pandas", r"zoo animals", phrase)

'I like giant zoo animals, my favorite animal is named Xin Bao'

*<a href="https://zoo.sandiegozoo.org/giant-pandas">Here's the best panda</a>*

In [8]:
phrase = "Hello! Beep!"
re.sub(r"(.)\1+", r"\1\1\1\1\1\1", phrase)

'Hellllllo! Beeeeeep!'

## How do I write them?

## The basics

- Anchors
- Character sets
- Repetition
- Groups

### Anchors

The most important of these:

- `^`: The beginning of the line
- `$`: The end of the line

In [9]:
def test_search(pattern, string):
    m = re.search(pattern, string)
    print(f"{string} → {'Yes' if m else 'No'}")

In [10]:
pattern = r"^a"
test_search(pattern, "apple")
test_search(pattern, "almond")
test_search(pattern, "orange")

apple → Yes
almond → Yes
orange → No


In [11]:
pattern = r"e$"
test_search(pattern, "apple")
test_search(pattern, "almond")
test_search(pattern, "orange")

apple → Yes
almond → No
orange → Yes


### Character sets

- `[abc123]`: A character that is *any* of a, b, c, 1, 2, or 3
- `[^abc]`: A character that is *not* a, b, or c
- `[a-z]`: Any of the lowercase characters
- `[0-9]` ... or `\d`: Any digit
- `\s`: Any whitespace (space, tab, newline)
- `.`: Anything! (Except for a newline.)
- `\.`: The literal period character

In [12]:
pattern = r"[abcxyz123]"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 16")

apple → Yes
Apple → No
IPHONE 16 → Yes


In [13]:
pattern = r"[a-z]"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 16")

apple → Yes
Apple → Yes
IPHONE 16 → No


In [14]:
pattern = r"\d"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 16")

apple → No
Apple → No
IPHONE 16 → Yes


In [15]:
pattern = r"\s"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 16")

apple → No
Apple → No
IPHONE 16 → Yes


### Repetition

- `?`: Zero or one
- `*`: Zero or more
- `+`: One or more
- `{5}`: Exactly five
- `{,5}`: Up to five
- `{5,}`: At least five
- `{5,8}`: Between five and eight

In [16]:
pattern = r"Buz?"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzzzz")
test_search(pattern, "Buzzzzzz")

Bu → Yes
Buz → Yes
Buzz → Yes
Buzzzzz → Yes
Buzzzzzz → Yes


In [17]:
pattern = r"Buz?$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → Yes
Buz → Yes
Buzz → No
Buzzz → No
Buzzzz → No


In [18]:
pattern = r"Buz*$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → Yes
Buz → Yes
Buzz → Yes
Buzzz → Yes
Buzzzz → Yes


In [19]:
pattern = r"Buz+$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → Yes
Buzz → Yes
Buzzz → Yes
Buzzzz → Yes


In [20]:
pattern = r"Buz{3}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → No
Buzz → No
Buzzz → Yes
Buzzzz → No


In [21]:
pattern = r"Buz{3,}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → No
Buzz → No
Buzzz → Yes
Buzzzz → Yes


In [22]:
pattern = r"Buz{2,3}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → No
Buzz → Yes
Buzzz → Yes
Buzzzz → No


### Groups

- `(...)`: A group
- `(abc|xyz)`: *Either* "abc" OR "xyz"
- `\1`, `\2`: A reference to the first group, second group, etc.

In [23]:
pattern = r"Ba(na)+$"
test_search(pattern, "Ba")
test_search(pattern, "Banana")
test_search(pattern, "Bananana")
test_search(pattern, "Banananan")

Ba → No
Banana → Yes
Bananana → Yes
Banananan → No


In [24]:
pattern = r"Hello, (World|Lede)"
test_search(pattern, "Hello, World")
test_search(pattern, "Hello, Lede")
test_search(pattern, "Hello, Yellow")

Hello, World → Yes
Hello, Lede → Yes
Hello, Yellow → No


In [25]:
pattern = r"^([^\s]+) vs \1"
test_search(pattern, "Dog vs Dog")
test_search(pattern, "Cat vs Cat")
test_search(pattern, "Dog vs Cat")

Dog vs Dog → Yes
Cat vs Cat → Yes
Dog vs Cat → No


In [26]:
pattern = r"^([^\s]+) vs \1"
test_search(pattern, "Dog vs Dog")
test_search(pattern, "Cat vs Cat")
test_search(pattern, "Dog vs Cat")

Dog vs Dog → Yes
Cat vs Cat → Yes
Dog vs Cat → No


In [27]:
text = "State: NY, City: Brooklyn"
pattern = r"State: ([A-Z]{2}), City: ([^,]+)"
replacement = r"\2, \1"
re.sub(pattern, replacement, text)

'Brooklyn, NY'

## Regular expressions in Pandas

- `.str.extract(pattern_with_group, expand=False)`
- `.str.replace(pattern, replacement, regex=True)`
- `.str.contains(pattern)`

Let's try this with our new example list of amounts ...

In [28]:
import pandas as pd

In [29]:
amounts = [
    "   5,1245.31   doLLaRs   ",
    "456 Dollars  ",
    "156.20 USD",
    "$15"
]

In [30]:
amounts_df = pd.DataFrame({ "raw": amounts })
amounts_df

Unnamed: 0,raw
0,"5,1245.31 doLLaRs"
1,456 Dollars
2,156.20 USD
3,$15


In [31]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.extract(r"([\d,\.]+)", expand=False)
)

amounts_df

Unnamed: 0,raw,quantity
0,"5,1245.31 doLLaRs",51245.31
1,456 Dollars,456.0
2,156.20 USD,156.2
3,$15,15.0


In [32]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.extract(r"([\d,\.]+)", expand=False)
    .str.replace(",", "")
    .astype(float)    
)

amounts_df

Unnamed: 0,raw,quantity
0,"5,1245.31 doLLaRs",51245.31
1,456 Dollars,456.0
2,156.20 USD,156.2
3,$15,15.0


Or, saving us the `.str.replace(...)` line:

In [33]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.replace(r"[^\.\d]", "", regex=True)
    .astype(float)
)

amounts_df

Unnamed: 0,raw,quantity
0,"5,1245.31 doLLaRs",51245.31
1,456 Dollars,456.0
2,156.20 USD,156.2
3,$15,15.0
