# 🧹 Cleaning Data, Part 3: "Regular Expressions"

*i.e., Find/replace on steroids.*

## `.strip(...)`, `.replace(...)`, etc. have their limits

They deal with sets of explicit, pre-defined characters (e.g., the `"$ "` in `my_string.replace("$ ", "")`. But what if we don't know all characters we want to replace/strip/etc.?

What if we want to clean based on ... *patterns*?

## An example

How would you extract the amounts from these strings?

In [1]:
amounts = [
    "   1,000.31   doLLaRs   ",
    "54 cents  ",
    "33 CENTS",
    "Dollars: 10" # <- this is our new wrinkle
]

If we try to use the function we defined earlier, we'll get an error because the `Dollars: ` comes before the amount rather than afterward, and it's not as easy to strip out as just `$`.

In [2]:
def clean_amount(amt):
    if "$" in amt or "dollar" in amt.lower():
        conversion = 1
    elif "cent" in amt.lower():
        conversion = 0.01
    else:
        raise ValueError(f"Cannot determine unit for {amt}")
        
    amt = amt.strip("$ ")
    amt = amt.replace(",", "")
    amt = amt.split(" ")[0]
    return float(amt) * conversion

In [3]:
sum(clean_amount(a) for a in amounts)

ValueError: could not convert string to float: 'Dollars:'

We *could* redefine our `.strip(...)` step to include every non-digit character ... but there's *got* to be a better way.

Spoiler: There is!

In [4]:
import re

for amt in amounts:
    print(re.sub(r"[^\d\.]", "", amt))

1000.31
54
33
10


In [5]:
for amt in amounts:
    quantity = re.sub(r"[^\d\.]", "", amt)
    unit_match = re.search(r"([a-z]+)", amt.lower())
    unit = unit_match.group(1)
    print(float(quantity), unit)

1000.31 dollars
54.0 cents
33.0 cents
10.0 dollars


## __Regular expressions__ ("RegEx")

Regexes are:

- Like "Control-F" on steroids
- A (mostly) standard, purpose-tailored mini-language
- Usable across virtually every programming language (Python, JavaScript, R, etc.) and environment (even Excel and Google Sheets)

By analogy:

- __HTML__ - for webpage content
- __CSS__  - for styling webpages
- __SQL__  - for querying databases
- __RegEx__ - for searching and modifying patterns in text

## Searching text for patterns

Is X *in* my string?

`re.search(pattern, text)`

In [6]:
print(re.search(r"\d", "There are 24 people in class"))

<re.Match object; span=(10, 11), match='2'>


In [7]:
print(re.search(r"\d", "There are twenty-four people in class"))

None


Does my string *match* this pattern?

`re.match(pattern, text)`

In [8]:
print(re.match(r"\d", "There are 24 people in class"))

None


In [9]:
print(re.match(r"[A-Z].* \d+ .*", "There are 24 people in class"))

<re.Match object; span=(0, 28), match='There are 24 people in class'>


## Extracting text, using patterns

Regexes use parentheses to define "capture groups":

In [10]:
match = re.search(r"(\d+)", "There are 24 people in class")
match.group(1)

'24'

In [11]:
re.findall(r"\d+", "There are 24 people in class at 10 am")

['24', '10']

## Changing text, using patterns

Like "Find/Replace," but much more powerful.

`re.sub(pattern, replacement, text)`

In [12]:
ambiguous_date = "02/05/2023"
re.sub(r"(\d+)/(\d+)/(\d+)", r"\3-\1-\2", ambiguous_date)

'2023-02-05'

In [13]:
conversation = "😀: Hello! 🤖: Beep!"
re.sub(r"(.)\1+", r"\1\1\1\1\1\1", conversation)

'😀: Hellllllo! 🤖: Beeeeeep!'

Regular expressions are the heart and soul of gimmicks like this:

- https://chrome.google.com/webstore/detail/cloud-to-butt-plus/apmlngnhgbnjpajelfkmabhkfapgnoai

And mistakes like this:

- https://www.nytimes.com/2018/03/06/us/politics/07dc-tradefacts.html

## "But how do I actually write them?"

The __bad news__: There are some rules to learn.

The __good news__: There aren't *too* many, and they're well worth learning.

## The basics

- Anchors
- Character sets
- Repetition
- Groups

### Anchors

The most important of these:

- `^`: The beginning of the line
- `$`: The end of the line

In [14]:
def test_search(pattern, string):
    m = re.search(pattern, string)
    print(f"{string} → {'Yes' if m else 'No'}")

In [15]:
pattern = r"^a"
test_search(pattern, "apple")
test_search(pattern, "almond")
test_search(pattern, "orange")

apple → Yes
almond → Yes
orange → No


In [16]:
pattern = r"e$"
test_search(pattern, "apple")
test_search(pattern, "almond")
test_search(pattern, "orange")

apple → Yes
almond → No
orange → Yes


### Character sets

- `[abc123]`: A character that is *any* of a, b, c, 1, 2, or 3
- `[^abc]`: A character that is *not* a, b, or c
- `[a-z]`: Any of the lowercase characters between
- `[0-9]` ... or `\d`: Any digit
- `\s`: Any whitespace (space, tab, newline)
- `.`: Anything! (Except for a newline.)
- `\.`: The literal period character

In [17]:
pattern = r"[abcxyz123]"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple → Yes
Apple → No
IPHONE 13 → Yes


In [18]:
pattern = r"[a-z]"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple → Yes
Apple → Yes
IPHONE 13 → No


In [19]:
pattern = r"\d"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple → No
Apple → No
IPHONE 13 → Yes


In [20]:
pattern = r"\s"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple → No
Apple → No
IPHONE 13 → Yes


### Repetition

- `?`: Zero or one
- `*`: Zero or more
- `+`: One or more
- `{5}`: Exactly five
- `{,5}`: Up to five
- `{5,}`: At least five
- `{5,8}`: Between five and eight

In [21]:
pattern = r"Buz?"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzzzz")
test_search(pattern, "Buzzzzzz")

Bu → Yes
Buz → Yes
Buzz → Yes
Buzzzzz → Yes
Buzzzzzz → Yes


In [22]:
pattern = r"Buz?$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → Yes
Buz → Yes
Buzz → No
Buzzz → No
Buzzzz → No


In [23]:
pattern = r"Buz*$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → Yes
Buz → Yes
Buzz → Yes
Buzzz → Yes
Buzzzz → Yes


In [24]:
pattern = r"Buz+$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → Yes
Buzz → Yes
Buzzz → Yes
Buzzzz → Yes


In [25]:
pattern = r"Buz{3}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → No
Buzz → No
Buzzz → Yes
Buzzzz → No


In [26]:
pattern = r"Buz{3,}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → No
Buzz → No
Buzzz → Yes
Buzzzz → Yes


In [27]:
pattern = r"Buz{2,3}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu → No
Buz → No
Buzz → Yes
Buzzz → Yes
Buzzzz → No


### Groups

- `(...)`: A group
- `(abc|xyz)`: *Either* "abc" OR "xyz"
- `\1`, `\2`: A reference to the first group, second group, etc.

In [28]:
pattern = r"Ba(na)+$"
test_search(pattern, "Ba")
test_search(pattern, "Banana")
test_search(pattern, "Bananana")
test_search(pattern, "Banananan")

Ba → No
Banana → Yes
Bananana → Yes
Banananan → No


In [29]:
pattern = r"Hello, (World|Lede)"
test_search(pattern, "Hello, World")
test_search(pattern, "Hello, Lede")
test_search(pattern, "Hello, Yellow")

Hello, World → Yes
Hello, Lede → Yes
Hello, Yellow → No


In [30]:
pattern = r"^([^\s]+) vs \1"
test_search(pattern, "Dog vs Dog")
test_search(pattern, "Cat vs Cat")
test_search(pattern, "Dog vs Cat")

Dog vs Dog → Yes
Cat vs Cat → Yes
Dog vs Cat → No


In [31]:
pattern = r"^([^\s]+) vs \1"
test_search(pattern, "Dog vs Dog")
test_search(pattern, "Cat vs Cat")
test_search(pattern, "Dog vs Cat")

Dog vs Dog → Yes
Cat vs Cat → Yes
Dog vs Cat → No


In [32]:
text = "State: NY, City: Brooklyn"
pattern = r"State: ([A-Z]{2}), City: ([^,]+)"
replacement = r"\2, \1"
re.sub(pattern, replacement, text)

'Brooklyn, NY'

## Regular expressions in Pandas

- `.str.extract(pattern_with_group, expand=False)`
- `.str.replace(pattern, replacement, regex=True)`
- `.str.contains(pattern)`

In [33]:
import pandas as pd

In [34]:
amounts = [
    "   1,000.31   doLLaRs   ",
    "54 cents  ",
    "33 CENTS",
    "Dollars: 10"
]

In [35]:
amounts_df = pd.DataFrame({ "raw": amounts })
amounts_df

Unnamed: 0,raw
0,"1,000.31 doLLaRs"
1,54 cents
2,33 CENTS
3,Dollars: 10


In [36]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.extract(r"([\d,\.]+)", expand=False)
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


In [37]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.extract(r"([\d,\.]+)", expand=False)
    .str.replace(",", "")
    .astype(float)    
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


In [38]:
amounts_df["raw"].str.extract(r"([\d,\.]+)(.*)$")

Unnamed: 0,0,1
0,1000.31,doLLaRs
1,54.0,cents
2,33.0,CENTS
3,10.0,


In [39]:
amounts_df["raw"].str.extract(r"(?P<amount>[\d,\.]+)(?P<suffix>.*)$")

Unnamed: 0,amount,suffix
0,1000.31,doLLaRs
1,54.0,cents
2,33.0,CENTS
3,10.0,


In [40]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.replace(r"[^\.\d]", "", regex=True)
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


In [41]:
amounts_df["is_dollars"] = (
    amounts_df["raw"]
    .str.contains(r"dollars|USD|\$", case=False)
)

amounts_df

Unnamed: 0,raw,quantity,is_dollars
0,"1,000.31 doLLaRs",1000.31,True
1,54 cents,54.0,False
2,33 CENTS,33.0,False
3,Dollars: 10,10.0,True


In [42]:
def get_conversion(is_dollars):
    if is_dollars:
        return 1
    else:
        return 0.01
    
amounts_df["conversion"] = amounts_df["is_dollars"].apply(get_conversion)
amounts_df

Unnamed: 0,raw,quantity,is_dollars,conversion
0,"1,000.31 doLLaRs",1000.31,True,1.0
1,54 cents,54.0,False,0.01
2,33 CENTS,33.0,False,0.01
3,Dollars: 10,10.0,True,1.0


---

---

---