# ðŸ§¹ Cleaning Data, Part 3: "Regular Expressions"

*i.e., Find/replace on steroids.*

## `.strip(...)`, `.replace(...)`, etc. have their limits

They deal with sets of explicit, pre-defined characters (e.g., the `"$ "` in `my_string.replace("$ ", "")`. But what if we don't know all characters we want to replace/strip/etc.?

What if we want to clean based on ... *patterns*?

## An example

How would you extract the amounts from these strings?

In [1]:
amounts = [
    "   1,000.31   doLLaRs   ",
    "54 cents  ",
    "33 CENTS",
    "Dollars: 10" # <- this is our new wrinkle
]

If we try to use the function we defined earlier, we'll get an error because the `Dollars: ` comes before the amount rather than afterward, and it's not as easy to strip out as just `$`.

In [2]:
def clean_amount(amt):
    amt = amt.strip("$ ")
    amt = amt.replace(",", "")
    amt = amt.split(" ")[0]
    return float(amt)

In [3]:
sum(clean_amount(a) for a in amounts)

ValueError: could not convert string to float: 'Dollars:'

We *could* redefine our `.strip(...)` step to include every non-digit character ... but there's *got* to be a better way.

Spoiler: There is!

In [4]:
import re

for amt in amounts:
    print(re.sub(r"[^\d\.]", "", amt))

1000.31
54
33
10


## __Regular expressions__ ("RegEx")

Regexes are:

- Like "Control-F" on steroids
- A (mostly) standard, purpose-tailored mini-language
- Usable across virtually every programming language (Python, JavaScript, R, etc.) and environment (even Excel and Google Sheets)

## Searching text for patterns

Is X *in* my string?

`re.search(pattern, text)`

In [5]:
print(re.search(r"\d+", "There are 24 people in class"))

<re.Match object; span=(10, 12), match='24'>


In [6]:
print(re.search(r"\d+", "There are twenty-four people in class"))

None


## Extracting text, using patterns

Regexes use parentheses to define "capture groups":

In [7]:
match = re.search(r"(\d+)", "There are 24 people and 3 dogs in class")
match.group(1)

'24'

In [8]:
match = re.search(r"(\d+) ([^ ]+)", "There are 24 people and 3 dogs in class")
match.groups()

('24', 'people')

In [9]:
re.findall(r"(\d+) ([^ ]+)", "There are 24 people and 3 dogs in class")

[('24', 'people'), ('3', 'dogs')]

## Changing text, using patterns

Like "Find/Replace," but much more powerful.

`re.sub(pattern, replacement, text)`

In [10]:
phrase = "I like dogs, my favorite dog is named Jim"
re.sub(r"dog", r"human", phrase)

'I like humans, my favorite human is named Jim'

In [11]:
phrase = "Hello! Beep!"
re.sub(r"(.)\1+", r"\1\1\1\1\1\1", phrase)

'Hellllllo! Beeeeeep!'

Regular expressions are the heart and soul of gimmicks like this:

- https://chrome.google.com/webstore/detail/cloud-to-butt-plus/apmlngnhgbnjpajelfkmabhkfapgnoai

And mistakes like this:

- https://www.nytimes.com/2018/03/06/us/politics/07dc-tradefacts.html

![Screenshot](../images/regex-whoops.png)

## "But how do I actually write them?"

The __bad news__: There are some rules to learn.

The __good news__: There aren't *too* many, and they're well worth learning.

## The basics

- Anchors
- Character sets
- Repetition
- Groups

### Anchors

The most important of these:

- `^`: The beginning of the line
- `$`: The end of the line

In [12]:
def test_search(pattern, string):
    m = re.search(pattern, string)
    print(f"{string} â†’ {'Yes' if m else 'No'}")

In [13]:
pattern = r"^a"
test_search(pattern, "apple")
test_search(pattern, "almond")
test_search(pattern, "orange")

apple â†’ Yes
almond â†’ Yes
orange â†’ No


In [14]:
pattern = r"e$"
test_search(pattern, "apple")
test_search(pattern, "almond")
test_search(pattern, "orange")

apple â†’ Yes
almond â†’ No
orange â†’ Yes


### Character sets

- `[abc123]`: A character that is *any* of a, b, c, 1, 2, or 3
- `[^abc]`: A character that is *not* a, b, or c
- `[a-z]`: Any of the lowercase characters
- `[0-9]` ... or `\d`: Any digit
- `\s`: Any whitespace (space, tab, newline)
- `.`: Anything! (Except for a newline.)
- `\.`: The literal period character

In [15]:
pattern = r"[abcxyz123]"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple â†’ Yes
Apple â†’ No
IPHONE 13 â†’ Yes


In [16]:
pattern = r"[a-z]"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple â†’ Yes
Apple â†’ Yes
IPHONE 13 â†’ No


In [17]:
pattern = r"\d"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple â†’ No
Apple â†’ No
IPHONE 13 â†’ Yes


In [18]:
pattern = r"\s"
test_search(pattern, "apple")
test_search(pattern, "Apple")
test_search(pattern, "IPHONE 13")

apple â†’ No
Apple â†’ No
IPHONE 13 â†’ Yes


### Repetition

- `?`: Zero or one
- `*`: Zero or more
- `+`: One or more
- `{5}`: Exactly five
- `{,5}`: Up to five
- `{5,}`: At least five
- `{5,8}`: Between five and eight

In [19]:
pattern = r"Buz?"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzzzz")
test_search(pattern, "Buzzzzzz")

Bu â†’ Yes
Buz â†’ Yes
Buzz â†’ Yes
Buzzzzz â†’ Yes
Buzzzzzz â†’ Yes


In [20]:
pattern = r"Buz?$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu â†’ Yes
Buz â†’ Yes
Buzz â†’ No
Buzzz â†’ No
Buzzzz â†’ No


In [21]:
pattern = r"Buz*$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu â†’ Yes
Buz â†’ Yes
Buzz â†’ Yes
Buzzz â†’ Yes
Buzzzz â†’ Yes


In [22]:
pattern = r"Buz+$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu â†’ No
Buz â†’ Yes
Buzz â†’ Yes
Buzzz â†’ Yes
Buzzzz â†’ Yes


In [23]:
pattern = r"Buz{3}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu â†’ No
Buz â†’ No
Buzz â†’ No
Buzzz â†’ Yes
Buzzzz â†’ No


In [24]:
pattern = r"Buz{3,}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu â†’ No
Buz â†’ No
Buzz â†’ No
Buzzz â†’ Yes
Buzzzz â†’ Yes


In [25]:
pattern = r"Buz{2,3}$"
test_search(pattern, "Bu")
test_search(pattern, "Buz")
test_search(pattern, "Buzz")
test_search(pattern, "Buzzz")
test_search(pattern, "Buzzzz")

Bu â†’ No
Buz â†’ No
Buzz â†’ Yes
Buzzz â†’ Yes
Buzzzz â†’ No


### Groups

- `(...)`: A group
- `(abc|xyz)`: *Either* "abc" OR "xyz"
- `\1`, `\2`: A reference to the first group, second group, etc.

In [26]:
pattern = r"Ba(na)+$"
test_search(pattern, "Ba")
test_search(pattern, "Banana")
test_search(pattern, "Bananana")
test_search(pattern, "Banananan")

Ba â†’ No
Banana â†’ Yes
Bananana â†’ Yes
Banananan â†’ No


In [27]:
pattern = r"Hello, (World|Lede)"
test_search(pattern, "Hello, World")
test_search(pattern, "Hello, Lede")
test_search(pattern, "Hello, Yellow")

Hello, World â†’ Yes
Hello, Lede â†’ Yes
Hello, Yellow â†’ No


In [28]:
pattern = r"^([^\s]+) vs \1"
test_search(pattern, "Dog vs Dog")
test_search(pattern, "Cat vs Cat")
test_search(pattern, "Dog vs Cat")

Dog vs Dog â†’ Yes
Cat vs Cat â†’ Yes
Dog vs Cat â†’ No


In [29]:
pattern = r"^([^\s]+) vs \1"
test_search(pattern, "Dog vs Dog")
test_search(pattern, "Cat vs Cat")
test_search(pattern, "Dog vs Cat")

Dog vs Dog â†’ Yes
Cat vs Cat â†’ Yes
Dog vs Cat â†’ No


In [30]:
text = "State: NY, City: Brooklyn"
pattern = r"State: ([A-Z]{2}), City: ([^,]+)"
replacement = r"\2, \1"
re.sub(pattern, replacement, text)

'Brooklyn, NY'

## Regular expressions in Pandas

- `.str.extract(pattern_with_group, expand=False)`
- `.str.replace(pattern, replacement, regex=True)`
- `.str.contains(pattern)`

Let's try this with our new example list of amounts ...

In [31]:
import pandas as pd

In [32]:
amounts = [
    "   1,000.31   doLLaRs   ",
    "54 cents  ",
    "33 CENTS",
    "Dollars: 10"
]

In [33]:
amounts_df = pd.DataFrame({ "raw": amounts })
amounts_df

Unnamed: 0,raw
0,"1,000.31 doLLaRs"
1,54 cents
2,33 CENTS
3,Dollars: 10


In [34]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.extract(r"([\d,\.]+)", expand=False)
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


In [35]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.extract(r"([\d,\.]+)", expand=False)
    .str.replace(",", "")
    .astype(float)    
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


Or, saving us the `.str.replace(...)` line:

In [36]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.replace(r"[^\.\d]", "", regex=True)
    .astype(float)
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


How might we use regular expressions to handle the cents/dollars conversion step?

In [37]:
amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,Dollars: 10,10.0


In [38]:
amounts_df["conversion"] = (
    amounts_df["raw"]
    .str.contains(r"dollars|USD|\$", case=False)
    .astype(float)
    .replace({ 0: 0.01 })    
)

amounts_df

Unnamed: 0,raw,quantity,conversion
0,"1,000.31 doLLaRs",1000.31,1.0
1,54 cents,54.0,0.01
2,33 CENTS,33.0,0.01
3,Dollars: 10,10.0,1.0


In [39]:
(amounts_df["quantity"] * amounts_df["conversion"]).sum()

np.float64(1011.18)

---

---

---