In [1]:
import sys
import re

In [2]:
print(sys.version)

3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]


## re Module Function Basics

regex in python can be implemented using the built-in `re` module. There are several functions to note within the regex module:
- `re.match(pattern, string)`
- `re.search(pattern, string)`
- `re.findall(pattern, string)`
- `re.finditer(pattern string)`
- `re.sub(pattern, repl, string)`
- `re.compile(pattern)`

In [3]:
text = "Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow 24 hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at 1-555-555-5555. We are dedicated to ensuring your satisfaction and saving you $$$."

`re.match()` will search for a pattern and then returns a `Match` object **if it is at the beginning of the string**. If no pattern is found, a `None` is returned.

In [4]:
match = re.match(pattern="Questions", string=text)
no_match = re.match(pattern="Please", string=text)

The match object provides information on the match.

In [5]:
print(match)

<re.Match object; span=(0, 9), match='Questions'>


In [6]:
# the pattern is correct, it's just not at the beginning of a string
print(no_match)

None


We can call various methods on the match object.

In [7]:
# used to extract the matched string
# note, if a regular expression had capture groups, then we can return each group seperately (this is demonstrated below)
match.group()

'Questions'

In [8]:
match.start()

0

In [9]:
match.end()

9

`re.search()` finds the **first** instance of a match **anywhere** in the string and returns a `Match` object. As before, if no pattern is found, a `None` is returned.

In [10]:
text = "Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow 24 hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at 1-555-555-5555. We are dedicated to ensuring your satisfaction and saving you $$$."

In [11]:
match = re.search(pattern="customer", string=text) # Q: how many instances do we expect to see given this pattern? A: 2

print(match)

<re.Match object; span=(105, 113), match='customer'>


We can use `re.findall()` to get all instances of a match, not just the first. Note that the return type is different from the `re.search()` function. Specifically:

> If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

In [12]:
# example with no groups
all_matches = re.findall(pattern="customer", string=text)

print(all_matches)

['customer', 'customer']


If we want to get the top level match and all the groups of the match, we can use `re.finditer()`.

In [13]:
# example with both capturing and non-capturing groups
all_matches = re.finditer(pattern="(?:1-)?(\d{3})-(\d{3}-\d{4})", string=text)

for match in all_matches:
    full_match = match.group(0)
    print(f"{full_match=}") # the top-level match
    first_group = match.group(1)
    print(f"{first_group=}") # first group
    groups = match.groups()
    print(f"{groups=}") # all groups as a tuple

full_match='1-555-555-5555'
first_group='555'
groups=('555', '555-5555')


We can substitute based on regex patterns using `re.sub()` function, which returns a string.

In [14]:
# will print several instances of substitutions on the text
print(
    re.sub(pattern="\d", repl="_", string=text, count=3),
    re.sub(pattern="\d", repl="_", string=text, count=8),
    re.sub(pattern="\d", repl="_", string=text),
    sep="\n\n"
)

Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow __ hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at _-555-555-5555. We are dedicated to ensuring your satisfaction and saving you $$$.

Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow __ hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at _-___-__5-5555. We are dedicated to ensuring your satisfaction and saving you $$$.

Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow __ hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at _-___-___-____. We are dedicated to ensuring your satisfaction and saving you $$$.


We can make things more efficient by using the `re.compile()` function which returns a regular expression object. That object that can be referenced throughout the code, with associated `match()`, `search()`, `findall()`, and `sub()`. 

In [15]:
pattern = re.compile("[Qq]uestion")

print(
    pattern.match(string=text), # will match beginning of the string
    pattern.search(string=text), # will match first instance anywhere in the string
    pattern.findall(string=text), # will find all instances
    sep="\n"
)

<re.Match object; span=(0, 8), match='Question'>
<re.Match object; span=(0, 8), match='Question'>
['Question', 'question']


## Regex Patterns

### Character Sets

Character sets denote all possible options for a character. Can be thought of as a character-level "or"

In [16]:
text = "This example demonstrates using character sets, a form of character-level disjunction."

pattern = "[^aeiou]" # can negate by adding ^ at beginning of the set

matches = re.findall(pattern, text) # will return all instances anywhere in the text
print(f"{len(matches)} matches found: {matches}")

60 matches found: ['T', 'h', 's', ' ', 'x', 'm', 'p', 'l', ' ', 'd', 'm', 'n', 's', 't', 'r', 't', 's', ' ', 's', 'n', 'g', ' ', 'c', 'h', 'r', 'c', 't', 'r', ' ', 's', 't', 's', ',', ' ', ' ', 'f', 'r', 'm', ' ', 'f', ' ', 'c', 'h', 'r', 'c', 't', 'r', '-', 'l', 'v', 'l', ' ', 'd', 's', 'j', 'n', 'c', 't', 'n', '.']


In [17]:
text = "This example demonstrates using character sets, a form of character-level disjunction."

pattern = r"[^a-e]"

matches = re.findall(pattern, text)
print(f"{len(matches)} matches found: {matches}")

63 matches found: ['T', 'h', 'i', 's', ' ', 'x', 'm', 'p', 'l', ' ', 'm', 'o', 'n', 's', 't', 'r', 't', 's', ' ', 'u', 's', 'i', 'n', 'g', ' ', 'h', 'r', 't', 'r', ' ', 's', 't', 's', ',', ' ', ' ', 'f', 'o', 'r', 'm', ' ', 'o', 'f', ' ', 'h', 'r', 't', 'r', '-', 'l', 'v', 'l', ' ', 'i', 's', 'j', 'u', 'n', 't', 'i', 'o', 'n', '.']


In [18]:
text = "These are metacharacters: . ^ $ * + ? { } [ ] \ | ( ) -"

# in a character set, most metacharacters are disabled
pattern = r"[.^$*+?{}\]|()\-\\]" # what if ^ was in the front?

matches = re.findall(pattern, text)
print(f"{len(matches)} matches found: {matches}")

14 matches found: ['.', '^', '$', '*', '+', '?', '{', '}', ']', '\\', '|', '(', ')', '-']


### Grouping

Grouping allows us to extract substrings from the overall match.

In [19]:
text = "Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow 24 hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at 1-555-555-5555. We are dedicated to ensuring your satisfaction (and saving you $$$)."

In [20]:
pattern = "([A-Za-z0-9.-]+)@([A-Za-z0-9.-]+)\.([A-Za-z]{2,})"
matches = re.findall(pattern, text) # if groups exist, findall() will only return the groups, not top-level match
print(matches)

matches = re.search(pattern, text)
print(f"{matches.group(0)=}") # entire email match
print(f"{matches.group(1)=}") # local part
print(f"{matches.group(2)=}") # domain
print(f"{matches.group(3)=}") # top-level domain

[('support', 'ourcompany', 'com')]
matches.group(0)='support@ourcompany.com'
matches.group(1)='support'
matches.group(2)='ourcompany'
matches.group(3)='com'


Non-capturing groups allow the benefits of groups, without the cost of the regex engine having to keep track of the groups.

In [21]:
pattern = "(?:1-)?(?:\d{3})-\d{3}-\d{4}"
matches = re.findall(pattern, text)
print(matches)

matches = re.search(pattern, text)
try:
    print(f"{matches.group(1)=}") # <-- will throw an error because no groups exists
except:
    print("`matches.group(1)` didn't work")

['1-555-555-5555']
`matches.group(1)` didn't work


### Lookaheads and Lookbehinds

Lookaheads and lookbehinds are useful when we want to match a pattern based on their surrounding context

In [22]:
passwords = ["hello123", "Hello123", "Hello123!goodbye?"]

pattern = (
    "^(?=.*[a-z])"  # at least one lowercase letter after the start of the string
    "(?=.*[A-Z])"  # at least one uppercase letter after the start of the string
    "(?=.*\d.*\d)"  # at least two digits after the start of the string
    "(?=.*[@$!%*?&])"  # at least one special character after the start of the string
    ".{8,}$" # at least 8 characters after the start of the string
)

print("regex as single line:", pattern)

for text in passwords:
    matches = re.search(pattern, text)
    print(matches)

regex as single line: ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d.*\d)(?=.*[@$!%*?&]).{8,}$
None
None
<re.Match object; span=(0, 17), match='Hello123!goodbye?'>


### Greedy vs Lazy Behavior

By default, the regex pattern tries to match as much of a pattern as it can. This is considered "greedy" behavior. For example, consider the pattern `[a-z]*`, given the string `hello`. Why would this return `hello` and not `h` or `he` or `hel`? These are all instances of "zero or more a through z characters". Because the engine tries to return the **longest string it can find**.

Consider the below example of trying to extract all paranthetical statements in the below text.

In [23]:
text = "Questions? Please do not hesitate to contact us at support@ourcompany.com (please allow 24 hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at 1-555-555-5555. We are dedicated to ensuring your satisfaction (and saving you $$$)."

In [24]:
pattern = "\(.+\)" # default greedy behavior
# pattern = "\(([a-zA-z0-9 $]+)\)" # this would work too but we would have to keep adding possibilities for characters in the parans
matches = re.findall(pattern, text)
print(matches)

['(please allow 24 hours for our customer care team to respond to your questions). For more immediate service, you can also reach out to our customer care team at 1-555-555-5555. We are dedicated to ensuring your satisfaction (and saving you $$$)']


In [25]:
pattern = "\(.+?\)" # lazy behavior
matches = re.findall(pattern, text)
print(matches)

['(please allow 24 hours for our customer care team to respond to your questions)', '(and saving you $$$)']


In [26]:
text = """
<html>
<body>
<p>This is the first paragraph.</p>
<p>This is the second paragraph, which includes <a href="link">a link</a>.</p>
</body>
</html>
"""

In [27]:
pattern = "<p>(.+)</p>" # greedy
matches = re.findall(pattern=pattern, string=text, flags=re.DOTALL) # re.DOTALL matches any character, including \n
print(matches)

['This is the first paragraph.</p>\n<p>This is the second paragraph, which includes <a href="link">a link</a>.']


In [28]:
pattern = "<p>(.*?)</p>" # lazy
matches = re.findall(pattern=pattern, string=text, flags=re.DOTALL)
print(matches)

['This is the first paragraph.', 'This is the second paragraph, which includes <a href="link">a link</a>.']


### Alteration

There are also another way to introduce an "or" using the pipe: `|`. It is useful for when we watch to match several patterns; think of this as a pattern-level "or".

In [29]:
text = "I have a cat, a dog, and a fish in my house."

pattern = "cat|dog|fish"

matches = re.findall(pattern, text)
print(matches)

['cat', 'dog', 'fish']


In [30]:
text = "Two date patterns are 2024-01-20 20/01/2024"

pattern = "\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}"

matches = re.findall(pattern, text)
print(matches)

['2024-01-20', '20/01/2024']


Beware that the pipe matches whatever pattern is the left and to the right. We can control or limit this behavior by usings groups.

In [31]:
text = "Have you heard of batman? No? What about birdman." # The intent here is to capture either "birdman" or "batman"

pattern = "bat|birdman" # but is that what this pattern will do?

matches = re.findall(pattern, text)
print(matches)

['bat', 'birdman']


In [32]:
text = "Have you heard of batman? No? What about birdman." # The intent here is to capture either "birdman" or "batman"

pattern = "(?:bat|bird)man"

matches = re.findall(pattern, text) # 
print(matches)

['batman', 'birdman']


## Regex in Pandas

There is some native support for regex in pandas. Namely, they can be used as follows:

In [None]:
df['column'].str.replace('pattern', 'replacement', regex=True) # can be used to replace substrings in a column
df['column'].str.contains('pattern', regex=True) # can be used to create boolean Series for filtering
df['column'].str.extractall('(pattern)', expand=True) # can be used to extract regex groups into dataframe 