# Regular Expressions

Regular expressions are sequences of characters that define search patterns.
They can be used for a variety of tasks, such as searching, editing, and manipulating text.

## Basic components of regex

### Literals
Literals are the simplest form of regex patterns because they match the exact characters in a search string.
For example, if you want to Knd the exact word "cat " in the text "The cat sat on the mat ", you'll have to use a literal that matches the pattern cat .

### Metacharacters
In regular expressions, metacharacters are symbols with special meanings and purposes:
- `.` matches any character, except on a new line.
- `^` matches the start of a string.
- `$` matches the end of a string.
- `*` matches 0 or more repetitions of the preceding element.
- `+` matches 1 or more repetitions of the preceding element.
- `?` matches 0 or 1 repetition of the preceding element.
- `{}` matches a specificc number of repetitions of the preceding element.

### Character Classes
Character classes match any character inside a given set. Common examples are:
- `[abc]`: matches any of the characters a, b, or c.
- `\d` matches any digit (equivalent to [0-9]).
- `\w` matches any word character (alphanumeric plus underscore).
- `\s` matches any whitespace character.

### Quantifiers
Quantifiers specify the number of a character's or group's occurrences:
- `*`: 0 or more.
- `+`: 1 or more.
- `?`: 0 or 1.
- `{n}`: Exactly n.
- `{n,}`: n or more.
- `{n,m}`: Between n and m.

## When to use each function

`re.match()`, `re.search()`, and `re.findall()` are similar.
So when should we use one over the other?
- `re.search()` searches for the Krst place where the pattern matches a string.
  Use it when you need to find the Krst occurrence of a pattern in a string,
  regardless of where it is.
- `re.match()` searches for a match only at the beginning of a string. Use it
  to check if a string starts with a certain pattern.
- `re.findall()` finds all matches of a pattern in a string and returns them as
  a list of strings. Use it when you need to find all pattern occurrences in a
  string.

In [4]:
# the `re.search()` function
# the `re.search()` function searches for the Krst match of the regex pattern in a string.
import re
# define the pattern
pattern = r"\d+"
# define the text to analyze
text = "There are 123 apples."
# apply the regex
match = re.search(pattern, text)
print(match.group())

123


In [5]:
# the `re.match()` function
# The `re.match()` function checks for a match only at the beginning of a string.
import re
# define the pattern
pattern = r"\d+"
# define the text to analyze
text = "There are 123 apples."
# apply the regex
match = re.match(pattern, text)
print(match)

None


In [6]:
# the `re.findall()` function
# the `re.findall()` function Knds all matches of the regex pattern in a string, returning the output in a list.
import re
pattern = r"\d+"
text = "There are 123 apples in 2 trees."
matches = re.findall(pattern, text)
print(matches)

['123', '2']


In [9]:
# the `re.sub()` function
# the `re.sub()` function replaces a matched regex pattern with a replacement string.
import re
# define the pattern
pattern = r"\d+"
# define the text to analyze
text = "There are 123 apples."
# apply the regex
replaced_text = re.sub(pattern, "many", text)
print(replaced_text)

# Note that the `re.sub()` function applies to all the matches it finds.
# So, for example, the following code sample: 
import re
# define the pattern
pattern = r"\d+"
# define the text to analyze
text = "There are 123 apples in 3 trees."
# apply the regex
replaced_text = re.sub(pattern, "many", text)
print(replaced_text)

There are many apples.


There are many apples in many trees.


## Advanced Regex


### Lookaheads and Lookbehinds

Lookaheads and lookbehinds are types of zero-width assertions that match a position in a string based on what precedes or follows it (without including the preceding or following elements in the match itself).

A lookahead asserts that a certain pattern must follow the current position in the string but does not include this pattern in the matched result.

A lookbehind, on the other hand, asserts that a certain pattern must precede the current position in the string, but does not include this pattern in the matched result.


In [10]:
# LOOKAHEAD
# match sequences of one or more word characters immediately followed by a period
# `(?=\.)` is a positive lookahead asserting that the word characters must be followed by a period.
# `(?!...)` is a negative lookahead that states that what follows the current position does not match the speciKed pattern.
import re
pattern = r"\w+(?=\.)"
text = "This is a test. Followed by another test."
matches = re.findall(pattern, text)
print(matches)

['test', 'test']


In [11]:
# LOOKBEHIND
# match one or more word characters immediately preceded by a whitespace character
# `(?<=\s)` is a positive lookbehind asserting that the current position in the string must be preceded by a whitespace character, such as a space, tab, or new line. 
import re
pattern = r"(?<=\s)\w+"
text = "This is a test. Followed by another test."
matches = re.findall(pattern, text)
print(matches)

['is', 'a', 'test', 'Followed', 'by', 'another', 'test']


### Non-capturing Groups

Non-capturing groups _(syntax: `(?:...)`)_ are used to group parts of a pattern are used to group parts of a pattern.


In [12]:
# NON-CAPTURING GROUPS
# `(?:\d{3})` is a non-capturing group that matches exactly three digits.
import re
pattern = r"(?:\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.search(pattern, text)
print(match.group(1))

45


## Performance Considerations


### Avoid Recompiling Patterns

In [6]:
import re
import time

# Sample texts
texts = ["123", "abc 456", "def 789 ghi"] * 10000

# Pattern without compiling
pattern1 = r"\d+"
start_time = time.time()
for text in texts:
  match = re.search(pattern1, text)
end_time = time.time()
time_without_compile = end_time - start_time

# Compiling the pattern
pattern2 = re.compile(r"\d+")
start_time = time.time()
for text in texts:
  match = pattern2.search(text)
end_time = time.time()
time_with_compile = end_time - start_time

print(f"without compile:{time_without_compile: .3}, with compile:{time_with_compile: .3}")

without compile: 0.0437, with compile: 0.0132


### Use Specific Patterns

In [7]:
import re
import time

# Sample texts
texts = ["abc123xyz", "123", "a123b", "x123y", "nonumber", "123", "test"] * 50000

# Less efficient pattern
pattern1 = r".*123.*"
start_time = time.time()
for text in texts:
    match = re.search(pattern1, text)
end_time = time.time()
time_with_less_efficient_pattern = end_time - start_time

# More efficient pattern
pattern2 = r"\b123\b"
start_time = time.time()
for text in texts:
    match = re.search(pattern2, text)
end_time = time.time()
time_with_more_efficient_pattern = end_time - start_time

print(f"time with not efficient pattern: {time_with_less_efficient_pattern: .3}, time with efficient pattern: {time_with_more_efficient_pattern: .3}")


time with not efficient pattern:  0.186, time with efficient pattern:  0.125


### Always Use Raw Strings

In [9]:
import re
import time

# Define a regex pattern without using a raw string
pattern_without_raw = "\d+"

# Define the same regex pattern using a raw string
pattern_with_raw = r"\d+"

# Create some sample text to search
text = "123 456 789 012" * 3000000

# Search using the pattern without raw string
start_time = time.time()
matches_without_raw = re.findall(pattern_without_raw, text)
end_time = time.time()
time_without_raw = end_time - start_time

# Search using the pattern with raw string
start_time = time.time()
matches_with_raw = re.findall(pattern_with_raw, text)
end_time = time.time()
time_with_raw = end_time - start_time

# Print the results and performance comparison
print(f"Time taken without raw string:{time_without_raw: .3}")
print(f"Time taken with raw string: {time_with_raw: .3}")


Time taken without raw string: 1.16


Time taken with raw string:  1.18
