# Regular Expressions

## 1. Basic Pattern Matching

Regular expressions are used to search for specific patterns in text. You define a pattern using a combination of characters, metacharacters, and special sequences.

In this section, we will discuss Regular Expression's pattern, method, metacharacters.

In [1]:
import re

In [2]:
string =  "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."
pattern = "The"
match = re.match(pattern, string)

print(f"Type of pattern: {type(pattern)}")
print(f"Pattern: {pattern}")
print(f"Type of match: {type(match)}")
print(f"Match: {match}")
print(match.start(), match.end(), match.group())

Type of pattern: <class 'str'>
Pattern: The
Type of match: <class 're.Match'>
Match: <re.Match object; span=(0, 3), match='The'>
0 3 The


In [3]:
# re.match() only search from the beginning while re.search() scan through the entire string

string = "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."
pattern = r"cat"
match = re.match(pattern, string)
print(f"Match: {match}") # None because the first word is not 'cat'
search = re.search(pattern, string)
print(f"Search: {search}") # Found fist 'cat'

Match: None
Search: <re.Match object; span=(12, 15), match='cat'>


In [4]:
# re.findall()

string = "cat The category curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree cat"

pattern = r"cat[\W]|cat$|^cat"
#pattern = "cat"
matches = re.findall(pattern, string)

if matches:
    print(f"Found: {matches}", len(matches))
else:
    print("Pattern not found.")

Found: ['cat ', 'cat,', 'cat ', 'cat ', "cat'", 'cat ', 'cat'] 7


In [5]:
# re.compile() convert a pattern to a regular expression object
# Why? faster when using re.match() or re.findall()
# especially when you want to use the pattern multiple times

string = "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."
pattern = re.compile("The")
match = re.match(pattern, string)

print(f"Type of pattern: {type(pattern)}")
print(f"Pattern: {pattern}")
print(f"Type of match: {type(match)}")
print(f"Match: {match}")
print(match.start(), match.end(), match.group())

Type of pattern: <class 're.Pattern'>
Pattern: re.compile('The')
Type of match: <class 're.Match'>
Match: <re.Match object; span=(0, 3), match='The'>
0 3 The


In [6]:
pattern.search(string)

<re.Match object; span=(0, 3), match='The'>

In [7]:
# split text based on pattern

string = "coconut, durian, papaya; lychee"
pattern = re.compile("[,;]")
texts = re.split(pattern, string)
texts

['coconut', ' durian', ' papaya', ' lychee']

In [8]:
# remove leading 0s in an IP
ip = "216.08.094.196"
pattern = '\.[0]*'
string = re.sub(pattern, '.', ip)
print(string)

216.8.94.196


A raw string, denoted by an `r` prefix before the string literal, is a way to define strings that treat backslashes `\` as **literal characters** rather than escape characters.

In [9]:
"\""

'"'

In [10]:
# with raw string
pattern = r"\d{3}-\d{2}-\d{4}"
print(re.findall(pattern, "123-12-1234"))

# without raw string
pattern = "\\d{3}-\\d{2}-\\d{4}"
print(re.findall(pattern, "123-12-1234"))

['123-12-1234']
['123-12-1234']


In [11]:
# speacial case
pattern = r"\\"
file_path = r"C:\Users\John\Documents\file.txt"

re.split(pattern, file_path)

['C:', 'Users', 'John', 'Documents', 'file.txt']

## 2. Metacharacters & Character classes

Metacharacters are special characters with predefined meanings that are used to construct complex search patterns. These metacharacters provide a way to specify more than just literal text matches in regular expressions.

1. `.` (Dot): Matches any character except a newline character. For example, `a.b` matches "axb," "a2b," but not "a\nb."

2. `*` (Asterisk): Matches zero or more occurrences of the preceding character or group. For example, `ab*c` matches "ac," "abc," "abbc," and so on.

3. `+` (Plus): Matches one or more occurrences of the preceding character or group. For example, `ab+c` matches "abc," "abbc," but not "ac."

4. `?` (Question Mark): Matches zero or one occurrence of the preceding character or group. For example, `ab?c` matches "ac" and "abc" but not "abbc."

5. `|` (Vertical Bar): Acts as an OR operator and allows you to match either of two patterns. For example, `cat|dog` matches "cat" or "dog."

6. `[]` (Square Brackets): Defines a character set. For example, `[aeiou]` matches any vowel, and `[0-9]` matches any digit.

7. `[^]` (Caret within Square Brackets): Defines a negated character set, matching any character not listed. For example, `[^0-9]` matches any non-digit character.

8. `()` (Parentheses): Creates a capturing group to capture a subpattern. For example, `(abc)+` matches "abc," "abcabc," and captures "abc" as a group.

9. `(?...)`: Defines a non-capturing group or sets options within the group. For example, `(?:abc)+` is a non-capturing group.

10. `*?`, `+?`, `??`: These are non-greedy quantifiers. They match as little text as possible. For example, `a*?b` matches "ab" in "aab," not the entire "aab."

11. `^` (Caret, outside square brackets): Matches the start of a string. For example, `^abc` matches "abc" at the beginning of a string.

12. `$` (Dollar Sign): Matches the end of a string. For example, `abc$` matches "abc" at the end of a string.

- The **square brackets ([])** character class define a character class, allowing you to match any one character from a set. For example, `[aeiou]` matches any vowel.

In [12]:
pattern = "[aeiou]"
string = "The quick brown fox jumps over the lazy dog."
vowels = re.findall(pattern, string)
print(f"Vowels: {vowels}")

Vowels: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']


In [13]:
# Matching a range of character
string = "Call me by this number +79131234567"
pattern = "[+0-9]"
matches = re.findall(pattern, string)

if matches:
    print("".join(matches))

+79131234567


In [14]:
# EXERCISE
# Use re library to count number of vowels and number of consonants in this string
string = "The quick brown fox jumps over the lazy dog."

- **Anchors**: Anchors like `^` (start of a line) and `$` (end of a line) are used to specify where in the text a match should occur.

In [15]:
def is_valid_variable_name(variable_name):
    # Define the regular expression pattern
    pattern = re.compile("^[a-zA-Z_][a-zA-Z0-9_]*$")

    # Use re.match() to check if the entire string matches the pattern
    match = re.match(pattern, variable_name)

    return match is not None

# Test the function with some variable names
variable_names = [
    "my_variable",
    "123invalid",
    "_underscore",
    "name_with$pecial_chars",
    "This is my variable my_var. It is valid."
]

for name in variable_names:
    if is_valid_variable_name(name):
        print(f"'{name}' is VALID.")
    else:
        print(f"'{name}' is INVALID.")


'my_variable' is VALID.
'123invalid' is INVALID.
'_underscore' is VALID.
'name_with$pecial_chars' is INVALID.
'This is my variable my_var. It is valid.' is INVALID.


- Some useful predefined sets of characters
    - `\d`: Matches any **decimal digit**; this is equivalent to the class `[0-9]`.
    - `\D`: Matches any **non-digit** character; this is equivalent to the class `[^0-9]`.
    - `\s`: Matches any **whitespace** character; this is equivalent to the class `[ \t\n\r\f\v]`.
    - `\S`: Matches any **non-whitespace** character; this is equivalent to the class `[^ \t\n\r\f\v]`.
    - `\w`: Matches any **alphanumeric** character; this is equivalent to the class `[a-zA-Z0-9_]`.
    - `\W`: Matches any **non-alphanumeric** character; this is equivalent to the class `[^a-zA-Z0-9_]`.

- **Flags**: Flags, such as `re.IGNORECASE`, can be used to modify the behavior of the regular expression matching. For example, `re.IGNORECASE` makes the matching case-insensitive.

In [16]:
# Sample text containing words in different case variations
text = "The quick brown Fox jumped over the LAZY Dog."

# Regular expression pattern to match "fox" in a case-insensitive manner
pattern = r'fox'

# Use re.findall() without the flag
matches_without_flag = re.findall(pattern, text)
print("Without IGNORECASE flag:")
print(matches_without_flag)

# Use re.findall() with the IGNORECASE flag
matches_with_flag = re.findall(pattern, text, flags=re.IGNORECASE)
print("\nWith IGNORECASE flag:")
print(matches_with_flag)


Without IGNORECASE flag:
[]

With IGNORECASE flag:
['Fox']


In [17]:
# CODE DEMO
# Find date format DD/MM/YYYY

string = "Yesterday was 27/09/2023. Today's date is 28/09/2023. Tmorrow is 29-09-2023"
pattern = r"\d{2}/\d{2}/\d{4}"
matches = re.findall(pattern, string)

print(matches if matches else "No date found.")

['27/09/2023', '28/09/2023']


In [None]:
# EXERCISE
# Find all dates in this text YYYY-MM-DD

string = "In the year 2023, we made plans to visit the beautiful countryside on 2023-05-15. We enjoyed the picturesque landscapes, and on 2023-05-20, we had a delightful picnic by the river. Our trip concluded on 2023-05-25, leaving us with cherished memories of our time together."

pattern = None # YOUR CODE
matches = re.findall(pattern, string)

print(matches if matches else "No date found.")

In [None]:
# CODE DEMO
# Remove hashtags
string = "I had an amazing dining experience at the Sin Lun Chinese restaurant! The ambiance was delightful, the service was impeccable, and the food was absolutely delicious. I couldn't get enough of their mouthwatering dumplings and flavorful Sichuan chicken. It's safe to say that Sin Lun has become my go-to spot for authentic Chinese cuisine. #SinLun #DeliciousEats #ChineseCuisine #TopNotchService #foodieHeaven2023 #🍜"

emoji_pattern = re.compile(r"#[\w"
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251""]+", flags=re.UNICODE)
result = emoji_pattern.sub('', string)
result.strip()

"I had an amazing dining experience at the Sin Lun Chinese restaurant! The ambiance was delightful, the service was impeccable, and the food was absolutely delicious. I couldn't get enough of their mouthwatering dumplings and flavorful Sichuan chicken. It's safe to say that Sin Lun has become my go-to spot for authentic Chinese cuisine."

In [None]:
# EXERCISE
# Write a regex pattern to match Russian phone numbers
# Format "+7 (XXX) XXX-XXXX"
# Sample Input: "Contact us at +7 (913) 123-7890."
# Expected Output: "+7 (913) 123-7890"

string = "Planning a trip to Russia? Don't forget to book your accommodations! You can reach our customer support team at +7 (123) 456-7890 for assistance with reservations. If you have any questions about our services, feel free to give us a call at +7 (987) 654-3210. Our friendly staff is available around the clock to ensure your stay is comfortable. For special offers and promotions, dial +7 (111) 222-3333. We look forward to welcoming you to our beautiful country! Your adventure begins with us at +7 (555) 123-4567."

pattern = None #YOUR CODE
matches = re.findall(pattern, string)
print(matches if matches else "No telephone numbers found.")

In [None]:
string = "cat the category are a cat."
pattern = r"cat[\W]"
re.findall(pattern, string)

['cat ', 'cat.']

# REFERENCES

1. [Python Docs](https://docs.python.org/3/howto/regex.html#regex-howto)
2. [Kaggle](https://www.kaggle.com/code/albeffe/regex-exercises-solutions)