# Regular Expressions

## 1. Basic Pattern Matching

Regular expressions are used to search for specific patterns in text. You define a pattern using a combination of characters, metacharacters, and special sequences.

In [None]:
import re

In [None]:
string =  "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."
pattern = "The"
match = re.match(pattern, string)

print(f"Type of pattern: {type(pattern)}")
print(f"Pattern: {pattern}")
print(f"Type of match: {type(match)}")
print(f"Match: {match}")
print(match.start(), match.end(), match.group())

In [None]:
# re.match() only search from the beginning while re.search() scan through the entire string

string = "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."
pattern = r"cat"
match = re.match(pattern, string)
print(f"Match: {match}")
search = re.search(pattern, string)
print(f"Search: {search}")

In [None]:
# re.compile()

string = "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."
pattern = re.compile("The")
match = re.match(pattern, string)

print(f"Type of pattern: {type(pattern)}")
print(f"Pattern: {pattern}")
print(f"Type of match: {type(match)}")
print(f"Match: {match}")
print(match.start(), match.end(), match.group())

In [None]:
# re.findall()

string = "The curious cat, known for its agility and independence, prowled the garden in search of adventure. With a swift leap, the cat pounced on a fallen leaf, a momentary victory in its playful escapade. Nearby, a second cat observed the scene, its eyes fixed on the first cat's antics. Suddenly, a third cat appeared, joining the playful duo, and together they formed a trio of frolicking felines. As the sun began to set, the cats' energy waned, and they settled down for a well-deserved catnap under the shade of a tall oak tree."

pattern = re.compile("cat")
matches = re.findall(pattern, string)

if matches:
    print(f"Found: {matches}")
else:
    print("Pattern not found.")

In [None]:
# split text based on pattern

string = "coconut, durian, papaya; lychee"
pattern = re.compile("[,;]")
texts = re.split(pattern, string)
texts

## 2. Metacharacters & Character classes

Metacharacters are special characters with predefined meanings that are used to construct complex search patterns. These metacharacters provide a way to specify more than just literal text matches in regular expressions.

1. **`.` (Dot)**: Matches any character except a newline character. For example, `a.b` matches "axb," "a2b," but not "a\nb."

2. `*` (Asterisk): Matches zero or more occurrences of the preceding character or group. For example, `ab*c` matches "ac," "abc," "abbc," and so on.

3. `+` (Plus): Matches one or more occurrences of the preceding character or group. For example, `ab+c` matches "abc," "abbc," but not "ac."

4. `?` (Question Mark): Matches zero or one occurrence of the preceding character or group. For example, `ab?c` matches "ac" and "abc" but not "abbc."

5. `|` (Vertical Bar): Acts as an OR operator and allows you to match either of two patterns. For example, `cat|dog` matches "cat" or "dog."

6. `[]` (Square Brackets): Defines a character set. For example, `[aeiou]` matches any vowel, and `[0-9]` matches any digit.

7. `[^]` (Caret within Square Brackets): Defines a negated character set, matching any character not listed. For example, `[^0-9]` matches any non-digit character.

8. `()` (Parentheses): Creates a capturing group to capture a subpattern. For example, `(abc)+` matches "abc," "abcabc," and captures "abc" as a group.

9. `(?...)`: Defines a non-capturing group or sets options within the group. For example, `(?:abc)+` is a non-capturing group.

10. `*?`, `+?`, `??`: These are non-greedy quantifiers. They match as little text as possible. For example, `a*?b` matches "ab" in "aab," not the entire "aab."

11. `^` (Caret, outside square brackets): Matches the start of a string. For example, `^abc` matches "abc" at the beginning of a string.

12. `$` (Dollar Sign): Matches the end of a string. For example, `abc$` matches "abc" at the end of a string.

- The **square brackets ([])** character class define a character class, allowing you to match any one character from a set. For example, `[aeiou]` matches any vowel.

In [None]:
pattern = "[aeiou]"
string = "The quick brown fox jumps over the lazy dog."
vowels = re.findall(pattern, string)
print(f"Vowels: {vowels}")

In [None]:
# Matching a range of character
string = "Call me by this number +79131234567"
pattern = "[\+0-9]"
matches = re.findall(pattern, string)

if matches:
    print("".join(matches))

In [None]:
# EXERCISE
# Use re library to count number of vowels and number of consonants in this string
string = "The quick brown fox jumps over the lazy dog."

- **Anchors**: Anchors like `^` (start of a line) and `$` (end of a line) are used to specify where in the text a match should occur.

In [None]:
def is_valid_variable_name(variable_name):
    # Define the regular expression pattern
    pattern = re.compile("^[a-zA-Z_][a-zA-Z0-9_]*$")

    # Use re.match() to check if the entire string matches the pattern
    match = re.match(pattern, variable_name)

    return match is not None

# Test the function with some variable names
variable_names = [
    "my_variable",
    "123invalid",
    "_underscore",
    "name_with$pecial_chars",
    "This is my variable my_var. It is valid."
]

for name in variable_names:
    if is_valid_variable_name(name):
        print(f"'{name}' is VALID.")
    else:
        print(f"'{name}' is invalid.")


- Some useful predefined sets of characters
    - `\d`: Matches any **decimal digit**; this is equivalent to the class `[0-9]`.
    - `\D`: Matches any **non-digit** character; this is equivalent to the class `[^0-9]`.
    - `\s`: Matches any **whitespace** character; this is equivalent to the class `[ \t\n\r\f\v]`.
    - `\S`: Matches any **non-whitespace** character; this is equivalent to the class `[^ \t\n\r\f\v]`.
    - `\w`: Matches any **alphanumeric** character; this is equivalent to the class `[a-zA-Z0-9_]`.
    - `\W`: Matches any **non-alphanumeric** character; this is equivalent to the class `[^a-zA-Z0-9_]`.

- **Flags**: Flags, such as `re.IGNORECASE`, can be used to modify the behavior of the regular expression matching. For example, `re.IGNORECASE` makes the matching case-insensitive.

In [None]:
# Sample text containing words in different case variations
text = "The quick brown Fox jumped over the LAZY Dog."

# Regular expression pattern to match "fox" in a case-insensitive manner
pattern = r'fox'

# Use re.findall() without the flag
matches_without_flag = re.findall(pattern, text)
print("Without IGNORECASE flag:")
print(matches_without_flag)

# Use re.findall() with the IGNORECASE flag
matches_with_flag = re.findall(pattern, text, flags=re.IGNORECASE)
print("\nWith IGNORECASE flag:")
print(matches_with_flag)


In [None]:
# CODE DEMO
# Find date format DD/MM/YYYY

string = "Yesterday was 27/09/2023. Today's date is 28/09/2023."
pattern = None
matches = re.findall(pattern, string)

print(matches if matches else "No date found.")

In [None]:
# EXERCISE
# Find all dates in this text YYYY-MM-DD

string = "In the year 2023, we made plans to visit the beautiful countryside on 2023-05-15. We enjoyed the picturesque landscapes, and on 2023-05-20, we had a delightful picnic by the river. Our trip concluded on 2023-05-25, leaving us with cherished memories of our time together."

pattern = None # YOUR CODE
matches = re.findall(pattern, string)

print(matches if matches else "No date found.")

In [None]:
# CODE DEMO
# Find all hashtags
string = "I had an amazing dining experience at the Sin Lun Chinese restaurant! The ambiance was delightful, the service was impeccable, and the food was absolutely delicious. I couldn't get enough of their mouthwatering dumplings and flavorful Sichuan chicken. It's safe to say that Sin Lun has become my go-to spot for authentic Chinese cuisine. #SinLun #DeliciousEats #ChineseCuisine #TopNotchService #FoodieHeaven # 🍜🥢"

pattern = None

matches = re.findall(pattern, string)
print(matches if matches else "No hastags found.")

In [None]:
# EXERCISE
# Write a regex pattern to match Russian phone numbers
# Format "+7 (XXX) XXX-XXXX"
# Sample Input: "Contact us at +7 (913) 123-7890."
# Expected Output: "+7 (913) 123-7890"

string = "Planning a trip to Russia? Don't forget to book your accommodations! You can reach our customer support team at +7 (123) 456-7890 for assistance with reservations. If you have any questions about our services, feel free to give us a call at +7 (987) 654-3210. Our friendly staff is available around the clock to ensure your stay is comfortable. For special offers and promotions, dial +7 (111) 222-3333. We look forward to welcoming you to our beautiful country! Your adventure begins with us at +7 (555) 123-4567."

pattern = None #YOUR CODE
matches = re.findall(pattern, string)
print(matches if matches else "No telephone numbers found.")

# REFERENCES

1. [Python Docs](https://docs.python.org/3/howto/regex.html#regex-howto)