# Regex - Regular Expressions

Regular expressions (regex) are powerful pattern-matching tools used for text processing, validation, and extraction. They are fundamental to NLP preprocessing tasks.

## Learning Objectives

- Understand regex syntax and patterns
- Use regex for text matching, searching, and extraction
- Apply regex for common NLP tasks (cleaning, tokenization, etc.)
- Recognize when regex is appropriate vs. when more advanced tools are needed

## Topics Covered

1. **Basic Patterns**: Literals, character classes, quantifiers
2. **Character Classes**: Digits, word characters, whitespace, custom classes
3. **Quantifiers**: `*`, `+`, `?`, `{n}`, `{n,m}`
4. **Anchors**: `^` (start), `$` (end), word boundaries
5. **Groups and Capturing**: Parentheses, non-capturing groups
6. **Alternation**: `|` (OR operator)
7. **Lookahead/Lookbehind**: Advanced assertions
8. **Practical Applications**: Email validation, phone numbers, text cleaning

## Resources

- [Python `re` module documentation](https://docs.python.org/3/library/re.html)
- [Regex101](https://regex101.com/) - Online regex tester and debugger
- [Regular Expressions Cookbook](https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/)


In [None]:
import re


# 1. Regex Basics

This section covers fundamental regex concepts including:
- Literal matching
- Character classes
- Basic metacharacters


## 1.1 Literal Matching


In [None]:
# Demonstrate literal character matching
text = "The cat sat on the mat"
pattern = "cat"

# re.search() - finds first match
match = re.search(pattern, text)
if match:
    print(f"Pattern: '{pattern}'")
    print(f"Text: '{text}'")
    print(f"Found: '{match.group()}' at position {match.start()}-{match.end()}")


## 1.2 Dot Metacharacter (.)

The dot (`.`) matches any character except newline.


In [None]:
text = "cat bat rat mat"
patterns = [
    ("c.t", "Matches 'c' + any char + 't'"),
    ("b.t", "Matches 'b' + any char + 't'"),
    ("r.t", "Matches 'r' + any char + 't'"),
]

for pattern, description in patterns:
    matches = re.findall(pattern, text)
    print(f"Pattern: '{pattern}' - {description}")
    print(f"Matches: {matches}")
    print()


## 1.3 Character Classes


In [None]:
text = "The year is 2024 and the price is $99.99"

# \d - matches any digit
digits = re.findall(r'\d', text)
print(f"Digits (\\d): {digits}")

# \w - matches word characters (letters, digits, underscore)
words = re.findall(r'\w+', text)
print(f"Words (\\w+): {words}")

# \s - matches whitespace
spaces = re.findall(r'\s', text)
print(f"Whitespace count: {len(spaces)}")

# Custom character class [0-9]
numbers = re.findall(r'[0-9]+', text)
print(f"Numbers ([0-9]+): {numbers}")

# Custom character class [aeiou] - vowels
vowels = re.findall(r'[aeiou]', text, re.IGNORECASE)
print(f"Vowels ([aeiou]): {vowels}")


## 1.4 Escaping Special Characters

To match literal special characters (like `.`, `$`, `*`, etc.), we need to escape them with a backslash.


In [None]:
text = "The price is $99.99 and the date is 2024-01-15"

# To match literal dot, we need to escape it
price_pattern = r'\$\d+\.\d+'
price = re.search(price_pattern, text)
if price:
    print(f"Price pattern (\\$\\d+\\.\\d+): {price.group()}")

# To match literal dash, we can escape it or put it at the end
date_pattern = r'\d{4}-\d{2}-\d{2}'
date = re.search(date_pattern, text)
if date:
    print(f"Date pattern (\\d{{4}}-\\d{{2}}-\\d{{2}}): {date.group()}")


## 1.5 findall() vs search() vs finditer()

- `re.search()` - returns first match object
- `re.findall()` - returns all matches as list
- `re.finditer()` - returns iterator of match objects


In [None]:
text = "Contact me at email1@example.com or email2@test.com"

# re.search() - returns first match object
match = re.search(r'\w+@\w+\.\w+', text)
print(f"re.search() - First match: {match.group() if match else 'None'}")

# re.findall() - returns all matches as list
matches = re.findall(r'\w+@\w+\.\w+', text)
print(f"re.findall() - All matches: {matches}")

# re.finditer() - returns iterator of match objects
print("re.finditer() - All matches with positions:")
for match in re.finditer(r'\w+@\w+\.\w+', text):
    print(f"  {match.group()} at position {match.start()}-{match.end()}")


# 2. Quantifiers

Quantifiers specify how many times a pattern should match:
- `*` (zero or more)
- `+` (one or more)
- `?` (zero or one)
- `{n}` (exactly n)
- `{n,m}` (between n and m)


## 2.1 Star Quantifier (*) - Zero or More


In [None]:
text = "The colors are: red, blue, green, and yellow"

# Match zero or more digits
pattern = r'\d*'
matches = re.findall(pattern, text)
print(f"Pattern: '\\d*' (zero or more digits)")
print(f"Text: '{text}'")
print(f"Matches: {matches[:10]}...")  # Show first 10

# Match word characters followed by zero or more spaces
pattern = r'\w*\s*'
words = re.findall(pattern, text)
print(f"\nPattern: '\\w*\\s*' (word chars + optional spaces)")
print(f"Matches: {words[:10]}")


## 2.2 Plus Quantifier (+) - One or More


In [None]:
text = "Prices: $5, $10, $100, $1000"

# Match one or more digits
pattern = r'\d+'
numbers = re.findall(pattern, text)
print(f"Pattern: '\\d+' (one or more digits)")
print(f"Text: '{text}'")
print(f"Matches: {numbers}")

# Match one or more word characters
pattern = r'\w+'
words = re.findall(pattern, text)
print(f"\nPattern: '\\w+' (one or more word chars)")
print(f"Matches: {words}")


## 2.3 Question Quantifier (?) - Zero or One (Optional)


In [None]:
texts = [
    "Color: red",
    "Colour: blue",  # British spelling
]

# Match "color" or "colour"
pattern = r'colou?r'
print(f"Pattern: 'colou?r' (matches 'color' or 'colour')")
for text in texts:
    match = re.search(pattern, text)
    if match:
        print(f"  '{text}' -> '{match.group()}'")

# Match optional 's' for plural
pattern = r'\d+ (apple|banana)s?'
text = "I have 3 apples and 1 banana"
matches = re.findall(pattern, text)
print(f"\nPattern: '\\d+ (apple|banana)s?' (optional plural)")
print(f"Text: '{text}'")
print(f"Matches: {matches}")


## 2.4 Exact Quantifier ({n}) - Exactly N


In [None]:
text = "Phone numbers: 123-456-7890, 987-654-3210, 555-1234"

# Match exactly 3 digits
pattern = r'\d{3}'
matches = re.findall(pattern, text)
print(f"Pattern: '\\d{{3}}' (exactly 3 digits)")
print(f"Text: '{text}'")
print(f"Matches: {matches}")

# Match phone number format: XXX-XXX-XXXX
pattern = r'\d{3}-\d{3}-\d{4}'
phones = re.findall(pattern, text)
print(f"\nPattern: '\\d{{3}}-\\d{{3}}-\\d{{4}}' (phone format)")
print(f"Matches: {phones}")


## 2.5 Range Quantifier ({n,m}) - Between N and M


In [None]:
text = "IDs: A1, AB12, ABC123, ABCD1234, ABCDE12345"

# Match 2 to 4 letters
pattern = r'[A-Z]{2,4}'
matches = re.findall(pattern, text)
print(f"Pattern: '[A-Z]{{2,4}}' (2 to 4 uppercase letters)")
print(f"Text: '{text}'")
print(f"Matches: {matches}")

# Match at least 3 digits
pattern = r'\d{3,}'
matches = re.findall(pattern, text)
print(f"\nPattern: '\\d{{3,}}' (3 or more digits)")
print(f"Matches: {matches}")


## 2.6 Greedy vs Lazy Matching

By default, quantifiers are **greedy** (match as much as possible). Add `?` after a quantifier to make it **lazy** (match as little as possible).


In [None]:
text = "<title>Python</title> <title>Regex</title>"

# Greedy matching (default) - matches as much as possible
greedy_pattern = r'<.*>'
greedy_match = re.search(greedy_pattern, text)
print(f"Greedy pattern: '<.*>'")
print(f"Match: {greedy_match.group() if greedy_match else 'None'}")

# Lazy (non-greedy) matching - matches as little as possible
lazy_pattern = r'<.*?>'
lazy_matches = re.findall(lazy_pattern, text)
print(f"\nLazy pattern: '<.*?>' (non-greedy)")
print(f"Matches: {lazy_matches}")


## 2.7 Practical Examples with Quantifiers


In [None]:
# Email-like pattern (simplified)
text = "Contact: john.doe@example.com or jane_smith@test.co.uk"
pattern = r'\w+[._]?\w*@\w+\.\w+\.?\w*'
emails = re.findall(pattern, text)
print(f"Email pattern: '\\w+[._]?\\w*@\\w+\\.\\w+\\.?\\w*'")
print(f"Text: '{text}'")
print(f"Matches: {emails}")

# Extract words (handling punctuation)
text = "Hello, world! How are you?"
pattern = r'\b\w+\b'
words = re.findall(pattern, text)
print(f"\nWord extraction: '\\b\\w+\\b'")
print(f"Text: '{text}'")
print(f"Words: {words}")


# 3. Anchors and Boundaries

Anchors and boundaries help match patterns at specific positions:
- `^` (start of string)
- `$` (end of string)
- `\b` (word boundary)
- `\A`, `\Z` (absolute anchors)


## 3.1 Start Anchor (^)


In [None]:
texts = [
    "Python is great",
    "I love Python",
    "Python programming"
]

pattern = r'^Python'
print(f"Pattern: '^Python' (must start with 'Python')")
for text in texts:
    match = re.search(pattern, text)
    result = "✓ MATCH" if match else "✗ NO MATCH"
    print(f"  '{text}' -> {result}")


## 3.2 End Anchor ($)


In [None]:
texts = [
    "The file is .txt",
    "I need a .txt file",
    "Download file.txt"
]

pattern = r'\.txt$'
print(f"Pattern: '\\.txt$' (must end with '.txt')")
for text in texts:
    match = re.search(pattern, text)
    result = "✓ MATCH" if match else "✗ NO MATCH"
    print(f"  '{text}' -> {result}")


## 3.3 Start and End Anchors (^...$)

Combining `^` and `$` allows for exact matches.


In [None]:
texts = [
    "12345",
    "Phone: 12345",
    "12345 is the code",
    "Code is 12345"
]

pattern = r'^\d{5}$'
print(f"Pattern: '^\\d{{5}}$' (exactly 5 digits, nothing else)")
for text in texts:
    match = re.search(pattern, text)
    result = "✓ MATCH" if match else "✗ NO MATCH"
    print(f"  '{text}' -> {result}")


## 3.4 Word Boundary (\b)

Word boundaries match positions between word characters (`\w`) and non-word characters.


In [None]:
text = "The cat is in the category of animals"

# Without word boundary - matches "cat" in "category" too
pattern_no_boundary = r'cat'
matches_no_boundary = re.findall(pattern_no_boundary, text)
print(f"Pattern: 'cat' (without boundary)")
print(f"Matches: {matches_no_boundary}")

# With word boundary - matches only whole word "cat"
pattern_with_boundary = r'\bcat\b'
matches_with_boundary = re.findall(pattern_with_boundary, text)
print(f"\nPattern: '\\bcat\\b' (with word boundary)")
print(f"Matches: {matches_with_boundary}")


## 3.5 More Word Boundary Examples


In [None]:
text = "Python3 is better than Python 2. Pythonista loves Python!"

# Find all occurrences of "Python" as a word
pattern = r'\bPython\b'
matches = re.finditer(pattern, text)
print(f"Pattern: '\\bPython\\b'")
print(f"Text: '{text}'")
print("Matches:")
for match in matches:
    print(f"  '{match.group()}' at position {match.start()}-{match.end()}")


## 3.6 Multiline Mode

In multiline mode (`re.MULTILINE`), `^` and `$` match the start and end of each line, not just the string.


In [None]:
text = """Line 1: Python
Line 2: Java
Line 3: JavaScript"""

# Without multiline flag - ^ matches only start of string
pattern = r'^Line'
matches_normal = re.findall(pattern, text)
print(f"Pattern: '^Line' (normal mode)")
print(f"Matches: {matches_normal}")

# With multiline flag - ^ matches start of each line
pattern = r'^Line'
matches_multiline = re.findall(pattern, text, re.MULTILINE)
print(f"\nPattern: '^Line' (multiline mode)")
print(f"Matches: {matches_multiline}")


## 3.7 Validation Examples

Anchors are commonly used for input validation.


In [None]:
# Validate 4-digit PIN
pin_pattern = r'^\d{4}$'
pins = ["1234", "12345", "abcd", "12 34"]
print(f"PIN validation: '^\\d{{4}}$'")
for pin in pins:
    match = re.match(pin_pattern, pin)
    result = "✓ VALID" if match else "✗ INVALID"
    print(f"  '{pin}' -> {result}")

# Validate username (3-20 alphanumeric + underscore)
username_pattern = r'^\w{3,20}$'
usernames = ["user123", "ab", "user_name", "user@name", "valid_user_123"]
print(f"\nUsername validation: '^\\w{{3,20}}$'")
for username in usernames:
    match = re.match(username_pattern, username)
    result = "✓ VALID" if match else "✗ INVALID"
    print(f"  '{username}' -> {result}")


# 4. Groups and Capturing

Groups allow you to:
- Capture parts of a match
- Apply quantifiers to multiple characters
- Use backreferences
- Make certain parts optional


## 4.1 Capturing Groups ()


In [None]:
text = "Date: 2024-01-15"

# Capture year, month, day separately
pattern = r'(\d{4})-(\d{2})-(\d{2})'
match = re.search(pattern, text)

if match:
    print(f"Pattern: '(\\d{{4}})-(\\d{{2}})-(\\d{{2}})'")
    print(f"Text: '{text}'")
    print(f"Full match: {match.group(0)}")
    print(f"Group 1 (year): {match.group(1)}")
    print(f"Group 2 (month): {match.group(2)}")
    print(f"Group 3 (day): {match.group(3)}")
    print(f"All groups: {match.groups()}")


## 4.2 findall() with Groups

When using `findall()` with groups, it returns tuples of captured groups.


In [None]:
text = "Emails: john@example.com, jane@test.com"

# Capture username and domain separately
pattern = r'(\w+)@(\w+\.\w+)'
matches = re.findall(pattern, text)
print(f"Pattern: '(\\w+)@(\\w+\\.\\w+)'")
print(f"Text: '{text}'")
print("Matches (as tuples):")
for username, domain in matches:
    print(f"  Username: {username}, Domain: {domain}")


## 4.3 Non-Capturing Groups (?:)

Non-capturing groups allow you to group patterns without capturing them. Use `(?:...)` instead of `(...)`.


In [None]:
text = "Colors: red, blue, green, yellow"

# With capturing group - both color and comma captured
pattern_capturing = r'(\w+)(,|$)'
matches_capturing = re.findall(pattern_capturing, text)
print(f"Pattern with capturing: '(\\w+)(,|$)'")
print(f"Matches: {matches_capturing}")

# With non-capturing group - only color captured
pattern_non_capturing = r'(\w+)(?:,|$)'
matches_non_capturing = re.findall(pattern_non_capturing, text)
print(f"\nPattern with non-capturing: '(\\w+)(?:,|$)'")
print(f"Matches: {matches_non_capturing}")


## 4.4 Named Groups (?P<name>...)

Named groups make your regex more readable and allow accessing groups by name instead of number.


In [None]:
text = "Contact: john.doe@example.com"

# Named groups for better readability
pattern = r'(?P<username>\w+\.?\w*)@(?P<domain>\w+\.\w+)'
match = re.search(pattern, text)

if match:
    print(f"Pattern: '(?P<username>\\w+\\.?\\w*)@(?P<domain>\\w+\\.\\w+)'")
    print(f"Text: '{text}'")
    print(f"Username: {match.group('username')}")
    print(f"Domain: {match.group('domain')}")
    print(f"Group dict: {match.groupdict()}")


## 4.5 Backreferences (\1, \2, etc.)

Backreferences allow you to match the same text that was matched by a capturing group.


In [None]:
texts = [
    "The cat sat on the mat",
    "The dog sat on the dog",  # Repeated word
    "Hello hello world"
]

# Match repeated words
pattern = r'\b(\w+)\s+\1\b'
print(f"Pattern: '\\b(\\w+)\\s+\\1\\b' (repeated word)")

for text in texts:
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        print(f"  '{text}' -> Found repeated: '{match.group(1)}'")


## 4.6 Substitution with Groups

You can use groups in `re.sub()` to rearrange or reformat matched text.


In [None]:
# Reformat date from YYYY-MM-DD to DD/MM/YYYY
text = "Date: 2024-01-15"
pattern = r'(\d{4})-(\d{2})-(\d{2})'
replacement = r'\3/\2/\1'  # Day/Month/Year
result = re.sub(pattern, replacement, text)
print(f"Pattern: '(\\d{{4}})-(\\d{{2}})-(\\d{{2}})'")
print(f"Replacement: '\\3/\\2/\\1'")
print(f"Original: '{text}'")
print(f"Result: '{result}'")

# Swap first and last name
text = "Name: John Doe"
pattern = r'(\w+)\s+(\w+)'
replacement = r'\2, \1'
result = re.sub(pattern, replacement, text)
print(f"\nPattern: '(\\w+)\\s+(\\w+)'")
print(f"Replacement: '\\2, \\1'")
print(f"Original: '{text}'")
print(f"Result: '{result}'")


## 4.7 Nested Groups

Groups can be nested inside other groups.


In [None]:
text = "Phone: (123) 456-7890"

# Nested groups for area code
pattern = r'\((\d{3})\)\s+(\d{3})-(\d{4})'
match = re.search(pattern, text)

if match:
    print(f"Pattern: '\\((\\d{{3}})\\)\\s+(\\d{{3}})-(\\d{{4}})'")
    print(f"Text: '{text}'")
    print(f"Area code: {match.group(1)}")
    print(f"Exchange: {match.group(2)}")
    print(f"Number: {match.group(3)}")
    print(f"All groups: {match.groups()}")


## 4.8 Alternation with Groups

You can use alternation (`|`) with groups to match different patterns.


In [None]:
text = "I have 5 cats and 3 dogs"

# Match number followed by animal
pattern = r'(\d+)\s+(cat|dog|bird)s?'
matches = re.findall(pattern, text)
print(f"Pattern: '(\\d+)\\s+(cat|dog|bird)s?'")
print(f"Text: '{text}'")
print("Matches:")
for count, animal in matches:
    print(f"  {count} {animal}s")


# 5. Practical Examples

Real-world applications of regex patterns for common tasks.


## 5.1 Email Validation

Note: This is a simplified pattern for learning. Real email validation is more complex.


In [None]:
# Simplified email pattern (real email validation is more complex)
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

emails = [
    "user@example.com",
    "john.doe@test.co.uk",
    "invalid.email",
    "@example.com",
    "user@.com",
    "user.name+tag@example-domain.com"
]

print(f"Pattern: '{email_pattern}'")
for email in emails:
    match = re.match(email_pattern, email)
    result = "✓ VALID" if match else "✗ INVALID"
    print(f"  {email:30} -> {result}")


## 5.2 Phone Number Extraction


In [None]:
text = """
    Contact us at:
    Phone 1: (555) 123-4567
    Phone 2: 555-123-4567
    Phone 3: 555.123.4567
    Phone 4: +1 555 123 4567
    Phone 5: 5551234567
    """

# Pattern to match various phone formats
phone_pattern = r'(\+?1?\s*)?\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})'

print(f"Pattern: '{phone_pattern}'")
print("Extracted phone numbers:")
matches = re.finditer(phone_pattern, text)
for i, match in enumerate(matches, 1):
    # Normalize format
    area, exchange, number = match.group(2), match.group(3), match.group(4)
    formatted = f"({area}) {exchange}-{number}"
    print(f"  Phone {i}: {formatted}")


## 5.3 URL Extraction


In [None]:
text = """
    Visit https://www.example.com or http://test.org for more info.
    Also check out https://subdomain.example.com/path?query=value
    """

# Simplified URL pattern
url_pattern = r'https?://[^\s]+'

print(f"Pattern: '{url_pattern}'")
urls = re.findall(url_pattern, text)
print("Extracted URLs:")
for url in urls:
    print(f"  {url}")


## 5.4 Text Cleaning


In [None]:
text = "Hello!!!   This is   a test...   with   extra   spaces!!!"

print(f"Original: '{text}'")

# Remove multiple exclamation marks
cleaned = re.sub(r'!+', '!', text)
print(f"Remove multiple !: '{cleaned}'")

# Remove multiple spaces
cleaned = re.sub(r'\s+', ' ', cleaned)
print(f"Remove multiple spaces: '{cleaned}'")

# Remove multiple dots
cleaned = re.sub(r'\.{2,}', '.', cleaned)
print(f"Remove multiple dots: '{cleaned}'")

# Trim whitespace
cleaned = cleaned.strip()
print(f"Final cleaned: '{cleaned}'")


## 5.5 Hashtags and Mentions


In [None]:
text = "Check out @username's post about #Python #MachineLearning #AI"

# Extract hashtags
hashtags = re.findall(r'#(\w+)', text)
print(f"Hashtags: {hashtags}")

# Extract mentions
mentions = re.findall(r'@(\w+)', text)
print(f"Mentions: {mentions}")


## 5.6 Log File Parsing


In [None]:
log_entries = [
    "2024-01-15 10:30:45 INFO: User logged in",
    "2024-01-15 10:31:12 ERROR: Connection failed",
    "2024-01-15 10:32:00 WARNING: High memory usage",
]

log_pattern = r'(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+):\s+(.+)'

print(f"Pattern: '{log_pattern}'")
print("Parsed log entries:")
for entry in log_entries:
    match = re.match(log_pattern, entry)
    if match:
        date, time, level, message = match.groups()
        print(f"  [{level}] {date} {time}: {message}")
