# Advanced Regular Expressions in Python

Welcome to this advanced regex tutorial! This notebook will take you beyond basic pattern matching into the sophisticated world of advanced regular expressions. We'll cover:

- **Lookahead and Lookbehind Assertions**
- **Capturing Groups and Backreferences*
- **Conditional Regex Patterns**
- **Performance Optimization**
- **Real-world Applications**

### Prerequisites
- Basic understanding of regular expressions
- Python programming knowledge
- Familiarity with the `re` module


In [63]:
# Import required modules
import re
import time
import timeit
from typing import List, Optional, Tuple
import warnings

# Helper function for testing regex patterns
def test_regex(pattern: str, test_strings: List[str], flags: int = 0) -> None:
    """Test a regex pattern against multiple strings and display results."""
    compiled_pattern = re.compile(pattern, flags)
    print(f"Pattern: {pattern}")
    print("-" * 50)
    
    for test_string in test_strings:
        match = compiled_pattern.search(test_string)
        if match:
            print(f"✓ '{test_string}' -> Match: '{match.group()}'")
            if match.groups():
                print(f"  Groups: {match.groups()}")
        else:
            print(f"✗ '{test_string}' -> No match")
    print()

## 1. Lookahead and Lookbehind Assertions

Lookaround assertions allow you to match patterns based on what comes before or after, without including those parts in the match.

### Types of Lookaround:
- **Positive Lookahead** `(?=...)` - Matches if followed by pattern
- **Negative Lookahead** `(?!...)` - Matches if NOT followed by pattern
- **Positive Lookbehind** `(?<=...)` - Matches if preceded by pattern
- **Negative Lookbehind** `(?<!...)` - Matches if NOT preceded by pattern

| Part          | Type               | Matches / Purpose                                                    |
| ------------- | ------------------ | -------------------------------------------------------------------- |
| `^`           | Anchor             | Start of string                                                      |
| `(?=.*\d)`    | Positive lookahead | Ensures **at least one digit** (`0–9`) exists anywhere in the string |
| `(?=.*[a-z])` | Positive lookahead | Ensures **at least one lowercase letter** exists anywhere            |
| `(?=.*[A-Z])` | Positive lookahead | Ensures **at least one uppercase letter** exists anywhere            |
| `.{8,}`       | Main match         | Matches **any 8 or more characters** (except newline)                |
| `$`           | Anchor             | End of string                                                        |


In [64]:
# Positive Lookahead Example: Password Validation
# Password must contain at least one digit, one lowercase, one uppercase, and be 8+ chars

password_pattern = r'^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$'

passwords = [
    "Password123",    # Valid
    "password123",    # No uppercase
    "PASSWORD123",    # No lowercase
    "Password",       # No digit
    "Pass123",        # Too short
    "MySecure1",      # Valid
]

print("Password Validation with Positive Lookahead:")
test_regex(password_pattern, passwords)

Password Validation with Positive Lookahead:
Pattern: ^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$
--------------------------------------------------
✓ 'Password123' -> Match: 'Password123'
✗ 'password123' -> No match
✗ 'PASSWORD123' -> No match
✗ 'Password' -> No match
✗ 'Pass123' -> No match
✓ 'MySecure1' -> Match: 'MySecure1'



| Part        | Type                   | Meaning / Purpose                                                |
| ----------- | ---------------------- | ---------------------------------------------------------------- |
| `file`      | Literal match          | Matches the exact characters `f`, `i`, `l`, `e`                  |
| `(?!\.txt)` | **Negative lookahead** | Asserts that what follows **is NOT `.txt`**                      |
| `\.`        | Escaped dot            | Matches a literal `.` (not "any character") inside the lookahead |
| `txt`       | Literal match          | Matches `txt` inside the lookahead                               |


In [None]:
# Negative Lookahead Example: Matching words not followed by specific patterns
# Match 'file' but not when followed by '.txt'

pattern = r'file(?!\.txt)'

test_strings = [
    "file.doc",       # Match
    "file.txt",       # No match
    "filename.txt",   # Match 'file' part
    "file system",    # Match
    "profile.txt",    # Match 'file' part
]

print("Negative Lookahead Example:")
test_regex(pattern, test_strings)

Negative Lookahead Example:
Pattern: file(?=\.txt)
--------------------------------------------------
✗ 'file.doc' -> No match
✓ 'file.txt' -> Match: 'file'
✗ 'filename.txt' -> No match
✗ 'file system' -> No match
✓ 'profile.txt' -> Match: 'file'



| Part        | Type                    | Meaning / Purpose                                                       |
| ----------- | ----------------------- | ----------------------------------------------------------------------- |
| `(?<=[$₹])` | **Positive lookbehind** | Asserts that the match is **immediately preceded** by either `$` or `₹` |
| `[$₹]`      | Character class         | Matches either a dollar sign (`$`) or rupee sign (`₹`)                  |
| `\d+`       | Quantifier              | Matches **one or more digits** — the whole number part of the price     |
| `\.`        | Escaped character       | Matches a **literal dot** (`.`) — separates rupees from paise/cents     |
| `\d{2}`     | Quantifier              | Matches **exactly two digits** — the decimal part (paise or cents)      |


In [None]:
# Positive Lookbehind Example: Currency amounts preceded by '$'
pattern = r'(?<=[$₹])\d+\.\d{2}'
#pattern = r'(?<=[$₹])\d+\.?(\d{2})?'
test_strings = [
    "$19.99",         # Match: 19.99
    "€19.99",         # No match
    "Price: ₹45.00",  # Match: 45.00
    "Amount: ₹25",
    "19.99 dollars",  # No match
    "Save $10.50!",   # Match: 10.50
]

print("Positive Lookbehind Example:")
test_regex(pattern, test_strings)

Positive Lookbehind Example:
Pattern: (?<=[$₹])\d+\.?(\d{2})?
--------------------------------------------------
✓ '$19.99' -> Match: '19.99'
  Groups: ('99',)
✗ '€19.99' -> No match
✓ 'Price: ₹45.00' -> Match: '45.00'
  Groups: ('00',)
✓ 'Amount: ₹25' -> Match: '25'
  Groups: (None,)
✗ '19.99 dollars' -> No match
✓ 'Save $10.50!' -> Match: '10.50'
  Groups: ('50',)



| Part      | Type                    | Meaning / Purpose                                               |
| --------- | ----------------------- | --------------------------------------------------------------- |
| `(?<!\$)` | **Negative lookbehind** | Asserts that the match is **not immediately preceded** by a `$` |
| `\d+`     | Quantifier              | Matches **one or more digits** — the whole number part          |
| `\.`      | Escaped character       | Matches a **literal dot** (`.`)                                 |
| `\d{2}`   | Quantifier              | Matches **exactly two digits** — the decimal part (e.g., `.99`) |


In [None]:
# Negative Lookbehind Example: Numbers not preceded by '$'
pattern = r'(?<!\$)\d+\.\d{2}'
#\b
test_strings = [
    "$19.99",         # No match
    "€19.99",         # Match: 19.99
    "Temperature: 98.60", # Match: 98.60
    "Price $45.00",   # No match
    "Version 2.10",   # Match: 2.10
]

print("Negative Lookbehind Example:")
test_regex(pattern, test_strings)

Negative Lookbehind Example:
Pattern: (?<!\$)\d+\.\d{2}
--------------------------------------------------
✓ '$19.99' -> Match: '9.99'
✓ '€19.99' -> Match: '19.99'
✓ 'Temperature: 98.60' -> Match: '98.60'
✓ 'Price $45.00' -> Match: '5.00'
✓ 'Version 2.10' -> Match: '2.10'



### Exercise 1: Email Validation with Lookaround

Create a regex pattern that validates email addresses with these requirements:
- Must contain exactly one '@' symbol
- Domain must end with '.com', '.org', or '.edu'
- Username cannot start or end with dots
- No consecutive dots allowed

| Part                | Type               | Meaning / Purpose                                         |                        |                                                                 |
| ------------------- | ------------------ | --------------------------------------------------------- | ---------------------- | --------------------------------------------------------------- |
| `^`                 | Anchor             | Start of the string                                       |                        |                                                                 |
| `(?!\.)`            | Negative lookahead | ❌ Disallow email starting with a dot                      |                        |                                                                 |
| `(?!.*\.\.)`        | Negative lookahead | ❌ Disallow double dots anywhere in the email              |                        |                                                                 |
| `(?!.*\.$)`         | Negative lookahead | ❌ Disallow email ending with a dot                        |                        |                                                                 |
| `[a-zA-Z0-9._%+-]+` | Character class    | ✅ Local part (username) — allowed characters before `@`   |                        |                                                                 |
| `@`                 | Literal character  | Mandatory `@` symbol separating local and domain parts    |                        |                                                                 |
| `[a-zA-Z0-9.-]+`    | Character class    | ✅ Domain name — allows letters, digits, dots, and hyphens |                        |                                                                 |
| `\.`                | Escaped dot        | Literal dot separating domain and TLD                     |                        |                                                                 |
| \`(com              | org                | edu)\`                                                    | Group with alternation | ✅ Acceptable domain endings (TLDs): only `.com`, `.org`, `.edu` |
| `$`                 | Anchor             | End of the string                                         |                        |                                                                 |


In [37]:
# Exercise 1 Solution
email_pattern = r'^(?!\.)(?!.*\.\.)(?!.*\.$)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.(com|org|edu)$'

emails = [
    "user@example.com",      # Valid
    ".user@example.com",     # Invalid: starts with dot
    "user.@example.com",     # Invalid: ends with dot
    "us..er@example.com",    # Invalid: consecutive dots
    "user@example.net",      # Invalid: wrong TLD
    "valid.email@test.org",  # Valid
]

print("Email Validation Exercise:")
test_regex(email_pattern, emails)

Email Validation Exercise:
Pattern: ^(?!\.)(?!.*\.\.)(?!.*\.$)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.(com|org|edu)$
--------------------------------------------------
✓ 'user@example.com' -> Match: 'user@example.com'
  Groups: ('com',)
✗ '.user@example.com' -> No match
✓ 'user.@example.com' -> Match: 'user.@example.com'
  Groups: ('com',)
✗ 'us..er@example.com' -> No match
✗ 'user@example.net' -> No match
✓ 'valid.email@test.org' -> Match: 'valid.email@test.org'
  Groups: ('org',)



## 2. Capturing Groups and Backreferences

Capturing groups allow you to extract parts of your match and reference them later in the same pattern or in replacement strings.

### Types of Groups:
- **Capturing Group** `(...)` - Captures and numbers the group
- **Named Group** `(?P<name>...)` - Captures with a name
- **Non-capturing Group** `(?:...)` - Groups without capturing
- **Backreference** `\1, \2` or `(?P=name)` - References captured groups

| Part               | Type                  | Meaning / Purpose                                  | Example Match |
| ------------------ | --------------------- | -------------------------------------------------- | ------------- |
| `(?P<year>\d{4})`  | Named capturing group | Matches **4 digits** and stores as group `"year"`  | `2023`        |
| `-`                | Literal character     | Matches a **hyphen** separator                     | `-`           |
| `(?P<month>\d{2})` | Named capturing group | Matches **2 digits** and stores as group `"month"` | `08`          |
| `-`                | Literal character     | Matches another hyphen                             | `-`           |
| `(?P<day>\d{2})`   | Named capturing group | Matches **2 digits** and stores as group `"day"`   | `04`          |


In [87]:
# Named Capturing Groups Example: Parsing dates
date_pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'

dates = [
    "2023-12-25",
    "2024-01-01",
    "1999-07-15",
]

print("Named Groups Example:")
compiled_pattern = re.compile(date_pattern)
for date in dates:
    match = compiled_pattern.search(date)
    if match:
        print(f"Date: {date}")
        print(f"  Year: {match.group('year')}")
        print(f"  Month: {match.group('month')}")
        print(f"  Day: {match.group('day')}")
        print(f"  Full match: {match.group()}")
        print()

Named Groups Example:
Date: 2023-12-25
  Year: 2023
  Month: 12
  Day: 25
  Full match: 2023-12-25

Date: 2024-01-01
  Year: 2024
  Month: 01
  Day: 01
  Full match: 2024-01-01

Date: 1999-07-15
  Year: 1999
  Month: 07
  Day: 15
  Full match: 1999-07-15



| Part    | Meaning                                                                                |
| ------- | -------------------------------------------------------------------------------------- |
| `\b`    | Word boundary — ensures the match starts at the **start of a word**                    |
| `(\w+)` | **Capturing group 1** — matches one or more word characters (`a-z`, `A-Z`, `0-9`, `_`) |
| `\s+`   | One or more whitespace characters (space, tab, newline)                                |
| `\1`    | **Backreference** — matches **the same exact word** captured in group 1                |
| `\b`    | Word boundary — ensures the match ends at the **end of the repeated word**             |


In [90]:
# Backreferences Example: Finding repeated words
repeated_word_pattern = r'\b(\w+)\s+\1\b'

sentences = [
    "This is is a test",           # Match: 'is is'
    "The the quick brown fox",     # Match: 'The the'
    "No repeated words here",      # No match
    "Buffalo buffalo buffalo",     # Match: 'buffalo buffalo'
    "Hello world world!",          # Match: 'world world'
]

print("Repeated Words with Backreferences:")
test_regex(repeated_word_pattern, sentences, re.IGNORECASE)

Repeated Words with Backreferences:
Pattern: \b(\w+)\s+\1\b
--------------------------------------------------
✓ 'This is is a test' -> Match: 'is is'
  Groups: ('is',)
✓ 'The the quick brown fox' -> Match: 'The the'
  Groups: ('The',)
✗ 'No repeated words here' -> No match
✓ 'Buffalo buffalo buffalo' -> Match: 'Buffalo buffalo'
  Groups: ('Buffalo',)
✓ 'Hello world world!' -> Match: 'world world'
  Groups: ('world',)



In [40]:
# Named Backreferences Example: Matching opening and closing HTML tags
html_tag_pattern = r'<(?P<tag>\w+)>.*?</(?P=tag)>'

html_snippets = [
    "<div>Content</div>",          # Match
    "<p>Paragraph</p>",            # Match
    "<div>Content</span>",         # No match
    "<h1>Title</h1>",              # Match
    "<img src='test.jpg'>",        # No match (self-closing)
]

print("HTML Tag Matching with Named Backreferences:")
test_regex(html_tag_pattern, html_snippets)

HTML Tag Matching with Named Backreferences:
Pattern: <(?P<tag>\w+)>.*?</(?P=tag)>
--------------------------------------------------
✓ '<div>Content</div>' -> Match: '<div>Content</div>'
  Groups: ('div',)
✓ '<p>Paragraph</p>' -> Match: '<p>Paragraph</p>'
  Groups: ('p',)
✗ '<div>Content</span>' -> No match
✓ '<h1>Title</h1>' -> Match: '<h1>Title</h1>'
  Groups: ('h1',)
✗ '<img src='test.jpg'>' -> No match



| Part       | Type                  | Meaning / Purpose                                             |                                                                                       |
| ---------- | --------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| \`(?:^     | ,)\`                  | Non-capturing group                                           | Matches either the **start of line (`^`)** or a **comma (`,`)** — field separator     |
| `(`        | Capturing group start | Start capturing the actual field content                      |                                                                                       |
| `"(...)*"` | Quoted field format   | Starts and ends with `"` — indicates a **quoted field**       |                                                                                       |
| \`(?:\[^"] | "")\*\`               | Non-capturing inner group                                     | Matches zero or more of:<br> - `[^"]`: any char except `"`,<br> - `""`: escaped quote |
| \`         | \`                    | Alternation                                                   | OR — allows matching either quoted or unquoted field                                  |
| `[^,]*`    | Unquoted field        | Match any characters that are **not commas** (unquoted field) |                                                                                       |
| `)`        | Capturing group end   | End of field-capturing group                                  |                                                                                       |


In [41]:
# Advanced Example: Parsing CSV with quoted fields
csv_pattern = r'(?:^|,)("(?:[^"]|"")*"|[^,]*)'

csv_lines = [
    'name,age,city',
    'John,25,"New York"',
    '"Smith, Jane",30,Boston',
    'Bob,35,"Los Angeles, CA"',
]

print("CSV Parsing Example:")
for line in csv_lines:
    matches = re.findall(csv_pattern, line)
    fields = [field.strip('"') for field in matches]
    print(f"Line: {line}")
    print(f"Fields: {fields}")
    print()

CSV Parsing Example:
Line: name,age,city
Fields: ['name', 'age', 'city']

Line: John,25,"New York"
Fields: ['John', '25', 'New York']

Line: "Smith, Jane",30,Boston
Fields: ['Smith, Jane', '30', 'Boston']

Line: Bob,35,"Los Angeles, CA"
Fields: ['Bob', '35', 'Los Angeles, CA']



### Exercise 2: Palindrome Detection

Create a regex pattern that detects palindromes (words that read the same forwards and backwards) using backreferences.

| Part    | Type                     | Meaning / Role                                            |
| ------- | ------------------------ | --------------------------------------------------------- |
| `^`     | Anchor                   | Start of the string                                       |
| `(\w)`  | Capturing group 1        | First character (always required)                         |
| `(\w)?` | Capturing group 2        | Second character (optional)                               |
| `(\w)?` | Capturing group 3        | Third character (optional)                                |
| `\3?`   | Backreference (optional) | Match the **same character as group 3**, if present       |
| `\2?`   | Backreference (optional) | Match the **same character as group 2**, if present       |
| `\1`    | Backreference            | Match the **same character as group 1** (always required) |
| `$`     | Anchor                   | End of the string                                         |


In [91]:
# Exercise 2 Solution: Palindrome Detection
# Note: This is a simplified version for short palindromes
palindrome_pattern = r'^(\w)(\w)?(\w)?\3?\2?\1$'

# For a more general solution, we'll use Python logic
def is_palindrome_regex(word):
    """Check if a word is a palindrome using regex concepts."""
    # Remove non-alphanumeric and convert to lowercase
    clean_word = re.sub(r'[^a-zA-Z0-9]', '', word.lower())
    
    # Create dynamic pattern for palindrome checking
    length = len(clean_word)
    if length <= 1:
        return True
    
    # Build pattern dynamically
    pattern_parts = []
    for i in range(length // 2):
        pattern_parts.append(f'(\w)')
    
    if length % 2 == 1:
        pattern_parts.append('\w?')  # Middle character for odd length
    
    # Add backreferences in reverse order
    for i in range(length // 2, 0, -1):
        pattern_parts.append(f'\\{i}')
    
    pattern = '^' + ''.join(pattern_parts) + '$'
    return bool(re.match(pattern, clean_word))

test_words = [
    "racecar",     # True
    "level",       # True
    "hello",       # False
    "madam",       # True
    "python",      # False
    "A man a plan a canal Panama",  # True (ignoring spaces)
]

print("Palindrome Detection:")
for word in test_words:
    result = is_palindrome_regex(word)
    print(f"'{word}' -> {'Palindrome' if result else 'Not a palindrome'}")

Palindrome Detection:
'racecar' -> Palindrome
'level' -> Palindrome
'hello' -> Not a palindrome
'madam' -> Palindrome
'python' -> Not a palindrome
'A man a plan a canal Panama' -> Palindrome


## 3. Atomic Groups and Possessive Quantifiers

Atomic groups and possessive quantifiers help prevent catastrophic backtracking and improve performance.

### Key Concepts:
- **Possessive Quantifiers** `*+, ++, ?+, {n,m}+` - Don't give up characters once matched
- **Catastrophic Backtracking** - When regex engine tries too many combinations

**Note:** Python's `re` module has limited support for atomic groups. We'll demonstrate the concepts and show alternatives.

In [92]:
# Demonstrating Catastrophic Backtracking
import time

def time_regex(pattern, text, description):
    """Time how long a regex takes to execute."""
    start_time = time.time()
    try:
        result = re.search(pattern, text)
        end_time = time.time()
        print(f"{description}: {end_time - start_time:.4f}s - {'Match' if result else 'No match'}")
    except Exception as e:
        end_time = time.time()
        print(f"{description}: {end_time - start_time:.4f}s - Error: {e}")

# Problematic pattern that can cause catastrophic backtracking
problematic_pattern = r'(a+)+b'
# Better pattern using non-capturing group
better_pattern = r'a+b'

# Test string that will cause backtracking (no 'b' at the end)
test_string = 'a' * 20  # String of 20 'a's with no 'b'

print("Demonstrating Backtracking Issues:")
time_regex(better_pattern, test_string, "Optimized pattern")
print("\nNote: The problematic pattern (a+)+b would take much longer on longer strings without 'b'")
print("This demonstrates why atomic groups and possessive quantifiers are important.")

Demonstrating Backtracking Issues:
Optimized pattern: 0.0000s - No match

Note: The problematic pattern (a+)+b would take much longer on longer strings without 'b'
This demonstrates why atomic groups and possessive quantifiers are important.


In [45]:
# Performance Optimization Techniques

def compare_regex_performance():
    """Compare different regex approaches for performance."""
    
    # Test data
    text = "The quick brown fox jumps over the lazy dog. " * 1000
    
    # Different approaches to find words
    patterns = {
        "Inefficient": r'(\w+\s*)+',  # Can cause backtracking
        "Better": r'\w+',             # Simple and efficient
        "Compiled": re.compile(r'\w+'),  # Pre-compiled
    }
    
    print("Performance Comparison:")
    
    for name, pattern in patterns.items():
        if name == "Compiled":
            # Time the compiled version
            start_time = time.time()
            matches = pattern.findall(text)
            end_time = time.time()
        else:
            # Time the regular version
            start_time = time.time()
            matches = re.findall(pattern, text)
            end_time = time.time()
        
        print(f"{name}: {end_time - start_time:.4f}s ({len(matches)} matches)")

compare_regex_performance()

Performance Comparison:
Inefficient: 0.0006s (1000 matches)
Better: 0.0011s (9000 matches)
Compiled: 0.0012s (9000 matches)


## 4. Conditional Regex Patterns

Conditional patterns allow you to match different alternatives based on whether a previous group matched.

**Syntax:** `(?(condition)yes-pattern|no-pattern)`

**Note:** Python's `re` module has limited support for conditionals. We'll show the concept and Python alternatives.

In [93]:
# Simulating Conditional Regex in Python

def flexible_date_matcher(date_string):
    """Match dates in multiple formats using conditional logic."""
    
    # Different date patterns
    patterns = [
        (r'(\d{4})-(\d{2})-(\d{2})', 'ISO format (YYYY-MM-DD)'),
        (r'(\d{2})/(\d{2})/(\d{4})', 'US format (MM/DD/YYYY)'),
        (r'(\d{2})\.(\d{2})\.(\d{4})', 'European format (DD.MM.YYYY)'),
        (r'(\w+)\s+(\d{1,2}),\s+(\d{4})', 'Long format (Month DD, YYYY)'),
    ]
    
    for pattern, description in patterns:
        match = re.search(pattern, date_string)
        if match:
            return {
                'format': description,
                'groups': match.groups(),
                'full_match': match.group()
            }
    
    return None

# Test different date formats
test_dates = [
    "2023-12-25",
    "12/25/2023",
    "25.12.2023",
    "December 25, 2023",
    "Invalid date format",
]

print("Flexible Date Matching:")
for date in test_dates:
    result = flexible_date_matcher(date)
    if result:
        print(f"'{date}' -> {result['format']}")
        print(f"  Groups: {result['groups']}")
    else:
        print(f"'{date}' -> No match found")
    print()

Flexible Date Matching:
'2023-12-25' -> ISO format (YYYY-MM-DD)
  Groups: ('2023', '12', '25')

'12/25/2023' -> US format (MM/DD/YYYY)
  Groups: ('12', '25', '2023')

'25.12.2023' -> European format (DD.MM.YYYY)
  Groups: ('25', '12', '2023')

'December 25, 2023' -> Long format (Month DD, YYYY)
  Groups: ('December', '25', '2023')

'Invalid date format' -> No match found



## 5. Advanced Quantifiers and Unicode Support

Understanding the nuances of quantifiers and working with Unicode text.

| Pattern | Behavior          | Example Match                                                   |
| ------- | ----------------- | --------------------------------------------------------------- |
| `.*`    | Greedy            | Eats everything up to last match                                |
| `.*?`   | Non-greedy / lazy | Stops at the first match that satisfies the rest of the pattern |


In [94]:
# Lazy vs Greedy Quantifiers

def demonstrate_quantifiers():
    """Show the difference between greedy and lazy quantifiers."""
    
    text = "<div>First</div><div>Second</div><div>Third</div>"
    
    patterns = {
        "Greedy": r'<div>.*</div>',      # Matches from first <div> to last </div>
        "Lazy": r'<div>.*?</div>',       # Matches each <div>...</div> separately
        "Specific": r'<div>[^<]*</div>', # More specific, avoids the issue
    }
    
    print("Quantifier Comparison:")
    print(f"Text: {text}")
    print()
    
    for name, pattern in patterns.items():
        matches = re.findall(pattern, text)
        print(f"{name} ({pattern}):")
        for i, match in enumerate(matches, 1):
            print(f"  Match {i}: {match}")
        print()

demonstrate_quantifiers()

Quantifier Comparison:
Text: <div>First</div><div>Second</div><div>Third</div>

Greedy (<div>.*</div>):
  Match 1: <div>First</div><div>Second</div><div>Third</div>

Lazy (<div>.*?</div>):
  Match 1: <div>First</div>
  Match 2: <div>Second</div>
  Match 3: <div>Third</div>

Specific (<div>[^<]*</div>):
  Match 1: <div>First</div>
  Match 2: <div>Second</div>
  Match 3: <div>Third</div>



In [97]:
# Unicode and International Text Processing

def unicode_text_processing():
    """Demonstrate regex with Unicode text."""
    
    # Sample text in different languages
    texts = [
        "Hello World! 123",           # English
        "Hola Mundo! 123",            # Spanish
        "Bonjour le monde! 123",      # French
        "Привет नमस्ते ! 123",            # Russian
        "こんにちは世界！123",            # Japanese
        "مرحبا بالعالم! 123",          # Arabic
    ]
    
    patterns = {
        "ASCII letters": r'[a-zA-Z]+',
        
        "Word characters": r'\w+',
        "Non-ASCII": r'[^\x00-\x7F]+',
    }
    
    print("Unicode Text Processing:")
    
    for text in texts:
        print(f"\nText: {text}")
        
        # ASCII letters
        ascii_matches = re.findall(r'[a-zA-Z]+', text)
        print(f"  ASCII letters: {ascii_matches}")
        
        # Word characters (includes Unicode in Python)
        word_matches = re.findall(r'\w+', text, re.UNICODE)
        print(f"  Word characters: {word_matches}")
        
        # Non-ASCII characters
        non_ascii = re.findall(r'[^\x00-\x7F]+', text)
        print(f"  Non-ASCII: {non_ascii}")

unicode_text_processing()

Unicode Text Processing:

Text: Hello World! 123
  ASCII letters: ['Hello', 'World']
  Word characters: ['Hello', 'World', '123']
  Non-ASCII: []

Text: Hola Mundo! 123
  ASCII letters: ['Hola', 'Mundo']
  Word characters: ['Hola', 'Mundo', '123']
  Non-ASCII: []

Text: Bonjour le monde! 123
  ASCII letters: ['Bonjour', 'le', 'monde']
  Word characters: ['Bonjour', 'le', 'monde', '123']
  Non-ASCII: []

Text: Привет नमस्ते ! 123
  ASCII letters: []
  Word characters: ['Привет', 'नमस', 'त', '123']
  Non-ASCII: ['Привет', 'नमस्ते']

Text: こんにちは世界！123
  ASCII letters: []
  Word characters: ['こんにちは世界', '123']
  Non-ASCII: ['こんにちは世界！']

Text: مرحبا بالعالم! 123
  ASCII letters: []
  Word characters: ['مرحبا', 'بالعالم', '123']
  Non-ASCII: ['مرحبا', 'بالعالم']


In [98]:
# Word Boundaries with Unicode

def word_boundary_examples():
    """Demonstrate word boundaries with different types of text."""
    
    texts = [
        "The cat in the hat",
        "email@example.com",
        "file_name.txt",
        "hello-world",
        "café résumé naïve",  # Accented characters
    ]
    
    patterns = {
        "Word boundary \\b": r'\bcat\b',
        "Non-word boundary \\B": r'\Bcat\B',
        "Start of word": r'\b\w+',
        "End of word": r'\w+\b',
    }
    
    print("Word Boundary Examples:")
    
    for text in texts:
        print(f"\nText: '{text}'")
        
        # Find all word boundaries
        words = re.findall(r'\b\w+\b', text)
        print(f"  Words: {words}")
        
        # Check if 'cat' appears as a whole word
        if re.search(r'\bcat\b', text):
            print(f"  Contains 'cat' as whole word: Yes")
        else:
            print(f"  Contains 'cat' as whole word: No")

word_boundary_examples()

Word Boundary Examples:

Text: 'The cat in the hat'
  Words: ['The', 'cat', 'in', 'the', 'hat']
  Contains 'cat' as whole word: Yes

Text: 'email@example.com'
  Words: ['email', 'example', 'com']
  Contains 'cat' as whole word: No

Text: 'file_name.txt'
  Words: ['file_name', 'txt']
  Contains 'cat' as whole word: No

Text: 'hello-world'
  Words: ['hello', 'world']
  Contains 'cat' as whole word: No

Text: 'café résumé naïve'
  Words: ['café', 'résumé', 'naïve']
  Contains 'cat' as whole word: No


## 6. Real-World Applications

Let's apply our advanced regex knowledge to solve real-world problems.

| Regex Part              | Name (`?P<name>`) | Matches Example Value        | Description                                 |
| ----------------------- | ----------------- | ---------------------------- | ------------------------------------------- |
| `^`                     | —                 | —                            | Start of the line                           |
| `(?P<ip>\S+)`           | `ip`              | `127.0.0.1`                  | Matches the **IP address** (non-whitespace) |
| `\S+`                   | —                 | `-`                          | Typically unused field (e.g., identity)     |
| `(?P<user>\S+)`         | `user`            | `frank`                      | Authenticated username or `-`               |
| `\[`                    | —                 | `[`                          | Literal opening square bracket              |
| `(?P<timestamp>[^\]]+)` | `timestamp`       | `10/Oct/2000:13:55:36 -0700` | Anything inside square brackets (`[ ... ]`) |
| `\]`                    | —                 | `]`                          | Literal closing square bracket              |
| `"`                     | —                 | `"`                          | Opening double quote for HTTP request       |
| `(?P<method>\w+)`       | `method`          | `GET`                        | HTTP method (e.g., GET, POST)               |
| `(?P<url>\S+)`          | `url`             | `/apache_pb.gif`             | Request URL path                            |
| `(?P<protocol>[^"]+)`   | `protocol`        | `HTTP/1.0`                   | Protocol version, anything up to next `"`   |
| `"`                     | —                 | `"`                          | Closing double quote                        |
| `(?P<status>\d+)`       | `status`          | `200`                        | HTTP status code                            |
| `(?P<size>\d+)`         | `size`            | `2326`                       | Size of response in bytes                   |
| `$`                     | —                 | —                            | End of the line                             |


In [51]:
# Log File Parsing

def parse_log_files():
    """Parse common log file formats."""
    
    # Sample log entries
    log_entries = [
        '192.168.1.1 - - [25/Dec/2023:10:00:00 +0000] "GET /index.html HTTP/1.1" 200 1234',
        '10.0.0.1 - user [25/Dec/2023:10:01:00 +0000] "POST /api/login HTTP/1.1" 401 567',
        '203.0.113.1 - - [25/Dec/2023:10:02:00 +0000] "GET /images/logo.png HTTP/1.1" 404 0',
    ]
    
    # Apache Common Log Format pattern
    log_pattern = r'^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<url>\S+) (?P<protocol>[^"]+)" (?P<status>\d+) (?P<size>\d+)$'
    
    print("Log File Parsing:")
    compiled_pattern = re.compile(log_pattern)
    
    for entry in log_entries:
        match = compiled_pattern.match(entry)
        if match:
            data = match.groupdict()
            print(f"IP: {data['ip']}")
            print(f"User: {data['user']}")
            print(f"Method: {data['method']}")
            print(f"URL: {data['url']}")
            print(f"Status: {data['status']}")
            print(f"Size: {data['size']} bytes")
            print("-" * 40)

parse_log_files()

Log File Parsing:
IP: 192.168.1.1
User: -
Method: GET
URL: /index.html
Status: 200
Size: 1234 bytes
----------------------------------------
IP: 10.0.0.1
User: user
Method: POST
URL: /api/login
Status: 401
Size: 567 bytes
----------------------------------------
IP: 203.0.113.1
User: -
Method: GET
URL: /images/logo.png
Status: 404
Size: 0 bytes
----------------------------------------


In [52]:
# Data Extraction from Unstructured Text

def extract_contact_info():
    """Extract contact information from unstructured text."""
    
    text = """
    Contact John Smith at john.smith@company.com or call (555) 123-4567.
    You can also reach Jane Doe via jane.doe@example.org or +1-800-555-0199.
    For urgent matters, contact support@help.com or dial 1-888-HELP-NOW.
    Visit our office at 123 Main Street, Suite 456, New York, NY 10001.
    """
    
    patterns = {
        'emails': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'phones': r'(?:\+?1[-.]?)?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})',
        'names': r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',
        'addresses': r'\d+\s+[A-Za-z\s,]+\s+[A-Z]{2}\s+\d{5}',
    }
    
    print("Contact Information Extraction:")
    print(f"Text: {text.strip()}")
    print("\nExtracted Information:")
    
    for info_type, pattern in patterns.items():
        matches = re.findall(pattern, text)
        print(f"\n{info_type.title()}:")
        if matches:
            for match in matches:
                if isinstance(match, tuple):
                    # For phone numbers with groups
                    formatted = f"({match[0]}) {match[1]}-{match[2]}"
                    print(f"  - {formatted}")
                else:
                    print(f"  - {match}")
        else:
            print("  None found")

extract_contact_info()

Contact Information Extraction:
Text: Contact John Smith at john.smith@company.com or call (555) 123-4567.
    You can also reach Jane Doe via jane.doe@example.org or +1-800-555-0199.
    For urgent matters, contact support@help.com or dial 1-888-HELP-NOW.
    Visit our office at 123 Main Street, Suite 456, New York, NY 10001.

Extracted Information:

Emails:
  - john.smith@company.com
  - jane.doe@example.org
  - support@help.com

Phones:
  - (800) 555-0199

Names:
  - Contact John
  - Jane Doe
  - Main Street
  - New York

Addresses:
  None found


| Part                                  | Group Name | Type                 | Matches / Captures                                               |
| ------------------------------------- | ---------- | -------------------- | ---------------------------------------------------------------- |
| `^`                                   | —          | Anchor               | Start of the string                                              |
| `(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*)` | `scheme`   | Named group          | URL scheme (e.g., `http`, `ftp`, `https`, `mailto`)              |
| `://`                                 | —          | Literal              | Required scheme separator                                        |
| `(?:...)`                             | —          | Non-capturing group  | For wrapping optional username/password logic                    |
| `(?P<username>[^:@]+)`                | `username` | Named group          | Username before `@`, can't include `:` or `@`                    |
| `(?::(?P<password>[^@]+))?`           | `password` | Optional named group | Password after `:` (if present), can't include `@`               |
| `@`                                   | —          | Literal              | Ends credentials section                                         |
| `)?`                                  | —          | Optional wrapper     | Makes entire username\:password@ section optional                |
| `(?P<host>[^:/?#]+)`                  | `host`     | Named group          | Host (domain or IP) — can't include port `:`, path `/`, `?`, `#` |
| `(?::(?P<port>\d+))?`                 | `port`     | Optional named group | Port number, preceded by `:`                                     |
| `(?P<path>/[^?#]*)?`                  | `path`     | Optional named group | Path, must start with `/`, excludes query and fragment           |
| `(?:\?(?P<query>[^#]*))?`             | `query`    | Optional named group | Query string, starts with `?`, excludes fragment (`#`)           |
| `(?:\#(?P<fragment>.*))?`             | `fragment` | Optional named group | Fragment identifier, starts with `#`, captures rest              |
| `$`                                   | —          | Anchor               | End of the string                                                |


In [53]:
# URL Validation and Parsing

def advanced_url_parsing():
    """Parse and validate URLs with detailed component extraction."""
    
    urls = [
        "https://www.example.com:8080/path/to/page?param1=value1&param2=value2#section",
        "http://subdomain.example.org/api/v1/users/123",
        "ftp://user:pass@ftp.example.com:21/files/document.pdf",
        "mailto:user@example.com",
        "invalid-url",
    ]
    
    # Comprehensive URL pattern
    url_pattern = r'''
        ^(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):// # Scheme
        (?:
            (?P<username>[^:@]+)                 # Username (optional)
            (?::(?P<password>[^@]+))?            # Password (optional)
            @
        )?
        (?P<host>[^:/?#]+)                       # Host
        (?::(?P<port>\d+))?                      # Port (optional)
        (?P<path>/[^?#]*)?                       # Path (optional)
        (?:\?(?P<query>[^#]*))?                  # Query string (optional)
        (?:\#(?P<fragment>.*))?                  # Fragment (optional)
        $
    '''
    
    compiled_pattern = re.compile(url_pattern, re.VERBOSE)
    
    print("Advanced URL Parsing:")
    
    for url in urls:
        print(f"\nURL: {url}")
        match = compiled_pattern.match(url)
        
        if match:
            components = match.groupdict()
            print("  Components:")
            for key, value in components.items():
                if value:
                    print(f"    {key}: {value}")
        else:
            print("  Invalid URL format")

advanced_url_parsing()

Advanced URL Parsing:

URL: https://www.example.com:8080/path/to/page?param1=value1&param2=value2#section
  Components:
    scheme: https
    host: www.example.com
    port: 8080
    path: /path/to/page
    query: param1=value1&param2=value2
    fragment: section

URL: http://subdomain.example.org/api/v1/users/123
  Components:
    scheme: http
    host: subdomain.example.org
    path: /api/v1/users/123

URL: ftp://user:pass@ftp.example.com:21/files/document.pdf
  Components:
    scheme: ftp
    username: user
    password: pass
    host: ftp.example.com
    port: 21
    path: /files/document.pdf

URL: mailto:user@example.com
  Invalid URL format

URL: invalid-url
  Invalid URL format


## 7. Performance Optimization and Best Practices

Learn how to write efficient regex patterns and avoid common pitfalls.

In [54]:
# Regex Compilation and Caching

import functools

class RegexCache:
    """A simple regex compilation cache."""
    
    def __init__(self, max_size=100):
        self.cache = {}
        self.max_size = max_size
    
    def get_compiled_regex(self, pattern, flags=0):
        """Get a compiled regex, using cache when possible."""
        key = (pattern, flags)
        
        if key not in self.cache:
            if len(self.cache) >= self.max_size:
                # Remove oldest entry (simple FIFO)
                oldest_key = next(iter(self.cache))
                del self.cache[oldest_key]
            
            self.cache[key] = re.compile(pattern, flags)
        
        return self.cache[key]

# Global regex cache instance
regex_cache = RegexCache()

def benchmark_compilation():
    """Benchmark regex compilation vs caching."""
    
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    text = "Contact us at support@example.com or sales@company.org" * 1000
    iterations = 1000
    
    # Test without caching (compile each time)
    start_time = time.time()
    for _ in range(iterations):
        re.findall(pattern, text)
    no_cache_time = time.time() - start_time
    
    # Test with pre-compilation
    compiled_pattern = re.compile(pattern)
    start_time = time.time()
    for _ in range(iterations):
        compiled_pattern.findall(text)
    compiled_time = time.time() - start_time
    
    # Test with caching
    start_time = time.time()
    for _ in range(iterations):
        cached_pattern = regex_cache.get_compiled_regex(pattern)
        cached_pattern.findall(text)
    cached_time = time.time() - start_time
    
    print("Regex Compilation Benchmark:")
    print(f"No caching: {no_cache_time:.4f}s")
    print(f"Pre-compiled: {compiled_time:.4f}s")
    print(f"With caching: {cached_time:.4f}s")
    print(f"Speedup (compiled vs no cache): {no_cache_time/compiled_time:.2f}x")

benchmark_compilation()

Regex Compilation Benchmark:
No caching: 0.5775s
Pre-compiled: 0.5556s
With caching: 0.5574s
Speedup (compiled vs no cache): 1.04x


In [55]:
# Common Performance Pitfalls and Solutions

def demonstrate_performance_pitfalls():
    """Show common regex performance issues and their solutions."""
    
    test_text = "a" * 1000 + "b"  # 1000 'a's followed by 'b'
    
    patterns = {
        "Catastrophic (avoid)": r'(a+)+b',
        "Better alternative": r'a+b',
        "Inefficient alternation": r'(cat|category|catastrophe)',
        "Optimized alternation": r'cat(egory|astrophe)?',
    }
    
    print("Performance Pitfalls Demonstration:")
    
    # Safe patterns to test
    safe_patterns = {
        "Better alternative": r'a+b',
        "Optimized alternation": r'cat(egory|astrophe)?',
    }
    
    for name, pattern in safe_patterns.items():
        start_time = time.time()
        result = re.search(pattern, test_text)
        end_time = time.time()
        
        print(f"{name}: {end_time - start_time:.6f}s - {'Match' if result else 'No match'}")
    
    print("\nNote: Avoided testing catastrophic patterns to prevent long execution times.")
    
    # Demonstrate alternation optimization
    test_words = ["cat", "category", "catastrophe", "dog"]
    
    print("\nAlternation Optimization:")
    for word in test_words:
        match1 = re.search(r'(cat|category|catastrophe)', word)
        match2 = re.search(r'cat(egory|astrophe)?', word)
        
        result1 = "Match" if match1 else "No match"
        result2 = "Match" if match2 else "No match"
        
        print(f"'{word}': Inefficient={result1}, Optimized={result2}")

demonstrate_performance_pitfalls()

Performance Pitfalls Demonstration:
Better alternative: 0.000030s - Match
Optimized alternation: 0.000003s - No match

Note: Avoided testing catastrophic patterns to prevent long execution times.

Alternation Optimization:
'cat': Inefficient=Match, Optimized=Match
'category': Inefficient=Match, Optimized=Match
'catastrophe': Inefficient=Match, Optimized=Match
'dog': Inefficient=No match, Optimized=No match


## 8. Debugging and Testing Regex

Tools and techniques for debugging complex regex patterns.

In [56]:
# Regex Debugging Tools

def debug_regex(pattern, text, description=""):
    """Debug a regex pattern with detailed information."""
    
    print(f"Debugging Regex: {description}")
    print(f"Pattern: {pattern}")
    print(f"Text: {text}")
    print("-" * 50)
    
    try:
        compiled_pattern = re.compile(pattern)
        
        # Find all matches
        matches = list(compiled_pattern.finditer(text))
        
        if matches:
            print(f"Found {len(matches)} match(es):")
            for i, match in enumerate(matches, 1):
                print(f"\nMatch {i}:")
                print(f"  Full match: '{match.group()}'")
                print(f"  Position: {match.start()}-{match.end()}")
                
                if match.groups():
                    print(f"  Groups: {match.groups()}")
                
                if hasattr(match, 'groupdict') and match.groupdict():
                    print(f"  Named groups: {match.groupdict()}")
        else:
            print("No matches found")
            
        # Show pattern analysis
        print(f"\nPattern Analysis:")
        print(f"  Groups in pattern: {compiled_pattern.groups}")
        if hasattr(compiled_pattern, 'groupindex'):
            print(f"  Named groups: {compiled_pattern.groupindex}")
            
    except re.error as e:
        print(f"Regex Error: {e}")
    
    print("=" * 60)

# Debug examples
debug_regex(
    r'(?P<protocol>https?)://(?P<domain>[^/]+)(?P<path>/.*)?',
    "Visit https://www.example.com/path/to/page",
    "URL parsing with named groups"
)

debug_regex(
    r'(\d{2})/(\d{2})/(\d{4})',
    "Today is 12/25/2023 and tomorrow is 12/26/2023",
    "Date extraction with numbered groups"
)

Debugging Regex: URL parsing with named groups
Pattern: (?P<protocol>https?)://(?P<domain>[^/]+)(?P<path>/.*)?
Text: Visit https://www.example.com/path/to/page
--------------------------------------------------
Found 1 match(es):

Match 1:
  Full match: 'https://www.example.com/path/to/page'
  Position: 6-42
  Groups: ('https', 'www.example.com', '/path/to/page')
  Named groups: {'protocol': 'https', 'domain': 'www.example.com', 'path': '/path/to/page'}

Pattern Analysis:
  Groups in pattern: 3
  Named groups: {'protocol': 1, 'domain': 2, 'path': 3}
Debugging Regex: Date extraction with numbered groups
Pattern: (\d{2})/(\d{2})/(\d{4})
Text: Today is 12/25/2023 and tomorrow is 12/26/2023
--------------------------------------------------
Found 2 match(es):

Match 1:
  Full match: '12/25/2023'
  Position: 9-19
  Groups: ('12', '25', '2023')

Match 2:
  Full match: '12/26/2023'
  Position: 36-46
  Groups: ('12', '26', '2023')

Pattern Analysis:
  Groups in pattern: 3
  Named groups: {}


# 9. Working with Large Texts

In [59]:
# Working with Large Texts using re.finditer()

def process_large_text():
    """Demonstrate efficient processing of large texts."""
    
    # Simulate a large text file
    large_text = """
    This is a sample document with multiple email addresses like john@example.com,
    jane.doe@company.org, and support@help.com. There are also phone numbers
    such as (555) 123-4567, 800-555-0199, and +1-888-555-1234.
    
    URLs in the text include https://www.example.com, http://subdomain.test.org,
    and ftp://files.company.com/documents/.
    
    Some dates mentioned: 2023-12-25, 01/15/2024, and March 10, 2024.
    """ * 100  # Repeat to simulate larger text
    
    patterns = {
        'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'phone': r'\+?1?[-.]?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})',
        'url': r'https?://[^\s]+',
        'date': r'\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}|\w+ \d{1,2}, \d{4}',
    }
    
    print(f"Processing large text ({len(large_text):,} characters):")
    print()
    
    # Use finditer for memory-efficient processing
    for pattern_name, pattern in patterns.items():
        print(f"Finding {pattern_name}s:")
        
        matches = list(re.finditer(pattern, large_text, re.IGNORECASE))
        unique_matches = set(match.group() for match in matches)
        
        print(f"  Total matches: {len(matches)}")
        print(f"  Unique matches: {len(unique_matches)}")
        
        # Show first few unique matches
        for i, match in enumerate(sorted(unique_matches)[:3]):
            print(f"    {i+1}. {match}")
        
        if len(unique_matches) > 3:
            print(f"    ... and {len(unique_matches) - 3} more")
        
        print()

process_large_text()

Processing large text (43,300 characters):

Finding emails:
  Total matches: 300
  Unique matches: 3
    1. jane.doe@company.org
    2. john@example.com
    3. support@help.com

Finding phones:
  Total matches: 200
  Unique matches: 2
    1. +1-888-555-1234
    2. 800-555-0199

Finding urls:
  Total matches: 200
  Unique matches: 2
    1. http://subdomain.test.org,
    2. https://www.example.com,

Finding dates:
  Total matches: 300
  Unique matches: 3
    1. 01/15/2024
    2. 2023-12-25
    3. March 10, 2024



## 10. Final Exercises and Projects

Put your advanced regex skills to the test with these challenging exercises.

In [61]:
# Exercise 1: Build a Log Analyzer

class LogAnalyzer:
    """Analyze log files using advanced regex patterns."""
    
    def __init__(self):
        # Common log patterns
        self.patterns = {
            'apache_common': re.compile(
                r'^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] '
                r'"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[^"]+)" '
                r'(?P<status>\d+) (?P<size>\d+)$'
            ),
            'nginx_error': re.compile(
                r'^(?P<timestamp>\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) '
                r'\[(?P<level>\w+)\] (?P<pid>\d+)#(?P<tid>\d+): '
                r'(?P<message>.*)$'
            ),
            'python_traceback': re.compile(
                r'Traceback \(most recent call last\):.*?'
                r'(?P<exception>\w+Error): (?P<message>.*)',
                re.DOTALL
            )
        }
    
    def analyze_log(self, log_content: str, log_type: str = 'apache_common'):
        """Analyze log content and extract statistics."""
        if log_type not in self.patterns:
            raise ValueError(f"Unknown log type: {log_type}")
        
        pattern = self.patterns[log_type]
        matches = list(pattern.finditer(log_content))
        
        if not matches:
            return {"error": "No matches found"}
        
        # Extract statistics based on log type
        if log_type == 'apache_common':
            return self._analyze_apache_logs(matches)
        elif log_type == 'nginx_error':
            return self._analyze_nginx_errors(matches)
        
        return {"matches": len(matches)}
    
    def _analyze_apache_logs(self, matches):
        """Analyze Apache access logs."""
        stats = {
            'total_requests': len(matches),
            'unique_ips': set(),
            'status_codes': {},
            'methods': {},
            'top_urls': {},
        }
        
        for match in matches:
            data = match.groupdict()
            
            # Collect statistics
            stats['unique_ips'].add(data['ip'])
            
            status = data['status']
            stats['status_codes'][status] = stats['status_codes'].get(status, 0) + 1
            
            method = data['method']
            stats['methods'][method] = stats['methods'].get(method, 0) + 1
            
            url = data['url']
            stats['top_urls'][url] = stats['top_urls'].get(url, 0) + 1
        
        # Convert set to count
        stats['unique_ips'] = len(stats['unique_ips'])
        
        # Sort top URLs
        stats['top_urls'] = dict(sorted(stats['top_urls'].items(), 
                                      key=lambda x: x[1], reverse=True)[:5])
        
        return stats
    
    def _analyze_nginx_errors(self, matches):
        """Analyze Nginx error logs."""
        stats = {
            'total_errors': len(matches),
            'error_levels': {},
            'common_messages': {},
        }
        
        for match in matches:
            data = match.groupdict()
            
            level = data['level']
            stats['error_levels'][level] = stats['error_levels'].get(level, 0) + 1
            
            message = data['message'][:50]  # First 50 chars
            stats['common_messages'][message] = stats['common_messages'].get(message, 0) + 1
        
        return stats

# Test the log analyzer
sample_apache_log = """
192.168.1.1 - - [25/Dec/2023:10:00:00 +0000] "GET /index.html HTTP/1.1" 200 1234
10.0.0.1 - user [25/Dec/2023:10:01:00 +0000] "POST /api/login HTTP/1.1" 401 567
192.168.1.1 - - [25/Dec/2023:10:02:00 +0000] "GET /images/logo.png HTTP/1.1" 404 0
203.0.113.1 - - [25/Dec/2023:10:03:00 +0000] "GET /index.html HTTP/1.1" 200 1234
""".strip()

analyzer = LogAnalyzer()
results = analyzer.analyze_log(sample_apache_log, 'apache_common')

print("Log Analysis Results:")
for key, value in results.items():
    print(f"{key}: {value}")

Log Analysis Results:
error: No matches found


In [100]:
# Exercise 2: Advanced Data Extraction and Cleaning

class DataExtractor:
    """Extract and clean data from unstructured text using advanced regex."""
    
    def __init__(self):
        self.patterns = {
            'financial': {
                'currency': re.compile(r'\$[\d,]+(?:\.\d{2})?|\d+(?:\.\d{2})?\s*(?:dollars?|USD|\$)'),
                'percentage': re.compile(r'\d+(?:\.\d+)?%'),
                'stock_symbol': re.compile(r'\b[A-Z]{2,5}\b(?=\s*(?:stock|shares?|ticker))'),
            },
            'personal': {
                'ssn': re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
                'credit_card': re.compile(r'\b(?:\d{4}[- ]?){3}\d{4}\b'),
                'driver_license': re.compile(r'\b[A-Z]\d{7,8}\b'),
            },
            'technical': {
                'ip_address': re.compile(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'),
                'mac_address': re.compile(r'\b[0-9A-Fa-f]{2}(?:[:-][0-9A-Fa-f]{2}){5}\b'),
                'version': re.compile(r'\bv?\d+(?:\.\d+){1,3}\b'),
            }
        }
    
    def extract_data(self, text: str, category: str = None) -> dict:
        """Extract data from text, optionally filtering by category."""
        results = {}
        
        categories = [category] if category else self.patterns.keys()
        
        for cat in categories:
            if cat not in self.patterns:
                continue
                
            results[cat] = {}
            
            for pattern_name, pattern in self.patterns[cat].items():
                matches = pattern.findall(text)
                if matches:
                    results[cat][pattern_name] = list(set(matches))  # Remove duplicates
        
        return results
    
    def clean_sensitive_data(self, text: str, replacement: str = "[REDACTED]") -> str:
        """Clean sensitive data from text."""
        sensitive_patterns = {
            **self.patterns['personal'],
            'email': re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
            'phone': re.compile(r'\b\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})\b'),
        }
        
        cleaned_text = text
        
        for pattern_name, pattern in sensitive_patterns.items():
            cleaned_text = pattern.sub(replacement, cleaned_text)
        
        return cleaned_text

# Test the data extractor
sample_text = """
Financial Report Q4 2023:
Revenue increased by 15.5% to $2,450,000.00 this quarter.
AAPL stock performed well, while MSFT shares declined 3.2%.

Contact Information:
John Smith: 555-123-4567, john.smith@company.com
SSN: 123-45-6789, Driver License: A1234567
Credit Card: 4532-1234-5678-9012

Technical Details:
Server IP: 192.168.1.100, MAC: 00:1B:44:11:3A:B7
Software version: v2.1.3, Database version: 5.7.2
"""

extractor = DataExtractor()

print("Data Extraction Results:")
extracted = extractor.extract_data(sample_text)
for category, data in extracted.items():
    if data:  # Only show categories with data
        print(f"\n{category.title()} Data:")
        for pattern_name, matches in data.items():
            print(f"  {pattern_name}: {matches}")

print("\n" + "="*50)
print("Cleaned Text (sensitive data redacted):")
cleaned = extractor.clean_sensitive_data(sample_text)
print(cleaned)

Data Extraction Results:

Financial Data:
  currency: ['$2,450,000.00']
  percentage: ['15.5%', '3.2%']
  stock_symbol: ['MSFT', 'AAPL']

Personal Data:
  ssn: ['123-45-6789']
  credit_card: ['4532-1234-5678-9012']
  driver_license: ['A1234567']

Technical Data:
  ip_address: ['192.168.1.100']
  mac_address: ['00:1B:44:11:3A:B7']
  version: ['5.7.2', '3.2', '15.5', 'v2.1.3', '192.168.1.100', '000.00']

Cleaned Text (sensitive data redacted):

Financial Report Q4 2023:
Revenue increased by 15.5% to $2,450,000.00 this quarter.
AAPL stock performed well, while MSFT shares declined 3.2%.

Contact Information:
John Smith: [REDACTED], [REDACTED]
SSN: [REDACTED], Driver License: [REDACTED]
Credit Card: [REDACTED]

Technical Details:
Server IP: 192.168.1.100, MAC: 00:1B:44:11:3A:B7
Software version: v2.1.3, Database version: 5.7.2



## Conclusion

Congratulations! You've completed this comprehensive advanced regex tutorial. You've learned:

### Key Concepts Covered:
1. **Lookahead and Lookbehind Assertions** - For context-aware matching
2. **Capturing Groups and Backreferences** - For extracting and referencing parts of matches
3. **Performance Optimization** - Avoiding catastrophic backtracking and using efficient patterns
4. **Conditional Patterns** - Matching different alternatives based on conditions
5. **Unicode and International Text** - Working with non-ASCII characters
6. **Real-world Applications** - Log parsing, data extraction, and validation
7. **Debugging and Testing** - Tools and techniques for regex development
8. **Advanced Python Features** - Custom classes, function-based substitutions, and optimization

### Best Practices to Remember:
- **Compile patterns** when using them repeatedly
- **Use specific patterns** instead of overly general ones
- **Test thoroughly** with edge cases
- **Consider performance** implications of complex patterns
- **Document complex patterns** for maintainability
- **Use appropriate tools** - sometimes regex isn't the best solution

### Next Steps:
- Practice with real-world data from your domain
- Explore regex in other programming languages
- Learn about parsing libraries for complex structured data
- Study formal language theory for deeper understanding

### Resources for Further Learning:
- [Python re module documentation](https://docs.python.org/3/library/re.html)
- [RegexOne Interactive Tutorial](https://regexone.com/)
- [Regex101 Online Tester](https://regex101.com/)
- [Regular-Expressions.info](https://www.regular-expressions.info/)

Keep practicing and experimenting with these advanced concepts. Regular expressions are a powerful tool that becomes more valuable as you master their intricacies!