# Regular Expressions (Regex) Tutorial

## Overview
Regular expressions are powerful tools for pattern matching and text processing. This tutorial covers regex fundamentals through advanced techniques with practical exercises.

## Learning Objectives
- Understand regex syntax and metacharacters
- Master pattern matching techniques
- Apply regex to real-world text processing tasks
- Learn best practices and common pitfalls

## Import Required Libraries

In [2]:
import re
import json
from collections import defaultdict

---

# 📚 Regular Expressions Cheatsheet

## Basic Metacharacters

| Pattern | Description | Example | Matches |
|---------|-------------|---------|----------|
| `.` | Any character except newline | `a.c` | "abc", "axc", "a5c" |
| `^` | Start of string | `^Hello` | "Hello world" (at start) |
| `$` | End of string | `world$` | "Hello world" (at end) |
| `*` | 0 or more repetitions | `ab*c` | "ac", "abc", "abbbbc" |
| `+` | 1 or more repetitions | `ab+c` | "abc", "abbbbc" (not "ac") |
| `?` | 0 or 1 repetition | `ab?c` | "ac", "abc" (not "abbc") |
| `\|` | OR operator | `cat\|dog` | "cat" or "dog" |

## Character Classes

| Pattern | Description | Example | Matches |
|---------|-------------|---------|----------|
| `[abc]` | Any character in brackets | `[aeiou]` | Any vowel |
| `[^abc]` | Any character NOT in brackets | `[^0-9]` | Any non-digit |
| `[a-z]` | Character range | `[a-zA-Z]` | Any letter |
| `\d` | Any digit | `\d+` | "123", "5" |
| `\D` | Any non-digit | `\D+` | "abc", "!@#" |
| `\w` | Word character (letter, digit, _) | `\w+` | "hello", "test_123" |
| `\W` | Non-word character | `\W+` | "!@#", "   " |
| `\s` | Whitespace character | `\s+` | " ", "\t", "\n" |
| `\S` | Non-whitespace character | `\S+` | "hello", "123!" |

## Quantifiers

| Pattern | Description | Example | Matches |
|---------|-------------|---------|----------|
| `{n}` | Exactly n repetitions | `\d{3}` | "123" (exactly 3 digits) |
| `{n,}` | n or more repetitions | `\d{3,}` | "123", "12345" |
| `{n,m}` | Between n and m repetitions | `\d{3,5}` | "123", "1234", "12345" |
| `*?` | Non-greedy 0 or more | `<.*?>` | "<tag>" (not "<tag>text</tag>") |
| `+?` | Non-greedy 1 or more | `\d+?` | First digit in "123" |
| `??` | Non-greedy 0 or 1 | `colou??r` | "color" before "colour" |

## Groups and Capturing

| Pattern | Description | Example | Usage |
|---------|-------------|---------|--------|
| `(abc)` | Capturing group | `(\d+)-(\d+)` | Capture parts separately |
| `(?:abc)` | Non-capturing group | `(?:Mr\|Mrs) Smith` | Group without capture |
| `(?P<name>abc)` | Named group | `(?P<year>\d{4})` | Access by name |
| `\1, \2` | Backreference | `(\w+) \1` | "hello hello" |

## Anchors and Boundaries

| Pattern | Description | Example | Matches |
|---------|-------------|---------|----------|
| `\b` | Word boundary | `\bcat\b` | "cat" but not "category" |
| `\B` | Non-word boundary | `\Bcat\B` | "cat" in "category" |
| `\A` | Start of string | `\AHello` | Only at very beginning |
| `\Z` | End of string | `world\Z` | Only at very end |

## Lookahead and Lookbehind

| Pattern | Description | Example | Matches |
|---------|-------------|---------|----------|
| `(?=abc)` | Positive lookahead | `\d+(?=px)` | "100" in "100px" |
| `(?!abc)` | Negative lookahead | `\d+(?!px)` | "100" not followed by "px" |
| `(?<=abc)` | Positive lookbehind | `(?<=\$)\d+` | "100" in "$100" |
| `(?<!abc)` | Negative lookbehind | `(?<!\$)\d+` | "100" not preceded by "$" |

## Common Patterns

| Use Case | Pattern | Example |
|----------|---------|----------|
| Email | `[\w.-]+@[\w.-]+\.\w+` | user@example.com |
| Phone (US) | `\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}` | (555) 123-4567 |
| URL | `https?://[\w.-]+` | https://example.com |
| IP Address | `\b(?:\d{1,3}\.){3}\d{1,3}\b` | 192.168.1.1 |
| Date (MM/DD/YYYY) | `\d{1,2}/\d{1,2}/\d{4}` | 12/31/2023 |
| Time (24h) | `\d{1,2}:\d{2}` | 14:30 |
| Hex Color | `#[0-9a-fA-F]{6}` | #FF5733 |
| Credit Card | `\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}` | 1234 5678 9012 3456 |

## Python re Module Functions

| Function | Description | Example |
|----------|-------------|----------|
| `re.search()` | Find first match | `re.search(r'\d+', text)` |
| `re.match()` | Match at beginning | `re.match(r'Hello', text)` |
| `re.findall()` | Find all matches | `re.findall(r'\d+', text)` |
| `re.finditer()` | Iterator of matches | `re.finditer(r'\d+', text)` |
| `re.sub()` | Replace matches | `re.sub(r'\d+', 'X', text)` |
| `re.split()` | Split by pattern | `re.split(r'\s+', text)` |
| `re.compile()` | Compile pattern | `pattern = re.compile(r'\d+')` |

## Flags

| Flag | Description | Usage |
|------|-------------|--------|
| `re.IGNORECASE` or `re.I` | Case insensitive | `re.search(r'hello', text, re.I)` |
| `re.MULTILINE` or `re.M` | ^ and $ match line boundaries | `re.findall(r'^\w+', text, re.M)` |
| `re.DOTALL` or `re.S` | . matches newlines too | `re.search(r'.*', text, re.S)` |
| `re.VERBOSE` or `re.X` | Allow comments in regex | `re.compile(r'\d+ # digits', re.X)` |

---

# Exercise 1: Basic Pattern Matching

## Problem Statement
Write functions to perform basic pattern matching tasks:

1. `find_digits(text)` - Find all sequences of digits in the text
2. `find_words(text)` - Find all words (sequences of letters)
3. `find_emails(text)` - Find all email addresses
4. `validate_phone(phone)` - Check if a phone number is valid (US format)

## Input Examples
```python
text1 = "I have 25 apples and 30 oranges. Call me at 555-1234."
text2 = "Contact john.doe@email.com or jane_smith@company.org for details."
phone1 = "(555) 123-4567"
phone2 = "555.123.4567"
phone3 = "123-45-6789"  # Invalid format
```

## Expected Output
```python
find_digits(text1) → ['25', '30', '555', '1234']
find_words(text1) → ['I', 'have', 'apples', 'and', 'oranges', 'Call', 'me', 'at']
find_emails(text2) → ['john.doe@email.com', 'jane_smith@company.org']
validate_phone(phone1) → True
validate_phone(phone2) → True
validate_phone(phone3) → False
```

In [None]:
# Your solution here
def find_digits(text):
    # Write your regex pattern here
    pass

def find_words(text):
    # Write your regex pattern here
    pass

def find_emails(text):
    # Write your regex pattern here
    pass

def validate_phone(phone):
    # Write your regex pattern here
    pass

# Test your functions
text1 = "I have 25 apples and 30 oranges. Call me at 555-1234."
text2 = "Contact john.doe@email.com or jane_smith@company.org for details."
phone1 = "(555) 123-4567"
phone2 = "555.123.4567"
phone3 = "123-45-6789"

print("Digits:", find_digits(text1))
print("Words:", find_words(text1))
print("Emails:", find_emails(text2))
print("Phone validation:", validate_phone(phone1), validate_phone(phone2), validate_phone(phone3))

### Solution for Exercise 1

In [3]:
def find_digits(text):
    """
    Find all sequences of digits in the text.
    Pattern: \d+ matches one or more digits
    """
    return re.findall(r'\d+', text)

def find_words(text):
    """
    Find all words (sequences of letters).
    Pattern: [a-zA-Z]+ matches one or more letters
    Alternative: \b[a-zA-Z]+\b for word boundaries
    """
    return re.findall(r'[a-zA-Z]+', text)

def find_emails(text):
    """
    Find all email addresses.
    Pattern breakdown:
    [\w.-]+ : username part (letters, digits, dots, hyphens)
    @ : literal @ symbol
    [\w.-]+ : domain name part
    \. : literal dot
    \w+ : top-level domain
    """
    return re.findall(r'[\w.-]+@[\w.-]+\.\w+', text)

def validate_phone(phone):
    """
    Validate US phone number format.
    Pattern breakdown:
    ^\(? : start, optional opening parenthesis
    \d{3} : exactly 3 digits (area code)
    \)? : optional closing parenthesis
    [-\s.]? : optional separator (dash, space, or dot)
    \d{3} : exactly 3 digits
    [-\s.]? : optional separator
    \d{4}$ : exactly 4 digits, end of string
    """
    pattern = r'^\(?\d{3}\)?[-\s.]?\d{3}[-\s.]?\d{4}$'
    return bool(re.match(pattern, phone))

# Test the solutions
text1 = "I have 25 apples and 30 oranges. Call me at 555-1234."
text2 = "Contact john.doe@email.com or jane_smith@company.org for details."
phone1 = "(555) 123-4567"
phone2 = "555.123.4567"
phone3 = "123-45-6789"

print("Digits:", find_digits(text1))
print("Words:", find_words(text1))
print("Emails:", find_emails(text2))
print("Phone validation:", validate_phone(phone1), validate_phone(phone2), validate_phone(phone3))

Digits: ['25', '30', '555', '1234']
Words: ['I', 'have', 'apples', 'and', 'oranges', 'Call', 'me', 'at']
Emails: ['john.doe@email.com', 'jane_smith@company.org']
Phone validation: True True False


---

# Exercise 2: Text Cleaning and Extraction

## Problem Statement
Create functions to clean and extract information from messy text data:

1. `clean_whitespace(text)` - Replace multiple whitespace characters with single spaces
2. `extract_urls(text)` - Extract all URLs (http and https)
3. `remove_html_tags(text)` - Remove all HTML tags from text
4. `extract_hashtags(text)` - Extract all hashtags from social media text
5. `normalize_phone_numbers(text)` - Convert all phone numbers to (XXX) XXX-XXXX format

## Input Examples
```python
messy_text = "Hello    world!\n\n\tThis   has    weird\tspacing."
url_text = "Visit https://example.com or http://test.org for more info."
html_text = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
social_text = "Loving this #python tutorial! #coding #learning #webdev"
phone_text = "Call 555-123-4567 or (555) 987-6543 or 555.111.2222"
```

## Expected Output
```python
clean_whitespace(messy_text) → "Hello world! This has weird spacing."
extract_urls(url_text) → ['https://example.com', 'http://test.org']
remove_html_tags(html_text) → "This is bold and italic text."
extract_hashtags(social_text) → ['#python', '#coding', '#learning', '#webdev']
normalize_phone_numbers(phone_text) → "Call (555) 123-4567 or (555) 987-6543 or (555) 111-2222"
```

In [None]:
# Your solution here
def clean_whitespace(text):
    # Write your regex pattern here
    pass

def extract_urls(text):
    # Write your regex pattern here
    pass

def remove_html_tags(text):
    # Write your regex pattern here
    pass

def extract_hashtags(text):
    # Write your regex pattern here
    pass

def normalize_phone_numbers(text):
    # Write your regex pattern here
    pass

# Test your functions
messy_text = "Hello    world!\n\n\tThis   has    weird\tspacing."
url_text = "Visit https://example.com or http://test.org for more info."
html_text = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
social_text = "Loving this #python tutorial! #coding #learning #webdev"
phone_text = "Call 555-123-4567 or (555) 987-6543 or 555.111.2222"

print("Clean whitespace:", repr(clean_whitespace(messy_text)))
print("Extract URLs:", extract_urls(url_text))
print("Remove HTML:", remove_html_tags(html_text))
print("Extract hashtags:", extract_hashtags(social_text))
print("Normalize phones:", normalize_phone_numbers(phone_text))

### Solution for Exercise 2

In [3]:
def clean_whitespace(text):
    """
    Replace multiple whitespace characters with single spaces.
    Pattern: \s+ matches one or more whitespace characters
    """
    return re.sub(r'\s+', ' ', text).strip()

def extract_urls(text):
    """
    Extract all URLs (http and https).
    Pattern breakdown:
    https? : http or https (s is optional)
    :// : literal ://
    [\w.-]+ : domain name (letters, digits, dots, hyphens)
    (?:/[\w.-]*) : optional path part
    """
    return re.findall(r'https?://[\w.-]+(?:/[\w.-]*)*', text)

def remove_html_tags(text):
    """
    Remove all HTML tags from text.
    Pattern: <.*?> matches any HTML tag (non-greedy)
    The ? makes it non-greedy so it stops at the first >
    """
    return re.sub(r'<.*?>', '', text)

def extract_hashtags(text):
    """
    Extract all hashtags from social media text.
    Pattern: #\w+ matches # followed by word characters
    """
    return re.findall(r'#\w+', text)

def normalize_phone_numbers(text):
    """
    Convert all phone numbers to (XXX) XXX-XXXX format.
    Uses capturing groups to extract parts and reformat.
    """
    # Pattern to match various phone formats
    pattern = r'\(?([0-9]{3})\)?[-\s.]?([0-9]{3})[-\s.]?([0-9]{4})'
    
    # Replace with standardized format
    def replace_phone(match):
        return f'({match.group(1)}) {match.group(2)}-{match.group(3)}'
    
    return re.sub(pattern, replace_phone, text)

# Test the solutions
messy_text = "Hello    world!\n\n\tThis   has    weird\tspacing."
url_text = "Visit https://example.com or http://test.org for more info."
html_text = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
social_text = "Loving this #python tutorial! #coding #learning #webdev"
phone_text = "Call 555-123-4567 or (555) 987-6543 or 555.111.2222"

print("Clean whitespace:", repr(clean_whitespace(messy_text)))
print("Extract URLs:", extract_urls(url_text))
print("Remove HTML:", remove_html_tags(html_text))
print("Extract hashtags:", extract_hashtags(social_text))
print("Normalize phones:", normalize_phone_numbers(phone_text))

Clean whitespace: 'Hello world! This has weird spacing.'
Extract URLs: ['https://example.com', 'http://test.org']
Remove HTML: This is bold and italic text.
Extract hashtags: ['#python', '#coding', '#learning', '#webdev']
Normalize phones: Call (555) 123-4567 or (555) 987-6543 or (555) 111-2222


---

# Exercise 3: Advanced Pattern Matching with Groups

## Problem Statement
Use capturing groups and advanced patterns to extract structured data:

1. `parse_log_entry(log)` - Parse web server log entries and return a dictionary
2. `extract_dates(text)` - Find dates in various formats and standardize them
3. `parse_csv_line(line)` - Parse CSV line handling quoted fields with commas
4. `validate_password(password)` - Check password strength requirements
5. `extract_code_blocks(text)` - Extract code blocks from markdown text

## Input Examples
```python
log_entry = '192.168.1.1 - - [10/Oct/2023:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326'
date_text = "Meeting on 2023-10-15, deadline is 10/20/2023, and party on Oct 25, 2023"
csv_line = 'John,"Doe, Jr.",30,"New York, NY",Engineer'
password1 = "MyP@ssw0rd123"  # Strong
password2 = "password"       # Weak
markdown_text = "Here's some code:\n```python\nprint('hello')\n```\nAnd more text."
```

## Expected Output
```python
parse_log_entry(log_entry) → {
    'ip': '192.168.1.1',
    'date': '10/Oct/2023:13:55:36 +0000',
    'method': 'GET',
    'path': '/index.html',
    'status': '200',
    'size': '2326'
}

extract_dates(date_text) → ['2023-10-15', '2023-10-20', '2023-10-25']
parse_csv_line(csv_line) → ['John', 'Doe, Jr.', '30', 'New York, NY', 'Engineer']
validate_password(password1) → {'valid': True, 'score': 5, 'missing': []}
validate_password(password2) → {'valid': False, 'score': 1, 'missing': ['uppercase', 'digit', 'special', 'length']}
extract_code_blocks(markdown_text) → [('python', "print('hello')")]
```

In [None]:
# Your solution here
def parse_log_entry(log):
    # Write your regex pattern here
    pass

def extract_dates(text):
    # Write your regex patterns here
    pass

def parse_csv_line(line):
    # Write your regex pattern here
    pass

def validate_password(password):
    # Write your regex patterns here
    pass

def extract_code_blocks(text):
    # Write your regex pattern here
    pass

# Test your functions
log_entry = '192.168.1.1 - - [10/Oct/2023:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326'
date_text = "Meeting on 2023-10-15, deadline is 10/20/2023, and party on Oct 25, 2023"
csv_line = 'John,"Doe, Jr.",30,"New York, NY",Engineer'
password1 = "MyP@ssw0rd123"
password2 = "password"
markdown_text = "Here's some code:\n```python\nprint('hello')\n```\nAnd more text."

print("Log entry:", parse_log_entry(log_entry))
print("Dates:", extract_dates(date_text))
print("CSV:", parse_csv_line(csv_line))
print("Password1:", validate_password(password1))
print("Password2:", validate_password(password2))
print("Code blocks:", extract_code_blocks(markdown_text))

### Solution for Exercise 3

In [4]:
def parse_log_entry(log):
    """
    Parse web server log entries using capturing groups.
    Pattern breaks down the common log format into components.
    """
    pattern = r'([\d.]+) - - \[([^\]]+)\] "(\w+) ([^\s]+) [^"]+" (\d+) (\d+)'
    match = re.search(pattern, log)
    
    if match:
        return {
            'ip': match.group(1),
            'date': match.group(2),
            'method': match.group(3),
            'path': match.group(4),
            'status': match.group(5),
            'size': match.group(6)
        }
    return None

def extract_dates(text):
    """
    Find dates in various formats and standardize to YYYY-MM-DD.
    Handles multiple date formats.
    """
    dates = []
    
    # Month names mapping
    months = {
        'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
        'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
        'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
    }
    
    # Pattern 1: YYYY-MM-DD
    pattern1 = r'(\d{4})-(\d{1,2})-(\d{1,2})'
    for match in re.finditer(pattern1, text):
        year, month, day = match.groups()
        dates.append(f"{year}-{month.zfill(2)}-{day.zfill(2)}")
    
    # Pattern 2: MM/DD/YYYY
    pattern2 = r'(\d{1,2})/(\d{1,2})/(\d{4})'
    for match in re.finditer(pattern2, text):
        month, day, year = match.groups()
        dates.append(f"{year}-{month.zfill(2)}-{day.zfill(2)}")
    
    # Pattern 3: Mon DD, YYYY
    pattern3 = r'([A-Za-z]{3})\s+(\d{1,2}),\s+(\d{4})'
    for match in re.finditer(pattern3, text):
        month_name, day, year = match.groups()
        month_num = months.get(month_name, '01')
        dates.append(f"{year}-{month_num}-{day.zfill(2)}")
    
    return dates

def parse_csv_line(line):
    """
    Parse CSV line handling quoted fields with commas.
    Uses alternation to match quoted or unquoted fields.
    """
    # Pattern matches either quoted strings or unquoted fields
    pattern = r'"([^"]*)"|([^,]+)'
    fields = []
    
    for match in re.finditer(pattern, line):
        # Group 1 is quoted content, Group 2 is unquoted
        field = match.group(1) if match.group(1) is not None else match.group(2)
        fields.append(field.strip())
    
    return fields

def validate_password(password):
    """
    Check password strength using multiple regex patterns.
    Returns validation results with score and missing requirements.
    """
    requirements = {
        'length': len(password) >= 8,
        'lowercase': bool(re.search(r'[a-z]', password)),
        'uppercase': bool(re.search(r'[A-Z]', password)),
        'digit': bool(re.search(r'\d', password)),
        'special': bool(re.search(r'[!@#$%^&*(),.?":{}|<>]', password))
    }
    
    score = sum(requirements.values())
    missing = [req for req, met in requirements.items() if not met]
    
    return {
        'valid': score >= 4 and requirements['length'],
        'score': score,
        'missing': missing
    }

def extract_code_blocks(text):
    """
    Extract code blocks from markdown text.
    Pattern matches ```language\ncode\n``` format.
    """
    # Pattern for code blocks with optional language
    pattern = r'```(\w+)?\n(.*?)\n```'
    matches = re.findall(pattern, text, re.DOTALL)
    
    result = []
    for language, code in matches:
        result.append((language or 'text', code.strip()))
    
    return result

# Test the solutions
log_entry = '192.168.1.1 - - [10/Oct/2023:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326'
date_text = "Meeting on 2023-10-15, deadline is 10/20/2023, and party on Oct 25, 2023"
csv_line = 'John,"Doe, Jr.",30,"New York, NY",Engineer'
password1 = "MyP@ssw0rd123"
password2 = "password"
markdown_text = "Here's some code:\n```python\nprint('hello')\n```\nAnd more text."

print("Log entry:", parse_log_entry(log_entry))
print("Dates:", extract_dates(date_text))
print("CSV:", parse_csv_line(csv_line))
print("Password1:", validate_password(password1))
print("Password2:", validate_password(password2))
print("Code blocks:", extract_code_blocks(markdown_text))

Log entry: {'ip': '192.168.1.1', 'date': '10/Oct/2023:13:55:36 +0000', 'method': 'GET', 'path': '/index.html', 'status': '200', 'size': '2326'}
Dates: ['2023-10-15', '2023-10-20', '2023-10-25']
CSV: ['John', 'Doe, Jr.', '30', 'New York, NY', 'Engineer']
Password1: {'valid': True, 'score': 5, 'missing': []}
Password2: {'valid': False, 'score': 2, 'missing': ['uppercase', 'digit', 'special']}
Code blocks: [('python', "print('hello')")]


---

# Exercise 4: Lookahead and Lookbehind Assertions

## Problem Statement
Use advanced regex features like lookahead and lookbehind assertions:

1. `extract_prices(text)` - Extract prices that are followed by currency symbols
2. `find_words_before_punctuation(text)` - Find words that come before punctuation
3. `extract_domain_from_email(text)` - Extract domain parts from email addresses
4. `find_capitalized_after_period(text)` - Find capitalized words that follow a period
5. `validate_strong_password(password)` - Password must have digit followed by letter

## Input Examples
```python
price_text = "Items cost $19.99, €25.50, and ¥1000. Total is $45.49."
punct_text = "Hello world! How are you? Fine, thanks."
email_text = "Contact admin@company.com or support@help.org"
sentence_text = "Hello there. This is good. Amazing work."
password_test = "abc123def"  # Has digit followed by letter
```

## Expected Output
```python
extract_prices(price_text) → ['19.99', '25.50', '1000', '45.49']
find_words_before_punctuation(punct_text) → ['world', 'you', 'thanks']
extract_domain_from_email(email_text) → ['company.com', 'help.org']
find_capitalized_after_period(sentence_text) → ['This', 'Amazing']
validate_strong_password(password_test) → True
```

In [None]:
# Your solution here
def extract_prices(text):
    # Use positive lookahead (?=pattern)
    pass

def find_words_before_punctuation(text):
    # Use positive lookahead for punctuation
    pass

def extract_domain_from_email(text):
    # Use positive lookbehind (?<=pattern)
    pass

def find_capitalized_after_period(text):
    # Use positive lookbehind for period and space
    pass

def validate_strong_password(password):
    # Use positive lookahead to ensure digit followed by letter exists
    pass

# Test your functions
price_text = "Items cost $19.99, €25.50, and ¥1000. Total is $45.49."
punct_text = "Hello world! How are you? Fine, thanks."
email_text = "Contact admin@company.com or support@help.org"
sentence_text = "Hello there. This is good. Amazing work."
password_test = "abc123def"

print("Prices:", extract_prices(price_text))
print("Words before punct:", find_words_before_punctuation(punct_text))
print("Email domains:", extract_domain_from_email(email_text))
print("Capitalized after period:", find_capitalized_after_period(sentence_text))
print("Strong password:", validate_strong_password(password_test))

### Solution for Exercise 4

In [None]:
def extract_prices(text):
    """
    Extract prices that are preceded by currency symbols.
    Uses positive lookbehind to find numbers after currency symbols.
    """
    # Lookbehind for currency symbols: $, €, ¥, £
    pattern = r'(?<=[$€¥£])\d+(?:\.\d{2})?'
    return re.findall(pattern, text)

def find_words_before_punctuation(text):
    """
    Find words that come before punctuation marks.
    Uses positive lookahead to find words followed by punctuation.
    """
    # Word followed by punctuation (lookahead)
    pattern = r'\b\w+(?=[!?.,;:])'
    return re.findall(pattern, text)

def extract_domain_from_email(text):
    """
    Extract domain parts from email addresses.
    Uses positive lookbehind to find domains after @ symbol.
    """
    # Domain part after @ symbol
    pattern = r'(?<=@)[\w.-]+\.[a-zA-Z]{2,}'
    return re.findall(pattern, text)

def find_capitalized_after_period(text):
    """
    Find capitalized words that follow a period and space.
    Uses positive lookbehind for period and space pattern.
    """
    # Capitalized word after period and space
    pattern = r'(?<=\. )[A-Z][a-z]+'
    return re.findall(pattern, text)

def validate_strong_password(password):
    """
    Check if password has at least one digit followed by a letter.
    Uses positive lookahead to ensure the pattern exists somewhere.
    """
    # Pattern: digit followed by letter
    pattern = r'\d(?=[a-zA-Z])'
    return bool(re.search(pattern, password))

# Alternative implementations with more complex lookarounds

def extract_prices_advanced(text):
    """
    More sophisticated price extraction with negative lookahead
    to avoid matching things like phone numbers.
    """
    # Price with currency, not followed by more digits (to avoid phone numbers)
    pattern = r'(?<=[$€¥£])\d+(?:\.\d{2})?(?!\d)'
    return re.findall(pattern, text)

def find_words_not_after_articles(text):
    """
    Find words that are NOT preceded by articles (a, an, the).
    Uses negative lookbehind.
    """
    # Words not preceded by articles
    pattern = r'(?<!\b(?:a|an|the)\s)\b[a-zA-Z]+'
    return re.findall(pattern, text, re.IGNORECASE)

# Test the solutions
price_text = "Items cost $19.99, €25.50, and ¥1000. Total is $45.49."
punct_text = "Hello world! How are you? Fine, thanks."
email_text = "Contact admin@company.com or support@help.org"
sentence_text = "Hello there. This is good. Amazing work."
password_test = "abc123def"

print("Prices:", extract_prices(price_text))
print("Words before punct:", find_words_before_punctuation(punct_text))
print("Email domains:", extract_domain_from_email(email_text))
print("Capitalized after period:", find_capitalized_after_period(sentence_text))
print("Strong password:", validate_strong_password(password_test))

# Test advanced examples
print("\nAdvanced examples:")
print("Prices (advanced):", extract_prices_advanced(price_text))
article_text = "The cat and a dog ran to an apple tree"
print("Words not after articles:", find_words_not_after_articles(article_text))

---

# Exercise 5: Real-World Data Processing Challenge

## Problem Statement
Create a comprehensive text processing system that combines multiple regex techniques:

Build a function `process_customer_data(raw_data)` that takes a list of messy customer records and returns clean, structured data.

Each raw record contains:
- Customer name (various formats)
- Phone number (multiple formats)
- Email address
- Address (street, city, state, zip)
- Purchase amount and date

The function should:
1. Extract and validate all information using regex
2. Standardize phone numbers to (XXX) XXX-XXXX format
3. Validate email addresses
4. Parse addresses into components
5. Convert dates to YYYY-MM-DD format
6. Return structured data with validation flags

## Input Example
```python
raw_data = [
    "Customer: John Smith, Phone: (555) 123-4567, Email: john@email.com, Address: 123 Main St, New York, NY 10001, Purchase: $299.99 on 2023-10-15",
    "Name: Jane Doe Jr., Tel: 555.987.6543, E-mail: jane.doe@company.org, Addr: 456 Oak Ave, Los Angeles, CA 90210, Bought $150.00 on Oct 20, 2023",
    "CUSTOMER: Bob Wilson, PHONE: 555-111-2222, EMAIL: bob@invalid, ADDRESS: 789 Pine Rd, Chicago, IL 60601, TOTAL: $75.50 DATE: 10/25/2023",
    "Mr. Alice Brown, 555 444 3333, alice.brown@test.com, 321 Elm St, Houston, TX 77001, $199.99 - 2023-11-01"
]
```

## Expected Output Structure
```python
[
    {
        'name': 'John Smith',
        'phone': '(555) 123-4567',
        'phone_valid': True,
        'email': 'john@email.com',
        'email_valid': True,
        'address': {
            'street': '123 Main St',
            'city': 'New York',
            'state': 'NY',
            'zip': '10001'
        },
        'purchase_amount': 299.99,
        'purchase_date': '2023-10-15',
        'valid_record': True
    },
    # ... more records
]
```

In [5]:
# Your solution here
def process_customer_data(raw_data):
    """
    Process messy customer data using comprehensive regex patterns.
    This is your chance to combine all the techniques you've learned!
    
    Hints:
    - Use multiple regex patterns for different data formats
    - Use capturing groups to extract structured data
    - Validate extracted data with additional patterns
    - Handle edge cases and missing data gracefully
    """
    processed_records = []
    
    for record in raw_data:
        # Extract name (various prefixes and formats)
        name = None
        # Your regex pattern here
        
        # Extract and standardize phone number
        phone = None
        phone_valid = False
        # Your regex patterns here
        
        # Extract and validate email
        email = None
        email_valid = False
        # Your regex patterns here
        
        # Extract address components
        address = {
            'street': None,
            'city': None,
            'state': None,
            'zip': None
        }
        # Your regex patterns here
        
        # Extract purchase amount and date
        purchase_amount = None
        purchase_date = None
        # Your regex patterns here
        
        # Create record
        processed_record = {
            'name': name,
            'phone': phone,
            'phone_valid': phone_valid,
            'email': email,
            'email_valid': email_valid,
            'address': address,
            'purchase_amount': purchase_amount,
            'purchase_date': purchase_date,
            'valid_record': all([name, phone, email, purchase_amount, purchase_date])
        }
        
        processed_records.append(processed_record)
    
    return processed_records

# Test your function
raw_data = [
    "Customer: John Smith, Phone: (555) 123-4567, Email: john@email.com, Address: 123 Main St, New York, NY 10001, Purchase: $299.99 on 2023-10-15",
    "Name: Jane Doe Jr., Tel: 555.987.6543, E-mail: jane.doe@company.org, Addr: 456 Oak Ave, Los Angeles, CA 90210, Bought $150.00 on Oct 20, 2023",
    "CUSTOMER: Bob Wilson, PHONE: 555-111-2222, EMAIL: bob@invalid, ADDRESS: 789 Pine Rd, Chicago, IL 60601, TOTAL: $75.50 DATE: 10/25/2023",
    "Mr. Alice Brown, 555 444 3333, alice.brown@test.com, 321 Elm St, Houston, TX 77001, $199.99 - 2023-11-01"
]

result = process_customer_data(raw_data)
print(json.dumps(result, indent=2))

[
  {
    "name": null,
    "phone": null,
    "phone_valid": false,
    "email": null,
    "email_valid": false,
    "address": {
      "street": null,
      "city": null,
      "state": null,
      "zip": null
    },
    "purchase_amount": null,
    "purchase_date": null,
    "valid_record": false
  },
  {
    "name": null,
    "phone": null,
    "phone_valid": false,
    "email": null,
    "email_valid": false,
    "address": {
      "street": null,
      "city": null,
      "state": null,
      "zip": null
    },
    "purchase_amount": null,
    "purchase_date": null,
    "valid_record": false
  },
  {
    "name": null,
    "phone": null,
    "phone_valid": false,
    "email": null,
    "email_valid": false,
    "address": {
      "street": null,
      "city": null,
      "state": null,
      "zip": null
    },
    "purchase_amount": null,
    "purchase_date": null,
    "valid_record": false
  },
  {
    "name": null,
    "phone": null,
    "phone_valid": false,
    "email": null,


### Solution for Exercise 5

In [6]:
def process_customer_data(raw_data):
    """
    Process messy customer data using comprehensive regex patterns.
    Combines multiple regex techniques for real-world data processing.
    """
    processed_records = []
    
    # Month name to number mapping for date conversion
    months = {
        'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
        'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
        'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
    }
    
    for record in raw_data:
        # Extract name (handle various prefixes and formats)
        name_patterns = [
            r'(?:Customer|Name|CUSTOMER):\s*([^,]+)',  # "Customer: Name"
            r'(?:Mr\.|Mrs\.|Ms\.)\s+([^,]+)',          # "Mr. Name"
            r'^([A-Za-z]+\s+[A-Za-z]+(?:\s+[A-Za-z]+)?)'  # Name at start
        ]
        
        name = None
        for pattern in name_patterns:
            match = re.search(pattern, record)
            if match:
                name = match.group(1).strip().rstrip(',')
                break
        
        # Extract and standardize phone number
        phone_pattern = r'(?:Phone|Tel|PHONE):\s*([\d\s\(\)\.-]+)|([\d\s\(\)\.-]{10,})'
        phone_match = re.search(phone_pattern, record)
        
        phone = None
        phone_valid = False
        
        if phone_match:
            phone_raw = phone_match.group(1) or phone_match.group(2)
            # Extract just the digits
            digits = re.sub(r'\D', '', phone_raw)
            
            if len(digits) == 10:
                phone = f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
                phone_valid = True
        
        # Extract and validate email
        email_pattern = r'(?:Email|E-mail|EMAIL):\s*([\w.-]+@[\w.-]+\.[a-zA-Z]{2,})|([\w.-]+@[\w.-]+\.[a-zA-Z]{2,})'
        email_match = re.search(email_pattern, record)
        
        email = None
        email_valid = False
        
        if email_match:
            email = email_match.group(1) or email_match.group(2)
            # Validate email format
            email_validation = r'^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$'
            email_valid = bool(re.match(email_validation, email)) and '.' in email.split('@')[1]
        
        # Extract address components
        address_pattern = r'(?:Address|Addr|ADDRESS):\s*([^,]+),\s*([^,]+),\s*([A-Z]{2})\s*(\d{5})'
        address_match = re.search(address_pattern, record)
        
        address = {
            'street': None,
            'city': None,
            'state': None,
            'zip': None
        }
        
        if address_match:
            address = {
                'street': address_match.group(1).strip(),
                'city': address_match.group(2).strip(),
                'state': address_match.group(3).strip(),
                'zip': address_match.group(4).strip()
            }
        
        # Extract purchase amount
        amount_pattern = r'\$([\d,]+\.\d{2})'
        amount_match = re.search(amount_pattern, record)
        
        purchase_amount = None
        if amount_match:
            amount_str = amount_match.group(1).replace(',', '')
            purchase_amount = float(amount_str)
        
        # Extract and standardize date
        date_patterns = [
            r'(\d{4})-(\d{1,2})-(\d{1,2})',  # YYYY-MM-DD
            r'(\d{1,2})/(\d{1,2})/(\d{4})',  # MM/DD/YYYY
            r'([A-Za-z]{3})\s+(\d{1,2}),\s+(\d{4})'  # Mon DD, YYYY
        ]
        
        purchase_date = None
        
        for i, pattern in enumerate(date_patterns):
            date_match = re.search(pattern, record)
            if date_match:
                if i == 0:  # YYYY-MM-DD
                    year, month, day = date_match.groups()
                    purchase_date = f"{year}-{month.zfill(2)}-{day.zfill(2)}"
                elif i == 1:  # MM/DD/YYYY
                    month, day, year = date_match.groups()
                    purchase_date = f"{year}-{month.zfill(2)}-{day.zfill(2)}"
                elif i == 2:  # Mon DD, YYYY
                    month_name, day, year = date_match.groups()
                    month_num = months.get(month_name, '01')
                    purchase_date = f"{year}-{month_num}-{day.zfill(2)}"
                break
        
        # Create processed record
        processed_record = {
            'name': name,
            'phone': phone,
            'phone_valid': phone_valid,
            'email': email,
            'email_valid': email_valid,
            'address': address,
            'purchase_amount': purchase_amount,
            'purchase_date': purchase_date,
            'valid_record': all([
                name, phone, email, purchase_amount, purchase_date,
                phone_valid, email_valid
            ])
        }
        
        processed_records.append(processed_record)
    
    return processed_records

# Test the comprehensive solution
raw_data = [
    "Customer: John Smith, Phone: (555) 123-4567, Email: john@email.com, Address: 123 Main St, New York, NY 10001, Purchase: $299.99 on 2023-10-15",
    "Name: Jane Doe Jr., Tel: 555.987.6543, E-mail: jane.doe@company.org, Addr: 456 Oak Ave, Los Angeles, CA 90210, Bought $150.00 on Oct 20, 2023",
    "CUSTOMER: Bob Wilson, PHONE: 555-111-2222, EMAIL: bob@invalid, ADDRESS: 789 Pine Rd, Chicago, IL 60601, TOTAL: $75.50 DATE: 10/25/2023",
    "Mr. Alice Brown, 555 444 3333, alice.brown@test.com, 321 Elm St, Houston, TX 77001, $199.99 - 2023-11-01"
]

result = process_customer_data(raw_data)
print(json.dumps(result, indent=2))

# Summary statistics
print("\n=== Processing Summary ===")
valid_records = sum(1 for r in result if r['valid_record'])
valid_phones = sum(1 for r in result if r['phone_valid'])
valid_emails = sum(1 for r in result if r['email_valid'])

print(f"Total records processed: {len(result)}")
print(f"Valid records: {valid_records}")
print(f"Valid phone numbers: {valid_phones}")
print(f"Valid email addresses: {valid_emails}")
print(f"Total purchase amount: ${sum(r['purchase_amount'] for r in result if r['purchase_amount']):.2f}")

[
  {
    "name": "John Smith",
    "phone": "(555) 123-4567",
    "phone_valid": true,
    "email": "john@email.com",
    "email_valid": true,
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "state": "NY",
      "zip": "10001"
    },
    "purchase_amount": 299.99,
    "purchase_date": "2023-10-15",
    "valid_record": true
  },
  {
    "name": "Jane Doe Jr.",
    "phone": "(555) 987-6543",
    "phone_valid": true,
    "email": "jane.doe@company.org",
    "email_valid": true,
    "address": {
      "street": "456 Oak Ave",
      "city": "Los Angeles",
      "state": "CA",
      "zip": "90210"
    },
    "purchase_amount": 150.0,
    "purchase_date": "2023-10-20",
    "valid_record": true
  },
  {
    "name": "Bob Wilson",
    "phone": "(555) 111-2222",
    "phone_valid": true,
    "email": null,
    "email_valid": false,
    "address": {
      "street": "789 Pine Rd",
      "city": "Chicago",
      "state": "IL",
      "zip": "60601"
    },
    "purchase

## Resources
- **Python re module documentation**: https://docs.python.org/3/library/re.html
- **Regular Expression HOWTO**: https://docs.python.org/3/howto/regex.html
- **Online regex testers**: regex101.com, regexr.com, regexpal.com
- **Books**: "Mastering Regular Expressions" by Jeffrey Friedl

**Practice makes perfect!** Continue applying these patterns to your own text processing challenges.
