# Chapter 13: Groups and Advanced Regex

**Chapter 13 - Learning Python**

Beyond basic matching, regular expressions offer powerful grouping and
assertion mechanisms. This notebook covers capturing groups, named groups,
lookahead/lookbehind assertions, backreferences, alternation, and the
`re.sub` function for pattern-based text replacement.

## Key Concepts
- **Capturing groups**: `(...)` to extract parts of a match
- **Named groups**: `(?P<name>...)` for readable, self-documenting patterns
- **Non-capturing groups**: `(?:...)` for grouping without capturing
- **Lookahead / Lookbehind**: Zero-width assertions for context-sensitive matching
- **Backreferences**: `\1`, `\2` to refer back to captured groups
- **`re.sub`**: Substitution with patterns and functions
- **Alternation**: `|` for matching one of several alternatives

## Capturing Groups

Parentheses `(...)` create **capturing groups** that extract sub-parts of a
match. Use `.group(n)` to access them:
- `.group(0)` or `.group()` -- the entire match
- `.group(1)` -- first capturing group
- `.groups()` -- tuple of all captured groups

In [None]:
import re

# Extract area code and number separately
phone_pattern = r"(\d{3})-(\d{3})-(\d{4})"
m = re.search(phone_pattern, "Call me at 415-555-1234 today")

if m:
    print(f"Full match:  {m.group()}")
    print(f"Area code:   {m.group(1)}")
    print(f"Exchange:    {m.group(2)}")
    print(f"Subscriber:  {m.group(3)}")
    print(f"All groups:  {m.groups()}")

# Groups with findall: returns list of tuples
text = "Home: 415-555-1234, Work: 650-555-5678"
all_phones = re.findall(phone_pattern, text)
print(f"\nAll phones (findall with groups): {all_phones}")

# Using groups to reformat
for area, exchange, number in all_phones:
    print(f"  ({area}) {exchange}-{number}")

## Named Groups

Named groups use the syntax `(?P<name>...)` and can be accessed by name
instead of number, making patterns self-documenting and more maintainable.

In [None]:
import re

# Parse a date string with named groups
date_pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
m = re.search(date_pattern, "Event date: 2024-03-15")

if m:
    # Access by name -- much more readable than group(1), group(2)
    print(f"Year:  {m.group('year')}")
    print(f"Month: {m.group('month')}")
    print(f"Day:   {m.group('day')}")

    # .groupdict() returns a dictionary of all named groups
    print(f"\ngroupdict: {m.groupdict()}")

    # Numeric access still works
    print(f"group(1):  {m.group(1)}")

# Named groups in finditer
log = "ERROR 2024-01-10 disk full; WARN 2024-01-11 low memory"
entry_pattern = r"(?P<level>ERROR|WARN)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<msg>[^;]+)"

print("\nLog entries:")
for entry in re.finditer(entry_pattern, log):
    d = entry.groupdict()
    print(f"  [{d['level']}] {d['date']}: {d['msg'].strip()}")

## Non-Capturing Groups

Sometimes you need to group patterns (for alternation or quantifiers) but
don't want to capture them. Use `(?:...)` for non-capturing groups.

In [None]:
import re

# Problem: we want to match 'http' or 'https' but only capture the domain
url = "Visit https://example.com or http://test.org"

# With capturing group -- captures the protocol too
capturing = re.findall(r"(https?)://(\w+\.\w+)", url)
print(f"With capturing groups: {capturing}")

# With non-capturing group -- only captures the domain
non_capturing = re.findall(r"(?:https?)://(\w+\.\w+)", url)
print(f"With non-capturing:    {non_capturing}")

# Non-capturing groups are useful for applying quantifiers to alternatives
# Match repeated syllables like 'mama', 'papa', 'dada'
words = ["mama", "papa", "dada", "banana", "ma", "mamama"]
pattern = r"^(?:ma|pa|da){2}$"  # exactly 2 repetitions of ma, pa, or da

print("\nRepeated syllables:")
for w in words:
    m = re.match(pattern, w)
    print(f"  {w!r:10s} -> {'match' if m else 'no match'}")

## Lookahead and Lookbehind Assertions

These are **zero-width assertions** -- they check for a pattern without
consuming characters (i.e., without including them in the match).

| Syntax | Name | Meaning |
|--------|------|---------|
| `(?=...)` | Positive lookahead | What follows must match |
| `(?!...)` | Negative lookahead | What follows must NOT match |
| `(?<=...)` | Positive lookbehind | What precedes must match |
| `(?<!...)` | Negative lookbehind | What precedes must NOT match |

In [None]:
import re

# Positive lookahead: (?=...)
# Find words followed by a comma
text = "apple, banana, cherry and date"
before_comma = re.findall(r"\w+(?=,)", text)
print(f"Words before comma (lookahead): {before_comma}")

# Negative lookahead: (?!...)
# Find words NOT followed by a comma
not_before_comma = re.findall(r"\b\w+\b(?!,)", text)
print(f"Words not before comma:         {not_before_comma}")

# Positive lookbehind: (?<=...)
# Extract prices (digits after $ sign)
prices = "Items: $25, $100, $7.50 and 50 cents"
dollar_amounts = re.findall(r"(?<=\$)\d+(?:\.\d+)?", prices)
print(f"\nDollar amounts (lookbehind): {dollar_amounts}")

# Negative lookbehind: (?<!...)
# Find numbers NOT preceded by $
non_dollar = re.findall(r"(?<!\$)\b\d+(?:\.\d+)?\b", prices)
print(f"Non-dollar numbers:         {non_dollar}")

# Combined lookahead and lookbehind: extract content between tags
html = "<b>bold</b> text <i>italic</i>"
tag_content = re.findall(r"(?<=<b>).*?(?=</b>)", html)
print(f"\nContent inside <b> tags: {tag_content}")

## Backreferences

Backreferences (`\1`, `\2`, etc.) refer back to previously captured groups
within the **same pattern**. They match the exact same text that the group
captured, which is useful for finding repeated or matching pairs.

In [None]:
import re

# Find repeated words ("the the", "is is")
text = "This is is a test. The the cat sat on on the mat."
repeated = re.findall(r"\b(\w+)\s+\1\b", text, re.IGNORECASE)
print(f"Repeated words: {repeated}")

# Match opening and closing HTML tags
html = "<b>bold</b> <i>italic</i> <b>mismatched</i>"
# \1 ensures closing tag matches the opening tag
matched_tags = re.findall(r"<(\w+)>([^<]*)</\1>", html)
print(f"\nProperly matched tags: {matched_tags}")

# Find words with repeated letters (aa, bb, oo, etc.)
words = ["book", "cat", "coffee", "tree", "apple", "dog", "balloon"]
for word in words:
    m = re.search(r"(\w)\1", word)
    if m:
        print(f"  {word!r} has repeated letter: {m.group()!r}")

# Named backreference: (?P=name)
quote_text = "'hello' and \"world\" and 'test\""
# Match text enclosed in matching quotes
matched_quotes = re.findall(r"(?P<quote>['\"])(.+?)(?P=quote)", quote_text)
print(f"\nMatched quotes: {matched_quotes}")

## Alternation with `|`

The `|` operator matches **either** the pattern on its left **or** the
pattern on its right. It has the lowest precedence of any regex operator,
so use parentheses to limit its scope.

In [None]:
import re

# Simple alternation
pets = "I have a cat and a dog and a fish"
found = re.findall(r"cat|dog|fish", pets)
print(f"Pets found: {found}")

# Alternation scope: without parentheses
# 'ab|cd' matches 'ab' OR 'cd', not 'a(b or c)d'
text = "ab cd ad cb"
print(f"\nab|cd: {re.findall(r'ab|cd', text)}")
print(f"a(b|c)d: {re.findall(r'a(?:b|c)d', text)}")

# Practical: match different date formats
dates = "Created 2024-01-15, expires 01/15/2025, updated Jan 15, 2024"
# ISO format OR US format OR text format
pattern = r"\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}|[A-Z][a-z]+ \d{1,2}, \d{4}"
all_dates = re.findall(pattern, dates)
print(f"\nAll date formats: {all_dates}")

# Alternation within groups captures the alternative that matched
log = "ERROR: disk full; WARNING: low memory; INFO: started"
entries = re.findall(r"(ERROR|WARNING|INFO): (\w[\w ]*)", log)
print(f"\nLog entries: {entries}")

## `re.sub` -- Pattern-Based Substitution

`re.sub(pattern, replacement, string)` replaces all matches of `pattern`
in `string` with `replacement`. The replacement can be:
- A string (with backreferences like `\1` or `\g<name>`)
- A callable (function) that receives the match object

In [None]:
import re

# Simple replacement
text = "I love cats. Cats are great. CATS rule!"
result = re.sub(r"cats", "dogs", text, flags=re.IGNORECASE)
print(f"Simple sub: {result}")

# Replacement with backreferences: reformat dates
dates = "Dates: 2024-01-15 and 2024-12-25"
reformatted = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\2/\3/\1", dates)
print(f"\nReformatted dates: {reformatted}")

# Using named groups in replacement with \g<name>
reformatted_named = re.sub(
    r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})",
    r"\g<m>/\g<d>/\g<y>",
    dates,
)
print(f"Named group sub:   {reformatted_named}")

# Replacement with a function
def censor_email(match: re.Match) -> str:
    """Replace email username with asterisks."""
    user = match.group("user")
    domain = match.group("domain")
    return f"{'*' * len(user)}@{domain}"


emails = "Contact alice@example.com or bob.smith@work.org"
censored = re.sub(
    r"(?P<user>[\w.]+)@(?P<domain>[\w.]+)",
    censor_email,
    emails,
)
print(f"\nCensored: {censored}")

# re.subn returns (new_string, count_of_replacements)
result, count = re.subn(r"\d+", "#", "abc 123 def 456 ghi 789")
print(f"\nsubn result: {result!r} ({count} replacements)")

## Putting It Together: Advanced Pattern Matching

Combining groups, assertions, and substitution for real-world tasks.

In [None]:
import re


def parse_url(url: str) -> dict[str, str | None]:
    """Parse a URL into its components using named groups."""
    pattern = (
        r"^(?P<scheme>https?)://"
        r"(?P<host>[\w.-]+)"
        r"(?::(?P<port>\d+))?"
        r"(?P<path>/[\w/.-]*)?"
        r"(?:\?(?P<query>[\w=&]+))?"
    )
    m = re.match(pattern, url)
    if m:
        return m.groupdict()
    return {}


def add_thousand_separators(text: str) -> str:
    """Add comma separators to large numbers using lookahead."""
    # Match digit followed by groups of 3 digits before a word boundary
    return re.sub(r"(?<=\d)(?=(\d{3})+\b)", ",", text)


# URL parsing
urls = [
    "https://example.com/path/to/page?key=value",
    "http://localhost:8080/api/data",
    "https://docs.python.org",
]
print("URL parsing:")
for url in urls:
    parts = parse_url(url)
    print(f"  {url}")
    for key, val in parts.items():
        if val:
            print(f"    {key}: {val}")
    print()

# Thousand separators
print("Thousand separators:")
numbers = "Population: 7800000, GDP: 21000000000, ID: 42"
print(f"  Before: {numbers}")
print(f"  After:  {add_thousand_separators(numbers)}")

## Summary

### Capturing Groups
- `(...)` captures text; accessed via `.group(n)` or `.groups()`
- `(?P<name>...)` creates named groups; accessed via `.group('name')` or `.groupdict()`
- `(?:...)` groups without capturing -- useful for alternation and quantifiers

### Lookahead and Lookbehind
- `(?=...)` positive lookahead, `(?!...)` negative lookahead
- `(?<=...)` positive lookbehind, `(?<!...)` negative lookbehind
- Zero-width: they assert context without consuming characters

### Backreferences
- `\1`, `\2` in the pattern match the same text as the corresponding group
- `(?P=name)` for named backreferences
- Useful for finding duplicates, matching paired delimiters

### Substitution
- `re.sub(pattern, repl, string)` replaces all matches
- `repl` can be a string with `\1` or `\g<name>` backreferences
- `repl` can be a function receiving the `Match` object for dynamic replacement
- `re.subn` returns the count of replacements alongside the result

### Alternation
- `a|b` matches `a` or `b`; lowest precedence, use `(?:a|b)` to limit scope