# Chapter 13: Regular Expression Fundamentals

**Chapter 13 - Learning Python**

Regular expressions (regex) are a powerful language for matching patterns in text.
Python's `re` module provides full support for Perl-style regular expressions.
This notebook covers the foundational concepts: matching functions, character
classes, quantifiers, anchors, and flags.

## Key Concepts
- **`re.match`** vs **`re.search`**: matching at the start vs scanning the whole string
- **`re.findall`**: extracting all non-overlapping matches
- **Character classes**: `\d`, `\w`, `\s`, and custom sets like `[aeiou]`
- **Quantifiers**: `?`, `+`, `*`, `{n}`, `{n,m}` and greedy vs non-greedy
- **Anchors**: `^`, `$`, `\b` for positional matching
- **Raw strings**: why `r"..."` is essential for regex patterns
- **Flags**: `re.IGNORECASE`, `re.MULTILINE`, `re.DOTALL`

## Raw Strings and Why They Matter

Regular expressions use backslashes extensively (`\d`, `\w`, `\b`). In normal
Python strings, backslashes introduce escape sequences (`\n`, `\t`). Raw strings
(`r"..."`) treat backslashes as literal characters, preventing conflicts between
Python's string escaping and regex syntax.

In [None]:
import re

# Without raw string: \b is a backspace character (ASCII 8)
normal_string = "\bword\b"
print(f"Normal string: {normal_string!r}")
print(f"Length: {len(normal_string)}")

# With raw string: \b is preserved literally for the regex engine
raw_string = r"\bword\b"
print(f"\nRaw string: {raw_string!r}")
print(f"Length: {len(raw_string)}")

# Practical difference
text = "a word in a sentence"
# r"\bword\b" matches "word" as a whole word
match = re.search(r"\bword\b", text)
print(f"\nMatch with raw string: {match.group() if match else None}")

# Without raw string, you'd need double backslashes
match2 = re.search("\\bword\\b", text)
print(f"Match with escaped backslashes: {match2.group() if match2 else None}")
print("\nRule of thumb: ALWAYS use raw strings for regex patterns.")

## `re.match` vs `re.search`

- **`re.match(pattern, string)`**: checks for a match only at the **beginning** of the string
- **`re.search(pattern, string)`**: scans the **entire string** for the first match

Both return a `Match` object on success, or `None` on failure.

In [None]:
import re

text = "Python is powerful"

# re.match: only checks at the START of the string
match_start = re.match(r"Python", text)
match_middle = re.match(r"powerful", text)

print(f"match('Python', text):   {match_start.group() if match_start else None}")
print(f"match('powerful', text): {match_middle}")

# re.search: scans the entire string
search_start = re.search(r"Python", text)
search_middle = re.search(r"powerful", text)

print(f"\nsearch('Python', text):   {search_start.group() if search_start else None}")
print(f"search('powerful', text): {search_middle.group() if search_middle else None}")

# Match object attributes
m = re.search(r"powerful", text)
if m:
    print(f"\nMatch details:")
    print(f"  .group() = {m.group()!r}")
    print(f"  .start() = {m.start()}")
    print(f"  .end()   = {m.end()}")
    print(f"  .span()  = {m.span()}")

## `re.findall` -- All Non-Overlapping Matches

`re.findall(pattern, string)` returns a list of all non-overlapping matches.
If the pattern contains groups, it returns a list of groups (or tuples for
multiple groups).

In [None]:
import re

text = "My phone: 555-1234, office: 555-5678, fax: 555-9999"

# Find all phone number patterns
numbers = re.findall(r"\d{3}-\d{4}", text)
print(f"All phone numbers: {numbers}")

# Find all words starting with a capital letter
sentence = "The Quick Brown Fox Jumped Over The Lazy Dog"
capitalized = re.findall(r"[A-Z][a-z]+", sentence)
print(f"Capitalized words: {capitalized}")

# When pattern has ONE group, findall returns list of group contents
tagged = "<b>bold</b> and <i>italic</i>"
contents = re.findall(r"<[bi]>(\w+)</[bi]>", tagged)
print(f"\nTag contents (one group): {contents}")

# When pattern has MULTIPLE groups, findall returns list of tuples
pairs = re.findall(r"<([bi])>(\w+)</\1>", tagged)
print(f"Tag + content (two groups): {pairs}")

## Character Classes

Character classes match a single character from a set of characters.

| Pattern | Meaning | Equivalent |
|---------|---------|------------|
| `\d`    | Digit | `[0-9]` |
| `\D`    | Non-digit | `[^0-9]` |
| `\w`    | Word character | `[a-zA-Z0-9_]` |
| `\W`    | Non-word character | `[^a-zA-Z0-9_]` |
| `\s`    | Whitespace | `[ \t\n\r\f\v]` |
| `\S`    | Non-whitespace | `[^ \t\n\r\f\v]` |
| `.`     | Any character (except newline by default) | -- |
| `[aeiou]` | Custom set (any vowel) | -- |
| `[^aeiou]` | Negated set (any non-vowel) | -- |

In [None]:
import re

sample = "Order #42: 3 widgets @ $7.50 each (total: $22.50)"

# \d matches digits
digits = re.findall(r"\d+", sample)
print(f"Digits (\\d+): {digits}")

# \w matches word characters (letters, digits, underscore)
words = re.findall(r"\w+", sample)
print(f"Word chars (\\w+): {words}")

# \s matches whitespace
spaces = re.findall(r"\s+", sample)
print(f"Whitespace (\\s+): {spaces!r}")

# Custom character class: vowels
text = "Hello World"
vowels = re.findall(r"[aeiouAEIOU]", text)
print(f"\nVowels in {text!r}: {vowels}")

# Negated character class: non-vowels (excluding spaces)
consonants = re.findall(r"[^aeiouAEIOU\s]", text)
print(f"Non-vowels (no space): {consonants}")

# Ranges inside character classes
hex_chars = re.findall(r"[0-9a-fA-F]+", "color: #3Fb2c1; id: 9zz")
print(f"\nHex sequences: {hex_chars}")

## Quantifiers

Quantifiers control how many times a preceding element must appear.

| Quantifier | Meaning |
|------------|--------|
| `?`        | 0 or 1 (optional) |
| `+`        | 1 or more |
| `*`        | 0 or more |
| `{n}`      | Exactly n |
| `{n,}`     | n or more |
| `{n,m}`    | Between n and m (inclusive) |

In [None]:
import re

# ? -- optional (0 or 1)
# Match 'color' or 'colour'
texts = ["color", "colour", "colouur"]
for t in texts:
    m = re.match(r"colou?r", t)
    print(f"  colou?r on {t!r}: {m.group() if m else None}")

# + -- one or more
print("\n+ quantifier:")
print(f"  \\d+ in 'abc':  {re.findall(r'\d+', 'abc')}")
print(f"  \\d+ in 'a1b23': {re.findall(r'\d+', 'a1b23')}")

# * -- zero or more
print("\n* quantifier:")
print(f"  'go*gle' matches: {[w for w in ['ggle', 'gogle', 'google', 'gooogle'] if re.match(r'go*gle$', w)]}")

# {n} -- exactly n
print("\n{{n}} quantifier:")
zips = ["12345", "1234", "123456", "abcde"]
for z in zips:
    m = re.match(r"^\d{5}$", z)
    print(f"  \\d{{5}} on {z!r}: {'match' if m else 'no match'}")

# {n,m} -- between n and m
print("\n{{n,m}} quantifier:")
passwords = ["ab", "abc", "abcdefgh", "abcdefghij", "abcdefghijk"]
for pw in passwords:
    m = re.match(r"^\w{3,10}$", pw)
    print(f"  \\w{{3,10}} on {pw!r} (len={len(pw)}): {'valid' if m else 'invalid'}")

## Greedy vs Non-Greedy Matching

By default, quantifiers are **greedy** -- they match as much text as possible.
Adding `?` after a quantifier makes it **non-greedy** (lazy), matching as
little as possible.

| Greedy | Non-Greedy | Behavior |
|--------|------------|----------|
| `*`    | `*?`       | 0 or more (prefer fewer) |
| `+`    | `+?`       | 1 or more (prefer fewer) |
| `?`    | `??`       | 0 or 1 (prefer 0) |
| `{n,m}` | `{n,m}?` | n to m (prefer fewer) |

In [None]:
import re

html = "<b>bold</b> and <i>italic</i>"

# Greedy: .* matches as MUCH as possible
greedy = re.findall(r"<.*>", html)
print(f"Greedy <.*>:     {greedy}")

# Non-greedy: .*? matches as LITTLE as possible
non_greedy = re.findall(r"<.*?>", html)
print(f"Non-greedy <.*?>: {non_greedy}")

# Another example with +
text = "aabab"
print(f"\nGreedy a.+b:     {re.findall(r'a.+b', text)}")
print(f"Non-greedy a.+?b: {re.findall(r'a.+?b', text)}")

# Practical example: extracting quoted strings
config = 'name="Alice" age="30" city="New York"'
greedy_quotes = re.findall(r'".*"', config)
lazy_quotes = re.findall(r'".*?"', config)
print(f"\nGreedy quotes:     {greedy_quotes}")
print(f"Non-greedy quotes: {lazy_quotes}")

## Anchors: Positional Matching

Anchors don't match characters -- they match **positions** in the string.

| Anchor | Meaning |
|--------|---------|
| `^`    | Start of string (or line with `re.MULTILINE`) |
| `$`    | End of string (or line with `re.MULTILINE`) |
| `\b`   | Word boundary (between `\w` and `\W`) |
| `\B`   | Non-word boundary |

In [None]:
import re

# ^ and $ -- start and end of string
lines = ["hello world", "Hello World", "hello", "say hello"]
for line in lines:
    starts = bool(re.match(r"^hello", line))
    ends = bool(re.search(r"hello$", line))
    full = bool(re.match(r"^hello$", line))
    print(f"  {line!r:20s} starts={starts}  ends={ends}  exact={full}")

# \b -- word boundary
text = "cat concatenate catalog scattered"
# Without boundary: matches 'cat' inside other words
no_boundary = re.findall(r"cat", text)
print(f"\nWithout \\b: {no_boundary}")

# With boundary: matches 'cat' only as a whole word
with_boundary = re.findall(r"\bcat\b", text)
print(f"With \\b:    {with_boundary}")

# \b at the start only: words STARTING with 'cat'
starts_with_cat = re.findall(r"\bcat\w*", text)
print(f"Starts with cat: {starts_with_cat}")

## Regex Flags

Flags modify how the regex engine interprets the pattern.

| Flag | Short | Meaning |
|------|-------|---------|
| `re.IGNORECASE` | `re.I` | Case-insensitive matching |
| `re.MULTILINE`  | `re.M` | `^` and `$` match at line boundaries |
| `re.DOTALL`     | `re.S` | `.` matches newline characters too |

In [None]:
import re

# re.IGNORECASE (re.I) -- case-insensitive matching
text = "Python is AWESOME and python is Fun"
case_sensitive = re.findall(r"python", text)
case_insensitive = re.findall(r"python", text, re.IGNORECASE)
print(f"Case-sensitive:   {case_sensitive}")
print(f"Case-insensitive: {case_insensitive}")

# re.MULTILINE (re.M) -- ^ and $ match at line boundaries
multiline_text = """first line
second line
third line"""

# Without MULTILINE: ^ only matches start of entire string
without_ml = re.findall(r"^\w+ line", multiline_text)
print(f"\nWithout MULTILINE: {without_ml}")

# With MULTILINE: ^ matches start of each line
with_ml = re.findall(r"^\w+ line", multiline_text, re.MULTILINE)
print(f"With MULTILINE:    {with_ml}")

# re.DOTALL (re.S) -- dot matches newlines
html = "<div>\nHello\nWorld\n</div>"
without_dotall = re.findall(r"<div>(.+)</div>", html)
with_dotall = re.findall(r"<div>(.+)</div>", html, re.DOTALL)
print(f"\nWithout DOTALL: {without_dotall}")
print(f"With DOTALL:    {with_dotall}")

# Combining flags with the | operator
combined = re.findall(r"^hello", "Hello\nhello\nHELLO", re.IGNORECASE | re.MULTILINE)
print(f"\nCombined I+M: {combined}")

## Putting It Together: Practical Examples

Combining character classes, quantifiers, anchors, and flags to solve
real pattern-matching tasks.

In [None]:
import re


def validate_username(username: str) -> bool:
    """Validate username: 3-16 alphanumeric chars or underscores, must start with letter."""
    return bool(re.match(r"^[a-zA-Z]\w{2,15}$", username))


def extract_dates(text: str) -> list[str]:
    """Extract dates in MM/DD/YYYY format."""
    return re.findall(r"\b\d{2}/\d{2}/\d{4}\b", text)


def find_hashtags(text: str) -> list[str]:
    """Extract hashtags from social media text."""
    return re.findall(r"#\w+", text)


# Test username validation
usernames = ["alice", "Bob_42", "_hidden", "ab", "a" * 17, "123start"]
print("Username validation:")
for u in usernames:
    print(f"  {u!r:20s} -> {'valid' if validate_username(u) else 'invalid'}")

# Test date extraction
log = "Created 01/15/2024, modified 03/22/2024, expires 12/31/2025."
print(f"\nDates found: {extract_dates(log)}")

# Test hashtag extraction
tweet = "Learning #Python and #regex is great! #100DaysOfCode"
print(f"Hashtags: {find_hashtags(tweet)}")

## Summary

### Core Functions
- **`re.match(pattern, string)`**: Match at the start of the string only
- **`re.search(pattern, string)`**: Find the first match anywhere in the string
- **`re.findall(pattern, string)`**: Return all non-overlapping matches as a list

### Character Classes
- `\d` (digit), `\w` (word), `\s` (space) and their negations `\D`, `\W`, `\S`
- Custom classes `[abc]`, negated `[^abc]`, ranges `[a-z]`

### Quantifiers and Greediness
- `?` (0-1), `+` (1+), `*` (0+), `{n}`, `{n,m}` are greedy by default
- Append `?` for non-greedy: `*?`, `+?`, `??`, `{n,m}?`

### Anchors
- `^` (start), `$` (end), `\b` (word boundary)

### Flags
- `re.IGNORECASE` for case-insensitive, `re.MULTILINE` for line-level `^`/`$`, `re.DOTALL` for `.` matching newlines

### Best Practice
- **Always use raw strings** (`r"..."`) for regex patterns to avoid backslash conflicts