# VL02 - Regular Expression - Basics & Overview

A regular expression is a compact mini-language for describing text patterns — good for quick extraction, validation and compact rules, but brittle to variation and obfuscation.



# `re` operations — quick reference table 

| Operation                    |                                     Function | Description                                                                                                                             |                     Return value | One-line example                                                              |
| ---------------------------- | -------------------------------------------: | --------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------: | ----------------------------------------------------------------------------- |
| Search first match anywhere  |                   `re.search(pattern, text)` | Finds the **first** match in `text`; returns a `Match` object or `None`. Useful to test existence or get span/groups.                   |                `Match` or `None` | `m = re.search(r"\bemail\b", "no email here"); bool(m)`                       |
| Match at start only          |                    `re.match(pattern, text)` | Tries to match **only at the start** of the string. (Use `^` with `search` for similar behaviour.)                                      |                `Match` or `None` | `bool(re.match(r"Hello", "Hello world"))`                                     |
| Full-string match            |                `re.fullmatch(pattern, text)` | Returns a match **only if the entire string** fits the pattern. Good for strict validation.                                             |                `Match` or `None` | `bool(re.fullmatch(r"\d{4}-\d{2}-\d{2}", "2025-09-05"))`                      |
| Find all matches (strings)   |                  `re.findall(pattern, text)` | Returns a **list** of all matched substrings (or group-tuples if pattern has groups). Quick extraction.                                 |     `list[str]` or `list[tuple]` | `re.findall(r"\b\w+\b", "one, two 3")`                                        |
| Iterate matches (with spans) |                 `re.finditer(pattern, text)` | Returns an **iterator** of `Match` objects (gives `.group()`, `.span()`, named groups). Memory-friendly for large texts.                |              iterator of `Match` | `for m in re.finditer(r"\d+", "id42 id7"): print(m.group(), m.span())`        |
| Split by pattern             |                    `re.split(pattern, text)` | Splits `text` using the pattern as delimiter. Useful for flexible tokenization/field splitting.                                         |                      `list[str]` | `re.split(r"[,\s]+", "a, b,c  d")`                                            |
| Substitute / replace         |                `re.sub(pattern, repl, text)` | Replace matches with `repl` (string or callable). Useful for normalization or masking.                                                  |                            `str` | `re.sub(r"\d+", "<NUM>", "price 12 and 7")`                                   |
| Substitution + count         |               `re.subn(pattern, repl, text)` | Same as `sub` but returns `(new_text, n_subs)` — handy when you need how many replacements happened.                                    |                     `(str, int)` | `re.subn(r"\s+", " ", "a   b")`                                               |
| Compile reusable pattern     |             `re.compile(pattern, flags=...)` | Compile a pattern into a `Pattern` object for reuse (faster if used many times). Use `.findall()`, `.search()`, `.sub()` on the object. |                        `Pattern` | `P = re.compile(r"\bfree\b", re.I); P.findall("Free free")`                   |
| Escape literal strings       |                            `re.escape(text)` | Escape user input so it can be used safely in a regex (turns `.` → `\.` etc.).                                                          |                            `str` | `re.escape("a+b.c?")`                                                         |
| Named groups & captures      |              `(?P<name>...)` inside patterns | Capture parts of a match with a **name**; access via `m.group("name")`. Useful for structured extraction.                               |              via `Match.group()` | `m=re.search(r"(?P<user>[\w.-]+)@(?P<dom>[\w.-]+\.\w+)", s); m.group("user")` |
| Lookaround (non-consuming)   | `(?=...)`, `(?!...)`, `(?<=...)`, `(?<!...)` | Check context **without** consuming chars (lookahead/lookbehind). Powerful for context-sensitive matches.                               |              used inside pattern | `re.findall(r"(?<=\$)\d+", "Price $12")`                                      |
| Verbose / commented patterns |                 `re.X` / `re.VERBOSE` (flag) | Allow whitespace and `#` comments in pattern for readability. Good for complex patterns.                                                | used with `compile` or `findall` | `re.compile(r"(\d{2})\.(\d{2})\.(\d{4})", re.X)`                              |



In [None]:
import re

# quick comparison
print(re.findall(r"\d+", "Order 42, id 7"))          # direct one-liner
p = re.compile(r"\d+")
print(p.findall("Order 42, id 7"))                   # compiled + reused


## Core building blocks

In [None]:
# Literals
print(re.findall(r"cat", "the cat sat"))                   # ['cat']

# Character classes (range)
print(re.findall(r"[A-Za-z]+", "Hello 123 !"))              # ['Hello']

# Negation
print(re.findall(r"[^a-zA-Z]+", "abc123!"))                 # ['123!']

# Quantifiers (example: ? optional)
print(re.findall(r"colou?r", "color colour"))               # ['color', 'colour']

# Anchors
print(re.findall(r"^[Ss]tart", "Start here\nNot start"))       # ['Start']
print(re.findall(r"[Ss]tart$", "Start here\nNot start"))  

# Groups & capture (captures alternatives)
print(re.findall(r"(cat|dog)", "cat and dog in the yard"))  # ['cat', 'dog']

# Alternation (same effect without capture)
print(re.findall(r"cat|dog", "cat dog pig"))                # ['cat', 'dog']

# Greedy vs lazy (lazy example)
print(re.findall(r"<tag>.*?</tag>", "<tag>one</tag><tag>two</tag>"))  # ['<tag>one</tag>', '<tag>two</tag>']

# Word boundaries
print(re.findall(r"\bfree\b", "get free stuff, not freeload")) # ['free']

# Escaping a special char (literal dot)
print(re.findall(r"\.", "a.b c.."))                         # ['.', '.', '.']

# Flags (ignore case)
print(re.findall(r"free", "FREE free", flags=re.I))         # ['FREE', 'free']

# Lookarounds (lookbehind example: digits after $)
print(re.findall(r"(?<=\$)\d+", "Price: $12 and $7"))       # ['12', '7']


### Splitting text
We can do more than 'matching' / 'fiding' with regular expression. Some examples:

In [49]:
text = "May is nice. I may go later! Will you join?"

# Split *and* keep the delimiter (punctuation) so downstream steps see it
parts = re.split(r'[.!?]\s+', text)

print(parts)

# This is with the positive lookbehind (?<=) pattern, since it's non consumable. That is, the punctuation won't be part of the "split"
re.split(r'(?<=[.!?])\s+', text)


['May is nice', 'I may go later', 'Will you join?']


['May is nice.', 'I may go later!', 'Will you join?']

### Substitutions

In [46]:
text = "Win from $10 to $100 dollars!!!"
re.sub(r"\$\d+", "<MONEY>", text) 

'Win from <MONEY> to <MONEY> dollars!!!'

## Exercise
1. Write regular expressions to extract common text signals from short messages: email addresses, URLs, runs of exclamation marks, monetary prices, and occurrences of the word free (including simple obfuscations).

2. Then prepare a new array `text_clean` where we substitute the matching email, url, price by the labels 'EMAIL', 'URL" 'PRICE' 

In [None]:
doc_texts = [
    "Get FREE money!!! Visit http://scam.example.com",
    "Fr33 m0ney!!! Claim your prize now",
    "Contact us at info@example.com or sales@shop.eu",
    "Price: $12.40 (special offer)"
]

## Write the patter to find each of the following components from the texts
patterns = {
    "email": r"",
    "url":  r"",
    "exclamation": r"",
    "price": r"",
    "free":  r""
}

# 1) show matches
for t in doc_texts:
    print("TEXT:", t)
    for name, pat in patterns.items():
        matches = re.findall(pat, t)
        print(f"  {name}: {matches}")
    print()


# 2) Substitute email/url/price with labels (keep case/spacing otherwise)

# Implement this function
def mask_labels(text):
    return text

for t in texts:
    print("ORIGINAL:", t)
    print("MASKED:  ", mask_labels(t))
    print()

## Advanced rules with `Matcher`
This snippet demonstrates a compact, production-style rule layer in spaCy: it finds token-level patterns (a normal word or a regex for obfuscation) and punctuation bursts, turns each match into a labeled Span, then filters overlaps so you get the single longest, human-readable matches (e.g., `FREE_WORD`, `EXCLAMATION`).

In [None]:
# Requires: pip install -U spacy && python -m spacy download en_core_web_sm
import spacy
from spacy.matcher import Matcher
from spacy.pipeline import EntityRuler
from spacy.util import filter_spans
from spacy.tokens import Span
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
# token pattern matching the literal "free" (case-insensitive) OR tokens that look like obfuscation
matcher.add("FREE_WORD", [[{"LOWER": "free"}], [{"TEXT": {"REGEX": r"f[rR][eE3][eE3]"}}]])

# pattern for many punctuation marks (spammy emphasis)
matcher.add("EXCLAMATION", [[{"IS_PUNCT": True, "ORTH": "!", "OP" : "+"}]])

for text in doc_texts:
    doc = nlp(text)
    matches = matcher(doc)
    # create spans to be able to filter filter each individual match and keep the longest
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    spans = filter_spans(spans)  # optional (keeps the longest match)
    print("TEXT:", text)
    for s in spans:
        print(" ", s.label_ if s.label_ else "MATCH", "\t" ,s.text, [(t.text, t.pos_) for t in s])
    print()


## References

https://www.dataquest.io/cheat-sheet/regular-expressions-cheat-sheet/