# <font color="#418FDE" size="6.5" uppercase>**Substitution and Cleanup**</font>

>Last update: 20251224.
    
By the end of this Lecture, you will be able to:
- Use `re.sub` and `re.subn` to perform targeted text substitutions and track the number of replacements. 
- Employ backreferences in replacement strings or functions to rearrange or normalize matched text. 
- Design robust cleanup patterns that use advanced quantifiers and grouping to safely modify messy input data. 


## **1. Regex substitution basics**

### **1.1. Regex text replacement**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_01_01.jpg?v=1766636844" width="250">



>* Use patterns to find and replace text
>* Standardize, clean, and transform strings efficiently

>* Substitution uses original text, pattern, and replacement
>* Flexible patterns and replacements clean and standardize data

>* Substitutions apply everywhere, so careless patterns risk damage
>* Test and refine patterns to safely clean data



In [None]:
#@title Python Code - Regex text replacement

# Demonstrate simple regex based text replacement in Python strings.
# Show how patterns describe what gets replaced across entire text.
# Print original and cleaned text to visualize the transformation.

import re  # Import regular expression module for pattern based substitutions.

text = "This   is   a   messy    sentence    with   extra   spaces."  # Example text.

pattern = r"\s+"  # Pattern matches one or more whitespace characters together.

replacement = " "  # Replacement is a single space for each whitespace sequence.

cleaned_text = re.sub(pattern, replacement, text)  # Replace matches with single spaces.

print("Original text:", text)  # Show original messy spacing for comparison.

print("Cleaned text:", cleaned_text)  # Show result after regex based replacement.




### **1.2. Counting Substitutions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_01_02.jpg?v=1766636864" width="250">



>* Count how often your regex substitutions occur
>* Use counts to spot missing or unexpected matches

>* Track substitutions to monitor large-scale text redaction
>* Sudden count changes warn of pattern or data issues

>* Use substitution counts for validation and branching
>* Refine patterns and audit exactly what changed



In [None]:
#@title Python Code - Counting Substitutions

# Demonstrate counting substitutions using re.subn with simple phone number cleanup.
# Show how many phone numbers were standardized inside a small customer record list.
# Use the returned substitution count for basic sanity checks and diagnostics.

import re

records = ["Alice: 555-1234", "Bob: 5551234", "Carol: no phone"]

pattern = re.compile(r"(\d{3})[- ]?(\d{4})")

print("Original records and standardized versions with substitution counts.")

for record in records:
    cleaned, count = pattern.subn(r"(\1) \2", record)
    print(f"Original: {record} | Cleaned: {cleaned} | Replacements: {count}")





### **1.3. Global vs limited replacements**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_01_03.jpg?v=1766636885" width="250">



>* Global replacement changes every matching occurrence found
>* Useful for normalizing text consistently across entire strings

>* Limit replacements by setting a maximum count
>* Use limits to test patterns and reduce risk

>* Global vs limited affects speed and safety
>* Use limited first to test and control changes



In [None]:
#@title Python Code - Global vs limited replacements

# Demonstrate global versus limited regex substitutions using simple repeated words.
# Show how re.sub replaces all matches when count parameter is zero.
# Show how re.sub limits replacements when count parameter is positive.

import re

text_example = "error error error fixed error error"
pattern_word = r"error"
replacement_word = "issue"

print("Original text string:", text_example)

result_global = re.sub(pattern_word, replacement_word, text_example, count=0)
print("Global replacement result:", result_global)

result_limited_one = re.sub(pattern_word, replacement_word, text_example, count=1)
print("Limited replacement one:", result_limited_one)

result_limited_two = re.sub(pattern_word, replacement_word, text_example, count=2)
print("Limited replacement two:", result_limited_two)

result_limited_three = re.sub(pattern_word, replacement_word, text_example, count=3)
print("Limited replacement three:", result_limited_three)



## **2. Backreference Replacements**

### **2.1. Backreference Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_02_01.jpg?v=1766636915" width="250">



>* Capture parts of matches for later reuse
>* Rebuild text by reordering captured name pieces

>* Treat each capturing group as a container
>* Reuse groups to standardize formats across data

>* Backreferences reuse original data without inventing content
>* They enable consistent cleanup and reformatting across datasets



In [None]:
#@title Python Code - Backreference Basics

# Demonstrate basic backreference usage with Python regular expressions.
# Show how captured groups are reused inside replacement strings.
# Normalize simple name and date formats using backreferences safely.

import re  # Import regular expression module for pattern matching.

text = "Smith, John was born on 03/14/1990."  # Example messy text string.

name_pattern = r"(\w+),\s+(\w+)"  # Capture last name then first name groups.

fixed_names = re.sub(name_pattern, r"\2 \1", text)  # Use backreferences to flip order.

print("Original text string:", text)  # Show original unmodified text string.

print("After name reordering:", fixed_names)  # Show text with reordered name.

date_pattern = r"(\d{2})/(\d{2})/(\d{4})"  # Capture month, day, year groups.

final_text = re.sub(date_pattern, r"\3-\1-\2", fixed_names)  # Rebuild date with dashes.

print("After date normalization:", final_text)  # Show final normalized text string.



### **2.2. Reordering captured groups**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_02_02.jpg?v=1766636935" width="250">



>* Use captured groups to rearrange matched text
>* Regex slices parts; replacement reorders them flexibly

>* Use groups to standardize inconsistent names
>* Reorder captured date parts for uniform timestamps

>* Reorder groups to standardize varied text patterns
>* Capture components, then output them in new order



In [None]:
#@title Python Code - Reordering captured groups

# Demonstrate reordering captured regex groups using simple name formats.
# Show how last comma first becomes first space last format.
# Use Python re sub with backreference replacement groups for clarity.

import re  # Import regular expression module for pattern matching operations.

names_text = """Smith, John
Doe, Jane
Brown, Charlie"""  # Multiline string with sample names data.

pattern = r"(\w+),\s+(\w+)"  # Capture last name then first name using two groups.

replacement = r"\2 \1"  # Replacement uses second group then first group reordered.

reordered_text = re.sub(pattern, replacement, names_text)  # Apply substitution with reordered groups.

print("Original names list:\n" + names_text)  # Print original names for comparison clarity.

print("\nReordered names list:\n" + reordered_text)  # Print reordered names showing group reordering effect.




### **2.3. Function Based Replacements**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_02_03.jpg?v=1766636959" width="250">



>* Custom functions decide replacements for each match
>* Enable complex, context aware text transformations per occurrence

>* Use match groups in functions for replacements
>* Programmatically reorder, transform, and combine captured text

>* Use functions for complex, context aware normalization
>* Standardize, anonymize, and migrate data using backreferences



In [None]:
#@title Python Code - Function Based Replacements

# Demonstrate regex function based replacements with backreferences in Python easily.
# Show how a replacement function can normalize messy phone numbers consistently.
# Use captured groups programmatically to build a clean standardized phone format.

import re  # Import regular expression module for pattern matching operations.

text = "Call me at (555) 123-4567 or 555.987.6543 today."  # Example text.

pattern = re.compile(r"(\d{3})\D*(\d{3})\D*(\d{4})")  # Capture phone parts.


def normalize_phone(match):  # Replacement function receives one match object.
    area = match.group(1)  # First captured group holds area code digits.
    first = match.group(2)  # Second captured group holds first three digits.
    last = match.group(3)  # Third captured group holds final four digits.
    return f"({area}) {first}-{last}"  # Build standardized phone format string.

cleaned_text, count = pattern.subn(normalize_phone, text)  # Apply function replacement.

print("Original text:", text)  # Show original unnormalized phone numbers for comparison.
print("Cleaned text:", cleaned_text)  # Show text after function based replacements.
print("Replacements made:", count)  # Show how many phone numbers were normalized.



## **3. Regex cleanup patterns**

### **3.1. Condensing Extra Whitespace**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_03_01.jpg?v=1766636997" width="250">



>* Normalize messy spaces, tabs, and line breaks
>* Preserve meaningful structure like paragraphs and indentation

>* Choose carefully which whitespace to collapse
>* Preserve important line breaks and record structure

>* Use grouped patterns to treat whitespace contexts differently
>* Layer steps to trim, condense, and normalize whitespace



In [None]:
#@title Python Code - Condensing Extra Whitespace

# Demonstrate condensing messy whitespace using Python regular expressions.
# Show how advanced quantifiers safely normalize internal whitespace only.
# Preserve important line breaks while cleaning spaces and tab characters.

import re  # Import regular expression module for whitespace cleanup.

messy_text = "Customer   Name:\tJohn    \t   Smith\n\nAddress:\t123   Main    Street\nCity:\tNew    York"  

print("Original messy text with irregular spacing and tabs:\n")  
print(messy_text)  

pattern_internal = r"(?<=\S)[ \t]+(?=\S)"  

cleaned_internal = re.sub(pattern_internal, " ", messy_text)  

print("\nText after condensing internal spaces and tabs only:\n")  
print(cleaned_internal)  

pattern_blank_lines = r"\n{3,}"  

fully_cleaned = re.sub(pattern_blank_lines, "\n\n", cleaned_internal)  

print("\nText after normalizing excessive blank lines safely:\n")  
print(fully_cleaned)  



### **3.2. Normalizing simple formats**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_03_02.jpg?v=1766637017" width="250">



>* Use regex to unify messy format variations
>* Group core parts and standardize optional separators

>* Separate required parts from optional, variable details
>* Capture essentials, then rebuild data in standard order

>* Control quantifiers and boundaries to avoid overmatching
>* Balance flexibility and precision for trustworthy normalization



In [None]:
#@title Python Code - Normalizing simple formats

# Demonstrate normalizing simple text formats using Python regular expressions.
# Show how flexible patterns capture messy but valid user input formats.
# Convert varied phone number formats into one consistent normalized representation.

import re  # Import regular expression module for pattern based text cleanup.

raw_phones = ["(555)123-4567", "555 123 4567", "555-1234567", "+1 555.123.4567"]

pattern = re.compile(r"(?:\+1[\s.-]*)?\(?([2-9]\d{2})\)?[\s.-]*([2-9]\d{2})[\s.-]*(\d{4})")

normalized_phones = []  # Store normalized phone numbers after regex substitution cleanup.

for phone in raw_phones:
    cleaned = pattern.sub(r"(\1) \2-\3", phone)
    normalized_phones.append(cleaned)

print("Original phone list:")
print(raw_phones)

print("\nNormalized phone list:")
print(normalized_phones)



### **3.3. Safe Match Boundaries**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Python Regex A-Z/Module_03/Lecture_B/image_03_03.jpg?v=1766637042" width="250">



>* Define strict boundaries for regex cleanup matches
>* Prevent unwanted changes to valid punctuation and codes

>* Use grouping, anchors, context to control matches
>* Match phone parentheses only in valid positions

>* Anticipate and exclude sensitive tokens from cleanup
>* Use context and grouping to protect edge cases



In [None]:
#@title Python Code - Safe Match Boundaries

# Demonstrate safe regex boundaries during text cleanup operations.
# Compare aggressive and safe patterns for stripping stray punctuation characters.
# Show how context and grouping protect important tokens from unwanted changes.

import re  # Import regular expression module for pattern based text cleanup.

texts = [
    "Sale!!! Shoes... only $49.99...",
    "Visit docs.example.com... for details...",
    "Version 2.0.0... released today...",
]

print("Original and aggressively cleaned versions compared side by side.")

aggressive_pattern = re.compile(r"[.!?]{2,}")

safe_pattern = re.compile(r"(?<![\w])([.!?]{2,})(?![\w])")

for text in texts:
    aggressive_cleaned = aggressive_pattern.sub(".", text)
    safe_cleaned = safe_pattern.sub(".", text)
    print("\nOriginal:", text)
    print("Aggressive:", aggressive_cleaned)
    print("Safe:", safe_cleaned)



# <font color="#418FDE" size="6.5" uppercase>**Substitution and Cleanup**</font>


In this lecture, you learned to:
- Use `re.sub` and `re.subn` to perform targeted text substitutions and track the number of replacements. 
- Employ backreferences in replacement strings or functions to rearrange or normalize matched text. 
- Design robust cleanup patterns that use advanced quantifiers and grouping to safely modify messy input data. 

<font color='yellow'>Congratulations on completing this course!</font>