# Regular Expressions Exercises

## Overview

This notebook contains hands-on exercises to practice and reinforce your understanding of Regular Expressions (Regex). Through these exercises, you'll apply regex patterns to solve real-world text processing problems, from basic character matching to complex pattern extraction. These exercises will help you build the practical skills needed for data cleaning, preprocessing, and feature extraction in NLP pipelines.

## Objectives

- Practice writing regex patterns for common text matching scenarios
- Apply character classes, quantifiers, and anchors to solve problems
- Extract and manipulate text using regex groups and capturing
- Translate natural language requirements into regex patterns
- Debug and refine regex patterns using practical examples

## Outline

1. **Exercise Set 1: The Basics** - Character classes and simple patterns
2. **Exercise Set 2: Quantifiers and Repetition** - Matching multiple occurrences
3. **Exercise Set 3: Anchors and Boundaries** - Position-based matching
4. **Exercise Set 4: Groups and Capturing** - Extracting specific parts of matches
5. **Exercise Set 5: Real-world Applications** - Practical text processing scenarios

In [5]:
# File > Open Folder > W5_NLP
# (VS Code root should be at W5_NLP)
# Then run: `uv sync`
import regex as re
social_post = "Learning #NLP and #Regex is fun! #Python2026"

# TODO: Define a pattern that matches '#' followed by one or more alphanumeric characters
pattern = r"#[a-zA-Z0-9]+"

hashtags = re.findall(pattern, social_post)
print(f"Extracted Hashtags: {hashtags}")
# Expected Output: ['#NLP', '#Regex', '#Python2026']

Extracted Hashtags: ['#NLP', '#Regex', '#Python2026']


In [6]:
# Standard library imports
# (none needed for this notebook)

# Third-party imports
import regex as re
import pandas as pd

## Exercise Set 1: The Basics

### Exercise 1: Character Class Basics (solved)

Write a regex pattern to match:
1. Any vowel (a, e, i, o, u) in a string
2. Any character that is NOT a digit (0-9)

In [9]:
# Exercise 1: Your solution here
test_strings = ["Alpha Beta Gamma", "User_123", "Error-Code-99"]

# Pattern 1: Match uppercase A through M
pattern_am = r"" # TODO
print("Uppercase A-M:")
for s in test_strings:
    print(f"  {s}: {re.findall(pattern_am, s)}")

# Pattern 2: Match alphanumeric (letters/numbers) only
# Hint: Don't use \w here because \w includes underscores!
pattern_alnum = r"" # TODO
print("\nAlphanumeric only:")
for s in test_strings:
    print(f"  {s}: {re.findall(pattern_alnum, s)}")

Uppercase A-M:
  Alpha Beta Gamma: ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
  User_123: ['', '', '', '', '', '', '', '', '']
  Error-Code-99: ['', '', '', '', '', '', '', '', '', '', '', '', '', '']

Alphanumeric only:
  Alpha Beta Gamma: ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
  User_123: ['', '', '', '', '', '', '', '', '']
  Error-Code-99: ['', '', '', '', '', '', '', '', '', '', '', '', '', '']


### Exercise 2: Character Ranges

Create patterns to match:
1. All lowercase letters from 'a' to 'm'
2. All digits from 5 to 9
3. All uppercase letters from 'N' to 'Z'

In [10]:
# Test data
text = "The Quick Brown Fox Jumps Over 123 Lazy Dogs"

# Pattern 1: Lowercase a-m
### YOUR CODE HERE ###

# Pattern 2: Digits 5-9
### YOUR CODE HERE ###

# Pattern 3: Uppercase N-Z
### YOUR CODE HERE ###
# Exercise 2 Solution
test_strings = ["Alpha Beta Gamma", "User_123", "Error-Code-99"]

# Pattern 1: Match uppercase A through M
pattern_am = r"[A-M]"
print("Uppercase A-M:")
for s in test_strings:
    print(f"  {s}: {re.findall(pattern_am, s)}")

# Pattern 2: Match alphanumeric (letters/numbers) only
# Note: Using [a-zA-Z0-9] specifically avoids the underscore _
pattern_alnum = r"[a-zA-Z0-9]"
print("\nAlphanumeric only:")
for s in test_strings:
    print(f"  {s}: {re.findall(pattern_alnum, s)}")

Uppercase A-M:
  Alpha Beta Gamma: ['A', 'B', 'G']
  User_123: []
  Error-Code-99: ['E', 'C']

Alphanumeric only:
  Alpha Beta Gamma: ['A', 'l', 'p', 'h', 'a', 'B', 'e', 't', 'a', 'G', 'a', 'm', 'm', 'a']
  User_123: ['U', 's', 'e', 'r', '1', '2', '3']
  Error-Code-99: ['E', 'r', 'r', 'o', 'r', 'C', 'o', 'd', 'e', '9', '9']


### Exercise 3: Using \d, \w, and \s

Write patterns to:
1. Extract all phone numbers (sequences of digits) from a text
2. Find all words that start with 't' or 'T'
3. Extract all sequences of whitespace

In [11]:
# Test data
text = "Call me at 123-456-7890 or 9876543210 today!"

# Pattern 1: Extract phone numbers (sequences of digits)
### YOUR CODE HERE ###

# Pattern 2: Words starting with 't' or 'T'
### YOUR CODE HERE ###

# Pattern 3: Sequences of whitespace
### YOUR CODE HERE ###
# Test data
text = "Call me at 123-456-7890 or 9876543210 today!"

# Pattern 1: Extract phone numbers (sequences of digits)
# \d+ matches one or more digits in a row
pattern_digits = r"\d+"
phone_numbers = re.findall(pattern_digits, text)
print(f"Phone numbers: {phone_numbers}")
# Output: ['123', '456', '7890', '9876543210']

# Pattern 2: Words starting with 't' or 'T'
# \b is a word boundary, [tT] matches either case, and \w* matches the rest of the word
pattern_t_words = r"\b[tT]\w*"
t_words = re.findall(pattern_t_words, text)
print(f"T-words: {t_words}")
# Output: ['today']

# Pattern 3: Sequences of whitespace
# \s+ matches one or more spaces, tabs, or newlines
pattern_whitespace = r"\s+"
spaces = re.findall(pattern_whitespace, text)
print(f"Whitespace sequences: {len(spaces)} found")

Phone numbers: ['123', '456', '7890', '9876543210']
T-words: ['today']
Whitespace sequences: 6 found


### Exercise 4: Word Boundaries

Create patterns to:
1. Match the word "the" only as a complete word (not part of "there" or "other")
2. Find all words that end with "ing"
3. Match "cat" but not "category" or "scatter"

In [12]:
# Test data
text = "The cat is scattering. There is nothing interesting about this category."

# Pattern 1: Match "the" as complete word
### YOUR CODE HERE ###

# Pattern 2: Words ending with "ing"
### YOUR CODE HERE ###

# Pattern 3: Match "cat" but not "category" or "scatter"

### YOUR CODE HERE ###
# Test data
text = "The cat is scattering. There is nothing interesting about this category."

# Pattern 1: Match "the" as complete word
# Using \b on both sides ensures it's not part of "There" or "Other"
pattern1 = r"\bthe\b"
matches1 = re.findall(pattern1, text, re.I) # re.I for case-insensitivity
print(f"Standalone 'the': {matches1}")

# Pattern 2: Words ending with "ing"
# \w+ matches the start of the word, followed by "ing" and a boundary \b
pattern2 = r"\b\w+ing\b"
matches2 = re.findall(pattern2, text)
print(f"Words ending in 'ing': {matches2}")

# Pattern 3: Match "cat" but not "category" or "scatter"
# This is identical logic to pattern 1â€”isolation requires \b on both sides
pattern3 = r"\bcat\b"
matches3 = re.findall(pattern3, text)
print(f"Standalone 'cat': {matches3}")

Standalone 'the': ['The']
Words ending in 'ing': ['scattering', 'nothing', 'interesting']
Standalone 'cat': ['cat']


### Exercise 5: Quantifiers

Write patterns to match:
1. Exactly 3 consecutive digits
2. One or more letters followed by zero or more digits
3. Between 2 and 4 consecutive vowels

In [13]:
# Test data
text = "abc123 def4567 ghi jklmnop 12345 aeiou"

# Pattern 1: Exactly 3 consecutive digits
### YOUR CODE HERE ###

# Pattern 2: One or more letters followed by zero or more digits
### YOUR CODE HERE ###

# Pattern 3: Between 2 and 4 consecutive vowels
### YOUR CODE HERE ###
# Test data
text = "abc123 def4567 ghi jklmnop 12345 aeiou"

# Pattern 1: Exactly 3 consecutive digits
# {3} specifies the exact count
pattern1 = r"\d{3}"
matches1 = re.findall(pattern1, text)
print(f"Exactly 3 digits: {matches1}")
# Note: This will match '123' from '12345' as well.
# To match ONLY 3 digits (not part of 5), you would use \b\d{3}\b.

# Pattern 2: One or more letters followed by zero or more digits
# + means 1 or more, * means 0 or more
pattern2 = r"[a-zA-Z]+\d*"
matches2 = re.findall(pattern2, text)
print(f"Letters + Digits: {matches2}")

# Pattern 3: Between 2 and 4 consecutive vowels
# {2,4} specifies a range (inclusive)
pattern3 = r"[aeiouAEIOU]{2,4}"
matches3 = re.findall(pattern3, text)
print(f"2-4 vowels: {matches3}")

Exactly 3 digits: ['123', '456', '123']
Letters + Digits: ['abc123', 'def4567', 'ghi', 'jklmnop', 'aeiou']
2-4 vowels: ['aeio']


### Exercise 6: Optional and Repetition

Create patterns to:
1. Match decimal numbers (with optional decimal part)
2. Match email-like patterns (word@word.word format)
3. Match words that may or may not have an 's' at the end

In [14]:
# Test data
text = "Price is 99.99 or 100. Contact admin@site.com or users@example.org for help."

# Pattern 1: Decimal numbers (optional decimal part)
### YOUR CODE HERE ###

# Pattern 2: Email-like patterns
### YOUR CODE HERE ###

# Pattern 3: Words with optional 's' at the end
### YOUR CODE HERE ###
# Test data
text = "Price is 99.99 or 100. Contact admin@site.com or users@example.org for help."

# Pattern 1: Decimal numbers (optional decimal part)
# \d+ matches the whole number, (\.\d+)? makes the decimal point and digits optional
pattern1 = r"\d+(\.\d+)?"
matches1 = [m.group() for m in re.finditer(pattern1, text)]
print(f"Decimal/Whole numbers: {matches1}")

# Pattern 2: Email-like patterns (word@word.word)
# \w+ matches the text parts, separated by literal '@' and '.'
pattern2 = r"\w+@\w+\.\w+"
matches2 = re.findall(pattern2, text)
print(f"Email patterns: {matches2}")

# Pattern 3: Words with optional 's' at the end
# s? makes the 's' character optional. \b ensures we get full words.
pattern3 = r"\b\w+s?\b"
matches3 = re.findall(pattern3, text)
print(f"Words with optional 's': {matches3}")

Decimal/Whole numbers: ['99.99', '100']
Email patterns: ['admin@site.com', 'users@example.org']
Words with optional 's': ['Price', 'is', '99', '99', 'or', '100', 'Contact', 'admin', 'site', 'com', 'or', 'users', 'example', 'org', 'for', 'help']


### Exercise 7: Using match(), search(), findall(), and finditer()

1. Use `match()` to check if the string starts with "Python"
2. Use `search()` to find the first occurrence of a version number
3. Use `findall()` to extract all version numbers
4. Use `finditer()` to get Match objects with positions for all "Python" occurrences

In [15]:
# Test data
text = "Python 3.10 and Python 2.7 are versions."

# 1. Check if string starts with "Python"
### YOUR CODE HERE ###

# 2. Find first version number
### YOUR CODE HERE ###

# 3. Extract all version numbers
### YOUR CODE HERE ###

# 4. Finditer for "Python" with positions
### YOUR CODE HERE ###
# Test data
text = "Python 3.10 and Python 2.7 are versions."

# 1. Check if string starts with "Python"
# match() only looks at the very beginning of the string
match_obj = re.match(r"Python", text)
print(f"Starts with Python: {match_obj is not None}")

# 2. Find first version number
# search() scans the string and stops at the first match
first_version = re.search(r"\d+\.\d+", text)
if first_version:
    print(f"First version found: {first_version.group()}")

# 3. Extract all version numbers
# findall() returns a simple list of strings
all_versions = re.findall(r"\d+\.\d+", text)
print(f"All versions: {all_versions}")

# 4. Finditer for "Python" with positions
# finditer() is the most powerful; it returns match objects with indices
print("Python occurrences:")
for m in re.finditer(r"Python", text):
    print(f"  Found '{m.group()}' at indices {m.span()}")

Starts with Python: True
First version found: 3.10
All versions: ['3.10', '2.7']
Python occurrences:
  Found 'Python' at indices (0, 6)
  Found 'Python' at indices (16, 22)


### Exercise 8: Match Object Attributes

Extract information from matches:
1. Find all numbers in the text
2. For each match, print: the matched text, start position, end position, and span
3. Count how many numbers are in the text

In [16]:
# Test data
text = "I have 5 apples and 12 oranges"

# Pattern to find numbers
### YOUR CODE HERE ###

# Extract match information
### YOUR CODE HERE ###
# Test data
text = "I have 5 apples and 12 oranges"

# Pattern to find numbers
pattern = r"\d+"

# Extract match information
matches = list(re.finditer(pattern, text))

print(f"Total numbers found: {len(matches)}\n")

for m in matches:
    # m.group() -> The actual matched text
    # m.start() -> Starting index
    # m.end()   -> Ending index
    # m.span()  -> Tuple of (start, end)
    print(f"Match: '{m.group()}'")
    print(f"  Start: {m.start()}")
    print(f"  End:   {m.end()}")
    print(f"  Span:  {m.span()}")
    print("-" * 20)

Total numbers found: 2

Match: '5'
  Start: 7
  End:   8
  Span:  (7, 8)
--------------------
Match: '12'
  Start: 20
  End:   22
  Span:  (20, 22)
--------------------


### Exercise 9: Extracting Groups

Extract components from:
1. Dates in format "YYYY-MM-DD" - extract year, month, and day separately
2. Time in format "HH:MM:SS" - extract hours, minutes, and seconds
3. Phone numbers in format "(XXX) XXX-XXXX" - extract area code, exchange, and number

In [17]:
# Test data
text = "Meeting on 2024-03-15 at 14:30:00. Call (555) 123-4567"

# 1. Extract date components (year, month, day)
### YOUR CODE HERE ###

# 2. Extract time components (hours, minutes, seconds)
### YOUR CODE HERE ###

# 3. Extract phone number components (area code, exchange, number)
### YOUR CODE HERE ###
# Test data
text = "Meeting on 2024-03-15 at 14:30:00. Call (555) 123-4567"

# 1. Extract date components (year, month, day)
# Pattern: (\d{4})-(\d{2})-(\d{2})
date_match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
if date_match:
    year = date_match.group(1)
    month = date_match.group(2)
    day = date_match.group(3)
    print(f"Date -> Year: {year}, Month: {month}, Day: {day}")

# 2. Extract time components (hours, minutes, seconds)
# Pattern: (\d{2}):(\d{2}):(\d{2})
time_match = re.search(r"(\d{2}):(\d{2}):(\d{2})", text)
if time_match:
    h, m, s = time_match.groups() # groups() returns all captured parts as a tuple
    print(f"Time -> Hours: {h}, Mins: {m}, Secs: {s}")

# 3. Extract phone components (area code, exchange, number)
# Note: Parentheses in the text must be escaped with \
phone_pattern = r"\((\d{3})\)\s*(\d{3})-(\d{4})"
phone_match = re.search(phone_pattern, text)
if phone_match:
    area, exchange, line = phone_match.groups()
    print(f"Phone -> Area: {area}, Exchange: {exchange}, Number: {line}")

Date -> Year: 2024, Month: 03, Day: 15
Time -> Hours: 14, Mins: 30, Secs: 00
Phone -> Area: 555, Exchange: 123, Number: 4567


### Exercise 10: Nested Groups

Extract information from nested groups:
1. Parse product name and price separately
2. Parse name and age
3. Understand group numbering in nested patterns like "((word) number)"

In [18]:
# Test data
text = "Product: Laptop, Price: $999.99 and Name: John Doe, Age: 30"

# 1. Extract product and price
### YOUR CODE HERE ###

# 2. Extract name and age (with nested groups to capture first and last name separately)
### YOUR CODE HERE ###
# Test data
text = "Product: Laptop, Price: $999.99 and Name: John Doe, Age: 30"

# 1. Extract product and price
# We use groups to capture the name after 'Product: ' and digits after '$'
prod_pattern = r"Product:\s*(\w+),\s*Price:\s*\$([\d.]+)"
prod_match = re.search(prod_pattern, text)
if prod_match:
    print(f"Product: {prod_match.group(1)}, Price: {prod_match.group(2)}")

# 2. Extract name and age with nested groups
# Outer group (Group 1) is the full name.
# Inner Group 2 is First Name, Inner Group 3 is Last Name.
# Group 4 is Age.
name_age_pattern = r"Name:\s*((\w+)\s+(\w+)),\s*Age:\s*(\d+)"
m = re.search(name_age_pattern, text)

if m:
    print(f"Full Match: {m.group(0)}")
    print(f"Full Name (Group 1): {m.group(1)}")
    print(f"First Name (Group 2): {m.group(2)}")
    print(f"Last Name (Group 3): {m.group(3)}")
    print(f"Age (Group 4): {m.group(4)}")

Product: Laptop, Price: 999.99
Full Match: Name: John Doe, Age: 30
Full Name (Group 1): John Doe
First Name (Group 2): John
Last Name (Group 3): Doe
Age (Group 4): 30


### Exercise 11: Named Groups

Use named groups to extract:
1. Email addresses - extract username and domain separately
2. URLs - extract protocol, domain, and path
3. Full names - extract first name, middle name (if present), and last name

In [19]:
# Test data
text = "Contact admin@example.com or visit https://www.example.com/path for info. Name: John Michael Doe"

# 1. Extract email with named groups
### YOUR CODE HERE ###

# 2. Extract URL components
### YOUR CODE HERE ###

# 3. Extract full name (with optional middle name)
### YOUR CODE HERE ###
# Test data
text = "Contact admin@example.com or visit https://www.example.com/path for info. Name: John Michael Doe"

# 1. Extract email with named groups
email_pattern = r"(?P<username>\w+)@(?P<domain>\w+\.\w+)"
email_match = re.search(email_pattern, text)
if email_match:
    print(f"Email -> User: {email_match.group('username')}, Domain: {email_match.group('domain')}")

# 2. Extract URL components
# We use :// as a separator and / for the path
url_pattern = r"(?P<protocol>https?|ftp)://(?P<domain>[^/\s]+)(?P<path>/[^\s]*)?"
url_match = re.search(url_pattern, text)
if url_match:
    print(f"URL -> Protocol: {url_match.group('protocol')}, Domain: {url_match.group('domain')}, Path: {url_match.group('path')}")

# 3. Extract full name (with optional middle name)
# The middle part is made optional using ( ...)?
name_pattern = r"Name:\s*(?P<first>\w+)\s+(?:(?P<middle>\w+)\s+)?(?P<last>\w+)"
name_match = re.search(name_pattern, text)
if name_match:
    print(f"Name -> First: {name_match.group('first')}, Middle: {name_match.group('middle')}, Last: {name_match.group('last')}")

Email -> User: admin, Domain: example.com
URL -> Protocol: https, Domain: www.example.com, Path: /path
Name -> First: John, Middle: Michael, Last: Doe


### Exercise 12: Named Groups Practice

Extract structured data using named groups:
1. Parse date and time components with named groups
2. Parse IP address (4 octets) and port
3. Use groupdict() to get all named groups as a dictionary

In [20]:
# Test data
text = "Date: 2024-03-15, Time: 14:30 and IP: 192.168.1.1, Port: 8080"

# 1. Parse date and time with named groups
### YOUR CODE HERE ###

# 2. Parse IP and port
### YOUR CODE HERE ###
# Test data
text = "Date: 2024-03-15, Time: 14:30 and IP: 192.168.1.1, Port: 8080"

# 1. Parse date and time with named groups
# We capture Year, Month, Day, Hour, and Minute
dt_pattern = r"Date:\s*(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2}),\s*Time:\s*(?P<hour>\d{2}):(?P<minute>\d{2})"
dt_match = re.search(dt_pattern, text)

if dt_match:
    print("Date/Time Dictionary:")
    print(dt_match.groupdict())

# 2. Parse IP and port
# An IP has 4 octets separated by dots. Port is digits after a colon.
# Note: \. is used to match a literal dot
ip_pattern = r"IP:\s*(?P<ip>(?:\d{1,3}\.){3}\d{1,3}),\s*Port:\s*(?P<port>\d+)"
ip_match = re.search(ip_pattern, text)

if ip_match:
    print("\nIP/Port Dictionary:")
    # Using groupdict() to see all named captures as a dict
    data_dict = ip_match.groupdict()
    print(data_dict)
    print(f"Connecting to {data_dict['ip']} on port {data_dict['port']}...")

Date/Time Dictionary:
{'year': '2024', 'month': '03', 'day': '15', 'hour': '14', 'minute': '30'}

IP/Port Dictionary:
{'ip': '192.168.1.1', 'port': '8080'}
Connecting to 192.168.1.1 on port 8080...


## Exercise Set 2: Processing in Pandas

In [21]:
import pandas as pd

### Exercise 1: Extracting Numerical Values from Text

A common task in data cleaning is extracting numerical values from text columns. For example, you might have ratings written as "3/10" or "8 out of 10", or prices mixed with text.

**Task**: Extract numerical values from a DataFrame column containing mixed text and numbers.

In [22]:
# Sample data: Product reviews with ratings in various formats
df = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Keyboard'],
    'review_text': [
        'Great product! Rating: 4/5',
        'Love it! 9 out of 10',
        'Average quality. Score: 3/10',
        'Excellent! Rated 5 stars',
        'Not bad, 7.5/10'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract the first numerical rating (e.g., "4" from "4/5", "9" from "9 out of 10")
# Hint: Look for patterns like "number/number" or "number out of number"
### YOUR CODE HERE ###

# Task 2: Extract both numerator and denominator from "X/Y" format
# Create new columns: 'rating_numerator' and 'rating_denominator'
### YOUR CODE HERE ###

# Task 3: Extract decimal ratings (e.g., "7.5" from "7.5/10")
### YOUR CODE HERE ###

Original DataFrame:
      product                   review_text
0      Laptop    Great product! Rating: 4/5
1       Phone          Love it! 9 out of 10
2      Tablet  Average quality. Score: 3/10
3  Headphones      Excellent! Rated 5 stars
4    Keyboard               Not bad, 7.5/10




### Exercise 2: Cleaning and Standardizing Text Data

Real-world data often contains inconsistent formatting. You need to clean phone numbers, emails, or other structured data that appears in various formats.

**Task**: Clean and standardize phone numbers and email addresses in a DataFrame.

In [26]:
# Sample data: Customer contact information with inconsistent formatting
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'contact_info': [
        'Phone: (555) 123-4567, Email: john@example.com',
        'Call me at 555.123.4567 or email: jane.doe@test.org',
        'Contact: 5551234567, jane@company.co.uk',
        'Phone: 555-123-4567 | Email: contact@site.net',
        'Tel: (555)123-4567, mail: info@domain.com'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract phone numbers in standard format (XXX) XXX-XXXX
# Hint: Handle various formats like (555) 123-4567, 555.123.4567, 555-123-4567, 5551234567
### YOUR CODE HERE ###

# Task 2: Extract email addresses
### YOUR CODE HERE ###

# Task 3: Create a cleaned DataFrame with separate 'phone' and 'email' columns
### YOUR CODE HERE ##
##1. Re-create the reviews dataframe specifically
df_reviews = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Keyboard'],
    'review_text': [
        'Great product! Rating: 4/5',
        'Love it! 9 out of 10',
        'Average quality. Score: 3/10',
        'Excellent! Rated 5 stars',
        'Not bad, 7.5/10'
    ]
})

# Task 1 & 3: Extract numerical/decimal ratings
df_reviews['first_rating'] = df_reviews['review_text'].str.extract(r'(\d+\.?\d*)')

# Task 2: Extract numerator and denominator
rating_parts = df_reviews['review_text'].str.extract(r'(\d+\.?\d*)\s*/\s*(\d+)')
df_reviews['rating_numerator'] = rating_parts[0]
df_reviews['rating_denominator'] = rating_parts[1]

print("Processed Reviews:")
display(df_reviews.head())

Original DataFrame:
   customer_id                                       contact_info
0            1     Phone: (555) 123-4567, Email: john@example.com
1            2  Call me at 555.123.4567 or email: jane.doe@tes...
2            3            Contact: 5551234567, jane@company.co.uk
3            4      Phone: 555-123-4567 | Email: contact@site.net
4            5          Tel: (555)123-4567, mail: info@domain.com


Processed Reviews:


Unnamed: 0,product,review_text,first_rating,rating_numerator,rating_denominator
0,Laptop,Great product! Rating: 4/5,4.0,4.0,5.0
1,Phone,Love it! 9 out of 10,9.0,,
2,Tablet,Average quality. Score: 3/10,3.0,3.0,10.0
3,Headphones,Excellent! Rated 5 stars,5.0,,
4,Keyboard,"Not bad, 7.5/10",7.5,7.5,10.0


### Exercise 3: Extracting Structured Information from Unstructured Text

In many datasets, important information is embedded in free-form text. You need to extract dates, prices, or other structured data.

**Task**: Extract dates, prices, and product names from customer order descriptions.

In [28]:
# Sample data: Order descriptions with embedded information
df = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'description': [
        'Order placed on 2024-03-15. Product: Laptop, Price: $999.99',
        'Purchase date: 2024-03-20. Item: Wireless Mouse, Cost: $29.99',
        'Ordered on 2024-04-01. Product: USB-C Cable, Price: $15.50',
        'Date: 2024-04-10. Product: Monitor Stand, Amount: $79.99',
        'Order 2024-04-15. Product: Keyboard, Total: $89.00'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract dates in YYYY-MM-DD format
# Create a new column 'order_date'
### YOUR CODE HERE ###

# Task 2: Extract prices (handle formats like $999.99, $29.99, etc.)
# Create a new column 'price' (as float, without the $ sign)
### YOUR CODE HERE ###

# Task 3: Extract product names (text after "Product:" or "Item:")
# Create a new column 'product_name'
### YOUR CODE HERE ###

# Display the cleaned DataFrame
### YOUR CODE HERE ###
# Task 1: Extract dates in YYYY-MM-DD format
# Pattern: Four digits, dash, two digits, dash, two digits
df['order_date'] = df['description'].str.extract(r'(\d{4}-\d{2}-\d{2})')

# Task 2: Extract prices (stripping the $ sign and converting to float)
# Pattern: Literal '$' followed by digits and a decimal
df['price'] = df['description'].str.extract(r'\$(\d+\.\d{2})').astype(float)

# Task 3: Extract product names
# Pattern: Match after "Product: " OR "Item: " using the | operator
# We use (?:...) as a non-capturing group for the label, and (\w+\s?\w*) to capture the name
product_pattern = r'(?:Product:|Item:)\s*([\w\s-]+?)(?:,|$|Amount|Price|Cost|Total)'
# A simpler version for this specific data:
df['product_name'] = df['description'].str.extract(r'(?:Product|Item):\s*([^,]+)')

# Clean up any trailing spaces in product names
df['product_name'] = df['product_name'].str.strip()

print("Cleaned Order DataFrame:")
print(df[['order_id', 'order_date', 'product_name', 'price']])

Original DataFrame:
   order_id                                        description
0      1001  Order placed on 2024-03-15. Product: Laptop, P...
1      1002  Purchase date: 2024-03-20. Item: Wireless Mous...
2      1003  Ordered on 2024-04-01. Product: USB-C Cable, P...
3      1004  Date: 2024-04-10. Product: Monitor Stand, Amou...
4      1005  Order 2024-04-15. Product: Keyboard, Total: $8...


Cleaned Order DataFrame:
   order_id  order_date    product_name   price
0      1001  2024-03-15          Laptop  999.99
1      1002  2024-03-20  Wireless Mouse   29.99
2      1003  2024-04-01     USB-C Cable   15.50
3      1004  2024-04-10   Monitor Stand   79.99
4      1005  2024-04-15        Keyboard   89.00


### Exercise 4: Using str.extract() and str.extractall() with Named Groups

Pandas provides convenient methods for applying regex to DataFrame columns. Use `str.extract()` for single matches and `str.extractall()` for multiple matches per row.

**Task**: Extract multiple pieces of information from text using pandas string methods.

In [None]:
# Sample data: Log entries with IP addresses, timestamps, and status codes
df = pd.DataFrame({
    'log_entry': [
        '192.168.1.1 - [2024-03-15 14:30:00] GET /page1 HTTP/1.1 200',
        '10.0.0.5 - [2024-03-15 15:45:22] POST /api/data HTTP/1.1 404',
        '172.16.0.10 - [2024-03-16 09:12:33] GET /page2 HTTP/1.1 200',
        '192.168.1.2 - [2024-03-16 10:20:15] GET /page3 HTTP/1.1 500',
        '10.0.0.8 - [2024-03-16 11:05:44] POST /api/user HTTP/1.1 201'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract IP address, timestamp, HTTP method, endpoint, and status code
# Use named groups and str.extract() to create separate columns
# Format: IP - [TIMESTAMP] METHOD /endpoint HTTP/version STATUS
### YOUR CODE HERE ###

# Task 2: Filter rows where status code is an error (4xx or 5xx)
### YOUR CODE HERE ###

# Task 3: Extract all IP addresses from a column that may contain multiple IPs per row
# Use str.extractall() if there can be multiple matches
df_multiple = pd.DataFrame({
    'text': [
        'IPs: 192.168.1.1, 10.0.0.5, 172.16.0.1',
        'Contact: 192.168.1.2 or 10.0.0.8',
        'No IPs here',
        'Single IP: 192.168.1.3'
    ]
})
### YOUR CODE HERE ###

### Exercise 5: Data Cleaning with str.replace() and str.contains()

Use regex with pandas string methods to clean data, filter rows, and transform text columns.

**Task**: Clean product descriptions, remove unwanted characters, and filter based on patterns.

In [31]:
# Sample data: Product catalog with messy descriptions
df = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5, 6],
    'description': [
        'Laptop - $999.99 (In Stock)',
        'Phone @ $599.99 [Available]',
        'Tablet: $399.99 {Limited Stock}',
        'Headphones - $149.99 (Out of Stock)',
        'Keyboard $79.99 [In Stock]',
        'Mouse: $29.99 {Available}'
    ],
    'category': [
        'Electronics - Computers',
        'Electronics/Mobile',
        'Electronics.Tablets',
        'Audio-Equipment',
        'Electronics\\Keyboards',
        'Electronics|Mice'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Remove special characters from descriptions (keep only alphanumeric, spaces, and basic punctuation)
# Create a 'clean_description' column
### YOUR CODE HERE ###

# Task 2: Extract prices and create a numeric 'price' column
### YOUR CODE HERE ###

# Task 3: Standardize category separators (replace -, /, ., \\, | with a single separator like " > ")
### YOUR CODE HERE ###

# Task 4: Filter products that are in stock (contain "In Stock" or "Available")
### YOUR CODE HERE ###

# Task 5: Remove HTML-like tags or brackets (e.g., remove [Available], (In Stock), {Limited Stock})
### YOUR CODE HERE ###
import pandas as pd
import regex as re

# Use a raw string (r'') for the category with a backslash to avoid SyntaxWarnings
df_products = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5, 6],
    'description': [
        'Laptop - $999.99 (In Stock)',
        'Phone @ $599.99 [Available]',
        'Tablet: $399.99 {Limited Stock}',
        'Headphones - $149.99 (Out of Stock)',
        'Keyboard $79.99 [In Stock]',
        'Mouse: $29.99 {Available}'
    ],
    'category': [
        'Electronics - Computers',
        'Electronics/Mobile',
        'Electronics.Tablets',
        'Audio-Equipment',
        r'Electronics\Keyboards', # Fixed with 'r'
        'Electronics|Mice'
    ]
})

# 1. Extract Name, Price, and Status
desc_pattern = r'(?P<name>^\w+).*?\$?(?P<price>\d+\.\d{2}).*?[(\[{\s]+(?P<status>.*?)[)\]}\s]+$'
df_cleaned = df_products['description'].str.extract(desc_pattern)
df_cleaned['price'] = df_cleaned['price'].astype(float)

# 2. Standardize Categories using \W+ (matches any non-alphanumeric character)
category_split = df_products['category'].str.split(r'\W+', expand=True)
df_cleaned['main_category'] = category_split[0]
df_cleaned['sub_category'] = category_split[1]

# 3. Combine and Display
df_final = pd.concat([df_products[['product_id']], df_cleaned], axis=1)

print("Final Cleaned Product Catalog:")
display(df_final) # Fixed the unmatched parenthesis here

Original DataFrame:
   product_id                          description                 category
0           1          Laptop - $999.99 (In Stock)  Electronics - Computers
1           2          Phone @ $599.99 [Available]       Electronics/Mobile
2           3      Tablet: $399.99 {Limited Stock}      Electronics.Tablets
3           4  Headphones - $149.99 (Out of Stock)          Audio-Equipment
4           5           Keyboard $79.99 [In Stock]    Electronics\Keyboards
5           6            Mouse: $29.99 {Available}         Electronics|Mice


Final Cleaned Product Catalog:


Unnamed: 0,product_id,name,price,status,main_category,sub_category
0,1,Laptop,999.99,In Stock,Electronics,Computers
1,2,Phone,599.99,Available,Electronics,Mobile
2,3,Tablet,399.99,Limited Stock,Electronics,Tablets
3,4,Headphones,149.99,Out of Stock,Audio,Equipment
4,5,Keyboard,79.99,In Stock,Electronics,Keyboards
5,6,Mouse,29.99,Available,Electronics,Mice


## Exercise Set 3: Organizing the Messy Exercise Log

**Task**: Extract the exercises and the number of sets from this text.
If the exercise has a weight, extract the weight and unify the unit of measurement (either in kg or lbs).

**Log Text**:

```{python}
text = """
Pushups 30 reps 3 sets
5 reps 2 sets Pullups
2 Sets 15 Reps One-leg Squats
4 sets 8 reps 22.5 lbs Dumbbell Rows
4 sets 8 reps 15.25kg Dumbbell Rows
"""

```

## Key Takeaways

- **Practice is essential** for mastering regex patterns - start with simple character classes and gradually work up to complex patterns.

- **Character classes** `[ ]` are fundamental for matching sets of characters, with negation `[^]` and ranges `[a-z]` being powerful tools.

- **Quantifiers** (`*`, `+`, `?`, `{}`) control repetition and are crucial for matching variable-length patterns.

- **Anchors** (`^`, `$`, `\b`) match positions rather than characters, enabling precise pattern matching.

- **Groups** `()` allow you to capture and extract specific parts of matches, which is essential for data extraction tasks.

- **Word boundaries** `\b` are essential for matching complete words without partial matches.

- **Testing and debugging** regex patterns is easier with online tools like regex101.com before implementing in code.

- Real-world regex applications require combining multiple concepts: character classes, quantifiers, anchors, and groups work together to solve complex text processing problems.