![Alt Text](https://raw.githubusercontent.com/msfasha/307307-BI-Methods/main/20242-NLP-LLM/images/header.png)

# **Introduction to Regular Expressions (RegEx) in Python**

<div style="display: flex; justify-content: flex-start; align-items: center;">
    <a href="https://github.com/msfasha/307307-BI-Methods/blob/main/20242-NLP-LLM/Part%201%20-%20Introduction%20to%20NLP/1.2-introduction_to_regular_expressions_python.ipynb" target="_blank">  
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="height: 25px; margin-right: 20px;">
</a>
</div>

Regular Expressions (RegEx) help us **search, extract, and manipulate** text efficiently. Python provides the `re` module to work with regex.

In this notebook, we'll cover **simple regex patterns with diverse examples**, perfect for beginners!

In [91]:
import re  # Import the regular expressions module

#### **1. Searching for a Simple Character**
**Pattern:** `a`

This pattern matches the letter 'a' anywhere in the string.

In [92]:
text = 'apple banana cat'
match = re.search(r'a', text)
if match:
    print(f'Found "a" at position: {match.start()}')  # Output: Position of first 'a'
else:
        print('No match found.')

Found "a" at position: 0


#### **2. Matching an Exact Word**
**Pattern:** `cat`

This pattern finds the word 'cat' in a string.

In [93]:
text = 'The cat sat on the mat.'
match = re.search(r'cat', text)
if match:
    print(f'Found "cat" at position: {match.start()}')
else:
    print('No match found.')

Found "cat" at position: 4


#### **3. Matching Any Character**
**Pattern:** `c.t`

The `.` (dot) matches any single character. This finds words like 'cat', 'cot', or 'cut'.

In [94]:
text = 'cat cot cut'
matches = re.findall(r'c.t', text)
print(f'Matching words: {matches}')  # Output: ['cat', 'cot', 'cut']

Matching words: ['cat', 'cot', 'cut']


#### **4. Matching a Digit**
**Pattern:** `\d`

The `\d` pattern matches any single digit (0-9).

In [95]:
text = 'My number is 5 and yours is 7.'
matches = re.findall(r'\d', text)
print(f'Found digits: {matches}')  # Output: ['5', '7']

Found digits: ['5', '7']


#### **5. Matching a Word Character**
**Pattern:** `\w`

The `\w` pattern matches any letter, number, or underscore (_).

In [96]:
text = 'Hello, Python_3!'
matches = re.findall(r'\w', text)
print(f'Word characters: {matches}')  # Output: ['H', 'e', 'l', 'l', 'o', ..., '3']

Word characters: ['H', 'e', 'l', 'l', 'o', 'P', 'y', 't', 'h', 'o', 'n', '_', '3']


#### **6. Matching Whitespace**
**Pattern:** `\s`

The `\s` pattern matches any space, tab, or newline character.

In [97]:
text = 'Hello World\nNew Line'
matches = re.findall(r'\s', text)
print(f'Whitespace characters found: {len(matches)}')  # Output: 3 (space, newline, space)

Whitespace characters found: 3


#### **7. Matching Multiple Occurrences**
**Pattern:** `o+`

The `+` means 'one or more' repetitions of the letter 'o'.

In [98]:
text = 'sooooon moon good'
matches = re.findall(r'o+', text)
print(f'Found sequences: {matches}')  # Output: ['oooo', 'oo', 'oo']

Found sequences: ['ooooo', 'oo', 'oo']


#### **8. Matching the Start of a String**
**Pattern:** `^Hello`

The `^` symbol ensures the match is **only at the beginning** of the string.

In [99]:
text = 'Hello, world!'
match = re.match(r'^Hello', text)
if match:
    print('The string starts with "Hello"!')
else:
    print('No match at the start.')

The string starts with "Hello"!


#### **9. Matching the End of a String**
**Pattern:** `world!$`

The `$` symbol ensures the match is **only at the end** of the string.

In [100]:
text = 'This is my world!'
match = re.search(r'world!$', text)
if match:
    print('The string ends with "world!"')
else:
    print('No match at the end.')

The string ends with "world!"


#### **10. Finding All Words with 'ing'**
**Pattern:** `\b\w+ing\b`

This pattern finds full words ending in 'ing'.

In [101]:
text = 'I am running, singing, and playing today.'
matches = re.findall(r'\b\w+ing\b', text)
print(f'Words ending with "ing": {matches}')  # Output: ['running', 'singing', 'playing']

Words ending with "ing": ['running', 'singing', 'playing']


---

### **Practical Example**

The script below uses regular expressions to automatically extract essential business details including:
- email subjects
- recipient names
- dates
- monetary amounts
- company names 

from sample email text, organizing the results into a structured pandas DataFrame for easy analysis.

In [102]:
import re

email = """Subject: Meeting Request for Product Launch
    Dear John,

    We are excited to announce the launch of our new product next month. The project cost is estimated at $12,500,000.
    We would like to schedule a meeting on March 20th at 3 PM to discuss partnership opportunities.

    Best regards,
    Jane Doe
    MarketingPro Inc."""

**Extracting the Subject of the Email**

In [103]:
# Pattern: r"Subject:\s*(.*)"
# - Matches the word "Subject:" followed by any whitespace (`\s*`).
# - Captures the rest of the line (the actual subject) using `(.*)`, 
#   which means "any character (.) zero or more times (*)".
# - This helps in extracting the subject line from the email content.

match = re.search(r"Subject:\s*(.*)", email)
subject = match.group(1) if match else "Not Found"

print(subject)

Meeting Request for Product Launch


   **Extracting the Recipient Name or Team**

In [104]:
# Pattern: r"Dear\s+([\w\s]+),"
# - Matches the word "Dear" followed by one or more whitespace characters (`\s+`).
# - Captures the recipient’s name or team using `([\w\s]+)`, 
#   which means "one or more word characters (`\w`) or spaces (`\s`)".
# - Ensures the match ends before the comma (`,`), marking the end of the salutation.

match = re.search(r"Dear\s+([\w\s]+),", email)
recipient = match.group(1) if match else "Not Found"

print(recipient)

John


**Extracting Any Mentioned Date**

In [105]:
# Pattern: r"(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}"
# - Looks for any full month name (`January` to `December`).
# - Followed by a space and a **1 or 2 digit number** (`\d{1,2}`), representing the day.
# - This extracts dates like "March 20", "April 10", or "May 5" from the email content.

match = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}", email)
date_mentioned = match.group(1) if match else "Not Found"
print(date_mentioned)

March


**Extracting Monetary Values (Currency Amounts)**

In [106]:
# Pattern: r"\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?"
# - Matches a dollar sign (`\$`).
# - Captures **1 to 3 digits** (`\d{1,3}`) at the start of the amount.
# - Optionally matches **comma-separated thousands** using `(?:,\d{3})*` (e.g., "$1,000" or "$15,000").
# - Optionally matches **decimal values** (`(?:\.\d{2})?`), allowing for cents like "$99.99".
# - This extracts monetary values such as "$15,000", "$95,000", or "$499.99" from the email.

match = re.search(r"\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?", email)
amount_mentioned = match.group(0) if match else "Not Found"
print(amount_mentioned)

$12,500,000


**Extracting Company Names**

In [107]:
# Pattern: r"\n\s*([\w\s]+(?:Inc\.|Ltd\.|Corp\.|Giant))"
# - Ensures the match starts on a new line (`\n`).
# - Allows for any leading whitespace (`\s*`).
# - Captures the company name using `([\w\s]+)`, which allows letters and spaces.
# - Ensures the company name ends with a **common business suffix** (e.g., "Inc.", "Ltd.", "Corp.", or "Giant").
# - This extracts names like "MarketingPro Inc.", "TechInnovators Ltd.", and "GlobalCorp."

match = re.search(r"^\s*([\w ]+(?:Inc\.|Ltd\.|Corp\.|Giant))", email,re.MULTILINE)
company_name = match.group(0) if match else "Not Found"
print(company_name)


    MarketingPro Inc.


Note:<br>
If we want to match start-of-line within multi-line text we need to set the re.MULTILINE flag, otherwise it will match the first line only

The re.MULTILINE flag redefines ^ and $ so that:

|Anchor	Normal (default)|With re.MULTILINE|
|--|--|
|^	|Start of string	Start of any line|
|$	|End of string	End of any line|