<div style="text-align: center;">
    <img src="https://media.licdn.com/dms/image/v2/D4E0BAQE4bCQaEBYDUQ/company-logo_200_200/company-logo_200_200/0/1728590235264/hobotacademy_logo?e=1738195200&v=beta&t=0DiMS1wnJYE6nkBjL237Lh3mTm8GKEy_s65hG_yiTSs" alt="Hobot Academy Logo" width="200"/>
</div>

<h1 style="text-align: center;">Python</h1>
<h2 style="text-align: center;">RegEX, Text Preprocessing</h2>
<h3 style="text-align: center;">Part02</h3>
<h3 style="text-align: center;">Zahra Amini</h3>

<div style="width: 100%; text-align: center;">
    <table style="margin: 0 auto;">
        <tr>
            <td>
                <a href="https://t.me/hobotacademy">Telegram</a><br>
                <a href="https://www.linkedin.com/company/hobotacademy">LinkedIn</a><br>
                <a href="https://www.youtube.com/@AcademyHobot">YouTube</a><br>
            </td>
            <td>
                <a href="https://github.com/hobotacademy">GitHub</a><br>
                <a href="https://www.kaggle.com/aminizahra">Kaggle</a><br>
                <a href="https://www.instagram.com/hobot.academy/">Instagram</a><br>
            </td>
        </t>
    </table>
</div>
    </table>
</div>
    </table>
</div>

## 4. Grouping and Subgroups

### 4.1. `()` and `(?: ...)`: Grouping with and without capturing

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>(...)</code>: Groups with match capturing, storing the matched result.<br>
<code>(?:...)</code>: Groups without capturing, used when the result does not need to be stored.
</div>


#### 💠Grouping to capture area code in a phone number

In [61]:
# Text with phone numbers
text = "My number is (123) 456-7890"

# Using parentheses to capture area code separately
result = re.search(r"\((\d{3})\) \d{3}-\d{4}", text)

# Displaying the area code if matched
if result:
    print("Area code:", result.group(1))  # Output: Area code: 123
else:
    print("No match found")


Area code: 123


<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>\((\d{3})\)</code>: Uses parentheses to capture the area code as a separate group.
</div>


#### 💠Non-capturing group for pattern grouping without capturing

In [64]:
# Text where we want to find either "cat" or "dog"
text = "I have a cat and a dog."

# Using a non-capturing group for alternatives
result = re.findall(r"(?:cat|dog)", text)

# Displaying the list of matched animals
print("Animals found:", result)  # Output: ['cat', 'dog']

Animals found: ['cat', 'dog']


### 4.2. Using `group()` and `groups()` to access grouped results

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>group()</code>: Accesses the entire match.<br>
<code>groups()</code>: Accesses all captured groups as a tuple.
</div>


#### 💠Extracting name and domain from an email address

In [68]:
# Email to extract name and domain from
email = "user@example.com"

# Using groups to capture both parts of the email
result = re.search(r"(\w+)@(\w+\.\w+)", email)

print("Username:", result.group(1))  # Output: Username: user
print("Domain:", result.group(2))    # Output: Domain: example.com

Username: user
Domain: example.com


#### 💠Extracting date components using groups

In [70]:
# Date in YYYY-MM-DD format
date = "2023-10-28"

# Using groups to capture year, month, and day
result = re.search(r"(\d{4})-(\d{2})-(\d{2})", date)

print("Year:", result.group(1))   # Output: Year: 2023
print("Month:", result.group(2))  # Output: Month: 10
print("Day:", result.group(3)) 

Year: 2023
Month: 10
Day: 28


## 5. Working with Substitution and Pattern Replacement

### 5.1. `re.sub()`: Replace a pattern with another text

#### 💠Replacing all digits with an asterisk

In [74]:
# Text with numbers to be replaced
text = "Phone: 123-456-7890"

# Using re.sub() to replace digits with '*'
result = re.sub(r"\d", "*", text)

# Displaying modified text
print("Text after replacement:", result)  # Output: Phone: ***-***-****


Text after replacement: Phone: ***-***-****


#### 💠Replacing email domains to anonymize email addresses

In [76]:
# Text with email addresses
text = "Emails: alice@example.com, bob@test.com"

# Using re.sub() to replace domains with 'hidden.com'
result = re.sub(r"@\w+\.\w+", "@hidden.com", text)

# Displaying modified text
print("Anonymized emails:", result)  # Output: Emails: alice@hidden.com, bob@hidden.com

Anonymized emails: Emails: alice@hidden.com, bob@hidden.com


### 5.2. re.subn(): Similar to sub() but returns the count of replacements

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>re.subn()</code>: Functions like <code>re.sub()</code>, but also returns the number of substitutions made.
</div>


#### 💠Replacing spaces with underscores and counting replacements

In [80]:
# Text with spaces to replace
text = "Convert spaces to underscores"

# Using re.subn() to replace spaces with underscores
result, count = re.subn(r"\s", "_", text)

# Displaying modified text and count of replacements
print("Modified text:", result)     # Output: Convert_spaces_to_underscores
print("Number of replacements:", count)  # Output: Number of replacements: 3


Modified text: Convert_spaces_to_underscores
Number of replacements: 3


#### 💠Removing vowels and counting occurrences

In [82]:
# Text with vowels to remove
text = "Remove vowels from this text"

# Using re.subn() to remove vowels
result, count = re.subn(r"[aeiouAEIOU]", "", text)

# Displaying modified text and count of vowels removed
print("Text without vowels:", result)  # Output: Rmv vwls frm ths txt
print("Number of vowels removed:", count)  # Output: Number of vowels removed: 8


Text without vowels: Rmv vwls frm ths txt
Number of vowels removed: 8


## 6. Start and End of Line or Text

### 6.1. `^` and `$`: Match the start and end of a line

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>^</code>: Matches the start of a line.<br>
<code>$</code>: Matches the end of a line.
</div>

#### 💠Checking if a line starts with a specific word

In [87]:
# Text to check if it starts with "Hello"
text = "Hello there"

# Using re.search() with ^ to check the beginning of the line
result = re.search(r"^Hello", text)

print(result)

if result:
    print("Starts with 'Hello'")  # Output: Starts with 'Hello'
else:
    print("Does not start with 'Hello'")

<re.Match object; span=(0, 5), match='Hello'>
Starts with 'Hello'


#### 💠Checking if a line ends with a specific pattern

In [89]:
# Text to check if it ends with "done."
text = "The process is now done."

# Using re.search() with $ to check the end of the line
result = re.search(r"done\.$", text)

# Displaying result if matched
if result:
    print("Ends with 'done.'")  # Output: Ends with 'done.'
else:
    print("Does not end with 'done.'")

Ends with 'done.'


### 6.2. `\A` and `\Z`: Match the start and end of the entire text

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>\A</code>: Matches the beginning of the entire text.<br>
<code>\Z</code>: Matches the end of the entire text.
</dv>


#### 💠Checking if text starts with a specific pattern (entire text)

In [93]:
# Full text to check
text = "Start of text here"

# Using re.search() with \A to check if text starts with "Start"
result = re.search(r"\AStart", text)

# Displaying result if matched
if result:
    print("Text starts with 'Start'")  # Output: Text starts with 'Start'
else:
    print("Text does not start with 'Start'")


Text starts with 'Start'


#### 💠Checking if text ends with a period

In [95]:
# Full text to check
text = "This is a complete sentence."

# Using re.search() with \Z to check if text ends with a period
result = re.search(r"\.$", text)

# Displaying result if matched
if result:
    print("Text ends with a period")  # Output: Text ends with a period
else:
    print("Text does not end with a period")


Text ends with a period
