<h1 align="center">Machine Learning for NLP</h1>
    <h2 align="center">Text Preprocessing with RegEx</h2>
    <h3 align="center">Zahra Amini</h3>
<div style="width: 100%; text-align: center;">
    <table>
        <tr>
            <td>
                <a class="link" href="https://t.me/Zahraamini_ai">Telegram</a><br>
                <a class="link" href="https://www.linkedin.com/in/zahraamini-ai/">LinkedIn</a><br>
                <a class="link" href="https://www.youtube.com/@AcademyHobot">YouTube</a><br>
            </td>
            <td>
                <a class="link" href="https://github.com/aminizahra">GitHub</a><br>
                <a class="link" href="https://www.kaggle.com/aminizahra">Kaggle</a><br>
                <a class="link" href="https://www.instagram.com/zahraamini_ai/">Instagram</a><br>
            </td>
        </tr>
    </table>
</div>

## 1. Importing the library

In [3]:
import re

## 2. Multiple Flags and Pattern Configurations

### 2.1. `re.IGNORECASE` or `re.I`: Case-insensitive matching

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
This flag makes the regex case-insensitive.
</di>


#### ðŸ’ Finding words regardless of case

In [8]:
text = "Python is great. python is fun."

# Using re.IGNORECASE to find "python" in any case
result = re.findall(r"python", text, re.IGNORECASE)

print("Matches (case-insensitive):", result)  # Output: ['Python', 'python']

Matches (case-insensitive): ['Python', 'python']


### 2.2. `re.MULTILINE` or `re.M`: Multi-line mode for line-by-line matching

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
The <code>re.MULTILINE</code> flag makes the <code>^</code> and <code>$</code> characters match the start and end of each line individually.
</di>
>


#### ðŸ’ Matching the start of each line

In [12]:
text = """Hello world
Python is awesome
hello world again"""

# Using re.MULTILINE to match start of each line
result = re.findall(r"^hello", text, re.IGNORECASE | re.MULTILINE)

print("Lines starting with 'hello':", result)  # Output: ['Hello', 'hello']


Lines starting with 'hello': ['Hello', 'hello']


### 2.3. `re.DOTALL` or `re.S`: Allow `.` to match newlines

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
The <code>re.DOTALL</code> flag allows the <code>.</code> character to match newline (<code>\n</code>) characters in addition to regular characters.
</div>
>


#### ðŸ’ Matching a multi-line sentence

In [16]:
text = "Hello world.\nThis is Python."

# Using re.DOTALL to match across newlines
result = re.search(r"Hello.*Python", text, re.DOTALL)

print("Match with DOTALL:", result.group())  # Output: Hello world.\nThis is Python.

Match with DOTALL: Hello world.
This is Python


## 3. Using `re.finditer()`

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>re.finditer()</code> returns an iterator of matches for a pattern. This method is useful for processing large texts and optimizing memory usage.
</di>
>


#### ðŸ’ Iterating through all word matches

In [20]:
text = "The quick brown fox jumps over the lazy dog."

# Using re.finditer() to iterate over matches
for match in re.finditer(r'\b\w+\b', text):
    print("Word found:", match.group())

Word found: The
Word found: quick
Word found: brown
Word found: fox
Word found: jumps
Word found: over
Word found: the
Word found: lazy
Word found: dog


## 4. Combining Flags with `re.compile()`

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
<code>re.compile()</code> allows combining flags for specific use cases, enabling customized pattern matching.
</di>


#### ðŸ’ Compiling a pattern with case-insensitive and multi-line flags

In [24]:
# Compiling pattern with combined flags
pattern = re.compile(r'^[a-z]+', re.IGNORECASE | re.MULTILINE)

text = """hello
Python
WORLD"""

# Finding all matches at the start of each line
result = pattern.findall(text)

print("Matches with combined flags:", result)  # Output: ['hello', 'Python', 'WORLD']


Matches with combined flags: ['hello', 'Python', 'WORLD']


## 5. Managing Unicode Characters

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
For handling text data with non-Latin characters (such as Persian, Arabic, or other Unicode characters), the flags <code>re.UNICODE</code> and <code>re.ASCII</code> are used.<br><br>
- <code>re.UNICODE</code>: Ensures that Unicode characters are correctly matched according to Unicode rules.<br>
- <code>re.ASCII</code>: Restricts matching to ASCII characters only, ignoring non-ASCII characters.
</div>

### 5.1. `re.UNICODE`: Enable Unicode matching

#### ðŸ’ Matching Persian words

In [29]:
text = "Ø³Ù„Ø§Ù… Ø¨Ù‡ Ø´Ù…Ø§"

# Using re.UNICODE to match Persian words
result = re.findall(r"\w+", text, re.UNICODE)

print("Persian words found:", result)

Persian words found: ['Ø³Ù„Ø§Ù…', 'Ø¨Ù‡', 'Ø´Ù…Ø§']


### 5.2. `re.ASCII`: Limit to ASCII characters

#### ðŸ’ Matching ASCII characters only

In [32]:
text = "Hello Ø³Ù„Ø§Ù…"

# Using re.ASCII to match only English letters
result = re.findall(r"\w+", text, re.ASCII)

print("ASCII words found:", result)  # Output: ['Hello']

ASCII words found: ['Hello']


## 6. Advanced Replacement Functions with `re.sub()`

#### ðŸ’ Capitalizing each word

In [35]:
def replace_function(match):
    return match.group(0).upper()

# Using re.sub() with a function to capitalize words
text = "hello world"
result = re.sub(r'\b\w+\b', replace_function, text)

print("Text after capitalization:", result)  # Output: HELLO WORLD

Text after capitalization: HELLO WORLD


## 7. Negative Expressions and Non-matching Groups

<div style="background-color: #dafbe1; border-left: 5px solid #34c759; padding: 10px;">
Non-matching tools like <code>[^abc]</code> and <code>(?&lt;!...)</code> allow us to use patterns that exclude specific characters or groups.<br><br>
- <code>[^abc]</code>: Matches any character except <code>a</code>, <code>b</code>, or <code>c</code>.<br>
- <code>(?&lt;!...)</code>: Negative lookbehind, matching positions that are not preceded by the specified pattern.
</iv>


### 7.1. `[^abc]`: Match any character except those listed

In [39]:
text = "The quick brown fox."

# Using [^aeiou] to match non-vowel characters
result = re.findall(r"[^aeiou\s]", text)

print("Non-vowel characters:", result)

Non-vowel characters: ['T', 'h', 'q', 'c', 'k', 'b', 'r', 'w', 'n', 'f', 'x', '.']


### 7.2. (?!...) and (?<!...): Negative Lookahead and Lookbehind

#### ðŸ’ Matching words not followed by "ing"

In [42]:
text = "I am running, jumping, and swim."

# Using negative lookahead to avoid words ending with "ing"
result = re.findall(r"\b\w+\b(?!ing\b)", text)

print("Words not followed by 'ing':", result)  # Output: ['I', 'am', 'and', 'swim']

Words not followed by 'ing': ['I', 'am', 'running', 'jumping', 'and', 'swim']


## 8. Counting Matches with re.fullmatch()

In [44]:
text = "2023-10-05"

# Using re.fullmatch() to check if text matches date format
if re.fullmatch(r'\d{4}-\d{2}-\d{2}', text):
    print("This is a valid date format.")  # Output: This is a valid date format.
else:
    print("Invalid date format.")

This is a valid date format.


## 9. Multi-line and Newline Matching with Inline Flags

#### ðŸ’ Multi-line and dot-matching in a pattern

In [47]:
text = """This is line one.
This is line two."""

# Using inline flags to handle multi-line and dot-matching
result = re.findall(r"(?ms)^.*$", text)

print("Lines found with inline flags:", result)  # Output: ['This is line one.', 'This is line two.']

Lines found with inline flags: ['This is line one.\nThis is line two.']


## 9. Ignoring Parts of the Pattern with In-line Comments

#### ðŸ’ Adding comments inside a regex

In [50]:
# Compiling pattern with comments for better understanding
pattern = re.compile(r"(\d+) (?# This matches digits)")

text = "There are 15 apples."

# Using the pattern to find numbers
result = pattern.search(text)

print("Matched number with comment:", result.group())  # Output: 15

Matched number with comment: 15 


## 11. Recursive and Nested Matching with Named Groups

In [52]:
# Compiling pattern with named group and backreference
pattern = re.compile(r"(?P<word>\b\w+\b) (?P=word)")

text = "This is is a test."

# Using named group and backreference to find repeated words
result = pattern.search(text)

print("Repeated word found:", result.group())  # Output: is is

Repeated word found: is is
