# Text Processing with Regular Expressions

## Import Regular Expression Module

Regular expressions (regex) are powerful tools for pattern matching and text manipulation. The `re` module in Python provides comprehensive support for working with regular expressions.

In this notebook, we'll explore how to use regex for:
- Pattern matching in text
- Text cleaning and preprocessing
- Extracting specific information from strings
- Replacing and substituting text patterns

In [1]:
import re

## String Escape Sequences Problem

When working with file paths and regex patterns, it's important to understand escape sequences in strings. The backslash (`\`) is a special character in Python that creates escape sequences.

In the example below, we'll see how Python interprets escape sequences in a file path string. Notice what happens when Python encounters `\f` and `\t` in the path:

In [3]:
path = "c:\folder\file.txt"
print(path)

c:olderile.txt


## Raw Strings Solution

To avoid issues with escape sequences, Python provides **raw strings** using the `r` prefix. Raw strings treat backslashes as literal characters rather than escape sequences.

This is particularly useful when working with:
- File paths (especially on Windows)
- Regular expression patterns
- Any string containing backslashes that should be treated literally

Compare the output below with the previous example - notice how the raw string preserves the backslashes exactly as written:

In [4]:
path= r"c:\folder\file.txt"
print(path)

c:\folder\file.txt


In [6]:
result_search = re.search("pattern", r"text containing the pattern")
print (result_search)

<re.Match object; span=(20, 27), match='pattern'>


In [7]:
result_search = re.search("pattern", r"text containinging nothing")
print (result_search)

None


In [9]:
the_string = r"sara was able to help me find the item quickly"
new_str = re.sub(r"sara", r"sarah", the_string)
print(new_str)

sarah was able to help me find the item quickly


In [None]:
reviews = [
    "sarah helped me find the item quickly.",
    "joe was very helpful in finding the item.",
    "don was able to assisted me as well",
    "sara help, was greatly appreciated",
    "jeremy was very, helpful"
]

# use the list comprehension approach
sarah_reviews = [ review for review in reviews if re.search(r"sarah?", review, re.IGNORECASE)]
print(sarah_reviews)

# use the non-functional approach; iteration
sarah_reviews = []
for review in reviews:
    if re.search(r"sarah?", review, re.IGNORECASE):
        sarah_reviews.append(review)
print(sarah_reviews)

# use the filter function
sarah_reviews = list(filter(lambda review: re.search(r"sarah?", review, re.IGNORECASE), reviews))
print(sarah_reviews)

['sarah helped me find the item quickly', 'sara help was greatly appreciated']
['sarah helped me find the item quickly', 'sara help was greatly appreciated']
['sarah helped me find the item quickly', 'sara help was greatly appreciated']


find all the reviews which start with the letter 'j'

In [23]:
pattern_to_find = r"^j"
j_reviews = [review for review in reviews if re.search(pattern_to_find, review, re.IGNORECASE)]
print(j_reviews)


['joe was very helpful in finding the item', 'jeremy was very helpful']


In [24]:
pattern_to_find = r"l$"
j_reviews = [review for review in reviews if re.search(pattern_to_find, review, re.IGNORECASE)]
print(j_reviews)


['don was able to assisted me as well', 'jeremy was very helpful']


find multiple patterns

In [25]:
patterns_to_find = r"(help|assist)ed"
multiple_reviews = [review for review in reviews if re.search(patterns_to_find, review, re.IGNORECASE)]
print(multiple_reviews)


['sarah helped me find the item quickly', 'don was able to assisted me as well']


remove the punctuation

In [28]:
punctuation_patterns = r"[^\w\s]" # not a word or whitespace character
cleanup_reviews = [re.sub(punctuation_patterns, "", review) for review in reviews]
print(cleanup_reviews)

# functional approach
def remove_punctuation(review):
    return re.sub(punctuation_patterns, "", review)

cleanup_reviews = list(map(remove_punctuation, reviews))
print(cleanup_reviews)

# functional using lambda
cleanup_reviews = list(map(lambda review: re.sub(punctuation_patterns, "", review), reviews))
print(cleanup_reviews)

['sarah helped me find the item quickly', 'joe was very helpful in finding the item', 'don was able to assisted me as well', 'sara help was greatly appreciated', 'jeremy was very helpful']
['sarah helped me find the item quickly', 'joe was very helpful in finding the item', 'don was able to assisted me as well', 'sara help was greatly appreciated', 'jeremy was very helpful']
['sarah helped me find the item quickly', 'joe was very helpful in finding the item', 'don was able to assisted me as well', 'sara help was greatly appreciated', 'jeremy was very helpful']
