<a name='0'></a>
# Intro to Regular Expressions


In this lab, you will learn to:

 * [1.Using the re Module in Python](#1)
 * [2. Searching and Matching](#2)
 * [3. Capturing Groups](#3)
 * [4. Replacing Patterns](#4)
 * [5. Compilation Flags](#5)





If you are using Google Colab, we do not need to install NumPy. We will only have to import it just like this:

`import numpy as np`

If you are using local Jupyter notebooks, make sure you have it installed already.

In [None]:
import re

In [None]:
pattern = r"receipt"
text = "this is the receipt no:9080"

match = re.search(pattern, text)

if match:
    print("Pattern found:", match.group())
else:
    print("Pattern not found.")

Pattern found: receipt


<a name='2'></a>

### Capturing Groups

In [None]:
text = "123-45"
pattern = r"(\d{3})-(\d{2})"
match = re.search(pattern, text)
if match:
    print("Full match:", match.group())
    print("Group 1:", match.group(1))
    print("Group 2:", match.group(2))

Full match: 123-45
Group 1: 123
Group 2: 45


### Replacing Patterns

In [None]:
pattern = r"love"
text = "I love apples and apple pie."
new_text = re.sub(pattern, "hate", text)
print(new_text)

pattern = r"apple"
text = "I love apples and apple pie."
new_text = re.sub(pattern, "orange", text)
print(new_text)

I hate apples and apple pie.
I love oranges and orange pie.


### Compilation Flags

In [None]:
pattern = r"hello"
text = "Hello World"
match = re.search(pattern, text, re.IGNORECASE)
if match:
    print("Pattern found:", match.group())


Pattern found: Hello


#### Exercise 1: Validate an Email Address

In [None]:
import re

def validate_email(email):
    # Regular expression pattern to validate email addresses
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'

    if re.match(pattern, email):
        return True  # Valid email address
    else:
        return False  # Invalid email address

# Test cases
print(validate_email("user@example.com"))  # True
print(validate_email("invalid-email"))     # False

True
False


^: Matches the start of the string.

[\w\.-]+: Matches one or more word characters (alphanumeric and underscore), dots, and hyphens. This represents the username part of the email.

@: Matches the "@" symbol.

[\w\.-]+: Matches one or more word characters, dots, and hyphens. This represents the domain name.

\.: Matches a dot (used to separate the domain name and top-level domain).

\w+: Matches one or more word characters. This represents the top-level domain.

$: Matches the end of the string.

#### Exercise 2: Extract All URLs from a Webpage Source Code

In [None]:
import re

def extract_urls(source_code):
    # Regular expression pattern to match URLs
    pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

    # Find all URLs in the source code using the pattern
    urls = re.findall(pattern, source_code)
    return urls

# Example webpage source code
webpage_source = """
<html>
  <a href="https://www.example.com">Visit Example</a>
  <a href="http://www.test.com">Test Website</a>
</html>
"""

# Extract and print URLs
urls = extract_urls(webpage_source)
for url in urls:
    print(url)


https://www.example.com
http://www.test.com


http[s]?://: Matches "http" or "https" followed by "://".

(?: ... ): A non-capturing group that contains several character classes. This is used to match valid URL characters.

[a-zA-Z]: Matches any alphabetical character.

[0-9]: Matches any digit.

[$-_@.&+]: Matches specific characters used in URLs.

[*\(\),]: Matches characters like asterisks, parentheses, and commas.

(?:%[0-9a-fA-F][0-9a-fA-F]): Matches URL-encoded characters (e.g., %20 for space).

+: Matches one or more occurrences of the preceding group.

### Exercise: Replace Sensitive Information (e.g., Email Addresses) in a Text python


In [None]:
import re

def replace_sensitive_info(text):
    # Regular expression pattern to match email addresses
    email_pattern = r'\b[\w\.-]+@[\w\.-]+\.\w+\b'

    # Replace email addresses with placeholders
    masked_text = re.sub(email_pattern, "[email]", text)
    return masked_text

# Test case
original_text = "Contact us at john@example.com or jane@example.com for assistance."
masked_text = replace_sensitive_info(original_text)
print("Original text:", original_text)
print("Masked text:", masked_text)

Original text: Contact us at john@example.com or jane@example.com for assistance.
Masked text: Contact us at [email] or [email] for assistance.


\b: Matches a word boundary, ensuring that only whole email addresses are matched.
    
[\w\.-]+@[\w\.-]+\.\w+: Matches a typical email address pattern.
    
re.sub(email_pattern, "[email]", text): Replaces matched email addresses with the placeholder "[email]".

### Exercise: Applying this on a column

In [None]:
import pandas as pd
import re

# Sample DataFrame
data = {
    'text_column': [
        "Contact us at john@example.com for assistance.",
        "Send an email to support@example.com.",
        "Please reach out to info@company.com."
    ]
}
df = pd.DataFrame(data)

def replace_sensitive_info(text):
    # Regular expression pattern to match email addresses
    email_pattern = r'\b[\w\.-]+@[\w\.-]+\.\w+\b'

    # Replace email addresses with placeholders
    masked_text = re.sub(email_pattern, "[email]", text)
    return masked_text

# Apply the replacement function to the 'text_column' column
df['text_column_masked'] = df['text_column'].apply(replace_sensitive_info)

# Display the DataFrame
print(df)

                                      text_column  \
0  Contact us at john@example.com for assistance.   
1           Send an email to support@example.com.   
2           Please reach out to info@company.com.   

                      text_column_masked  
0  Contact us at [email] for assistance.  
1              Send an email to [email].  
2           Please reach out to [email].  




In this example, we create a sample DataFrame df with a column named 'text_column' containing text data.

We define the replace_sensitive_info function, which uses regular expressions to replace email addresses with placeholders.

The apply function is then used to apply the replace_sensitive_info function to each element in the 'text_column' column. The results are stored in a new column named 'text_column_masked'.

After running the code, the DataFrame df will contain the original text in the 'text_column' column and the text with email addresses replaced by placeholders in the 'text_column_masked' column.

Remember to replace 'text_column' and 'text_column_masked' with the actual column names you are working with in your DataFrame.

### [BACK TO TOP](#0)