References:

You may use this to create Regex and create the code (including Python):

https://regex101.com/


# Regular Expressions in Python - Regex

Regex (Regular Expressions) is a powerful tool in Python for matching and manipulating text patterns. It allows you to search, validate, and extract data from strings using a specific syntax.

In Python, you can use the `re` (regular expression) module to work with regex. Here's a brief overview:

**Why use regex in Python?**

1. **Text processing**: Regex is ideal for tasks like data cleaning, data extraction, and data validation.
2. **Pattern matching**: Regex can match complex patterns in text, such as phone numbers, email addresses, or credit card numbers.
3. **String manipulation**: Regex can be used to replace, split, or join strings based on specific patterns.

**Basic regex concepts in Python**

1. **Patterns**: Regex patterns are used to match text. They consist of characters, special characters, and metacharacters.
2. **Metacharacters**: Special characters that have special meanings in regex, such as `.`, `^`, `$`, `*`, `+`, `?`, `{`, `}`, `[`, `]`, `(`, `)`, `|`, `\`.
3. **Character classes**: A set of characters enclosed in square brackets `[]` that match any single character within the set.
4. **Quantifiers**: Special characters that specify the number of times a pattern should be matched, such as `*`, `+`, `?`, `{n,m}`.
5. **Groups**: Parentheses `()` that group patterns together, allowing you to capture and reference parts of the match.

**Common regex patterns in Python**

1. **Matching a string**: `re.match(pattern, string)` matches the pattern at the beginning of the string.
2. **Searching for a pattern**: `re.search(pattern, string)` searches for the pattern anywhere in the string.
3. **Replacing a pattern**: `re.sub(pattern, replacement, string)` replaces the pattern with a replacement string.
4. **Splitting a string**: `re.split(pattern, string)` splits the string into substrings based on the pattern.



In [28]:
import re

email = "john.doe@example.com"
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"

if re.match(pattern, email):
    print("Valid email address")
else:
    print("Invalid email address")


Valid email address



This code uses the `re.match` function to match the email address against a pattern that checks for the following:

* The email address starts with one or more alphanumeric characters, dots, underscores, percent signs, plus signs, or hyphens (`^[a-zA-Z0-9._%+-]+`).
* The email address contains an `@` symbol followed by one or more alphanumeric characters, dots, or hyphens (`@[a-zA-Z0-9.-]+`).
* The email address ends with a dot followed by two or more letters (`\.[a-zA-Z]{2,}$`).

If the email address matches the pattern, the code prints "Valid email address". Otherwise, it prints "Invalid email address".

## Validating a Phone Number

Regex pattern: `\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})`
Description: This pattern matches a phone number in the format `(123) 456-7890` or `123-456-7890` or `123.456.7890`.

In [29]:
import re

phone_number = "123-456-7890"
pattern = r"\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$"  #the $ specify the ending pattern
if re.match(pattern, phone_number):
    print("Valid phone number")
else:
    print("Invalid phone number")

Valid phone number


## Extracting URLs from text

Regex pattern: `https?://[^\s]+`
Description: This pattern matches URLs in the format `http://example.com` or `https://example.com`.

In [30]:
import re

text = "Visit our website at http://example.com or https://example.com"
urls = re.findall(r"https?://[^\s]+", text)
print(urls)  # Output: ["http://example.com", "https://example.com"]

['http://example.com', 'https://example.com']


## Validating a Credit Card Number

Regex pattern: `^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11})$
Description: This pattern matches credit card numbers in the formats `4111-1111-1111-1111` or `5105-1051-0510-5105`.

In [31]:
import re

credit_card = "4111-1111-1111-1111"
pattern = r"^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11})$"
if re.match(pattern, credit_card):
    print("Valid credit card number")
else:
    print("Invalid credit card number")

Invalid credit card number


## Extracting Dates from Text

Regex pattern: `\b(?:[1-9]|[12][0-9]|3[01])[-/.](?:[0-9]{2}|[0-9]{4})\b`
Description: This pattern matches dates in the formats `01-01-2020` or `01/01/2020` or `2020-01-01`.

In [32]:
import re

text = "The meeting is scheduled for 01-01-2020 or 01/01/2020."
dates = re.findall(r"\b\d{2}[-/.]\d{2}[-/.]\d{4}\b", text)
print(dates)  # Output: ["01-01-2020", "01/01/2020"]


['01-01-2020', '01/01/2020']


## Validating an IP address

Regex pattern: `^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$
Description: This pattern matches IP addresses in the format `192.168.1.1`.

In [33]:
import re

ip_address = "192.168.1.1"
pattern = r"^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$"
if re.match(pattern, ip_address):
    print("Valid IP address")
else:
    print("Invalid IP address")

Valid IP address


## Extracting Email Addresses from Text

Regex pattern: `\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`
Description: This pattern matches email addresses in the format `john.doe@example.com`.

In [44]:
import re

text = "Contact us at john.doe@example.com or jane.smith@example.com"
emails = re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text)
print(emails)  # Output: ["john.doe@example.com", "jane.smith@example.com"]

['john.doe@example.com', 'jane.smith@example.com']



**Conclusion**

Regex is a powerful tool in Python for working with text patterns. By understanding the basics of regex and using the `re` module, you can perform complex text processing tasks, validate data, and extract valuable information from strings.

## Exercise

Extract only those emails which have .com or .ai endings

In [45]:
import re

text = "Contact us at john.doe@example.com or jane.smith@example.com or test@email.AI"
emails = re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[com|ai]{2,}\b", text) # \b means not a blank space
# emails = re.findall(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.(?:com|ai)\b", text)
print(emails)  # Output: ["john.doe@example.com", "jane.smith@example.com"]

['john.doe@example.com', 'jane.smith@example.com']
