# Regular Expression

Imagine you have a big pile of documents, and you need to find all the phone numbers in them.  You could read every single word, but that would take forever! Instead, you can use a "search pattern" to find all the phone numbers quickly.

That "search pattern" is what a regular expression (regex) is.

Think of it like a super-powered search tool that understands patterns, not just exact words.

Here's a simple analogy:

* Normal search: If you search for "cat," you'll only find the word "cat."
* Regex search: You can create a pattern that says, "Find anything that looks like a phone number: three digits, then a hyphen, then three more digits, then another hyphen, and then four digits."





# 1. Data Validation and cleaning

## A. Email Validation

How it works:

The "Recipe" (Regular Expression):

* The line email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$" is the "recipe" the computer uses to check if an email is valid. It's written in a special language called "regular expressions" (regex for short).
* Think of regex as a way to describe patterns in text.


The regex checks if the email address follows a basic pattern:

* It starts with a username (letters, numbers, some symbols).
* Then has an "@" symbol.
* Then has a domain name (letters, numbers, dots, hyphens).
* Then has a dot.
* Then has a top-level domain (at least two letters).

In [1]:
import re

def validate_email(email):
  """Validates an email address."""
  email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
  return re.match(email_pattern, email) is not None

In [2]:
# Example usage

email1 = "test@example.com"
email2 = "invalid.email"

print(f"'{email1}' is valid email: {validate_email(email1)}")
print(f"'{email2}' is valid email: {validate_email(email2)}")

'test@example.com' is valid email: True
'invalid.email' is valid email: False


## B. Validate Indian Phone number

Logic of regex checks if the Indian phone number:

* May optionally start with "+91" (with or without a separator).
* May optionally have a leading "0".
* May optionally have a leading "91" after the optional 0.
* Must be a 10-digit number starting with 7, 8, or 9.
* Must match the entire input string, from beginning to end.

In [3]:
import re

def validate_indian_phone_number(phone_number):
  """Validates an Indian phone number."""

  # Common Indian phone number formats:
  # +91-XXXXXXXXXX, 0XXXXXXXXXX, XXXXXXXXXX, (XXX)-XXXXXXXXXX, etc.
  phone_pattern = r"^(?:\+91[\-\s]?)?[0]?(?:91)?[789]\d{9}$"
  return re.match(phone_pattern, phone_number) is not None

In [4]:
# Example usage
phone1 = "+91-9876543210"
phone2 = "09876543210"
phone3 = "9876543210"
phone4 = "(987)6543210"
phone5 = "1234567890" #invalid
phone6 = "+919876543210"
phone7 = "919876543210"

print(f"'{phone1}' is valid Indian phone number: {validate_indian_phone_number(phone1)}")
print(f"'{phone2}' is valid Indian phone number: {validate_indian_phone_number(phone2)}")
print(f"'{phone3}' is valid Indian phone number: {validate_indian_phone_number(phone3)}")
print(f"'{phone4}' is valid Indian phone number: {validate_indian_phone_number(phone4)}")
print(f"'{phone5}' is valid Indian phone number: {validate_indian_phone_number(phone5)}")
print(f"'{phone6}' is valid Indian phone number: {validate_indian_phone_number(phone6)}")
print(f"'{phone7}' is valid Indian phone number: {validate_indian_phone_number(phone7)}")

'+91-9876543210' is valid Indian phone number: True
'09876543210' is valid Indian phone number: True
'9876543210' is valid Indian phone number: True
'(987)6543210' is valid Indian phone number: False
'1234567890' is valid Indian phone number: False
'+919876543210' is valid Indian phone number: True
'919876543210' is valid Indian phone number: True


## C. Validate Pincode

Logic Summary:

The regex checks if the input string:

* Starts with a digit from 1 to 9.
* Is followed by exactly 5 more digits.
* Consists of exactly 6 digits in total.

In [5]:
import re

def validate_indian_pincode(pincode):
    """Validates an Indian PIN code (6 digits)."""
    pincode_pattern = r"^[1-9]\d{5}$"  # PIN codes cannot start with 0
    return re.match(pincode_pattern, pincode) is not None

In [6]:
# Example usage
pincode1 = "700001"  # Valid Indian PIN code
pincode2 = "012345"  # Invalid Indian PIN code (starts with 0)
pincode3 = "12345"   #Invalid Indian pincode(less than 6 digits)
pincode4 = "1234567" #Invalid Indian pincode(more than 6 digits)

print(f"'{pincode1}' is valid Indian PIN code: {validate_indian_pincode(pincode1)}")
print(f"'{pincode2}' is valid Indian PIN code: {validate_indian_pincode(pincode2)}")
print(f"'{pincode3}' is valid Indian PIN code: {validate_indian_pincode(pincode3)}")
print(f"'{pincode4}' is valid Indian PIN code: {validate_indian_pincode(pincode4)}")

'700001' is valid Indian PIN code: True
'012345' is valid Indian PIN code: False
'12345' is valid Indian PIN code: False
'1234567' is valid Indian PIN code: False


## D. Date Validation

Logic Summary:

The regex checks if the input string:

* Starts with exactly four digits (the year).
* Is followed by a hyphen.
* Is followed by exactly two digits (the month).
* Is followed by a hyphen.
* Is followed by exactly two digits (the day).
* Consists only of this specific pattern from beginning to end.

In [7]:
import re

def validate_date_iso(date_str):
    """Validates an ISO-8601 date (YYYY-MM-DD)."""
    date_pattern = r"^\d{4}-\d{2}-\d{2}$"
    return re.match(date_pattern, date_str) is not None

In [8]:
# Example Usage:
date1 = "2023-10-27"  # Valid ISO date
date2 = "2023/10/27"  # Invalid (uses / instead of -)
date3 = "2023-10-32"  # Invalid (day is out of range)
date4 = "23-10-27"    # Invalid (year is not 4 digits)
date5 = "2023-1-27"    # Invalid (month is only one digit)
date6 = "2023-10-7"    # Invalid (day is only one digit)
date7 = "2023-10-27T12:00:00Z" #Invalid(Includes time zone, this is only date check)

print(f"'{date1}' is valid ISO date: {validate_date_iso(date1)}")
print(f"'{date2}' is valid ISO date: {validate_date_iso(date2)}")
print(f"'{date3}' is valid ISO date: {validate_date_iso(date3)}")
print(f"'{date4}' is valid ISO date: {validate_date_iso(date4)}")
print(f"'{date5}' is valid ISO date: {validate_date_iso(date5)}")
print(f"'{date6}' is valid ISO date: {validate_date_iso(date6)}")
print(f"'{date7}' is valid ISO date: {validate_date_iso(date7)}")

'2023-10-27' is valid ISO date: True
'2023/10/27' is valid ISO date: False
'2023-10-32' is valid ISO date: True
'23-10-27' is valid ISO date: False
'2023-1-27' is valid ISO date: False
'2023-10-7' is valid ISO date: False
'2023-10-27T12:00:00Z' is valid ISO date: False


# E. Clean Text

Logic summary

* Remove whitespaces
* Remove punctuations

In [9]:
import re

def clean_text(text):
    """Cleans text by removing special characters and extra spaces."""
    cleaned_text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    cleaned_text = " ".join(cleaned_text.split())  # Remove extra spaces
    return cleaned_text

In [10]:
# Example Usage:
text1 = "Hello, world! How are you?"
text2 = "This is a test with  extra   spaces and some #special@characters."
text3 = "   Leading and trailing spaces,  as well as newlines\nand tabs\t should be removed."
text4 = "123 Numbers and symbols like $ % ^ & * ( ) should be removed."
text5 = "Mixed Case Text Example."

print(f"Original text: '{text1}'")
print(f"Cleaned text: '{clean_text(text1)}'")

print(f"Original text: '{text2}'")
print(f"Cleaned text: '{clean_text(text2)}'")

print(f"Original text: '{text3}'")
print(f"Cleaned text: '{clean_text(text3)}'")

print(f"Original text: '{text4}'")
print(f"Cleaned text: '{clean_text(text4)}'")

print(f"Original text: '{text5}'")
print(f"Cleaned text: '{clean_text(text5)}'")

Original text: 'Hello, world! How are you?'
Cleaned text: 'Hello world How are you'
Original text: 'This is a test with  extra   spaces and some #special@characters.'
Cleaned text: 'This is a test with extra spaces and some specialcharacters'
Original text: '   Leading and trailing spaces,  as well as newlines
and tabs	 should be removed.'
Cleaned text: 'Leading and trailing spaces as well as newlines and tabs should be removed'
Original text: '123 Numbers and symbols like $ % ^ & * ( ) should be removed.'
Cleaned text: '123 Numbers and symbols like should be removed'
Original text: 'Mixed Case Text Example.'
Cleaned text: 'Mixed Case Text Example'


## F. Advanced Clean Text

The clean_text_advanced function removes HTML markup, unwanted special characters, and extra whitespace from text, producing a cleaner, more standardized string.

In [11]:
import re

def clean_text_advanced(text):
    """
    Cleans text by removing unwanted characters, extra spaces, and HTML tags.
    Optionally extracts specific data elements.
    """

    # 1. Remove HTML tags
    text = re.sub(r"<[^>]+>", "", text)

    # 2. Remove special characters (except alphanumeric, spaces, and certain symbols)
    text = re.sub(r"[^\w\s.,!?@#$%&*()\-+=]", "", text)

    # 3. Remove extra spaces
    text = " ".join(text.split())

    return text

In [12]:
# Example Usage:
text1 = "<p>This is a <b>test</b> with HTML tags!</p>"
text2 = "Here's some messy text:  $%^&*(, with extra   spaces and special characters!"
text3 = "A mix of text, numbers 123, and symbols like @#$ are here."
text4 = "   Leading and trailing spaces, as well as newlines\nand tabs\t should be removed."
text5 = "Some text with Emojis 😀🚀🎉 and unwanted characters ~`"

print(f"Original text1: '{text1}'")
print(f"Cleaned text1: '{clean_text_advanced(text1)}'")

print(f"Original text2: '{text2}'")
print(f"Cleaned text2: '{clean_text_advanced(text2)}'")

print(f"Original text3: '{text3}'")
print(f"Cleaned text3: '{clean_text_advanced(text3)}'")

print(f"Original text4: '{text4}'")
print(f"Cleaned text4: '{clean_text_advanced(text4)}'")

print(f"Original text5: '{text5}'")
print(f"Cleaned text5: '{clean_text_advanced(text5)}'")

Original text1: '<p>This is a <b>test</b> with HTML tags!</p>'
Cleaned text1: 'This is a test with HTML tags!'
Original text2: 'Here's some messy text:  $%^&*(, with extra   spaces and special characters!'
Cleaned text2: 'Heres some messy text $%&*(, with extra spaces and special characters!'
Original text3: 'A mix of text, numbers 123, and symbols like @#$ are here.'
Cleaned text3: 'A mix of text, numbers 123, and symbols like @#$ are here.'
Original text4: '   Leading and trailing spaces, as well as newlines
and tabs	 should be removed.'
Cleaned text4: 'Leading and trailing spaces, as well as newlines and tabs should be removed.'
Original text5: 'Some text with Emojis 😀🚀🎉 and unwanted characters ~`'
Cleaned text5: 'Some text with Emojis and unwanted characters'


## G. Extract Data

Logic Summary:

The extract_data function is designed to extract specific data elements from a given text string based on a provided regular expression pattern.

Input:

* text: The input text string from which data needs to be extracted.
* pattern: The regular expression pattern that defines the data elements to be extracted.

Regular Expression Matching:

* re.findall(pattern, text): This function is the core of the extraction process.
* re.findall(): Finds all non-overlapping occurrences of the pattern within the text.
* It returns a list of strings, where each string is a match found in the text.
Output:

The function returns the matches list, which contains all the extracted data elements that match the provided regex pattern.

In [13]:
import re

def extract_data(text, pattern):
    """Extracts data elements from text using a provided regex pattern."""
    matches = re.findall(pattern, text)
    return matches

In [16]:
# Example Usage:

text1 = "Order ID: ORD-12345, Customer ID: CUST-67890, Product Code: PROD-ABC"
pattern1 = r"[A-Z]+-\d+"  # Pattern for IDs (e.g., ORD-12345)

text2 = "Contact us at info@example.com or support@company.net for assistance."
pattern2 = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"  # Pattern for email addresses

text3 = "Dates: 2023-10-26, 2023/11/15, and 2024.01.05"
pattern3 = r"\d{4}[-/.]\d{2}[-/.]\d{2}"  # Pattern for dates (various separators)

text4 = "Find prices: $10.99, $25, and $100.00"
pattern4 = r"\$\d+(\.\d{2})?" #Pattern for currency

text5 = "There are 3 apples, 12 bananas, and 1 orange."
pattern5 = r"\d+ [a-z]+" #Pattern for number followed by a word

text6 = "This text has some emojis 😀 🚀 🎉"
pattern6 = r"[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F900-\U0001F9FF\u2600-\u26FF\u2700-\u27BF]+"

print(f"Text1: '{text1}'")
print(f"Extracted data1: {extract_data(text1, pattern1)}")

print(f"Text2: '{text2}'")
print(f"Extracted data2: {extract_data(text2, pattern2)}")

print(f"Text3: '{text3}'")
print(f"Extracted data3: {extract_data(text3, pattern3)}")

print(f"Text4: '{text4}'")
print(f"Extracted data4: {extract_data(text4, pattern4)}")

print(f"Text5: '{text5}'")
print(f"Extracted data5: {extract_data(text5, pattern5)}")

print(f"Text6: '{text6}'")
print(f"Extracted data6: {extract_data(text6, pattern6)}")

Text1: 'Order ID: ORD-12345, Customer ID: CUST-67890, Product Code: PROD-ABC'
Extracted data1: ['ORD-12345', 'CUST-67890']
Text2: 'Contact us at info@example.com or support@company.net for assistance.'
Extracted data2: ['info@example.com', 'support@company.net']
Text3: 'Dates: 2023-10-26, 2023/11/15, and 2024.01.05'
Extracted data3: ['2023-10-26', '2023/11/15', '2024.01.05']
Text4: 'Find prices: $10.99, $25, and $100.00'
Extracted data4: ['.99', '', '.00']
Text5: 'There are 3 apples, 12 bananas, and 1 orange.'
Extracted data5: ['3 apples', '12 bananas', '1 orange']
Text6: 'This text has some emojis 😀 🚀 🎉'
Extracted data6: ['😀', '🚀', '🎉']
