# Regular Expressions

Regular expressions, often referred to as regex or regexp, are powerful and flexible patterns used to match and manipulate text. They provide a concise and efficient way to search, extract, and modify specific patterns of characters within strings.
Python, being a versatile programming language, integrates regular expressions through the re module. 

## Regular expressions in action

Validating email

In [None]:
import re

def is_valid_email(email):
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return re.match(pattern, email) is not None

In [None]:
is_valid_email("john.doe@example.com")

In [None]:
is_valid_email("invalid_email@dasdad") 

In [None]:
is_valid_email("@dasdad.com") 

Validating phone numbers

In [None]:
def is_valid_phone_number(phone_number):
    pattern = r'^\d{3}-\d{3}-\d{4}(-\d{2})?$'
    return re.match(pattern, phone_number) is not None

In [None]:
is_valid_phone_number("123-456-7890-34")

In [None]:
is_valid_phone_number("9876543210")  

In [None]:
is_valid_phone_number("123-456-7890")  

Extract URLs from text

In [None]:
text = "Visit my website at https://www.example.com or https://blog.example.com"

pattern = r'https?://[\w\.-]+'
urls = re.findall(pattern, text)

urls 

Parsing log files

In [None]:
log_line = "Error: File not found (filename.txt)"

pattern = r'Error: (.+) \((.+)\)'
match = re.search(pattern, log_line)

if match:
    error_type = match.group(1)
    file_name = match.group(2)
    print(f"Error Type: {error_type}")
    print(f"File Name: {file_name}")
else:
    print("No match found.")

Data cleaning

In [None]:
text = "This text contains! some@ unwanted# characters$"

pattern = r'[!@#$]'
cleaned_text = re.sub(pattern, '', text)

cleaned_text

Tokenizing a sentence

In [None]:
sentence = "Hello, how   are you  doing today?"

pattern = r'\w+'
tokens = re.findall(pattern, sentence)

tokens

Find words that ends with "ful" or "full"

In [None]:
sentence = "The colorful brown fox jumps over the colorfull lazy dog."

pattern = r'\b\w+full?\b'
result = re.findall(pattern, sentence)

print(result)  

## Basics of Regular Expressions

### Matching functions

**match** function attempts to match RE pattern to string with optional flags. Here is the syntax for this function:

re.match(pattern, string, flags=0)

match function returns a **match** objects on success, or **None** on failure. We get *group(idx)* or *groups* to get the expression


In [None]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(\w+) are (\w+)', line)

if matchObj:
    print(matchObj.group(0))
    print(matchObj.group(1))
    print(matchObj.group(2))
else:
    print("No match!!")

In [None]:
line = "Cats arex smarter than dogs"

matchObj = re.match( r'(\w+) are (\w+)', line)
print(matchObj)

Search vs Match: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string

In [None]:
line = "I think that cats are smarter than dogs"

matchObj = re.match( r'(\w+) are (\w+)', line)
print(matchObj)

In [None]:
matchObj = re.search( r'(\w+) are (\w+)', line)
print(matchObj.group(0))

**findall** returns all the ocurrences of the matching pattern

In [None]:
line = "She sells seashells by the seashore. The shells she sells are surely seashells"
re.findall(r's\w+', line)

### Search and replace with **sub**

In [None]:
text = "She sells seashells by the seashore. The shells she sells are surely seashells."
re.sub(r"shells", "pebbles", text)

In [None]:
text = "The total cost is $99.99. The discount is 20%."
new_text = re.sub(r"\d+(\.\d+)?", "[NUMBER]", text)

new_text

### Splitting strings with re

In [None]:
text = "Hello World to everybody"
text.split(' ')

In [None]:
text = "Hello   World     Python    Regex"
text.split(' ')

In [None]:
words = re.split(r'\s+', text)
print(words)

In [None]:
text = "Hello, World! How are you? I am nice, thank you"
segments = re.split(r'\s?[\.,\?!]\s?', text)
print(segments)

### Regular expression patterns

Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

^, matches beginning of line

In [None]:
text = "Hello, World! Welcome to High Python."

print(re.findall(r'^H\w+', text))
print(re.findall(r'H\w+', text))

Why findall and not simply match

In [None]:
print(re.findall(r'^H\w+', "Hello\nHi Honey\nHeritage", flags=re.MULTILINE))

$, matches end of line

In [None]:
print(re.findall(r'\w+g$', "Hello\ndog fog\nrog", flags=re.MULTILINE))

., matches any single character except newline

In [None]:
line = "Cats are smarter than dogs, said the cat"

print(re.findall( r'are .* dogs', line))

[] Matches any single character in the brackets

In [None]:
text = "The cat sat on the mat."

pattern = r'[cms]\w+'

matches = re.findall(pattern, text)
print(matches)

We can also use ranges with -

In [None]:
text = "The cat sat on the mat."

pattern = r'\b[o-s]\w+\b'

matches = re.findall(pattern, text)
print(matches)

[^] Matches any single character NOT in the brackets

In [None]:
text = "12345678912345678"

print(re.findall(r'[34]', text))
print(re.findall(r'[^34]', text))

| matches one expression or the other

In [None]:
text = "The cat rest in a pest sat on the mat."

pattern = r'((re|pe)st)'

matches = re.findall(pattern, text)
print(matches)

### Character classes
- \w, word characters
- \W, nonword characters
- \s, whitespaces, equivalent to [\t\n\r\f]
- \S, non whitespaces
- \d, digits
- \D, nondigits
- \b, word boundaries
- \B, nonword boundaries
- \n, \t, newlines, tabs

### Repetition clases

?, optional

In [None]:
text = "1234 124 1256 1244 12334"

print(re.findall(r'\b123?4\b', text))

*, cero or more

In [None]:
text = "1234 124 1256 1244 12334"

print(re.findall(r'\b123*4\b', text))

+, one or more

In [None]:
text = "1234 124 1256 1244 12334"

print(re.findall(r'\b123+4\b', text))

{n}, {n,}, {n1,n2}

In [None]:
text = "1234 124 1256 1244 12334 123334 1233334 12333334"

print(re.findall(r'\b123{1}4\b', text))
print(re.findall(r'\b123{3}4\b', text))
print(re.findall(r'\b123{2,}4\b', text))
print(re.findall(r'\b123{1,3}4\b', text))

### Anchors

In [None]:
text = "Hello World\nHi to Python Huge World\nHub is High bold"

print("Matches with ^:")
print(re.findall(r'^H\w+', text, re.MULTILINE))

print("Matches with \A:")
print(re.findall(r'\AH\w+', text, re.MULTILINE))

print("Matches with \b:")
print(re.findall(r'\bH\w+', text, re.MULTILINE))

In [None]:
text = "Hello World\nHi to old Python Huge World\nHub is High bold"

print("Matches with $:")
print(re.findall(r'\w+d$', text, re.MULTILINE))

print("Matches with \Z:")
print(re.findall(r'\w+d\Z', text, re.MULTILINE))

print("Matches with \b:")
print(re.findall(r'\w+d\b', text, re.MULTILINE))

## Regular expressions in action ... again ...

In [None]:
text = "Hello123 World456"
letters = re.findall(r'[A-Za-z]+', text)
letters

In [None]:
digits = re.findall(r'\d+', text)
digits

In [None]:
alphanumeric = re.findall(r'\w+', text)
alphanumeric

Extracting email and phone number

In [None]:
text = "Contact us at support@example.com or call +1-555-123-4567 for assistance."

In [None]:
phone_numbers = re.findall(r'\+1-\d{3}-\d{3}-\d{4}', text)
phone_numbers

In [None]:
email_addresses = re.findall(r'\b[\w.-]+@[\w.-]+\.\w+\b', text)
email_addresses

Can you find some text that matches with the email pattern and that are invalid emails?

In [None]:
text = "Contact support@.example.com"
re.findall(r'\b[\w.-]+@[\w.-]+\.\w+\b', text)

Matching date and times

In [None]:
text = "Today is 2023-05-23 and the time is 12:30 PM."

In [None]:
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
dates

In [None]:
times = re.findall(r'\d{1,2}:\d{2}\s(?:AM|PM)', text)
times

Extracting data from text

In [None]:
text = "Hello, my name is John Doe. I am a worker in Alaska"
match = re.search(r'my name is (\w+)\s+(\w+)', text)

if match:
    name = match.group(1)
    last_name = match.group(2)
    print("Name:", name)
    print("Last name:", last_name)

In [None]:
text = "Contact: John Doe (john.doe@example.com)"
pattern = r'Contact: (\w+ \w+) \(([\w.]+@\w+\.\w+)\)'

match = re.search(pattern, text)
if match:
    name = match.group(1)
    email = match.group(2)
    print("Name:", name)
    print("Email:", email)


**Using backreferences**

In [None]:
text = "The quick brown fox fox jumps over the lazy dog dog."

pattern = r'(\b\w+\b) \1'
matches = re.findall(pattern, text)
print("Repeated Words:", matches)

**Non-capturing groups**

I want to extract the words that are preceded by the words 'new' or 'old'

In [None]:
text = "I bought a new car, but I don't have a driver's license."

pattern = r'new|old\s(\w+)'
matches = re.findall(pattern, text)
print("Car Type:", matches)

No match .... lets see what happens

In [None]:
re.findall("new|old\s\w+", "new car old car")

the problem is a word grouping issue, so lets use parenthesis

In [None]:
pattern = r'(new|old)\s(\w+)'
matches = re.findall(pattern, text)
print("Car Type:", matches)

Now we have a problem, because the parenthesis used for grouping words are now understood as groups to capture

The solution: non capturing groups

In [None]:
pattern = r'(?:new|old) (\w+)'
matches = re.findall(pattern, text)
print("Car Type:", matches)


## Some more complex regular expresions to play with...

### Phone numbers

In [None]:
def process_expressions(expression, to_match, failed_list):
    for v in to_match:
        if re.match(expression, v):
            print("Matched: ", v)
        else:
            print("*******  ", v)

    failed_list = [v for v in to_ignore if re.match(expression, v)]
    if len(failed_list) > 0:
        print("Failed ignore list")
        for f in failed_list:
            print(f)

In [None]:
to_match = [
    "(123) 456-7890",
    "(123)456-7890",
    "123-456-7890",
    "123.456.7890",
    "1234567890",
    "+31636363634",
    "+91 (123) 456-7890",
    "+91 (123)456-7890",
    "+91 123-456-7890",
    "+91 123.456.7890",
    "+91 1234567890",
    "+91 123 456 7890",
]

to_ignore = [
    "(123)456-780",    
    "12-456-7890",
    "12.456.7890",
    "123.45.7890",
    "123.456.790",
    "123/456/7908",
    "123/456-7908",
    "123456789",
    "+1234567890"
]

expression = r"^\d{3}-\d{3}-\d{4}$"
process_expressions(expression, to_match, to_ignore)

### Dates

In [None]:
to_match = [
    "01.01.2023",
    "01/01/23",
    "01-01-23",
    "01.01.23",
    "01/01/2019",
    "01-01-2019",
    "01.01.2019",
    "01/01/19",
    "01-01-19",
    "01.01.19",
    "01/01/2023",
    "01-01-2023",
    "01.01.2023",
    "01/01/23",
    "01-01-23",
    "01.01.23",
    "01/01/2019",
]
to_ignore = [
    "01\\01\\2023",
    "01\\01\\23",
    "01-01-123",
    "01.01.123",
    "1/01/2019",
    "1-01-2019",
    "01.1.2019",
    "01/1/19",
    "01-01-19ac",
]

expression = r"^\d{2}-\d{2}-\d{2}$"
process_expressions(expression, to_match, to_ignore)

### Times

In [None]:
to_match = [
    "1:00 am",
    "1:00 pm",
    "01:00 am",
    "01:00 pm",
    "01:00 PM",
    "1:00am",
    "1:00pm",
    "1:00PM",
    "01:00am",
    "01:00pm",
    "1400",
    "0800"
]
to_ignore = [
    "01:00 xm",
    "01:00 ym",
    "1:00xm",
    "1:00tm",
]

expression = r"^\d{1,2}:\d{2}\sam$"
process_expressions(expression, to_match, to_ignore)

### Urls

In [None]:
to_match = [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "http://www.google.com/",
    "https://www.google.com/",
    "http://google.com/",
    "http://yahoo.com/",
    "https://google.com/",
    "https://google.cu/",
    "http://www.google.com/index.html",
    "https://www.google.com/index.html",
    "http://google.com/index.html",
    "https://google.com/start.html",
    "google.com/index.html"
]
to_ignore = [
    "htp://www.google.com",
    "httpx://www.google.com",
    "htps://www.google.com",
    "http://.google.com",
    "https://#google.com",
    "http://www.google./",
    "https://www..com/",
]

expression = r"^http://www\.\w+\.com$"
process_expressions(expression, to_match, to_ignore)