# Regular Expressions

Regular expressions, often referred to as regex or regexp, are powerful and flexible patterns used to match and manipulate text. They provide a concise and efficient way to search, extract, and modify specific patterns of characters within strings.
Python, being a versatile programming language, integrates regular expressions through the re module. 

## Regular expressions in action

In [1]:
import re

# Validating email

def is_valid_email(email):
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return re.match(pattern, email) is not None

email1 = "john.doe@example.com"
email2 = "invalid_email"

print(is_valid_email(email1)) 
print(is_valid_email(email2))  


True
False


In [2]:
# Validating phone numbers

def is_valid_phone_number(phone_number):
    pattern = r'^\d{3}-\d{3}-\d{4}(-\d{2})?$'
    return re.match(pattern, phone_number) is not None

phone1 = "123-456-7890-34"
phone2 = "9876543210"
phone3 = "123-456-7890"

print(is_valid_phone_number(phone1))  
print(is_valid_phone_number(phone2))  
print(is_valid_phone_number(phone3))  


True
False
True


In [3]:
# Extract URLs from text

text = "Visit my website at https://www.example.com or https://blog.example.com"

pattern = r'https?://[\w\.-]+'
urls = re.findall(pattern, text)

print(urls)  


['https://www.example.com', 'https://blog.example.com']


In [4]:
# Parsing log files

log = "Error: File not found (filename.txt)"

pattern = r'Error: (.+) \((.+)\)'
match = re.search(pattern, log)

if match:
    error_type = match.group(1)
    file_name = match.group(2)
    print(f"Error Type: {error_type}")
    print(f"File Name: {file_name}")
else:
    print("No match found.")


Error Type: File not found
File Name: filename.txt


In [5]:
# Data cleaning

text = "This text contains! some@ unwanted# characters$"

pattern = r'[!@#$]'
cleaned_text = re.sub(pattern, '', text)

print(cleaned_text)  


This text contains some unwanted characters


In [6]:
# Tokenizing a sentence
import re

sentence = "Hello, how are you doing today?"

pattern = r'\w+'
tokens = re.findall(pattern, sentence)

print(tokens)  


['Hello', 'how', 'are', 'you', 'doing', 'today']


In [7]:
# Find words that ends with "ful" or "full"

sentence = "The colorful brown fox jumps over the colorfull lazy dog."

pattern = r'\b\w+full?\b'
result = re.findall(pattern, sentence)

print(result)  


['colorful', 'colorfull']


## Basics of Regular Expressions

### Matching functions

**match** function attempts to match RE pattern to string with optional flags. Here is the syntax for this function:

re.match(pattern, string, flags=0)

match function returns a **match** objects on success, or **None** on failure. We get *group(idx)* or *groups* to get the expression


In [8]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line)

if matchObj:
    print(matchObj.group(0))
    print(matchObj.group(1))
    print(matchObj.group(2))
else:
    print("No match!!")

Cats are smarter than dogs
Cats
smarter


In [9]:
line = "Cats arex smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line)
print(matchObj)

None


Search vs Match: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string

In [10]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'are (.*?) .*', line)
print(matchObj)

matchObj = re.search( r'are (.*?) .*', line)
print(matchObj.group(0))

None
are smarter than dogs


**findall** returns all the ocurrences of the matching pattern

In [11]:
line = "She sells seashells by the seashore. The shells she sells are surely seashells"
re.findall(r's\w*', line)

['sells',
 'seashells',
 'seashore',
 'shells',
 'she',
 'sells',
 'surely',
 'seashells']

### Search and replace with **sub**

In [12]:
text = "She sells seashells by the seashore. The shells she sells are surely seashells."
re.sub(r"shells", "pebbles", text)

'She sells seapebbles by the seashore. The pebbles she sells are surely seapebbles.'

In [13]:
text = "The total cost is $99.99. The discount is 20%."
pattern = r"\d+(\.\d+)?"
replacement = "[NUMBER]"

new_text = re.sub(pattern, replacement, text)

print(new_text)


The total cost is $[NUMBER]. The discount is [NUMBER]%.


### Splitting strings with re

In [14]:
text = "Hello   World     Python    Regex"
words = re.split(r'\s+', text)
print(words)
print(text.split(' '))

['Hello', 'World', 'Python', 'Regex']
['Hello', '', '', 'World', '', '', '', '', 'Python', '', '', '', 'Regex']


In [15]:
text = "Hello, World! How are you?"
segments = re.split(r'[\.,\?!]', text)
print(segments)

['Hello', ' World', ' How are you', '']


### Regular expression patterns

Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

In [16]:
# ^, matches beginning of line
text = "Hello, World! Welcome to High Python."

print(re.findall(r'^H\w+', text))
print(re.findall(r'H\w+', text))

['Hello']
['Hello', 'High']


In [17]:
# Why findall and not simply match
print(re.findall(r'^H\w+', "Hello\nHi Honey\nHeritage", flags=re.MULTILINE))

['Hello', 'Hi', 'Heritage']


In [18]:
# $, matches end of line
print(re.findall(r'\w+g$', "Hello\ndog fog\nrog", flags=re.MULTILINE))

['fog', 'rog']


In [19]:
# ., matches any single character except newline
line = "Cats are smarter than dogs, said the cat"

print(re.findall( r'are .* dogs', line))

['are smarter than dogs']


In [20]:
# [] Matches any single character in the brackets
text = "The cat sat on the mat."

pattern = r'[cms]\w+'

matches = re.findall(pattern, text)
print(matches)

['cat', 'sat', 'mat']


In [21]:
# We can also use ranges with -
text = "The cat sat on the mat."

pattern = r'\b[o-s]\w+\b'

matches = re.findall(pattern, text)
print(matches)

['sat', 'on']


In [22]:
# [^] Matches any single character NOT in the brackets
text = "12345678912345678"

print(re.findall(r'[34]', text))
print(re.findall(r'[^34]', text))


['3', '4', '3', '4']
['1', '2', '5', '6', '7', '8', '9', '1', '2', '5', '6', '7', '8']


In [23]:
# | Mathes one expression or the other
text = "The cat rest in a pest sat on the mat."

pattern = r'((re|pe)st)'

matches = re.findall(pattern, text)
print(matches)

[('rest', 're'), ('pest', 'pe')]


### Character classes
- \w, word characters
- \W, nonword characters
- \s, whitespaces, equivalent to [\t\n\r\f]
- \S, non whitespaces
- \d, digits
- \D, nondigits
- \b, word boundaries
- \B, nonword boundaries
- \n, \t, newlines, tabs

### Repetition cases

In [24]:
# ?, optional
text = "1234 124 1256 1244 12334"

print(re.findall(r'\b123?4\b', text))

['1234', '124']


In [25]:
# *, cero or more
text = "1234 124 1256 1244 12334"

print(re.findall(r'\b123*4\b', text))

['1234', '124', '12334']


In [26]:
# +, one or more
text = "1234 124 1256 1244 12334"

print(re.findall(r'\b123+4\b', text))

['1234', '12334']


In [27]:
# {n}, {n,}, {n1,n2}
text = "1234 124 1256 1244 12334 123334 1233334 12333334"

print(re.findall(r'\b123{1}4\b', text))
print(re.findall(r'\b123{3}4\b', text))
print(re.findall(r'\b123{2,}4\b', text))
print(re.findall(r'\b123{1,3}4\b', text))

['1234']
['123334']
['12334', '123334', '1233334', '12333334']
['1234', '12334', '123334']


### Anchors

In [28]:
text = "Hello World\nHi to Python Huge World\nHub is High bold"

print("Matches with ^:")
print(re.findall(r'^H\w+', text, re.MULTILINE))

print("Matches with \A:")
print(re.findall(r'\AH\w+', text, re.MULTILINE))

print("Matches with \b:")
print(re.findall(r'\bH\w+', text, re.MULTILINE))

Matches with ^:
['Hello', 'Hi', 'Hub']
Matches with \A:
['Hello']
Matches with :
['Hello', 'Hi', 'Huge', 'Hub', 'High']


In [29]:
text = "Hello World\nHi to old Python Huge World\nHub is High bold"

print("Matches with $:")
print(re.findall(r'\w+d$', text, re.MULTILINE))

print("Matches with \Z:")
print(re.findall(r'\w+d\Z', text, re.MULTILINE))

print("Matches with \b:")
print(re.findall(r'\w+d\b', text, re.MULTILINE))

Matches with $:
['World', 'World', 'bold']
Matches with \Z:
['bold']
Matches with :
['World', 'old', 'World', 'bold']


## Regular expressions in action ... again ...

In [30]:
text = "Hello123 World456"
letters = re.findall(r'[A-Za-z]+', text)
digits = re.findall(r'\d+', text)
alphanumeric = re.findall(r'\w+', text)

print("Letters:", letters)
print("Digits:", digits)
print("Alphanumeric:", alphanumeric)

Letters: ['Hello', 'World']
Digits: ['123', '456']
Alphanumeric: ['Hello123', 'World456']


In [31]:
# Extracting email and phone number

text = "Contact us at support@example.com or call +1-555-123-4567 for assistance."
phone_numbers = re.findall(r'\+1-\d{3}-\d{3}-\d{4}', text)
email_addresses = re.findall(r'\b[\w.-]+@[\w.-]+\.\w+\b', text)

print("Phone Numbers:", phone_numbers)
print("Email Addresses:", email_addresses)


Phone Numbers: ['+1-555-123-4567']
Email Addresses: ['support@example.com']


Can you find some text that matches with the email pattern and that are invalid emails?
- Can you fix the expression?

In [32]:
# Here is one ...
text = "Contact support@.example.com"
re.findall(r'\b[\w.-]+@[\w.-]+\.\w+\b', text)

['support@.example.com']

In [33]:
# Matching date and times

text = "Today is 2023-05-23 and the time is 12:30 PM."
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
times = re.findall(r'\d{1,2}:\d{2}\s(?:AM|PM)', text)

print("Dates:", dates)
print("Times:", times)


Dates: ['2023-05-23']
Times: ['12:30 PM']


In [34]:
# Extracting data from text

text = "Hello, my name is John Doe."
match = re.search(r'my name is (\w+)\s+(\w+)', text)

if match:
    name = match.group(1)
    last_name = match.group(2)
    print("Name:", name)
    print("Last name:", last_name)


Name: John
Last name: Doe


In [35]:
text = "Contact: John Doe (john.doe@example.com)"
pattern = r'Contact: (\w+ \w+) \(([\w.]+@\w+\.\w+)\)'

match = re.search(pattern, text)
if match:
    name = match.group(1)
    email = match.group(2)
    print("Name:", name)
    print("Email:", email)


Name: John Doe
Email: john.doe@example.com


In [36]:
# Using backreferences

text = "The quick brown fox fox jumps over the lazy dog dog."

pattern = r'(\b\w+\b) \1'
matches = re.findall(pattern, text)
print("Repeated Words:", matches)


Repeated Words: ['fox', 'dog']


In [37]:
# Non-capturing groups

# I want to extract the words that are preceded by the words 'new' or 'old'
text = "I bought a new car, but I don't have a driver's license."

pattern = r'new|old\s(\w+)'
matches = re.findall(pattern, text)
print("Car Type:", matches)


Car Type: ['']


No match .... lets see what happens

In [38]:
re.findall("new|old\s\w+", "new car old car")

['new', 'old car']

the problem is a word grouping issue, so lets use parenthesis

In [39]:
pattern = r'(new|old)\s(\w+)'
matches = re.findall(pattern, text)
print("Car Type:", matches)

Car Type: [('new', 'car')]


Now we have a problem, because the parenthesis used for grouping words are now understood as groups to capture

The solution: non capturing groups

In [40]:
pattern = r'(?:new|old) (\w+)'
matches = re.findall(pattern, text)
print("Car Type:", matches)


Car Type: ['car']


## Some more complex regular expresions to play with...

In [41]:
# Phone numbers

expression = r"^(\+\d{1,2}\s?)?\(?\d{3}\)?[\s\-\.]?\d{3}[\-.\s]?\d{4}$"
for v in [
    "(123) 456-7890",
    "(123)456-7890",
    "123-456-7890",
    "123.456.7890",
    "1234567890",
    "+31636363634",
    "+91 (123) 456-7890",
    "+91 (123)456-7890",
    "+91 123-456-7890",
    "+91 123.456.7890",
    "+91 1234567890",
    "+91 123 456 7890",
]:
    if re.match(expression, v):
        print("Matched: ", v)
    else:
        print("*******  ", v)

Matched:  (123) 456-7890
Matched:  (123)456-7890
Matched:  123-456-7890
Matched:  123.456.7890
Matched:  1234567890
Matched:  +31636363634
Matched:  +91 (123) 456-7890
Matched:  +91 (123)456-7890
Matched:  +91 123-456-7890
Matched:  +91 123.456.7890
Matched:  +91 1234567890
Matched:  +91 123 456 7890


In [42]:
# Dates

expression = r"^\d{2}[./\-]\d{2}[./\-]\d{2,4}$"
for v in [
    "01.01.2023",
    "01/01/23",
    "01-01-23",
    "01.01.23",
    "01/01/2019",
    "01-01-2019",
    "01.01.2019",
    "01/01/19",
    "01-01-19",
    "01.01.19",
    "01/01/2023",
    "01-01-2023",
    "01.01.2023",
    "01/01/23",
    "01-01-23",
    "01.01.23",
    "01/01/2019",
]:
    if re.match(expression, v):
        print("Matched: ", v)
    else:
        print("*******  ", v)

Matched:  01.01.2023
Matched:  01/01/23
Matched:  01-01-23
Matched:  01.01.23
Matched:  01/01/2019
Matched:  01-01-2019
Matched:  01.01.2019
Matched:  01/01/19
Matched:  01-01-19
Matched:  01.01.19
Matched:  01/01/2023
Matched:  01-01-2023
Matched:  01.01.2023
Matched:  01/01/23
Matched:  01-01-23
Matched:  01.01.23
Matched:  01/01/2019


In [43]:
# Times

expression = r"^\d{1,2}:\d{2}\s[ap]m$"
for v in [
    "1:00 am",
    "1:00 pm",
    "01:00 am",
    "01:00 pm",
    "1:00am",
    "1:00pm",
    "01:00am",
    "01:00pm",
    "1400",
    "0800"
]:
    if re.match(expression, v):
        print("Matched: ", v)
    else:
        print("*******  ", v)

Matched:  1:00 am
Matched:  1:00 pm
Matched:  01:00 am
Matched:  01:00 pm
*******   1:00am
*******   1:00pm
*******   01:00am
*******   01:00pm
*******   1400
*******   0800


In [44]:
# Urls

expression = r"^(https?://)?(www\.)?[a-zA-Z\-_]+\.[a-zA-Z]+(?:/index.html)?/?$"
for v in [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "http://www.google.com/",
    "https://www.google.com/",
    "http://google.com/",
    "http://yahoo.com/",
    "https://google.com/",
    "https://google.cu/",
    "http://www.google.com/index.html",
    "https://www.google.com/index.html",
    "http://google.com/index.html",
    "https://google.com/index.html",
    "google.com/index.html"
]:
    if re.match(expression, v):
        print("Matched: ", v)
    else:
        print("*******  ", v)

Matched:  http://www.google.com
Matched:  https://www.google.com
Matched:  http://google.com
Matched:  https://google.com
Matched:  http://www.google.com/
Matched:  https://www.google.com/
Matched:  http://google.com/
Matched:  http://yahoo.com/
Matched:  https://google.com/
Matched:  https://google.cu/
Matched:  http://www.google.com/index.html
Matched:  https://www.google.com/index.html
Matched:  http://google.com/index.html
Matched:  https://google.com/index.html
Matched:  google.com/index.html
