# RE MODULE

### re (regular expressions) module in Python provides support for regular expressions, which are powerful tools for pattern matching and manipulation of strings. Regular expressions allow you to search for specific patterns in text and perform various operations like finding, replacing, and splitting strings based on those patterns. Here's an overview of the re module and some important methods:

In [1]:
# search(): Returns a match object if the regular expression matches the string, or None if it does not match.
# findall(): Returns a list of all the matches for the regular expression in the string.
# split(): Splits the string into a list of strings, based on the matches for the regular expression.
# sub(): Replaces all the matches for the regular expression in the string with a new string.
# compile(): Compiles a regular expression pattern into a regular expression object. This can be useful for improving performance, as the compiled object can be reused multiple times.
# match(): Returns a match object if the regular expression matches the beginning of the string, or None if it does not match. This is different from the search() method, which will match the regular expression anywhere in the string.
# group(): Returns a specific group from a match object. Groups are numbered starting from 1, and the first group is the entire match.
# groups(): Returns a tuple of all the groups from a match object.
# flags: Specifies flags that control the behavior of the regular expression. For example, the re.IGNORECASE flag can be used to make the regular expression case-insensitive.

### When using re.compile(), you compile the regular expression pattern into a pattern object. This pattern object has various methods (e.g., search(), findall()) that can be used for matching and manipulating strings.

In [3]:
import re
pattern = re.compile(r'abc') #we can also use pattern = r'abc' and it would give the same result, but pattern = re.compile(r'abc') is more efficient for large datasets
string = "abcdef"

match = pattern.search(string)
if match:
    print("Pattern found!")
else:
    print("Pattern not found.")

Pattern found!


In [57]:
# if you use compile(), you will not need to use re.findall(pattern, string). Instead, you can use pattern.findall(string)

In [4]:
import re

pattern = re.compile(r"\d+")
string1 = "I have 3 cats."
string2 = "There are 10 dogs."

matches1 = pattern.findall(string1)
matches2 = pattern.findall(string2)

print(matches1)  # Output: ['3']
print(matches2)  # Output: ['10']

# Example without re.compile():
import re

string1 = "I have 3 cats."
string2 = "There are 10 dogs."

matches1 = re.findall(r"\d+", string1)
matches2 = re.findall(r"\d+", string2)

print(matches1)  # Output: ['3']
print(matches2)  # Output: ['10']

['3']
['10']
['3']
['10']


<callable_iterator object at 0x00000118948DA980>


### re.match(pattern, string, flags=0): This method tries to match the pattern at the beginning of the string. It returns a match object if the pattern matches, or None if it doesn't.

In [8]:
import re

pattern = r"Hello"
string = "Hello, World!"

match = re.match(pattern, string)
print(match)
if match:
    print("Match found!")
else:
    print("No match.")

<re.Match object; span=(0, 5), match='Hello'>
Match found!


### re.search(pattern, string, flags=0): This method searches the pattern in the entire string. It returns a match object if the pattern is found, or None otherwise.

In [7]:
import re

pattern = r"World"
string = "Hello, World!"

match = re.search(pattern, string)
print(match)
if match:
    print("Pattern found!")
else:
    print("Pattern not found.")

<re.Match object; span=(7, 12), match='World'>
Pattern found!


### Flags

In [94]:
import re
sentence = "Start a sentence and bring it to the end."
pattern = re.compile(r"start")
matches = pattern.search(sentence)
print(matches) # since pattern is sensetive to uppercase, it returns to none

pattern = re.compile(r"start", re.IGNORECASE) #to avoid this case, use re.IGNORECASE or re.I flag.
matches = pattern.search(sentence)
print(matches) # since pattern is sensetive to uppercase, it returns to none

None
<re.Match object; span=(0, 5), match='Start'>


### Difference between match and search methods:
The main difference between search() and match() is that search() looks for a match anywhere in the string, while match() only looks for a match at the beginning of the string. It's important to choose the appropriate method based on your specific matching requirements.

### re.findall(pattern, string, flags=0): This method finds all occurrences of the pattern in the string and returns them as a list of strings.

In [9]:
import re

pattern = r"\d+"
string = "I have 3 cats and 2 dogs."

numbers = re.findall(pattern, string)
print(numbers)  # Output: ['3', '2']

['3', '2']


### re.finditer() method in Python is used to find all occurrences of a pattern in a string and return an iterator yielding match objects for each match found. It's similar to the re.findall() method, but instead of returning a list of matching strings, it returns an iterator of match objects that provide more detailed information about each match.

In [56]:
import re

pattern = r"\d+"
string = "I have 3 cats and 2 dogs."

matches = re.finditer(pattern, string)
for match in matches:
    print(match)
    print(match.group())  # Output: '3', '2'
    print(match.start())  # Output: 7, 18
    print(match.end())    # Output: 8, 19
# Inside the loop, we can access the matched string using match.group(). 
# The start() method returns the starting index of the match in the string, and end() method returns the index immediately following the end of the match.

<re.Match object; span=(7, 8), match='3'>
3
7
8
<re.Match object; span=(18, 19), match='2'>
2
18
19


### MetaCharacters (Need to be escaped): When searching these characters, we need to use \ before them to escape the chararcter.
. ^ $ * + ? { } [ ] \ | ( )

In [55]:
pattern = r"."
string = '''hi.
i have 3 dogs and  2 cats.
'''

matches = re.finditer(pattern, string)  # or you can use mathces=pattern.finditer(string) if pattern = re.compile(r"\.")
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='i'>
<re.Match object; span=(2, 3), match='.'>
<re.Match object; span=(4, 5), match='i'>
<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(6, 7), match='h'>
<re.Match object; span=(7, 8), match='a'>
<re.Match object; span=(8, 9), match='v'>
<re.Match object; span=(9, 10), match='e'>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(11, 12), match='3'>
<re.Match object; span=(12, 13), match=' '>
<re.Match object; span=(13, 14), match='d'>
<re.Match object; span=(14, 15), match='o'>
<re.Match object; span=(15, 16), match='g'>
<re.Match object; span=(16, 17), match='s'>
<re.Match object; span=(17, 18), match=' '>
<re.Match object; span=(18, 19), match='a'>
<re.Match object; span=(19, 20), match='n'>
<re.Match object; span=(20, 21), match='d'>
<re.Match object; span=(21, 22), match=' '>
<re.Match object; span=(22, 23), match=' '>
<re.Match object; span=(23, 24), match='2'>
<re.M

In [54]:
pattern =re.compile( r"\.")
string = '''hi.
i have 3 dogs and  2 cats.
'''

matches = pattern.finditer(string)
for match in matches:
    print(match)

<re.Match object; span=(2, 3), match='.'>
<re.Match object; span=(29, 30), match='.'>


### Cases for Patterns:
pattern = r"character"

In [26]:
# # if character = 
# .    - any character except new line
# \d   - digit(0-9)
# \d+  - whole number adjacent
# \D   - not a digit(0-9)
# \w   - word character(a-z, A-Z, 0-9, _)
# \W   - not a word character 
# \s   - whitespace(space, tab, newline)
# \S   - not whitespace
# \b   - word boundry
# \B   - not a word boundry

# ^    - beginning of a string
# $    - end of a string

In [37]:
# an important example
import re
text= '''
Matthew Anderson
555.222.4564
mathew+12@email.com

Yuri Johanson
666.333.6677
yuri23@emial.com

Steven Neil
345.658.2342
steven_2432@email.com
'''
pattern = re.compile(r"\d\d\d.\d\d\d.\d\d\d\d") # pattern for phone numbers
matches= re.findall(pattern, text)
for match in matches:
    print(match)
pattern = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b") # pattern for mail adresses
matches= re.findall(pattern, text)
for match in matches:
    print(match)
pattern = re.compile(r"([A-Z][a-z]+ [A-Z][a-z]+)") # pattern for names
matches= re.findall(pattern, text)
for match in matches:
    print(match)

555.222.4564
666.333.6677
345.658.2342
mathew+12@email.com
yuri23@emial.com
steven_2432@email.com
Matthew Anderson
Yuri Johanson
Steven Neil


In [87]:
# another example
phone_nums ='''
123.345.6789
431-346-9786
875*235*7544
324--453-7568
'''
pattern = re.compile(r"\d\d\d[.-]\d\d\d[.-]\d\d\d\d") # pattern for phone numbers
matches= re.findall(pattern, phone_nums)
for match in matches:
    print(match)
# when we use [.-], we limited our seperators to either . or - so that other numbers didn't match.

123.345.6789
431-346-9786


### - between two characters indicates a range we search. To specify our search, use [1st boundry - 2nd boundry]

In [88]:
text= '''
Matthew Anderson
555.222.4564
mathew+12@email.com

Yuri Johanson
666.333.6677
yuri23@emial.com

Steven Neil
345.658.2342
steven_2432@email.com
'''
pattern = re.compile(r"[a-z]+") # + indicating that the preceding element should occur one or more times. 
matches = re.finditer(pattern, text)
for match in matches:
    print(match)

<re.Match object; span=(2, 8), match='atthew'>
<re.Match object; span=(10, 17), match='nderson'>
<re.Match object; span=(31, 37), match='mathew'>
<re.Match object; span=(41, 46), match='email'>
<re.Match object; span=(47, 50), match='com'>
<re.Match object; span=(53, 56), match='uri'>
<re.Match object; span=(58, 65), match='ohanson'>
<re.Match object; span=(79, 83), match='yuri'>
<re.Match object; span=(86, 91), match='emial'>
<re.Match object; span=(92, 95), match='com'>
<re.Match object; span=(98, 103), match='teven'>
<re.Match object; span=(105, 108), match='eil'>
<re.Match object; span=(122, 128), match='steven'>
<re.Match object; span=(134, 139), match='email'>
<re.Match object; span=(140, 143), match='com'>


In [50]:
# ^ negates the characters between []
text= '''
bat
pat
cat
sat
'''
pattern = re.compile(r"[^b]at") 
matches = re.finditer(pattern, text)
for match in matches:
    print(match)

<re.Match object; span=(5, 8), match='pat'>
<re.Match object; span=(9, 12), match='cat'>
<re.Match object; span=(13, 16), match='sat'>


In [None]:
# BRACKETS
# |     - Either or
# ()    - Groups

# QUANTIFIERS
# *     - 0 or more
# +     - 1 or more
# ?     - 0 or one
# {3}   - exact number
# {3,4} - range of numbers {max,min}

In [68]:
phone_nums ='''
123.345.6789
431-346-9786
875*235*7544
324--453-7568
'''
pattern =re.compile( r'\d{3}.\d{3}.\d{4}') # pattern for phone numbers
matches= pattern.finditer(phone_nums)
for match in matches:
    print(match)

<re.Match object; span=(1, 13), match='123.345.6789'>
<re.Match object; span=(14, 26), match='431-346-9786'>
<re.Match object; span=(27, 39), match='875*235*7544'>


In [71]:
text= '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
pattern = re.compile(r"Mr\.?\s[A-Z]\w+")
matches = pattern.finditer(text)
for match in matches:
    print(match)
print()
# including ms and mrs
pattern = re.compile(r"(Mr|Mrs|Ms)\.?\s[A-Z]\w+")
matches = pattern.finditer(text)
for match in matches:
    print(match)
print()
#  including mr. t
pattern = re.compile(r"(Mr|Mrs|Ms)\.?\s[A-Z]\w*")
matches = pattern.finditer(text)
for match in matches:
    print(match)


<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>
<re.Match object; span=(22, 30), match='Ms Davis'>
<re.Match object; span=(31, 44), match='Mrs. Robinson'>

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>
<re.Match object; span=(22, 30), match='Ms Davis'>
<re.Match object; span=(31, 44), match='Mrs. Robinson'>
<re.Match object; span=(45, 50), match='Mr. T'>


In [89]:
emails= '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''
pattern = re.compile(r"[a-zA-z0-9.-]+@[a-zA-Z-]+\.(com|net|edu)")
matches = pattern.finditer(emails)
for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


In [84]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)") #(www\.) group 1   (\w+) group 2   (\.\w+) group 3   
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(0)) # group zero returns the entire group
    print(match.group(1))
    print(match.group(2)) # since it is optional, it returns none if it there is no group 2
    print(match.group(3))

https://www.google.com
www.
google
.com
http://coreyms.com
None
coreyms
.com
https://youtube.com
None
youtube
.com
https://www.nasa.gov
www.
nasa
.gov


In [86]:
subbed_urls =pattern.sub(r"\2\3", urls) #\2\3 indicating the number of groups that we want to substitute
print(subbed_urls)
print(urls)


google.com
coreyms.com
youtube.com
nasa.gov


https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov



### re.sub(pattern, repl, string, count=0, flags=0): This method replaces all occurrences of the pattern in the string with the specified replacement string (repl). The count parameter limits the number of replacements made.

In [17]:
import re

pattern = r"apple"
string = "I have an apple and an apple pie. apple apple apple appleapple"

new_string = re.sub(pattern, "orange", string, count=6)
print(new_string)  # Output: "I have an orange and an orange pie."

I have an orange and an orange pie. orange orange orange orangeapple
