# Regular Expressions

- Regular Expression basically allow us to search for and match specific patterns of text

- **Raw String**
- raw string is a string prefixed with an r and that tells python not to handle back slashes in any special way
- Normally back slashes are used to specify tabs or new lines etc
- We want our regular expressions to interpret the strings we are passing in and not have python doing anything to them first

In [None]:
print('\tTab')
print(r'\tTab')

- . - matches any character except new line
- \d - matches Digit (0-9)
- \D - Not a digit (0-9)
- \w - Word character (a-z, A-Z, 0-9, _)
- \W - Not a word character
- \s - Whitespace(space,tab,newline)
- \S - Not whitespace(space,tab,newline)
- **Anchors** : They dont actually match any chracters but rather invisible positions before or after characters. We can use this in conjunction with other patterns for searching for 
- \b - Word boundary (indicated by whitespace or a non alphanumeric character)
- \B - Not a word boundary
- ^ - Beginning of a string
- $ - End of a string
- [] - Matches characters in bracket
- [^ ] - Matches characters not in brackets
- | - Either Or
- () - group
- **Quantifiers**
- '*' - Match its preceding element one or more times.
- '+' - Match its preceding element one or more times.
- ? - Match its preceding element zero or one time.
- {3} - Exact Number
- {3,4} - Range of numbers (Minimum, Maximum)

In [None]:
import re

In [None]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

abc

MetaCha_racters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
121*554*1534
331--555-4321
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat
mat
pat
bat

'''

sentence = 'Start a sentence and then  bring it to an end'

In [None]:
pattern = re.compile(r'abc')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)
#span is the beginning and the end index of the match
print(text_to_search[1:4])

In [None]:
pattern = re.compile(r'abc')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match.group())

In [None]:
pattern = re.compile(r'bac')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match.group())

In [None]:
pattern = re.compile(r'.')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\.')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match.group())

In [None]:
pattern = re.compile(r'coreyms\.com')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match.group())

In [None]:
pattern = re.compile(r'\d')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match.group())

In [None]:
pattern = re.compile(r'\D')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\w')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\W')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\s')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\S')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\bHa')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\BHa')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'^Start')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'^rt')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'^S')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'nd$')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'end$')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
#Regular expression to match phone number
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')# . matches any character
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
#Regular expression to match phone number
pattern = re.compile(r'\d\d\d..\d\d\d.\d\d\d\d')# . matches any character
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
#Regular expression to match phone number
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')# . matches any character
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
#Regular expression to match phone number
pattern = re.compile(r'\d\d\d[*]\d\d\d[*]\d\d\d\d')# . matches any character
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
#Regular expression to match phone number
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')# . matches any character
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d') 
matches = pattern.finditer(text_to_search)
#with open(r'data.txt','r',encoding='utf-8') as f: If you get Unicodedecodeerror
with open(r'data.txt','r') as f:
    contents = f.read()
    matches = pattern.finditer(contents)
    for match in matches:
        print(match)

In [None]:
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d') 
matches = pattern.finditer(text_to_search)
#with open(r'data.txt','r',encoding='utf-8') as f: If you get Unicodedecodeerror
with open(r'data.txt','r') as f:
    contents = f.read()
    matches = pattern.finditer(contents)
    for match in matches:
        print(match)

In [None]:
pattern = re.compile(r'[1-3]')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'[a-f]')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'[a-fM-T]')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'[^a-fM-T]')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'[^b]at')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

**Quantifiers**

In [None]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [None]:
#pattern = re.compile(r'(Mr|Mrs|Ms)\.?\s[A-Z]\w*')
pattern = re.compile(r'M(r|rs|s)\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

**Example**

In [None]:
import re
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''
pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')
matches = pattern.finditer(emails)
for match in matches:
    print(match)

In [None]:
import re
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''
pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.(com|edu)')
matches = pattern.finditer(emails)
for match in matches:
    print(match)

In [None]:
import re
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''
pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')
matches = pattern.finditer(emails)
for match in matches:
    print(match)

In [None]:
#Generic regex for emails
import re
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
matches = pattern.finditer(emails)
for match in matches:
    print(match)

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'[^https?://(www\.)?]\w+\.\w+')
matches = pattern.finditer(urls)
for match in matches:
    print(match)

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(0))

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(1))

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(2))

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    print(match.group(3))

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_urls = pattern.sub(r'\2\3',urls)
print(subbed_urls)

**findall()**

In [None]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')
matches = pattern.findall(text_to_search)
for match in matches:
    print(match)

In [None]:
import re
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.findall(urls)
for match in matches:
    print(match)

**match**
- Match method will determine if the regular expression matches at the beginning of the string
- match doesnt return an iterable like finditer or findall
- It returns first match if there and if there isnt a match then it returns none

In [None]:
pattern = re.compile(r'Start')
matches = pattern.match(sentence)
print(matches)

In [None]:
pattern = re.compile(r'sentence')
matches = pattern.match(sentence)
print(matches)

**Search**
- search method will determine if the regular expression matches in the string

In [None]:
pattern = re.compile(r'sentence')
matches = pattern.search(sentence)
print(matches)

In [None]:
pattern = re.compile(r'ten')
matches = pattern.search(sentence)
print(matches)

In [None]:
pattern = re.compile(r'den')
matches = pattern.search(sentence)
print(matches)

# flags

In [None]:
pattern = re.compile(r'start',re.IGNORECASE)
matches = pattern.search(sentence)
print(matches)

In [None]:
pattern = re.compile(r'start',re.I)
matches = pattern.search(sentence)
print(matches)