## Regex Matching

A regular expression is just a pattern of characters that we use to perform a
search in a text.  
For example, the regular expression `the` means: the letter
`t`, followed by the letter `h`, followed by the letter `e`.

### Meta charecter

Meta characters are the building blocks of regular expressions.  Meta
characters do not stand for themselves but instead are interpreted in some
special way. Some meta characters have a special meaning and are written inside
square brackets.
The meta characters are as follows:

`Meta character`      |      `Description`

- `.`        - Any Character Except New Line
- `\d`      - Digit (0-9)
- `\D`      - Not a Digit (0-9)
- `\w`      - Word Character (a-z, A-Z, 0-9, _)
- `\W`      - Not a Word Character
- `\s`      - Whitespace (space, tab, newline)
- `\S`      - Not Whitespace (space, tab, newline)

- `\b`      - Word Boundary
- `\B`      - Not a Word Boundary
- `^`        - Beginning of a String
- `$`        - End of a String

- `[]`      - Matches Characters in brackets
- `[^ ]`  - Matches Characters NOT in brackets
- `|`         - Either Or
- `\`         - Escapes the next character. This allows you to match reserved characters <code>[ ] ( ) { } . * + ? ^ $ \ &#124;</code>
- `( )`     - Group

Quantifiers:
- `*`           - 0 or More
- `+`           - 1 or More
- `?`           - 0 or One
- `{3}`      - Exact Number
- `{3,4}`  - Range of Numbers (Minimum, Maximum)

<!-- ## 2. Meta Characters

Meta characters are the building blocks of regular expressions.  Meta
characters do not stand for themselves but instead are interpreted in some
special way. Some meta characters have a special meaning and are written inside
square brackets. The meta characters are as follows:

|Meta character|Description|
|:----:|----|
|.|Period matches any single character except a line break.|
|[ ]|Character class. Matches any character contained between the square brackets.|
|[^ ]|Negated character class. Matches any character that is not contained between the square brackets|
|*|Matches 0 or more repetitions of the preceding symbol.|
|+|Matches 1 or more repetitions of the preceding symbol.|
|?|Makes the preceding symbol optional.|
|{n,m}|Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol.|
|(xyz)|Character group. Matches the characters xyz in that exact order.|
|&#124;|Alternation. Matches either the characters before or the characters after the symbol.|
|&#92;|Escapes the next character. This allows you to match reserved characters <code>[ ] ( ) { } . * + ? ^ $ \ &#124;</code>|
|^|Matches the beginning of the input.|
|$|Matches the end of the input.| -->

<p>
    <a href="">
        <img src="Regex.png">
    </a>
</p>

### Import Regex Module for Python

In [2]:
import re

In [165]:
# Define some text for search using regex
text_to_search = '''abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
\n new 
\nnew 
\tnew 
\anew 
+8801756-770501

cat
mat
bat

'''
sentence = 'Start a sentence and then bring it to an end'

### Raw String 

In raw string, any charecter is not handled as special charecter. 
Like, `\n` for new line or `\t` for Tab. 

In [9]:
print("\t Hello World")
print(r"\t Hello World") # Raw String

	 Hello World
\t Hello World


In [23]:
pattern = re.compile(r'Mr')

# matches = pattern.search(text_to_search)
matches = pattern.finditer(text_to_search)
# matches = pattern.findall(text_to_search)

# print(matches)
for match in matches:
    print(match)

<re.Match object; span=(216, 218), match='Mr'>
<re.Match object; span=(228, 230), match='Mr'>
<re.Match object; span=(246, 248), match='Mr'>
<re.Match object; span=(260, 262), match='Mr'>


In [81]:
pattern = re.compile(r'.') # It will search everything except new line
pattern = re.compile(r'\d') # It will search Digit (0-9)
pattern = re.compile(r'\w') # It will search Word Character (a-z, A-Z, 0-9, _)
pattern = re.compile(r'\W') # It will search Not a Word Character
pattern = re.compile(r'\s') # It will search whitespace (space, tab, newline)
pattern = re.compile(r'\S') # It will search not a whitespace (space, tab, newline)
pattern = re.compile(r'\bRo')  # Word Boundary, start of word
pattern = re.compile(r'\Bbi')  # Word Boundary, Middle of the word


matches = pattern.finditer(text_to_search)


for match in matches:
    print(match)


In [75]:
pattern = re.compile(r'\.')  # `\` Is a escape charecter. Now it will search for `.`

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(111, 112), match='.'>
<re.Match object; span=(146, 147), match='.'>
<re.Match object; span=(167, 168), match='.'>
<re.Match object; span=(171, 172), match='.'>
<re.Match object; span=(218, 219), match='.'>
<re.Match object; span=(249, 250), match='.'>
<re.Match object; span=(262, 263), match='.'>


In [56]:
pattern = re.compile(r'(\w).{1,9}\.com')  # Find any word (1,9) Charecter long .com

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(139, 150), match='coreyms.com'>


In [90]:
pattern = re.compile(r'^Start')  # Start of the sentance
pattern = re.compile(r'end$')  # Start of the sentance

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(41, 44), match='end'>


In [112]:
pattern = re.compile(r'[^b]at')  # find cat, mat except bat which start with b `^` inside [ ] works as not start with  

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(284, 287), match='cat'>
<re.Match object; span=(288, 291), match='mat'>


For find all the name in following structure

In [124]:
pattern = re.compile(r'M[r|rs]\.?\s[A-Z]\w*')  # M - Matches M 
                                                # [r|rs] matches Mrs/Ms
                                                # \.? Matches a `.` after Mr - `?` means 0 or 1 
                                                # \s matched space
                                                # [A-Z] matches any uppercase A-Z
                                                # \w* matched any charecter and `*` means 0 or more 

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(215, 226), match='Mr. Schafer'>
<re.Match object; span=(227, 235), match='Mr Smith'>
<re.Match object; span=(236, 244), match='Ms Davis'>
<re.Match object; span=(259, 264), match='Mr. T'>


In [268]:
text = '''&\nabc\t\a&\section'''

pattern = re.compile(r'\n+\w*') 

matches = pattern.finditer(text)

for match in matches:
    print(match)

<re.Match object; span=(1, 5), match='\nabc'>


In [154]:
pattern = re.compile(r'\+880(\d){4}\-(\d){6}')  # Match any phone number

pattern = re.compile(r'\d\d\d\-\d\d\d\-\d\d\d\d')  # Match any phone number
pattern = re.compile(r'\d\d\d[-.*]\d\d\d[-.*]\d\d\d\d')  # Match any phone number
pattern = re.compile(r'\d{3}[-.*]\d{3}[-.*]\d{4}')  # same as previous



matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(150, 162), match='321-555-4321'>
<re.Match object; span=(163, 175), match='123.555.1234'>
<re.Match object; span=(176, 188), match='123*555*1234'>
<re.Match object; span=(189, 201), match='800-555-1234'>
<re.Match object; span=(202, 214), match='900-555-1234'>


In [128]:
emails = '''
CoreyMSchafer@gmail.com
Corey.MSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

In [133]:
# pattern = re.compile(r'\w*\@gmail\.com')  
pattern = re.compile(r'[A-Za-z0-9_.]+@gmail\.com')  
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 49), match='Corey.MSchafer@gmail.com'>
<re.Match object; span=(50, 78), match='corey.schafer@university.edu'>
<re.Match object; span=(79, 108), match='corey-321-schafer@my-work.net'>


### Chack from separete file - data.txt

In [103]:
with open ('data.txt', 'r') as f:
    contents = f.read()

In [105]:
pattern = re.compile(r'\d{3}[-.*]\d{3}[-.*]\d{4}') # Find all phone number from separete file

matches = pattern.finditer(contents)

for match in matches:
    print(match)

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(557, 569), match='783-555-4799'>
<re.Match object; span=(647, 659), match='516-555-4615'>
<re.Match object; span=(740, 752), match='127-555-1867'>
<re.Match object; span=(829, 841), match='608-555-4938'>
<re.Match object; span=(915, 927), match='568-555-6051'>
<re.Match object; span=(1003, 1015), match='292-555-1875'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1180, 1192), match='614-555-1166'>
<re.Match object; span=(1269, 1281), match='530-555-2676'>
<re.Match object; span=(1355, 1367), match='470-555-2750'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; spa

In [270]:
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') # Find all email

matches = pattern.finditer(contents)

for match in matches:
    print(match)

<re.Match object; span=(60, 85), match='davemartin@bogusemail.com'>
<re.Match object; span=(147, 175), match='charlesharris@bogusemail.com'>
<re.Match object; span=(235, 263), match='laurawilliams@bogusemail.com'>
<re.Match object; span=(325, 354), match='coreyjefferson@bogusemail.com'>
<re.Match object; span=(425, 453), match='jenniferwhite@bogusemail.com'>
<re.Match object; span=(517, 540), match='tomdavis@bogusemail.com'>
<re.Match object; span=(601, 629), match='neilpatterson@bogusemail.com'>
<re.Match object; span=(695, 724), match='laurajefferson@bogusemail.com'>
<re.Match object; span=(785, 812), match='mariajohnson@bogusemail.com'>
<re.Match object; span=(871, 899), match='michaelarnold@bogusemail.com'>
<re.Match object; span=(962, 989), match='michaelsmith@bogusemail.com'>
<re.Match object; span=(1049, 1076), match='robertstuart@bogusemail.com'>
<re.Match object; span=(1137, 1163), match='lauramartin@bogusemail.com'>
<re.Match object; span=(1225, 1253), match='barbaramartin@bo

In [269]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

subbed_urls = pattern.sub(r'\2\3', urls)

print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov

