<a href="https://colab.research.google.com/github/jose-cano/RegEx/blob/main/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regular Expression**

A regular expression (regex) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory. Many programming languages provide regex capabilities either built-in or via libraries, as it has uses in many situations.

To use regular expressions in Python, we import the `re` module.

In [1]:
import re

When constructing a regex pattern in Python, we will often need to use a raw string. A Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. Python raw string treats backslash (\\) as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

In Python, we can use the re.compile() method.

re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.

The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

In [2]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

Finding "abc" in our text using re.finditer()

re.finditer(pattern, string, flags=0)

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

Note that pattern matching is case sensitive.


In [3]:
pattern = re.compile(r'abc')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(1, 4), match='abc'>


What if we wanted to match a period and find ocurences of it in our text? (`.`)
Notice how we prefix it with a backlash. This is because a period is a meta-character in regex, meaning it is used for a specific purpose.

(Dot.) In the default mode matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

In [4]:
pattern = re.compile(r'\.')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(113, 114), match='.'>
<re.Match object; span=(149, 150), match='.'>
<re.Match object; span=(171, 172), match='.'>
<re.Match object; span=(175, 176), match='.'>
<re.Match object; span=(223, 224), match='.'>
<re.Match object; span=(254, 255), match='.'>
<re.Match object; span=(267, 268), match='.'>


A use case for escaping a dot might be finding a URL within a text.

In [5]:
pattern = re.compile(r'.......\.com')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(142, 153), match='coreyms.com'>


Finding a literal substring isn't too exciting, because thats something we can already do in Python. Finding more complex patterns will require the use of the meta-characters.

In [6]:
print(text_to_search.find('abc'))
print(text_to_search.find('.com'))

1
149


# **Meta Characters**

.       - Any Character Except New Line

\d      - Digit (0-9)

\D      - Not a Digit (0-9)

\w      - Word Character (a-z, A-Z, 0-9, _)

\W      - Not a Word Character

\s      - Whitespace (space, tab, newline)

\S      - Not Whitespace (space, tab, newline)
_______________________________________________
\b      - Word Boundary

\B      - Not a Word Boundary

^       - Beginning of a String

$       - End of a String
______________________________________________
[]      - Matches Characters in brackets

[^ ]    - Matches Characters NOT in brackets

|       - Either Or

( )     - Group
_____________________________________________

Quantifiers:

\*       - 0 or More

\+       - 1 or More

?       - 0 or One

{3}     - Exact Number

{3,4}   - Range of Numbers (Minimum, Maximum)


Word boundaries are inviible spaces before or after characters.
In our text we have a line that reads:

`Ha Haha`

With the following pattern, we match the first two `Ha`'s because they are preceded by a word boundary (newline and space in this case).

In [7]:
pattern = re.compile(r'\bHa')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(67, 69), match='Ha'>
<re.Match object; span=(70, 72), match='Ha'>


To match just the last `Ha`, we can use `\B` to specify Not a word boundary.

In [8]:
pattern = re.compile(r'\BHa')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(72, 74), match='Ha'>


To search for a pattern at the beginning of a string, we can use the `^` character.

Anything after `^` has to be at the start of the string.

In [9]:
sentence = 'Start a sentence and then bring it to an end'
pattern = re.compile(r'^St')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

<re.Match object; span=(0, 2), match='St'>


Similarly, to search the end of a string, we use `$`.

In [10]:
pattern = re.compile(r'end$')

matches = pattern.finditer(sentence)

for match in matches:
  print(match)

<re.Match object; span=(41, 44), match='end'>


# **Character sets**

In [11]:
pattern = re.compile(r'\d\d\d\D\d\d\d\D\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(181, 193), match='123*555*1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


We can specify a character set to match a specific group of characters. Lets use a character set to get phone numbers that are separated only by a `-` or a `.`.

Notice that it is not necessary to escape the `.` in the character set.

 Also, we only match one character at a time, either a `-` or a `.`.

In [12]:
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


Matching a number that starts with `800` or `900`.

In [13]:
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


From a file, specifying `.group(0)` to get just the phone numbers.

In [14]:
with open("data.txt", 'r') as f:
  contents = f.read()

  matches = pattern.finditer(contents)

  for match in matches:
    print(match.group(0))

800-555-5669
900-555-9340
800-555-6771
900-555-3205
800-555-6089
800-555-7100
900-555-5118
900-555-5428
800-555-8810
900-555-9598
800-555-2420
900-555-3567
800-555-3216
900-555-7755
800-555-1372
900-555-6426


The `-` within a character set can be used to specify a range. For example [a-z] will match all lowercase characters from a to z.

Similarly, [A-Z] will match all uppercase and [1-9] will match all digits from 1 to 9. If we wanted all lowercase and uppercase alphabet letters, we can specify [a-zA-Z].

The `^` at the beginning of a character set can be used to exclude the set. The set [^a-zA-Z] will match anything that is not a lowercase or uppercase letter.

Matching cat, pat, mat, and excluding bat.

In [15]:
text = """
cat
pat
mat
bat
"""

pattern = re.compile(r'[^b]at')

matches = pattern.finditer(text)

for match in matches:
  print(match)

<re.Match object; span=(1, 4), match='cat'>
<re.Match object; span=(5, 8), match='pat'>
<re.Match object; span=(9, 12), match='mat'>


# **Quantifiers**

We can use quantifiers to specify how many characters to match.

In [16]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
  print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(181, 193), match='123*555*1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


The `?` is used for 0 or more matches.

In [17]:
text = """
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
"""

pattern = re.compile(r'Mr\.?')

matches = pattern.finditer(text)

for match in matches:
  print(match)

<re.Match object; span=(1, 4), match='Mr.'>
<re.Match object; span=(13, 15), match='Mr'>
<re.Match object; span=(31, 33), match='Mr'>
<re.Match object; span=(45, 48), match='Mr.'>


Lets match the Misters. Here we use `\s` for one space. Since Mr. T is just one letter after Mr., we need to use the `*` quantifier for 0 or more word characters `\w`.

In [18]:
text = """
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
"""

pattern = re.compile(r'Mr\.?\s[A-Z]\w*')

matches = pattern.finditer(text)

for match in matches:
  print(match)

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>
<re.Match object; span=(45, 50), match='Mr. T'>


# **Groups**

To match all of the names, we can use a group. We use parentheses to specify a group, and we can use bitwise operators within them. Here, we use the bitwise OR (|) operator to specify `s`, or `r`, or `rs` after the `M`.

In [19]:
text = """
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
"""

pattern = re.compile(r'M(s|r|rs).?\s[A-Z]\w*')

matches = pattern.finditer(text)

for match in matches:
  print(match)

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>
<re.Match object; span=(22, 30), match='Ms Davis'>
<re.Match object; span=(31, 44), match='Mrs. Robinson'>
<re.Match object; span=(45, 50), match='Mr. T'>


Perhaps a more readable pattern is:

In [20]:
pattern = re.compile(r'(Ms|Mr|Mrs).?\s[A-Z]\w*')

matches = pattern.finditer(text)

for match in matches:
  print(match.group(0))

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T


Matching emails. 

Here, we use a character set to match one or more characters within the set before the `@` symbol. Then we use another character set for one or more characters within the set before reaching a `.`, and finally we use a group for the domains.

In [27]:
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')

matches = pattern.finditer(emails)

for match in matches:
  print(match.group(0))

CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net


This is a regex from the web used to match emails.

In [29]:
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match.group(0))

CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net


Matching URLs. Here we use groups to capture and later reference specific patterns in our match.

In [32]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

matches = pattern.finditer(urls)

for match in matches:
    print(match.group(0))

https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov


Reformatting the URLs using groups 2 and 3 of our pattern. group 0 is the entire match, group 1 is the optional `www.`

In [33]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

subbed_urls = pattern.sub(r'\2\3', urls)

print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov

