# Regular Expressions

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [1]:
# import re
import re

In [8]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.9/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

##### re.compile()

In [2]:
# using compile, predetermines the string to be used in regular expression methods
pattern = re.compile('abcd')


In [9]:
help(re.compile)

Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a Pattern object.



##### re.match()

In [20]:
# span() returns both start and end indexes in a single tuple. 
# Since the match() method only checks if the RE matches at the start of a string, start() will always be zero. 
# However, the search() method of patterns scans through the string, so the match may not start at zero in that case


# would not work for      match = pattern.match('123abcd123')

match = pattern.match('abcd123')
print(match)

# Accessing the span of the match
span = match.span()
print(match.span())

<re.Match object; span=(0, 4), match='abcd'>
(0, 4)


In [10]:
help(re.match)

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.



##### re.findall()

In [4]:
# findall() module is used to search for “all” occurrences that match a given pattern. 
# In contrast, search() module will only return the first occurrence that matches the specified pattern

finders = pattern.findall('123abcd abcd123 abcd abcabc acb')
print(finders)

['abcd', 'abcd', 'abcd']


In [15]:
help(re.findall)

Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.
    
    Empty matches are included in the result.



##### re.search()

In [25]:
# search() search(pattern, string, flags=0) scans through string looking for a match to the pattern

random_string = '123 123 234 abcd abc'
search = pattern.search(random_string)
print(search)

span = search.span()
print(span)

# abcd
print(random_string[span[0] : span[1]])

# abcd
print(random_string[12:16])

#(12, 16) <class 'tuple'>
print(span, type(span))

# 12 16
print(span[0], span[1])

# a b c d
print(random_string[12],random_string[13],random_string[14],random_string[15])

<re.Match object; span=(12, 16), match='abcd'>
(12, 16)
abcd
abcd
(12, 16) <class 'tuple'>
12 16
a b c d


In [16]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.



### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [37]:
pattern_int = re.compile('[0-7][7-9][0-3]')
random_ints = '67383'

# find random numbers that meet the 0-7;7-9;0-3 pattern that are directly next to eachother

random_numbers = pattern_int.search(random_ints) # stops after finding first match 
span_int = random_numbers.span() # Accessing the span

# (0, 3)
print(span_int)

# 673
print(random_ints[span_int[0]:span_int[1]]) # The span is technically two indexes [0,3] so prints the entire number


find_randoms = pattern_int.findall('673772') # finds all instances of the pattern

# ['673', '772']
print(find_randoms)

# 673
print(find_randoms[0])

# 772
print(find_randoms[1])

(0, 3)
673
['673', '772']
673
772


##### Character Ranges

In [7]:
char_pattern = re.compile('[A-Z][a-z]')

# Search through a string that has an uppercase letter and lowercase letter
# directly next to each other

found = char_pattern.findall('Hello There Mr.Anderson')
print(found)

['He', 'Th', 'Mr', 'An']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [8]:
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}')

found_count = char_pattern_count.findall('Hello Mr. An12derson')
print(found_count)

['An12']


##### {x, x} - something that occurs between x and x times

In [9]:
random_pattern = re.compile('m{1,5}')
random_statement = random_pattern.findall("This is an example of a regular\
                                          expression trying to find one m,\
                                          more than one mmm or five mmmmm's")
print(random_statement)

['m', 'm', 'm', 'mmm', 'mmmmm']


##### ? - something that occurs 0 or 1 time

In [39]:
#  "?"      Matches 0 or 1 (greedy) of the preceding RE.
#                   Greedy means that it will match as many repetitions as possible.

pattern = re.compile('Mrss?')

found_pat = pattern.findall('Hello M there Mr.Anderson, Mid how is Mrss.Anderson')
print(found_pat)

['Mrss']


##### * - something that occurs at least 0 times

In [42]:
# "*"      Matches 0 or more (greedy) repetitions of the preceding RE.

pattern_m = re.compile('M*s')

test = ["MMMs"]
found_test = pattern_m.findall(test[0])
print(found_test)

found_m = pattern_m.findall('MMMs name is Ms.Smith. This is Mssssss')
print(found_m)

['MMMs']
['MMMs', 's', 'Ms', 's', 's', 'Ms', 's', 's', 's', 's', 's']


##### + - something that occurs at least once

In [44]:
# "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
pattern_again = re.compile('M+s+')

found_patt = pattern_again.findall('My name is Mss.Smith. This is MMMMMMsssss')
print(found_patt)

['Mss', 'MMMMMMsssss']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [51]:
import re
my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."

# Output: ['10909090','1',2]

match_string = re.compile('[0-9]+')
found_match = match_string.findall(my_string)

print(found_match)

['10909090', '1', '2']


In [62]:
# Regular expressions use the backslash character ('\') to indicate special forms 
#      allow special characters to be used without invoking their special meaning

# ie. "."      Matches any character except a newline.

pattern = re.compile('Mr?s?\.?')

found_pat = pattern.findall("Hello M. there Mr.Anderson. Mid. how is Mrs.s.Anderson mrrrrss..")
print(found_pat)

['M.', 'Mr.', 'M', 'Mrs.']


### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [63]:
pattern_1 = re.compile('[\w]+') # Find characters that add up to words
pattern_3 = re.compile('[\w]')

pattern_2 = re.compile('[\W]+') # Find characters that are not words
pattern_4 = re.compile('[\W]') 

# \ allows it to return as many times as the instance is found 
found_1 = pattern_1.findall("This is a sentance. With an, exclamation mark at the end!")
print(found_1)

found_3 = pattern_3.findall("This is a sentance. With an, exclamation mark at the end!")
print(found_3)

found_2 = pattern_2.findall("This is a sentance. With an, exclamation mark at the end!")
print(found_2)

found_4 = pattern_4.findall("This is a sentance. With an, exclamation mark at the end!")
print(found_4)

['This', 'is', 'a', 'sentance', 'With', 'an', 'exclamation', 'mark', 'at', 'the', 'end']
['T', 'h', 'i', 's', 'i', 's', 'a', 's', 'e', 'n', 't', 'a', 'n', 'c', 'e', 'W', 'i', 't', 'h', 'a', 'n', 'e', 'x', 'c', 'l', 'a', 'm', 'a', 't', 'i', 'o', 'n', 'm', 'a', 'r', 'k', 'a', 't', 't', 'h', 'e', 'e', 'n', 'd']
[' ', ' ', ' ', '. ', ' ', ', ', ' ', ' ', ' ', ' ', '!']
[' ', ' ', ' ', '.', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', '!']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [15]:
pattern_nums = re.compile('\d{1,2}[\w]{2}')

found_date = pattern_nums.findall("Today is the 7th, in 20days it will be the 27th. 3rd,1st,30th")
print(found_date)

['7th', '20da', '27th', '3rd', '1st', '30th']


##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [16]:
pattern_no_space = re.compile('\S')
pattern_space = re.compile('\s+')

found_dark = pattern_no_space.findall('Are you afraid of the dark?')
print(found_dark)

found_space = pattern_space.findall('Are you afraid of the dark?')
print(found_space)

['A', 'r', 'e', 'y', 'o', 'u', 'a', 'f', 'r', 'a', 'i', 'd', 'o', 'f', 't', 'h', 'e', 'd', 'a', 'r', 'k', '?']
[' ', ' ', ' ', ' ', ' ']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [17]:
pattern_bound = re.compile(r'\bTheCodingTemple\b')
pattern_bound_none = re.compile('\BTheCodingTemple\B')

found_bound = pattern_bound.findall("TheCodingTemple")
print(found_bound)

no_found_bound = pattern_bound_none.findall("fgTheCodingTempledsf")
print(no_found_bound)

['TheCodingTemple']
['TheCodingTemple']


### Grouping

In [18]:
my_string_again = "Max Smith, aaron rodgers, Sam Darnold,LeBron James, Michael Jordan, Kevin Durant, Patrick McCormick"

# Group of names regular expression compiler
pattern_name = re.compile('([A-Z][A-Za-z]+) ([A-Z][A-Za-z]+)')

found_names = pattern_name.findall(my_string_again)
print(found_names)

# looping over the entire list to get tuples by themselves
for name in my_string_again.split(','):
    match = pattern_name.search(name)
    
    if match:
        print(match.group(1))
    else:
        print("Not a name")

[('Max', 'Smith'), ('Sam', 'Darnold'), ('LeBron', 'James'), ('Michael', 'Jordan'), ('Kevin', 'Durant'), ('Patrick', 'McCormick')]
Max
Not a name
Sam
LeBron
Michael
Kevin
Patrick


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [19]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None





jordanw@codingtemple.orgcom
pocohontas1776@gmail.com
None
yourfavoriteband@g6.org
None


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [20]:
file = open("files/names.txt")

data = file.read()

print(data)

file.close()

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Stanton, Brian	brians@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



In [21]:
# file.read()

##### with open()

In [22]:
with open('files/names.txt') as file:
    data = file.read()
    print(data)

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Stanton, Brian	brians@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



In [23]:
# file.read()

##### re.match()

In [24]:
print(re.match(r"Hawkins, Derek", data))
print(data[0:14])

<re.Match object; span=(0, 14), match='Hawkins, Derek'>
Hawkins, Derek


##### re.search()

In [25]:
print(re.search(r"ripalp@codingtemple.com", data))
print(data[582:605])

<re.Match object; span=(588, 611), match='ripalp@codingtemple.com'>
Ripal	ripalp@codingtemp


##### Store the String to a Variable

In [26]:
answer = input("What would you like to search for?...")

found = re.findall(answer,data)

if found:
    print(f"I found you data: {found}")
else:
    print("It's a no from me boss...")

What would you like to search for?...Stanton
I found you data: ['Stanton']


### In-Class Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [27]:
with open("files/names.txt") as file:
    data = file.readlines()
    print(len(data)) # indexes fall between 0 - 10

11


### Regex project

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [28]:
"""
Expected Output
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

'\nExpected Output\nAbraham Lincoln\nAndrew P Garfield\nConnor Milliken\nJordan Alexander Williams\nNone\nNone\n'