# Coding Temple's Data Analytics Course
---
## Python II: Regular Expressions

## Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [3]:
# Import statements always go at the TOP of a notebook.
import re


### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

##### re.compile()

In [3]:
# Using compile, we pre determine the string we want to search for inside of regex methods:
print(help(re.compile))

pattern = re.compile('abcd')

Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a Pattern object.

None


##### re.match()

In [7]:
# In order to access pieces of a string and use the pattern to parse over a doc, string, etc., we would use the .match() function
print(help(re.match))

# This line of code:
re.match(pattern, 'abcd123')

# Same as this line:
match = pattern.match('abcd123')

match.span()

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.

None


(0, 4)

##### re.findall()

In [10]:
# What happens when I have multiple matches in a single string or text document?
# findall() comes into play
print(help(re.findall))

pattern.findall('abcd123abcd')

Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.
    
    Empty matches are included in the result.

None


['abcd', 'abcd']

##### re.search()

In [17]:
# Search object will return the singular match of the string wherever it matches at
# The return object from a search function is also still a match object!

# Instatiated a random string
random_string = '123 123 234 abcd abc'

# Search the string for a pattern:
search = pattern.search(random_string)
print(search)

# Grab the resulting indicies and return them:
span = search.span()
print(span, type(span))

# What if I wanted to return the actual match instead of just this match object here?:
# Hard-Coding: Non-reproducible code. No matter how the string or the indicies for the match change, by hard coding in values, my program will always remain static in it's reference point:
print(random_string[16:20]) # 16 will always equal 16. No matter how the string changes, this reference value will NEVER change!

# Reproducible method: No matter how my string object changes, if the match is found, the refrence point of this statement will dynamically change as that point does.
print(random_string[span[0]:span[1]]) # 16 does not always equal 16. As the strings shifts, so does this point of reference
print(span[0])

<re.Match object; span=(12, 16), match='abcd'>
(12, 16) <class 'tuple'>
 abc
abcd
12


### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [20]:
# We are searching phone numbers for an area code that is between 0-4 for the first digit, 7-9 for the second digit, and 0-3 for the third digit:
pattern_int = re.compile('[0-4][7-9][0-3]')
phone_num = '383 838 6788'

# Search the string for the matching numbers in our set and return them:
random_num = pattern_int.search(phone_num)
print(random_num)

<re.Match object; span=(0, 3), match='383'>


In [28]:
# Can also iterate over integer values:
# Compile an integer pattern and create an integer variable:
pattern_int = re.compile('[0-4][7-9][0-3]')
int_value = 67393

# Search the value for all matching numbers:
random_num = pattern_int.search(str(int_value))
print(random_num)

# Return this match:
span = random_num.span()
print(int(str(int_value)[span[0]:span[1]]))

<re.Match object; span=(2, 5), match='393'>
393


##### Character Ranges

In [31]:
# Compile a character pattern and create a string object:
char_pattern = re.compile('[AEIOUaeiou][a-z]')
my_str = 'Hello There Senior Anderson'

# Search the string for all matching characters in the pattern and return them:
char_pattern.findall(my_str)

['el', 'er', 'en', 'io', 'An', 'er', 'on']

### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [37]:
"""
What does this .compile do?

Creating a pattern that is looking for an uppercase letter, a lowercase letter, 
and then two digits between 0-3
"""
# Establised our search parameters:
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}')

# Search the string object for values matching the pattern and return them:
found_count = char_pattern_count.findall('Hello Mr. An3derson')

# Populated lists return True when looked at like this:
if found_count:
    print(found_count)
    
# Non-populated lists will return False when looked at like this:
elif not found_count:
    print('Your search query returned no matching values')

Your search query returned no matching values


##### {x, x} - something that occurs between x and x times

In [6]:
"""
What is this compile statement doing???

Creating a pattern that looks for any occurances of the letter 'm' where it occurs
anywhere between 1-5 times.
"""

random_pattern = re.compile('m{1,5}')

random_statment = random_pattern.findall('m mm mmmmmmmmm this was a great soup! I am in love with Tomato Soup!')
print(random_statment)

['m', 'mm', 'mmmmm', 'mmmm', 'm', 'm']


##### ? - something that occurs 0 or 1 time

In [42]:
'''
What does this compile statement do?

Create a pattern of "Mr" and search for an occurance of the letter 's' to follow after it.
If the letter 's' does not occur, will still return all matches of "Mr".
'''

# Compile the string pattern to search for
pattern = re.compile('Mrs?')

# Search the string for values matching the pattern and return them:
found_pat = pattern.findall('Hello M there Mr. Anderson. How is Mrs. Anderson and Ms. Anderson?')
print(found_pat)

['Mr', 'Mrs']


##### * - something that occurs at least 0 times

In [43]:
'''
What does this compile statment do?

Create a pattern that will search for a s with any number of "M"'s preceding it.
This pattern will also return ANY singular s values as the M is NOT required to preceed it
in order to trigger the pattern.
'''

pattern_m = re.compile('M*s')

found_m = pattern_m.findall('MMMs name is Ms. Smith. This is Msssss')
print(found_m)

['MMMs', 's', 'Ms', 's', 's', 'Ms', 's', 's', 's', 's']


##### + - something that occurs at least once

In [44]:
'''
What does this compile statement do?

Create a pattern that will search for all singular s values. When it an s value, it will look in front of it and see if there are any M in front. If there is, it will return that match.
If not, it is NOT a match.
'''

pattern_again = re.compile('M+s')

found_m = pattern_again.findall('MMMs name is Ms. Smith. This is Msssss')
print(found_m)

['MMMs', 'Ms', 'Ms']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [49]:
my_string = 'This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day'
# Output: ['10909090', '1', '2']

pattern = re.compile('[0-9]+')
pattern.findall(my_string)

['10909090', '1', '2']

### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [60]:
print('\U0001f525')
print('\U0001f30a')
print('\U00002660')
print('\U00002665')

🔥
🌊
♠
♥


In [54]:
# Anything that is a unicode object:
# Plus means I want everything that matches afterwards to also be returned as part of the group or set:
pattern_uni = re.compile('[\w]+')
found_uni = pattern_uni.findall('This is a sentence. This sentence has an exclamation point at the end!')
print(found_uni)

# Anything that isn't a unicode object or character:
pattern_not_uni = re.compile('[\W]+')
found_non_uni = pattern_not_uni.findall('This is a sentence. This sentence has an exclamation point at the end!')
print(found_non_uni)

['This', 'is', 'a', 'sentence', 'This', 'sentence', 'has', 'an', 'exclamation', 'point', 'at', 'the', 'end']
[' ', ' ', ' ', '. ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [65]:
date = 'Today is the 25th. Here soon, it will be the 1st.'

'''
What does this compile statement do?

We are searching for any numerical value that occurs 1-2 times, then
is proceeded by a non-numerical value 2 times.

25th would fall under this purview. Where 20OCT2015
'''
pattern_num = re.compile('\d{1,2}[\D]{2}')
found_date = pattern_num.findall(date)
print(found_date)

['25th', '1st']


##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [84]:
string_1 = 'Are you 11 and    afraid aint of the dark?'

# What is whitespace?
# Whitespace is any form of space within a string. Either leading, trailing, or in between words.
# Pattern with no whitespaces:
pattern_no_space = re.compile('\S[a-z]+')
found_dark = pattern_no_space.findall(string_1)
print(found_dark)

# Pattern with whitespaces:
pattern_with_space = re.compile('\s+')
found_space = pattern_with_space.findall(string_1)
print(found_space)

['Are', 'you', 'and', 'afraid', 'aint', 'of', 'the', 'dark']
[' ', ' ', ' ', '    ', ' ', ' ', ' ', ' ']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [88]:
pattern_bound = re.compile(r'\bTheCodingTemple\b') # r in front of a string means I want to read this as a literal instead.
found_pat = pattern_bound.findall('Welcome to TheCodingTemple')
print(found_pat)

pattern_bound = re.compile(r'\BTheCodingTemple\B') # r in front of a string means I want to read this as a literal instead.
found_pat = pattern_bound.findall('Welcome to TheCodingTemple')
print(found_pat)

['TheCodingTemple']
[]


### Grouping

In [94]:
my_string_again = 'Max Smith, aaron rodgers, Sam Darnold, LeBron James, Michael Jordan, Kevin Durant, Patrick McCormik'

'''
What does this compile statement do?

We search for patterns matching either one of our grouped patterns in the compile statement.
Pattern group 1, (first set of parenthesis), searches for name exceptions (ex. LeBron, McCormik),
searching for a capital letter, followed by a lowercase letter, then another capital and lowercase to follow.

The second pattern searches for where there is a capital letter, and any capital or lowercase letters to follow.
'''
# The parenthesis create the grouping saying "Here is a list of how I want this specific search parameter to look."
pattern_name = re.compile('([A-Z][a-zA-Z-a-z]+) ([A-Z][A-Za-z]+)')
found_names = pattern_name.findall(my_string_again)
print(found_names)

# What if I wanted to break this down even further and see when I am not adding a name or skipping it?

'''
What is my loop doing?

I am splitting the string by first and last name. I am separating the string into a list, where each element of that list
should be the first name and last name as a single string. 

Then, we search the string for the name using regex. If it meets our conditional filter, it gets returned!
If it doesn't, our program let's us know!
'''
for name in my_string_again.split(','):
    match = pattern_name.search(name)
    if match:
        print(match.groups(2))
    else:
        print(f'The name did not work! Looks like the name: {name} may not be a valid format for first and last name')

[('Max', 'Smith'), ('Sam', 'Darnold'), ('LeBron', 'James'), ('Michael', 'Jordan'), ('Kevin', 'Durant'), ('Patrick', 'McCormik')]
('Max', 'Smith')
The name did not work! Looks like the name:  aaron rodgers may not be a valid format for first and last name
('Sam', 'Darnold')
('LeBron', 'James')
('Michael', 'Jordan')
('Kevin', 'Durant')
('Patrick', 'McCormik')


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [101]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search
#.com OR .org => com|org

# Define our function:
def validate_email(emails):
    pattern = re.compile("([A-Za-z0-9]+)@([A-Za-z0-9]+).(org|com)$")
    
    # When we look at the match object like this, if a match is found, this triggers
    if pattern.match(email):
        return email
    else:
        return None
    
    
# Expected Output:
# None
# "pocohontas1776@gmail.com"
# None
# "yourfavoriteband@g6.org"
# None

for email in my_emails:
    print(validate_email(email))

None
pocohontas1776@gmail.com
None
yourfavoriteband@g6.org
None


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [108]:
# Open a file
f = open(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\data\names.txt')

# Read in the data:
data = f.read()

# Print out the data and the type of the data
print(data)
print(type(data))


# When using this method:
# ALWAY CLOSE YOUR FILE
f.close()

ï»¿Hawkins, Derek        derek@codingtemple.com        (555) 555-5555        Teacher, Coding Temple        @derekhawkins
Zhai, Mo        mozhai@codingtemple.com        (555) 555-5554        Teacher, Coding Temple
Johnson, Joe        joejohnson@codingtemple.com                Johson, Joe
Osterberg, Sven-Erik        governor@norrbotten.co.se                Governor, Norrbotten        @sverik
, Tim        tim@killerrabbit.com                Enchanter, Killer Rabbit Cave
Butz, Ryan        ryanb@codingtemple.com        (555) 555-5543        CEO, Coding Temple        @ryanbutz
Doctor, The        doctor+companion@tardis.co.uk                Time Lord, Gallifrey
Exampleson, Example        me@example.com        555-555-5552        Example, Example Co.        @example
Pael, Ripal        ripalp@codingtemple.com        (555) 555-5553        Teacher, Coding Temple        @ripalp
Vader, Darth        darth-vader@empire.gov        (555) 555-4444        Sith Lord, Galactic Empire        @darthvader
Fer

##### with open()

In [109]:
# When using with open(), you do not have to close the file! It does it for you!
with open(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\data\names.txt', encoding='utf-8') as f:
    data = f.read()
    print(data)


Hawkins, Derek        derek@codingtemple.com        (555) 555-5555        Teacher, Coding Temple        @derekhawkins
Zhai, Mo        mozhai@codingtemple.com        (555) 555-5554        Teacher, Coding Temple
Johnson, Joe        joejohnson@codingtemple.com                Johson, Joe
Osterberg, Sven-Erik        governor@norrbotten.co.se                Governor, Norrbotten        @sverik
, Tim        tim@killerrabbit.com                Enchanter, Killer Rabbit Cave
Butz, Ryan        ryanb@codingtemple.com        (555) 555-5543        CEO, Coding Temple        @ryanbutz
Doctor, The        doctor+companion@tardis.co.uk                Time Lord, Gallifrey
Exampleson, Example        me@example.com        555-555-5552        Example, Example Co.        @example
Pael, Ripal        ripalp@codingtemple.com        (555) 555-5553        Teacher, Coding Temple        @ripalp
Vader, Darth        darth-vader@empire.gov        (555) 555-4444        Sith Lord, Galactic Empire        @darthvader
Fernan

##### re.search()

In [111]:
re.search('ripalp@codingtemple.com', data)

<re.Match object; span=(786, 809), match='ripalp@codingtemple.com'>

##### Store the String to a Variable

In [113]:
answer = input('What name would you like to search for?: ')
found = re.findall(answer, data)
print(found)
if found:
    print(f'I found your data!: {found}')
else:
    print('Nothing to see here folks!')



[]
Nothing to see here folks!


### In-Class Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [116]:
with open(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\data\names.txt') as f:
    data = f.read()
    
# Match to the first name, last name, and twitter handle:
# Account for those objects alone.
def accounts(data):
    pattern = re.compile(r"([A-Za-z-]+), ([A-Za-z-]+).*(@[A-Za-z0-9_-]+)$", re.MULTILINE)
    matches = pattern.findall(data)
    for match in matches:
        # rearrange the matches to follow the format I want to return
        print(f'{match[1]} {match[0]} / {match[2]}')
        
print(accounts(data))

Derek Hawkins / @derekhawkins
Sven-Erik Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader
None
