# Coding Temple's Data Analytics Course
---
## Advanced Python Day 1: Regular Expressions

## Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [41]:
# Part of standard python library
# import statement
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

##### re.compile()

In [42]:
# Using compile methods, pre determine the string that we want to search 
# for in regular expression methods
help(re.compile)
# complile(pattern,flag=0)
pattern = re.compile('abcd')


Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a Pattern object.



##### re.match()

In [241]:
help(re.match)
match = pattern.match('abcd123')
print(match)
# How can we access the span of the match? and what is the span?
# Span shows the range in which the match was found within a given string
# match object.span()
print(match.span())
"""
What if i wanted to learn more about regex
https://regex101.com/

"""

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.

None


AttributeError: 'NoneType' object has no attribute 'span'

##### re.findall()

In [44]:
help(re.findall)
finders = pattern.findall('abcd123')
print(finders)

#Can't find if re.compile('abcd abcd123 abc 123')
print([('abcd'), ('abcd123'), ('abcd')])

Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.
    
    Empty matches are included in the result.

['abcd']
['abcd', 'abcd123', 'abcd']


##### re.search()

In [45]:
help(re.search)

# instatiate a random string of values
random_string = '123 123 234 abc abcd'
#
## Search the string for a pattern
search = pattern.search(random_string)
print(search)
#
## How useful is it when we cannot view the string where it matches?
## Let's try and grab the matching indicies and return them instead
## .span() function only returns the span of the match
span = search.span()
#print(span)
print(span, type(span))

# IF I know I can index it, then we can grab the match from the resulting string
# One method is to hardcode it
# What is hardcoding? - Nonreproducable code. Values/variables will not change
# Example: putting 10 vs x
print(random_string[16:20]) #16 will always be 16 and 20 will always be 20

#This is reproducable, no matter what word or phases goes into my search object
print(random_string[span[0]:span[1]])

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.

<re.Match object; span=(16, 20), match='abcd'>
(16, 20) <class 'tuple'>
abcd
abcd


### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [46]:
# What this compile here is going to do is it will search for values in a string object
# Where they match beginning with the value 0-4, the next value being between 7-9,
# and last value being 0-3
pattern_int = re.compile('[0-4][7-9][0-3]')
int_str = '67383 383'

# search the string for all matching numbers:
random_numbers = pattern_int.search(int_str)
print(random_numbers)

# Remember that a search returns a match object, the match object finds the
# first instance of a match and returns
# it to us. it does not search the entire string and return all objects

# Find the matching numbers from the match object and use dynamic programming to return them
span = random_numbers.span()
print(int_str[span[0]:span[1]])

<re.Match object; span=(2, 5), match='383'>
383


In [47]:
# How can we use this to iterate over integer values instead of a string?
# Compile an integer pattern and create an integer variable
pattern_int = re.compile('[0-4][7-9][0-3]')
int_value = 67383

# Search the value for the matching numbers
random_numbers = pattern_int.search(str(int_value))
print(random_numbers)

# Find all the matching numbes and return them
span = random_numbers.span()
print(int(str(int_value)[span[0]:span[1]]))

<re.Match object; span=(2, 5), match='383'>
383


##### Character Ranges

In [206]:
# Just as we looked through integer values using a range, we can do the same
# thing with characters in a string
# Compile a character pattern and create a string object

char_pattern = re.compile('[A-Z][a-z]')
my_str = 'Hello There Mr. Anderson'

# Using .findall() object. This searches the string for all the matching characters
# retur s a list of the values
found = char_pattern.findall(my_str)
print(found)


['He', 'Th', 'Mr', 'An']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [49]:
"""
What does .compile() do?
Creates a pattern that is looking for a single uppercase letter, a lowercase
letter, and two digits with values of 0-3
"""
char_pattern_count = re.compile('[A-Z][a-z][0-3]{4}')

# Search a string for values that match this pattern and return them.
found_count = char_pattern_count.findall('Hello Mr. An3333derson')
print(found_count)

['An3333']


##### {x, x} - something that occurs between x and x times

In [50]:
"""
Creating a pattern that looks for an occurance of the letter m where it
occurs 1-5 times
"""
random_pattern = re.compile('m{1,5}')

# Search a string for values matching the pattern and return them
random_statement = random_pattern.findall('m mm mmm this is a great soup! I am in love with Tomato soup!')
print(random_statement)


['m', 'mm', 'mmm', 'm', 'm']


##### ? - something that occurs 0 or 1 time

In [51]:
"""
What does this .compile() do?
Creates a pattern of 'Mrs.' then, the s? looks for the IF this occurs
If it doesn't, it will ignore that charater!
"""
pattern = re.compile('Mrss?')

# Search the string for values matching our pattern and return them
found_pat = pattern.findall('Hello M there Mr. Anderson. How is Mrss. Anderson and Ms. Anderson?')
print(found_pat)

['Mrss']


##### * - something that occurs at least 0 times

In [52]:
"""
Any moment this value is happening.
Creates a pattern that searches for s with any number of M's preceding it.
This pattern will also return all singular S values as the values M in NOT
required to preceed it.
"""
pattern_m = re.compile('M*s')

# Search our string for values matching the pattern and return them

found_m = pattern_m.findall('MMMs name is Ms. Smith. This is Msssss')
print(found_m)

['MMMs', 's', 'Ms', 's', 's', 'Ms', 's', 's', 's', 's']


##### + - something that occurs at least once

In [53]:
"""
Creates a pattern that searched for where M occurs at least 1 time
followed by an s.

"""
pattern_again = re.compile("M+s")

# Search our string for values mathing the pattern and return them to us
found_patt = pattern_again.findall("My name is Ms. Smith, this is MMMMMsssss")
print(found_patt)

['Ms', 'MMMMMs']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [54]:
import re
my_string = 'This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day'
# Output: ['10909090', '1', '2']

rule_set = re.compile('[0-9]+')
# or rule_set = re.compile('[*0-9]+')
outcome = rule_set.findall(my_string)
print(outcome)


['10909090', '1', '2']


### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [55]:
print('\u0040')


@


In [56]:
# w allows us to select items that are unicode
# W selects anything that ISNT a unicode character

my_str = 'This is a sentence. With an, exclamation point at the end!'
# Selects everything that is unicode and the plus means I want all the unicode that follows it
#pattern_uni = re.compile('[\w]+')
#found_uni = pattern_uni.findall('This is a sentence. With an, exclamation point at the end!')
#print(found_uni)

# What if I wanted everything that wasn't a unicode character?
pattern_not_uni = re.compile('[\W]+')
found_not_uni = pattern_not_uni.findall(my_str)
print(found_not_uni)

[' ', ' ', ' ', '. ', ' ', ', ', ' ', ' ', ' ', ' ', '!']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [57]:
date = 'Today is the 15th. In 16 days it will be the 1st.'
"""
Searches for any occurance of 1-2 digits, followed by anything that isn't
a number and occurs twice.
"""
pattern_num = re.compile('\d{1,2}[\D]{2}')
#\d{1,2} is looking for 0-9 digits, from 1-2 digits
#\D{2} is looking for any letter, and 2 letters after the digits
found_date = pattern_num.findall(date)
print(found_date)

['15th', '16 d', '1st']


##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [58]:
string = 'Are you    afraid of the dark?'

# Something to remember, is that whitespace is any form of space within a string.
# It does not need to occur at the beginning or end, but again can be anywhere inside the string.
# Let's look at a pattern with NO whitespaces
"""
Search a string for all non-white space characters, then look for letters a-z, then ask for all proceeding letters with it
"""
pattern_no_space = re.compile('\S[a-z]+')
found_dark = pattern_no_space.findall(string)
print(found_dark)

# PAttern with any white space
pattern_space = re.compile('\s')
found_space = pattern_space.findall(string)
print(found_space)

['Are', 'you', 'afraid', 'of', 'the', 'dark']
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [59]:

pattern_bound = re.compile(r'\bTheCodingTemple\b')
pattern_found = pattern_bound.findall('TheCodingTemple')
print(pattern_found)

# Anything that isn't a boundary
pattern_bound_none = re.compile(r'\BTheCodingTemple\B')
pattern_none = pattern_bound_none.findall("TheCodingTemple")
print(pattern_none)

['TheCodingTemple']
[]


### Grouping

In [210]:
my_string_again = "Max Smith, aaron Rogers, Sam Darnold, Lebron James, Michael Jordan, Kevin Durant, PAtrick McCormik"
"""
We search for patterns matching either pattern in our compile function using grouping
The first pattern is handling our Name Exceptions (Ex. LeBron, McCormik)
    Ex. Trying to find a Capital, lower case, then capital
    The second pattern searches for capital then any letter after

"""
pattern_name = re.compile('([A-Z][a-zA-Z-a-z]+) ([A-Z][A-Za-z]+)')
"""

What is my loop doing?
it i separating each name by a comma that is present in the string

"""
for name in my_string_again.split(','):
    #print(name)
    match = pattern_name.search(name)
    if match:
        print(match.group(1),match.group(2))
    else:
        print("Not a Name")
    

Max Smith
Not a Name
Sam Darnold
Lebron James
Michael Jordan
Kevin Durant
PAtrick McCormik


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [61]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

def validate_email(email):
    """
    matches beginning of e-mail with capital or lower case letters and numbers
    then we give it the @ symbol because this is in a string
    then another finding of capital and lower case letters and numbers
    then we give it a . symbol
    then we look for ORG or COM
    $ -> stops the search
    | -> means or
    $| -> stops partial matches
        Ex:  org$|com -> finds org first then tries to find com
             org|com -> finds org or com
    """
    pattern = re.compile("([A-Za-z0-9]+)@([A-Za-z0-9]+).(org$|com)")
    
    if pattern.match(email):
        return email
    
    else:
        return None
    
for email in my_emails:
    print(validate_email(email))

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None




None
pocohontas1776@gmail.com
None
yourfavoriteband@g6.org
None


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [224]:
# Open the file
f = open(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\names.txt')

# Read in the data
data = f.read()

#print the data
print(data)
(print(type(data)))

# Always close the file! Cannot be opened simutaenously
f.close()

ï»¿Hawkins, Derek        derek@codingtemple.com        (555) 555-5555        Teacher, Coding Temple        @derekhawkins
Zhai, Mo        mozhai@codingtemple.com        (555) 555-5554        Teacher, Coding Temple
Johnson, Joe        joejohnson@codingtemple.com                Johson, Joe
Osterberg, Sven-Erik        governor@norrbotten.co.se                Governor, Norrbotten        @sverik
, Tim        tim@killerrabbit.com                Enchanter, Killer Rabbit Cave
Butz, Ryan        ryanb@codingtemple.com        (555) 555-5543        CEO, Coding Temple        @ryanbutz
Doctor, The        doctor+companion@tardis.co.uk                Time Lord, Gallifrey
Exampleson, Example        me@example.com        555-555-5552        Example, Example Co.        @example
Pael, Ripal        ripalp@codingtemple.com        (555) 555-5553        Teacher, Coding Temple        @ripalp
Vader, Darth        darth-vader@empire.gov        (555) 555-4444        Sith Lord, Galactic Empire        @darthvader
Fer

##### with open()

In [225]:
# Using with the keyword
with open(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\names.txt', encoding = 'utf-8') as f:
    data = f.read()
    print(data)

Hawkins, Derek        derek@codingtemple.com        (555) 555-5555        Teacher, Coding Temple        @derekhawkins
Zhai, Mo        mozhai@codingtemple.com        (555) 555-5554        Teacher, Coding Temple
Johnson, Joe        joejohnson@codingtemple.com                Johson, Joe
Osterberg, Sven-Erik        governor@norrbotten.co.se                Governor, Norrbotten        @sverik
, Tim        tim@killerrabbit.com                Enchanter, Killer Rabbit Cave
Butz, Ryan        ryanb@codingtemple.com        (555) 555-5543        CEO, Coding Temple        @ryanbutz
Doctor, The        doctor+companion@tardis.co.uk                Time Lord, Gallifrey
Exampleson, Example        me@example.com        555-555-5552        Example, Example Co.        @example
Pael, Ripal        ripalp@codingtemple.com        (555) 555-5553        Teacher, Coding Temple        @ripalp
Vader, Darth        darth-vader@empire.gov        (555) 555-4444        Sith Lord, Galactic Empire        @darthvader
Fernan

##### re.search()

In [222]:
p = re.search(r'joejohnson@codingtemple.com', data)
s = p.span()
data[s[0]:s[1]]

'joejohnson@codingtemple.com'

##### Store the String to a Variable

In [226]:
#answer = input('What would you like to search for?')
#found = re.findall(answer,data)
#
#print(found)
#
#if found:
#    print(f'I found your data: {found}')
#else:
#    print('Nothing to be found here')

### In-Class Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [277]:
with open(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\names.txt', encoding = 'utf-8') as f:
    data = f.read()

# Match last name, first name, and twitter handle
def accounts(data):
    pattern = re.compile(r"([A-Za-z-]+), ([A-Za-z-]+).*(@[A-Za-z0-9_-]+)$", re.MULTILINE)
    matches = pattern.findall(data)
    for match in matches:
# Rearrange match for last name, first name, and twitter handle
        print(f"{match[1]} {match[0]} / {match[2]}")
accounts(data)

Derek Hawkins / @derekhawkins
Sven-Erik Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader


### Regex project

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [264]:
# Open the file
import re

with open(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\regex_test.txt', encoding = 'utf-8') as f:
    data = f.read()

#print the data
def validate_name(data):
    pattern = re.compile(r"[A-Z][a-z]+(?: [A-Z][a-z]*\.?)* [A-Z][a-z]+")
    matches = pattern.findall(data)
    for match in matches:
        if match:
            print(match)
    else:
        print(f'None')
print(validate_name(data))


Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None


In [231]:
"""
Expected Output
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

'\nExpected Output\nAbraham Lincoln\nAndrew P Garfield\nConnor Milliken\nJordan Alexander Williams\nNone\nNone\n'