# Regular Expressions

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [3]:
# import re
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

#### RegEx Cheatsheet

In [None]:
########################
# DO NOT RUN THIS CELL #
########################

a, X, 9, < -- ordinary characters just match themselves exactly.
. (a period) -- matches any single character except newline '\n'
\w -- matches a "word" character: a letter or digit or underscore [a-zA-Z0-9_].
\W -- matches any non-word character.
\b -- matches word boundary (in between a word character and a non word character)
\s -- matches a single whitespace character -- space, newline, return, tab
\S -- matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- matches any numeric digit [0-9]
\D matches any non-numeric character.
^ -- matches the beginning of the string, or specify omition of certain characters
$ -- matches the end of the string
\ -- escapes special character.
(x|y|z) matches exactly one of x, y or z.
(x) in general is a remembered group. We can get the value of what matched by using the groups() method of the object returned by re.search.
x? matches an optional x character (in other words, it matches an x zero or one times).
x* matches x zero or more times.
x+ matches x one or more times.
x{m,n} matches an x character at least m times, but not more than n times.
?: matches an expression but do not capture it. Non capturing group.
?= matches a suffix but exclude it from capture. Positive lookahead.
a(?=b) will match the "a" in "ab", but not the "a" in "ac"
In other words, a(?=b) matches the "a" which is followed by the string 'b', without consuming what follows the a.
?! matches if suffix is absent. Negative look ahead.
a(?!b) will match the "a" in "ac", but not the "a" in "ab"
?<= positive look behind
[] matches for groupings of consecutive characters
?<! negative look behind

[^0-9] matches any non-digit character
[a-zA-Z] matches any letter (Add+ to find words but strip everything else)

########################
# DO NOT RUN THIS CELL #
########################

##### re.compile()

In [None]:
# using compile, pre determines the string to be used in regular expression methods
pattern = re.compile('123abcd')
pattern


##### re.match()

In [None]:
#specifically looks for a match at the start of a string

match = pattern.match('abcd12334566aecrgtrt')
print(match)
# Accessing the span of the match
print(match.span())
'abcd12334566aecrgtrt'[0:4]

##### re.findall()

In [None]:
# much more inclusive than match. Will return all instances of found pattern as a list, but no indeces given

finders = pattern.findall('123abcdab123cdababcaabcd123abcdabcd123RegExisfunaf')
print(finders)
re.findall

##### re.search()

In [None]:
#looking for the first instance of the match anywhere in the string, vs match which specifically looks at index[0]

random_string = "123 123 234 123abcd abcd abc"

searching = pattern.search(random_string)
print(searching)

span = searching.span()
print(random_string[span[0]: span[1]])


### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [None]:
pattern_int = re.compile('[0-7][7-9][0-3]')

test = '67383'

random_numbers = pattern_int.search(test)
span = random_numbers.span()
print(test[span[0]:span[1]])

##### Character Ranges

In [None]:
char_pattern = re.compile('[A-Z][a-z]')

found = char_pattern.findall('Hello there, Mr. Anderson')
print(found)

### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [None]:
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}')

found_count = char_pattern_count.findall('Hello there Mr. An31derson')
print(found_count)

##### {x, x} - something that occurs between x and x times

In [None]:
random_pattern = re.compile('m{1,5}') #this is different than range numbers. In this case, the second number is included. 
random_statement = random_pattern.findall('This m is an example of a regular expression trying to find one m, more than one mmm or five mmmmms')

print(random_statement)

##### ? - something that occurs 0 or 1 time

In [None]:
pattern_1 = re.compile('Mrss?')

found_1 = pattern_1.findall('Hello there, Mr. Anderson. How is Mrs. Anderson, and Mrss. Anderson?')

found_1

##### * - something that occurs at least 0 times

In [None]:
pattern_m = re.compile('M*s')

found_m = pattern_m.findall('MMMs name is Ms. Smith. This is Mssss Anderson')

print(found_m)

##### + - something that occurs at least once

In [None]:
pattern_2 = re.compile('M+s')

found_2 = pattern_2.findall('MMMs name is Ms. Smith. This is Mssssss Anderson')

print(found_2)

##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [None]:
number_patt = re.compile('[0-9]+')
my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."
found_nums = number_patt.findall(my_string)
print(found_nums)

### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [None]:
pattern_3 = re.compile('[\w]+')
pattern_4 = re.compile('[\W]+')

found_3 = pattern_3.findall('This is a sentence. With an exlamation mark at the end!')
found_4 = pattern_4.findall('This is a sentence. With an exlamation mark at the end!')

print(found_3)
print(found_4)

##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [None]:
pattern_nums = re.compile('\d+[a-z]{2}')
found_date = pattern_nums.findall('Today is the 19th, tomorrow is the 20th. My birthday is the 3rd. Today is the 100th day of the year.')
# Today is the 19th, tomorrow is the 20th. My birthday is the 3rd.

print(found_date)
for date in found_date:
    print(date)

##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [None]:
pattern_no_space = re.compile('\S[a-z]+')
pattern_space = re.compile('\s+')

found_space = pattern_space.findall('Are you afraid   of the  dark??')
print(found_space)

found_no_space = pattern_no_space.findall('Are you afraid   of the  dark??')
print(found_no_space)


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [None]:
#not very commonly used. Essentially looks for spaces or \n or \t. 
# Good use case is if looking for a whole word that is not a subset of another word
# need the r to make raw string before the quotes in the compile constructor

"Thecodingtemple"

pattern_boundary = re.compile(r'\bTheCodingTemple\b')
pattern_not_bound = re.compile(r'\BTheCodingTemple\B') #looking for TheCodingTemple that does NOT have boundaries

found_bound = pattern_boundary.findall('       TheCodingTemple    ')
print(found_bound)

no_found_bound = pattern_not_bound.findall('1234TheCodingTempleblahblahblahetc')
print(no_found_bound)

In [None]:
# looking for just letters in a string: re.compile('[a-zA-Z]+'). 
#If you're just looking for characters that aren't numbers, it's [^0-9]


### Grouping

In [None]:
my_string_again = "Max Smith, aaron rodgers, Sam Darnold, LeBron James, Micheal Jordan, Kevin Durant, Patrick McCormick"

#group of names RegEx compiler using 2 separate groups for first/last names

pattern_first = re.compile('([A-Z][a-zA-Z]+) ([A-Z][a-zA-Z]+)')

found_names = pattern_first.findall(my_string_again)
print(found_names)

# for name in found_names:
#     print(f'First Name: {name[0]} \n Last Name: {name[1]}')
    
#Splitting and using .search() syntax to get a match object

for name in my_string_again.split(', '):
    match = pattern_first.search(name)
    
    if match:
        print(name)
    else:
        print('Not a name')

##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [None]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# email_pattern = re.compile('([A-Za-z0-9]+)@([A-Za-z0-9]+)(com|org$)')
#looking 
def identifyEmails(a_list): # will give me a more properly formatted email in this case
    email_pattern = re.compile('([\w]+)@([\w]+).(com|org)$')
    for user in a_list:
        if email_pattern.match(user):
            print(user)
        else:
            print('None')

def identify_Emails(a_list): #will give 'metadata' style result
    email_pattern = re.compile('([\w]+)@([\w]+).(com|org)$')
    for user in a_list:
        email = email_pattern.search(user)
        if email:
            print(email)
        else:
            print('None')
            
identifyEmails(my_emails)

identify_Emails(my_emails)

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None

# for address in addresses:
#     print address


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [None]:
file = open("files/names.txt") #opens the file but doesn't do anything with it yet

data = file.read() #actually reads the information
print(data)

file.close() #get in the habit of closing the file you just opened so it doesn't sit open in the background

##### with open()

In [None]:
#with open() tells python what to do with the file, and once it's done, it automattically closes it.

with open('files/names.txt') as file: 
    data = file.read()
    print(data)

##### re.match()

In [None]:
#with the read data, you can match, search, findall, etc.
# because THIS IS ONE GIANT STRING!!!

print(re.match('Hawkins, Derek', data))

##### re.search()

In [None]:
print(re.search('vader', data))

##### Store the String to a Variable

In [None]:
answer = input('What do you want to look for?')

found = re.findall(answer, data)

if found:
    print(f'Here is your answer...{found}')
else:
    print('Sorry -- you have the wrong number.')

### Homework Exercise 1 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [20]:
with open('files/names.txt') as file: 
    data = file.readlines()

# print(data[0])

# for line in data:
#     handle_pattern = re.compile('@[a-zA-Z0-9]+\s')
#     handle = re.findall(handle_pattern, line)
#     print(handle)
    
    
#     if line[-1]:
#         print(handle)
#         last_first_name = re.compile('([A-Z][a-zA-Z]+), ([A-Z][a-zA-Z]+)', line)
#         name = re.findall(last_first_name, line)
#         for last, first in line[0]:
#             print(f'{line[0][1]} {line[0][0]} / {handle}')

# last_first_name = re.compile('([A-Z][a-zA-Z]+), ([A-Z][a-zA-Z]+)')
# name = re.findall(last_first_name, data)
# print(f'{name[0][1]} {name[0][0]}')

# handle_pattern = re.compile('@[a-zA-Z0-9]+\n')
# handle = re.findall(handle_pattern, data)
    
    
#     for last, first in names and handle in handles:
#          print(f'{first} {last} / {handle}')
    
#can use f.readlines() to break up the data into individual lists based on lines

#looking for people with a first and last name AND a twitter handle
#once I have that list of tuples, I can index into each and print just what I want.
#break it down

#what am I looking for?
#I"m only looking for those names that have a twitter handle, so if a line has both then print

for lines in data:
    last_first_name = re.compile('([A-Z][a-zA-Z]+), ([\w+]*-*[A-Z][a-zA-Z]+)')
    names = re.findall(last_first_name, lines)

    # twitter handle
    handle_pattern = re.compile('@[a-zA-Z0-9]+\n')
    handles = re.findall(handle_pattern, lines)
    for handle in handles: 
        twitter_handle = handle[:-1]
        print(f'{names[0][1]} {names[0][0]} / {twitter_handle}')
    


Derek Hawkins / @derekhawkins
Sven-Erik Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader


In [21]:
import re

pattern = re.compile("(\w+), ([\w+]*-*[\w]+).*\s(@[\w]+)")

# for data_groups in data:
#     print(data_groups)


for twitter_handle in data:
    match = pattern.search(twitter_handle)
    
    if match:
        print(f'{match.group(2)} {match.group(1)} / {match.group(3)}')

Derek Hawkins / @derekhawkins
Sven-Erik Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader


### Homework Exercise 2

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [None]:
#if there is a middle name or abbreviation, print that as well 
#but given data we have, the output is below, look at aaron rogers example above

"""
Expected Output
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""



In [22]:
import re

with open('files/regex_test.txt') as file: 
    data_1 = file.readlines()
    
# print(data_1)

name_pattern = re.compile('([A-Z][a-z]+) *([\w]+)* ([A-Z][a-z]+)')
# name_pattern_2 = re.compile('([A-Z][a-z]+) ([A-Z][a-z]+)')

for lines in data_1:
    full_name = re.search(name_pattern, lines)
    if not full_name:
        print('None')
    elif not full_name.group(2):
        print(f'{full_name.group(1)} {full_name.group(3)}')
    elif full_name.group(2):
        print(f'{full_name.group(1)} {full_name.group(2)} {full_name.group(3)}')
    
    



Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
