## Lecture 13: Regular Expressions in Python

This one is going to be short and pretty fun

Topics to cover:

1. What are regular expressions and why are they useful?
2. How to design and use a regular expression
3. When to use and when not to use a regular expression

Let's start with a motivating example:

In [121]:
# Let's imagine Google buys Stanford and it becomes a commercial institution...
# We need to change everyone's .edu address to a .com!

my_email = 'aguoman@stanford.edu'
#my_new_email = ??

#print(my_new_email)



In [3]:
# Some valid options

# mostly works... replace
my_email = "aguoman.educator@stanford.edu"
my_new_email = my_email.replace(".edu", ".com")
print(my_new_email)


aguoman.comcator@stanford.com


In [4]:
# this is actually good, but not very nice looking... slicing
my_new_email = my_email[:-4] + ".com"
print(my_new_email)

aguoman.educator@stanford.com


In [5]:
# don't do this... but this works... split/join
#...
# first get a list
my_new_email = my_email.split('.')
print(my_new_email)

# change the elements in the list
my_new_email.remove('edu')
my_new_email.append('com')

print('.'.join(my_new_email))

['aguoman', 'educator@stanford', 'edu']
aguoman.educator@stanford.com


#### BUT... Some of these don't work for all emails!

Would be nice if we could specify only to replace at the end of the string...

Regular expressions allow us to do this and much more!

Can...

1. match beginning and end of strings
2. extract/substitute parts of the string that we're searching
3. search on the basis of patterns in strings

In [28]:
import re

# a regular expression is sometimes called a "pattern" because it describes the shape and form of a string

# as an example
telephone_number = "(801)-712-1238"

# the structure is (numberything)-numberything-numberything
# regular expressions give a way of encoding this sort of logic!

#### One basic thing we can do is "search"

In [16]:
# Let's look at searches:

# does a string contain a substring?
# basic python:
if "exam" in "We are having an exam next week.":
    print("Oh no!")
else:
    print("Whew")


Oh no!


In [15]:
# To do this with the regular expression library is easy

match = re.search("exam", "We are having an exam next week.")

if match:
    print("Oh no! (But thank god for regexes).")
else:
    print("Whew")

Oh no! (But thank god for regexes).


In [22]:
# That was too easy, what if we wanted to see if any numbers are in a string?

test_string = "Does this strin9 contain any numb3r5?"

# Volunteer python solutions?


In [52]:
is c in str(list(range(10)))

def contains_num(test_string):
    for c in test_string:
        if c.isdigit():
            return True
    return False
        
contains_num(test_string)

True

In [25]:
# How about using a set and some functional programming?
test_string = "Does this strin9 contain any numb3r5?"

if set(map(str, range(10))).intersection(test_string):
    print("Found some numbers!")
else:
    print("Nope. No numbers here.")

Found some numbers!


In [26]:
# It's easier with regexes!
import re
match = re.search("[0-9]", test_string)
print(match)

<_sre.SRE_Match object; span=(15, 16), match='9'>


In [31]:
# okay... that was kinda cool... how about to test if 
# a string is a telephone number?

print(telephone_number)

# just to add extra noise
telephone_number += "@"

# the structure is (numberything)-numberything-numberything

(801)-712-1238@


In [32]:
match = re.search("\(\d+\)-\d+-\d+", telephone_number)

print(match)

<_sre.SRE_Match object; span=(0, 14), match='(801)-712-1238'>


#### Regex Characters

There's quite a few with special behavior, some of which we've already seen!

1. "[ ]" is kind of a like an "or" in Python, it accepts any of the things inside (you can also put ranges inside!)
2. "0,1,..9,a..z,A..Z" all match themselves
3. "." matches any character (except newline)!
4. "\w" word like character
5. "\W" non-word like character
6. "\s" space like character
7. "\d" digit

Also some "control" type characters:

1. "+" matches *at least* one of whatever precedes it 
   1. "0+" matches at least one "0"
   3. "[A-Z]+" matches at least one capital letter
2. "\*" matches *any number* of whatever precedes it (matches null string as well)
6. "?" matches 0 or 1 of whatever precedes it (means *optional*)
3. "^" matches start of string
4. "\$" matches end of string
5. "\" escapes things!


Also... some tricky parens...

In [33]:
# Gotcha time...

# regular expression to match a single \ ?
test = "\\"
print(test)
print(re.search("\\\\", test))

# Eww! Good thing Python has raw (escaped) strings
raw_test = r'\\'
print(raw_test)
print(re.search(r'\\', test))

\
<_sre.SRE_Match object; span=(0, 1), match='\\'>
\\
<_sre.SRE_Match object; span=(0, 1), match='\\'>


In [46]:
# Practice time:

# match a vector of 4 floats (no space, but comma separated)
ex_vec = "(1.231,0.0012348,0,-5.9)"

# expr = ???

match = re.search(expr,ex_vec)
print(match)

<_sre.SRE_Match object; span=(0, 24), match='(1.231,0.0012348,0,-5.9)'>


In [63]:
match = re.search('-?[0-9\.]+[0-9]*',example)
print(match)

<_sre.SRE_Match object; span=(0, 14), match='..............'>


In [96]:
match = re.search("\(-?[0-9\.]+[0-9]*,-?[0-9\.]+[0-9]*,-?[0-9\.]+[0-9]*,-?[0-9\.]+[0-9]*\)", ex_vec)
print(match)

<_sre.SRE_Match object; span=(0, 24), match='(1.231,0.0012348,0,-5.9)'>


## Matching Groups with Parentheses

In [71]:
# Let's modify that last example so that we can print out the numbers in the vector

In [72]:
match = re.search("\((-?[0-9\.]+[0-9]*),(-?[0-9\.]+[0-9]*),(-?[0-9\.]"
                  "+[0-9]*),(-?[0-9\.]+[0-9]*)\)", ex_vec)

TypeError: unsupported operand type(s) for +=: 'int' and 'str'

#### Extras

The 're' module has a lot of other features, including string replace, split, findall... etc.

Mostly search gets the job done, but it's good to know the module has all this stuff!

## Review Slide

In [1]:
import re
# CHEATSHEET at regexr website (much better than what I have here)

# use "[]" to match any of the things inside

# +, *, ? are for matching at least one, any number, and 0 or 1 things

# a-z, A-Z, 3-8, etc... ranges!

# \d digits

# \w word characters

# \W non word characters

# () for a group

# just as an example... regular expressions can get complicated!

In [84]:
#And a perl regular expression to match EMAIL!!! (properly)

#Don't do this or I'll cry.

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

SyntaxError: invalid syntax (<ipython-input-84-889289e42b8f>, line 5)

In [None]:
# Just for fun!!!

import turtle

turtle.color('red', 'yellow')
turtle.begin_fill()
while True:
    turtle.forward(200)
    turtle.left(170)
    if abs(turtle.pos()) < 1:
        break
turtle.end_fill()
turtle.done()

In [7]:
import unicodedata

print(unicodedata.lookup('SLICE OF PIZZA'))
print(unicodedata.name('👌'))
print(unicodedata.numeric('¼'))

🍕
OK HAND SIGN
0.25
