# Regular Expressions

Regular expressions are a mini-language, used to parse and extract information from strings.

### Motivation: slicing vs split vs regex

Given a strings, such as:

"01/09/2008", "05/12/2012"

we know we can get extract the year this way:

In [1]:
dates = ["01/09/2008", "05/12/2012"]

for d in dates:
    print(d[-4:]) # use normal indexing

2008
2012


If we had a strings, such as:

"In the year 2008 we did such as such"
"After the year 2009 we continued something else"

We can no longer use slicing, but we can just split the string and get the 4th value to get the year:

In [4]:
sentences = ["In the year 2008 we did such as such"
             , "After the year 2009 we continued something else"]

for s in sentences:
    print(s.split(" ")[3])

2008
2009


How do we extract dates in the following sentences?

"2019: After the Fall of New York"

"The exterminators of the year 3000"

"1990: The Bronx Warriors"

The first inclination of novice programmers would be to split the movie title above, go through each title and check to see if it is just numbers. If it is, extract that token as the year.

This pattern of coding comes up so often that there is a special way of extracting such information: regular expressions!

In [14]:
import re # <= regular expression library


movies = ["01 - 2019: After the Fall of New York"
          , "02 - The exterminators of the year 3000"
          , "03 - 1990: The Bronx Warriors"]


for m in movies:
     a = re.match(r"(\d+)*(\d+)*",m)
     print(a)
     print(a.group(0),a.group(1),a.group(2))

<re.Match object; span=(0, 2), match='01'>
01 01 None
<re.Match object; span=(0, 2), match='02'>
02 02 None
<re.Match object; span=(0, 2), match='03'>
03 03 None


**...what??**

Some people don't like regular expressions:

> Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems.


- Jamie Zawinski

## Regular expressions in context

Regular expressions were invented, in their modern form, in 1951 by Stephen Kleene. They have their roots in theoretical computer science, although they have extremely useful as a text parsing tool.

Practically every language has regular expressions built-in. They are often super optimized and always expressed in an archaic syntax.

Regular expressiosn allow you to use basic components to parse a language. Here are some pseudo-code examples of regex expressions:

Find all characters which are digits

Find all characters which are digits, followed by another digit

Find all characters which are at the beginning of a line, are of one of the following characters: [,.!;:], followed by 3 digits, followed by a comma, followed by three characters which are NOT digits

## Sample regular expressions

In [22]:
ages = "Papa is 102, Homer is 38 years old, Marge is 36 years old, Bart is 10 years old, Lisa is 8 years old and Maggie is 3."

# Task: Extract all ages
# Thinking: Find all numbers
# Regex pseudo code: find digits

regex_attempt1 = "(\d+)" # <= Find digits

for m in re.finditer(regex_attempt1, ages): 
    print("Match starts at",m.start(), "ends at", m.end(), "and contains", m.group())

Match starts at 8 ends at 11 and contains 102
Match starts at 22 ends at 24 and contains 38
Match starts at 45 ends at 47 and contains 36
Match starts at 67 ends at 69 and contains 10
Match starts at 89 ends at 90 and contains 8
Match starts at 115 ends at 116 and contains 3


In [24]:
# Task: Extract all ages
# Thinking: Find all numbers
# Regex pseudo code: find digits, clump consecutive digits together

regex_attempt1 = "(\d|\d\d)" # <= Find digits and 1 or more repititions

for m in re.finditer(regex_attempt1, ages): 
    print("Match starts at",m.start(), "ends at", m.end(), "and contains", m.group())

Match starts at 8 ends at 9 and contains 1
Match starts at 9 ends at 10 and contains 0
Match starts at 10 ends at 11 and contains 2
Match starts at 22 ends at 23 and contains 3
Match starts at 23 ends at 24 and contains 8
Match starts at 45 ends at 46 and contains 3
Match starts at 46 ends at 47 and contains 6
Match starts at 67 ends at 68 and contains 1
Match starts at 68 ends at 69 and contains 0
Match starts at 89 ends at 90 and contains 8
Match starts at 115 ends at 116 and contains 3


## Just use http://www.pyregex.com/ or https://www.debuggex.com/

**Exercise** Extract area codes from the following phone numbers. _Must_ write a single regex which is able to extract regular expressions from the following numbers (in a loop):

1-201-123-1234

98-708-567-7890

0-708-333-4444

In the above numbers, the area codes are 201, 708 and 708, respectively.

In [75]:
area_code_regex = r"\d+-(\d+)-\d+-\d+"

for ac in ["1-201-123-1234", "98-708-567-7890", "0-708-333-4444"]:
    print(re.findall(area_code_regex, ac))

['201']
['708']
['708']


Hint: Look for the start of string, then one or more digits, then a dash, THEN the digits which contain our area code. Ignore the rest.


## What regular expressions can't do

Regular expressions are part of a theoretical framework which define languages. There are languages which are less or more powerful than regular expressions.

For example, regular expressions are not able to correctly parse this expressions:

`1 + (2 * (3 + 8))`

In order to parse the expression above, after each left parenthesis, we would have to use recursion. Regular expressions are not designed to parse such recursive expressions.

Practically speaking, although _many_ poeple attempt it, regular expressions are not the correct choise to parse html (web) pages or xml documents.


Computer science students often learn about context free grammars. CFGs _can_ parse recursive strings and are often used to parse programming languages. Unfortunately, CFGs are out of scope for this course.

In [29]:
bestphysicist = ['Rafael Vescovi, PhD',
                 'Isaac Newton, physicist']

 
for k in bestphysicist:
    m = re.match(r"(?P<Name>\w+) (?P<surname>\w+), \w+", k)
    print(m.group('surname'), ', ',m.group('Name'))

Vescovi ,  Rafael
Newton ,  Isaac
