# Regular Expressions

Regular expressions or regex have a reputation amongst programmers as being very difficult and parsnickety. 

> "Some people, when confronted with a problem, think 'I know, I'll use
regular expressions.' Now they have two problems."

Don't let this intimidate you! Everyone gets frustrated with regex.

Let's try and do some things with regular expressions in Python.


In [None]:
# load the regular expressions module.
import re

## The Queen's English

Some words are spelt differently in the UK vs. US. Let's try and craft a few regular expressions so that we can match these variations. Here is a 

![British vs. American spellings](https://i.redd.it/8pexf4d1fax21.jpg)

The square brackets mean match any one of the characters insideo the brackets.

Match a "t" then either "y" or "i" then "re".

In [None]:
regex = "t[yi]re"
print(re.findall(regex, "tyre"))
print(re.findall(regex, "tire"))
print(re.findall(regex, "wheel"))

Let's see if we can match both "while" and "whilst"

The special characet `?` means, match 0 or 1 of the previous character. It doesn't match a question mark.

Match "whil" and then either "s" or "e" and then zero or one "t".

In [None]:
regex = "whil[se][t]?"
print(re.findall(regex, "whilst"))
print(re.findall(regex, "while"))
print(re.findall(regex, "meantime"))      

Ok, now let's try something more complicated like "bogeyman" and "boogeyman."

We need to use the curly braces `{}` or *quantifier* to specifiy how 

Match "b" then one or two "o" then "geyman"

In [None]:
regex = "bo{1,2}geyman"
print(re.findall(regex, "bogeyman"))
print(re.findall(regex, "boogeyman"))
print(re.findall(regex, "booooooogeyman"))
      

Why don't you try crafting some regular expressions to match British and American spellings. Here is some sample code for you to work with:

```python
regex = "???"
print(re.findall(regex, "???"))
print(re.findall(regex, "???"))
print(re.findall(regex, "???"))
```

In [None]:
# try your own here


## Rhyming

```
There was an Old Man with a beard
Who said, 'It is just as I feared
Two Owls and a Hen
Four Larks and a Wren
```

So this is an AABBA peom, which means we need to create a regex that matches words that rhyme with "beard." We could make a regex that matches all the words, but it might be easier to find a pattern.

In [None]:
regex = "ed|eard"

In [None]:
last_line = "Have all built their nests in my TREE"
re.findall(regex, last_line)

Empty list, which means no match, which means no rhyme 

In [None]:
last_line = "Have all built their nests in my beard"
re.findall(regex, last_line)

In [None]:
last_line = "Have all built their beds in my HAT"
re.findall(regex, last_line)

Oh dear, this line does not rhyme.  
Our regex is broken.  
Some of the time.  
We can use a special token.  
To match the end of the line.

In [None]:
regex = "ed$|eard$"
last_line = "Have all built their beds in my HAT"
re.findall(regex, last_line)

Using the `$` we are saying match a string that ends in "ed" or a string that ends in "eard"

## Matching the News
Extract proper names from a [news article](https://text.npr.org/979155522)

In [None]:
text = """When electric pickup maker Lordstown Motors 
took over an old General Motors plant in Ohio in 2019, 
it had big ambitions — and made a lot of promises. 
It promised a revival for a community agonizingly 
familiar with lost jobs. It named itself after the town, 
Lordstown, and named its future truck the Endurance, after 
the region's enduring residents. And it promised a fast 
timeline. Lordstown aimed to bring the first mass-produced 
electric pickup truck to market, built right there in 
that old GM plant."""
text

Create a regex that matches one capital letter (`[A-Z]`) then zero or more lowercase letters (`[a-z]`). The `*` stands for "match zero or more of the previous character."

In [None]:
regex = "[A-Z][a-z]*"
re.findall(regex,text)

In [None]:
regex = "[A-Z][a-z]+"
re.findall(regex,text)

So why isn't this matching "Motors"?
Let's look at our [handy web tool](https://pythex.org/?regex=%5B%5E.%5D%5B%5Cs%5D%5BA-Z%5D%5Ba-z%5D%2B&test_string=When%20electric%20pickup%20maker%20Lordstown%20Motors%20took%20over%20an%20old%20General%20Motors%20plant%20in%20Ohio%20in%202019%2C%20it%20had%20big%20ambitions%20—%20and%20made%20a%20lot%20of%20promises.%5CnIt%20promised%20a%20revival%20for%20a%20community%20agonizingly%20familiar%20with%20lost%20jobs.%20It%20named%20itself%20after%20the%20town%2C%20Lordstown%2C%20and%20named%20its%20future%20truck%20the%20Endurance%2C%20after%20the%20region%27s%20enduring%20residents.%5CnAnd%20it%20promised%20a%20fast%20timeline.%20Lordstown%20aimed%20to%20bring%20the%20first%20mass-produced%20electric%20pickup%20truck%20to%20market%2C%20built%20right%20there%20in%20that%20old%20GM%20plant.&ignorecase=0&multiline=0&dotall=0&verbose=0) to test regular expressions.

In [None]:
regex = "[A-Z][a-z]{3,}"
re.findall(regex,text)

Close enough!

## Passwords

You know those annoying password checkers that verify you have a sufficiently strong password. 

Let's create a regular expression that makes sure the string is at least 8 characters and *can* be made up of capital letters (`A-Z`), lowercase letters (`a-z`), digits (`0-9`), and either an exclaimation point (`!`) or dollar sign(`$`).

In [None]:
password = "L1br4r14nsRule!"

regex = "[A-Za-z0-9!$]{8,}"

In [None]:
if re.fullmatch(regex,password):
    print("strong")
else:
    print("weak")

Note, this is not the greatest password checker because it doesn't require all of those special characters.

## Parsing files

From the files `hamlet.txt` write a regular expression that gathers a list of names of all the speaking characters

In [None]:
# open the file and read it into memory
with open("files/hamlet.txt", "r") as fh:
    hamlet_text = fh.read()

# print the first 1000 characters so it is pretty
print(hamlet_text[0:1000])

The output above shows the speaking parts are capital names on their own lines.

In [None]:
# craft a regex to match capital letters
regex = "[A-Z]"

# loop through the first 50 lines to test teh regex
for line in hamlet_text.split("\n")[0:50]:
    character_name = re.findall(regex, line)
    if character_name:
        print(character_name)

Yikes, that appears to match every capital letter. But only one letter, we want more letter. 

In [None]:
# craft a regex that matches 0 or more capital letters
regex = "[A-Z]*"

for line in hamlet_text.split("\n")[0:50]:
    character_name = re.findall(regex, line)
    if character_name:
        print(character_name)

What is happening it is appears the regex is matching blank lines too. This is because of `*`. Remember the `*` means "match 0 or more of the preceding letter" so when we write `[A-Z]*` we are saying "match 0 or more letters between capital A and capital Z." An empty space technically matches that pattern.

In [None]:
# match 1 or more capital letters
regex = "[A-Z]+"

for line in hamlet_text.split("\n")[0:50]:
    character_name = re.findall(regex, line)
    if character_name:
        print(character_name)

Making headway, but we need to not match these individual letters. Let's just match lines that start and end with 1 or more capital letters and contain no other letters.

In [None]:
# match 1 or more capital letters that start and end a string
regex = "^[A-Z]+$"

for line in hamlet_text.split("\n")[0:50]:
    character_name = re.findall(regex, line)
    if character_name:
        print(character_name)

In [None]:
character_list = []

regex = "^[A-Z]+$"

for line in hamlet_text.split("\n")[0:50]:
    
    character_name = re.findall(regex, line)
    
    if character_name and (character_name[0] not in character_list):
        character_list.append(character_name[0])
        
print(character_list)

Wait, why only 4?! That is because we are only looping over the first 50 lines (see the slice `[0:50]` in the loop after the `split("\n")`. We can remove that now that we are done testing.

In [None]:
# create a list to hold the characters
character_list = []

# regex that matches speaking lines
regex = "^[A-Z]+$"

# loop over every line in the play
for line in hamlet_text.split("\n"):
    
    # match the character name
    character_name = re.findall(regex, line)
    
    # if the line contains a character name AND we haven't seen that name
    if character_name and (character_name[0] not in character_list):
        # add new character to our list
        character_list.append(character_name[0])

# print out our list of characters
print(character_list)

Cool! We just used regular expressions to create a list of speaking parts in Hamlet.

##  Regex Groups - Finding Birthdays

We have a paragraph of text and what we want to do is extract from this blob of text

In [None]:
text = """Bob has a birthday on 06-20-1946, but Kevin
is much younger with a birthday on 01-01-1968. Abe
was born a long time ago in on 02-12-1809, but
he isn't as old as George who was also born in 
February, but on 02-22-1732. Alf, a smart-mouthed 
alien was born on 09-22-1986, but we're not sure 
if that is the correct birthday. Also, don't 
consuse Alf with Howard, who as born on 08-01-1986
and is a duck."""

In [None]:
# match two digits, a single dash, two digits, a single dash, and then four digits
regex = "\d{2}-\d{2}-\d{4}"

In [None]:
# find all the matching strings in our blob of text
birthdays = re.findall(regex, text)
birthdays

So this gives us each of the birthdays as onestring. We can use the power of groups to identify the separate matching components of our regular expession. If we put *parentheses* around the sections of the regular expression we want to extract as separate *matching groups*.

In [None]:
# add parentheses around the parts we want to extract
regex = "(\d{2})-(\d{2})-(\d{4})"
# search the string for matches
birthdays = re.findall(regex, text)
birthdays

In [None]:
# loop over our list of lists and print each part
for birthday in birthdays:
    print("Month:", birthday[0])
    print("Day:", birthday[1])
    print("Year:", birthday[2])
    print()