# <br>Finding Patterns in Strings

This week, you've learned a lot about the syntax of regular expressions and experienced the logic required to solve real world problems. You've learned how to write regular expressions to match particular patterns. Today, we'll talk about how you code those expressions in Python, rather than reviewing how to build your match strings.

The regular expressions you write need to be passed to the computer in some way specific to each language. In Python, there are a handful of functions that will identify the matches to your patterns.
<br><br>Which function you use will depend on which question you are asking:
1. Is the match there?
2. Where is the match?
3. Return the first instance of the match
4. Return all the matches

<br>In addition to introducing the functions, we'll cover a couple quirks to working with regular expressions in Python.
<br><br>If you are new to Python, a group of text characters is called a string.

<br><br>For matching exact substrings, Python has some built-in functions. For matching more complicated patterns, you will need to use the Python package `re`, which stands for **regular expressions**. `re` is part of the standard Python library, so you do not need to install it; you will need to import it into this notebook, but we'll go over that.

<br><br>**After the workshop:** For a more thorough tutorial on the `re` package in Python, I recommend the Real Python Regex tutorial: https://realpython.com/regex-python/. Most of RealPython's tutorials are behind a paywall, but this one is available for free. 

## <br><br>Searching for **exact** strings using Python's built-in functions

#### Is the data there?

To see if an exact substring is present, we use the `in` boolean operator. A boolean is any expression that returns only True or False.

*This is a Jupyter Notebook. To run a gray code cell, click in the cell and either click the play arrow or type shift+enter (shift+return on a Mac).*

In [1]:
full = "Morgan Taylor 555-555-1890"

In [80]:
"Taylor" in full

True

In [81]:
" 555-" in full

True

#### <br><br>Where is the data?

The method function `find()` returns the index position of the first character in the **first occurence** of the search string. Method functions follow the string you're searching in, and the argument is your search string.

In [5]:
full = "Morgan Taylor 555-555-1890"

In [6]:
full.find(" 555-")

13

<br>You can then index the position of the first character in the string and work with it.

In [7]:
full[13]

' '

In [8]:
full[13:]

' 555-555-1890'

### <br><br>Exercise 1

In [103]:
full_text = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."

Run the cell above to store the variable `full_text`. Write code to answer the questions:

Does the word "Rabbit" appear in this text?

In [101]:
"Rabbit" in full_text

True

Where does "Rabbit" first appear?

In [102]:
full_text.find("Rabbit")

552

#### <br><br>Do I need to use a regular expression?

Most string parsing tasks I encounter can actually be completed in Python without using regular expressions. A combination of the string methods `split()` and `join()`, along with basic Python object indexing, the `len()` function, and the `in` boolean can usually get me what I need. 
<br><br>However, you will use whatever you know best, and there are certainly some jobs that require regular expressions.

#### <br><br>A note about collecting or removing data from strings

It's important to remember that sometimes, you will need to search for one string in order to locate another. For example:
- If the data you need to collect doesn't follow a strict pattern, but some substring near the data you need does follow a pattern.
- If the data you need to collect will be unpredictable, but other data near it is predictable.

For example, let's say you wanted to extract filenames from a list of text lines. The filenames are all different, with different extensions, but each filename is preceded by the phrase "Filename: " and followed by a tab. 

In [96]:
data = "Date: 11-10-19\tFilename: 73820.pdf\tAuthor: User\n Date: Unknown\tFilename: giantPicture.jpeg\tAuthor: Unknown\n Date: 12-01-19\tFilename: gene3820028.txt\tAuthor: User\n Filename: P38HAK8.EXE\tAuthor: Admin"

Using regular expressions, you could search for the pattern of the actual filename - a series of letters or numbers followed by a dot followed by three or four letters. But it would be much cleaner to include a search for the consistent string "Filename:" in case there are other filenames in the document.

In [97]:
for line in data.split("\n"):
    filename = [l.split(": ")[-1] for l in line.split("\t") if "Filename:" in l][0]
    print(filename)

73820.pdf
giantPicture.jpeg
gene3820028.txt
P38HAK8.EXE


In [98]:
re.findall(r"Filename: (\w+.\w+)\t", data)

['73820.pdf', 'giantPicture.jpeg', 'gene3820028.txt', 'P38HAK8.EXE']

## <br><br>The `re` module

First, we need to import the module:

In [2]:
import re

<br>The `re` module will let you search for those regular expressions you've been learning, not just exact strings.

<br><br>We can use the function `re.search()` to find out if a string is present AND the location of the **first** occurence of the string. `re.search()` is not a method function - i.e. it doesn't go after an object. Instead, it takes two arguments - the expression to find and the string to search in.

In [104]:
full = "Morgan Taylor 555-555-1890"
re.search("Taylor", full)

<re.Match object; span=(7, 13), match='Taylor'>

This returns a **match generator object**. It tells us the start and end positions of the match (**span**) and the term that was matched (**match**). In this case, we asked for an exact match, so our match is the same as the search term.

If no match was found, nothing would be returned.

In [105]:
re.search("Tylor", full)

<br><br>The match generator object can also be used as a Boolean:

In [106]:
if re.search("Taylor", full):
    print("A match!")
else:
    print("No match.")

A match!


In [107]:
if re.search("Tylor", full):
    print("A match!")
else:
    print("No match.")

No match.


#### <br><br>The match generator object
We can also ask for only part of the generator object by using these method functions at the end:

The span:

In [108]:
re.search("Taylor", full).span()

(7, 13)

The start position:

In [109]:
re.search("Taylor", full).start()

7

The end position:

In [110]:
re.search("Taylor", full).end()

13

The string that was matched:

In [111]:
re.search("Taylor", full).group()

'Taylor'

### <br><br>Exercise 2

In [113]:
full_text = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."

Run the line of code above. What is the position of the last character in the first occurence of the word "Rabbit"?

In [115]:
re.search("Rabbit", full_text).end()

558

### <br><br>Finding multiple occurrences

`re.search()` returns the match generator object of only the first occurrence of the search term, when searching the full string from left to right. <br><br>`re.findall()` finds all occurrences in the full string, and returns the matched strings. <br><br>`re.finditer()` returns an iterable object (like a list) of all the match generator objects for every occurence in the string.

In [116]:
print(full)

Morgan Taylor 555-555-1890


In [117]:
re.search("555", full)

<re.Match object; span=(14, 17), match='555'>

In [118]:
re.findall("555", full)

['555', '555']

In [119]:
re.finditer("555", full)

<callable_iterator at 0x7fca7a2b55c0>

<br>`re.finditer()` returns an iterable object that you can loop through to reveal the generator object of each match:

In [120]:
for m in re.finditer("555", full):
    print(m)

<re.Match object; span=(14, 17), match='555'>
<re.Match object; span=(18, 21), match='555'>


In [121]:
for m in re.finditer("555", full):
    print(m.span())
    print(m.start())
    print(m.end())

(14, 17)
14
17
(18, 21)
18
21


### <br><br>Exercise 3

In [30]:
full_text = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."

Run the cell above. Write code to print the start position for every occurence of "Alice" in `full_text`.

In [124]:
for m in re.finditer("Alice", full_text):
    print(m.start())

0
260
646
1014


### <br><br>Searching for inexact strings using regular expressions

Functions that are part of the `re` module can search for regular expressions, which you are now familiar with.<br><br>In the simplest examples, you just include the regular expression in quotes:

In [125]:
text = "The bee goes buzz buzzz buzzzz..."
re.findall("buz+", text)

['buzz', 'buzzz', 'buzzzz']

<br>**However**, some of the special characters used in regular expressions are also special characters in Python, so they won't be automatically interpreted as regular expressions:

In [126]:
re.findall("\bbu", text)

[]

<br>To get around this, we can explicitly let Python know that we are using regular expression special characters and not Python special characters by adding `r` before our search string:

In [127]:
re.findall(r"\bbu", text)

['bu', 'bu', 'bu']

### <br><br>Exercise 4
Use a regular expression to return all word that end in the letter z.

In [128]:
text = "The bee goes buzz buzzz buzzzz..."

In [129]:
re.findall(r"\w+z\b", text)

['buzz', 'buzzz', 'buzzzz']

<br><br>Let's practice what you've learned this week to create regular expressions to solve problems:

### Exercise 5
Run the code cell below to store this list of phone numbers. 800, 888, and 860 are all toll-free area codes. Write code to answer the questions.

In [74]:
phone_numbers = "800-555-2928 247-888-2827 860-555-2029 888-555-7262 860-555-2876 917-555-9373"

Are there any toll-free phone numbers in this list?

In [66]:
"800-" in phone_numbers or "888-" in phone_numbers or "860-" in phone_numbers

True

In [67]:
" 800-" in phone_numbers or " 888-" in phone_numbers or " 860-" in phone_numbers

True

In [59]:
re.findall(r"800-\d{3}-|860-\d{3}-|888-\d{3}-", phone_numbers)

['800-555-', '860-555-', '888-555-', '860-555-']

How many toll-free numbers are in this list?

In [64]:
len(re.findall(r"800-\d{3}-|860-\d{3}-|888-\d{3}-", phone_numbers))

4

Return all the toll-free numbers in this list.

In [65]:
re.findall(r"800-\d{3}-\d{4}|860-\d{3}-\d{4}|888-\d{3}-\d{4}", phone_numbers)

['800-555-2928', '860-555-2029', '888-555-7262', '860-555-2876']

### <br><br>Exercise 6
Run the code cell below to store this NEW list of phone numbers. 800, 888, and 860 are all toll-free area codes. Write code to answer the questions.

In [77]:
phone_numbers = "800-555-2928 247-888-2827 (860)555-2029 888.555.7262 8605552876 9175559373"

Return all the toll-free numbers in this list.

In [78]:
re.findall(r"800\W?\d{3}\W?\d{4}|860\W?\d{3}\W?\d{4}|888\W?\d{3}\W?\d{4}", phone_numbers)

['800-555-2928', '860)555-2029', '888.555.7262', '8605552876']

### <br><br>Flags

These are the available flags in the `re` module:

re.I<br>re.IGNORECASE<br>Makes matching of alphabetic characters case-insensitive
<br><br>re.M<br>re.MULTILINE<br>Causes start-of-string and end-of-string anchors to match embedded newlines
<br><br>re.S<br>re.DOTALL<br>Causes the dot metacharacter to match a newline
<br><br>re.X<br>re.VERBOSE<br>Allows inclusion of whitespace and comments within a regular expression
<br><br>----<br>re.DEBUG<br>Causes the regex parser to display debugging information to the console
<br><br>re.A<br>re.ASCII<br>Specifies ASCII encoding for character classification
<br><br>re.U<br>re.UNICODE<br>Specifies Unicode encoding for character classification
<br><br>re.L<br>re.LOCALE<br>Specifies encoding for character classification based on the current locale

<br><br>Flags get added as a third argument to the `re` functions:

In [130]:
re.findall("Rabbit", full_text, re.I)

['Rabbit', 'Rabbit', 'Rabbit', 'rabbit', 'rabbit']

### Exercise 7
Run the line of code below.

In [131]:
addresses = "872 Route 13, Cortlandville NY 13045\n279 Troy Road, East Greenbush NY 12061\n2465 Hempstead Turnpike, East Meadow NY 11554\n6438 Basile Rowe, East Syracuse NY 13057\n25737 US Rt 11, Evans Mills NY 13637\n901 Route 110, Farmingdale NY 11735"

Use the multiline flag to return the Zip Codes of each address.

In [132]:
re.findall("\d{5}$", addresses, re.M)

['13045', '12061', '11554', '13057', '13637', '11735']

### <br><br>A note about using parentheses in regular expressions in Python

Parentheses are used in regular expressions to return only part of our search term - the part called the group.

In [133]:
names = "Bob Belcher, Linda Belcher, Tina Belcher, Gene Belcher, Louise Belcher, Jimmy Pesto, Jimmy Pesto Jr., Teddy"

In [135]:
re.findall(r"(\w+) Belcher", names)

['Bob', 'Linda', 'Tina', 'Gene', 'Louise']

<br>You may also remember that you need to use parentheses to set aside parts of your search term if you're using an `or` operator.

You might remember this example from Day 2. We want to find all the times that are real times (between 00:00 and 23:59).

In [139]:
times = "00:15, 09:45, 23:10, 30:00, 10:99, 27:10"

The solution requires an `or`, but it won't return the correct times without parantheses around the `or` part:

In [140]:
re.findall("[01]\d|2[0-3]:[0-5]\d", times)

['00', '15', '09', '23:10', '00', '10', '10']

In Python, if we put parentheses around the or statement, however, it thinks we only want to return that group:

In [141]:
re.findall("([01]\d|2[0-3]):[0-5]\d", times)

['00', '09', '23']

So, we need to let Python know that we are not asking for a group, just using parentheses as parentheses. Inside the first `(`, we need to add the special term `?:`.

In [142]:
re.findall("(?:[01]\d|2[0-3]):[0-5]\d", times)

['00:15', '09:45', '23:10']