# Working with Strings: Regular Expressions

Kevin Bonham, PhD
Eric Franzosa, PhD

## Outline

- Complex string printing
- What are regular expressions?
- Using regular expressions in python
- Lots of practice

## Learning Objectives

After completing this lesson, you should be able to:

- Print a complex string built from variables using the `.format()` string method
- Identify the individual matching elements of a regular expression
- Use the RE module in python to match a specified pattern in a string
- Predict which parts of a regular expression are prone to failure

## String Creation with `.format()`

There are many ways to create strings

In [7]:
s0 = "Hello, World!"
s1, s2, s3, s4 = ['Hell', 'o', 'World', ',!']

In [11]:
s1 + s2 + s4[0] + ' ' + s3 + s4[1]

'Hello, World!'

In [16]:
import os

print("The current path is", os.getcwd())

The current path is /Users/kev/computation/science/hutlab/bst273/lectures


### The `.format()` string method enables complext string creation

In [19]:
template = """
The first element: {0}
The second element: {1}
The firt element again: {0}
"""

print(template.format("Life, the Universe, and Everything", 42))


The first element: Life, the Universe, and Everything
The second element: 42
The firt element again: Life, the Universe, and Everything



In [21]:
print(template.format("Green Eggs", "Spam"))


The first element: Green Eggs
The second element: Spam
The firt element again: Green Eggs



In [22]:
t2 = "Numbers {} needed for {} arguments"
print(t2.format("aren't", "positional"))
print(t2.format("are", "repeated"))

Numbers aren't needed for positional arguments
Numbers are needed for repeated arguments


In [26]:
print(t2.format("WAIT WHAT?"))

IndexError: tuple index out of range

In [28]:
print(template.format(["first", "second"], "third"))


The first element: ['first', 'second']
The second element: third
The firt element again: ['first', 'second']



### Poll: Finish the expression

Can you make a list with greetings for the teaching staff of BST273 using the `.format()` method?

In [33]:
staff = ["Eric", "Kevin", "Emma", "Marina", "Shirley"]
greetings = []

for s in staff:
    # Fix the line below and paste it into the answer box of today's poll
    greetings.append("Hello, s!".format())

greetings

['Hello, s!', 'Hello, s!', 'Hello, s!', 'Hello, s!', 'Hello, s!']

## Regular Expressions - Motivation

### Matching simple patterns is useful, but limited.

In [35]:
def narcissus(test_string):
    return "kevin" in test_string.lower()

In [37]:
[narcissus(s) for s in greetings]

[False, True, False, False, False]

### Thought question: How can I identfy e-mail addresses?

In [38]:
def isemail(a_string):
    return "@" in a_string

In [41]:
test_cases = [
    "kbonham@broadinstitute.org",
    "eric_franzosa@hsph.harvard.edu",
    "http://www.broadinstitute.org",
    "My twitter handle: @kevbonham",
    "My office hours this week @ 1pm"
]

In [42]:
[isemail(t) for t in test_cases]

[True, True, False, True, True]

### Regular expressions are a special pattern syntax

- Also called “regexps” or “regexes” or “REs” as shorthand
- A language for describing patterns in strings
- Useful for:
    - Asking if a pattern occurs in a text
    - "Capturing" all or part of the pattern for manipulation
    - Replacing all or part of the pattern!

**WARNING: This will be confusing - practice is necessary**

My favorite resource (I use this all the time): [regexr.com](http://regexr.com)

![](https://imgs.xkcd.com/comics/regular_expressions.png)

### To use regular expressions in python, we use the RE module

- `re.search( <pattern>, <text> )` to find and return matches
- `re.finditer( <pattern>, <text> )` to iterate through matches
- `re.sub( <pattern>, <replacement>, <text> )` find/replace

In [165]:
import re

def displaymatch(pattern, string):
    match = re.search(pattern, string)
    if match is None:
        return "No Match!!"
    return 'Match:  "{}"'.format(match.group())

def displaymatches(pattern, string):
    matches = list(enumerate(re.finditer(pattern, string)))
    if len(matches) == 0:
        yield "No Match!!"
    else:
        for c, m in matches:
            yield('Match {}: "{}"'.format(c, m.group()))
            

In [166]:
print(displaymatch("an", "banana"))

Match:  "an"


In [167]:
for m in displaymatches("an", "banana"):
    print(m)

Match 0: "an"
Match 1: "an"


### Exact matching

In [168]:
print(displaymatch("Hello", "Hello, World!"))

Match:  "Hello"


In [169]:
print(displaymatch("hello", "Hello, World!"))

No Match!!


**Beware: *Matching is case-sensitive***

## The wildcard character

`.` reprents **anything**

In [170]:
p = ".ello"
check = ["Hello, World!", "hello, world!", "Mellow", "yellow"]
for s in check:
    print(displaymatch(p, s))

Match:  "Hello"
Match:  "hello"
Match:  "Mello"
Match:  "yello"


In [171]:
p = "..i.."
check = "Life, the Universe, and Everything"
for m in displaymatches(p, check):
    print(m)

Match 0: "Unive"
Match 1: "thing"


**Beware: _everything has to match_**

## Pre-defined character classes

- `\d` matches any **digit** (0-9)
    - `\D` matches any **NOT** digit
- `\s` matches any **whitespace** (space, tab, newline)
    - `\S` matches any **NOT** whitespace
- `\w` matches any **word character** (0-9, letters, underscore)
    - `\W` matches any **NOT** word character

In [174]:
check = "99 Bottles of beer on the wall, 99 bottles of beer"
patterns = [r"\d\d", r"\d\D", r"\s\D\D\s"]

In [175]:
for p in patterns:
    print("Pattern:", p)
    for m in displaymatches(p, check):
        print(m)

Pattern: \d\d
Match 0: "99"
Match 1: "99"
Pattern: \d\D
Match 0: "9 "
Match 1: "9 "
Pattern: \s\D\D\s
Match 0: " of "
Match 1: " on "
Match 2: " of "


In [176]:
check = "It’s my party and I’ll cry if I want to"
p = r"\W\w\w\w\w\W"
print(displaymatch(p, check))

Match:  " want "


In [105]:
check = "Yesterday, December 7, 1941--a date which will live in infamy"
print(displaymatch(p, check))

Match:  " 1941-"


### Custom character classes

- `[AB]` matches “A” and “B”
- `[A-E]` matches any character “A” through “E”
- `[A-Za-z0-9_]` matches any word character (equivalent to \w)


In [106]:
pattern = r"[kK]e"
check = ["Kevin", "Erik", "keep", "finicky", "maker"]
for c in check:
    print("Checking:", c)
    print(displaymatch(pattern, c))
    print()

Checking: Kevin
Match:  "Ke"

Checking: Erik
No Match!!

Checking: keep
Match:  "ke"

Checking: finicky
No Match!!

Checking: maker
Match:  "ke"



In [178]:
pattern = r"[ACGT]"
check = "A DNA sequence is represented by the letters A, C, G, and T"
for m in displaymatches(pattern, check):
    print(m)

Match 0: "A"
Match 1: "A"
Match 2: "A"
Match 3: "C"
Match 4: "G"
Match 5: "T"


### Negation of custom character classes

- Negate a character class with an initial `^`
- `[^ABC]` matches any character EXCEPT “A” or “B” or “C”
- `[^A-E]` matches any character EXCEPT “A” through “E”
- `[^A-Za-z0-9_]`matches any character EXCEPT word characters (equivalent to `\W`)

In [181]:
pattern = r"[^ACGT]"
check = "A DNA sequence is represented by the letters A, C, G, and T"
"".join([m[-2] for m in displaymatches(pattern, check)])


' DN sequence is represented by the letters , , , and '

### Boundaries

- `^` matches the start of a string (_before_ the first character)
- `$` matches the end of a string
- `\b` matches a “word boundary” (beginning/end of a line, whitespace, or a non-word character)

In [183]:
for m in displaymatches(r"^ana", "ana, my nana,  ate a banana"):
    print(m)

Match 0: "ana"


In [184]:
for m in displaymatches(r"ana", "ana, my nana, ate a banana"):
    print(m) # how many matches do you expect?

Match 0: "ana"
Match 1: "ana"
Match 2: "ana"


In [185]:
for m in displaymatches(r"deer$", "Doe, a deer, a female deer"):
    print(m)

Match 0: "deer"


In [186]:
for m in displaymatches(r"deer\b", "Doe, a deer, a female deer"):
    print(m)

Match 0: "deer"
Match 1: "deer"


In [187]:
for m in displaymatches(r"\bana\b", "ana, my nana, ate a banana"):
    print(m)

Match 0: "ana"


### Repetition

- `+`: **one** or more matches of the previous character or group
- `*`: **zero** or more matches of the previous character or group
- `?`: one **optional** matches of the previous character or group
- `{N}`: exactly `N` matches in a row
    - `{N,M}`: `N` to `M` matches (inclussive)
    - `{N,}`: `N` or more matches
    - `{,M}`: `M` or fewer matches

In [191]:
check = ["AC", "ABC", "ABBC"]
patterns = [r"AB+C", r"AB*C", r"AB?C", r"AB{2}C"]
for p in patterns:
    print("Pattern:", p)
    for c in check:
        print("Checking:", c)
        print(displaymatch(p, c))
    print()

Pattern: AB+C
Checking: AC
No Match!!
Checking: ABC
Match:  "ABC"
Checking: ABBC
Match:  "ABBC"

Pattern: AB*C
Checking: AC
Match:  "AC"
Checking: ABC
Match:  "ABC"
Checking: ABBC
Match:  "ABBC"

Pattern: AB?C
Checking: AC
Match:  "AC"
Checking: ABC
Match:  "ABC"
Checking: ABBC
No Match!!

Pattern: AB{2}C
Checking: AC
No Match!!
Checking: ABC
No Match!!
Checking: ABBC
Match:  "ABBC"



### Subpatterns and Groups

- Parts of a pattern enclosed in parentheses are grouped
- These can be referenced within the pattern itself with `\\<group n>`

In [192]:
pattern = "(na)+"
check = "ana, my nana, ate a banana"

for m in displaymatches(pattern, check):
    print(m)

Match 0: "na"
Match 1: "nana"
Match 2: "nana"


In [193]:
pattern = "(..)\\1" # any 2 characters, followed by the same two characters
check = "ana, my nana, ate a banana"

for m in displaymatches(pattern, check):
    print(m)

Match 0: "nana"
Match 1: "anan"


### Chosing between multiple patterns

- `|` (pipe) behaves as a logical OR
- Often combined with parentheses to indicate a choice of sub-patterns

In [196]:
pattern = r"\w+\.(txt|py)\b"
check = ["my_script.py",
         "my_script.pyc",
         "my_input.txt",
         "README.md",
         "my_output.txt",
         "bad move.txt"]
for c in check:
    print("Checking:", c)
    print(displaymatch(pattern, c))

Checking: my_script.py
Match:  "my_script.py"
Checking: my_script.pyc
No Match!!
Checking: my_input.txt
Match:  "my_input.txt"
Checking: README.md
No Match!!
Checking: my_output.txt
Match:  "my_output.txt"
Checking: bad move.txt
Match:  "move.txt"


## Poll: How to identify e-mail addresses?

In [197]:
test_cases = [
    "kbonham@broadinstitute.org",
    "eric_franzosa@hsph.harvard.edu",
    "http://www.broadinstitute.org",
    "My twitter handle: @kevbonham",
    "My office hours this week @ 1pm"
]

def better_isemail(test_string):
    pattern = r"" # enter your pattern here
    if re.search(pattern, test_string):
        return True
    else:
        return False
    
[better_isemail(c) for c in test_cases]

[True, True, True, True, True]

In [None]:
hard_cases = [
    "numbers123@example.com", # TRUE
    "this.is.valid@weird-url.com", # TRUE
    "kevbonham+commerce@gmail.com", #TRUE
    "notvalid@notaurl", # FALSE
    "git@github.com:kescobo/bst273_lecture09.git" # FALSE
]

## Miscellaneous gotchas

### "Escaping" special characters

What if you want to match a special character? (`\`, `[`, `.`, `+` etc)

Find what's being added:

In [200]:
check = "2 + 2 = 4"
pattern = r"\d + \d"
print(displaymatch(pattern, check))

No Match!!


`r"\d + \d"` is parsed as:

- `\d`: a digit
- ` +`: one or more spaces
- ` `: a space
- `\d`: a digit

In [201]:
pattern = r"\d \+ \d"
print(displaymatch(pattern, check))

Match:  "2 + 2"


In [202]:
check = "42 + 3.14 = 45.14"
print(displaymatch(pattern, check))

Match:  "2 + 3"


In [204]:
pattern = r"\d+(.\d+)? \+ \d(.\d+)?"
print(displaymatch(pattern, check))

Match:  "42 + 3.14"


In [205]:
check = "6.02e23 + 1"
print(displaymatch(pattern, check))

Match:  "02e23 + 1"


### Matches don't overlap

`"banana"` has 2 `"ana"` substrings, but they're overlapping

In [208]:
for m in displaymatches("ana", "banana"):
    print(m)

Match 0: "ana"


This can be overcome with "lookaheads", but I'll leave that to you to look up.