# Regular Expressions

**Regular expressions are text matching patterns described with a formal syntax. The patterns are interpreted as a set of instructions, which are then executed with a string as input to produce a matching subset. The term “regular expressions” is frequently shortened to as 'regex' or 'regexp'.**

# Finding Patterns In Text

In [2]:
text = "The agent's phone number is 408-555-1234. Call soon!"

**We'll start off by trying to find out if the string "phone" is inside the text string. Now we could quickly do this with:**

In [3]:
'phone' in text

True

**But let's show the format for regular expressions, because later on we will be searching for patterns that won't have such a simple solution.**

In [4]:
import re

In [5]:
pattern = 'phone'

In [6]:
re.search(pattern,text)

<re.Match object; span=(12, 17), match='phone'>

In [7]:
pattern = "NOT IN TEXT"

In [8]:
re.search(pattern,text)

**re.search() function will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned.**

**Let's have a closer look at Match Object.**

In [11]:
pattern = 'phone'

In [12]:
match = re.search(pattern,text)

In [13]:
match

<re.Match object; span=(12, 17), match='phone'>

**Notice the span, there is also a start and end index information.**

In [14]:
match.span()

(12, 17)

In [15]:
match.start()

12

In [16]:
match.end()

17

In [17]:
match.re.pattern

'phone'

In [18]:
match.string

"The agent's phone number is 408-555-1234. Call soon!"

**Another Example of finding a set of words in a text.**

In [19]:
import re
patterns = [ 'this', 'that' ]
text = 'Does this text match the pattern?'

for pattern in patterns:
    print(f"Looking for {pattern} in {text}")
    if re.search(pattern,  text):
        print ('found a match!')
    else:
        print ('no match')

Looking for this in Does this text match the pattern?
found a match!
Looking for that in Does this text match the pattern?
no match


**Another Example for looking into Match Object returned by re.search().**

In [20]:
import re
pattern = 'this'
text = 'Does this text match the pattern?'
match = re.search(pattern, text)
s = match.start()
e = match.end()
print (f"Found {match.re.pattern} in {match.string} from {s} to {e} ({text[s:e]})")

Found this in Does this text match the pattern? from 5 to 9 (this)


# Compiling Expressions

**re includes module-level functions for working with regular expressions as text strings, but it is usually more efficient to compile the expressions your program uses frequently. The compile() function converts an expression string into a RegexObject.**

In [24]:
import re
# Pre-compile the patterns
regexes = [re.compile(p) for p in [ 'this','that']]
text = 'Does this text match the pattern?'
for regex in regexes:
    print(f"Looking for {regex.pattern} in {text}")
    if regex.search(text):
        print ('found a match!')
    else:
        print ('no match')

Looking for this in Does this text match the pattern?
found a match!
Looking for that in Does this text match the pattern?
no match


**The module-level functions maintain a cache of compiled expressions, but the size of the cache is limited and using compiled expressions directly means you can avoid the cache lookup overhead. By pre-compiling any expressions your module uses when the module is loaded you shift the compilation work to application startup time, instead of a point where the program is responding to a user action.**

# Multiple Matches

**What if the pattern occurs more than once?**

In [25]:
text = "my phone is a new phone"
match = re.search("phone",text)
match.span()

(3, 8)

**Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:**

In [26]:
matches = re.findall("phone",text)
matches

['phone', 'phone']

In [27]:
len(matches)

2

In [28]:
import re
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.findall(pattern, text):
    print (f"Found {match}")

Found ab
Found ab


**To get actual match objects, use the iterator:
finditer() returns an iterator that produces Match instances instead of the strings returned by findall().**

In [29]:
text = "my phone is a new phone"
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


In [64]:
import re
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print (f"Found {text[s:e]} at {s}:{e}")

Found ab at 0:2
Found ab at 5:7


**Another Example of finding a set of words in a text.**

In [31]:
import re
patterns = [ 'this', 'that' ]
text = 'Does this text match the pattern if so what is this. that is exact match in that string?'
for pattern in patterns:
    for match in re.finditer(pattern,text):
        print(f"Looking for '{pattern}' in '{text}'")
        s = match.start()
        e = match.end()
        print (f"Found {text[s:e]} at {s}:{e}")

Looking for 'this' in 'Does this text match the pattern if so what is this. that is exact match in that string?'
Found this at 5:9
Looking for 'this' in 'Does this text match the pattern if so what is this. that is exact match in that string?'
Found this at 47:51
Looking for 'that' in 'Does this text match the pattern if so what is this. that is exact match in that string?'
Found that at 53:57
Looking for 'that' in 'Does this text match the pattern if so what is this. that is exact match in that string?'
Found that at 76:80


# Patterns Syntax

**So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?**

**Regular Expression support more powerful patterns than simple literal text strings.**

In [35]:
import re
def test_patterns(text, patterns=[]):
    print(''.join(str(int(i/10) or ' ') for i in range(len(text))))
    print(''.join(str(i%10) for i in range(len(text))))
    print(text)
    for pattern in patterns:
        print
        print('Matching {pattern}')
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            print('  %2d : %2d = "%s"' %(s, e-1, text[s:e]))
    return

In [37]:
test_patterns('abbaaabbbbaaaaa',['ab'])

          11111
012345678901234
abbaaabbbbaaaaa
Matching {pattern}
   0 :  1 = "ab"
   5 :  6 = "ab"


# Repetition

**There are five ways to express repetition in a pattern. A pattern followed by the metacharacter * is repeated zero or more times (allowing a pattern to repeat zero times means it does not need to appear at all to match). Replace the * with + and the pattern must appear at least once(allowing a pattern to repeat one or more times). Using ? means the pattern appears zero or one time. For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat. And finally, to allow a variable but limited number of repetitions, use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the value appears at least m times, with no maximum.**

In [38]:
test_patterns('abbaaabbbbaaaaa',
              [ 'ab*',     # a followed by zero or more b
                'ab+',     # a followed by one or more b
                'ab?',     # a followed by zero or one b
                'ab{3}',   # a followed by three b
                'ab{2,3}', # a followed by two to three b
                ])

          11111
012345678901234
abbaaabbbbaaaaa
Matching {pattern}
   0 :  2 = "abb"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  9 = "abbbb"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"
Matching {pattern}
   0 :  2 = "abb"
   5 :  9 = "abbbb"
Matching {pattern}
   0 :  1 = "ab"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  6 = "ab"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"
Matching {pattern}
   5 :  8 = "abbb"
Matching {pattern}
   0 :  2 = "abb"
   5 :  8 = "abbb"


**The normal processing for a repetition instruction is to consume as much of the input as possible while matching the pattern. This so-called greedy behavior. Greediness can be turned off by following the repetition instruction with ?.**

In [39]:
test_patterns('abbaaabbbbaaaaa',
              [ 'ab*?',     # a followed by zero or more b
                'ab+?',     # a followed by one or more b
                'ab??',     # a followed by zero or one b
                'ab{3}?',   # a followed by three b
                'ab{2,3}?', # a followed by two to three b
                ])

          11111
012345678901234
abbaaabbbbaaaaa
Matching {pattern}
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"
Matching {pattern}
   0 :  1 = "ab"
   5 :  6 = "ab"
Matching {pattern}
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"
Matching {pattern}
   5 :  8 = "abbb"
Matching {pattern}
   0 :  2 = "abb"
   5 :  7 = "abb"


# Character Sets

**A character set is a group of characters, any one of which can match at that point in the pattern. For example, [ab] would match either a or b.**

In [40]:
test_patterns('abbaaabbbbaaaaa',
              [ '[ab]',    # either a or b
                'a[ab]+',  # a followed by one or more a or b
                'a[ab]+?', # a followed by one or more a or b, not greedy
                ])

          11111
012345678901234
abbaaabbbbaaaaa
Matching {pattern}
   0 :  0 = "a"
   1 :  1 = "b"
   2 :  2 = "b"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
   6 :  6 = "b"
   7 :  7 = "b"
   8 :  8 = "b"
   9 :  9 = "b"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"
Matching {pattern}
   0 : 14 = "abbaaabbbbaaaaa"
Matching {pattern}
   0 :  1 = "ab"
   3 :  4 = "aa"
   5 :  6 = "ab"
  10 : 11 = "aa"
  12 : 13 = "aa"


**A character set can also be used to exclude specific characters. The special marker ^ means to look for characters not in the set following.**

In [42]:
#This pattern finds all of the substrings that do not contain the characters -, ., or a space.
test_patterns('This is some text -- with punctuation.',
              [ '[^-. ]+',  # sequences without -, ., or space
                ])

          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.
Matching {pattern}
   0 :  3 = "This"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"


**As character sets grow larger, typing every character that should (or should not) match becomes tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point.**

In [43]:
test_patterns('This is some text -- with punctuation.',
              [ '[a-z]+',      # sequences of lower case letters
                '[A-Z]+',      # sequences of upper case letters
                '[a-zA-Z]+',   # sequences of lower or upper case letters
                '[A-Z][a-z]+', # one upper case letter followed by lower case letters
                ])

          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.
Matching {pattern}
   1 :  3 = "his"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"
Matching {pattern}
   0 :  0 = "T"
Matching {pattern}
   0 :  3 = "This"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"
Matching {pattern}
   0 :  3 = "This"


**As a special case of a character set the metacharacter dot, or period (.), indicates that the pattern should match any single character in that position.**

In [44]:
test_patterns('abbaaabbbbaaaaa',
              [ 'a.',    # a followed by any one character
                'b.',    # b followed by any one character
                'a.*b',  # a followed by anything, ending in b
                'a.*?b', # a followed by anything, ending in b
                ])

          11111
012345678901234
abbaaabbbbaaaaa
Matching {pattern}
   0 :  1 = "ab"
   3 :  4 = "aa"
   5 :  6 = "ab"
  10 : 11 = "aa"
  12 : 13 = "aa"
Matching {pattern}
   1 :  2 = "bb"
   6 :  7 = "bb"
   8 :  9 = "bb"
Matching {pattern}
   0 :  9 = "abbaaabbbb"
Matching {pattern}
   0 :  1 = "ab"
   3 :  6 = "aaab"


# Escape Codes

# Identifiers for Characters in Patterns
Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

For example:

In [47]:
text = "My telephone number is 408-555-1234"

In [48]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

**Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.**

In [51]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

In [52]:
test_patterns('This is a prime #1 example!',
              [ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ])

          11111111112222222
012345678901234567890123456
This is a prime #1 example!
Matching {pattern}
  17 : 17 = "1"
Matching {pattern}
   0 : 16 = "This is a prime #"
  18 : 26 = " example!"
Matching {pattern}
   4 :  4 = " "
   7 :  7 = " "
   9 :  9 = " "
  15 : 15 = " "
  18 : 18 = " "
Matching {pattern}
   0 :  3 = "This"
   5 :  6 = "is"
   8 :  8 = "a"
  10 : 14 = "prime"
  16 : 17 = "#1"
  19 : 26 = "example!"
Matching {pattern}
   0 :  3 = "This"
   5 :  6 = "is"
   8 :  8 = "a"
  10 : 14 = "prime"
  17 : 17 = "1"
  19 : 25 = "example"
Matching {pattern}
   4 :  4 = " "
   7 :  7 = " "
   9 :  9 = " "
  15 : 16 = " #"
  18 : 18 = " "
  26 : 26 = "!"


**To match the characters that are part of the regular expression syntax, escape the characters in the search pattern.**

In [53]:
test_patterns(r'\d+ \D+ \s+ \S+ \w+ \W+',
              [ r'\\d\+',
                r'\\D\+',
                r'\\s\+',
                r'\\S\+',
                r'\\w\+',
                r'\\W\+',
                ])

          1111111111222
01234567890123456789012
\d+ \D+ \s+ \S+ \w+ \W+
Matching {pattern}
   0 :  2 = "\d+"
Matching {pattern}
   4 :  6 = "\D+"
Matching {pattern}
   8 : 10 = "\s+"
Matching {pattern}
  12 : 14 = "\S+"
Matching {pattern}
  16 : 18 = "\w+"
Matching {pattern}
  20 : 22 = "\W+"


## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [26]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [27]:
results = re.search(phone_pattern,text)

In [28]:
# The entire result
results.group()

'408-555-1234'

In [67]:
# Can then also call by group position.
# remember groups were separated by parentheses ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1)

NameError: name 'results' is not defined

In [30]:
results.group(2)

'555'

In [31]:
results.group(3)

'1234'

In [32]:
# We only had three groups of parentheses
results.group(4)

IndexError: no such group

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [33]:
re.search(r"man|woman","This man was here.")

<_sre.SRE_Match object; span=(5, 8), match='man'>

In [34]:
re.search(r"man|woman","This woman was here.")

<_sre.SRE_Match object; span=(5, 10), match='woman'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [35]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [36]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

Notice how we only matched the first 3 letters, that is because we need a **.** for each wildcard letter. Or use the quantifiers described above to set its own rules.

In [37]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".

In [38]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [39]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

In [40]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

Note that this is for the entire string, not individual words!

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [41]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [42]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign 

In [43]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

We can use this to remove punctuation from a sentence.

In [44]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [45]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [46]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [47]:
clean

'This is a string But it has punctuation How can we remove it'

## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [48]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [49]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

## Parentheses for Multiple Options

If we have multiple options for matching, we can use parentheses to list out these options. For Example:

In [50]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [51]:
re.search(r'cat(fish|nap|claw)',text)

<_sre.SRE_Match object; span=(27, 34), match='catfish'>

In [52]:
re.search(r'cat(fish|nap|claw)',texttwo)

<_sre.SRE_Match object; span=(32, 38), match='catnap'>

In [53]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

### Conclusion

Excellent work! For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html

## Next up: Python Text Basics Assessment