## Regular Expressions
*Regular expressions* are *patterns* that match parts of strings. 
A regular expression can tell if a string matches a pattern.
Regular expressions are defined inside a Python library called `re`.
`import` allows using a Python library in a program.

The variable name `p` will be used to store patterns.

In [1]:
import re
s = "this needs to be matched"
p = "match"
if re.search(p, s):
    print("re.search(", p, ", ", s, ") is True")

re.search( match ,  this needs to be matched ) is True


A pattern is made of normal string keyboard characters and escape characters, but some characters and escape characters have special meaning in patterns. 
- period `.` matches any character
- the escape character `\.` is needed to match a period, because a period has a special meaning
- the escape character `\d` matches any number character, or *digit*, 0 to 9
- the escape character `\D` is the opposite of `\d` and matches anything that is not a digit
- the escape character `\w` matches any digit or alphabet character a to z, or A to Z
- the escape character `\W` is the opposite of `\w` and matches anything that is not a digit or alphabet character
- the escape character `\s` matches any *space character* which are spaces `" "`, tabs `\t`, or newlines `\n`
- the escape character `\S` is the opposite of `\s` and matches anything that is not a space character

<img src="regex1.jpg" width="300">

In [2]:
import re
s = "there are 12 months in the year"
p = "\d\d"
if re.search(p, s):
    print("re.search(", p, ", ", s, ") is True")

re.search( \d\d ,  there are 12 months in the year ) is True


There are special characters that control how characters match in a pattern.
- plus `+` means the normal character or escape character just before it will match one or more of that character in the string
- star `*` means the normal character or escape character just before it will match zero or more of that character in the string
- question mark `?` means the normal character or escape character just before it will match if that character is in the string but will still match if it is missing
- caret `^` at the beginning of the pattern means the pattern must match starting at the beginning of the string
- dollar sign `$` at the end of the pattern means the pattern must matsh ending at the end of the string
- to match a plus, star, question mark, caret, or dollar sign, the escape character version must be used, `\+`, `\*`. `\?`, `\^`, or `\$`

In [3]:
import re
s = "there are 12 months in the year"
p = "\d+"
if re.search(p, s):
   print("re.search(", p, ", ", s, ") is True")

re.search( \d+ ,  there are 12 months in the year ) is True


The pattern to match a date like '*month*/*day*/*year*, where month and day are two digits like "03/05/2011", is `\d\d/\d\d/\d\d\d\d`, where the `\d` are escape characters and slash `/` matches a normal slash.

<img src="regex2.jpg" width="200">

In [4]:
import re
s = "03/05/2011"
p = "\d\d/\d\d/\d\d\d\d"
if re.search(p, s):
   print("re.search(", p, ", ", s, ") is True")

re.search( \d\d/\d\d/\d\d\d\d ,  03/05/2011 ) is True


There are special characters to control how many times characters will match in a pattern.
- `{`*count*`}` will match the character just before it *count* times
- `{`*min-count*`,`*max-count*`}` will match the character just before it at least *min-count* times but at most *max-count* times
The pattern to match a date can be `\d{2}/\d{2}/\d{4}` this way.

<img src="regex3.jpg" width="200">

In [5]:
import re
s = "03/05/2011"
p = "\d{2}/\d{2}/\d{4}"
if re.search(p, s):
   print("re.search(", p, ", ", s, ") is True")

re.search( \d{2}/\d{2}/\d{4} ,  03/05/2011 ) is True


Groups of characters and escape characters can match together using `(` and `)`.

In [6]:
import re
s = "banana"
p = "(an){2}"
if re.search(p, s):
   print("re.search(", p, ", ", s, ") is True")

re.search( (an){2} ,  banana ) is True


The `|` can be used in a group inside `(` and `)` to let either of two patterns to match.

In [7]:
import re
s = "this has been a long day"
p = "a (long|short) day"
if re.search(p, s):
   print("re.search(", p, ", ", s, ") is True")

re.search( a (long|short) day ,  this has been a long day ) is True


When you match a pattern with escape characters, you can get the exact part of the string matched with the `group` function.

In [8]:
import re
s = "today 03/05/2011 is March 5th, 2011"
p = "\d{2}/\d{2}/\d{4}"
m = re.search(p, s)
if m:
    print("after re.search(", p, ", ", s, "), m.group() is ", m.group())

after re.search( \d{2}/\d{2}/\d{4} ,  today 03/05/2011 is March 5th, 2011 ), m.group() is  03/05/2011


By grouping parts of the pattern with `(` and `)`, the `group` function can give the part of the string matched by each group.

<img src="regex4.jpg" width="300">

In [9]:
import re
s = "on 03/05/2011 it rained"
print("s starts as ", s)
p = "(\d{2})/(\d{2})/(\d{4})"
print("p starts as ", p)
m = re.search(p, s)
if m:
    print("after re.search(p, s), m.group() is ", m.group())
    print("m.group(1) is the month ", m.group(1))
    print("m.group(2) is the day ", m.group(2))
    print("m.group(3) is the year ", m.group(3))

s starts as  on 03/05/2011 it rained
p starts as  (\d{2})/(\d{2})/(\d{4})
after re.search(p, s), m.group() is  03/05/2011
m.group(1) is the month  03
m.group(2) is the day  05
m.group(3) is the year  2011


Building a regular expression can be tricky, and you can use tools like `regex101.com` to visualize how a given regular expression will match some input text.

## Regex functions

The findall() Function
The findall() function returns a list containing all matches.
The list contains the matches in the order they are found.

If no matches are found, an empty list is returned

In [10]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The split() Function
The split() function returns a list where the string has been split at each match.

Split at each white-space character:

In [11]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


The sub() Function
The sub() function replaces the matches with the text of your choice:

Example
Replace every white-space character with the number 9:

In [12]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain
