# Python Tutorial 3 - Regular Expressions

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

The Python module **re** provides full support for regular expressions in Python. When working with regular expressions you usually want to work with raw strings. Raw strings are used so characters inside the string are interpreted literally instead of as having  special meaning, such as newline (\n), tabs (\t), etc. In order to work with raw strings in Python you use the following notation:

In [2]:
notRawString = 'This is a \tstring\n with special characters test'
print(notRawString)
print("--------------")
rawString = r'\tThis is a string\n with special characters' #raw string
print(rawString)

This is a 	string
 with special characters test
--------------
\tThis is a string\n with special characters


### The search function

The function *search* inside the **re** module function attempts to match a regular expression *pattern* to *string* with optional *flags*.

Parameter | Description 
--- | --- 
 pattern |  This is the regular expression to be matched.
 string |  This is the string, which would be searched to match the pattern at the beginning of string.
 flags |  You can specify different flags using bitwise OR. These are modifiers, which are listed later in the tutorial.
 
The re.search function returns a match object on success, None on failure. We usegroup(num) or groups() function of the match object to get the matched expression(s).
 
Match Object Methods | Description 
--- | --- 
 group(x) |  This method returns entire match of specific subgroup x of the entire match)
 groups() |  This method returns all matching subgroups in a tuple (empty if there weren't any)
 
 Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using the bitwise OR operator (|):
 
Modifier | Description 
--- | --- 
 re.I | Performs case-insensitive matching.
 re.L | Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).
 re.M | Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).
 re.S | Makes a period (dot) match any character, including a newline.
 re.U | Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.
 re.X | Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

### Regular Expression Special characters

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so the regular expresion r'last' matches the string 'last'.

Some characters, like '|', '\' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. In the table below, I have tried to summarize some of the most important special characters: 

Special character Pattern | Description 
--- | --- 
^  |  Matches beginning of line.
$  |  Matches end of line.
.  |  Matches any single character except newline. Using re.M option allows it to match newline as well.
[...]| Matches any single character in brackets.
[^...]| Matches any single character not in brackets
* | Matches 0 or more occurrences of preceding expression.
+ | Matches 1 or more occurrences of preceding expression.
? | Matches 0 or 1 occurrences of preceding expression.
{n} | Matches exactly n number of occurrences of preceding expression.
{n, m}| Matches at least n and at most m occurrences of preceding expression.
a &#124; b| Matches either a or b.
(re)| Groups regular expressions and remembers matched text.
(?imx)| Temporarily toggles on re.I, re.M, or re.X options within a regular expression.
(?-imx)| Temporarily toggles off re.I, re.M, or re.X options within a regular expression.
\w | Matches word characters.
\W | Matches nonword characters.
\s | Matches whitespace. Equivalent to [\t\n\r\f].
\S | Matches nonwhitespace.
\d | Matches digits. Equivalent to [0-9].
\D | Matches nondigits.
\b | Matches word boundaries
\n, \t, etc. | Matches newlines, carriage returns, tabs, etc.

--- 
 
After all this theory, time to some practice. Work through the following example line by line and make sure you understand what's going on. Edit the code snippets and run them to convince yourself how regular expressions work in Python. 

In [7]:
import re #We import the regular expression module

str1 = "My favorite color is red";  #We create a sample string
str2 = "My favorite colour is red";  #We create a 2nd sample string
str3 = "My favorite song is red";  #We create a 3rd sample string

myRegExp = r'col[ou]+r'
searchObj1 = re.search( myRegExp, str1) #We try to match myRegExp against the string 'str1' 
searchObj2 = re.search( myRegExp, str2) #We try to match myRegExp against the string 'str2'
searchObj3 = re.search( myRegExp, str3) #We try to match myRegExp against the string 'str3' 

"""
We need try statements here because in case the regular expression pattern doesn't match the string, 

"""
try:
    print(searchObj1.group()) #prints all the matches
except AttributeError:
    print("Nothing matched!")
    
try:
    print(searchObj2.group()) #prints all the matches
except AttributeError:
    print("Nothing matched!")
    
try:
    print(searchObj3.group()) #prints all the matches
except AttributeError:
    print("Nothing matched!")
    


color
Nothing matched!
Nothing matched!


#### Modifiers

In [8]:
import re #We import the regular expression module

str1 = "My favorite COLOR is red";  #We create a sample string
str2 = "My favorite COLOR is red";  #We create a 2nd sample string

myRegExp = r'col[ou]+r'
searchObj1 = re.search( myRegExp, str1) #We try to match myRegExp against the string 'str1' 
searchObj2 = re.search( myRegExp, str2, re.I) #We try to match myRegExp against the string 'str2' with a case insensitive modifier

try:
    print(searchObj1.group()) #prints all the matches
except AttributeError:
    print("Nothing matched!")
    
try:
    print(searchObj2.group()) #prints all the matches
except AttributeError:
    print("Nothing matched!")    

Nothing matched!
COLOR


#### Grouping matches

In [16]:
import re #We import the regular expression module

line = "TV is lame test dog";  #We create a sample string

searchObj = re.search( r'(.*) is (.*) test (.*)', line) #We try to match a regular expression against the string 'line' using 2 groups for matches

if searchObj: #If the regular expression return a much
   print(searchObj.group()) #prints all the matches subgroups
   print(searchObj.group(1)) #prints the 1st match
   print(searchObj.group(2)) #prints the 2nd match
   print(searchObj.group(3))
else:
   print("Nothing found!!")

TV is lame test dog
TV
lame
dog


#### Matching at the beginning and end of a string

In [17]:
import re #We import the regular expression module

line1 = "TV is lame";  #We create a sample string
line2 = "I think TV is lame and silly";  #We create another sample string

myRegExp = r'^(TV).*(lame)$' # the regular expression needs to start at the beginning of the line and finish at the very end

searchObj1 = re.search( myRegExp, line1) #We try to match myRegExp against the string 'line1' 
searchObj2 = re.search( myRegExp, line2) #We try to match myRegExp against the string 'line2' 

if searchObj1: #If the regular expression return a match
   print(searchObj1.group()) #prints all the matches subgroups
else:
    print("Nothing found by myRegExp in line1")
if searchObj2: #If the regular expression return a match
   print(searchObj2.group()) #prints all the matches subgroups
else:
    print("Nothing found by myRegExp in line2")

TV is lame
Nothing found by myRegExp in line2


#### Search and replace

One of the most important re methods that use regular expressions is sub.

This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method returns modified string.

In [3]:
#!/usr/bin/python
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print("Phone Num : ", num)

('Phone Num : ', '2004-959-559 ')
('Phone Num : ', '2004959559')


#### Assigning names to matches

Sometimes it is useful to assign names to subgroups of matches so later in the code we can refer to them. This is particularly useful when working with URLs in Django.

In [43]:
import re
contactInfo = 'Doe, John: 555-1212'

"""
In the next line I assign the name 'last' to the 1st match of the regular expression so I can later refer to it. 
I also assigned the names 'first' and 'phone' to the subsequent matches
"""
match = re.search(r'(?P<last>\w+), (?P<first>\w+): (?P<phone>\S+)', contactInfo)
print(match.group('last'))
print(match.group('first'))
print(match.group('phone'))

Doe
John
555-1212


### More Examples

In [45]:
s = '100 High Road'
o = re.sub('Road$', 'RD.', s)
print(o)

100 High RD.


In [46]:
phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')
o = phonePattern.search('800-555-1212').groups()   
print(o)

('800', '555', '1212')


In [1]:
st = "z111111z"
m = re.search("^\d+$", st) #Notice the caret ^ and dolar $ symbols
try:
    print(m.group())
except AttributeError:
    print("No match!")

NameError: name 're' is not defined

In [48]:
st = "z111111z"
m = re.search("\d+", st)
print(m.group())

111111
