All the regex functions in Python are in the re module.

In [1]:
import re

Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object).

_Example: Matching a Phone Number_

In [4]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

A Regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object, which have a group() method that will return the actual matched text from the searched string.

In [5]:
mo = phoneNumRegex.search('My number is 415-555-4242')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


_Grouping with Parentheses_

Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). The first set of parentheses in a regex string will be group 1 . The second set will be group 2 . By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text.



In [7]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('1', mo.group(1))
print('2', mo.group(2))
print('0', mo.group(0))

1 415
2 555-4242
0 415-555-4242


If you would like to retrieve all the groups at once, use the groups() method— __note the plural form for the name__.

In [8]:
areaCode, mainNumber = mo.groups()
mo.groups()

('415', '555-4242')

If i need to match parenthesis.

In [9]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (415) 555-4242')
print(1, mo.group(1))
print(2, mo.group(2))

1 (415)
2 555-4242


In regular expressions, the following characters have special meanings:

. ^ $ * + ? { } [ ] \ | ( )

To detect these characters as part of a text pattern, it's necessary to escape them with a backlash.

_Matching Multiple Groups with the Pipe_

The | character is called a pipe. You can use it anywhere you want to match one of many expressions. When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object.

In [11]:
heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey')
mo2 = heroRegex.search('Tina Fey and Batman')
print(mo1.group())
print(mo2.group())

Batman
Tina Fey


You can also use the pipe to match one of several patterns as part of your regex. This can be done with parentheses.

In [12]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Batmobile
mobile


_Optional Matching With the Question Mark_

Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match regardless of whether that bit of text is there.

In [15]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
print(mo2.group())

Batman
Batwoman


The (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of _wo_ in it.

In [16]:
# For the earlier phone number example.
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo2 = phoneRegex.search('My number is 555-4242')
print(mo1.group())
print(mo2.group())

415-555-4242
555-4242


You can think of the ? as saying, “Match zero or one of the group preceding this question mark.”

_Matching Zero or More with the Star_

The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again.

In [17]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo2 = batRegex.search('The Adventures of Batwoman')
mo3 = batRegex.search('The Adventures of Batwowowowoman')
print(mo1.group())
print(mo2.group())
print(mo3.group())

Batman
Batwoman
Batwowowowoman


_Matching One or More with the Plus_

While * means “match zero or more,” the + (or plus) means “match one or more.”

In [18]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo3 = batRegex.search('The Adventures of Batman')
print(mo1.groups())
print(mo2.groups())
print(mo3)

('wo',)
('wo',)
None


_Matching Specific Repetitions with Braces_

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in braces. Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the braces. You can also leave out the first or second number in the braces to leave the minimum or maximum unbounded.

In [24]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo2 = haRegex.search('Ha')
print(mo1.group())
print(mo2)

HaHaHa
None


Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non greedy (also called lazy) version of the braces, which matches the shortest string possible, has the closing brace followed by a question mark.

In [30]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())
print(mo2.group())

HaHaHaHaHa
HaHaHa


_The findall() method_

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string.

In [31]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

On the other hand, findall() will not return a Match object but a list of strings—__as long as there are no groups in the regular expression__.

In [32]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there are groups in the regular expression, then findall() will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex.

In [33]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

__Character Classes__

\d - Any numeric digit from 0 to 9.  
\D - Any character that is _not_ a numeric digit from 0 to 9.  
\w - Any letter, numeric digit, or the underscore character.  
\W - Any character that is _not_ a letter, numeric digit, or the underscore character.  
\s - Any space, tab, or newline character.  
\S - Any character that is _not_ a space, tab, or newline.  

In [34]:
xmasRegex = re.compile(r'\d+\s\w+')
# at least one digit, followed by a space character, followed by at least one 'word character'.
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

_Making Customized Character Classes_

You can define your own character class using square brackets.

In [35]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class \[a-zA-Z0-9\] will match all lowercase letters, uppercase letters, and numbers.  
By placing a caret character ( ^ ) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class.

In [36]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

_The Caret and Dollar Sign Characters_

the caret symbol ( ^ ) can be used at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you the dollar sign ( \$ ) can be used at the end of the regex to indicate the string must end with this regex pattern. 

The ^ and \$  can be used together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

In [37]:
beginsWithHello = re.compile(r'^Hello')
mo1 = beginsWithHello.search('Hello, world!')
mo2 = beginsWithHello.search('He said hello.')
print(mo1.group())
print(mo2)

Hello
None


In [38]:
endsWithNumber = re.compile(r'\d$')
mo1 = endsWithNumber.search('Your number is 42')
mo2 = endsWithNumber.search('Your number is forty two.')
print(mo1.group())
print(mo2)

2
None


In [39]:
wholeStringIsNum = re.compile(r'^\d+$')
mo1 = wholeStringIsNum.search('1234567890')
mo2 = wholeStringIsNum.search('12345xyz67890')
mo3 = wholeStringIsNum.search('12  34567890')
print(mo1.group())
print(mo2)
print(mo3)

1234567890
None
None


_The Wildcard Character_

The . (or dot) character in a regular expression is called a wildcard and will match any character except for a newline.

In [40]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

The dot-star (.\*) can stand in for 'anything'. The dot-star uses greedy mode: It will always try to match as much text as
possible. To match any and all text in a non-greedy fashion, use the dot, star, and question mark ( .\*? ).

In [42]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo.groups())

('Al', 'Sweigart')


In [43]:
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
print(mo.group())

<To serve man>


In [44]:
greedyRegex = re.compile(r'<.*>')
mo = greedyRegex.search('<To serve man> for dinner.>')
print(mo.group())

<To serve man> for dinner.>


The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile() , you can make the dot character match
all characters, including the newline character.

In [45]:
noNewlineRegex = re.compile('.*')
print(noNewlineRegex.search('Serve the public trust.\nProtect the innocent. \nUphold the law.').group())

Serve the public trust.


In [46]:
newlineRegex = re.compile('.*', re.DOTALL)
print(newlineRegex.search('Serve the public trust. \nProtect the innocent.\nUphold the law.').group())

Serve the public trust. 
Protect the innocent.
Uphold the law.


Sometimes it's necessary to match the letters without worrying whether they’re uppercase or lowercase. To make your regex case insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile(). .

In [47]:
robocop = re.compile(r'robocop', re.I)
print(robocop.search('RoboCop is part man, part machine, all cop.').group())
print(robocop.search('ROBOCOP protects the innocent.').group())
print(robocop.search('Al, why does your programming book talk about robocop so much?').group())

RoboCop
ROBOCOP
robocop


_Substituting Strings with the sub() Method_

Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects
is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [51]:
namesRegex = re.compile(r'Agent \w+')
print(namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.'))

CENSORED gave the secret documents to CENSORED.


To use the matched text itself as part of the substitution, in the first argument to sub(), you can type \1 , \2 , \3 , and so on, to mean “Enter the text of group 1 , 2 , 3 , and so on, in the substitution.”

In [52]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
print(agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.'))

A**** told C**** that E**** knew B**** was a double agent.


_Managing Complex Regexes_



In [53]:
# phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?          
    (\s|-|\.)?
    \d{3}
    (\s|-|\.)
    \d{4}
    (\s*(ext|x|ext.)\s*\d{2,5})?
    )''', re.VERBOSE)
# 1. area code
# 2. separator
# 3. first 3 digits
# 4. separator
# 5. last 4 digits

Note how the previous example uses the triple-quote syntax ( ''' ) to create a multiline string so that you can spread the regular expression definition over many lines, making it much more legible.

_Combining re.IGNORECASE, re.DOTALL and re.VERBOSE_

In [54]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

In [55]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

__Project: Phone Number and Email Address Extractor__