Importing the re module

In [1]:
import re

Finding a phone number in a regex object

In [3]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


Now grouping the regex object

In [8]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242')
mo.group(1)

'415'

In [9]:
mo.group(2)

'555-4242'

In [10]:
mo.group(0)

'415-555-4242'

In [11]:
mo.group()

'415-555-4242'

In [12]:
mo.groups()

('415', '555-4242')

In [16]:
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

415
555-4242


Matching parentesis in text

In [18]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is (415) 555-4242')
print(mo.group(1))
print(mo.group(2))

(415)
555-4242


The pipe command '|' is used to match the first occurence of the words in the re.compile in a text.

In [22]:
heroRegex = re.compile(r'Batman|Tina Frey')
mo1 = heroRegex.search('Batman and Tina Frey')

In [23]:
mo1.group()

'Batman'

In [24]:
mo2 = heroRegex.search('Tina Frey and Batman')

In [25]:
mo2.group()

'Tina Frey'

In [26]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()

'Batmobile'

In [27]:
mo.group(1)

'mobile'

Optinal Match can be find by using the question mark '?'

In [29]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [30]:
mo2 = batRegex.search('The adventures of Batwoman')
mo2.group()

'Batwoman'

Using this idea on phone numbers, you can write a code that identifies phones with or without area code.

In [31]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())
mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())

415-555-4242
555-4242


The question mark ('?') finds none or one match.
On other hand, the asterix find zero or more (to infinite) matches.

In [33]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batwowowoman')
print(mo3.group())

Batman
Batwoman
Batwowowoman


Now, if you want to have at least one match result you can use the plus mark '+'

In [35]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwowowoman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batman')
print(mo3 == None)

Batwowowoman
Batwoman
True


You can match specific repetitions with curly brackets

In [37]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())
mo2 = haRegex.search('Hahaha')
print(mo2 == None)

HaHaHa
True


Also, you can put a range on the repetition

In [39]:
haRegex = re.compile(r'((Ha){3,5})')
mo = haRegex.search('HaHaHaHaHa')
mo.group()

'HaHaHaHaHa'

Python is greedy by nature, in other words it will also give the result of the higher value. On the last example it could give only three or four 'Ha' that it would be in the range, but it gave five instead, that's what we called greedy.
To make it nongreedy you could add a question mark after the closing curly bracket.

In [40]:
haRegex = re.compile(r'((Ha){3,5}?)')
mo = haRegex.search('HaHaHaHaHa')
mo.group()

'HaHaHa'

With search() method you will find just the first of the matches, but you can find all of them with findall() method.

In [6]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there are groups on the re.compile, the findall method will return a list of tuples.

In [7]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

#### List of a shorthand codes for common character Classes

\d - Any numeric digit from 0 to 9

\D - Any character that is NOT a numeric digit from 0 to 9
\w - Any letter, numeric digit, or the underscore character. (Think of it as matching "word" characters.)

\W - Any character that is NOT a letter, numeric digit, or the underscore character.

\s - Any space, tab, or newline character. (Think of it as matching "spaces" characters.)

\S - Any character that is NOT a space, tab, or newline.

An example:

In [8]:
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

#### Making your own character classes

You can creat a character class by using square bracket. For example:

In [9]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can create a range o characters like [A-Z] or [0-5]. Also, you don't need to use the backslash to validate some special characts like on print() function, for example, if you want to match a backslash on your character class you can just put it inside square brackets.

You can also make a negative character classes to match anything but the characters inside the square bracket by adding a caret character (^) just after the opening square bracket. For example:

In [10]:
vowelRegex = re.compile(r'[^aeiouAEIOU]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

To find if a text on a string start with some characters you can use the caret character in the begining of the regex. Likewise, you can find if the text end with some characters by putting the dollar at the end of the regex. Example:

In [15]:
beginsWithHello = re.compile(r'^Hello')
print(beginsWithHello.search('Hello World'))
print(beginsWithHello.search('He said hello.') == None)
endsWithNumber = re.compile(r'\d$')
print(endsWithNumber.search('Your number is 42'))
print(endsWithNumber.search('Your number is 42.') == None)

<_sre.SRE_Match object; span=(0, 5), match='Hello'>
True
<_sre.SRE_Match object; span=(16, 17), match='2'>
True


The dot character in a regular expression is called a wildcard and will match any character that is followed by some characters that are after the dot on Regex. For example.

In [17]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.}')

['cat', 'hat', 'sat', 'lat', 'mat']

Note that it will only bring the character that are behind the characters that the regex are trying to match, not the word, as you can see on the "flat" word.