Regular Expressions are a powerful paradigm for extracting text from and searching documents. To use them in Python, we must first import the built-in regular expressions module.

In [None]:
import re

Let's define the text of a document for which we can utilize Regular Expressions (RegExes).

In [None]:
doc = 'We are what we pretend to be, so we must be careful about what we pretend to be.'
doc

'We are what we pretend to be, so we must be careful about what we pretend to be.'

If we want to extract the instances of "we " from the document (the word "we" with a following space", we can compile and use a regex to find instances of "we ".

In [None]:
my_regex = re.compile(r'we\s')
my_regex.findall(doc)

['we ', 'we ', 'we ']

Note initial "We" is not located. This is due to case sensitivity and can be mitigated with a flag.


In [None]:
my_regex = re.compile(r'we\s', re.I)
my_regex.findall(doc)

['We ', 'we ', 'we ', 'we ']

Python has two primitive searching operations, search and match. The latter only checks for a match from the beginning of the input string.

In [None]:
match_matches = re.match('what', doc)
search_matches = re.search('careful', doc, flags=re.I)
match_matches2 = re.match('We', doc)

print(f'Match result 1: {match_matches}')
print(f'Search result: {search_matches}')
print(f'Match result 2: {match_matches2}')

doc

Match result 1: None
Search result: <re.Match object; span=(44, 51), match='careful'>
Match result 2: <re.Match object; span=(0, 2), match='We'>


'We are what we pretend to be, so we must be careful about what we pretend to be.'