# Python Functions

In the prior lessons, we generally looked only at the regular expression mini-language itself.  Once you understand patterns, you will want to do something with them in Python code.

In [1]:
from src.setup import *

## Compiling Queries

Before a regular expression can be utilized by Python, it must be *compiled*.  You do not necessarily need to think about this since any string used as a pattern will be compiled behind the scenes.  However, if you will use a pattern more than once in your program, pre-compiling it will speed up operations.

Here we search for any sequence of five words where the second and fourth word are the same.

In [2]:
pat = r'(\w+) (?P<dup>\w+) (\w+) (?P=dup) (\w+)'
show(pat, couplet)

In [3]:
cpat = re.compile(pat)
show(cpat, couplet)

## Find One

One of the common functions you will use is `re.search()`.  This will search for the first location where a match occurs, and return a *match object*.  The function `re.match()` is a more limited case of this, and will only match at the beginning of a string.  When no match is found the special value `None` is returned.

In [4]:
match = re.search(cpat, couplet)
if match is not None:
    print(match)

<re.Match object; span=(27, 50), match='fleece as white as snow'>


We can do a variety of operations with a match object.  For example, we could use the span in the original string to modify that string or extract a relevant portion.

In [5]:
couplet[match.start():match.end()]

'fleece as white as snow'

In [6]:
# A more obscure shorter spelling
couplet[slice(*match.span())]

'fleece as white as snow'

Since the pattern used contains groups, we can look at those.  Notice that our pattern included a back reference which re-uses a group.  Hence there are four groups, not five.

In [7]:
match.groups()

('fleece', 'as', 'white', 'snow')

An individual group can be retreived either by number, or by name if it is a named group.

In [8]:
match.group(3), match.group('dup')

('white', 'as')

## Splitting

Another useful thing do with regex patterns is to split strings apart.  The string method `.split()` can divide a string based on a fixed delimiter, but the function `re.split()` can divide based on an arbitrary complex regular expression.

For example, if you wished to divide the auto part list using both the dash and newlines, you might use:

In [9]:
print(re.split(r'[-\n]', parts))

['FORD', '2008', 'xyz37', 'FORD', '1998', 'ef445', 'TOYO', '1999', 'wxy66', 'TOYO', '2005', 'qrst3', 'FORD', '2010', 'ab614', 'MAZD', '1995', 'pqr33', 'TOYO', '2013', 'fg185', 'TOYO', '1997', 'abc23', 'FORD', '2012', 'lm034']


Even only selecting among two characters for the split is more than `str.split()` can perform directly, but let us try a slightly more complex pattern.  We want to split on the years and their surrounding dashes, but verify that years start with '1' or '2' and are four digits.

In [10]:
print(re.split(r'-[12]\d{3}-|\n', parts))

['FORD', 'xyz37', 'FORD', 'ef445', 'TOYO', 'wxy66', 'TOYO', 'qrst3', 'FORD', 'ab614', 'MAZD', 'pqr33', 'TOYO', 'fg185', 'TOYO', 'abc23', 'FORD', 'lm034']


Probably more useful for this case is to perform a separate split per line.  We cannot easily do that with a regular expression alone, but can with just a bit of Python code.

In [11]:
[re.split(r'-[12]\d{3}-', part) for part in parts.splitlines()]

[['FORD', 'xyz37'],
 ['FORD', 'ef445'],
 ['TOYO', 'wxy66'],
 ['TOYO', 'qrst3'],
 ['FORD', 'ab614'],
 ['MAZD', 'pqr33'],
 ['TOYO', 'fg185'],
 ['TOYO', 'abc23'],
 ['FORD', 'lm034']]

## Find Many

Two function provide the ability to find all matches to a pattern within a string rather than just the first.  The function `re.findall()` returns a concrete list of all matches as strings.  The function `re.finditer()` creates an iterator of match objects.

In [12]:
print(re.findall(r'(?<=-)[12]\d{3}(?=-)', parts))

['2008', '1998', '1999', '2005', '2010', '1995', '2013', '1997', '2012']


A list of matching strings for a pattern can be useful, but at times we might want to do more with the match object.  For example, perhaps matches have groups we would like to utilize.

In [13]:
pat = r'-(?P<century>19|20)(?P<year>\d\d)-'
for m in re.finditer(pat, parts):
    print(m.group('century'), m.group('year'), m.span())

20 08 (4, 10)
19 98 (20, 26)
19 99 (36, 42)
20 05 (52, 58)
20 10 (68, 74)
19 95 (84, 90)
20 13 (100, 106)
19 97 (116, 122)
20 12 (132, 138)


## Substitution

One of the most powerful uses of regular expressions is in replacing portions of text that match a pattern. It can often be useful to utilize the contents of groups in the match to create the relevant replacement.

Let us suppose we need to modify part numbers such that the 4 digit year is represented by just the 2-digit year followed by a brackets indicator of the century.  Moreover, the year component of the part number should be surrounded by underscores rather than dashes.

In [14]:
print(re.sub(r'-([12]\d{1})(\d{2})-', r'_\2{\1x}_', parts))

FORD_08{20x}_xyz37
FORD_98{19x}_ef445
TOYO_99{19x}_wxy66
TOYO_05{20x}_qrst3
FORD_10{20x}_ab614
MAZD_95{19x}_pqr33
TOYO_13{20x}_fg185
TOYO_97{19x}_abc23
FORD_12{20x}_lm034


We can also pass a match object to a function within the the substitution.  This allows arbitrarily complex computation based on the match, but the function will always need to return a string for the "new representation" of the match object.

Let us reverse every 3+ letter word in our rhyme that contains the letter 'a'.  This gives us the opportunity to use the `re.IGNORECASE` (also spelled `re.I`) flag.  We could, of course, achieve the same effect in this specific pattern using the character class `[aA]`.

In [15]:
# Lookahead that word has 'a' before grabbing long enough words
pat = re.compile(r'\b(?=\w*a\w*)(\w{3,})\b', re.IGNORECASE)
show(pat, rhyme)

In [16]:
def reverse_match(m):
    # `m.group()` is also entire match string
    s = m.string[m.start():m.end()]
    return ''.join(reversed(s))

In [17]:
print(re.sub(pat, reverse_match, rhyme))

yraM dah a little bmal
Its fleece as white as snow
dnA everywhere taht yraM
went, the bmal saw sure
to go


A variation on `re.sub()` is `re.subn()` that will return a pair containing both the new string and a count of the number of substitutions.

In [18]:
new, count = re.subn(pat, reverse_match, rhyme)
print("Changed:", count, '\n-----')
print(new)

Changed: 8 
-----
yraM dah a little bmal
Its fleece as white as snow
dnA everywhere taht yraM
went, the bmal saw sure
to go
