# Lookahead and Lookbehind

Sometimes in regular expressions, you wish to make *assertions* about patterns without actually including the subpattern in the match.  

We have already seen a few examples of this, although not described as such.  The special symbols `^` and `$` mark the beginning and end of lines or strings, respectively, and describe what a match must look like without actually including any characters themselves. The pattern `\b` that must match a word boundary is similar.

However, we can be more general in describing a complete subpattern that must occur before or after a match, without that subpattern itself being included in the match.  Negative lookahead and lookbehind assertions are likewise available.

In [1]:
from src.setup import *

Within our nursery rhyme, we can use an example.  Let us match every 'a' or 'e' that is followed by 'd', 'm' or 'r'. 

In [2]:
show(r'[ae](?=[dmr])', rhyme)

Notice two things.  The match consists *only* of the 'a' or 'e' itself; but nonetheless, other 'a' and 'e' characters are not matched.

It is hard to come up with an obvious reason you would wish to highlight or reference exactly those single letters.  It might make better sense if we do not think of prose sentences, but other information encoded as text.  For example, suppose we have a list of auto part numbers.

In [3]:
print(parts)

FORD-2008-xyz37
FORD-1998-ef445
TOYO-1999-wxy66
TOYO-2005-qrst3
FORD-2010-ab614
MAZD-1995-pqr33
TOYO-2013-fg185
TOYO-1997-abc23
FORD-2012-lm034


Imagine there are hundreds of thousands of such numbers listed rather than just a few for a somewhat more "real world" feel.  We would like to match only the collection of years for which we have Toyota parts.  This is a "lookbehind" question rather than "lookahead."

In [4]:
show(r'(?<=^TOYO-)\d{4}', parts)

Let us make a somewhat more complex query.  We would like the years of Ford parts that use a two letter part code rather than the three or four letter version that some parts have.  Moreover, we only want the years for parts made after year 2000.

In [5]:
pat = r'(?<=^FORD-)2\d{3}(?=-[^0-9]{2}\d)' 
show(pat, parts)

Negative assertions are, as the name suggests, requirements that a certain pattern does *not* come before or after that pattern we wish to match. To make a negative assertion  For example, let us find the years of parts that are **not** made by Mazda and that do **not** have four letter codes.

In [6]:
pat = r'(?<!^MAZD-)\d{4}(?!-\D{4})'
show(pat, parts)

```python
pat = r'(?<!^MAZD-)\d{4}(?!-\D{4})'
```
Even in these dense patterns we made a number of simplifying assumptions.  For example, for the part code portion of the part number, we stipulate that it always contains letters and numbers rather than other characters, and hence that a "not-digit" class like `\D` will match only letters.  Moreover, if the expected pattern of `MAKER-YEAR-CODE` is not followed on a given line, results will be unreliable.

## Back References

So far everything in these lessons looked only at identifying patterns.  In fact, all we have done is highlight the matches, rather than work with them in a programmatic way within Python.  Later lessons will do more.  

Here let us look at a feature in many regular expression tools of using *back references* to groups defined within expressions.  Especially when you do replacements on patterns identified, it is useful to be able to refer to the components of match.  

For this example, suppose that the warehouse with these auto parts will stop stocking parts made prior to 2000 to keep more modern inventory.  To aid this process, parts will be renumbered to reflect this change.  Older parts should have the new pattern `MAKER-OBS-CODE(YEAR)`, i.e. the "year" portion will become the string 'OBS' (obsolete). To build on lookbehind patterns in this lesson, we only do this for non-Mazda parts.

In [7]:
pat = r'(?<!^MAZD-)(1\d{3})(-.*)'
new = r'OBS\2(\1)'
revised = re.sub(pat, new, parts, flags=re.MULTILINE)
show(r'OBS.*', revised)

```python
pat = r'(?<!^MAZD-)(1\d{3})(-.*)'
new = r'OBS\2(\1)'
```

What happened in that above pattern is that the two *groups* were automatically named `\1` and `\2` for purposes of back reference in a replacement pattern.  A lookbehind assertion superficially looks like a group, but it does not count as one for purposes of back reference.  The same applies to lookahead assertations, and to either postive or negative assertions in either direction.

When a complex regular expression has multiple groups, back referencing them by number can get confusing.  In fact, if you have more than 9 groups, the syntax does not support back reference at all by number.  In this case, you can use *named groups*  The syntax is a bit verbose compared to other regular expression elements, but it can add clarity.

In [8]:
pat = (r'(?<!^MAZD-)'
       r'(?P<year>1\d{3})'
       r'(?P<code>-.*)')
new = r'OBS\g<code>(\g<year>)'
revised = re.sub(pat, new, parts, flags=re.MULTILINE)

show(r'OBS.*', revised)

Back references, either named or numbered, may be used within a pattern as well.  In a somewhat contrived example, let us capture the parallel comparison words in the nursery rhyme, i.e. "fleece as white as snow" in this case.

In [9]:
show(r'(fleece) (?P<word>\w+) (\w+) (?P=word) (\w+)', rhyme)