In [3]:
import re

## Regular Expressions and the re Module

### Pattern String Syntax

#### Raw strings `r''`

Why do we use *raw* strings in `regex`? It's not that easy to explain in full. Let's go step-by-step:
- First of all we have so called string escape symbols like `\n` or `\t`. There are not many intersections with `regex` patterns, but there are some: 1) `\b` is a backspace for strings and `\b` is a pattern for empty string; 2) `\x` is an escape for hexadecimal numbers and `\number` is a pattern for a group match. If we use raw strings string escape symbols are not interpreted.
- More importantly we may *match* this string escape symbols: `r'\t'` matches a tab symbol.
> Pattern elements (such as `r'\t'`, equivalent to the string literal `'\\t'`) do match the corresponding special characters (in this case, the tab character `\t`), so you can use a raw string literal even when you need a literal match for such special characters.
- Fo these reasons we can not use patterns as is in a string, we have to escape with yet another backslash: `'\\s+'`, not `'\s+'`. But with raw string we do not need to do this: `f'\s+'`.
> In raw string literals, escape sequences are not interpreted as in Table 3-3, but are literally copied into the string, including backslashes and newline characters. Raw string literal syntax is handy for strings that include many backslashes, especially regular expression patterns.

#### Greedy vs Non-Greedy 

What is the difference between greedy and non-greedy `regex`? 
> All of these examples are greedy, meaning that they match the substring beginning with the first occurrence of 'pre' all the way to the last occurrence of 'post'. When you care about what part of the string you match, you may often want to specify nongreedy matching, which in our example would match the substring beginning with the first occurrence of 'pre' but only up to the first following occurrence of 'post'.

In [1]:
s = 'preposterous and post facto'

In [8]:
# greedy match
re.match(r'pre.*post', s).group()

'preposterous and post'

In [7]:
# non-greedy match
re.match(r'pre.?post', s).group()

'prepost'

## Match Versus Search

What's the difference between `match` and `search`? It turns out that `match` implicitely starts with the beginning of the string.
> So far, we’ve been using regular expressions to match strings. For example, the RE with pattern `r'box'` matches strings such as `'box'` and `'boxes'`, but not `'inbox'`. In other words, an RE match is implicitly anchored at the start of the target string, as if the RE’s pattern started with `\A`.

In [20]:
s = 'inbox'

In [13]:
re.match(r'box', s) is None

True

In [15]:
re.search(r'box', s) is None

False

To test if the full string matches we may use `fullmatch()`.

In [17]:
re.fullmatch(r'box', s) is None

True

In [24]:
re.fullmatch(r'.*box', s) is None

False

## Anchoring at String Start and End

### Read entire file

> `$` Matches end of string (if MULTILINE, also matches right before `\n`)

In [40]:
digatend = re.compile(r'\d$', re.MULTILINE)
with open('afile.txt', 'r') as f:
    fstring = f.read()
    print(f'fstring: {repr(fstring)}')
    match = digatend.search(fstring)
    if match:
        print(f'match:{match.group()}')

fstring: 'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene 1\n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.\n\n'
match:1


Without this flag we'll match only at the beginning of the line.

In [45]:
digatend = re.compile(r'\d$')
with open('afile.txt', 'r') as f:
    fstring = f.read()
    print(f'fstring: {repr(fstring)}')
    match = digatend.search(fstring)
    if match:
        print(f'match:{match.group()}')

fstring: 'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene 1\n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.\n\n'


In [44]:
digatend = re.compile(r'\d$')
with open('afile2.txt', 'r') as f:
    fstring = f.read()
    print(f'fstring: {repr(fstring)}')
    match = digatend.search(fstring)
    if match:
        print(f'match:{match.group()}')

fstring: 'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene 1\n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2\n'
match:2


Finally, let's show that `r'\d$'` and `r'\d\n'` can be the same or different.

In [72]:
s = """The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene \n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2\n"""

In [73]:
print(repr(s))

'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene \n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2\n'


In [74]:
print(repr(s.strip()))

'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene \n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2'


In [81]:
re.search(r'\d\n', s, re.MULTILINE).group(), re.search(r'\d\n', s.strip()) is None

('2\n', True)

### Read line by line

In case of a single string it does not matter if a file contains `\n`. If we need to distinguish these 2 cases we may do this with `r'\d\n$'` pattern.

In [55]:
re.search(r'\d$', 'Scene 1') is None, re.search(r'\d\n$', 'Scene 1') is None

(False, True)

In [57]:
re.search(r'\d$', 'Scene 1\n') is None, re.search(r'\d\n$', 'Scene 1\n') is None

(False, False)

### Alternative anchors

It turns out that we have alternative anchors for a beginning and an end of a line: `\A` and `\Z`:
> For RE objects that are not flagged as `MULTILINE`, `^` is the same as `\A`, and `$` is the same as `\Z`. For a multiline RE, however, `^` can anchor at the start of the string or the start of any line (where “lines” are determined based on `\n` separator characters). Similarly, with a multiline RE, `$` can anchor at the end of the string or the end of any line. `\A` and `\Z` always anchor exclusively at the start and end of the string, whether the RE object is multiline or not.

## Regular Expression Object