# An intermediate user's guide to RegEx

In [85]:
import re

## Nuances of the `re` module
### The main regex module of Python

Module level functions vs. pattern level functions

#### Grouping

`.group()` vs. `.groups()`

`.findall()`

(I prefer this) `.finditer()`

Creating regex patterns using formatted strings

##### Side note:
The reason I think these are intermediate features, is because you get by without knowing these, and still implement simple regexes in real code.

At the same time, I don't think these are Advanced either. Those would include Sub-routines and really exploring look-arounds, as well as recursion (Yeah, I know. It's crazy that you can do recursion in Regex)

# Bonus

## Raw Strings
Python raw string is created by prefixing a string literal with 'r' or 'R'.

from https://blog.devgenius.io/beauty-of-raw-strings-in-python-fa627d674cbf:<br>
Python raw string treats backslash(`\`) as a literal character.
It is useful when we want to have a string that contains backslash(`\`) and don’t want it to be treated as an escape character.

In [1]:
raw_string = r"\n"
regular_string = "\n"

In [2]:
raw_string == regular_string

False

In [12]:
raw_string == "\\n" # Note no 'r' in front of the string literal

True

### Why do we need to use raw strings in regex
Because the regex engine interprets backslashes in a special way, like when you put `\d` in a regex pattern. But python interprets these backslashes in a separate way, and that creates a conflict.<p>

- Say your document or actual text contains `\d`, and you want to match that.
- You could create a regex pattern with just that - `\d`, but it won't match with the `\d` in your text because `\d` has a special meaning in regex, namely that it matches with a numerical digit (0,1,2...9).
    - So (side note) if you wanted to match with any digit in your text like `4` or `7`, a regex pattern of `\d` will catch any of those digits
- (Back to original use case) So the regex pattern we create is to catch the literal character `\` followed by literal character `d` but without any special meaning in this juxtaposition.
- In the regex world, you do this by 'escaping' your backslash, so the regex pattern needs to be `\\d`

In [79]:
match = re.match('\\d','\d')
print(match)

None


##### Why didn't this work?

The reason is that when you create Python strings literals with backslashes, they are processed in a unique Pythonic way (which I don't know a lot about tbh, but my point is we that don't need to). This messes up the regex pattern that is passed to the regex engine, and the pattern ends up being something different from what we intended.

So in order to use backslashes in strings in Python, without attributing any special meaning to them, we have to escape them, by putting another backslash (`\`) behind them.

So as you can imagine, since we have two `\` in our actual regex pattern that we want to pass to the regex engine, `\\d`, we need to escape both the backslashes when representing this pattern as a Python string, and the final string that will work in this use case is `\\\\d` <-- four backslashes

In [81]:
match = re.match('\\\\d','\d')
print(match)

<re.Match object; span=(0, 2), match='\\d'>


This is where <b>Raw Strings</b> come in to save the day. As mentioned in the beginning, raw strings treat backslashes as a literal character. So we don't need to escape the backslashes in the regex pattern. We can simply pass `r'\\d'` to the regex engine

In [84]:
print(re.match(r'\\d', '\d'))

<re.Match object; span=(0, 2), match='\\d'>


<b><u>Takeaway</b></u>: Raw strings are not a strict requirement, but make your life easier when working with regular expressions, as you don't need to worry about Python messing up your regex pattern containing backslashes. And trust me, backslashes are a pretty common feature of regexes.

### Raw Strings Bonus

#### repr() Magic function
From a stackoverflow post - The goal of `__repr__` is to be unambiguous - https://stackoverflow.com/questions/1436703/what-is-the-difference-between-str-and-repr. So it shows the underlying representation of the obejct

In [4]:
print(repr(raw_string))

'\\n'


In [5]:
print(repr(regular_string))

'\n'


In [6]:
print(f"length of raw string: {len(raw_string)}")
print(f"length of regular string: {len(regular_string)}")

length of raw string: 2
length of regular string: 1


In [7]:
print(repr(raw_string[0]))

'\\'


In [8]:
print(repr(regular_string[0]))

'\n'


In [9]:
print(f"type of raw string: {type(raw_string)}")
print(f"type of regular string: {type(regular_string)}")

type of raw string: <class 'str'>
type of regular string: <class 'str'>
